LPLM: A Neural Language Model for Cardinality Estimation of LIKE-Queries

Cardinality estimation is an important step in cost-based database query optimization. The accuracy of the estimates directly affects the ability of an optimizer to identify the most efficient query execution plan correctly. In this paper, we study cardinality estimation of LIKE-queries, i.e., queries that use the LIKE-operator to match a pattern with wildcards against string-valued attributes. While both traditional and machine-learning-based approaches have been proposed to tackle this problem, we argue that they all suffer from drawbacks. Most importantly, many state-of-the-art approaches are not designed for patterns that contain wildcards in-between characters. Based on past research on neural language models, we introduce the LIKE-Pattern Language Model (LPLM) that uses a new language and a novel probability distribution function to capture the semantics of general LIKE-patterns. We also propose a method to generate training data for our model. We demonstrate that our method outperforms state-of-the-art approaches in terms of precision (Q-error), while offering comparable runtime performance and memory requirements.


INTRODUCTION
In cost-based query optimization, the result size or cardinality of a query or subquery is used to assign a cost to each enumerated query execution plan, which is used in turn to find the plan with the lowest overall cost.Since actual cardinalities are not known at optimization time, a typical query optimizer will estimate them for each operator in the query plan by multiplying the size of the input with the estimated selectivity of the operator [33].In relational database systems, the problem of cardinality estimation is well-studied [12,13,26,30,31,35,37] and the precision of the estimated cardinalities has been shown to have a great impact on the quality of the selected query execution plan in terms of runtime [24].
LIKE-queries filter input tuples by matching a string-valued attribute against a pattern containing two types of wildcards (i.e., % and _).Estimating the result cardinality of such LIKE-queries is a subproblem of cardinality estimation that has received relatively little attention.While both traditional and machine-learning-based approaches exist, they all put restrictions on the structure of the LIKE-patterns that they actually support in order to achieve a practical trade-off between the size of the summary data structure, e.g., a histogram or trained model, and the precision of the cardinality estimation.Either they do not support both types of wildcards, or they limit where wildcards can occur within a LIKE-pattern.For example, several state-of-the-art approaches focus exclusively on prefix, suffix, or substring patterns and do not support LIKE-patterns containing wildcards between characters.Therefore, they cannot estimate the cardinality of LIKE-predicates with a pattern as, for example, Dav% Ma_er that matches string values such as David Maier and Dave Mayer.
In this paper, we study cardinality estimation for queries with predicates that match string-valued attributes against a general LIKE-pattern, i.e., a pattern that is not restricted w.r.t.its structure or the permissible types of wildcards.Our approach builds on and extends the ideas of neural language models [3].We first define a new probability distribution function for our neural language model to predict the next token probability for a given pattern.We also introduce a novel language to capture the syntax and semantics of general LIKE-patterns.Our method first translates a LIKE-pattern into words of our novel language and then utilizes our probability distribution function to estimate the cardinality of the given LIKE-pattern.To train our model, we present a method that generates a set of random LIKE-patterns from a given sequence database.While existing techniques for LIKEquery cardinality estimations may perform better in specific and narrow cases, our experimental evaluation demonstrates that our approach substantially outperforms state-of-the-art approaches in general.
Specifically, we claim the following contributions.
• We introduce the LIKE-pattern Language Model (LPLM) that uses a new language and a novel probability distribution function to support general LIKE-patterns (Section 4).• We propose a method to generate training and test data for our model and show how to estimate the cardinality of LIKE-patterns using LPLM (Section 5).• Our experimental evaluation demonstrates that LPLM outperforms the state of the art, both in terms of accuracy and types of LIKE-patterns supported (Section 6).
We begin by discussing related work in Section 2, introduce preliminaries in Section 3, and give concluding remarks in Section 7.

RELATED WORK
Our work is positioned at the intersection of result cardinality estimation for substring predicates with wildcards and machine-learning-based cardinality estimation.

Traditional Approaches
Various aspects of cardinality estimation for LIKE-queries in SQL queries have been studied in the literature [1,2,5,14,18,21,22,25].The work of Krishnan et al. [18] was the first to study cardinality estimation for substring predicates with wildcards.They proposed an algorithm called KVI that uses pruned suffix trees to estimate the cardinality of substring LIKE-predicates in the presence of wildcards.Since its introduction, though, KVI has been outperformed by many cardinality estimators [5,14,25,34].
Lee et al. [22] proposed the LBS algorithm to estimate the cardinality of an approximate substring query.In the same paper, the authors also present an extension of the LBS algorithm for cardinality estimation of SQL LIKE-predicates.The LBS algorithm builds on the author's previous work [21], which uses minimal base substrings and an  -gram table to estimate the cardinality of a predicate.The LBS is the first algorithm to support LIKE-patterns that contain both wildcards (% and _).Hence, we chose LBS as a first baseline approach to experimentally evaluate our method to the state of the art in terms of support for multiple wildcards.
SPH [1] and its successor P-SPH [2] are two algorithms that estimate the cardinality of LIKEpredicates based on histograms.For example, the more recent P-SPH uses a histogram structure based on frequent positional patterns to estimate the cardinality of LIKE-patterns.Specifically, it mines the complete set of frequent positional patterns from a string-valued column and then constructs a histogram based on the mined frequent positional patterns.During query execution, for a given LIKE-pattern, P-SPH visits all buckets of the histogram and estimates the cardinality of the LIKE-predicate according to the match type (exact match, encapsulated match, and no match) between the pattern and endpoint values of histogram buckets.If the query pattern does not match with any buckets of the histogram, P-SPH returns 10% of the minimum support value used during frequent pattern mining as the cardinality.Both SPH and P-SPH focus on LIKE-patterns that contain a high number of wildcards in-between characters, but only support %-wildcards.Since P-SPH has been shown to outperform LBS for this specific use case, we choose it as a second baseline approach to compare our method to the state of the art in our experimental evaluation.

Machine-Learning-based Approaches
Recently, deep-learning-based approaches to estimate the result cardinality of predicates have shown promise [9,10,16,17,27,36].Astrid-EMBED [34] and Astrid-NLM [34] are two recent deep-learning-based methods to estimate the selectivity of prefix, suffix, and substring queries, which are, therefore, related to cardinality estimation for (certain types of) LIKE-predicates.Astrid-EMBED estimates result cardinality through word embedding.It consists of an embedding learner and a cardinality estimator.The embedding of a string is learned by using syntactic and cardinality similarities between the string and other strings in the suffix tree.Then, a fully connected neural network learns from the embedded strings and their corresponding cardinalities to estimate the cardinalities prefix, suffix, and substring queries.The second method, Astrid-NLM, adapts a neural language model to estimate result cardinality.Specifically, Astrid-NLM uses a character-based neural language model that processes a pattern one character at a time and computes the probability distribution for the next character given the preceding characters.Then, the probability distributions of the characters are used to estimate the cardinality of prefix, suffix, and substring queries.Since, both methods have been shown to outperform traditional approaches for LIKE-patterns that have prefix, suffix, and substring form, and Astrid-EMBED is superior to Astrid-NLM according to Shetiya et al. [34], we use Astrid-EMBED as the third and final baseline approach to evaluate the performance of our approach w.r.t.prefix, suffix, and substring queries experimentally.Astrid is employed as a substitution for Astrid-EMBED in the following sections of the paper.
While other deep-learning-based approaches exist to estimate the cardinality of string-based queries, they are not closely related to our work.For example, a recent work by Kwon et al. [19] presents a deep-learning-based method called DREAM for cardinality estimation of approximate substring matching.More specifically, the authors aim to estimate the number of strings similar to a query string w.r.t. the edit distance measure.Since the focus of this technique is on string similarity, they do not support cardinality estimation for LIKE-predicate or any patterns that contain wildcards.

Limitations of Prior Approaches
Traditional approaches are presented with a trade-off between the size of the summary data structure and the precision of the estimate that is often impossible to balance in practice.Getting accurate estimates for LIKE-predicates would require a complete summary data structure, which is impractical as it is typically considerably larger than the original dataset and cannot be entirely stored in memory.In order to reduce memory requirements, the summary data structure, therefore, needs to be pruned by, for example, removing all entries whose frequency is less than a given threshold.However, pruning the summary data structure leads in turn to poor estimates for LIKE-patterns that are not in the summary data structure.
Existing cardinality estimation approaches based on machine learning only offer limited support for general LIKE-patterns.More specifically, these approaches are designed to learn contextual and syntactic similarities between patterns in the form of a sequence of characters, i.e., prefix, suffix, and substring patterns.In contrast, the structure of general LIKE-patterns can be more complex as they may contain wildcards (% and _) in-between the characters of a pattern.Currently, approaches based on neural language models are unable to encode the semantics of these general LIKE-patterns and simply treating % as another character during embedding and estimation will inevitably lead to poor cardinality estimates.

PRELIMINARIES AND PROBLEM STATEMENT
Our approach to estimating cardinalities for general LIKE-predicates follows in the tradition of language models and, in particular, neural language models.In this section, we introduce the foundations we build on and end with a formal statement of the problem addressed in this paper.

Language Models
Let  = ⟨ 1 ,  2 , . . .,   ⟩ be a sequence of tokens from a vocabulary T .A token can be either a character or a wildcard, i.e., % or _.We denote the -th token   in  by  [] and the length  of a sequence  is given by | |.We also denote a subsequence of  from the -th to and including the -th token by  [ : ], where 1 ≤  ≤  ≤ .Furthermore, let S be the set of all sequences.A language model defines a probability distribution over sequences drawn from S by assigning probabilities such that ∀ ∈ S : where  () is the probability distribution over S. To assign a probability to a given sequence  = ⟨ 1 ,  2 , . . .,   ⟩, a language model factorizes the probability as Language models require excessively large amounts of training samples to capture statistical characteristics of the distribution of sequences of tokens in training data.The number of unique tokens in the vocabulary usually rises as the number of training samples increases.A learning algorithm requires at least one example for each relevant combination of tokens from the vocabulary to learn the joint probability distribution of all possible sequences.This requirement causes the problem of the curse of dimensionality [3] for language models during training.

Neural Language Models
A neural language model [3] is a neural-network-based language model that uses distributed representations of input examples to reduce the impact of the curse of dimensionality during 54:5 learning.A distributed representation of a token  is a vector of non-exclusive features that characterize the token's meaning in the given text.Specifically, given a sequence of tokens  = ⟨ 1 ,  2 , . . .,   ⟩, a neural language model assigns each token in  a unique -dimensional continuous vector as follows where   is an embedding function and   is the embedded feature vector of , i.e., a unique dimensional continuous vector for token  that contains the learned features of .Tokens that are semantically close have similar embedded feature vectors.Neural language models train on embedded feature vectors and use the softmax function as an activation function to find the probability of the next token given a sequence of tokens up to that point.The softmax function normalizes the input vector so that all components are in the interval [0, 1] and sum to 1.Last but not least, a distributed representation approach allows the neural language model to generalize well to sequences that are not in the training data by using similarity between embedding vectors.Figure 1a shows the architecture of a traditional neural language model.

Word Embedding
In natural language processing, word embedding refers to a group of models and feature selection methods that describe the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word so that words that are close in the vector space are expected to be similar in meaning.Let  =  1 ,  2 , . . .,   be a sequence of words.Given sequence W and word   , the meaning of   is determined by a set of surrounding words,  − , . . .,  + , in a sliding window where d represents the length of the window.Bert [7], word2vec [28], doc2vec [20] and FastText [4] are popular word embedding techniques.

Problem Statement
The SQL LIKE-operator selects all tuples in a relation with an attribute value that matches the specified LIKE-pattern.For example, the SQL predicate name LIKE 'Ais' returns all tuples in a relation with a name-value that exactly matches the string 'Ais'.As defined by ISO/IEC 9075-2:2023 (Section 8.5, General Rule 3.b.ii), the LIKE-operator supports the two wildcards % and _, either independently or in combination.The percent sign matches zero or more characters, whereas the underscore matches exactly one character.For example, the SQL predicate name LIKE 'Ais%' matches all tuples with a name-value that begins with the prefix 'Ais'.In contrast, the SQL predicate name LIKE 'A_s' matches all tuples that have a name-value of length three with the first character being 'A' and the last being 's'.
where  is a sequence drawn from the vocabulary T , under the condition that time complexity of the estimation is strictly lower than the time required to evaluate the pattern  over the relation .

LPLM: LIKE-PATTERN LANGUAGE MODEL
Standard language models are designed to determine the probability of a given sequence of characters.Figure 2a shows the probability distributions for each character of the sequence ABBC from the vocabulary T = {A, . . ., Z} in a standard neural language model.However, in addition to characters, LIKE-patterns may contain wildcards (% and _) that have semantics different from encoding the corresponding character.Since standard neural language models cannot encode these semantics of wildcards, they cannot be applied to cardinality estimation of LIKE-predicates out-of-the-box.As an example, consider the two LIKE-patterns %AB% and %A%B% that are syntactically very similar, but can have arbitrarily different cardinalities.Simply treating % and _ as regular characters in the embedding will inevitably lead to poor cardinality estimates.
Furthermore, directly applying the probability distribution function of standard neural language models to LIKE-patterns leads to problems.On the one hand, if a neural language model treats wildcards as regular characters, the function would simply determine the probability of the next wildcard as the next character in an input sequence.Since wildcards are not in the vocabulary and, therefore, are unknown tokens, a standard neural language model would assign a low probability to a wildcard.Consequently, treating wildcards as a regular characters violates wildcard semantics and leads to inaccurate estimates for LIKE-patterns.On the other hand, if the neural language model was to combine a wildcard with the following regular character of the LIKE-pattern, the function could return a probability distribution with a sum greater than 1, thus violating Equation 1.
Example 4.1.Consider the vocabulary T = {A, . . ., E} ∪ {%, _} and the LIKE-pattern %AB%C that selects all rows from database given in Table 1 that start with any token followed by AB and end with C with any number of tokens in-between.Using the approach of combining wildcards with the next regular character, the probability distribution over all possible next tokens given %AB% would be {  (%A%|%AB%),  (%B%|%AB%), . . .,  (%E%|%AB%) } .The sum of the probabilities of this probability distribution will be greater than 1, if there are tuples in a relation that satisfy multiple conditional probabilities.Therefore, generating conditional probabilities for every token in the vocabulary does not work for patterns that contain wildcards in-between characters.
To overcome these limitations of standard neural language models and to accurately estimate the cardinalities of LIKE-patterns, we propose the LIKE-Pattern Language Model (LPLM).First, we propose a new probability distribution function that assigns a probability only to the next likely token, instead of all possible next tokens in the vocabulary.LPLM operates on the character level [15] because word-based neural language models are too coarse-grained to represent LIKEpatterns.Second, we define a new language to embed LIKE-patterns that encodes the semantics of % and _ wildcards.Finally, we adapt the output function of traditional neural language models to our proposed probability distribution function.Figure 1 compares the architectures of a traditional to our method.

Probability Distribution Function
Our new probability distribution function follows from combining wildcards with the next regular character of a LIKE-pattern.However, instead of returning a set of probabilities for all possible next tokens in the vocabulary, as in traditional language models, our function only returns the probability of the next likely token.The general idea of this approach is as follows.Our probability distribution function always assumes the next token is % in a given sequence of tokens from LIKE-patterns during the computation of the distribution.If the next token is %, it combines it with the character that follows % to determine the character's conditional probability for a given sequence of tokens.If not, it first divides the probability distribution of the next token into two new probabilities according to the type of the next token and computes them.
Algorithm 1 demonstrates the computation of the sequence of probabilities P for the tokens of a given LIKE-pattern .The algorithm first checks whether the first token is a wildcard or a regular character and then initializes the sequence of probabilities P accordingly in Lines 1-8.Specifically, it the first token is a character, then the probability of the character surrounded by % is added to the sequence in Line 2. Otherwise, the probability of the second token in the sequence surrounded by % is added to the sequence in Line 5. If the token is a _, we also append (denoted by ⊕) the probability of the first two tokens followed by % given the second token surrounded by % in Line 7. Subsequently, the algorithm examines the intermediate tokens in sequence in Lines 9-20.In Lines 10-12, if the current token  [] is not a wildcard, we add two conditional probabilities in P regarding the following sequences: a) the subsequence of  up to  [−1] followed by  [] surrounded by %, and b) the subsequence of  up to  [] followed by %.Otherwise, if  [] is the wildcard %, we add a conditional probability to P regarding the subsequence of  up to  [+1] in Line 16, while if  [] is the wildcard _, we add two conditional probabilities regarding the subsequence up to  [−1] followed by  [+1] surrounded by %, and the subsequence up to  [+1] followed by % in Lines 18-19.Finally, the algorithm checks the last token of  in Lines 21-25.If the last token is not a wildcard, it adds the probability of  given  followed by %.Otherwise, if the last token is _, it adds the probability of  given the subsequence up to the last token followed by %.

Algorithm 1 Compute Probability Sequence
← 2 4: else if  [] ∉ {%, _} then 11: ←  + 1 ⊲ Examining last token. 22: Consider the LIKE-pattern  = %AB%C as input to Algorithm 1.Since the first token  [1] is a % wildcard, it is combined with the following regular character A, and probability  (%A%) is appended to P. Since B is a regular character that immediately follows the regular character A, the probability  (%AB%|%A%) is decomposed into two new probabilities  (%A%B%|%A%) and  (%AB%|%A%B%) that are also appended to P. Finally, since the last token of  is not a wildcard, the probability  (%AB%C|%AB%C%) is appended to P. Thus, the final sequence of probabilities for all tokens in the pattern  that is returned by Algorithm 1 is P = ⟨  (%A%),  (%A%B%|%A%),  (%AB%|%A%B%),  (%AB%C%|%AB%),  (%AB%C|%AB%C%) ⟩.
We then define the probability (selectivity) of a LIKE-pattern as the product of all conditional probabilities of its tokens.For example, the selectivity of the LIKE-pattern %AB%C is The cardinality of the LIKE-pattern is given by the product of its selectivity and the total number of tuples in the relation.
Note that by adding more constraints to the sequence of probabilities P, the model can identify whether a given LIKE-pattern is a subpattern of another LIKE-pattern and estimate cardinalities accordingly.For instance, our probability distribution function guarantees that the estimated cardinality of the LIKE-pattern %A%B% will always be greater than or equal to the estimated cardinality of %AB%.

Embedding: The LIKE-Pattern Language
In a traditional neural language model, the length of the output is always the same as the length of the input.However, in LPLM, the number of tokens in the LIKE-pattern  might be different from the length of the corresponding sequence of probabilities P due to how we encode the semantics of wildcards.Note that the output length of P depends on the implicit structure and type of the LIKE-pattern .Therefore, to obtain an input of the same length as the desired output, information about the structure and type of the LIKE-pattern must be explicitly encoded before inputting it into the neural language model.We define a transformation from a LIKE-pattern  into a word  in a new LIKE-pattern language that encodes this information.
This transformation must satisfy two requirements to avoid poor estimates.First, it must preserve the semantics of the % and _ wildcards.Second, it must not be pattern-specific.As such, we introduce the function  that is applied to every token of the LIKE-pattern.If two adjacent tokens are both regular characters,  encodes that information by placing • after the second token.If two adjacent tokens are a _ wildcard followed by a regular character,  encodes this information by placing * after the second token.If the last token is a character,  encodes that information by placing ⋄ after the token.If the first token is character,  encodes that information by placing • after the token.Otherwise,  assumes that there is a % wildcard in-between the characters.
Given a LIKE-pattern  as a sequence drawn from the vocabulary T ∪ {%, _} and 1 ≤  ≤ | |, the function  is given as where  is the empty token.Notice that after this transformation the length of , i.e., | | = 5, matches the length of the desired sequence of probabilities P (cf.previous example) that the neural language model needs to generate in output.Figure 2b illustrates this process step by step for the LIKE-pattern .As a second, more complex example, consider the LIKE-pattern  ′ = %AB_A%C%D which is transformed by applying  to  ′ = AB•A * CD⋄.

Activation Function
Traditional neural language models generate probability distributions by applying a softmax function that assigns a probability to each token in the vocabulary given the preceding token sequence.The softmax function enforces the requirement given in Equation 1, i.e., that the sum of the probabilities of the token in the vocabulary for a given preceding token sequence must sum up 1.Our probability distribution function assigns conditional probability only to the next likely token for a given preceding sequence instead of all possible tokens in the vocabulary.Since our probability distribution function does not generate multiple probabilities, the softmax function is not a good classifier for our neural language model.Instead, LPLM uses sigmoid, another popular activation function, as an activation function.The sigmoid function takes a single real value as an input and maps it to the interval [0, 1] to represent the next token's probability.Formally, the sigmoid function is given by where  represents a weighted sum of feature vector for a specific LIKE-pattern in our case.

CARDINALITY ESTIMATION USING LPLM
In this section, we describe how cardinalities of LIKE-predicates are estimated using the LIKE-Pattern Language Model (LPLM).Figure 3

Training Data Generation
To obtain training data for LPLM, we propose a method that generates a subset of all possible LIKE-patterns for a given sequence database.More specifically, we generate  LIKE-patterns from randomly selected rows from a given sequence database .We begin by randomly picking a row  from .Then, we pick a random set I of item indexes from  .Next, we iterate over all tokens in  .We replace each token with the % or _ wildcard if the index of the token is in I and the previous token has not already been replaced by the % wildcard.This way we ensure that the pattern will not contain consecutive % wildcards.The procedure is repeated until  LIKE-patterns  ←  ⊕ % 13: ←  ∪ {} 14: return  have been generated.Algorithm 2 shows the pseudocode of our method for randomly generating LIKE-patterns.
Example 5.1.Consider the sample database in Table 1.Given the sequence ACECBE, randomly drawn from the sequence database,  = 3 and a list of random indexes [1, 2, 5], the output of the algorithm will be the LIKE-pattern %EC_E.

LIKE-Pattern Transformation
Given a LIKE-pattern  from the output Algorithm 2, we first transform  into a word  in the new LIKE-pattern language.Then, we assign a probability to each token in .These probabilities are computed by evaluating the corresponding patterns against the sequence database .Finally,  is associated with a vector containing the probability of each token.
Example 5.2.Consider the LIKE-pattern  = %AB%C%.The pattern  is first transformed into a word in our LIKE-pattern language resulting in  = AB•C.Then, the vector  of ground truth probabilities based on the sequence database from Table 1 is computed as follows.
As  contains four sequences, the ground truth cardinality of the the LIKE-pattern  = %AB%C% can then be computed as 1 Note that Sequence 1 is only sequence in  that matches the given LIKE-pattern .

Cardinality Estimation
The last step of LPLM is the cardinality estimator.We train the cardinality estimator on all transformed LIKE-patterns together with the corresponding vectors of probabilities.More specifically, our cardinality estimator processes each transformed LIKE-pattern sequentially, one character at a time, and predicts the conditional probability of the following character given the preceding characters.Training our cardinality estimator can be considered a regression problem where onehot-encoded LIKE-patterns are the independent features and the vectors of conditional probabilities are the dependent feature.We use binary cross-entropy to compute the loss between the probability distributions for predicted and actual values for a character in LIKE-pattern .The binary cross-entropy is defined as where   and ŷ are the predicted and actual probability, respectively, for the character in the transformed LIKE-pattern that occurs at position  and  is the transformation of .Our proposed method seeks to minimize the average binary cross-entropy over the entire set of LIKE-predicates as where  is the number of LIKE-patterns in the training set.
Our neural language model accepts transformed LIKE-patterns as a query and returns the vector of conditional probabilities of the LIKE-pattern as an estimation.In order to estimate the result cardinality of the corresponding LIKE-predicate, we use the product of the conditional probabilities as its selectivity and multiply with the number of rows in the sequence database.

Deep Learning Architecture
A number of deep learning architectures such as LSTM [11], GRU [6] have been proposed for neural language modeling.We use a Gated Recurrent Unit (GRU) to learn LPLM efficiently.We studied different numbers of layers (1, 2) and hidden units (128, 256, 512) and decided to use a GRU with one layer and 256 hidden states.We also evaluated our model with different batch sizes (16,32,64,128,256,512) and various learning rates (0.1, 0.01, 0.001, 0.0001, 0.00001).The model returns its best performance with a batch size of 128 and a learning rate of 0.0001.We used 64 epochs to train our model.

Handling Data Updates
LPLM is currently not designed to support incremental re-training.LPLM learns the probability distribution of each pattern from transformed LIKE-patterns and the ground truth probabilities.After completing the training phase, the LIKE-pattern samples and the ground truth probabilities are discarded to save space.Hence, the model needs to be re-trained in case of data updates.Naturally, the re-training does not need to happen too often.One could track the accuracy of the estimates, and only when the accuracy deteriorates below a user-defined threshold, re-training occurs.However, incremental re-training of LPLM is out of the scope of this paper and an interesting direction for future work.

EXPERIMENTAL EVALUATION
We first examine the efficacy of LPLM compared to both traditional and deep-learning-based state-of-the-art approaches (Section 6.1).In addition, we conduct an ablation study to investigate the effect that different parameters have on the efficacy of LPLM (Section 6.2).Finally, we evaluate the performance LPLM in terms of the end-to-end query runtimes in PostgreSQL 14.5 (Section 6.3).We begin by describing the experimental setup.Our implementation of LPLM is publicly available1 .
Data Sets.We perform all experiments on four benchmark datasets from three different sources.
More specifically, we use author names (DBLP-AN) obtained from DBLP2 , actor names (IMDb-AN) and movie titles (IMDb-MT) obtained from IMDb3 , and part names (TPCH-PT) obtained from TPC-H4 .All of these data sets have been frequently used in previous evaluations of LIKE-query cardinality estimation [1,2,21,22,34].Table 2 reports the characteristics of these four data sets in terms of total and unique number of sequences as well as minimum, maximum, and average length.
Generating Training Data for LPLM.For each data set, we use Algorithm 2 to generate a pair of a training and a validation set for LPLM.Due to the large number of LIKE-patterns, it would be infeasible in terms of time and memory consumption to train LPLM using the whole set of all possible LIKE-patterns.Hence, for each data set, we generate 5 and 1 million random LIKE-patterns in the training and the validation set, respectively.
State-of-the-Art Methods.To evaluate the efficacy of LPLM, we compare it against state-of-theart techniques that use either a traditional or a deep-learning approach.As outlined in Section 2, most state-of-the-art techniques put restrictions on the structure of supported LIKE-patterns in order to reduce the size of the summary data structure.Table 3 summarizes the capabilities of the chosen baseline approaches and relates them to LPLM.
In what follows, we describe these baseline approaches and how we used them in our experiments in more detail.
• Astrid [34] is the deep-learning-based state-of-the-art approach for the cardinality estimation of prefix, suffix, and substring LIKE-patterns that do not contain any wildcard (% and _) inbetween characters.Astrid estimates the selectivity of queries based on word embedding.
To apply Astrid to estimate the cardinalities of LIKE-patterns that contain % in between characters, we first split each LIKE-pattern into subqueries, i.e., one prefix, one suffix, and one or more substring queries, according to the % wildcard.For example, consider the LIKEpattern AB%C%D, the subqueries are the prefix query AB%, the substring query %C%, and the suffix query %D.We first compute the selectivity of all subqueries, and then we estimate the cardinality of the LIKE-pattern as the product of the selectivities of the subqueries multiplied by the number of rows in the database.If the resulting estimated cardinality is greater than zero but less than one, then we assume the cardinality of the LIKE-pattern is equal to one.• P-SPH [2] is the state-of-the-art approach for the cardinality estimation of substring LIKEpatterns that may contain % wildcards.P-SPH estimates the selectivity of LIKE-patterns using a new type of histogram structure that is based on frequent sequence patterns.To build this histogram for each dataset, we use 2048 buckets.To mine frequent sequence patterns, we use 1.5% of the dataset size as the minimum support threshold.According to the authors of the original paper, these two values represent the optimal configuration for P-SPH.Similar to Astrid, we assume that the cardinality of a LIKE-pattern is equal to one if the estimated cardinality is greater than zero but less than one.
• LBS [22] is the state-of-the-art approach for the cardinality estimation of substring LIKEpatterns that may contain both % and _ wildcards.LBS exploits an extended  -gram table as summary structure to store the cardinalities of possible base strings with length up to  .Then, it estimates the cardinality of a query based on the cardinalities of the minimal base strings of the query.We set the  = 6 and the maximum edit distance threshold to 1 to exploit an extended set of all  -grams.We prune the  -grams table by removing base strings that have true cardinalities lower than 20.According to the authors, these settings represent the best configuration.Notice, however, that in this configuration, LBS will perform poorly for LIKE-patterns that have a length greater than 6 and true cardinality less than 20.• PostgreSQL is the only state-of-the-art approach for cardinality estimation that supports all types of LIKE-patterns.PostgreSQL's query planner [32] uses column statistics to estimate cardinalities during query planning.These statistics include information about the distribution of values and the presence of patterns in a column.PostgreSQL collects these statistics in the form of histograms.To obtain cardinality estimates for LIKE-patterns, we used histogrambased cardinality estimation of PostgreSQL.We obtain these estimates from the query planner using the EXPLAIN statement and a single table for each dataset.
Query Sets.In order to test the accuracy of our cardinality estimator, we use five benchmarks that differ in terms of the query types that their query set contains.This setup is necessary to ensure fair comparisons in a setting where most state-of-the-art approaches do not support all query types.Table 4 summarizes the statistics of the benchmark query sets used in the evaluation.
In what follows, we explain what query types the five benchmarks contain and how we generated them in more detail.
• The Astrid Benchmark is generated using the same approach as Suraj et al. [34] for evaluating Astrid.This query set includes LIKE-queries with patterns that have prefix, suffix, and substring form.• The LPLM Benchmark consists of LIKE-patterns that are generated by Algorithm 2. As the LIKE-patterns are randomly generated, their structure is unrestricted.• The P-SPH Benchmark is generated using the same approach as described by the authors to evaluate P-SPH.Queries in this set have the form % 1 % 2 % . . .%  %, where   is one or more characters.We refer the reader to the work of Aytimur and Çakmak [2] that describes this process in detail.• The LBS Benchmark is generated using the same approach as Lee et al. [22] for evaluating LBS.This query set includes queries in the form of % 0 % and % 1 % 2 %, where   is a word with a length between 5 and 12.More specifically, to generate this query set, a word  with a length between 5 and 12 is randomly drawn.Then, a random number (from 0 to 2) of _ wildcards are inserted at random positions in .• The Negative Queries set consists of queries that do not match any rows in the database, i.e., that have true cardinality zero.We use LIKE-patterns that all methods can support from Table 3.Hence, we only generate substring LIKE-patterns for the negative query set.To generate such queries, we extend Algorithm 2. Instead of replacing characters from a randomly drawn sequence with wildcards, the extended algorithm first removes the characters in the selected indexes and then shuffles the characters in the remaining sequence.Then, we follow the same procedure in the Algorithm 2 and keep all LIKE-patterns that have true cardinalities zero.Each dataset has 10k negative LIKE-patterns.
Evaluation Metric.To measure the accuracy of all cardinality estimation approaches, we report the Q-error [29], which measures the factor that the estimated cardinality deviates from the true cardinality.The Q-error is defined as , where   is the actual number of tuples returned by the query and   is the estimated cardinality.To use the Q-error as an evaluation metric for the negative queries, we use 1 as the true cardinality.The best Q-error is obtained when estimated cardinality is equal to actual cardinality, and the worst Q-error is unbounded.
Environment.All our experiments were performed on a machine with two NVIDIA RTX 3090 24 GB GPUs, AMD Ryzen Threadripper 3970X CPU and 256 GB of RAM, Ubuntu 20.04 LTS.We used PyTorch 1.10.1 for building the deep-learning models.

Comparison to the State of the Art
We report results of the experimental comparison of LPLM to Astrid, P-SPH, LBS, and PostgreSQL in terms of estimation accuracy, run-time performance, and model size on all five test benchmarks.
Estimation Accuracy.Table 5 gives the results of our experimental evaluation of the accuracy (Q-error) of LPLM and the state-of-the-art approaches.Learning-based approaches were trained with queries from the corresponding query set.All approaches were applied to all query sets that they support.We report geometric mean, mean, median, 90th, and 99th percentile errors.We also created an additional query set, Full Benchmark, that contains all queries from all other benchmarks.The only approaches that support all queries in the Full Benchmark are LPLM and PostgreSQL.LBS outperforms LPLM on the LBS Benchmark.These results are a direct consequence of the characteristics of the queries in this benchmark.Recall that strings with a cardinality less than 20 are pruned from its  -gram table summary structure and that LBS returns an estimated cardinality of 0 for infrequent LIKE-patterns.Since most queries in the LBS Benchmark have a true cardinality of 1, the Q-error is therefore almost always 1.
Astrid outperforms LPLM on the Astrid Benchmark for the DBLP-AN, IMDb-AN, and IMDb-MT datasets.These results are not surprising as Astrid is specifically designed for the type of queries in this benchmark.Furthermore, most queries in the training and test set have again a true cardinality of 1 with the same consequence as above.In general, we observe that such training and test datasets have a tendency to favor Astrid.However, Astrid's performance drops substantially on the TPCH-PN dataset.The main reason for this degradation is that the part name strings are combinations of only 96 unique words, leading to a low number of unique prefixes, suffixes, and substrings.Hence, Astrid is unable to learn the underlying structure of the TPCH-PN dataset by training on a small number of samples.
On the P-SPH Benchmark, LPLM outperforms all state-of-the-art approaches.Since Astrid is trained on LIKE-patterns that do not contain % wildcards in-between characters, it consistently underestimates the cardinality of such LIKE-patterns.Again, LBS returns an estimated cardinality of 0 for most queries in this set.To estimate the cardinalities of LIKE-patterns that contain % in between sequence of characters, LBS finds the estimated set of tuple ids in the database that match the subqueries of LIKE-patterns and then get the union size of the estimated sets.LIKE-patterns in that group have a higher number of %-wildcards in between characters of LIKE-patterns, which mostly ends up with a low-sized union set.Hence, LBS underestimates the cardinalities of LIKE-patterns in that test benchmark.Surprisingly, P-SPH is outperformed by both LPLM and LBS on its own benchmark.The reason is that the underlying histogram structure does not accurately approximate the data sets, as most LIKE-patterns do not match any endpoint values.
While PostgreSQL supports all types of LIKE-patterns like our method, it has worse accuracy than LPLM in all cases.PostgreSQL returns a constant number for most of the LIKE-queries, leading to overestimation of LIKE-patterns that have low actual cardinalities.
For the most part, the effect that we have so far observed is a strong dependence between an approach and its way of generating (training and) test queries.Nevertheless, LPLM performs comparably or better than state-of-the-art approaches in the queries they support, and clearly outperforms PostgreSQL, the only state-of-the-art approach that support all queries, in all cases.
Next, we study the estimation accuracy (Q-error) of LPLM and the state-of-the-art approaches depending on the true cardinalities of LIKE-queries.Figure 4 plots accuracy (geometric mean) over four different bins of queries from the Full Benchmark query set.We observe that there is no dependency between the true cardinalities queries and the accuracy of LPLM.The reason for this result is that LPLM is trained on randomly generated LIKE-patterns that provide a more representative summary of a dataset.In contrast to LPLM, most state-of-the-art approaches show a substantial dependency between accuracy and true cardinality.The minimum support threshold value is a key bottleneck for P-SPH.P-SPH uses 6, 750 (DBLP-AN), 8, 750 (IMDb-AN), 7, 500 (IMDb-MT), and 3000 (TPCH-PN) as the minimum support thresholds.If any LIKE-pattern in the bins [0, 10], [11,20], and [20,51] matches an endpoint value in the histogram, the estimated cardinality will get a minimum value as minimum support, which is excessively larger than the true cardinalities of the LIKE-queries in that benchmark.However, if a LIKE-pattern in these bins falls into the nonmatch case, P-SPH returns 10% of the minimum support value as cardinality estimation.Hence, P-SPH always overestimates LIKE-query that have true cardinality less than 10% of minimum support.The accuracy of Astrid decreases as the actual cardinality increases since Astrid is trained on LIKE-patterns that do not contain wildcards and, therefore, have low true cardinalities.Therefore, Astrid cannot capture the underlying structure of LIKE-patterns that have higher cardinalities and contain wildcards.LBS shows its worst performance for LIKE-queries in the bin [11,20].The reason that LBS fails for these queries is that LBS estimates the cardinality of a LIKE-query based on the cardinalities  -grams in its  -gram table.Since the  -gram table is pruned the keep only  -grams with a cardinality greater than 19, there are few matches between base strings of these LIKE-queries and  -grams in the  -gram table.Therefore, LBS mostly returns an estimated cardinality less than 1 for the LIKE-patterns that have actual cardinality lower than 20.LBS shows its best performance for LIKE-patterns with cardinalities less than 10 because most of the LIKE-queries in the bin [0, 10] have a actual cardinality of 1 which results in a Q-error of 1.
Negative Query Set.In this experiment, we evaluate the performance of LPLM and state-of-art approaches on the negative query set.Table 6 shows the estimation accuracy for all methods for negative LIKE-patterns for all datasets.LBS has the best performance for the negative query set because LBS returns 0 as the estimated cardinality for every query in the negative set.The LBS summary structure selectively retains extended  -grams characterized by true cardinalities equal to or exceeding 20.Consequently, LBS assigns a Q-error of 1 to negative queries, thereby contributing to the best performance outcomes.Astrid has the second-best performance for the negative queries set because most queries in the training have a true cardinality of 1, which works in favor of Astrid for the negative queries set.LPLM learns from the order of tokens and their conditional probabilities.Since the order of tokens and the conditional probabilities of LIKE-patterns in the negative set are not learned by LPLM, it does estimate the cardinality of negative LIKE-patterns as accurately as the cardinality of queries that yield at least one result.PostgreSQL returns a constant number for each query in the negative queries set.P-SPH has the word performance because the underlying histogram structure only contains frequent patterns where all negative LIKE-patterns do not match any endpoint values.
Runtime.We conduct a runtime assessment for the entire pipeline of LPLM and state-of-the-art approaches.While we use a GPU for the training phase of machine-learning-based methods, we only use the CPU for inference in order to get a fair comparison between machine-learning-based and traditional approaches.Our analysis is divided into three phases for LPLM and Astrid, and in two phases for LBS and P-SPH.The pre-processing phase encompasses the computation of the training dataset for LPLM and Astrid, along with the summary structure for LBS and P-SPH.In the training phase, we report the training time for LPLM and Astrid.In the online phase, we capture the query (estimation) time for all methods.To obtain the runtime for each approach, we randomly select 10, 000 LIKE-queries that each method can support for each data set.The results of our analysis are reported in Table 7.  Impact of Model Size.In this section, we report on the memory requirements for each method in our study.First, we report the average model size on all data sets for the configuration of each method we used in the previous experiments.LPLM requires around 0.93 MB to store its selectivity estimation model.Astrid requires around 0.51 MB to store the embedding and the selectivity estimation model.LBS requires around 260 MB to store the  -gram table.P-SPH requires around 140 kB to store its histogram.Finally, PostgreSQL requires only 1.9 kB to store the most common values and a histogram.
To ensure a fair comparison of all methods, we therefore evaluate the accuracy of all methods w.r.t.different model sizes (0.1, 0.5, 1.0, and 2.0 MB).For LPLM and Astrid, we alter the hyperparameters to adjust the size of the models.For P-SPH, we adjust the number of histogram buckets.For LBS, we adjust the prune threshold.Finally, for Postgres, we modify the statistics target for a specific column in a table.We report the -error on 10, 000 randomly selected LIKE-queries that all methods support for different model sizes in Figure 5.The results are consistent with those reported in Table 5.Specifically, LPLM demonstrates the lowest -error, followed by Astrid.Both LPLM and Astrid show an improvement with an increasing model size, particularly between 0.5 and 2.0 MB.PostgreSQL comes third with its accuracy being constant as the memory required by PostgreSQL is so low that it is always well below the given space budget.Finally, LBS and P-SPH profit somewhat from a larger model size, but their accuracy is still clearly worse than that of all other methods.

Ablation Study
In this section, we conduct an ablation study to examine the effectiveness of our proposed cardinality estimator LPLM.First, we report results for different training data set sizes on the accuracy of LPLM.Second,we evaluate the effectiveness of LPLM for different types of LIKE-patterns.Finally, we evaluate the effect of dataset size on estimation accuracy.

Impact of Training Data Set Size.
To study the effect of the training data set size on the accuracy of LPLM, we conduct experiment varying the size of the training data set between 50 thousand and 5 million LIKE-patterns for all data sets, and we report the results in Figure 6.We observe an improvement in the performance of LPLM as we increase the size of the training data for all data sets.The improvement is more pronounced between 50 thousand and 1 million LIKE-patterns and is even more apparent for the 90th and the 99th percentile.We observe a smaller improvement in all measures of LPLM between 1 and 5 million LIKE-patterns.Based on these results, we conclude that our methods can produce accurate cardinality estimates, even if the training data set is much smaller than the one we used in the first part of our evaluation.

Impact of LIKE-Pattern
Type.Next, we investigate the impact of different types of LIKEpatterns on the performance of LPLM.For this experiment, we classify LIKE-patterns into the following types: exact, substring, prefix-suffix, and patterns that contain at least one wildcard in-between characters (cf.Table 3).Table 8 reports the results of our measurements.LPLM shows the highest accuracy for the exact type of LIKE-patterns because this type includes two constraints to the sequence of probabilities, one at the beginning and one at the end.The difference is more apparent in the 90th and 99th percentile.For the other types of LIKE-patterns, we observe no substantial difference in the accuracy of LPLM.In fact, LPLM yields similar accuracy for all pattern types with the exception of the 99th percentile for queries that contain wildcard(s) in-between characters, where the Q-error is a bit higher.

6.2.3
Impact of Data Set Size.In this section, we study the impact of the dataset size on the accuracy of LPLM and on model training time.We used all actor names (4.16M) and movie titles (2.52M) from the IMDb benchmark, all author names (3.49M) from the DBLP benchmark, and part names (4M) from the TPC-H benchmark.The average training time for all datasets was 5.6 hours.We report the Q-errors for LPLM in Table 9.By comparing these results with the results for LPLM reported in Table 3 we observe only a slight increase in the Q-errors for the larger data sets.This increase, although not very significant, can be attributed to the larger vocabulary size.

End-to-End Query Runtimes
In this section, we evaluate the performance LPLM in terms of the end-to-end query runtimes in PostgreSQL 14.5.We used the IMDb dataset and executed the Join Order Benchmark (JOB) [23] workload.To inject cardinalities into PostgreSQL, we extended Han et al.'s patch [8] to accept outside estimations for LIKE and equality predicates on string attributes in addition to the existing support for predicates over categorical and numerical attributes.We trained LPLM on the contents of the columns movie_companies.note,title.title,name.name,and link_type.link.We measure end-to-end runtimes for a total of 61 queries that include LIKE and/or equality predicates on these columns.
In addition to the baseline given by PostgreSQL, we study four different configurations.The first configuration (TrueCard) injects true cardinalities into the PostgreSQL optimizer for all types of predicates.The second configuration (TrueCard+PG) uses true cardinalities for string predicates (LIKE and equality) but uses PostgreSQL's own estimations for predicates on categorical and numerical attributes.The third configuration (LPLM+TrueCard) uses our estimations for string predicates but true cardinalities for other predicates.Finally, the fourth configuration (LPLM+PG) uses our estimations for string predicates, but PostgreSQL's estimates for the other types of else.Since PostgreSQL and LPLM are the only two techniques that support general LIKE-patterns (cf.Table 5), we have decided to focus on them in this experiment.
The results of this end-to-end runtime experiment are reported in Tables 10a and 10b.In Table 10a, we report the number of queries out of the 61 total queries that improved and regressed, as well as the respective average improvement and regression.Before we discuss these results in more detail, we need to point out that the PostgreSQL optimizer shows unexpected behavior in this experiment that seems to be inconsistent with the findings of Leis et.al. [23] in the sense that injecting true cardinalities does not yield better plans.Nevertheless, LPLM+PG improves the largest number of queries in this experiment.Also, as reported in Table 10b, the same configuration improves the average and cumulative runtime of all 61 queries the most.Table 11 reports estimation accuracy (Q-error) for PostgreSQL and LPLM on JOB.Comparing these numbers to the Q-errors reported in Table 5 leads us to the conclusion that the LIKE-predicates in JOB are less challenging than the ones in the query sets proposed together with state-of-the-art methods, further explaining the relatively small total runtime improvement gained by using LPLM+PG on JOB.We consider these initial end-to-end runtime results to be promising.However, more work is required to (a) study why the PostgreSQL optimizer shows this inconsistent behavior and (b) to establish more comprehensive and challenging benchmarks that enable us to study the impact of cardinality estimation for LIKE-predicates more systematically.

CONCLUSION AND FUTURE WORK
In this paper, we introduced the LIKE-Pattern Language Model, a novel method for estimating the cardinality of general LIKE-patterns.Our approach employs a new language and a novel probability distribution function to capture the syntax and semantics of general LIKE-patterns.To train our neural language model, we also proposed a method that extracts random LIKE-patterns from randomly selected rows from a sequence database.Through an experimental evaluation on four real-world data sets, we demonstrated that our method outperforms the state of the art in terms of accuracy (Q-error) while supporting more types of LIKE-patterns.
As a next step, our goal is to extend LPLM to support cardinality estimation for regular expressions.We also plan to examine the applicability of our model for estimating the cardinality of approximate string matching queries.

Fig. 4 .
Fig. 4. Geometric mean of Q-errors of different methods for LIKE-patterns varying actual result cardinality.

Fig. 6 .
Fig. 6.Geometric mean of Q-errors of LPLM for different sizes of training data set.

Table 2 .
Statistics for the data sets used in our evaluation.

Table 3 .
Types of LIKE-queries that are supported by estimation methods

Table 4 .
Summary of the statistics of the different query sets used in the experimental evaluation.

Table 5 .
Comparison of estimation accuracy (Q-error) of LPLM against traditional and machine-learningbased baselines.

Table 8 .
Estimation accuracy of LPLM for different types of LIKE-patterns.

Table 9 .
Estimation accuracy of LPLM for larger datasets.

Table 10 .
Experiments on the Join Order Benchmark.

Table 11 .
Estimation accuracy of LPLM and PostgreSQL on the Join Order Benchmark.