Beyond Two-Tower Matching: Learning Sparse Retrievable Cross-Interactions for Recommendation

Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications. The success of two-tower matching attributes to its efficiency in retrieval among a large number of items, since the item tower can be precomputed and used for fast Approximate Nearest Neighbor (ANN) search. However, it suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving. Existing approaches attempt to design novel late interactions instead of dot products, but they still fail to support complex feature interactions or lose retrieval efficiency. To address these challenges, we propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval. Specifically, SparCode introduces an all-to-all interaction module to model fine-grained query-item interactions. Besides, we design a discrete code-based sparse inverted index jointly trained with the model to achieve effective and efficient model inference. Extensive experiments have been conducted on open benchmark datasets to demonstrate the superiority of our framework. The results show that SparCode significantly improves the accuracy of candidate item matching while retaining the same level of retrieval efficiency with two-tower models. Our source code will be available at MindSpore/models.

Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications.The success of two-tower matching attributes to its efficiency in retrieval among a large number of items, since the item tower can be precomputed and used for fast Approximate Nearest Neighbor (ANN) search.However, it suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving.Existing approaches attempt to design novel late interactions instead of dot products, but they still fail to support complex feature interactions or lose retrieval efficiency.To address these challenges, we propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval.Specifically, SparCode introduces an all-to-all interaction module to model fine-grained query-item interactions.Besides, we design a discrete code-based sparse inverted index jointly trained with the model to achieve effective and efficient model inference.Extensive experiments have been conducted on open benchmark datasets to demonstrate the superiority of our framework.The results show that SparCode significantly improves the accuracy of candidate item matching while retaining the same level of retrieval efficiency with two-tower models.Our source code will be available at MindSpore/models.

INTRODUCTION
Industrial recommender systems generally consists of candidate matching and ranking phases, where the candidate matching phase plays the role of efficiently retrieving user-preferred items out of a large pool of candidates.The matching result will directly affect the input of the ranking phase, which demonstrates significant importance to the quality of recommender systems.Due to the large size of candidate items (e.g., millions or more), matching models not only need to obtain high accuracy in recall, but also require efficient retrieval to achieve low latency.
Two-tower models [7,22,33,35] are a primary paradigm for candidate matching due to their good accuracy and support for efficient top-k retrieval.As shown in Figure 1(a), a two-tower model uses dual-encoders (i.e.query and item encoders) to obtain the query 1 and item representations separately and obtains the final score by simple dot product or cosine computations.For inference, item embeddings can be pre-computed and cached, while only the user embedding is computed online.By utilizing the cached item embeddings and the support of fast Approximate Nearest Neighbor (ANN) search (e.g.Faiss [11]), two-tower matching has proven to be a good industrial practice in real-world applications [6,33,35] However, the development of two tower models has the following limitations: (a) Limitation 1: Limited feature interaction capability.There is only one dot product interaction between the two towers, resulting in a limited ability to model fine-grained feature interactions between queries and items.Although dot product is a nice feature interaction method [27], it is not always optimal, especially in the case when rich content features are available.(b) Limitation 2: Reduced accuracy in online serving.In many industrial scenarios, the size of candidate item pool is so large that exhaustively comparing against all the candidate items becomes impractical due to inference time constraints.ANN search methods (e.g., HNSW [21]) have been used to build indexes and speed up retrieval, but unfortunately this may reduce recall rate since the model and the index are usually not trained end-to-end.To address the above limitations, we propose a new matching paradigm, i.e., SparCode, for improving recommendation accuracy and retrieval efficiency.SparCode consists of two key designs: all-to-all interaction and sparse inverted index.To enhance feature interaction capability, the all-to-all interaction module utilizes a single expressive encoder to capture fine-grained interactions between all query features and all item features, as shown in Figure 1(b).It allows different forms of encoder structures (e.g.Self-Attention [30], MLPs, CrossNet [32]), and can learn more fine-grained and predictive feature interactions between the query and the item (i.e., early fusion) than the simple dot product in two-tower models (i.e., late fusion).Fine-grained feature interactions improve model capacity and thus result in improved accuracy, as reported in previous research [20,25].However, compared with the two-tower models, the nature of query-item encoding makes it hard to use the existing ANN approaches for efficient retrieval, making it impractical.To speed up model inference, a straightforward solution is to pre-compute the scores of user-item pairs and cache them.Thus, SparCode introduces the sparse inverted index that is widely used in sparse retrieval.However, there are two problems of building index for users: (1) The number of users is too large to be exhaustive, resulting in an unacceptably expensive precomputation.(2) Each user (index) corresponds to a large number of items, which brings great storage pressure.In respond to these problems, SparCode leverages vector quantization(VQ) as a bridge between all-to-all interaction and sparse inverted indexing, i.e. query is quantized into a series of discrete codes and corresponding code representations, where code replaces query as index and code representation is used for all-to-all interaction.Since the number of codes is controllable and much smaller than the number of queries, the problem of large number of indexes is alleviated.Further, SparCode designs a controlled sparse score function that allows each index to save only the most relevant candidates, greatly relieving storage pressure.
In summary, we present the advantages of SparCode from both model structure and inference perspectives as follows.For model structure, SparCode supports sophisticated forms of all-to-all interactions and efficient top-k retrieval of large-scale candidates by designing the linkage of VQ and sparse inverted index (for Limitation 1).SparCode introduces all-to-all interaction directly into the model structure, significantly enhancing the ability of feature interaction, rather than using all-to-all interaction models indirectly (e.g.Knowledge Distillation (KD) [3]) or using parameter-less interaction methods(e.g.ColBERT [13]).In addition, sparse inverted indexes are also supported by well-established search tools, such as ElasticSearch, ensuring usability in industrial applications.
For model inference, SparCode reduces the gap between model training and inference, since the model and index structure are trained end-to-end (for Limitation 2).In the index structure of SparCode, the index is code, which corresponds to the score of code and candidate items.The mapping between the query and code, as well as the scores, are learned during model training instead of posttraining independently in ANN libraries such as Faiss.Thus, we reduce the performance drop in this stage, which is also empirically verified in our experiments (Sec.4.4).
All in all, SparCode as an all-to-all interaction-based retrieval paradigm, achieves the goal of improving accuracy with the help of enhanced feature interactions while keeping inference efficient.Although we mainly focus on recommender system in this work, SparCode has the potential to be extended to other tasks, such as cross-modal retrieval.We conduct extensive experiments on two public datasets and show that SparCode is significantly more accurate than the two-tower model (Sec.4.2) as well as comparably efficient to the ANN-based two-tower model (Sec.4.4).Our contributions are summarized as follows.
• To the best of our knowledge, SparCode is the first sparse retrieval framework that supports sophisticated forms of allto-all interactions and controllable sparse inverted indexing for recommender systems.• SparCode converts queries to discrete codes, and thus makes pre-computed scores possible.Besides, SparCode enables efficient retrieval with a sparse inverted index structure, which has mature indexing tool support for industrial deployment.
• Our experiments on public datasets show that SparCode brings a significant improvement in accuracy, while achieving comparable efficiency to the two-tower matching.

RELATED WORK 2.1 Two-tower Models and Variants
Two tower matching, the combination of the two-tower model and ANN framework, is the dominant paradigm in dense retrieval, owing to achieving high accuracy and efficient top-k retrieval.Two tower matching are widely deployed in various applications [2,6,7,34].For these aforementioned partial two-tower models, BARS [39] provides experimental results and leaderboards on multiple datasets for further understanding.The feature interaction capability between query and item towers is limited by the dot product especially

Item
All-to-all Interaction Enc. for scenarios with rich features.This has been verified in crossmodal retrieval tasks.For example, in text-image retrieval model VILT [14], the all-to-all interaction models performs significantly better than the standard two tower models.To sum up, enhancing feature interaction between two towers is an important direction for performance improvement.
There are two variants to overcome the above limitations: (a) using more advanced shallow interactions such as MaxSim [13], attention mechanism [9]; (b) knowledge distillation, i.e., transferring knowledge from the better interaction-based model to the two-tower model [3].However, these variants still do not support real all-to-all feature interaction.In addition, some work [10,37] attempts to mitigate the performance loss during inference, using joint training models and indexing.Unlike these methods, Spar-Code supports sophisticated forms of all-to-all interaction models with superior performance, and uses sparse inverted index instead of the ANN framwork.

Vector Quantization
Vector quantization refers to the discretization of vectors into codes, while product quantization (PQ) is a variant of the sharp increase in the number of codes using VQ multiple times.Traditional Product Quantization employs post-processing to convert vectors into multiple discrete codes for speeding up nearest neighbor search.In recent years, deep quantization has been applied to generating tasks successfully [23,24,29] with amazing results.In retrieval task, VQ is also used for multimodal unified modeling [19] or representation reconstruction [36].In general, the code representation obtained by deep quantization is used as an alternative to the query representation to express similar information.
Different from these works, we use VQ to generate codes for the inverted indexes.The corresponding code representations of the query are used for the all-to-all interaction with the item representations.

OUR METHODOLOGY 3.1 Overview
In this section, we present our matching framework SparCode.An overview of SparCode is shown in Figure 2. To summarize, Tokenizer and Quantizer are designed to obtain multiple discrete representations of a query, supporting sparse inverted indexes for maintaining efficiency.All-to-all interaction-based scorer enhances the interaction between query and item for improving accuracy.
Briefly, we compare the differences between the two-tower model, the all-to-all interaction model, and our SparCode as follows.
Two-tower model: SparCode: where  and  represent a query and item respectively;  1 (•) and  2 (•) are query and item encoder respectively, and  (•) refers to the all-to-all interaction encoder and obtains the score, and T (•) refers to the Tokenizer and Quantizer, which converts  into a token embedding and looks for codes and alternative code representations.With these equations, we highlight the design motivation of SparCode.The two tower models (i.e.Eq. 1) adopting dot product scoring is to support efficient embedding-based retrieval.The allto-all interaction models (i.e.Eq. 2) supports arbitrarily advanced encoder for fine-grained feature interaction.Usually Eq. 2 is used for ranking in a few candidates due to its inefficient inference.Spar-Code (i.e.Eq. 3), based on Eq. 2, defines the overall score function as the sum of the scores between each code and candidate.That means that arbitrary queries can be replaced by codes, sharing a code's vocabulary of manageable size.Considering codes as words, the inverted indexes widely used in text retrieval [28] can be easily transferred to the recommendation task.With this definition, Spar-Code not only introduces a powerful all-to-all interaction but also achieves effecient retrieval similar to sparse retrieval.

Matching Framework: SparCode
3.2.1 Tokenizer.In order to be able to support sparse inverted index, inspired by sparse retrieval, we first encode the query  into multiple token embeddings.The role of the Tokenizer is to represent the query as   tokens.Specifically, a query in recommendation (i.e., a user) is encoded into   tokens, representing the multiple interests of the user.
A query may consist of images, text, sequantial information or category features.Uniformly, we represent a query  ∈ Q and a candidate item  ∈ I as a series of embeddings, i.e., where   and   are the  and  embeddings of a given  and , respectively;  is the dim of each embedding in   and   .Then, we formalize the -th token representation as follows: where   is the embedding dim of a token depending on the tokenizer settings, usually equals to .The specific form of the Tokenizer depends on the given task and features, e.g., if the query is a sequence of historical user clicks, the Tokenizer can be chosen from GRU [4], Self-Attention [12,17], Capsule Network [17], etc.Similarly, we denote an item as   tokens   ∈ R   ×  .
3.2.2Quantizer.For text matching tasks, even though the number of queries is very large, they share the same token table (i.e.vocabulary), which limits the number of indexes in the inverted index to the size of the token table.However, query token embeddings are dense and not shared between queries, making it impossible to build a reasonable number of indexes.Thus, in this section, Quantizer transforms query token embeddings into shared codes and their representations by discretization.
In vector quantization [29], a codebook refers to a series of numbered vectors whose numbers are called codes.The quantification is to input query and return code and the corresponding vector by looking up the codebook, resulting in an arbitrary query sharing the codebook.By utilizing VQ, we quantize token embeddings into discrete codes.We construct  codebooks  ∈ R  ×    , each of size  , i.e., holding  ordered embeddings.For    , we split    into  sub-embeddings where  () is the -th codebook and    is its -th embedding;  stands for looking up the most similar sub-token embedding from the given codebook, and the definition of similarity depends on the Euclidean distance between the two sub-embeddings.Thus, we get the complete updated token embedding: We use the indexes of  replaced sub-embeddings to combine a corresponding discrete code.For example, in Figure 2, suppose that  = 2 and that the 2-nd and  -th sub-embeddings are taken from  (1) and  (2) respectively, then the discrete code is (2,  ).
There are two considerations for the design of codebook.Firstly, the size  of each codebook should not be set too large since an excessively large codebook affects the speed of our " ".Secondly,  should not be too small since we need enough model capacity to represent different queries.For the above considerations, we discuss the choice of  and  .The number of discrete codes is at most the number  ×  × • • • ×  =   .If  takes 1, no matter how many queries there are, there will ultimately be only  different query embeddings.In order to adequately represent the different queries,  tends to be large, but this increases the number of parameters rapidly and may reduce the speed of quantization.If  is greater than or equal to 2, it is possible to achieve a sufficient number of queries with a small number of parameters and a fast quantization speed.
Furthermore, as "arg min" is a non-differential operation, there is a non-negligible optimisation problem here.Specifically, the original token embedding    is unable to obtain the gradient from the updated token embedding    , which ultimately results in the previous parameters (e.g., tokenizer) not being updated.We will describe the corresponding model training solution in Section 3.2.4.

3.2.3
All-to-all Interaction-based Scorer.As a parameter-free operation, the dot product of the two-tower models accomplishes both feature interaction and scoring between queries and items, which potentially limits the expressiveness of the model.
Different from the previous variants of the two-tower model, we design a parameterised, learnable scorer that supports complex interactions between queries and items, named All-to-all Interactionbased Scorer.
The proposed scorer does not restrict the specific form of feature interaction, either explicitly (inner product, FM [26], CrossNet [32], Attention [18,30]) or implicitly (e.g.DNN).As an example, We combine inner product and MLPs to give a hybrid scoring function: where   ∈ R 1 is the matching score between a code and a candidate item ; [•] represents the stop gradient operation.Since codebooks and the rest of the model are optimized separetely, [•] is introduced to avoid affecting the parameters of codebooks.Besides, the query is represented by   token embeddings, the above scoring function will get   scores separately.
Considering a query or token is only related to a part of candidate items, we define Sparse Score and Final Score as follow: where b is a learnable bias for training.The scoring function, especially the part corresponding to Eq. 10, is simple but critical for sparse indexing.The bias term b of ReLU in Eq.10 is the threshold to determine whether to set relevant score ŷ as 0. If ŷ equals 0, there is no need to cache ŷ in advance for online serving (See Section 3.3 for details).

Optimization.
To make the quantization process training more stable and faster, we use the exponential moving average (EMA) to update codebooks and back-propagation to update the rest of the model like [24].In each mini-batch, the parameters in the codebooks and the remainder of the model are updated by the corresponding methods.
Model Training.The model here does not include codebooks.The matching task is the most important objective and SparCode adopts sampled softmax loss for training as follow: where I  and I  represent positive samples and negative samples sampled for  respectively.

Top-k Candidate Items
Token Embeddings   As mentioned earlier, arg min is a non-differentiable operation which blocked gradient propagation, causing some parameters such as the tokenizer (or embedding table) cannot get updated.To update these parameters and make the training more stable, we introduce the commitment loss as follows: Thus, the final loss is: where  is a hyperparameter, usually set to 1 or 0.25.Codebook Update.Following [24],we update codebooks by EMA, where the embedding in the codebook is is iteratively updated by a combination of itself and the token embeddings mapped to it.Suppose there is a series of where  is a hyperparameter that adjusts the update rate of the codebook.

Indexing and Retrieval
A query is transformed into   codes by Tokenizer and Quantizer, and each code has a score    with each item  by all-to-all interaction-based scorer.Further, the storage cost may be unacceptable if the scores of all (, ) pairs are cached.In most cases, a query is only highly relevant to part of items, which makes sparse inverted index possible since only the score of several top relevant items should be kept for each code.
Inspired by sparse retrieval and attracted by the efficiency of inverted indexes, we designed a sparse inverted index mechanism for SparCode.We use 0 as a threshold to decide whether to store the score or not for inference.We rewrite Eq. 10 to use a controllable bias instead of the bias term at training time, named "Sparsity Control" as follows: Depending on the latency requirements and memory constraints, we adjust b in Eq. 15 to determine the sparsity of indexes.Figure 3(b) illustrates SparCode's online service process.The right and blue part shows how the scores are cached, i.e. each code is sparsely cached with the scores of candidate items.It is important to note that these scores are pre-computed.
When serving online, the token embeddings are first obtained by tokenizer, and the code is looked up by codebooks.Then, the corresponding item score and filtered candidate itemset are loaded from the cache.For example, in Figure 3 25 } and get the merged itemset { 9 ,  24 ,  13 ,  25 }.Finally, Eq. 11 is used to obtain the final scores and the items with the highest scores are taken as the recommended results.

Comparision with Two-tower Models
Figure 5 shows the comparison between SparCode and Two-tower Matching.We discuss these differences below from two perspectives: modeling stage and inference stage.
Modeling Stage.SparCode supports advanced all-to-all interaction encoders, providing better feature interaction capabilities than two-tower matching, which only supports parameter-free interactions like dot product.Additionally, SparCode's quantizer allows for shared sub-embeddings in different query representations, while each query's representation is independent in the two-tower model.
Inference Stage.As shown in the right part of Figure 3.4(a) and (b), we summarize the differences: (1) the cached content.Instead of caching the item embedding, SparCode selectively caches the pre-computed (code, item) scores.(2) the cache structure.SparCode cache is sparse hashing tables for a sparse code-based inverted index, while the two-tower matching caches matrix of item embeddings or other index structures depending on the ANN settings.Deezer dataset contains 100,000 fully anonymous users and 50,000 music tracks.We follow [1], using 70,000 active warm users for training and the remaining 20,000 and 10,000 cold users for validation and testing.The dataset provides two pretrained embedding types named "TT-SVD" and "UT-ALS" to represent users, user features and songs.
Movielens-10M dataset is a classic recommendation dataset, consisting of 71,567 users and 65,133 items and over 10,000,000 user interactions.It contains abundant sequential user interactions and is widely evaluated for sequential item retrieval.We split all users into training, validation and test sets by the ratio of 8:1:1.

Competitors.
To evaluate the results, our proposed method is compared with several powerful baselines in recent literature.For simplicity, we abbreviate the "Two-Tower Model" as "TTM" for all the tables and figures.
For Deezer, we adopt a popularity-based method called "Popularity" and three models specialized in cold-start recommendations include DropoutNet [31], MeLU [16], and DeezerNet [1].In addition, we compare SparCode with two two-tower models: DSSM [7] and SA [12], in which DSSM is a famous two-tower model based-on DNN while SA utilizes Self-Attention module as encoder for feature encoding.We denote DSSM as "TTM w/ DNN" and SA as "TTM w/ SA".For SparCode, we choose DNN and Self-Attention modules as its tokenizer, denoted as "SparCode w/ DNN" and "SparCode w/ SA" respectively.
For Movielens-10M, we choose three most commonly used baselines including DSSM [7], GRU4Rec [4] and SASRec [12].The sequential modeling module of these methods are treated as the user tower.Respectively, we choose DNN, GRU, and Self-Attention modules as tokenizer for SparCode to evaluate on this dataset.We set the length of the behaviour sequence to 20 for all methods.
For both recommendation datasets, we choose vanilla All-to-All Self-Attention model as a strong baseline, denoted as "All-to-All SA".Without consideration of inference speed, the user features/histories and item features are fed into a mulit-layer SA-based model simultaneously for better feature interaction.The performance gap between SparCode and All-to-All SA reveals how much effectiveness sacrificed by SparCode for better efficiency.4.1.3Implementation Details.We choose Adam [15] as optimizer with the learning rate of 0.001.We set the  2 regularization factor for the embedding table as 1 −6 and the dropout rate as 0.1.Uniformly, the hidden units of DNNs is [256,256,256], and Self-Attention layers is 3.For SparCode, we search for the best query token number   from {1, 2, 4, 6, 8, 10}.The different   query token embeddings are obtained from SENet [5] or linear layers based on the output f the Tokenizer.The different   item token embeddings are encoded from different linear layers.The hyperparameters of the codebooks are of vital importance, including codebook number , codebook capacity  , and the size of each embedding.We do grid search of  from {1, 2} and  from {64, 128, 256, 512, 1024}.The item embedding size is set as 128 and 256 for "TT-SVD" and "UT-ALS" respectively on Deezer, and 64 on Movielens-10M. in Eq. 14 is choosen from {0.25, 1}.To ensure the accuracy of the results, we repeated the experiment five times for each experiment with different random seeds.Our implementation is partially based on FuxiCTR [40], a library for CTR prediction.

Performance Analysis
A comparison of the performance of SparCode and baselines is shown in the Table 1 and 2. As shown in Table 1 and 2, SparCode performs well on both datasets, far better than two-tower models and even close to the "oracle", the All-to-All SA model.Specifically, SparCode w/ DNN achieves 71.15% relative improvements over Table 3: Effect of Interaction Type on Deezer with "TT-SVD" Embeddings.R: Recall, N: NDCG.

Model
Interaction Type R@20 R@50 N@20 N@50 TTM w/ DNN Dot Product 1.58% DSSM for Precision@50 on Deezer.Compared with other state-ofthe-art retrieval models such as MeLU, DropoutNet, and DeezerNet, SparCode also yields significant effectiveness gain.For Movielens-10M, SparCode outperforms all tow-tower-models with the same sequential encoder on various metrics.For SparCode, we found that self-attention module is more suitable to be used as the tokenizer to encode interacted feature fields or sequential user behaivors.Since SA-based module performs best on both datasets, the Allto-All SA model acts like an "Oracle" which shows the performance upper bound of SparCode.Specifically, we remove the PQ and sparse indexing modules in SparCode with SA for both datasets.The results are shown as "All-to-All SA" for Deezer and "All-to-All SASRec" for Movielens-10M.For TT-SVD Embedding based Deezer, the performance gap between SparCode with SA and All-to-All SA is only 0.4% for precision@50.For UT-ALS Embedding based Deezer and Movielens-10M, the performance gap is slightly larger but is also acceptable.For instance, the relative gap on Recall@50 between SparCode w/ SA and All-to-All SA is 11%.This indicates that the use of PQ and sparse index will not significantly degrade performance.SparCode achieves a nice trade-off between effectiveness and efficiency.

Abalation Study
4.3.1 Interaction Type.We explore the impact of different types of feature interactions on SparCode.Considering that we model query and item as multiple token representations, we give a generic form for different feature interactions by rewriting Eq. 9 as follows.
Among these interaction ways, Dot Product and MaxSim (from ColBERT [13]) are parameterless methods while the other three have optimizable parameters.The number of CrossNet layers is set as 3. InnerPDNN is the interaction way shown in Eq. 9, which is the combination of Inner Product and DNN.Table 3 shows the experimental results of various interaction types on Deezer with "TT-SVD" Embeddings."TTM w/ DNN" is the same as DSSM.
According to Table 3, complex interaction approaches (e.g., DNN, MaxSim) achieve significant performance gains compared with the simple dot production.Specifically, for DNN-based models, Spar-Code with MaxSim improves from 3.39% to 5.44% for Recall@50 and from 9.39% to 15.68% for NDCG@50 compared with SparCode with Dot Product.Similar phenomenon can be observed from the SA-based models, with a 45.69% and 49.61% relative improvement in Recall@50 and NDCG@50 respectively.The above results indicate that the dot product may not be sufficient to model the interaction between query and item.The DNN interaction performs significantly better than MaxSim and CrossNet.For SparCode with SA, the DNN interaction improves the Recall@50 metric from MaxSim's 5.42% and CrossNet's 5.55% to 6.63%, which is a remarkable relative improvement.Since Deezer dataset is rich in content semantics but not restricted to sparse ID features, the cross feature network may not as good as DNN to model such implict query-item interaction.
Another interesting observation is the performance of SparCode with dot product interaction is always slightly lower than that of two-tower models, suggesting that the replaced vectors from codebooks compromise some performance for efficiency.We leave how to reduce the performance gap as a future research topic.

Codebook
Structure.Tuning codebooks is a key step in PQ.We explored the impact of hyperparameters related to the number and capacity of codebooks.We chose SparCode w/ DNN on Deezer and SparCode w/ SA on Movielens-10M for experiments and select Recall@50 and NDCG@50 as evaluation metrics.According to the results in Figure 4, we have the following observations.Firstly, increasing the capacity  of a single codebook is helpful for better performance.On Deezer, as the codebook size increases from 64 to 1024, both Recall and NDCG are on the rise.When there is only one codebook(=1), Recall@50 and NDCG@50 get improved

Performance
Recall@20 NDCG@20 Recall@50 NDCG@50 (a) Deezer by 16.92% and 13.29% when  increases from 64 to 1024.When  is 2, the improvement is 6.53% and 5.84%.This is because the larger  brings more code words combinations, which enhanced the representation of PQ thus the relative performance boost is not so significant.To conclude, it is recommended to increase the capacity of the codebook size for better performance, especially when there is only one codebook.
Secondly, the number of codebooks  has tremendous effects of final performance.As shown in Figure 4(a)(b), comparing =1,  =1024 with =2,  =64, the latter one is already more effective than the former.We believe the shared information between codebooks contributes to the increase.When >1, different queries may share the same codewords in a specific codebook.Thus the learned information are shared in this codeword.We believe such representation sharing potentially prevents overfitting.However, it is worth noting that increasing  brings cost on computing.Choosing a proper  is also a trade-off between efficiency and effectiveness.

4.3.3
Effect of   .The tokenizer of SparCode encode query and item into  token vectors for later interaction.We utilize learnable linear layers or SENet [8] to transform various inputs into fixed tokens.We investigates the effects of different hyperparameter   .The results is shown in Fig 5 .For sequential recommendation dataset like MovieLens-10M, the increase of K brings performance on various metrics, which indicates various encoded tokens represented multiple user interests, and is beneficial for better interests matching.For non-sequential recommendation dataset such as Deezer, such improvements is not very clear.Thus we can set a small   (e.g.  = 2) for better efficiency.

Sparsity, Performance and Speed
Sparse Indexing Mechanism play a key role in SparCode deployment.In this section, we explore the impact of cache sparsity on model performance and efficiency.To be more intuitive, we provide two evaluation metrics for sparsity, "Sparsity" and "Average Items".Specifically, "Sparsity" represents the ratio of not-cached code-item scores to the total scores."Average items" represents the average number of cache scores per code.These two metrics are formulated as follows: Effect of Sparsity on Performance.We control sparsity by adjusting the bias b in Eq. 15.The experimental results of the association of sparsity and performance are shown in Figure 6(a) and Figure 6(b), which is surprising.For Deezer, given =2 and  =256, the setting with 99% sparsity achieves comparable performance with 0% sparsity on Recall and NDCG.This result also validates our observation that each code is highly correlated with only a small number of candidates.For memory-limited scenarios, the compressed model with a 99.947% sparsity can still exceed the performance of two-tower models (refer to Table 1).For Movielens-10M, such conclusion still holds, with only 1% scores cached to support a significant accuracy improvement.The above experimental results indicate SparCode is quite capable of online deployment even under extreme conditions.
Effect of Sparsity on Inference Speed.With performance guaranteed, another concern is the inference speed.Here, we are mainly considering the time taken to process each retrieval request, excluding pre-calculation or pre-loading time.We simulate processing a real request in real scenarios.Given the resource constraints, query tower inference runs on the GPU, while the process of retrieving a top-k ANN runs on the CPU.In order to respond requests quickly, only one or a small number of requests are processed per inference.We set the batch size as 1 for comparison.
Figure 6(c) compares the inference speed between SparCode and TTMs.As shown in Figure 6(c), SparCode has similar query time with TTMs with FlatIP index, but yields significantly better performance.For two tower models, we utilize PQ methods such as IVFPQ index to get lower query time, with slight performance loss for comparison.It is worth noting that SparCode is implemented entirely in python under our experiment settings.While there is a bunch of open-source search engines which supports inverted index searching with much faster speed, SparCode can utilize these engines to achieve better query time, similar to PQ-enhanced TTM, with little performance loss.
Figure 6(d) shows the effects of different sparsity on query time.With the sparsity increases, the query time decreases respectively.The evaluation performance also decreases as we have discussed.

Comparsion with TDM.
TDM [38] is a representative tree-based matching model which, like SparCode, supports advanced interaction between user and item.In this section, we compare SparCode and TDM in terms of model design and performance.We trained two YoutubeDNN models with interactions of dot product and MLPs on Deezer (TT-SVD), respectively, and trained and updated the index of the tree structure based on the trained YoutubeDNN models following the guidance of TDM.The results of SparCode and TDM experiments are shown in the  Query Time(ms) 6.34% 6.34% 6.34% 6.31% 5.95% 6.69% 6.69% 6.69% 6.69%According to Table 4, we have the following observations: (a) The TDM-DNN with MLPs is better than that with Dot Product, where R@20 and R@50 improved from 1.53% to 2.31% and from 3.71% to 5.09%, respectively.This improvement mainly comes from the interaction method, which again validates the limited feature interaction capability of dot product.(b) Compared to TDM-DNN with MLPs, SparCode-DNN has an average relative improvement of 35.07% in Recall metric, which reflects SparCode's accuracy advantage in model architecture.
Despite their ability to support advanced interaction methods, SparCode and TDM have many differences.First, SparCode employs a sparse inverted index with  (1) inference complexity, while TDM uses a tree structure index with  (( )) inference complexity.That is, SparCode only needs to perform a top-k retrieval once, while TDM needs to perform  (( )) times.Secondly, SparCode is trained end-to-end, while TDM performs both model and index training.In addition, SparCode is easily controllable for the size of inverted index structures, whereas TDM usually needs to keep full binary tree indexes.

CONCLUSION
In this paper, we summarize two limitations of the two-tower model: limited feature interaction capability and reduced accuracy in online serving.To address the two limitations, we proposed a new matching paradigm SparCode for improving both recommendation accuracy and inference efficiency.By linking vector quantization and sparse inverted indexing, SparCode introduces an all-to-all interaction module to achieve fine-grained interaction between user and item features and is able to maintain efficient retrieval with  (1) complexity.In addition, we further design the sparse fraction function to control the size of the index structure, so as to reduce the storage pressure.Extensive experimental results on two public datasets show that SparCode has far superior performance and comparable efficiency to the two-tower matching.In the future, we will further explore the application of SparCode to other tasks.

Figure 1 :
Figure 1: Two tower model and All-to-all Interaction Model.(ANN)search (e.g.Faiss[11]), two-tower matching has proven to be a good industrial practice in real-world applications[6,33,35] However, the development of two tower models has the following limitations: (a) Limitation 1: Limited feature interaction capability.There is only one dot product interaction between the two towers, resulting in a limited ability to model fine-grained feature interactions between queries and items.Although dot product is a nice feature interaction method[27], it is not always optimal, especially in the case when rich content features are available.(b) Limitation 2: Reduced accuracy in online serving.In many industrial scenarios, the size of candidate item pool is so large that exhaustively comparing against all the candidate items becomes impractical due to inference time constraints.ANN search methods (e.g., HNSW[21]) have been used to build indexes and speed up retrieval, but unfortunately this may reduce recall rate since the model and the index are usually not trained end-to-end.To address the above limitations, we propose a new matching paradigm, i.e., SparCode, for improving recommendation accuracy and retrieval efficiency.SparCode consists of two key designs: all-to-all interaction and sparse inverted index.To enhance feature interaction capability, the all-to-all interaction module utilizes a single expressive encoder to capture fine-grained interactions between all query features and all item features, as shown in Figure1(b).It allows different forms of encoder structures (e.g.Self-Attention[30], MLPs, CrossNet[32]), and can learn more fine-grained and predictive feature interactions between the query and the item (i.e., early fusion) than the simple dot product in two-tower models (i.e., late fusion).Fine-grained feature interactions improve model capacity and thus result in improved accuracy, as reported in previous research[20,25].However, compared with the two-tower models, the nature of query-item encoding makes it hard to use the existing ANN approaches for efficient retrieval, making it impractical.To speed up model inference, a straightforward solution is to pre-compute the scores of user-item pairs and cache them.Thus, SparCode introduces the sparse inverted index that is widely used in sparse retrieval.However, there are two problems of building index for users:(1) The number of users is too large to be exhaustive, resulting in an unacceptably expensive precomputation.(2) Each user (index) corresponds to a large number of items, which brings great storage pressure.In respond to these problems, SparCode leverages vector quantization(VQ) as a bridge between all-to-all interaction and sparse inverted indexing, i.e. query is quantized into a series of discrete codes and corresponding code representations, where code replaces query as index and code representation is used for all-to-all interaction.Since the number of codes is controllable and much smaller than the number of queries, the problem of large number of indexes is alleviated.Further, SparCode designs a controlled sparse

Figure 4 :
Figure 4: The number and capacity of codebook(s).
Comparison of query time under different Sparsity.The statics above the points denotes Recall@50.

Figure 6 :
Figure 6: Effects of sparsity on performance and query time.

Table 1 :
Performance comparisons on Deezer.The second-best results are in bold and the best results are underlined.

Table 2 :
Performance comparisons on Movielens-10M.The second-best results are in bold and the best results are underlined.

Table 4 :
Performance comparison of SparCode and TDM

Table 4 ,
where SparCode-DNN denotes SparCode with YoutubeDNN with the same parameters as TDM.