RAT: Retrieval-Augmented Transformer for Click-Through Rate Prediction

Predicting click-through rates (CTR) is a fundamental task for Web applications, where a key issue is to devise effective models for feature interactions. Current methodologies predominantly concentrate on modeling feature interactions within an individual sample, while overlooking the potential cross-sample relationships that can serve as a reference context to enhance the prediction. To make up for such deficiency, this paper develops a Retrieval-Augmented Transformer (RAT), aiming to acquire fine-grained feature interactions within and across samples. By retrieving similar samples, we construct augmented input for each target sample. We then build Transformer layers with cascaded attention to capture both intra- and cross-sample feature interactions, facilitating comprehensive reasoning for improved CTR prediction while retaining efficiency. Extensive experiments on real-world datasets substantiate the effectiveness of RAT and suggest its advantage in long-tail scenarios. The code has been open-sourced at \url{https://github.com/YushenLi807/WWW24-RAT}.


INTRODUCTION
Click-through rate (CTR) prediction is a binary classification task that aims to forecast whether a user will click on a given item.It has been broadly applicable in commercial fields such as advertising placement and recommender systems [4,9,11,17].Feature interaction modeling plays an essential role in CTR prediction.As shown in Figure 1, traditional methods [2,10,13,15] primarily focus on feature interactions within each sample, but seldom consider cross-sample information that can serve as a reference context to enhance the prediction.Since features and their interactions are usually sparse, it necessitates CTR models to capture and memorize all interaction patterns, posing challenges in robustness and scalability.
Recently, retrieval-augmented (RA) learning has shown effective in natural language processing [5] and computer vision [1], whose typical idea is to retrieve similar samples and enhance model prediction with these external demonstrations.Inspired by its success in relieving long-tail problems [12], we believe it is a promising paradigm to relieve the aforementioned issue in CTR prediction.In this direction, RIM [14], DERT [19] and PET [3] are three preliminary works on RA CTR prediction.However, they either compromise intra-or cross-sample feature interaction, which are still sub-optimal practices.Specifically, RIM simply aggregates retrieved samples on each feature field, which sacrifices fine-grained To remedy the shortcomings in previous works, we propose a unified framework termed Retrieval-Augmented Transformer (RAT) to enhance fine-grained intra-and cross-sample feature interactions for CTR prediction.Given a target sample, we retrieve similar samples from a reference pool (e.g., historical logs) using the sparse retrieval algorithm.Then we develop a Transformer-based model to acquire fine-grained feature interactions within and across samples.In particular, we find that intra-cross cascade attention not only improves the efficiency beyond joint modeling but also enhances the robustness of RAT.Without bells and whistles, we condense the semantic information to one token representation, which is fed to the binary classifier to make the final prediction.
We conduct extensive experiments on three real-world datasets: ML-Tag, KKBox, and Tmall, demonstrating the promise of retrievalaugmented approaches and the further improvement of RAT.We also show that RAT can enhance long-tail sample prediction, which suggests its capacity to tackle feature sparsity and cold start issues.
To summarize, we make the following contributions: ♣ We propose a Retrieval-augmented Transformer (RAT) for CTR prediction, which enhances fine-grained intra-and cross-sample feature interaction in a unified model.♣ We find that intra-cross cascade attention not only improves the efficiency but also enhances the robustness of RAT ( §3.3.1).♣ Extensive experiments on real-world datasets validate RAT's efficacy and suggest its advantage against the feature sparsity and cold start issues.

THE PROPOSED METHOD
We design REtrieval-augmented Transformer (RAT), considering both intra-and cross-sample interactions for CTR prediction.Figure 2 briefly illustrates the framework of RAT.

Retrieve Similar Samples as Context
Given the  -field record   = [ 1  ; ...;    ] of a target sample, we search for similar samples as the reference context from a reserved sample pool P. We use BM25 [16] for retrieval because of its training-free nature, which also aligns with previous works [3,14].Specifically, the relevant score of query   and the key   of a candidate sample (  ,   ) ∈ P is defined as where I {•} is an indicator function. P is the number of samples in P, while  P (   ) denotes the number of samples containing the feature    in P. Different from Du et al. [3], Qin et al. [14] that implemented the retrieval with Elasticsearch on CPUs, we provide an efficient GPU-based implementation to enable faster speed.
Finally, we retrieve  samples from P with the highest scores: Notes on Avoiding Information Leakage.We sort the samples in chronological order if there is timestamp information and restrict a query only to retrieve samples that occur earlier than it.
For validation and testing, we take the whole training set as the reference pool.This strategy still satisfies the restriction and is safe, because the testing set and the validation set are the latest and the next latest parts of the whole dataset in our experiments.

Construct Retrieval-augmented Input
We build an embedding layer to transform discrete features into -dimensional embedding vectors.In particular, we treat the labels of retrieved samples as special features and also build an embedding table for that field.Let us denote the set of embedding tables as  For a retrieved sample (   ,    ), we lookup feature embeddings and the label embedding from the embedding tables and obtain ) × .Finally, the retrieval-augmented input for the target record   is obtained by stacking   and (3)

Intra-and Cross-sample Feature Interaction
To integrate intra-and cross-sample feature interactions to enhance CTR prediction, a naïve idea is to unstack all retrieved samples, append them to the target record as extra feature fields, and use joint attention to model full feature interactions.However, it exhibits an efficiency issue: the complexity of each joint self-attention is O (( + 1) 2 • ( + 1) 2 ).We also find it inferior in performance (Table 4), possibly due to the influence of noisy feature interactions.To address the above issues, we decompose the joint attention.As shown in Figure 2, each RAT block comprises the cascade of an  intra-block, a cross-block, and a multi-layer perception (MLP).The forward process of the ℓ-th RAT block is formulated as where  ℓ  denote the input of the ℓ-th RAT block. 0  = Ẽ . ℓ  and  ′ ℓ  denote the hidden states.LN(•) is the layer norm operation.ISA(•) and CSA(•) represent the intra-sample and cross-sample attention modules, respectively, which perform multi-head selfattentions along the field-axis and sample-axis, respectively.
Compared with the vanilla joint attention, our design of cascaded attention favorably reduces the complexity to O (( +1) 2 + ( +1) 2 ), which helps to keep efficient.We will also investigate more block designs in experiments ( §3.3.1) and demonstrate the advantages of cascaded attention for RAT.

Implementation Details.
For fair comparisons, we implement all methods with FuxiCTR [21] and follow the BARS [20] benchmark settings.For retrieval-augmented models, we set  to 5 by default.

Comparison with State-of-the-arts
3.2.1 Overall Performance.We report the model performances in Table 2.Note that for CTR prediction, a ‰-level AUC increase is considered acceptable, as even such a minor improvement, if statistically significant, can lead to substantial gains in revenue [6,11].According to the results in  4 and 5, respectively.From Table 5 we can see that decomposed modeling designs are indeed more effective than joint modeling (i.e., JM).
From Table 4 we can learn that joint modeling is not the best practice, possibly due to the influence of noisy feature interactions.In contrast, attention decomposition provides an inductive bias for model robustness and leads to better performance in general.Comparing RAT and RAT CE , block-level decomposition does not bring extra gain upon attention-level decomposition.Comparing RAT and RAT PA , cascaded attention performs slightly better, suggesting that the successive order benefits the feature interaction modeling.

CONCLUSION
In this paper, we present Retrieval-Augmented Transformer (RAT) for CTR prediction.It highlights the importance of cross-sample information and explores Transformer-based architectures to capture fine-grained feature interactions within and between samples, effectively remedying the shortcomings of existing works.By decomposing the intra-and cross-sample interaction modeling, RAT enjoys better efficiency while further enhancing the robustness.We conduct comprehensive experiments to validate the effectiveness of RAT, and show its advantage in tackling long-tail data.

Figure 2 :
Figure 2: The overview framework of RAT. 2

Table 1 :
Statistics of datasets

Table 2 :
CTR prediction performance comparison.Δ  indicates averaged AUC improvement compared to xDeepFM.

Table 3 :
Performance w.r.t.long-tail users on ML-tag subset.

Table 4 :
Performance comparison of different RAT designs.

Table 5 :
Efficiency comparison of different RAT designs.It designs intra-and cross-sample attentions as two parallel branches.The hidden dimension of each branch is halved, and the outputs of the two branches are concatenated.We report the experimental results on model performance and efficiency in Tables 3.3.1 Designs of RAT Block.In addition to the proposed design, we further explore more variants: (i) Intra-Cross Joint Modeling (JM): The vanilla joint attention modeling for all intra-and crosssample feature interactions.(ii) Intra-Cross Cascaded Encoder (CE): Rather than the cascaded attentions in a single block, it separates the intra-and cross-sample attention into two cascaded Transformer blocks, yielding a double block number compared to other variants.(iii) Intra-Cross Parallel Attention (PA):