Linear Recurrent Units for Sequential Recommendation

State-of-the-art sequential recommendation relies heavily on self-attention-based recommender models. Yet such models are computationally expensive and often too slow for real-time recommendation. Furthermore, the self-attention operation is performed at a sequence-level, thereby making low-cost incremental inference challenging. Inspired by recent advances in efficient language modeling, we propose linear recurrent units for sequential recommendation (LRURec). Similar to recurrent neural networks, LRURec offers rapid inference and can achieve incremental inference on sequential inputs. By decomposing the linear recurrence operation and designing recursive parallelization in our framework, LRURec provides the additional benefits of reduced model size and parallelizable training. Moreover, we optimize the architecture of LRURec by implementing a series of modifications to address the lack of non-linearity and improve training dynamics. To validate the effectiveness of our proposed LRURec, we conduct extensive experiments on multiple real-world datasets and compare its performance against state-of-the-art sequential recommenders. Experimental results demonstrate the effectiveness of LRURec, which consistently outperforms baselines by a significant margin. Results also highlight the efficiency of LRURec with our parallelized training paradigm and fast inference on long sequences, showing its potential to further enhance user experience in sequential recommendation.


INTRODUCTION
Sequential recommender systems play a crucial role in personalization platforms, such as Netflix [28] and Amazon [19], by capturing sequential patterns from user action history (e.g., user clicks) to accurately predict the next action.Over time, these recommenders have undergone significant advances from traditional approaches like Markov chain [9,27] to deep neural networks like convolutional neural networks (CNNs) [32,35] and recurrent neural networks (RNNs) [12,13,18].Recently, self-attentive recommenders (SARs), inspired by transformers and their variants [4,34], have further improved both training efficiency and recommendation accuracy, and thus represent the current state-of-the-art [11,15,30].Despite their superior performance and training efficiency, SARs face criticism from the recommendation community due to their efficiency issues [24].Specifically, SARs [11,15,30] heavily rely on computing item-to-item weights (referred to as "self-attention") at a sequence-level to encode user representations over time.That is, each time a new user action occurs, the system appends the action to the user history, and then recomputes all item-to-item weights for incremental update on user representation.This process incurs significant time and space costs.In contrast, RNNs [12,13,18] differ from the following perspectives: (1) in terms of inference, RNNs perform efficiently by storing a fixed-length hidden-state vector as the user representation.RNNs can retrieve and update the vector stepwise whenever a new user action occurs.(2) RNNs face challenges in training efficiency.The recurrent non-linear units in RNNs prevent parallelizable training, leading to the need for cumbersome training sequence augmentation and slow convergence speed.(3) Moreover, empirical studies have shown that existing RNNs fail to match the recommendation performance of SARs [11,15,30].
In this work, we propose a novel sequential recommender model, linear reccurent units for sequential recommendation (LRURec), which not only outperforms the recommendation accuracy and training efficiency of SARs with parallelizable training (see Figure 1), but also matches the inference efficiency of RNNs by only maintaining a fixed-length user vector for incremental updates.Our motivation stems from two key observations: (1) the effectiveness of linear models in handling long sequences, as demonstrated in natural language processing (NLP) tasks and models [23,25,31]; and (2) our findings that linear RNN recommenders can achieve recommendation performance comparable to vanilla non-linear RNN recommenders.To begin, we introduce LRU [5,23], a linear version of the traditional recurrent units, which supports parallelizable training with the proposed recursive parallelization.This enables more efficient training compared to its non-linear counterparts.Furthermore, we incorporate a series of sequential modeling techniques from self-attentive architectures, such as layer normalization [1], residual connection [8], and position-wise feed-forward networks [34].They further improve the recommendation accuracy to outperform existing SARs while retaining RNN-like inference efficiency due to the replacement of self-attention computation with efficient RNN-like recurrence operation.
We summarize the contributions of our work as follows: • We introduce a linear recurrent unit-based architecture LRURec into sequential recommendation, which addresses the dilemma of training efficiency, inference efficiency and recommendation performance.[15,30]) and inference-efficient (like RNNs [13,18] or FPMC [27]).Additionally, LRURec matches SARs and FMLP-Rec [38] on recommendation performance.

RELATED WORK 2.1 Sequential Recommendation
Sequential Recommendation has a primary objective of predicting the user's next item of interest by modeling their past actions in chronological order [13,15,27,32].The central focus of sequential recommendation lies in the creation of a model architecture that is both effective and efficient.First, CNN-and RNN-based models were proposed to leverage the expressiveness of deep neural networks to model user sequences [13,18,32], which outperform linear sequential models such as FMC and FPMC [27].Nevertheless, RNN-based models faced limitations in training efficiency due to the lack of parallel training.Subsequently, transformer-based models [11,15,30] emerged as a solution, offering accelerated training and improved recommendation accuracy through the parallelizable self-attention operation [34].With efficient training and accurate item-to-item relevance via self-attention, transformer-based models stand as the state-of-the-art methods for sequential recommendation.Additionally, MLP-based sequential recommenders [20,21] like FMLP-Rec [38] exhibit comparable recommendation performance relying solely on the MLP architecture.However, the inference efficiency of transformer-and MLP-based models lag behind RNN-based models [24].Specifically, whenever a new user action appears, RNN-based models only need to update the latest hidden state, whereas both transformer-and MLP-based models must recompute sequence-level relevance for inference.
In this work, we aim to propose a new model architecture for sequential recommendation based on linear recurrent units [23], which enjoys both the training efficiency of transformer-based models and inference efficiency of RNN-based models.

Efficient Language Modeling
The difficulty of building an NLP model to achieve (1) training efficiency, (2) inference efficiency and (3) model performance like Table 1 is known as the "impossible triangle" [31].To achieve these goals, various approaches have been proposed from different aspects.Attention free transformer (AFT) or RWKV [25,37] simplify token-token attention weights to element-wise operations and moves softmax operations to key vectors, which shows comparable performance to a vanilla transformer model [34] after scaling up [25].S4 [5] is based on fundamental state space models (SSM) with continuous-time memorization [6], which demonstrates efficiency and effectiveness in long-sequence modeling (e.g.long-range arena benchmark [33]).Linear recurrent units [23] match the longsequence performance of SSMs with an improved initialization strategy and systematic ablations.Most recently, RetNet [31] proposes attention-like retention that have equivalent recurrent and parallel formulation, RetNet can thus be trained in parallel while achieving low inference costs like RNN models.
In our work, we study the efficient modeling problem in the context of sequential recommendation.The differences are (1) all mentioned methods are primarily studied and evaluated in NLP, where existing solutions (e.g., RWKV) are not specifically designed for recommendation and can lead to overfitting and latency issues; (2) the majority of such methods are optimized for long-range modeling tasks, which may cause performance issues upon recommendation tasks that prioritize short sequences of user actions.Different from existing approaches, we propose a novel sequential recommender based on linear recurrence.With the carefully designed LRU block and recursive parallelization, our model performs well regardless of sequence length and achieves both training and inference efficiency for sequential recommendation.

METHODOLOGY
As shown in previous works [5,23,31], efficient NLP models are capable of capturing long-term dependencies in a wide range of sequence-to-sequence tasks.Additionally, such models demonstrate significantly improved efficiency thanks to parallelizable training and incremental inference.Motivated by such observations, the proposed linear recurrent units for sequential recommendation (LRURec) incorporates the advantages of both RNN-based and transformer-based models, while requiring significantly reduced computing power with efficient incremental inference and rapid convergence via parallelized training.We introduce the key components and the overall model of LRURec in the following.

Setup
Data: Let input  ∈ X represent the user action history [ 1 ,  2 , ...,   ] of length  in chronological order, where the elements are represented in the item space I (i.e.,   ∈ I).The ground truth  is the next user action  +1 ∈ I (i.e.,  =  +1 ).
Model: The recommender model is denoted with function  , with  being the hidden dimension in  .Upon input ,  generates prediction scores over I. Ideally,  predicts item  with the highest score for data pair (, ), namely  = arg max  ().
Learning: The optimization of the recommender model corresponds to the likelihood maximization of ground truth item  upon input .In other words, the learning objective is to minimize the negative log-likelihood loss L over distribution X: min E (,)∼X L ( (), ). (1)

Linear Recurrent Unit
Decomposing Linear Recurrence.We first introduce a simplified form of linear recurrence.For input   at time step , we represent the hidden representation ℎ  and output   with learnable matrices where the input and output dimensions are denoted with  in and  out (i.e., embedding size), and the hidden dimension size with  .Different from RNN models (i.e., ℎ  =  (ℎ  −1 +   )), we discard the non-linearity  to enable the serialization of ℎ  : By unrolling the ℎ  = ℎ  −1 +   along the time steps, ℎ  can be written in closed-form w.r.t. the matrices ,  and the input elements  1 ,  2 , . . .,   .However, the repeated computation of matrix multiplication is inefficient and may lead to numerical issues as  increases (e.g., overflow).To this end, we leverage matrix diagonalization (i.e., eigendecomposition) and introduce eigenvalues and eigenvectors to (1) reduce matrix-level computation and (2) control the numerical stability of ℎ  .In particular, we perform decomposition on  with  = Λ −1 , in which Λ is diagonal and consists of the eignevalues  1 ,  2 , . . .,   ,  is an invertible matrix of size  ×  .Consequently, the computation of   reduces to Λ   −1 , and thereby significantly improving computation efficiency.
Representation in Complex Space.Despite the eigendecomposition of matrix  = Λ −1 , the eigenvalues and eigenvectors of  do not necessarily lie in real space R. Therefore, we extend  and Λ to the complex space with  ∈ C  × and Λ = diag( 1 ,  2 , . . .,   ) ∈ C  × .Similarly, we extend ℎ and ,  to the complex space C. As such, the computation of  =1   −   can now be written as  =1 Λ  −  −1   .We further write h =  −1 ℎ, B =  −1  and C =  −1 , which simplifies the formulation of h and output   to: where ℜ( C h ) is the real part of complex vector C h .To improve multiplication efficiency in the complex space, we represent Λ in polar form with absolute value  and argument  (i.e., Λ =   =  (cos( ) + sin( ))),  stands for the imaginary unit.Let  =  − , we have Λ =   =  −+ where ,  ∈ R  , thus the involved computation is reduced to (− +  ).Another advantage of the polar form is that parameters  and  can be now optimized in the real space R. As a result, the scale of Λ can be computed with  − , while the exponentiation of matrix  in Equation ( 3) is replaced with the efficient exponential of diagonal matrix Λ. Parameterization.To avoid numerical instability, a simple condition is to let elements in Λ satisfy |  | < 1,  = 1, 2, . . .,  , the  condition is equivalent to  −  < 1 for the polar form.Therefore, we use another exponential   = exp(− exp(log(  )) +   ) with   > 0 [5,23].More specifically, we use log parameters  log ,  log ∈ R  and introduce normalization factor  log ∈ R  , with  log = log(),  log = log( ). log normalizes input element-wise and is initialized with The rest log parameters are initialized following the ring approach with radius between [0.8, 0.99] [23], while the rest matrices are initialized via truncated normal initialization [30].Using ℎ,  and  to represent h, B and C in the complex space C, we summarize the above parameterization and formulate the final linear recurrence unit as follows: (5) In our implementation, we double the size of  in Λ to improve the modeling of recurrence.We also adopt identity matrix I as  in the output function as  in =  out .Discarding  additionally reduces learnable parameters and constructs a residual connection between the input and output.In the following, we use LRU to denote the proposed linear recurrence operation in Equation ( 5), the effectiveness of our design is demonstrated in Section 5.4.

Recursive Parallelization
Although matrix diagonalization and the exponential parameterization improves the computational efficiency of linear recurrence, the forward pass in Equation ( 5) is element-wise for each time step, resulting in linear time complexity w.r.t.sequence length.Inspired by the parallel scan algorithms [2,3,17,29], we develop recursive parallelization designed specifically for LRURec.In particular, long input sequences are recursively split into subsequences to enable parallel processing, followed by the scaled addition of hidden features from each subsequence, and thereby improving the forward pass of LRURec to logarithmic time complexity.
For simplicity, we formulate the recursive parallelization based on ℎ  =  =1 Λ  −   in Equation ( 4).Given input sequence  of length  and time step  with  < , the final output ℎ  at time step  can be formulated as: The results suggest that it is possible to split the sequence  = [ 1 ,  2 , ...,   ] into [ 1 ,  2 , ...,   ] and [ +1 ,  2 , ...,   ] for parallel processing on the subsequences.Then, ℎ  can be obtained by summing the output from the subsequences, where the last-step hidden feature from the first subsequence (i.e., ℎ  ) is additionally multiplied with Λ  − to correct the time steps.
Based on such observations, we propose recursive parallelization to accelerate the forward feeding of sequential input.The recursive parallelization process is illustrated in Figure 2. Specifically, we perform the the following steps in recursive parallelization: 1. Sequence Padding: Given input sequence  of length t, we first pad the sequence length to the power of two (i.e., 2 2 , 2 3 , . ..).The reason for such padding is to enable recursive split of the input sequences until reaching the shortest length of two (i.e., [ 1 ,  2 ], [ 3 ,  4 ], [ 5 ,  6 ], . ..), which maximizes the performance of our recursive parallelization algorithm.

Recursive Split:
With the padded sequence of length 2  , we multiply  with  and perform  forward passes.In the -th step, the input sequence is split into 2  − subsequences, each of length 2  (i.e., [ 1 ,  2 , . . .,  2  ], [ 2  +1 ,  2  +2 , . . .,  2 * 2  ], . ..).The subsequences are processed in parallel to perform linear recurrence (see next step), followed by restoring the original sequence.Consequently, the time complexity reduces with no additional space required.Recursive parallelization can be perform regardless of input length and hidden dimension, and is designed for full-length training and inference.We provide an implementation of recursive parallelization with PyTorch-like pseudocode in Algorithm 1, with x, B and La representing input , parameters  and Λ.The above steps follow the divide-and-conquer principle to break down long sequences into subsequences recursively, such that linear recurrence can be computed on the subsequences in parallel.By applying recursive parallelization, we reduce the number of forward passes to log 2 () for input sequence of length  (e.g., three passes for an eight-element sequence, see Figure 2), which significantly improves the time efficiency for both training and inference in LRURec.

Overall LRURec Model
In this section, we introduce the overall architecture of the proposed LRURec.The proposed model comprises of: (1) embedding module; (2) LRU block with position-wise feed-forward network (PFFN) and (3) prediction layer.The overall model is illustrated in Figure 3, we describe the details of each module in the following.
Embedding Module.Similar to existing methods, we use a learnable matrix  ∈ R | I | × in to transform the discrete item IDs in I from the input sequence to the high-dimensional embedding space R  in .For sequences of different lengths, we perform left padding to the power of two for parallelization, as demonstrated in Section 3.3.Layer normalization (LayerNorm) is performed after the embedding retrieval.Hence, for input sequence  = [ 1 ,  2 , ...,   ], we denote the embedding module with Embed: where  and  are learnable rescaling factors,  and  represents the mean and standard deviation of the input,  is added to the denominator in LayerNorm for numerical stability.
LRU Block.As in Section 3.2, we use LRU to describe the linear recurrence operation in LRURec.We additionally apply layer normalization on the output hidden features of LRU and denote the process with LRUNorm.Nevertheless, LRU sacrifices non-linearity in the recurrence operation for improved performance and efficiency.To compensate for the absence of non-linearity, we leverage position-wise feed-forward network (PFFN) to improve the modeling of user actions in the hidden dimension.In particular, we use PFFN to describe the two-layer MLP network: PFFN() = GELU( (2) GELU( (1)  +  (1) ) +  (2) ), where  (1) ∈ R 4 × , (2) ∈ R  ×4 ,  (1) ∈ R 4 ,  (2) ∈ R  are the parameters for the two-layer MLP, and GELU refers to the GELU activation.We additionally introduce sublayer connection (SubLayer) with both layer normalization and residual connection on the PFFN net to improve the recommendation performance and training dynamics.In short, we leverage linear recurrence unit to efficiently process sequential input.PFFN, layer normalization and residual connections are additionally introduced to improve the modeling of non-linear transition patterns in LRURec: In our experiments, we stack two LRU blocks following [15,30].
Prediction Layer.Given hidden features ℎ  at time step  from the previous LRU block, we compute the scores over I for next-item prediction via the Pred function: where  is the embedding matrix from the embedding module and   ∈ R | I | is an additional bias term.Thanks to the non-linearity introduced in the LRU block, dot product between embedding features and ℎ  can capture non-trivial patterns despite utilizing shared item features .Additionally, the shared  significantly reduces the model size, while effectively alleviates overfitting in LRURec.

DISCUSSION
Why does linear recurrence perform well in sequential recommendation?We have two explanations for the improvements of LRURec: (1) Linear recurrence by design assigns higher weights to recent items, thus it is effective for modeling recommendation

EXPERIMENTS 5.1 Experimental Setup
Datasets.Our model is evaluated on the following datasets: • MovieLens: A benchmark dataset for movie recommendation, we select the widely used ML-1M here [7].• Amazon: A series of datasets with product reviews from Amazon, here we select Beauty, Video and Sports [10,22].
• Steam: A video game review dataset crawled from Steam, a large online video game distribution platform [15].• XLong: XLong is an Alibaba dataset known for long interaction histories to evaluate lifelong sequential models [26].
For preprocessing, we follow [11,15,36] to construct input sequences and exclude users and items with fewer than 5 interactions.
For maximum sequence length, we adopt 200 for ML-1M, 1000 for XLong and 50 for the rest datasets.We follow the leave-last-out strategy on dataset splitting and use the most recent item as the test set, the second most recent item as the validation set, and the rest items in the sequences as the training set.During testing, we include both training and validation items as input.We report the dataset statistics after preprocessing in Table 2.
Baseline Methods.We compare our LRURec against multiple baseline methods, which include classic factorization and Markov chain-based methods (e.g.MF, FISM, FPMC), RNN-based models (e.g.GRU4Rec, NARM) as well as transformer-and MLP-based state-of-the-art models (e.g., SASRec, BERT4Rec and FMLP-Rec): • MF: A vanilla factorization model that learns user and item latent representations for next-item prediction [16].• FISM: FISM does not explicitly model user preference and predicts next interaction via item-to-item similarity [14].• FPMC: FPMC is a matrix factorization model and uses Markov chains to capture user transition patterns [27].• GRU4Rec: A classic sequential recommender that models user-item interactions using a GRU-based model [12,13].• NARM: Also a RNN-based sequential recommender with local and global user modeling for next-item prediction [18].• SASRec: The first unidirectional transformer-based sequential recommender, SASRec leverages unidirectional self-attention to capture user-item transition patterns [15].• BERT4Rec: A bidirectional transformer encoder architecture for sequential recommendation.BERT4Rec is learnt via predicting a random proportion of masked items [30].• FMLP-Rec: A filter-based all-MLP model that learns in the complex domain using fast Fourier transformation [38].
Evaluation.For evaluation results, we select models with the best validation Recall@10 scores in training to perform prediction on the test sets.The models are evaluated using Recall@k and NDCG@k metrics, and with  ∈ {10, 20}.The predicted items are ranked against all items in the dataset to compute the final scores.
Implementation Details.For the baseline methods, we refer to the original papers for implementation.We train all models using cross entropy loss with AdamW optimizer using the static learning rate of 1e-3.During training, we use a batch size of 128 (32 for XLong) and set the maximum epoch to be 500, validation is performed every 500 to 2000 iterations depending on the data size.Early stopping is triggered if validation Recall@10 does not improve in 10 consecutive validation rounds.We perform grid search for hyperparameters, with weight decay from [0, 1e-6, 1e-4, 1e-2] and dropout rate from [0.2, 0.4, 0.6, 0.8].Hyperparameters not mentioned above were used as reported in the original implementation 1 .

Overall Performance
Our main performance results are reported in Table 3.Here, rows represent the dataset and metric, and the columns represent each of the methods, we mark the best results in bold and underline the second best results.We also compute the relative improvement of LRURec compared to the best-performing baseline method (i.e., Improv.).We observe: (1) LRURec consistently outperforms baseline methods across all metrics and datasets, with an average performance improvement of 4.39% compared to the second best method.Despite the efficient and light-weight design, the performance gains of LRURec can go up to over 10% depending on the data distribution (e.g., 17.43% on NDCG@10 in Sports).(2) The performance gains of LRURec are more pronounced on sparse datasets, while being comparatively modest on dense datasets.For example, LRURec achieves 1.32% average improvements on the relatively dense ML-1M.The improvement is much more significant on the sparse Sports dataset with average gains of 10.48%, suggesting the substantial benefits of LRURec on sparse data.(3) In contrast to recall, LRURec sparser distributions.As such, the reported numbers should not be directly compared with those found in works that employ k-core preprocessing.demonstrates better ranking performance.For instance, there is a noteworthy increase of 6.73% in the average NDCG@10 scores with LRURec, while the relative improvement on Recall@10 is slightly lower at 3.08%.Overall, we find LRURec performs particularly well on sparse data and shows significantly improved ranking performance compared to the baseline methods.LRURec also achieves consistent performance improvements in all scenarios, suggesting the effectiveness of LRURec regardless of data domains.

Long-Range Modeling Performance
To examine the performance of LRURec on long-range dependencies, we additionally experiment on XLong: a large-scale dataset with ∼1k sequence length.Due to scalability issues, we only experiment on selected state-of-the-art baselines (SASRec and BERT4Rec) along with LRURec2 .We also reduce the hyperparameter search to [0, 1e-2] for weight decay and [0.

Ablation Studies
We perform a series of ablations to demonstrate the effectiveness of the proposed components in LRURec.In particular, we remove the layer normalization, residual connection and PFFN net respectively and evaluate the performance changes.We additionally reduce the LRU blocks in LRURec and adopt two efficient sequence modeling methods, S4 and RWKV, to replace LRU [5,25].The ablation results are reported in Table 6, with rows representing the ablation variant and columns representing the datasets.We observe the following on the ablation of LRURec components: (1) All components contribute to the overall performance of LRURec.For example, removing PFFN results in an average performance drop of 11.24% on ML-1M.(2) The components contribute differently depending on the datasets.For instance, layer normalization and residual connection contribute significantly to the performance on sparse datasets (e.g., with 82.07%and over 100% average gains on Beauty).In contrast, PFFN improves the modeling of non-linear transition patterns, and thereby further enhances the performance on dense datasets like ML-1M.By additionally switching the backbone of our LRURec, we notice: (1) The overall best performing variant is still our reduced one-layer LRURec, demonstrating the effectiveness of the proposed architecture.On average, the one-layer LRURec outperforms the second-best backbone variant by 7.70% on Recall@10.(2) Surprisingly, the S4 variant performs the best on the Steam dataset, which may be attributed to the combination of the linear design of S4 and the imbalanced item popularity in Steam.
Overall, the ablation results suggest that all proposed components and the carefully designed architecture in LRURec are effective for sequential recommendation across various data scenarios.

Model Efficiency
To further demonstrate the advantages of our design, we study the efficiency of LRURec against two representative baselines: RNNbased GRU4Rec and transformer-based SASRec.We illustrate validation Recall@10 curves during training in Figure 1 (left), with the horizontal axis representing training steps and the vertical axis representing Recall@10 scores.Owning to the linear recurrence design and improved training dynamics, LRURec converges significantly faster and triggers early stopping at ∼23k training steps, compared to over 50k of SASRec and over 80k of GRU4Rec.Aside from training, LRURec also demonstrates substantial advantages with incremental inference that drastically reduces latency for realtime recommendation, as illustrated in Figure 1 (right).Here, we perform batched inference and keep extending input length after each prediction step, we visualize the cumulative prediction time for a total of 2048 steps.Given hidden states and current input elements, we observe almost linear correlation between the cumulative computing time and increasing input steps on both GRU4Rec and LRURec.Moreover, the throughput capacity of LRURec is independent of the the sequence length in incremental inference.For example, LRURec can achieve ×7.3 increased input examples compared to SASRec with maximum sequence length of 50.In summary, the results suggest that LRURec has the benefits in training parallelization and performance like transformer models, while retaining the advantage of incremental inference from RNNs.

Hyperparameter Sensitivity
We evaluate the hyperparameter sensitivity of LRURec.In particular, we vary weight decay and dropout values to evaluate the trained models with the best validation performance.Figure 4 compares the performance with different hyperparameters, with the x-axis stands for the varying values and y-axis stands for Recall@10 scores.For weight decay, we observe minor changes with increasing penalty strength, the performance remains robust until increasing weight decay to 1.For varying dropout rates, we observe different performance changes depending on the datasets.For dense datasets (e.g., ML-1M), the best performance is achieved at 0.2 and then consistently reduces with increasing dropout rates.Unlike dense datasets, Recall@10 performance peaks at 0.6 dropout rate on sparse datasets like Video and Beauty.Overall, the performance of LRURec is quite robust with varying weight decay values, while dropout rates should be carefully selected for optimal performance.

Figure 1 :
Figure 1: Training and inference efficiency of GRU4Rec, SAS-Rec and LRURec on ML-1M.The proposed LRURec converges significantly faster than SASRec and GRU4Rec, while outperforming both models on Recall@10 scores and achieving O(1) complexity with incremental inference.

Figure 2 :
Figure 2: Recursive parallelization for LRURec, we illustrate the recursive split and parallel forward pass.

Figure 3 :
Figure 3: The overall architecture of the proposed LRURec.

•
We propose a series of improvements in LRURec to address the lack of non-linearity and improve training dynamics.We further propose recursive parallelization that significantly accelerates both training and inference.•We empirically demonstrate the effectiveness and efficiency of LRURec in comparison to state-of-the-art methods on multiple real-world datasets, where LRURec consistently outperforms baseline methods by a large margin.• Our results challenge the necessity of the core self-attention module in existing SARs while highlighting the importance of other techniques in SARs like layer normalization, which provide deeper understanding and new opportunities to the architecture design for sequential recommenders.

Table 1 :
Comparison among the proposed LRURec and representative sequential recommenders.LRURec is both trainingefficient (like SARs

Table 2 :
Dataset statistics after preprocessing.data that emphasizes recent interactions.To examine this, we inspect the average || values (i.e.,  − ) on different sequence lengths.Notice that higher || values suggest the inclusion of more history information.With ∼200 sequence length (i.e., ML-1M), we observe ∼0.4 for both LRU blocks.The values reduce to 0.25 and 0.31 for ∼10 sequence length in Beauty (see Section 5.3).As || is initialized close to 1, the low values indicate high emphasis on recent items, thus providing a solid justification for the recurrence design in LRURec.(2) PFFN and hierarchical LRU blocks further improve the modeling of non-trivial transition patterns in long-range dependencies and dense datasets.For example, PFFN and stacking LRU block causes up to 11.24% and 3.43% performance improvements on long LRURec achieves O (log() 2 ) for full-length input sequence as a result of our recursive parallelization design.In contrast, typical RNN-based recommenders require O ( 2 ) in time and the original transformer-based recommenders require O ( 2  +  2 ).Like RNNs, LRURec additionally support incremental inference with only O ( 2 ) complexity.We demonstrate the advantages of LRURec on both performance and efficiency in Section 5.
sequences (ML-1M), while contributing less than 1% improvements on the sparse Beauty data (see Section 5.4).Hence, the combination of additional non-linearity and hierarchical linear recurrence plays a vital role in the overall performance of LRURec.How does the complexity of LRURec compare to other models?We use  and  to represent the model hidden dimension and input sequence length, I and U to represent the items and users.The quantity of the learnable parameters in the proposed LRURec is comparable to RNN-based sequential recommenders, which have O (|I| +  2 ) space complexity.In comparison, factorization models have O ((|I| + |U|) ) and transformer models have O (( + |I|) +  2 ) space complexity.Given the low hidden dimension size ( = 64 in our experiments), the space complexity of LRURec is smaller in size and does not grow with increasing user numbers.Moreover, we use identity matrix as  and decompose  = Λ −1 ( is integrated in  and  matrices), which leads to additional parameter reduction in LRURec.For time complexity,

Table 3 :
Main performance results, best results are marked in bold, second best results underlined.

Table 4 :
Long-range modeling performance results, best results are marked in bold, second best results underlined.

Table 5 :
Average || values of each LRU block, higher || indicates the incorporation of increased history information.

Table 6 :
2, 0.4] for dropout rate.We improve the training efficiency on XLong by randomly sampling 100 negative items to compute loss and update model in training.For evaluation, we randomly sample 10k negative items compute the metrics, the experiment results are reported in Table4.Analogous to the main results, LRURec outperforms baseline methods by a / Recall ↑ NDCG ↑ / Recall ↑ NDCG ↑ / Recall ↑ NDCG ↑ / Recall ↑ NDCG ↑ / Recall ↑ Ablation results, best results are marked in bold, second best results underlined.considerablemargin,indicating the effectiveness of LRURec even for long-range dependencies.For example, LRURec achieves 8.53% average improvements across all metrics compared to the bestperforming baseline SASRec.In addition, we evaluate the weights of history information in LRURec by computing the average || in each of the LRU blocks.The || values for all datasets are reported in Table5.As expected, we observe high || values for long sequences (e.g., ∼0.8 for XLong), whereas for short sequences, the || values are significantly lower (e.g., ∼0.3 for Beauty and Video).Interestingly, we observe relatively high || on Sports despite its short sequence length, which may explain for the significant performance improvement of LRURec (up to 17.43%) in the main results.