Enhancing Sequential Recommendation via LLM-based Semantic Embedding Learning

Sequential recommendation systems (SRS) are crucial in various applications as they enable users to discover relevant items based on their past interactions. Recent advancements involving large language models (LLMs) have shown significant promise in addressing intricate recommendation challenges. However, these efforts exhibit certain limitations. Specifically, directly extracting representations from an LLM based on items' textual features and feeding them into a sequential model hold no guarantee that the semantic information of texts could be preserved in these representations. Additionally, concatenating textual descriptions of all items in an item sequence into a long text and feeding it into an LLM for recommendation results in lengthy token sequences, which largely diminishes the practical efficiency. In this paper, we introduce SAID, a framework that utilizes LLMs to explicitly learn Semantically Aligned item ID embeddings based on texts. For each item, SAID employs a projector module to transform an item ID into an embedding vector, which will be fed into an LLM to elicit the exact descriptive text tokens accompanied by the item. The item embeddings are forced to preserve fine-grained semantic information of textual descriptions. Further, the learned embeddings can be integrated with lightweight downstream sequential models for practical recommendations. In this way, SAID circumvents lengthy token sequences in previous works, reducing resources required in industrial scenarios and also achieving superior recommendation performance. Experiments on six public datasets demonstrate that SAID outperforms baselines by about 5% to 15% in terms of NDCG@10. Moreover, SAID has been deployed in Alipay's online advertising platform, achieving a 3.07% relative improvement of cost per mille (CPM) over baselines, with an online response time of under 20 milliseconds.


INTRODUCTION
Sequential recommendation systems (SRS) are widely utilized in various applications, enabling users to efficiently discover pertinent and tailored items by leveraging their historical interacted item sequences [29,30].Numerous techniques have been proposed to enhance the efficacy of SRS, including early matrix factorization-based approaches [21] as well as more recent advancements involving RNN-based and Transformer-based models, which have demonstrated substantial improvements in SRS performance [8,12,23,36].
In light of the impressive capabilities exhibited by large language models (LLMs) [19,26], it becomes reasonable and practical to enhance the performance of conventional sequential recommendation models [8,12] and tackle challenging recommendation issues by leveraging LLMs, due to the generalization ability and common knowledge within these large models [22].The utilization of LLMs in SRS can be broadly categorized into two paradigms: LLM-augmented methods and LLM-centric methods.In the LLMaugmented paradigm, as depicted in the upper left of Figure 2, embeddings of items' textual descriptions are extracted from LLMs and considered as features for the items.These features are subsequently integrated with other recommendation models, such as GRU or Transformer [4,34].LLM-centric methods, as sketched in the lower left of Figure 2, transform items into textual representations and concatenate them into a long text sequence to feed into an LLM.Afterward, an LLM can either directly generate an item description as prediction or extract sequence features to discover similar items [15,22].Considering that text can serve as a versatile modality connecting knowledge from distinct domains, the adoption of such text-based sequential modeling paradigms holds unprecedented promise in addressing previously intricate challenges, such as the cold-start and cross-domain transfer problems in sequential recommendation systems [15,30].
Despite the promise of LLMs in the realm of sequential recommendations, current works integrating LLMs with SRS exhibit certain limitations.Firstly, for the LLM-augmented methods, the textual embeddings obtained through LLMs are typically coarsegrained, which are challenging to capture an item's subtle wordlevel attributes to represent user preferences [22].For instance, there may be a small difference between the representations of 'Apple iPhone 15 White 256GB' and 'Apple iPhone 5 White 256GB', making it difficult to distinguish users who prefer 'iPhone 15' over 'iPhone 5'.In other words, there is no guarantee that the extracted textual embeddings from an LLM preserve the fine-grained item text information.Secondly, LLM-centric methods encounter difficulties in handling lengthy token sequences and efficiency bottlenecks caused by the notorious computational complexity associated with LLMs [19].In recent SRS literature, there is a tendency to employ significantly long item sequences, such as thousands of items, to improve the modeling of user preferences [20,28].However, each item typically consists of dozens or even hundreds of tokens.Consequently, the total number of tokens within an item sequence could become excessively large, leading to low efficiency and high expenses.In practical industrial applications, there is a high requirement on the response time for user queries, e.g., within ten or dozens of milliseconds [3,6,7,35].Due to the heavy inference cost of LLMs, existing LLM-centric methods can hardly meet the efficiency standards.Although several works have been dedicated to enhancing the efficiency of LLMs through prompt compression [10,11], acceleration of attention layers [2,17,31], and so on.Despite the efforts, these approaches exhibit compromised performance or limitations in leveraging on-the-shelf LLMs.
In view of the aforementioned limitations, this paper aims to utilize the capability of LLMs in SRS in an efficient and effective manner.The main idea of SAID is to learn item embeddings that are accurately aligned with the textual descriptions of items within the embedding space of an LLM, and able to be effectively utilized with readily available lightweight sequential models.To achieve this, SAID evolves a two-stage training scheme.Note that in recommendation scenarios, an item is typically represented by a numerical ID, accompanied by several textual descriptions such as band, category, and so on.In the first stage, inspired by LLM-oriented alignment learning [14,18,37], SAID employs a projector module to transform an item ID into an embedding and feeds it into an LLM to explicitly elicit the item's textual token sequence from the LLM.In this way, SAID explicitly preserves the fine-grained semantic meaning of an item's textual description into the embedding, i.e., semantically aligned embedding.Only the projector undergoes training while the LLM remains fixed, with gradients propagating through it.In the second stage, the learned item embeddings will be exploited by a downstream sequential model such as GRU or Transformer to extract the entire sequence's representation for recommendation.In this stage, the sequential model will be trained from scratch and the embeddings learned in the first stage will be fine-tuned.After training, the downstream sequence model and the fine-tuned item embeddings will be adopted in practical inference.Since the LLM is not engaged in the second stage and the downstream sequence model can be lightweight, SAID achieves superior inference efficiency, e.g., less than 5 milliseconds for a single inference and less than 20 milliseconds for an overall online response.Moreover, thanks to the LLM-based alignment learning, the learned item embeddings significantly improve SRS performance over randomly initialized embeddings utilized in previous models [8,13].
We perform extensive experiments to evaluate the SAID framework from various perspectives.Results on public datasets suggest that SAID outperforms baselines by about 5% to 15% on NDCG@10, and about 3% to 14% on Recall@10.In Alipay's online advertising deployment, SAID achieves 2.98% and 3.07% improvement on click-through rate (CTR) and cost per mille (CPM) respectively over baselines.We summarize our contributions as follows: • We propose a sequential recommendation framework that learns semantically meaningful item embeddings based on LLMs.Different from randomly initialized item embedding or directly extracting representations from LLMs, the proposed framework preserves fine-grained item textual information in learned embeddings, facilitating the performance of SRS.• We propose an alignment learning scheme that employs a projector module to learn item embeddings within the embedding space of an LLM.The fixed LLM, alongside lightweight downstream sequential models, simplifies the training and inference process and enhances its practicality in industrial scenarios.• We conduct experiments on various datasets to verify the effectiveness and efficiency of SAID.We also conduct comprehensive ablation studies and in-depth comparisons to investigate the effect of semantic item embedding learning and other components utilized in the framework.

METHODOLOGY
In this section, we illustrate the problem formulation of SRS and elaborate on the proposed SAID framework.

Problem Formulation
In sequential recommendation modeling, a predefined item set I is considered, and a sequence of items  = { 1 ,  For example, in Figure 1, a user has interacted with three items.For item 23, it has a textual dictionary indicating its brand, category, and detailed description.To predict the correct next item, we aim to learn a specific item embedding that contains the semantic meaning of its textual representation   and extract the entire sequence's representation based on item embeddings and a sequential model.In the first stage, SAID learns to generate an embedding for each item by leveraging the projector module and an on-the-shelf LLM.The size of the learned embedding for each attribute equates to the embedding size of a single token for the specific LLM.In the second stage, the embeddings acquired during the first stage are utilized as initial features of the items, which are then inputted into a downstream model (such as RNN or Transformer) for sequential recommendation.It is important to note that SAID is model-agnostic to the specific choices of downstream models employed in the recommendation process, thereby imparting the framework with significant adaptability and flexibility.In the subsequent sections, we will provide detailed elaborations of the two aforementioned stages respectively.

Semantically aligned embedding learning.
Let   denote the projector module with parameter set  , then item 's embedding   can be represented as follows: As stated above, the objective of training the projector in SAID is to preserve the textual information   within the projected representation   , thereby producing semantically-aligned embeddings within the embedding space of an LLM.Specifically, as shown in the Stage 1 of Figure 2, we formulate the projector learning as a conditional text generation task by an LLM, where the goal is to elicit the text sequence   from the LLM, given the projected embedding   as input to it.For instance, for the item 23 in the Stage 1 of Figure 2, its projected semantic embedding  23 will be fed into an LLM, the LLM is expected to output the first token 'Brand' of its textual description.
Subsequently, the  23 and word embedding of 'Brand' are taken as inputs and expected to elicit the 'BrandA' from the LLM.The errors from all output tokens of the LLM will be back-propagated to adjust the projector's parameters.
In the following, we mathematically formulate the training process in this stage.We omit the subscript  for ease of presentation and use  0 to denote an item ID.Let P  denote an LLM with parameter set , the optimization objective for the projector is as follows: arg max We follow a regressive approach to generate the text sequence , based on which the log P  ( |  ( 0 )) is defined as: (3) where | | is the number of tokens contained in .The training of   follows a gradient decent strategy, while gradients pass through the fixed LLM   , i.e., where  is the learning rate.
In the practical implementation of SAID, the output dimension of   ( 0 ) should match the token embedding size of P  , i.e.,   ( 0 ) resembles one token for the LLM.We instantiate   as an embedding lookup table, as we find the lookup table instantiation possesses extreme efficiency and achieves considerable performance in our exploratory experiments.

Model-agnostic sequential recommender training.
After the completion of projector training in the first stage, we can obtain each item's semantically-aligned embedding   .As depicted in the Stage 2 of Figure 2, these embeddings from the projector can be seamlessly integrated with a downstream sequential model for recommendation.This characteristic of SAID renders it agnostic to the specific downstream recommender models employed.
Depending on downstream sequential models, the embedding   may also be added with a position embedding   to explicitly identify item 's position in the sequence.This practice is particularly relevant when employing Transformer-like models for sequential recommendation.RNN-like models do not require positional embeddings as they inherently incorporate sequential information within model architectures.We denote the overall representation of an item sequence fed into a sequence model as   , i.e., Let  Φ denote a downstream model parameterized by Φ.The sequence representation obtained from  Φ is denoted as h  , which signifies the transformed sequence: It is important to note that we employ   as an individual item's representation instead of passing it through the sequence model  Φ , in order to improve the training and inference efficiency further.We expect the correlation between h  and the groundtruth item's representation   to be learned automatically.After obtaining the sequence representation h  , we predict the next item using the cosine similarity between h  and representations of all items in I.The similarity score is calculated as follows:

LLM-centric
To predict the next item, we select the one with the highest similarity with the sequence as the prediction î : In the training process of this stage, we employ the item-item contrastive learning objective, where negative items come from the entire item set I. Note that only the sequential model  Φ and the projectors   will be trained, while the LLM P  does not participate in this stage.The optimization objective adopted for item-item contrastive learning is formulated as follows: where  +  is the representation of the groundtruth next item,   is the representation of a negative item in I, and sim denotes a specific similarity metric.

EXPERIMENTS 3.1 Experimental setup
3.1.1Datasets.To evaluate the performance of our method, we select six sub-category datasets from the Amazon review 1  i.e., "Industrial and Scientific", "Musical Instruments", "Arts, Crafts and Sewing", "Office Products", "Video Games", and "Pet Supplies".The statistics of datasets after preprocessing are shown in Table 1.
For training and testing, we remove the items without title information in meta-data as title will be used for item identification, symbolization and modeling.Then we group each user's interacted items in chronological order for sequence construction.Following previous work [15], we select three item attributes, i.e., title, category, and brand, to construct a caption.

Baselines.
Considering that the primary contribution of SAID is the semantically-aligned item embedding learning, we compare it with several commonly employed item embedding initialization schemes in previous works.i) Random, which is the most widely adopted item embedding initialization scheme in the SRS literature.
ii) Last hidden state (LH), which extracts an LLM's last hidden state by inputting an item's textual caption into it.We choose LH for comparison since it is an representative LLM-augmented method for SRS.To ensure a fair comparison, we utilize the same LLM model for LH when comparing it with our method.
As for the downstream sequential model, we adopt the two most representative methods, i.e, the GRU4Rec [8] with an RNNbased architecture and the SASRec [13] with a Transformer-based architecture.

Evaluation Settings.
To evaluate the performance of sequential recommendation, we adopt three widely adopted metrics, i.e., NDCG@N, Recall@N, and MRR, where N is set to 10.Moreover, we evaluate the capability of different LLMs with the generation accuracy, which calculates a word-level accuracy when generating an item's caption.To perform train-validate-test splitting, we employ the leave-one-out strategy.The freshest item in an item sequence is reserved for testing, the penultimate item is used for validation, and the remaining items are for training.We rank the ground-truth item of each sequence among all items and report the averaged results over all item sequences.

Implementation Details.
By default, we utilize the LLama2-7B 2 in the first stage of SAID for performance comparison.Moreover, we also compare the capability of distinct LLMs, including Bloomz-560M 3 , Bloomz-1B7 4 , Bloomz-7B 5 and LLama2-13B 6 .All baselines utilize the same sequential models, batch size, and optimizer settings, differing only in the initial item embeddings fed into sequential models.The embedding size and hidden dimension of sequential models are both set to 256.The batch size is set to 256.A single layer is employed for all sequential models.For SASRec, one attention head is adopted.The maximum item sequence length is set to 50.We adopt the AdamW [16] optimizer and a learning rate of 5e-4 along with early stopping, with a patience of 5 epochs.

Further Analysis
Effect of different LLMs.We first investigate the capability of different LLMs on semantically-aligned embedding learning.The results are shown in Figure3 with the Scientific dataset.The axis indicates LLMs' token-level generation accuracy.It is evident that as the scale of LLMs increases, the generation accuracy also improves.While both Llama2-7B and Llama2-13B demonstrate nearperfect generation performance, the larger Llama2-13B exhibits faster convergence speed in the initial stage of training.
Based on the findings in Figure 3, we choose the item embeddings trained from the top three LLMs for further comparison of SRS performance, as depicted in Figure 4.The results indicate that Llama2-7B consistently outperforms Bloomz-7B in terms of both metrics and downstream models.Moreover, Llama2-13B demonstrates slightly better performance than Llama2-7B.Considering these outcomes, we select Llama2-7B as the default LLM in SAID, maintaining a balance between efficiency and efficacy.Effects of embedding alignment quality.In this part, we investigate the correlation between the quality of embedding alignment and the performance of downstream recommendation.We employ the GRU4Rec backbone for illustration.The results are illustrated in Figure 5.The -axis indicates the generation accuracy of the adopted LLM during stage 1 of SAID.The -axis signals the performance of downstream models.Note that the distribution of points along the -axis is not uniform due to the practical generation accuracy of alignment, which cannot be guaranteed uniformly distributed in advance.The grey line represents the performance of GRU4Rec with randomly initialized item embeddings, and the red line suggests the performance of GRU4Rec with item embeddings learned in stage 1 of our SAID.From Figure 5, we can conclude   that as the generation accuracy increases, the performance of the downstream model also improves.This consistent improvement is observed in both the NGCG@10 and Recall@10 metrics, confirming that item embedding alignment plays a significant role in enhancing the SRS performance.Effects of w/o freezing item embeddings.In this part, we investigate the impact of freezing item embeddings during the training of downstream models.The results, depicted in Figure 6 using the Scientific dataset and GRU4Rec as the downstream model, demonstrate that the without-freezing scheme outperforms the with-freezing  counterpart in all cases.This outcome is expected since the withoutfreezing scheme can be considered as a further fine-tuning over the item embeddings.However, even though it is not as effective as the without-freezing scheme, the with-freezing scheme still achieves better performance than the random baseline.This finding confirms the effectiveness of the semantically-aligned embeddings in initializing item embeddings.Efficiency of SAID.To assess the efficiency of different sequence recommendation paradigms, we conduct comparisons using a machine equipped with an A100 80G GPU.LLama2-7B, setting a maximum token length 1024.We utilize Pytorch and conduct inference with half-precision.Table 3 displays the experimental outcomes.We can find that SAID demonstrates significant efficiency superiority, requiring 2.2% of the time taken by the LLM method.Additionally, in Alipay's online advertising deployment, SAID could achieve an overall query response time of less than 20 milliseconds.In industrial scenarios, it is often necessary for the response time to be below 20 milliseconds, a standard that LLM-centric methods typically struggle to meet.

Visualization
Visualization of learned item embeddings.We visualize the learned item embeddings on six datasets using t-SNE [27] in Figure 7.The embeddings clearly exhibit clustering within each dataset and distinguishable ability between different datasets.Given the distinction of item textual descriptions from different datasets, the results in Figure 7 provide evidence that the item embeddings successfully captured the semantic meaning of their textual descriptions.Visualization of items with similar embeddings.To examine the caption of items that share similar learned embeddings, we perform clustering on item embeddings extracted from the Scientific dataset.The clustering is conducted using the K-means algorithm with 100 clusters and the cosine similarity metric.Afterward, we select two clusters and randomly select two items within each cluster.The ground-truth text descriptions of these items are listed in Table 4. Table 4 reveals the presence of intra-cluster similarity and inter-cluster distinction, e.g., cluster A is about scientific measurement and cluster B relates to industrial adhesives.This result indicates that the learned item embeddings are capable of capturing the semantic meaning of their original textual descriptions, which is consistent with results in Figure 7.
Visualization of generated item descriptions.Although a capable LLM could achieve virtually perfect (100%) generation accuracy as indicated in Figure 3, there still exist erroneous generations in some cases.To investigate the generation results of concrete LLMs vividly, we illustrate LLM-generated texts of two items that acquire imperfect generation accuracy.The results are listed in Table 5.We highlight the different words in green in ground truth texts and red in generated texts.We can find that erroneous words are rare and generally have a minor influence on the overall semantic meaning.This finding confirms the effectiveness of the alignment learning with a proper LLM in our SAID.

ONLINE DEPLOYMENT
We deployed the proposed SAID in Alipay's advertising system 7and conducted an A/B test for one week.The online experiment demonstrates that the SAID framework significantly improved the click-through rate (CTR) by 2.98% and the cost per mille (CPM) by 3.07% compared with the LH baseline.The online response time is less than 20 milliseconds.This outcome verifies the efficacy of SAID in read-world recommendation scenarios.In addition, it has

RELATED WORK 5.1 Sequential Recommendation
Early sequential recommendation models mainly focus on modeling users' sequential behaviors as a Markov chain, in which an itemto-item transition matrix is learned for next item prediction [21].
Attribute to the extraordinary capability of deep learning for complicated item sequence pattern modeling, a surge of deep neural networks-based methods are proposed to enhance the performance of sequential recommendation models, e.g., RNN-based methods [8] and CNN-based methods [24] are elaborated effective for sequential recommendation.However, these solutions often fail to capture long-term dependency between arbitrary items in a sequence.In light of this, Transformer-based solutions, e.g., SASRec [12], BERT4Rec [23], are widely applied and achieved promising results in sequential recommendation.Moreover, some sophisticated methods, e.g., contrastive learning [1,32] are also adopted to relieve the data sparsity in sequential recommendation scenarios.

LLM based recommendation
Motivated by the notable achievements of LLMs, several works try to employ these large models in recommendations [5,9,15,25,33].MoRec [33] investigates the viability of utilizing textual representations, derived from language models, as an alternative to ID embedding for recommendations.ZESRec [5] explores the zero-short capability of a sequential recommendation model, focusing on a specific domain.They employ pre-trained language model to generate textual-based representations of items based on their associated information.Subsequently, the derived representations are utilized for the training of sequential models, which will be deployed for personalized recommendations in diverse domains.Following ZESRec, UniSRec [9] leverages multi-domain data to pretrain the sequential model, and subsequently employs domain-specific data for fine-tuning.Furthermore, Recformer [15] pre-trains and fine-tunes a language model in a holistic approach for item text encoding and sequential recommendation.However, the integration of LLMs and textual representations in these studies results in prolonged token sequences that significantly impede efficiency.Besides, these works overlook the explicit alignment between item embeddings and the semantic meaning of corresponding textual descriptions, thereby offering no assurance that the textual information of the items could be accurately preserved in the generated embeddings.

CONCLUSION
In this paper, we propose SAID, a framework that utilizes LLMs to project item IDs into semantically meaningful embeddings within the embedding space of these LLMs.The learned semantic embeddings are leveraged by downstream sequential models for concrete recommendation tasks.In this way, SAID efficiently diminishes the lengthy token sequence challenge when directly utilizing LLMs in recommendation and also preserves superior performance by leveraging the capability of LLMs.Experiments conducted on six public datasets and Alipay's online advertising deployment justify the efficiency and efficacy of SAID.Additionally, we observe that the choice of LLM has a substantial impact on the quality of learned item embeddings, consequently influencing the performance of downstream models.In practical industrial applications, items are always combined with multi-modal features.In the future, we hope to unify the integration of these multi-modal data in our framework for further improvement.

Figure 1 :
Figure 1: An item sequence with IDs and textual captions, which is fed as inputs into an SRS.

2. 2 . 1
Overall framework.The right part of Figure 2 depicts the architecture of SAID.SAID adopts a two-stage training process, namely (1) semantically aligned embedding learning and (2) modelagnostic sequential recommender training.

Figure 2 :
Figure 2: Upper left: architecture of LLM-augmented SRS systems.Lower left: architecture of LLM-centric SRS systems.Right: the proposed SAID framework, which consists of two stages, i.e., (1) semantically-aligned embedding learning and (2) modelagnostic sequential recommender training.Note that in stage 1 of SAID, all item embeddings are learned in parallel.

Figure 3 :
Figure 3: accuracy of LLMs in the semantically aligned embedding learning stage on the Scientific dataset.

Figure 4 :
Figure 4: Performance of downstream models using items' semantic embeddings learned from the top three LLMs (ranked by their generation capability) on the Scientific dataset.

Figure 5 :
Figure 5: Performance of downstream models under different generation accuracy of the adopted LLM on the Scientific dataset.

Figure 6 :
Figure 6: Performance comparison among the without freezing scheme, with freezing scheme, and the random baseline.The experiment is conducted on the Scientific dataset.

Figure 7 :
Figure 7: Visualization of learned item embedding from six datasets using t-SNE with the scikit-learn's default settings.

Sequential Model Brand: Apple … x 1 x 2 x n h s Item caption Item embedding Input text x item
Brand: Apple Category: Consumer electronics Desc: MacBook Pro Laptop M1 Chip Brand: BrandA Category: Clothing Desc: Lined Borgcollar Denim Jacket Brand BrandA Category Clothing Desc Lined Borg-collar Denim Jacket.Brand BrandB Category Clothing Desc Chinese style cheongsam.... LLM Brand: BrandA … Brand: BrandB … … … Sequence representation Similarity 23

983 Sequential Model 698 x 23 x 983 x 698 h s Item ID Item embedding Projector x next
… … Sequence representation Similarity Brand: BrandB Category: Clothing Desc: Chinese style cheongsam … Ours: Stage 2 Brand: Apple Category: Consumer electronics Desc: Iphone15 pro Titanium x

Table 1 :
Statistics of the datasets after preprocessing.Avg.n denotes the average length of item sequences.

Table 2 :
Performance comparison of different methods.The best results of all methods are highlighted in bold, and the best performance of baselines is underlined.The improvement is calculated over the second-best result.

Table 3 :
Inference efficiency between our SAID with both sequential models and the LLama2-7B-based SRS model.

Table 4 :
Item descriptions from same clusters based on their learned embeddings.

Table 5 :
Comparison between groundtruth captions and generated captions for items with imperfect generation accuracy.The discrepancies are highlighted.