Contrastive Learning based Item Representation with Asymmetric Augmentation for Sequential Recommendation

Contrastive learning has been widely applied in sequential recommendation to improve the recommendation performance. Existing contrastive learning methods focus on adjusting the views number of positive and negative samples to enhance the item representation via data-level augmentation(e.g., MMInfoRec). However, they generally ignore the sampled items still follow a long-tail distribution, i.e., a few popular items have a high-frequency occurrence in contrastive learning, while the majority of unpopular items have a low-frequency in contrastive learning. As a result, this imbalance in sample extraction leads to insufficient contrastive learning between popular and unpopular items, resulting in sub-optimal item representation. In this paper, we propose CA4Rec, a Contrastive learning based item representation with Asymmetric augmentation for Sequential Recommendation, that adopts an asymmetric augmented multi-instance contrastive learning strategy to enhance the item representation. Specifically, we first use the popularity-aware method to divide the entire item set into popular and unpopular items, then we generate k augmented views for popular items, and k + n augmented views for unpopular items. Finally, we use our proposed asymmetric multi-instance noise contrastive estimation (AMINCE) to perform contrastive learning calculation. Make sure there is sufficient contrast between popular and unpopular items to mitigate the issue. we implement the improvements on the state-of-the-art model and conduct extensive experiments on three benchmark datasets. The experimental results demonstrate that CA4Rec outperforms the state-of-the-art baselines.


INTRODUCTION
The primary objective of sequential recommendation is to learn the sequential patterns in user behavior.Typically, sequential recommender models rely on the next item as the basis for supervised training.Recent advancements have introduced auxiliary tasks to enhance the training process.For instance, some recent methods incorporate a manually masked content prediction task as an additional training signal, such as masked item prediction [16] and masked attribute prediction [22].Moreover, Recent methods have utilized contrastive learning to enable self-supervised learning on sequence representations, allowing the model to learn from various supervision signals [9,12,18,21,22].
Although these methods have made relative progress in the sequential recommendation, training is still influenced by the longtail distribution of items, resulting in suboptimal performance.In the dataset, items follow a long-tail distribution, where popular items appear more frequently, while unpopular items appear less frequently.This is manifested in user interaction sequences, where users tend to interact more with popular items and less with obscure unpopular items.The issue of long-tail distribution in items is similar to the problem of sample imbalance in the image domain [7].When conducting contrastive learning, the contrastive learning samples are typically extracted from user sequences.This leads to sufficient contrastive learning between popular items, while insufficient contrastive learning between popular items and unpopular items.
In this paper, we aim to develop an effective solution to alleviate the problem of insufficient representation training in contrastive learning caused by the long-tail distribution of items.we propose a contrastive learning based item representation with asymmetric augmentation for sequential recommendation called CA4Rec.Specifically, based on MMInfoRec [12] coding framework, we first use the popularity-aware method to divide the entire item set into popular and unpopular items, then we generate different numbers of augmented views for both popular items and unpopular items.Finally, an asymmetric multi-instance variant of Noise Contrastive Estimation (NCE) was applied to ensure the effective execution of contrastive learning.The contributions of this paper are summarized as follows: (1) We generate different numbers of augmented views for popular items and unpopular items to ensure effective contrastive learning.(2) By combining the above strategy, we derive an asymmetric multi-instance contrastive loss, denoted as AMINCE, to ensure adequate training of item representations.(3) Extensive experiments conducted on three datasets demonstrate the superiority of the CA4Rec model compared to state-of-the-art baselines.

Contrastive Learning
Contrastive learning has been recently applied in sequential recommendation methods [9,18,19,21,22].Several approaches have been employed for masked content prediction, including item and attribute masking in S 3 Rec [22] and segment encoding in CL4SRec [18].These methods assume that all desired information is injected into the ID representation through these pretext tasks.CoSeRec [8] constructs view pairs by considering item relevance and further proposes two information-rich data augmentation methods.MMInfoRec [12] aligns pretext tasks with the recommendation task, utilizing memory-enhanced multi-instance contrastive predictive coding for training, injecting desired information into the ID representation without additional fine-tuning steps.ICLRec [1] uses clustering to extract users' intent distributions from their behavior sequences.It then integrates the captured intent into the sequential model using contrastive self-supervised learning (SSL) loss.

METHOD 3.1 Problem Definition and Notation
In sequential recommendation, there is usually a user set U and an item set I, where  ∈ U represents a user and  ∈ I represents an item.|U| and |I| are used to denote the number of users and items, respectively.Arranged in chronological order, user  has a historical ordered interaction sequence with multiple items: [ 1 ,  2 , . . .,   ] , where  is the number of interactions.Additionally, there is an attribute set A that includes all attributes belonging to all items in the dataset.|A| is used to represent the number of attributes appearing in the dataset.For each item, there are associated attributes A  = { 1 ,  2 , . . .,   }.Typically, different types of items have different attributes.The goal of a sequential recommendation system is to predict the next item   +1 that a user will interact with based on their interaction sequence [ 1 ,  2 , . . .,   ] and the attributes A  of each item, where  represents the current time step.

Learning Framework
Embedding Layer : In order to transform IDs and attributes into dense vectors, we utilize two embedding matrices: the item embedding matrix matrix Emb  ∈ R | I | × .and the attribute embedding matrix Emb  ∈ R | A | × .Here,  represents the embedding size.For instance, given an item   with attributes { 1 ,  2 }, we can obtain the corresponding dense vectors using a lookup function: Attribute Encoder : The goal of the attribute encoder is to combine all the information of the item, which includes the ID information and the side information, into a latent representation.Given the item embedding x and its attribute embeddings {a * }, the attribute encoder function  enc computes the latent representation z of the item as: where x ∈ R 1× , a * ∈ R 1× , and z ∈ R 1× .The function  enc can be implemented using a self-attention mechanism applied to the item embedding and all attribute embeddings.Temporal Aggregation : The function  ta aggregates the temporal information of the items up to a certain time step.It is defined as: where c  ∈ R 1× .Attention mechanisms are commonly employed in sequential recommender models to perform the temporal aggregation computation.Memory Module : To enhance the representation ability of the model, a memory module is incorporated to generate predictive outputs for each step based on the context vector.The memory module includes a memory bank M ∈ R  × with  memory slots.The memory addressing is defined as: where MLP refers to the multi-layer perceptron.The residual style is employed to retain the original prediction and improve gradient flow during training.Recommendation : During the validation and test of model, the recommendation process involves ranking the score between the sequence representation z  and all items in the item set.The score  is computed by taking the dot product between the sequence representation z  at the current time step and the latent representations z of all items.

Asymmetric
Multi-Instance Augmentation.The vanilla NCE [3] (Noise Contrastive Estimation) and MINCE [12], either through single-instance augmentation or multi-instance augmentation, aim to enhance the representation of items through contrastive learning.However, they both overlook the issue of insufficient contrastive learning between popular and unpopular items caused by the longtail distribution of items.To address this, we divide items set I into popular I  (top 20%) and unpopular I  (tail 80%) according to the distribution.Then, different amounts of augmentation are applied to popular items and unpopular items.For popular items,  different random Dropout functions are used to embed them with the encoder, generating a set of different latent representations for the same item For unpopular items,  +  different random Dropout functions are used to embed them with the encoder, generating a set of different latent representations for the same item as shown in the righthand side of Fig. 1.Contrastive learning is then performed based on asymmetric enhancement, allowing popular items to undergo more comprehensive contrastive learning with a larger number of unpopular items, alleviating the aforementioned problem.The number of terms included in the numerator and denominator for both popular and unpopular items is different.This MINCE is defined as follows: Where  represents the temperature coefficient, and N  stands for the negative sample set for item .The sets N add and P add denote the additional negative and positive sample sets, respectively.These sets are introduced as a result of the augmented views of the unpopular items.

EXPERIMENTS 4.1 Experimental Settings
4.1.1Datasets.In order to evaluate the proposed CA4Rec model, the following datasets were utilized and summarized in Table 1: Beauty, Sports, and Toys [10] 1 .As per the approach taken in prior works [9,15,22], we select three subcategories from the Amazon dataset, where the fine-grained categories and product brands serve as attributes.
4.1.2Preprocessing.We organize user sequences of interactions by timestamp.Following [15,20,22], item appearing frequencies less than 5 will be filtered out.And if a sequence is shorter than 5, the sequence will also be removed.We use the second-to-last item for validation and the last for testing.the rest are used for training.
4.1.3Metrics.We adopt two widely used evaluation metrics to evaluate the performances of all sequential recommendation methods: top- Hit Ratio (HR@) and top- Normalized Discounted Cumulative Gain (NDCG@), with  selected from {5,10}, as utilized in previous studies [6,9,12,15,22] 4.1.4Baselines.Our baseline selection follows recent research [12,22], encompassing popular and advanced methods comparison.We compare the proposed CA4Rec against the following chosen baselines: • GRU4Rec [4] utilizes the GRU architecture to model user sequences.Each sequence is considered as input, with the output hidden state serving as the sequence representation.
• Caser [17] employs CNN-based techniques to capture highorder Markov Chains through horizontal and vertical convolutions, distinguishing its network structure from conventional sequence models.• SASRec [5] represents a robust unidirectional sequential recommendation model employing the multihead attention mechanism to forecast the subsequent item.Its potency stems from the influential role of attention mechanisms in sequence modeling.• S 3 Rec [22] applies masked contrastive pre-training, using masking on segments, attributes, and single items.It is a strong baseline for sequential recommendation utilizing side information of items.The recent Seq2SeqRec [9] method uses a similar next-sequence prediction approach, which is a special case of S 3 Rec.• MMInfoRec [12] is a novel sequential recommendation framework that uses memory-augmented multi-instance contrastive predictive coding, without requiring an additional fine-tuning step.
The complete ranking results of these baseline models are obtained from the updated results provided by MMInfoRec [12] and will be presented in Table 2.
4.1.5Implementation Details.The embedding size is fixed at 64, with all linear mapping functions sharing the same hidden size.The Transformer is designed with 2 layers, each comprising 1 head.A Dropout function, with a 0.5 ratio, is implemented on both the input of the  ta function and the Transformer module inside  ta .
The training batch size is set to 256, and we utilize the Adam optimizer with a learning rate of 0.001.The number of memory slots, denoted as , is configured to 64.Default predictive steps amount to 1.A temperature parameter of 0.3 is adopted.Furthermore, L2 regularization is incorporated with a weight of 10e-4.For the hyperparameters of our model, the number  of additional augmented views for unpopular items is selected from the range {0, 1, . . ., 10}.

Comparative Experimental Results
Based on the results presented in Table 2, the following conclusions can be drawn: (1) the CA4Rec model outperformed all baseline methods, demonstrating the best performance.The improvement percentages across all datasets were significant.(2) GRU4Rec and Caser use GRU and CNN, respectively, for sequence encoding and representation.However, their backbone lacks the strength to effectively enhance model performance.On the other hand, SASRec significantly outperforms these methods, thanks to its attention mechanisms that learn item correlations in sequences.This underscores the effectiveness of the Transformer structure as a powerful backbone for sequential recommendation.(3) S 3 Rec and MMIn-foRec, using contrastive learning, outperform traditional baselines and show promising results in sequential recommendation tasks.This suggests that contrastive learning is a valuable technique for enhancing representations in such tasks.(4) Compared to S 3 -Rec, MMInfoRec performs better by using self-attention mechanisms to integrate item IDs with attribute information and aligning contrastive tasks with recommendation tasks for strong representations.However, our model outperforms MMInfoRec, demonstrating its superior representation capabilities.

Model Performances for Different Hyper-Parameters
In this experiment, we followed the experimental setup of the previous work MMInfoRec, generating  = 4 augmented views for each item.On this basis, we evaluated the impact of the asymmetric multi-instance augmentation strategy.Impact of the additional number  of augmented views for unpopular items: The results in Fig. 2 show the following.(1) Compared to  = 0, when  = 1, the asymmetric multi-instance augmentation strategy was activated, resulting in improved performance of the CA4Rec model on the sports and toys datasets, while the improvement on the beauty dataset was not significant.(2) As  increases, the overall performance shows a significant improvement, demonstrating the effectiveness of our strategy.(3) There is a positive correlation between the number of additional augmented views for unpopular items, denoted as , and the overall performance.The CA4Rec model may achieve better performance on the Toys dataset, especially when  > 10. (4) Experimental results on various datasets indicate that the ideal number of augmented views needed for less popular items (represented as ) varies to achieve optimal overall performance.

CONCLUSION
In this paper, we tackle the issue of insufficient contrastive learning in sequence recommendation caused by the long-tail distribution of items.To address this, we propose an asymmetrically augmented contrastive learning strategy.We generate more augmented views for unpopular items compared to popular items, achieving a balanced proportion during contrastive learning.This enables sufficient contrastive learning between both types of items.In our experiments, we compare CA4Rec with state-of-the-art baselines and show its superior performance.

Figure 1 :
Figure1: The overall process involves transforming the IDs and attribute IDs of all items  in the sequence into feature vectors through two embedding layers.For each item, its features are encoded into a latent representation z by the attribute encoder  enc .The temporal aggregation module  ta encodes the sequence information into c  , which is then utilized by the memory enhancement module  m to enhance the generation of the latent representation z  +1 .Finally, z  +1 undergoes adequate contrastive learning with the positive samples z  +1 and the negative samples z  , where these positive and negative samples have been asymmetrically augmented.

Table 1 :
Transposed Statistics of the datasets after preprocessing.Dataset Users Items Actions Attributes Avg.Actions / User Avg.Actions / Item Avg.Attribute / Item Sparsity

Table 2 :
we compare the performance of various methods.The highest score in each row is indicated in bold, while the second highest score is underlined.