Aligned Side Information Fusion Method for Sequential Recommendation

Combining contextual information (i.e., side information) of items beyond IDs has become an important way to improve the performance in recommender systems. Existing self-attention-based side information fusion methods can be categorized into early, late, and hybrid fusion. In practice, naive early fusion may interfere with the representation of IDs, resulting in negative effects, while late fusion misses effective interactions between IDs and side information. Some hybrid methods have been proposed to address these issues, but they only utilize side information in calculating attention scores, which may lead to information loss. To harness the full potential of side information without noisy interference, we propose an Aligned Side Information Fusion (ASIF) method for sequential recommendation, consisting of two parts: Fused Attention with Untied Positions and Representation Alignment. Specifically, we first decouple the positions to exclude the noisy interference in the attention scores. Secondly, we adopt the contrastive objective to maintain the semantic consistency between IDs and side information and then employ orthogonal decomposition to extract the homogeneous parts. By aligning the representations and fusing them together, ASIF makes full use of the side information without interfering with IDs. Offline experimental results on four datasets demonstrate the superiority of ASIF. Additionally, we successfully deployed the model in Alipay's advertising system and achieved 1.09% and 1.86% improvements on clicks and Cost Per Mille (CPM).


INTRODUCTION
Sequential recommendation plays an important role in industrial scenarios such as e-commerce, advertising, and search systems, and its main goal is to model the user's historical behavior to predict the next item that may be of interest to the user.Among various solutions [6,16], the attention-based models [10,15,20,21] are gradually becoming the mainstream due to excellent performance.
Early self-attention-based models like BERT4Rec [15] and SAS-Rec [7] only consider item IDs, lacking the ability to capture item attributes beyond IDs.This limitation becomes apparent when IDs change frequently.For example, in a typical recommendation scenario on the Alipay membership page, users are shown items that can be redeemed using points and money.The product pool is frequently updated with advertising programs, causing rapid changes in item IDs.Attributes such as categories and brands offer a more stable representation of a user's long-term preferences.Thus, we aim to incorporate side information into the recommendation model to boost performance.
Based on the varying fusion locations, the existing self-attentionbased side information fusion methods can be categorized into three types: early, late, and hybrid fusion.The early fusion fuses IDs with side information together before feeding them into the attention block.In contrast, the late fusion applies separated self-attention blocks on item-level and feature-level sequences and fuses them until the final stage.It has been pointed that the early fusion may not always improve performance but instead impair the representation   of the IDs, causing a phenomenon known as information invasion [11].The late fusion, on the other hand, lacks the interaction between IDs and side information and losses some prior information.Consequently, some hybrid-fusion methods have emerged recently.They avoid information invasion by incorporating side information in attention score calculation and explore interesting structures for attention correlations.Despite the remarkable improvements, these approaches still suffer from two limitations: (1) Correlations between IDs and attributes can vary, with some being strong and others being weak, making it difficult to eliminate interference and learn meaningful correlations effectively.(2) Methods that completely exclude side information from the final representation to prevent information invasion may inadvertently discard crucial information within the side information itself.
In this work, we try to enhance the utilization of side information by mitigating noise interference.Inspired by [8], we expand the fusion form of attention scores of early-fusion method SASRec  .As shown in Fig. 1, IDs have a strong relationship with attributes, while the correlations between position encoding and others are relatively weak.This indicates the common way to fuse position as ordinary side information may potentially introduce noise into the attention scores.We also examine the representation spaces of IDs and side information in SASRec  on the Yelp dataset to provide an explanation for information invasion.From a macroscopic perspective, we can observe a significant dissimilarity between the two distributions (see Fig. 2(a)), indicating that the representation space after fusion will deviate considerably from the original ID space.From a microscopic perspective, by projecting both the ID and side information embeddings onto a coordinate system, we uncover that if the directions of the two are opposite on certain axes, these segments of vectors may cancel each other out, leading to a loss of information (see Fig. 2

(b)).
To address the above issues, we propose a novel method called Aligned Side Information Fusion (ASIF).First, we introduce Fused Attention with Untied Positions, which separates the ID-attributes from the position encoding during attention score calculation, eliminating noise interference and preserving the strong correlation.Second, we propose Representation Alignment, consisting of two steps: Representation Space Alignment (RSA) and Homogeneous Information Extraction (HIE).The RSA approach employs a contrastive objective for paired ID and attribute at the interaction granularity within the sequence to ensure their semantic consistency.Although this operation brings the two distributions closer together, it still can not avoid the existence of heterogeneous parts.Therefore, HIE performs orthogonal decomposition on IDs and side information to extract the homogeneous parts, thus avoiding information invasion.In summary, our main contributions can be summarized as follows: • We meticulously design the ASIF framework, based on Fused Attention With Untied Position and Representation Alignment, to enhance the recommendation performance by leveraging side information.• In terms of Representation Alignment, we propose RSA and HIE.By employing contrastive loss and orthogonal decomposition, we align the representation space of IDs and side information in both macroscopic and microscopic aspects, effectively preventing the problem of information invasion.• Offline and online experiments demonstrate the effectiveness of our proposed method.

RELATED WORKS 2.1 Sequential Recommendation
Sequential recommendation aims to predict the next item that is most likely to be interacted with based on the user's historical behaviors.With the development of deep learning techniques recent years, many neural network based methods such as Convolutional Neural Networks (CNNs) [16,19], Recurrent Neural Networks (RNNs) [13], Graph Neural Networks (GNNs) [3] and attentionbased models start to emerge.Among them, the self-attention-based methods have made significant progress.SASRec [7] introduces self-attention into the SR model to capture long-range dependencies .BERT4Rec [15] adopts the Cloze objective and improves the performance by bidirectional self-attention mechanism.Recent SR methods also use contrastive learning to augment the data, including CL4SRec [17] and DuoRec [12].These works utilize only item IDs, ignoring other attributes associated with the item, which may potentially help to extract comprehensive sequence patterns.

Side Information Fusion for Sequential Recommendation
Instead of using item IDs only as the above solutions, the side information, such as other item attributes and ratings, is taken into consideration to capture meaningful supervision signals.S 3 Rec [24] notices the important information contained in the attributes and devises four auxiliary self-supervised tasks to learn the intrinsic relationship.Besides utilizing side information in auxiliary tasks, the end-to-end fusion approaches are beginning to be explored.Following the classification system of multi-modal fusion [1,2], we categorize the self-attention-based side information fusion methods into three types: early, late, and hybrid fusion.In early fusion, IDs and side information are combined at the shallow layers of the model, which are then fed into the network and generate outputs.For example, SASRec  [24] combines ID and attributes and feeds them into the self-attention block as input (see Fig. 3(a)).In late fusion, the networks of ID and side information are independent, and fusion takes place just before the predict layer.FDSA [22] is the late-fusion method, which applies separated self-attention blocks on item-level and feature-level sequences and concatenates their hidden states until the final stage (see Fig. 3(b)).
Both early and late fusion have their own limitations.The former cannot exclude noisy interference and may result in information invasion, while the latter lacks effective interaction between IDs and attributes.Hybrid fusion lies between them, allowing IDs and side information to interact in the middle layer.NOVA [11] first defines the information invasion problem caused by naive early fusion and proposes only to incorporate attributes in the calculation of attention scores to mitigate it (see Fig. 3(c)).However, it regards the position as an ordinary attribute, introducing noise into mixed attention.Furthermore, DIF-SR [18] decouples the attention scores for IDs and side information, allowing higher-rank attention matrices and flexible gradients (see Fig. 3(d)).Unfortunately, it abandons the implicit cross-relationships between IDs and attributes.Both methods utilize side information only in the attention scores, completely discarding it in the value matrices, which may result in a loss of information.Our work aims to fill these gaps, reducing noisy interference while enhancing the utilization of side information.

METHODOLOGY
The overall framework of ASIF is shown in Fig. 4, and the details will be introduced next.

Problem Formulation
In sequential recommendation with side information, let U, V, X and A  denote the sets of users, items, item IDs and the -th type of attributes, respectively.Let denotes the historical sequence of interactions in chronological order for user  ∈ U, where v ( )  ∈ V is the -th item in the user interaction sequence and  is the maximum length of the sequence.Suppose we have  types of side information, then v ∈ X is the item ID of the -th interaction, and a ( ) , ∈ A  represents the -th type of the attributes of the -th interaction.Given the interaction history S  , the goal of sequential recommendation is to predict the next item that the user  may be interested in.It can be formalized as modeling the probability over all candidate items for user :

Fused Attention with Untied Positions
For attention-based models, the naive way to incorporate side information is to fuse it with IDs and input into the attention block (see Fig. 3(a)).NOVA follows this structure but excludes side information from the value matrix (see Fig. 3(c)), while DIF-SR advises applying decoupled attention calculation of various side information and IDs representations (see Fig. 3(d)), ensuring flexible gradients.However, according to Fig. 1, IDs have a strong relationship with attributes, but position encoding has a weak relationship with IDs and attributes, which may not be suitable to be fused with others.Therefore, we propose the Fused Attention (FA) with Untied Positions (UP) (see Fig. 3(e)).
Let X and A represent the embedding matrices of item IDs and attributes, we first fuse them and compute the correlation matrix as where F denotes the fusion function, e.g., F  (X, A) = X +  =1 A  .Next, we compute the correlation matrix of the position encoding as where P denotes the absolute position embedding matrix, W ,2 ∈ R  × ℎ , and W ,2 ∈ R  × ℎ .Then, we fuse the two correlation matrices and obtain the final attention formula as follows: where W ,1 ∈ R  × .h X denotes the hidden state of the item IDs.Finally, we consider the side information important enough to be fully learned, so we also pass and update them between different Transformer layers as follows: where W ,2 ∈ R  × , W ,3 ∈ R  × , and h A and h P denote the hidden states of the attributes and the positions, respectively.

Representation Alignment
From macroscopic and microscopic perspectives, the occurrence of invasion phenomenon may be due to excessive distribution deviation and vector offset, respectively.To solve this, we propose Representation Space Alignment (RSA) and Homogeneous Information Extraction (HIE) to align the representations of IDs and attributes.The goal of the former is to narrow the representation space of both item IDs and attributes to improve the semantic consistency at the interaction granularity (see Fig. 5(a)).The latter extracts the information in the attributes that is homogeneous with the item IDs and fuses it into the IDs representation (see Fig. 5(b)).

Representation Space Alignment (RSA).
Taking inspiration from CLIP's alignment operation [14], we leverage a contrastive loss to align the embedding spaces of item IDs and attributes, intending to bring the two distributions closer (see Fig. 5(a)).However, unlike CLIP, our alignment occurs at the interaction granularity within a sequence rather than at the sample granularity.Specifically, X and A represent the embedding matrices of item IDs and attributes: x (1)  x (2)  . . .
where x ( ) , a ( ) ∈ R 1× denote the embeddings of the item ID and the attribute of the -th interaction in the sequence, and X, A ∈ R × .Next, we calculate the cosine similarity between the two embeddings to get the final matching scores as follows: x (1) /∥x (1) ∥ a (1) /∥a (1) ∥ a (2) /∥a (2) where Softmax(•) is executed for each row of the similarity matrix and  denotes the learnable temperature coefficient.Finally, we calculate the contrastive loss in the following form: where ⊙ is the element-wise product,  is the sample size, and Y  is the ground truth of the -th sample, which is an identity matrix , meaning that only the paired item IDs and attributes are positive examples.

Homogeneous Information Extraction (HIE).
The space alignment brings the two distributions closer, but still cannot avoid the existence of the heterogeneous part.Therefore, we propose performing orthogonal decomposition on each layer's hidden states to extract the homogeneous parts (see Fig. 5(b)).Intuitively, if an attribute's representation is in the same direction as ID's, it should be maximally preserved.Otherwise, there may be a conflict, and it should be discarded.Thus we need an  -dim orthogonal coordinate system as the comparison granularity, which needs to fully accommodate all the IDs' representations in a user's interaction sequence.Specifically, we first perform a QR decomposition of the IDs' hidden state: h X  = QR, where Q ∈ R  × is an orthogonal matrix, and R ∈ R × is an upper triangular matrix.Then, we map both hidden states into Q to get the coordinate matrices as follows: where Proj(h X ), Proj(h A ) ∈ R × .Thus we can obtain the homogeneous part h * A ∈ R × as follows: where ⊙ is the element-wise product,  (•) is the indicator function, which outputs 1 if the value is greater than 0, and 0 for the rest.Since h * A is homogeneous with h X , we can directly fusing it into the item representation, Eq.3 can be updated: Since the average sequence length of users is often lower than , we can reduce the dimension of h X  as h X  W  , where W  ∈ R × , to reduce the computational complexity as well as the redundancy of parameters before doing QR decomposition.

Model Prediction and Learning
After  layers of Transformer structure, we get the final hidden state h  X of the item ID, and calculate the prediction score as: where V ∈ R |V| × is the candidate item matrix.For the sequential recommendation task, we adopt the cross-entropy loss function as where   and   denote the ground truth and the predictive probability of the -th sample.Finally, combining the contrastive loss in RSA, we define the loss function with the balance coefficient :

OFFLINE EXPERIMENTS
In this section, offline experiments are designed to evaluate the performance and effectiveness of ASIF.

Datasets and Settings
4.1.1Dataset.We conduct experiments on three publicly available datasets and an industrial dataset: • Yelp1 dataset is a well-known business recommendation dataset.Category of business and position are regarded as side information.• Amazon Beauty2 dataset is collected from Amazon review datasets.Category of the goods and position information are supplementary attributes.• AliEC3 is a Taobao display advertising dataset provided by Alibaba.We utilize category and position as side information.• Industrial dataset is collected from a scenario in the commercial advertising system in Alipay, which is desensitized and encrypted, and does not contain any Personal Identifiable Information (PII).The position and item's entity such as category and brand, are utilized as side information.
Table 2: Overall Performance (HR and NDCG) on public datasets.The best results are boldfaced, while the second-best results are underlined.We pick the best model with the highest NDCG@20 on the validation set.Impr.(%) is the performance gain of ASIF against the best baseline method.
Model Yelp AliEC Beauty H@10 H@20 N@10 N@20 H@10 H@20 N@10 N@20 H@10 H@20 N@10 N@20 Following the same data pre-processing ways in [7,18,24], we remove all items and users that occur less than five times in public datasets.For the industrial dataset, we retain all users and items that have appeared due to the frequent updating of item IDs.The statistics of all processed datasets are summarized in Tab. 1.

Baseline Methods
. We compare our model with the following state-of-the-art sequential recommendation methods.

Evaluation Metrics.
Following the previous works [7,18], the leave-one-out strategy is used for evaluation.For each user sequence, we use the last item for testing, the second last item for validation, and the rest items for training.Models are evaluated in a full ranking manner as in [5,11,18] rather than negative sampling, which is often criticized for bias [4,9].Two widely used metrics are employed: top-K Hit Rate (HR@K) and top-K Normalized Discounted Cumulative Gain (NDCG@K) with K={10, 20}.

Implementation Details.
We run all the models on the opensource recommendation framework Recbole [23] and evaluate them with the same setting.We set the maximum sequence length to 50 and the embedding size to 256 for all datasets.All the networks are 3 layers and 4 heads, and the Adam optimizer is adopted for 200 epochs with batch size 2048 and learning rate 1e-4.Fusion functions for side information fusion methods are searched among sum, concat and gating.For other hyperparameters, we follow the best setting mentioned in previous papers.All four ablated versions of ASIF are significantly better than SASRec  in Tab. 4. RSA and HIE are the most effective components in ASIF, proving there is indeed valid information in side information that should be carefully incorporated into item representation.Table 4: Ablation results (HR@20 and NDCG@20) on three public datasets.Each row removes a single component from the model except the last row.

Model
Yelp AliEC Beauty H@20 N@20 H@20 N@20 H@20 N@20  Influence of number of orthogonal bases  .ASIF's performance with a varying number of bases  ∈ {4, 8, 12, 16, 20, 24} on three public datasets is reported in Fig. 6, respectively.A bigger number of bases usually means a finer granularity of decomposition.However, finer granularity does not always mean better.As we can see, the optimal number of orthogonal bases for three public datasets is around 16 to 24.
Impact of fusion function F .We compare the performance of three different fusion functions: Sum, Concat, and Gate.Fig. 7 illustrates the results, showing that ASIF with all three fusion functions outperforms the state-of-the-art baselines mentioned in the paper.This highlights the robustness and superiority of ASIF.

ONLINE DEPLOYMENT
In the online advertising system, the Click-Through Rate (CTR) prediction task is an important part, responsible for predicting the probability of users clicking on candidate items.Xlight is a traffic platform in the online app Alipay which provides advertisement services for small program merchants and so on.To further verify the effectiveness of the proposed model ASIF, we deploy it into the advertising system shown in Fig. 10.In Alipay's membership scenario, most of the ads are real goods, which are sold to users in the form of points with money.In order for ASIF to fully utilize its superiority in side information fusion, we select the category and brand of goods as side information for the items recommended in the this scenario.For offline training, ASIF collects the recent click samples in the past 7 days as the training dataset.For online service, when a user visits the membership page, the system will initiate a request for the user's historical behavior from an online feature service platform, which are truncated to a length of 50.ASIF will estimate the pCTR for some of the ads retrieved from the ads pool.In Xlight's Real-Time Bidding and Ranking system, each advertisement will be ranked based on its Effective Cost Per Mille (eCPM), which is estimated based on the pCTR and bid.Therefore, accurate estimation of CTR is pivotal for the Xlight platform.Due to industrial constraints, it was not feasible to compare all baseline models in the online system.Therefore, we selected SASRec as the baseline model for comparison.After conducting two-week online A/B test, our model improved clicks by 1.09% and delivered a significant 1.86% increase in Cost Per Mille (CPM).Meanwhile, it enhanced multi-day online AUC by 0.97% with additional negligible computational cost (p99 latency 2ms).In conclusion, combined with offline evaluation, ASIF demonstrates strong performance in realworld industrial scenarios.

CONCLUSION
In this paper, we present a novel method ASIF for side information fusion in Sequential Recommendation.Our method addresses the challenges of noisy interference and information invasion in the mixed embedding space.Specifically, we first introduce Fused Attention with Untied Positions, which calculates position correlations individually to avoid noisy interference in the mixed attention scores.Secondly, we propose Representation Alignment, consisting of RSA and HIE, to solve the information invasion problem.RSA aligns the embedding spaces of IDs and attributes using the contrastive objective to improve their semantic consistency at the interaction level.HIE employs orthogonal decomposition to extract the homogeneous part in attributes and then integrate it into item representation, further enhancing the utilization of side information.Through extensive experiments, we have demonstrated that our proposed method surpasses previous approaches in side information fusion, and the visualization and ablation experiments demonstrate its rationality.The online A/B test on Alipay's advertising system showed that ASIF obtains a 1.09% improvement on clicks and 1.86% on CPM.In future research, we aim to further improve the denoising techniques and explore automatic methods to enhance the utilization for side information fusion.

Figure 1 :
Figure 1: Visualization of attention scores in SASRec  on Yelp dataset.

Figure 3 :
Figure 3: Single layer structure comparison of existing self-attention-based side information fusion approaches: SASRec  is early fusion, FDSA is late fusion, while NOVA, DIF-SR and ASIF is hybrid fusion.

Figure 4 :
Figure 4: An overview of ASIF.The model is stacked with the Fused Attention with Untied Positions block, which decouples the computation of position and focuses on effective interaction between IDs and attributes.Through the Representation Space Alignment (RSA) and Homogeneous Information Extraction (HIE), item IDs and attributes' representations are aligned and the homogeneous parts are accurately captured.

Figure 5 :
Figure 5: Two steps of Representation Alignment.

4. 2 . 1
Overall Performance.Tab. 2 and Tab. 3 report the overall performance of three public datasets and an industrial dataset.We can make the following observations from four aspects:(1) In line with intuition, some fusion methods perform better than those use only IDs, revealing that side information can improve model's performance by capturing better sequence patterns.And this emphasizes the importance of side information fusion works.(2) On the contrary, under the vanilla self-attention framework, SASRec  considers more kinds of side information but brings a significant decrease compared with SASRec on all datasets, indicating that the information invasion does exist with self-attention-based naive early-fusion methods.(3) NOVA and DIF-SR are carefully designed to alleviate the invasion phenomenon, thus achieving better results than SASRec.At the same time, we note that, due to its lack of interaction caused by separating ID and feature into two channels, the effect of FDSA is not significantly better than that of NOVA and DIF-SR.(4) It is clear to see that ASIF achieves significantly better results than SOTA baseline methods on all datasets.These results demonstrate the efficiency and validity of ASIF for eliminating noisy interference and solving the information invasion problem in side information fusion.4.2.2Ablation Study.We analyze the effectiveness of each component of ASIF via an ablation study.Tab. 4 shows the performance of ASIF and its ablation versions on three public datasets.• w/o Representation Space Alignment (RSA).We disable the contrastive loss to verify the effectiveness of RSA.The significant decline implies that bringing the two spaces closer appropriately can help alleviate the invasion phenomenon and thus improve performance.• w/o Homogeneous Information Extraction (HIE).Without the HIE component, attributes and position information can only participate in the calculation of attention scores, instead of directly being integrated into the hidden state of item representation.In this case, metrics drop on all datasets.• w/o Untied Positions (UP).This version removes the independent position channel and treats position as a common attribute.It can be observed that the interactions between the position encoding and other terms increase the noise and lead to a decrease in performance.• w/o Fused Attention (FA).We decouple the correlation calculation of IDs and attributes, i.e., each learns its own correlation matrix.The results show a decrease in most metrics.It means that it is necessary to retain the intersectionality between IDs and attributes.

Figure 6 :
Figure 6: Influence of balance parameter  and number of orthogonal bases  .

Figure 7 :
Figure 7: Impact of fusion func F .

Figure 8 :
Figure 8: Visualization of attention correlations in ASIF.

Table 1 :
Statistics of datasets.

Table 3 :
Performance on the industrial dataset.

Table 5 :
Online performance on membership scene in Alipay.