Adaptive Disentangled Transformer for Sequential Recommendation

Sequential recommendation aims at mining time-aware user interests through modeling sequential behaviors. Transformer, as an effective architecture designed to process sequential input data, has shown its superiority in capturing sequential relations for recommendation. Nevertheless, existing Transformer architectures lack explicit regularization for layer-wise disentanglement, which fails to take advantage of disentangled representation in recommendation and leads to suboptimal performance. In this paper, we study the problem of layer-wise disentanglement for Transformer architectures and propose the Adaptive Disentangled Transformer (ADT) framework, which is able to adaptively determine the optimal degree of disentanglement of attention heads within different layers. Concretely, we propose to encourage disentanglement by requiring the independence constraint via mutual information estimation over attention heads and employing auxiliary objectives to prevent the information from collapsing into useless noise. We further propose a progressive scheduler to adaptively adjust the weights controlling the degree of disentanglement via an evolutionary process. Extensive experiments on various real-world datasets demonstrate the effectiveness of our proposed ADT framework.


INTRODUCTION
Sequential recommender systems, which aim to accurately recommend the next item to a target user through modeling the sequential patterns of user interests, play an important role in facilitating our online experiences [23,28,48].The major task is to find an appropriate way to represent the sequential user behaviors such as click, rate, favorite, etc. Transformer [50], as one of the most widely adopted architectures, has shown its power in representing sequential input data, thus capable of capturing dynamic relations for sequential recommendation [48].
On the one hand, users nowadays have diverse and constantlychanging interests in terms of various aspects in the course of time, thus accurately capturing the information regarding these aspects can help to boost the recommendation performance significantly.On the other hand, obtaining the information capturing various aspects is treated as one major advantage of disentangled representation learning, which is known to be very important for accurately making recommendations [54].However, it is reported that existing Transformer architectures fail to guarantee the assumption that the widely adopted multi-head attention within different layers can capture the disentangled information related to different aspects of sequence data [10,35,45], and lack of explicit regularization for layer-wise disentanglement, which may not take the advantage of disentangled representation in recommendation and may lead to suboptimal performance.
To deal with the issue, in this paper, we investigate the discovery of the optimal degree of disentanglement for Transformer architectures that best serve sequential recommendation, where the attention heads in different layers of the Transformer are disentangled to capture various aspects of user behaviors in an adaptive manner.Nevertheless, learning to discover such a disentangled Transformer architecture adaptively for a given recommendation dataset is largely unexplored in the literature, posing the following two challenges: (1) Transformer architectures or backbones may have different numbers of layers with different numbers of attention heads designed with various functions.Therefore, it is challenging to determine which part of the encoder and decoder layer in the backbone should be disentangled, and how we can encourage the disentanglement of attention heads to capture various aspects of user interests given a Transformer architecture.
(2) Different datasets may indicate different recommendation tasks imposing user interests with various aspects, and the capabilities of disentanglement differ among different layers, which requires an adequate Transformer architecture to adaptively adjust the degree of disentanglement given a recommendation task.Thus, it remains another challenge to adaptively discover the optimal degree of disentanglement for attention heads in different layers to achieve the best recommendation performance on various datasets.
To tackle these challenges, we propose the Adaptive Disentangled Transformer (ADT) framework, which simultaneously i) disentangles attention heads within different layers of a given Transformer architecture, and ii) determines the optimal degree of disentanglement for attention heads in different layers adaptively in order to achieve the best recommendation performance.Particularly, to disentangle the attention heads, our ADT framework requires the independence constraint via mutual information estimation over attention heads, utilizing an auxiliary classification objective with generated pseudo labels to encourage the disentanglement.To further prevent the learned disentangled information from collapsing into noisy signals, the proposed ADT framework employs another auxiliary objective with a decoder to reconstruct the input data.Moreover, we propose a progressive scheduler which is able to adaptively adjust the auxiliary weights controlling the degree of disentanglement of different layers via an evolutionary process.We also design a supernet to accelerate the searching progress for discovering the optimal weights for different auxiliary objectives.Extensive experiments including ablation studies are conducted to demonstrate that our proposed ADT framework can be applied to different Transformer architectures to significantly outperform state-of-theart baselines, and verify that it is necessary for Transformer architectures to possess different degrees of disentanglement in different layers that can best serve the task of sequential recommendation.The code is available at https://github.com/defineZYP/ADT.Our main contributions are summarized as follows: • We point out the importance of disentangling attention and investigate adaptive disentanglement of attention heads in Transformer architectures for the sequential recommendation, to the best of our knowledge, for the first time.• We propose the Adaptive Disentangled Transformer (ADT) framework capable of simultaneously disentangling attention heads in a given Transformer, as well as determining the optimal degree of disentanglement for attention heads in different layers adaptively such that the best recommendation performance can be achieved.Sequential Recommendation.The purpose of the sequential recommendation is to use historical data from user interactions to predict the next item.Early work on sequential recommendation tasks is generally based on Markov chains.FPMC [44] combines Markov chains and matrix factorization.It generates a transition matrix for each user, allowing the model to capture both temporal information and long-term user preference information.However, FPMC only makes use of the first-order Markov chains.Fossil [19] extends this idea to high-order Markov chains to consider more items.However, as MCs based methods fail to model union-level sequential patterns and fail to allow skip behaviors, Caser [49] introduces convolutional neural networks into the sequential recommendation.Caser considers the embedding matrix of items as an image and learned transitions by using convolution operations.As better models [9,25,50] are applied to sequence data, some works adopt them such as RNNs [11,22,23,30,33,37,39] and Transformers [14,28,48] to model user behavior sequences.GRU4Rec [23] first applies Gated Recurrent Units to sequence recommendation.And unlike Markov chain and RNN-based methods, the multi-head self-attention mechanism is able to capture information on different aspects from all item-item pairs in the sequence.SASRec [28] and Bert4Rec [48] have achieved great success with the sequential recommendation.
Attention Mechanism and Transformer.Attention mechanism has shown amazing potential for many tasks, e.g., visual recognition [12,57], image generation [36,40], machine translation [50,52] hyper-parameter optimization [56] and neural architecture search [16,62].The attention mechanism draws on the human attentional mindset, with the core goal of selecting the information that is more critical to the current task from among the many.And the multi-head attention mechanism allows the model to capture information from different representation sub-spaces at different positions.Several studies [3,10,27,35,45,51,60] have analyzed attention mechanism.Both [51] and [35] find that in the multihead attention mechanism, there is redundancy in the heads.Bian et al. [3] find that there is a high similarity of attention patterns between different heads in vanilla Transformer and prune some attention heads but get a similar performance.Clark et al. [10] calculate the Jensen-Shannon divergence between attention distributions of different heads.And the result indicates that show that at the shallow layers, the distribution of representations learned by different attention heads differs widely, which means that it is indeed able to capture information on different aspects.However, as the layer becomes deeper, the distributions of representations learned by different attention heads gradually become consistent.These studies suggest that the mechanism of multi-head attention may contradict the original expectations.
Disentangled Representation Learning.Disentangled representation learning aims to learn different aspects of the data, capturing the interpretable representation behind different latent factors [2,53].Variational autoencoder (VAE) [29] is one of the representative works of disentangled representation learning which captures the information of different latent factors by variational inference and an encoder-decoder based architecture.−VAE [24] balances the representational ability and the disentanglement ability of the model based on VAE.Due to the diversity of purchase interests of users, multi-interests methods [5,8] and disentangled representation learning have also been applied to recommendation tasks [34,54,55,58,61] with great success.However, to the best of our knowledge, there is still a lack of research to explicitly introduce layer-wise disentangled representation into the Transformer to adaptively make sequential recommendations.
Neural Architecture Search.Neural architecture search (NAS) is a method to automatically search for neural network architectures without excessive expert knowledge.Existing NAS methods consist of three main components: search space, search strategy, and evaluation method.Depending on the search strategy, NAS can be divided into three categories, reinforcement learning based methods [1,15,63,64], evolutionary algorithm based methods [41,46] and gradient based methods [7,31].However, many of these methods require a large amount of time to search for excellent architecture.For this reason, ENAS [38] proposes a method called supernet to share parameters between the same options in the search space to accelerate the search process.However, parameter sharing leads to the coupling between different candidate architectures, making the NAS method unable to accurately evaluate the performance of these architectures and thus affect the search result.One-shot method [4,17] is therefore proposed, which decouples the search process into the training stage and search stage to alleviate the problems.

METHODOLOGY
In this section, we present the proposed Adaptive Disentangled Transformer.We first give the problem formulation and introduce the general approach to using Transformer for sequential recommendation in the preliminaries.Then, we give a brief overview of the main parts of our method and describe the auxiliary objectives in detail in section 3.2.And we detail our adaptive training strategy in section 3.3.

Preliminaries
In sequential recommendation, we have the user set , and the interactions between the users and items.For each user  ∈ U, we can organize behaviors of that user into a sequence in chronological order as where    ∈ V is the item that the user  clicked at time step  and  is the maximum length of the sequence.The purpose of sequential recommendation is to infer the item that the user will most likely interact with at time step  + 1 based on the past  historical items in the sequence.Next, we will introduce a general approach to the use of Transformer for sequential recommendation.
First, for each item    ∈   , we need to embed them from the index into the representation space.Transformer has an item embedding matrix Z ∈ R | V | × , where each item can be represented as a  dimensional vector    .Then to model the position information of the sequence, Transformer usually has a positional embedding matrix P ∈ R  × that maps the item at time step  in the sequence to   .To represent both item attributes and the temporal information, a function  (•) is used to fuse the item embedding and the position embedding as follows, and then the item sequence of the user  is represented as a hidden representation The form of  (•) is not consistent for different Transformers.The easiest way is to add the two embeddings or concatenate the embeddings and then put them into the following modules.Also, there are other methods during the embedding process.For a unified description, we set For sequential recommendation, the Transformer can be seen as a stack of  Transformer encoder layer.And for the -th encoder layer, the representation ℎ  −1 will first be input into the attention module.The most commonly used multi-head attention mechanism can be represented as: where     ,    ,    ∈ R  ×Δ are the matrices for queries, keys, and values in the attention module of the -th head of the -th encoder layer,  = Δ × and  is the number of heads.The representations captured by different heads, [(ℎ represent different aspects of the purchase intentions of that user, which to some extent can be regarded as a kind of disentangled information.And after that, we concatenate the output of all heads as Attn(ℎ  −1 ) = [Attn  (ℎ  −1 )]  =1 .And for computational stability and to accelerate convergence, the skip connection and the LayerNormalization(LN) are used as After the attention module, a feedforward network is introduced to model the nonlinear relations as And after  encoder layers, we get ℎ  which contains the purchase preference information of user .A classic way to use this information to make predictions is to use the matrix factorization as  = ℎ   Z  with negative sampling and the binary cross-entropy loss, where ℎ   is the last embedding in the embedding sequence ℎ  ,  is a matrix about the relevance of the user  and the items.A high score means a high possibility that the user may interact with.However, for different Transformer backbone, the prediction method and the training objective is different, and we denote L  as the original objective of the selected Transformer backbone.Using L  to train the whole Transformer backbone until convergence will result in the final Transformer-based recommender model.
Note that in the traditional Transformer-based recommender models, the representations learned from different heads, [(ℎ which are expected to capture different aspects of the user's preferences, sharing the similar idea with disentanglement.However, they may fail to achieve disentanglement without explicit regularization.Additionally, since previous works [10,35,45] indicate that the disentanglement characteristic of Transformer is layer-specific, i.e., only shallow layers need disentangled representation while the deep layers need task-specific representation.Additionally, the layers that need to be disentangled may vary with the tasks and datasets.Deciding in which layer we should add the disentangled regularization and how to discover the optimal degree of disentanglement is another challenging problem.To solve these two problems, we present our proposed Adaptive Disentangled Transformer in the next subsection.

Adaptive Disentangled Transformer
The overall framework of our Adaptive Disentangled Transformer is shown in Figure 1.To encourage the disentanglement of the attention heads in different layers, we proposed two auxiliary objectives which control the ability to disentangle the intentions and the capability to capture the user's comprehensive diverse interests, respectively.To discover the optimal layer-wise disentanglement, we design an adaptive training strategy for Transformer to automatically adjust the weights of these two auxiliary objectives based on the backbone architecture and the dataset.Next, we respectively describe the two auxiliary objectives and the adaptive training strategy.
Objective 1: The Independence Auxiliary Objective.
We apply this objective to the encoder to guarantee the independence between different heads and thus make the representation more disentangled.The structure of the Transformer-based encoder is consistent with the backbone.They usually have at least one embedding module and an attention module shown in section 3.1.In our method, the behavior of the encoder is the same as the backbone Transformer and the general process is shown in Eqs.(1) to (5).
To distinguish the representations of each head, we reshape Attn(ℎ  −1 ) to , where   , ∈ R Δ is the hidden representation of head  and at the time step .
We expect each head to capture different latent factors of the user's purchase intentions, and they are independent of each other.However, since there is no label of intention in the training data, we do not know exactly which kind of information is captured by each head.To solve the problem, we can enhance the relationship between the latent factor  and the learned embedding   , by maximizing the mutual information between them.Although we do not know the exact meaning of latent factors, each latent factor is independent of the other.Therefore if the representation learned by each head  can be made to depend on a latent factor , it is possible to make the representation learned by different heads independent.
According to [26,59], the maximization of mutual information can be converted into the following form.Given that the representation   , is expected to belong to latent factor , the regularizer   ( |  , ) estimates the probability that   , belongs to the -th latent factor as where W  ∈ R ×Δ .For all   , ∈   , we calculate   ( |  , ) and obtain a vector   (  ).And thus we can use a classification task to optimize this regularizer.First, we generate an auxiliary label  ∈ {1, 2, ..., } for every representation learned by different heads according to the position of the head, i.e., for the representation output by the -th head, we give the auxiliary label  = .And we get an auxiliary label vector    = { 1,1 ,  1,2 , ...,  , } where each  , =  means the auxiliary label of head  at the time step .This regularization is an approximation of mutual information and when the loss of the auxiliary objective becomes small, the representation learned by different heads would be more independent.And the auxiliary objective of the -th layer can be formulated as: Objective 2: The Reconstruction Auxiliary Objective.
After disentangling the representation of different heads of the Transformer, we need to ensure that the disentangled representations learned by the different heads are meaningful and contain the user's comprehensive and diverse interests.Inspired by VAE [29], which makes the representation contain rich input information through reconstructing the input data, we introduce the reconstruction objective.
To achieve this, a decoder is used to decode the learned representation.A decoder, as opposed to an encoder, is a process of converting a representation to a specific sequence.Thus, each of its layers corresponds to the layer of the encoder.We reconstruct the input of layer  of encoder ℎ  −1 and the output of layer  −  + 1 of decoder ℎ −+1  , to guarantee the representations learned by the -th layer of encoder still contain rich information.
The structure of the decoder is similar to [50] and consists of two attention modules and shares the embedding module with the encoder so the input of the first layer of the decoder can be embedded as ℎ 0  = Embed(   ), where    is the input sequence of the decoder.For    , since the decoder is originally used for generation tasks, its final output contains the information of the sequence    and the information of the item at the next time step.Considering the need for reconstruction, we want the information output by the decoder at the last layer to be similar to the information obtained by the encoder at the first layer.Therefore we take out the first  − 1 time steps of the input sequence of the encoder as    and pad it to length  .The token we use for padding is related to the backbone Transformer.For example, in the SASRec backbone, we use token 0 for padding since there is no special need for it.But in the Bert4Rec backbone, we use the token [MASK] for padding since Bert4Rec treats sequential recommendation as a Cloze objective problem and uses the token [MASK] to replace the item that needs to be predicted.
To adapt to the encoder of the selected backbone, the decoder attention modules have the same structure as the encoder.The first attention module computes the self-attention of the input of the decoder as: where ℎ  −1  is the input of the -th layer of decoder.And the second attention module computes the attention between the output of the encoder and the decoder as: where   2 ,  2 ,  2 ∈ R  × are the matrices of the second attention module, ℎ  is the output of encoder, and FFN is a feed-forward network.And the auxiliary objective of the -th layer can be formulated as: This reconstruction auxiliary objective makes the learned disentangled representation contains rich input information, i.e., the user's sequential behaviors which can reflect his or her comprehensive and diverse interests.Additionally, this regularization can be regarded as an auxiliary task for recommendation, which can improve the representation generalization ability [6] [32], suitable for the recommendation task where the data is often sparse.
Training Objective.With the two proposed auxiliary objectives, our final training objective is as follows: where    and    control how disentangled the representation is and how much input information we want to keep in each layer.Since there are totally 2 weight hyper-parameters that are hard to decide, we propose the adaptive training strategy which automatically obtains the optimal value of these parameters and trains the whole Transformer model.

The Adaptive Training Strategy
We introduce the adaptive training strategy inspired by the neural architecture search (NAS), which trains the model and simultaneously obtains the best weights of each layer.To further accelerate convergence, we propose a supernet that shares the model parameters and only updates part of the specific weights after each training epoch.
For each auxiliary objective, we set the search space of its weight to a continuous interval and divide it into  sub-intervals regarding the order of magnitude to represent the importance of the auxiliary objective in a certain level as [ 0 ,  1 ), [ 1 ,  2 ), . . ., [  −1 ,   ] .For each sub-interval, we share the parameters in the supernet.And as we have two auxiliary objectives, for each layer in the supernet, there are  ×  identical modules that cover the entire search space like a grid.
To make it easier to search for auxiliary weights with different ranges of values, we use a vector c = [ 1 ,  1  , ...,    ,    ] to represent the sampled sub-network candidate, where    ,    ∈ [0, 1] denote the weight of the auxiliary weights of the -th layer and  is the number of the layers.For    ∈ [   , +1  ), where  ∈ {0, 1, 2, ...,  − 1}, we use linear interpolation method to find its true auxiliary weight as: and we do the same for    ∈ [   , +1  ) to calculate the real auxiliary weight as: In the forward phase of the supernet, we use bi-linear interpolation to find which modules of the grid to use for the computation.We let ℎ  , represent the hidden representation of the output of the module shared by the reconstruction auxiliary objective, whose weight lies in the -th interval, and the independence auxiliary objective whose weight lies in the -th interval in the -th layer.The hidden representation of the -th layer of the supernet ℎ   with    ∈ [    ,   +1  ),    ∈ [    ,   +1  ) can be formulated as follows: This is like a single path method [17].Although we need to generate  ×  modules, only a constant number of modules need to be extracted in each calculation, so there is no out-of-memory problem.And we use the differential evolutionary algorithm [47] to find the best candidate c.
According to [17], we divide the search process into two phases, the warmup phase and the search phase.The pseudo-code for this process is shown in appendix in section A.2. Similar to the usual evolutionary algorithms, the differential evolutionary algorithm optimizes the candidates by crossover and mutation.The additional required arguments are listed in the inputs of the pseudo-code with description.These arguments affect the size of the search space and the search speed.
During the warmup phase, for all shared parameters to learn information about the data, we randomly generate a candidate at the beginning of each epoch.At the end of the search process, we obtain the best candidate c and thus we can get the optimal auxiliary weights according to Eqs. ( 14) to (17).After that, we retrain or finetune the model and evaluate the performance of the model.
Till now, we complete the disentangled Transformer for recommendation, which makes the representation output by the multihead attention disentangled with two auxiliary objectives.Additionally, to fit the layer-specific disentanglement characteristics of the Transformer, we propose the adaptive training strategy to automatically find the optimal auxiliary weights for each layer.Our proposed method can be applied to various recommendation models that are based on Transformer and generally improve the performance of the backbone model.

EXPERIMENTS
In this section, we empirically evaluate the performance of the proposed Adaptive Disentangled Transformer and analyze how it works.Next, we will describe the backbone models, the evaluation metrics, and the datasets we adopt.
Backbones.To better illustrate the adaptability of our method to different situations, we adopt three different Transformer-based sequential recommendation backbones and evaluate how much improvement these models can achieve through the proposed adaptive disentanglement.
• SASRec [28]: This backbone makes full use of the self-attention mechanism and is one of the pioneers in the use of Transformers for sequential recommendation tasks.• Bert4Rec [48]: This backbone adopts the Cloze objective to the sequential recommendation and predicts the masked item by jointly using the left and the right context.• STOSA [14]: This backbone treats the embedding of items as a stochastic Gaussian distribution to fully find the similarity between different items.And they proposed a module using the Wasserstein distance to measure the relationship between items.
The Evaluation Protocols.To evaluate the effectiveness of the method, we chose Hit Ratio (HR), and Normalized Discounted Cumulative Gain (NDCG) as the evaluation metrics, the same as the original papers of the aforementioned backbones.For a fair comparison, we keep the other evaluation details the same as the original paper.However, in the original paper of SASRec and Bert4Rec, they use popularity sampling to sample 100 negative items to calculate these metrics, which will be easily biased as later works indicate [42].Therefore, on these two backbones, we add the Area Under Curve (AUC) metric which is consistent with different sampling methods.And for STOSA backbone, we add the Mean Reciprocal Rank (MRR) as an evaluation metric to more comprehensively show the superiority of our proposed method.
Datasets.We compare the methods on publicly available datasets from real-world applications.
• Amazon 1 : The Amazon Dataset records user reviews of Amazon.comproducts and is a classic dataset for recommendation systems.In our experiments, we adopt the "Beauty ", "Home and Kitchen", "Tools and Home Improvement", "Toys and Games" and "Office" sub-datasets.• Steam 2 : This dataset contains a large number of user reviews crawled from the Steam gaming platform.• MovieLens 3 : This dataset is one of the most famous datasets for recommendation systems.It contains multiple user reviews for multiple movies.In our experiment, we choose ML-1M and ML-20M which are widely used by most of the recommendation system researches.In the experiments, in order to be as consistent as possible with the treatment of the original paper of the backbone Transformer, we performed popularity sampling on the experimental data of SASRec backbone and Bert4Rec backbone, while experiments on STOSA backbone do not perform any sampling method.
For the popularity sampling method, we follow the previous work and match each ground truth item in the test set with 100 negative items that do not interact with the user.We use the popularity of the item as the sampling weight and randomly sample these 100 negative samples.For the non-sampling method, we take all items that have not interacted with the current user as negative samples.

Main Results
We report the performances of different methods in Table 1 and in Table 2.And we present the basic information of the baselines used in A.1.Note that besides the aforementioned three backbones, we also compare with some other baselines which the original papers compare with, whose details are described in the appendix.According to the results, we have some observations as follows.
The sequential model such as GRU4Rec and Caser is superior to the non-sequential model such as BPR-MF and LightGCN.This indicates that the non-sequential model only takes into account the user behavior information and ignores the temporal information, which does not make full use of the data.In the non-sequential models, we find that LightGCN can achieve the most promising results, which means that the introduction of graph data can capture the interaction behavior of the users.And Transformer-based methods such as SASRec and Bert4Rec not only take advantage of the temporal information but also capture the different intents of the user behavior, thus providing better performances.
For all the backbones, our method is able to bring improvement and outperforms all the other baselines.The relative improvement of NDCG@5 and HR@5 metrics of our method ranges from 1.48% to 46.24%, which illustrates the superiority of our method.We believe that these improvements come from the following aspects: 1.Compared with the non-Transformer model, thanks to the multihead attention mechanism, the Transformer is indeed able to extract different intentions of users to purchase different items and thus improve the performance of the model.2. Compared to the transformer-based model, our method automatically selects the weights of the auxiliary objectives using an adaptive approach that takes full account of the factors including different models, different datasets, and different optimization objectives.
For relatively small datasets such as Beauty and ML-1M in Table 1, Bert4Rec outperforms SASRec in NDCG@5 and HR@5 metrics, but both have similar AUC metrics.On the other hand, for larger datasets, Bert4Rec is not as good as SASRec.It means that the recommendation capability of SASRec is not weaker or even stronger than Bert4Rec, which is also shown in the results in Table 2.This is most likely due to the introduction of the Cloze objective task in Bert4Rec, which leads to inconsistency in model training and testing and thus affects the performance.Nevertheless, our method can consistently bring performance improvement for both of the two backbones.This shows that our method can be applied to different architectures, while the proposed auxiliary objectives can improve the generalization ability of the model, thus alleviating this inconsistency.
We note that the improvement brought by our method is different on different datasets.Taking the experimental results of the STOSA backbone as an example, for the MRR evaluation metric, the relative improvement of our method is 40% on the Home dataset.However, on the Office dataset, it is 3.78%.We believe this is because the Home dataset is sparser and it is more difficult for the recommendation system to capture the commonality of behaviors among users.But our method enables the learned representations to extract this information through auxiliary objectives that allow a deeper understanding of the sequence data.Similar phenomena can be found in our experiments on the SASRec backbone and Bert4Rec backbone.In particular, for Bert4Rec, there is almost no improvement on the ML-1M dataset.This indicates that on a simple dataset like ML-1M, the improvement on the multi-head self-attention mechanism can no longer continue to dig deeper into the information of user behaviors.And to get further improvement, it may be necessary to introduce additional user behavior by other methods, such as explicitly introducing graph data, etc.

Ablation Studies
We further conduct ablation studies to demonstrate the effectiveness of different components as follows.
• We illustrate that the improvement brought by our proposed ADT framework is not due to the increase in the number of parameters.• We show the necessity of adaptively finding the optimal degree of disentanglement through exploring the effect of the proposed auxiliary objectives.
Number of model parameters.Since our method adds a decoder, the parameters of the backbone model are increased.To show that the improvement brought by our method is not due to the increase of the parameters, we double the layers of the backbone model to eliminate the effect of the number of parameters.And the results can be found in Table 3. From the results, we observe that for most of the datasets, the performance of the model decreases after doubling the number of layers of the backbone model.For the results of the STOSA backbone, we note that the model fails to achieve a good performance after doubling the number of layers on the Toys dataset.One plausible reason is that the Toys dataset is a relatively simple dataset and numerous parameters lead to overfitting on this dataset.But for the Home dataset, the performance is improved because this dataset is more complex.Our method can handle both situations and get better results, by considering the relationship between different models and different datasets adaptively.
Auxiliary Objectives.We conduct experiments on Beauty, ML-1M, and Office datasets to illustrate the effect of our auxiliary objectives.We fixed the structure of the model that obtained optimal results in the previous experiments, i.e., the number of layers, the number of heads, hidden size, etc.For each layer of the model, we give only one of the auxiliary objectives a weight and set the weights of all auxiliary objectives in the other layers to 0. The results are shown in Figure 2.
We find that for different datasets, different model architectures, and different layers, the performance of the model changes with these two auxiliary weights.And this change is beneficial in most cases when the values of these auxiliary weights are in a suitable range.Besides, the value where the best performance appears varies from different models and different datasets.This means that these two auxiliary objectives we propose really make sense and that how to adaptively find the optimal weights for different layers is significant for the backbone Transformer model to get more positive information.And we also note that for the independence constraint auxiliary objective, the best model performance always occurs at the shallow layer, and overall, it is better to apply the objective to the shallow layer than to the deep layer.This is consistent with previous works [10,35,45] where shallow layers need more disentangled representations.And for the reconstruction auxiliary objective, the best model performance always occurs at the deep layer except in Figure 2(j), where the reconstruction can be regarded as an auxiliary task to improve the generalization ability of the whole model instead of only the shallow part.
To conclude, the experiments verify that our ADT framework is able to encourage the disentanglement of attention heads to capture various aspects of user interests, and adaptively discover the optimal degree of disentanglement for attention heads in different layers to achieve the best recommendation performance.

CONCLUSION
In this paper, we propose a novel Adaptive Disentangled Transformer (ADT) framework for the sequential recommendation, which is able to disentangle representations learned by different heads within different layers as well as adaptively determine the optimal degree of disentanglement for recommendation simultaneously.Extensive experiments show that our proposed framework can significantly improve the performance of various transformer-based recommendation models.We believe that it deserves further investigations to explore the adaptive disentangled Transformer in a more general setting, i.e., it will definitely be interesting to design an adaptive Transformer capable of handling more tasks including machine translation, image classification, object detection, etc. 100

A.4 Implementation Details
For SASRec backbone and Bert4Rec backbone, we trained the model on a device with 8 NVIDIA GeForce GTX TITAN X.And for STOSA backbone, we trained the model on a device with 2 NVIDIA GeForce RTX 3090.
In the search process, for all backbones, we use the Adam optimizer with learning rate of 1 − 3,  1 = 0.9,  2 = 0.98,  2 weight decay of 1 − 4.And the gradient is clipped when its norm exceeds 5. We set the number of population to 50, the number of candidates obtained from both crossover and mutation in each iteration to 20, the differential weight F to 0.5 , and the probability of both crossover and mutation to 0.1.During the retraining process, we choose different hyperparameters depending on the backbone.

A.5 Statistics of Datasets
The details of datasets statistics used in SASRec and Bert4Rec are presented in Table 4 and the statistics used in STOSA are presented in Table 5 4 .

Figure 1 :
Figure 1: Existing Backbone Transformer V.S.The Proposed Adaptive Disentangled Transformer Framework.(a) An existing backbone Transformer for sequential recommendation.(b) Our proposed Adaptive Disentangled Transformer with the same encoder as the backbone Transformer, which consists of four key parts.1. Transformer-based encoder: to encode the sequence of items, where the multi-head attention mechanism captures the information of different aspects of the items in the behavior sequence.In this module, we add the first auxiliary independence objective L   of each layer  to enhance the independence of the learned representations between different heads.2. Transformer-based decoder: to decode the learned representation, and in this module, we add another reconstruction auxiliary objective L   of each layer  to make the disentangled representation contain rich information about the input sequence, so that we can capture the user's comprehensive and diverse interests.3. Downstream recommendation task processing module: which is utilized to conduct the final prediction.4. Weight scheduler: the core component of the adaptive training strategy, for adaptively adjusting the weights of the auxiliary objectives according to different datasets and backbones.

Figure 2 :
Figure 2: Effect of auxiliary objectives on different layers of the model Figure 3: t-SNE of the representations on ML-1M dataset

Table 1 :
The Overall Performance Comparison Table with SASRec and Bert4Rec backbones.The best and second-best results are bold and underlined.where the "-ADT" suffix means that our method is used.

Table 2 :
The Overall Performance Comparison Table with the STOSA backbone.The best and second-best results are bold and underlined.where the "-ADT" suffix means that our method is used.And "OOM" means the out of memory error.

Table 3 :
Performance Comparison Table with Double Layers.The "-Double" suffix means that we double the number of layers of that backbone model.

Table 4 :
Datasets Statistics for SASRec and Bert4Rec backbone

Table 5 :
Datasets Statistics for STOSA backbone