Exploring Adapter-based Transfer Learning for Recommender Systems: Empirical Studies and Practical Insights

Adapters, a plug-in neural network module with some tunable parameters, have emerged as a parameter-efficient transfer learning technique for adapting pre-trained models to downstream tasks, especially for natural language processing (NLP) and computer vision (CV) fields. Meanwhile, learning recommendation models directly from raw item modality features -- e.g., texts of NLP and images of CV -- can enable effective and transferable recommender systems (called TransRec). In view of this, a natural question arises: can adapter-based learning techniques achieve parameter-efficient TransRec with good performance? To this end, we perform empirical studies to address several key sub-questions. First, we ask whether the adapter-based TransRec performs comparably to TransRec based on standard full-parameter fine-tuning? does it hold for recommendation with different item modalities, e.g., textual RS and visual RS. If yes, we benchmark these existing adapters, which have been shown to be effective in NLP and CV tasks, in item recommendation tasks. Third, we carefully study several key factors for the adapter-based TransRec in terms of where and how to insert these adapters? Finally, we look at the effects of adapter-based TransRec by either scaling up its source training data or scaling down its target training data. Our paper provides key insights and practical guidance on unified&transferable recommendation -- a less studied recommendation scenario. We release our codes and other materials at: https://github.com/westlake-repl/Adapter4Rec/.


INTRODUCTION
Recently, large foundation models [2] have attracted considerable attention in the entire AI community.BERT [8], GPT-3 [3], CLIP [48] and various Vision Transformers (ViT) [10,35] have demonstrated impressive transfer learning capabilities on a range of benchmark tasks, and are now reshaping the paradigm of the natural language processing (NLP) and computer vision (CV) communities.Inspired by the enormous success, the research of developing pre-trained & transferable recommender system (TransRec) models is becoming increasingly popular as well [6,51,58,[71][72][73]75].TransRec offers a natural solution to address the challenges of sparsity and insufficient data in recommender systems (RS) through the application of transfer learning.Specifically, TransRec leverages the knowledge acquired from larger data sources, including user/item representations and their corresponding matching relationships, and transfers this knowledge to the current RS, which might have limited data availability.
In fact, prior to the era of large-scale foundation models, significant research efforts were dedicated to studying TransRec, particularly for cross-domain or cold-start recommendations [29,42,43].For example, PeterRec [72], ShopperBERT [51], Conure [73], and STAR [50], applied modern deep neural networks to transfer useror item-level preference across different recommendation platforms.However, these works are mainly ID-based collaborative filtering Figure 1: Large industrial RS platforms often have a main recommendation channel and various vertical channels, e.g., sports, education, fashion, etc.One has to maintain the entire model for each channel by a separate model with standard fine-tuning; by contrast, only a small set of parameters needs to be maintained by adapter tuning.
(IDRec), which highly relies on overlapped userID or itemID data when transferring knowledge.This strict overlapping assumption hardly holds in practice [74,75] -e.g., TikTok is unlikely to share their userIDs or itemIDs to YouTube, and vice versa.
To realize more general transfer learning, the common practice is to represent items with their raw modality features (e.g., text or images) and users with the interacted item sequence 1 rather than userIDs and itemIDs [6,9,32,52,57,74,75].By replacing ID embeddings with powerful item modality encoders (ME), such as BERT and ViT, TransRec has shown state-of-the-art results on many downstream tasks.
According to the above works, the modern TransRec framework typically consists of two modules, i.e., a user encoder with one or multiple item ME.TransRec models are usually initially pre-trained on extensive upstream recommendation data and subsequently finetuned to cater to various downstream recommendation tasks.Here, we argue that the commonly adopted full parameter fine-tuning in TransRec has several key issues: (1) The standard fine-tuning often involves updating the entire pretrained model.Thereby, in scenarios where RS provides services for multiple vertical channels as described in Figure 1, TransRec has to maintain a copy of the fine-tuned model for every channel.This largely hinders parameter-sharing across domains and brings additional costs in model updates, maintenance, and storage.(2) The large foundation model could quickly overfit when fully fine-tuning it on a small-scale downstream dataset.Fine-tuning the last (few) layers provide an alternative solution.However, it requires many manual attempts to determine the number of layers to be tuned as it highly depends on the pre-trained These issues of standard fine-tuning motivate us to explore parameter-efficient transfer learning techniques for TransRec.Recent work [4,5,12,22,26,44,45,55,59] in NLP and CV suggests that by adding several plug-in networks, i.e., adapter blocks, to the Transformer-style backbones, one can achieve comparable results to full parameter fine-tuning by only optimizing these adapters.Figure 2 demonstrates the difference between fine-tuning all parameters (FTA) and adapter tuning (AdaT).Since the number of parameters in these adapters is extremely small compared with the backbone model, they can thereby achieve parameter-efficient transfer learning.For example, by applying the classic Houlsby adapter [22], the number of trainable parameters can be reduced to less than 3% of the entire backbone network.Another advantage of the adapter-based approach is that it enables modular design and easily decouples task-specific parameters from the large backbone network.This mitigates the difficulties of the model maintenance and inconsistent update issues mentioned above.In addition, it can introduce more robustness and achieve improved stability effects during transfer learning as indicated in [16].
Nevertheless, in the recommender system fields, little work has investigated the adapter techniques.A closely related work is Peter-Rec [72].However, PeterRec adopts IDRec as the backbone, where the largest amount of parameters is in the ID embedding layer rather than these middle layers in which the adapters are usually inserted.In fact, so far, it remains completely unknown whether adapter-based transfer learning is well-suited to the TransRec models when learning item representations from the raw modality features.To this end, we ask the following sub-questions: (1) Q(i) Does the Adapter-based TransRec perform comparably to typical fine-tuning based TransRec?Does this hold for items with different modalities?To answer it, we conduct a rigorous comparative study for the adapter-based and fine-tuning based TransRec on two item modalities (i.e., texts and images) with two popular recommendation architectures (i.e., SASRec [25] and CPC [41,57]) and four powerful ME (i.e., BERT, RoBERTa [33], ViT and MAE [17]).( 2) Q(ii) If Q(i) is true or partially true, what about the performance of these cleverly designed adapters developed in other communities for TransRec problems?To answer this, we benchmark four adapters widely adopted in the NLP and CV literature.We also add the results of lora, prompt tuning, and layer-normalization tuning for a comprehensive comparison.(3) Q(iii) Are there any factors that affect the performance of these adapter-based TransRec models?We report performance comparisons with different strategies regarding how and where to insert the adapters and whether to tune the corresponding normalization layers.
At last, we look at the data scaling effect of TransRec in the source and target domains to examine whether adapter tuning is beneficial when pre-training TransRec with larger datasets.

PRELIMINARY 2.1 Overview of TransRec
Given a recommendation dataset D = {U, V, I} where U, V, I denote the set of users, the set of items and the set of interaction sequences, respectively.Like the typical recommendation task, we aim to predict the next item to be interacted by  ∈ U by exploiting his/her past behaviors   .In TransRec, each item  ∈ V is associated with its raw modality features   .By feeding   into an item ME   (e.g., BERT for text or ViT for images), we obtain the vector representation of item : A basic dimension transformation Layer ( ) is added to ensure the consistency of the output dimensions of the item ME   and input dimensions of the user encoder   : Then, the representation of  can be obtained through the user encoder   (e.g., a Transformer backbone), which takes the interaction sequence   as input: Finally, the next item to be interacted by user  can be retrieved from V by the dot-product similarity between   and As mentioned, we aim to study the transfer learning problem by transferring the knowledge learned from the source domain S to the target domain T .To be specific, a TransRec model is first trained on the source data D  and then adapted to the target domain T , usually by fine-tuning models using target data D  .Note that D  and D  do not necessarily contain overlapped items & users.In this paper, we focus on parameter-efficient transfer learning from S to T by injecting the task-specific adapters into   and   .

Adapters for TransRec
Adapters overview.Adapters are task-specific neural modules inserted into a pre-trained model.[22] proposed to use a bottleneck network with a few parameters to project the original features to a lower dimension and then project them back after applying a non-linearity.With a residual connection, it can be illustrated as: where    and   represent fully-connected layers that project the input dimensions up and down, respectively.
Adopting adapters in TransRec.The TransRec architecture contains two sub-modules, namely, the item encoder   and user encoder   , both of which are based on the Transformer blocks.
The architecture of adapter-based TransRec is illustrated in Figure 3.For   with textual modality (e.g., BERT), we follow the insertion strategy in [22], where two adapter blocks are inserted into each Transformer block, with one after the multi-head selfattention layer and the other after the feedforward network (FFN) layer.For   with visual modality (e.g., ViT), the network structure remains the same, except for the position of LayerNorm.The user encoder   also uses the same Transformer 2 architecture, the only difference is that   is unidirectional here.In addition to this, it adopts the same insertion method as the item encoder.
Training objectives.
In TransRec,   takes the interaction sequence of user  (denoted as   , with length ) as input, and outputs the hidden vectors of corresponding input elements, i.e.: We use the SASRec [41] and CPC [41,57] framework to train Tran-sRec.In SASRec,   is expected to predict the corresponding next item of all elements in   , whereas, in the CPC framework, we only aim to predict the (n+1)-th item given the entire sequence.Note SASRec, in general, outperforms CPC in terms of accuracy, but CPC is essentially a more flexible two-tower based DSSM method [57] that is able to incorporate various user and item features.Following [25,57], we apply the binary cross entropy (BCE) loss for both recommendation frameworks: where   denotes the embedding of a randomly sampled negative item from V and  ∉   . represents the set of [1, 2, • • • , ] (see Figure 3). 2 One might wonder whether other networks can be used as TransRec backbone.In fact, TransRec that learns recommendation models directly from item raw modality features (vs.ID features [73], vs. pre-extracted fixed features [21] from ME) is still at a very early stage.Several existing literature [32,49,52,57,75] is all based on the Transformer-style backbone, the most well-known SOTA sequential encoder.In practice, the Transformer backbone can be easily replaced with other sequential networks.Second, can the CTR (click-through rate prediction) models be used as TransRec backbones?Unfortunately, the classical one-tower CTR models (e.g., DeepFM [37] & MMOE [37]) cannot be directly used as a pre-training backbone for TransRec since some domain-specific features are not transferable or easily decoupled when adapting to other datasets.Instead, the two-tower DSSM model [62] can often be used to pre-train TransRec, as shown below.

EXPERIMENT SETUP
Datasets.We evaluate adapter-based TransRec with two modalities.For items with textual features, we utilize the MIND [66] English news recommendation dataset as the source domain, and the Adressa [15], a Norwegian news recommendation dataset as the target domain. 3For the visual modality, the Amazon review dataset for clothing&shoes recommendation [18,39] is used as the target domain, and the H&M4 personalized fashion recommendation dataset is used as the source domain.We select the latest 20 clicked news to construct interaction sequences for text recommendation tasks.Due to the constraint of GPU memory, the sequence length for fashion recommendation is limited to 10.After the preprocessing, the details of the datasets are shown in Table 1.
Evaluations.Following previous works [20], we adopt the leaveone-out strategy to split the datasets: the last item in the interaction sequence is used for evaluation, and the item before the last is used as validation while the rest are for training.The HR@10 (hit ratio) and NDCG@10 (Normalized Discounted Cumulative Gain) [72] are used as the evaluation metrics.Without special mention, all results are for the testing set.Note that we rank the predicted item with all items in the item set [27].

EFFECTIVENESS OF ADAPTERS IN TRANSREC (Q(I))
In this section, we evaluate the effectiveness of adapter tuning (AdaT) in TransRec since it is unknown whether AdaT works or not for recommendation models.Specifically, we run experiments on eight combinations: {SASRec+BERT, CPC+BERT, SAS-Rec+RoBERTa, CPC+RoBERTa, SASRec+ViT, CPC+ViT, SASRec+MAE, CPC+MAE}, where BERT, RoBERTa, ViT, and MAE are the most popular and widely accepted state-of-the-art (SOTA) ME in NLP and CV fields.The most prevalent AdaT -i.e., Houlsby [22] Adapter -results are present in Table 2.Note that other adapter results are reported in the next benchmark section.As can be seen, TransRec, with the SASRec objective, consistently outperforms its CPC version.This is perhaps because SASRec is more powerful at modeling the item transition pattern in the user sequence and can thus alleviate the insufficient training data issue.For the text recommendation task, AdaT yields comparable results to fine-tuning all parameters (FTA) across evaluated frameworks (SASRec/CPC+BERT/RoBERTa) with a parameter reduction rate of over 97%.However, for image recommendation, the performance gap between FTA and AdaT is relatively large, regardless of the training strategies used (SAS-Rec/CPC+ViT/MAE).This result is somewhat justified, as the Houlsby adapter is primarily designed for NLP domain data and scenarios, which may make it suboptimal for visual tasks.
To understand the impact of trainable parameters on domain adaptation, we simply test AdaT with different adapter sizes and compare its performance with top-n layer fine-tuning (FTN) for text recommendation.Since most of the trainable parameters come from item ME, we focus on the adapters in item ME and keep the UE with the original settings.The results are shown in Figure 4, where the x-axis denotes the number of trainable parameters that are changed by gradually increasing/decreasing the hidden dimension of the adapter module (for AdaT) or by tuning more/fewer top Transformer layers (for FTN).Clearly, both FTN and AdaT can improve performance with more trainable parameters.Furthermore, AdaT achieves competitive results with nearly two orders of magnitude fewer parameters than FTN.
(Answer for Q(i)) Overall, when learning items with textual content, TransRec with the SOTA AdaT realizes the parameter-efficient transfer from the source to the target domain, yielding comparable performance to FTA.On the other hand, AdaT improves parameter efficiency at the cost of some performance drops for visual item recommendation but still could be an option in special cases where sufficient storage resources are unavailable.Therefore, how to design a specific adapter for various image-based recommendation models is a key research question.

BENCHMARKING PARAMETER-EFFICIENT METHODS (Q(II))
In this section, we go one step further and benchmark four popular adapters in NLP and CV literature for applications in recommender systems.To be specific, we choose the Houlsby adapter [22], the K-Adapter [59], the Pfeiffer Adapter [45], and Compacter [26] for evaluation.For a comprehensive comparison, we also include the results using prompt tuning [24], LayerNorm tuning [5], and LoRA (Low-rank Adaptation) [23].We report the results in Table 3.
The structure of Houlsby is illustrated in Figure 2. The Pfeiffer architecture only inserts one adapter block for each Transformer block, saving about half of the parameters of the Houlsby.We adopt the implementation of the Pfeiffer adapter in [26].The Compacter is constructed upon low-rank optimization and parameterized hypercomplex multiplication layers.The K-Adapter adds the adapters of Transformer structures to the backbone model in parallel.We insert two adapters for the item and user encoder respectively after structure searching.We also evaluate a popular prompt tuning [28] technique as the baseline method, where the newly inserted token embeddings are added to the word embedding layer in the BERT model.For visual recommendation, VPT [24] is used as the prompt tuning method, which adds the new token patch in the positional patch embedding for the ViT model.VPT needs to update the taskspecific head compared to prompt tuning for text.We update the   module as the task-specific head following the original setup.
The Houlsby adapter, among all methods, yields the best results under all settings with less than 3% of trainable parameters of full fine-tuning.Following Houlsby, Pfeiffer achieves close performance with only around half of the parameters.This is because their adapter architectures are similar.The key difference is Pfeiffer removes adapters after FFN.The reason why Pfeiffer performs relatively worse will be further discussed in Section 6.The conclusion is that the position of adapter blocks does affect the overall performance.
Table 3: Benchmark popular parameter-efficient tuning techniques.Houlsby, K-Adapter, Pfeiffer, and Compacter adapters, along with LoRA, LayerNorm (LN), prompt tuning, are presented.The best approach to each architecture is marked in bold in this section.The "Architecture" is the combination of the user and item encoder.All results of HR@10 and NDCG@10 in this table are denoted in the percentage (%).We represent the trainable parameters of each method by the percentage to the full fine-tuning.We omit the results of RoBERTa and MAE as ME, which are consistent with BERT & ViT.Compacter, with a special focus on parameter compression with low-rank factorization methodology, exhibits a significant decrease in recommendation accuracy, especially in image-based tasks.This is most likely due to the extremely low capacity of trainable modules in Compacter.The same occurrence can be seen in LoRA. Figure 4 also shows that reducing parameters by a large amount can lead to very bad results.Therefore, TP (trainable parameters) matters in a certain range in the recommendation task.
One exception here is the K-Adapter, which adopts a Transformer layer within the adapter module and requires much more parameters to train than its counterparts.Surprisingly, the performance drops severely.We conjecture that the Transformer architecture within the K-Adapter is not suitable for domain adaptation since it was originally designed for knowledge injection rather than parameter-efficient purposes.K-Adapter does not inject its information into the pre-trained model.Instead, it only receives knowledge from pre-trained models.The information flow direction makes its working mechanism very different.
The last two columns in Table 3 show the results for prompt tuning, LayerNorm tuning.Prompt tuning offers a flexible way to utilize a big pre-trained model in various downstream tasks, mainly in the NLP domain.However, it fails to give competitive results as the adapters in TransRec.LayerNorm tuning, i.e., only updating the LayerNorm parameters during adaptation also suffers from severe performance degradation.These results again potentially imply that TP are important for recommendation, although many NLP tasks can be performed well even with much fewer TPs.
(Answer for Q(ii)) Overall, the Houlsby adapter obtains the best results in TransRec under all experimental settings, while the Pfeiffer adapter achieves slightly worse performance with half the number of parameters.In the domain of NLP, Compacter yields significantly better performance than the popular Houlsby and Pfeiffer adapters, even with an extremely small amount of trainable parameters [26].However, it fails to achieve decent results in the modality-based recommendation task.The Lay-erNorm tuning routinely performs worse than the full fine-tuning and adapter techniques in both CV and NLP with an accuracy drop of about 10% to 20% [19,22]; however, it is only half as accurate as the best Houlsby method in the recommendation scenarios.One key finding is that the adapter's trainable parameter size, insertion positions, and information flow directions are all key factors for the recommendation task.

ANALYSIS OF MORE FACTORS (Q(III))
Since existing adapters are mainly derived from the NLP literature, a natural challenge is to effectively apply them in the recommendation scenario.Specifically, we ask: where and how to insert the adapters for TransRec?Regarding the question of where, we aim to check whether the two modules in TransRec,   and   , are equally important for domain transfer, as this is specific for the recommendation task.Regarding the question of how, we evaluate two insertion strategies (serial vs. parallel) of the adapter networks and explore the effect of LayerNorm in the recommendation task.
We first evaluate the effect of adapters inserted into different modules in TransRec.There are three ways to implement adapters: placing them into both user and item encoders (  ), only into the item encoder (  ), and into the user encoder (  ) (all other parameters are fixed).From Table 4, first, we can clearly see that   outperforms   by a large margin in all experimental settings.This indicates that the item encoder plays a more important role in the recommendation task and requires more re-adaptation on the new datasets.Second,   achieves comparable results as   in textual RS, suggesting that the knowledge stored in   can be largely re-used with the adapted   .However, there is still a significant gap between   and   for the visual task, indicating that the parameter adaptation of   is also important.Besides, This again shows that the image-based visual recommendation is a more difficult task than the text recommendation.
From Section 5, we know that the Pfeiffer adapter shows a performance gap from Houlsby.The only difference is that Houlsby inserts adapter blocks after both FFN and MHA, whereas Pfeiffer only inserts after MHA.We further test the setting of this adapter to verify the impact of the position of insertion.The results are in Table 5. Thereby, accuracy drops may come from two reasons: 1) no tunable adapter after FFN; 2) fewer TP because of removing one adapter.To verify 1), we changed the position of the adapter (i.e., inserting after FFN), which yielded the same results.To verify 2), we double the number of TP, which still performs similarly, indicating adapters should be inserted for both FFN and MHA for RS.We then compare two adaption insertion strategies: serial and parallel The parallel approach is also adopted in [22,78].In Table 6, it can be seen that the two insertion methods, in general, perform very similarly.The other observation is that whether tuning the LayerNorm layer or not has almost no obvious influence on the recommendation accuracy.This is very different from other fields [22,26] where they strongly suggest optimizing both the adapter and LayerNorm layers for obtaining the optimal results.Thereby, for practical RS tasks, we only need to save the adapter modules, which is more efficient and convenient.
(Answer for Q(iii)) We draw some conclusions here: (1) TransRec should place adapters for both user and item encoders for obtaining the optimal results, in particular for visual RS whose performance drop is very significant if the parameters of either the item encoder or user encoder are completely fixed; (2) the insertion position on the Transformer layers is also important, both FFN and MHA require a separate adapter module; (3) other factors such as insertion way (serial or parallel) and LayerNorm optimization do not matter a lot for the recommendation task, although they are often considered for NLP and CV tasks; (4) again, the number of trainable parameters is always a key factor for the accuracy of TransRec, certainly within a certain range, as described in the previous section.

SCALING EFFECTS
To better understand the role of training data during pre-training and downstream adaption, we conduct experiments by scaling data in both the source domain (D  ) and the target domain (D  ), and present the results in Figure 5.According to the performance curves, we make the following observations: (1) Despite some exceptions, the HR@10 shows a clear trend of improvement for FTA and AdaT on the two modalities as the upstream pre-training dataset in D  increases.This observation has important implications: for industrial recommender systems, one can expect greater performance gains with more pretrained source domain data.
(2) AdaT shows poor results under the NoPT setting, where only the item ME is pre-trained (on some NLP and CV data, e.g., Ima-geNet), and the user encoder is randomly initialized.This explains that the lightweight adapter network indeed (or can only) does some parameter adaption work.It should fail or perform worse when the parameters (in the user encoder) are randomly initialized.
(3) There are some other observations consistent with the previous description.For example, AdaT achieves comparable results to FTA for the text-based recommendation, while it lags behind FTA for image-based recommendation regardless of the size of the source and target datasets.We omit such repetitive descriptions here.
Imagine a practical scenario where we have a large number of user-item interactions in some industrial platform.The pre-trained knowledge (i.e., parameters) of TransRec in this platform can be effectively transferred to serve many other recommendation systems or channels through adapter tuning.

RELATED WORK
Parameter-efficient transfer learning (PETL).Researchers have been working on PETL for years to alleviate the gigantic amount of trainable parameters in large-scale pre-trained models.The principal way is to introduce adapter tuning techniques [38,46].In NLP, the first adapter was proposed in [22] where authors uncovered that only training the newly inserted adapter blocks without any modification of the pre-trained parameters could achieve competitive results to full parameter fine-tuning.Pfeiffer et al. [44] proposed the AdapterHub framework to facilitate, simplify, and speed up transfer learning across a variety of languages and tasks.[59] proposed to inject multiple kinds of knowledge into large pre-trained models by K-Adapter.Recently, [11,76]   for LLM (Large Language Models).He et al. [19] explored the PETL techniques in ViT-based computer vision tasks.ViT-Adapter [5] achieved state-of-the-art performance on many dense prediction tasks of CV.LoRA is another PETL technique similar to adapters but avoiding the inference latency [23].Concurrent work [1,14] used adapter/LoRA to adapt LLM for textual RS, unlike our TranRec problem that transfers from a source to a target domain.Prompt [28,30,31,36,60,68,70,77] is another popular PETL paradigm.It shows that only optimizing the embeddings of a few prompt tokens exhibits similar performance as the full model finetuning.Recently, P5 [13] presented a "pretrain, personalized prompt & predict paradigm" that can learn multiple recommendation-related tasks together by formulating them as prompt-based natural language tasks.M6-Rec [7] showed that prompt tuning outperformed fine-tuning with negligible 1% task-specific parameters.However, in this paper, with the two popular architectures, we found that the standard prompt tuning is still unsatisfactory compared to adapteror fine-tuning.[67] and [68] utilized prompt tuning to study the selective fairness and cold-start recommendation, but are ID-based methods different from our modality setting.Modality-based TransRec.Inspired by the success of foundation models in NLP and CV fields, the modality-based/only recommendation (MoRec) has attracted rising attention recently [6,32,34,40,47,54,57,58,61,63,74,75].Typically, they use the foundation model, such as BERT, RoBERTa, and GPT [3] as the text encoder or ViT, and ResNet as the image encoder.The user encoders still keep a similar fashion as the traditional IDRec architectures, e.g., SASRec [25], BERT4Rec [53].
A key advantage of MoRec models is that they are naturally transferable because item modality representation is universal regardless of platforms and systems.For example, ZESRec [9] proposed a zeroshot predictor by leveraging the natural language representation extracted from BERT.Similar work also includes TransRec [57] by Wang et al, UniSRec [21], and ShopperBERT [51], which all leveraged textual features to realize transferable recommendation.However, so far, existing TransRec literature (especially image Tran-sRec) mostly utilizes the off-the-shelf features pre-extracted from ME, which has efficiency advantages over fine-tuning heavy ME.Recently, [32,57,74] started to perform joint training of user encoder and item ME in both pre-training and fine-tuning stages (i.e., our FTA baseline), which showed significantly improved performance compared to the pre-extracted fixed features.Therefore, in this paper, we study TransRec as a comparison baseline in a more powerful end-to-end (or joint) learning manner. 6o the best of our knowledge, few studies have investigated the adapter tuning techniques for modality-based TransRec, especially for inserting adapters into item ME.PeterRec [72] proposed the first adapter tuning technique for the recommendation task, but it highly relies on overlapped userIDs when performing transfer learning.Moreover, the majority of parameters in PeterRec are on the ID embedding layer rather than the middle layers.Therefore, adapter tuning is still new for modality-based TransRec models.

CONCLUSION AND FUTURE WORK
In this paper, we conducted an extensive empirical study examining the performance of the popular Adapter Tuning (AdaT) techniques for modality-based TransRec models.We identified two facts: (1) the SOTA AdaT achieves competitive results compared to fine-tuning all parameters (FTA) for text recommendation; (2) AdaT works fine but lags slightly behind FTA for image recommendations.We then benchmarked four well-known AdaT approaches and found that their behavior was somewhat idiosyncratic, compared to NLP and CV tasks.We deeply studied several key factors that may influence AdaT results for recommendation tasks.At last, we found that TransRec with AdaT meets our expectations due to the ideal data scaling effect -TransRec benefits when upscaling the source domain data or downscaling the target domain data.Our work provides important guidelines for parameter-efficient transfer learning for modality recommendation models.It also has important practical implications for foundation models [2] in the RS community, with the grand goal of 'one model for all' [50,57,72,73].
There are several interesting future directions.The first one is to develop more advanced AdaT TransRec for visual item recommendation.Then, we are also interested in investigating the effects of AdaT for multimodal (i.e., both text and image) TransRec.Third, given that most typical AdaT does not help to speed up the training process in practice (nor for NLP and CV tasks), it is important to explore effective optimization techniques to reduce the computational cost and time for TransRec through end-to-end training of item modality encoders.

FuFigure 3 :
Figure 3: The adapter-based TransRec framework.The TransRec consists of a user encoder (UE) and multiple item encoders divided by the dotted line.BERT and ViT are applied as examples of the text encoder and image encoder respectively.SASRec and CPC (DSSM variant) are used to train UE.Z=1 , ..., Z= are vector generated by UE,  =1 , ...,  = are vectors generated by ME.Thereby, the way to inject adapters in UE follows the same way as that of the item encoder.
5 platform are used as the text and image encoders, respectively.The dimension of hidden representations of the user encoder is searched in {32,64,128} and set to 64.The number of Transformer blocks and attention heads is fixed to 2. We apply Adam as the optimizer without weight decay throughout the experiments and extensively search the learning rate from 1e-6 to 1e-2 while keeping the dropout probability at 0.1.We set the batch size to 64 for textual datasets and 32 for visual datasets due to the GPU memory limits.When adapting to the target domain, we set the batch size to 32 for both modalities.The hidden dimension of the adapter networks are carefully searched in {8,16,32,48,64, 96, 128, 192, 384, 768}, and the number of tokens of prompt tuning in {5, 10, 20, 30, 40, 50}.Note that the hyper-parameters of parameter-efficient modules are only searched in the SASRec-based architectures and directly transferred to the CPC-based methods.All hyper-parameters are determined according to the performance in the validation data.

Figure 5 :
Figure 5: Scaling effects of fine-tuning and adapter-based TransRec using the SASRec objective.The x-axis represents the size of pre-trained data.NoPT refers to TransRec that was not pre-trained by the source domain dataset.

Table 2 :
Fine-tuning and adapter tuning comparison.FTA and AdaT represent "Fine-tune All" and "Adapter Tuning" respectively.TP stands for trainable parameters.All results of HR@10 and NDCG@10 in this table are denoted in the percentage (%).T and V represent the textual and visual recommendation.The difference between FTA and AdaT is denoted by Diff.

Table 4 :
Comparison of full adapter-based TransRec and only adding adapters to the item or user encoder.Adapter   and Adapter   denote only adding adapters to the item and user encoder respectively.TP stands for trainable parameters.The subscripts  and  represent ViT and BERT.

Table 5 :
Adapter position impact inside a Transformer block.We present the HR@10 for text and image recommendation with SASRec+BERT and SASRec+ViT architectures.Adapter    and Adapter  represent inserting the adapter block after FFN and MHA respectively.And Adapter  ++ and Adapter    ++ stand for the same architectures as the previous two but with 2x the parameters.

Table 6 :
Performance of the adapter insertion methods.The best HR@10 are marked in bold in each column."w/" and "w/o" denote with and without updating the LayerNorm.
proposed LLaMA-Adapter