Cross-domain Recommendation with Behavioral Importance Perception

Cross-domain recommendation (CDR) aims to leverage the source domain information to provide better recommendation for the target domain, which is widely adopted in recommender systems to alleviate the data sparsity and cold-start problems. However, existing CDR methods mostly focus on designing effective model architectures to transfer the source domain knowledge, ignoring the behavior-level effect during the loss optimization process, where behaviors regarding different aspects in the source domain may have different importance for the CDR model optimization. The ignorance of the behavior-level effect will cause the carefully designed model architectures ending up with sub-optimal parameters, which limits the recommendation performance. To tackle the problem, we propose a generic behavioral importance-aware optimization framework for cross-domain recommendation (BIAO). Specifically, we propose a behavioral perceptron which predicts the importance of each source behavior according to the corresponding item’s global impact and local user-specific impact. The joint optimization process of the CDR model and the behavioral perceptron is formulated as a bi-level optimization problem. In the lower optimization, only the CDR model is updated with weighted source behavior loss and the target domain loss, while in the upper optimization, the behavioral perceptron is updated with implicit gradient from a developing dataset obtained through the proposed reorder-and-reuse strategy. Extensive experiments show that our proposed optimization framework consistently improves the performance of different cross-domain recommendation models in 7 cross-domain scenarios, demonstrating that our method can serve as a generic and powerful tool for cross-domain recommendation1.


INTRODUCTION
Recommender systems aim to provide personalized recommendation to users according to their historical behaviors, and have played an important role in various applications [6,28]. However, the existing recommender systems often suffer from the data sparsity and the cold-start problems [8,9,28,29,31], where the data sparsity problem is caused by the insufficient user-item interaction records and the cold-start problem is generally caused by the new users.
To address these problems, cross-domain recommendation, which improves the recommendation accuracy in the target domain by utilizing the user's behaviors in the source domain, has achieved great success recently [8,12,20,27].
Existing cross-domain recommendation(CDR) methods mainly focus on designing effective model architectures to transfer knowledge from the source domain to the target domain. In particular, EMCDR-based methods [18,33] aim to share the knowledge in the user's embedding, where in each domain the user's embedding is optimized with the source and target recommendation loss respectively, and then different mapping functions like global nonlinear function [18] and user-specific function [33] are learned to align the user's embedding in two domains. However, the transfer mechanism of EMCDR simply focuses on user embedding mapping, limiting their performance in recommendation. More advanced architectures [8,12,13,17,20] are recently proposed for more precise cross-domain recommendation. CoNet [8] adopts the cross-stitch structure in multi-task learning to transfer the source features to the target domain. MiNet [20] and DASL [12] design different kinds of attentive mechanisms to select the target-domain-related interest from the source behaviors. These methods [8,12,13,17,20] share not only user embedding but also deep model parameters through utilizing the joint loss of both the source and target domain to optimize the whole model, and thus achieve promising target recommendation performance.
Despite the effectiveness of the existing methods, they neglect the fact that behaviors regarding different aspects in the source domain may have different importance for the CDR model optimization. They lack the behavior-level consideration for the joint loss to be optimized, which leads to sub-optimal model parameters for the carefully designed CDR models. Specifically, the joint loss adopted by existing works [8,12,13,17,20,26,30] is generally obtained as follows, (1) average the losses of all the behaviors in the target/source domain to obtain the target/source loss, (2) use a hyper-parameter weighted to the source loss to balance information from the two domains, (3) and add the two loss terms together to obtain the joint loss. However, an appropriate behavior-level consideration requires that not only should the losses between two domains be balanced, but the loss of each behavior in the source domain should also be balanced. For example, if we want to utilize the user's behavior in the book domain to help recommendation in the target toy domain, interacting with the books related to "children" or "toy stories" can provide beneficial information to the target domain, but interacting with "love story" books provides little helpful information. Additionally, as indicated by Man et al. [18], using the losses of users that have only few interactions in the source domain tend to harm the target domain recommendation accuracy. Therefore, existing works averaging the source behavior losses without behavior-level consideration fail to filter out the harmful information, thus only obtaining sub-optimal parameters for the carefully designed model architectures.
To address the problem, we propose to consider behavior-level importance through assigning an individual importance weight to each source behavior loss, which faces two challenges.
• When assigning a proper behavior-level importance, we will have to decide millions of hyper-parameters for each of the millions of source behaviors if each weight is regarded as a hyper-parameter, which brings explosive computational cost.
• Even though we can reduce the number of required hyperparameters to an affordable level, it still remains a problem how we optimize the reduced hyper-parameters.
To tackle the challenges, we propose a generic behavioral importanceaware optimization framework (BIAO) for cross-domain recommendation, which involves a novel behavioral perceptron for the first challenge and a tailored bi-level optimization algorithm for the second challenge. The proposed behavioral perceptron learns the importance of each source behavior (i.e., a user-item pair) according to the item's global impact on the target domain as well as its local impact on a specific user, where the former is modeled by a global MLP and the latter is modeled with self-attention followed by a local MLP, which reduces the required hyper-parameters from million level to thousand level. The tailored bi-level optimization algorithm jointly optimizes the recommendation model parameters and the behavioral perceptron parameters. In the lower optimization, we utilize the target loss together with the importance weighted source loss to optimize the recommendation model parameters, while in the upper optimization, we update the perceptron parameters with implicit gradient from a developing dataset obtained through the designed reorder-and-reuse strategy. This strategy makes full use of all the target behaviors both in the upper and lower optimization, alleviating the potential bias and information loss in previous bi-level optimization works [2,15]. The proposed behavioral perceptron is learned in a bi-level data-driven manner, thus being generic enough to automatically fit different recommendation models and datasets. Extensive experiments and analysis show that our proposed BIAO framework consistently improves the performance of different cross-domain recommendation models in seven crossdomain scenarios. Our contributions are summarized as follows, • To the best of our knowledge, we are the first to consider behavior-level effect through assigning an individual importance weight to each source domain behavior loss for cross-domain recommendation. • We propose a generic behavioral importance-aware optimization (BIAO) framework for cross-domain recommendation, which includes a novel behavioral perceptron and a tailored effective bi-level optimization algorithm. • We conduct extensive experiments with different models on different datasets. Empirical results show that our proposed BIAO framework brings consistent performance improvement, demonstrating its ability to serve as a generic and powerful tool for cross-domain recommendation.

RELATED WORK
In this section, we review related work for cross-domain recommendation and bi-level optimization.
Cross-Domain Recommendation. Cross-domain recommendation aims to utilize the source domain information to provide more precise recommendation to users in the target domain. A line of typical methods are based on EMCDR [1,10,18,32,33]. The original EMCDR [18] utilizes the latent factor model to learn the user's embedding in the source domain and target domain respectively. Then a non-linear mapping from the source user embedding to the target domain is learned to transfer the knowledge. [32] proposes task-oriented loss to utilize the mapped embedding to predict the target behavior instead of target user embedding. [33] further proposes a personalized mapping function for each user and [10] gives a more reasonable metric learning for EMCDR. However, the EMCDR-based methods simply consider the user embedding mapping, limiting its performance in recommendation. More advanced cross-domain recommendation models were proposed recently [8,12,17,20,26]. CoNet [8] adopts the cross-stitch structure in multi-task learning to share the features between the source and target domain. -Net [17], MiNet [20] and DASL [12] considers the sequential cross-domain recommendation, where attention mechanisms are adopted to select the useful source information. Despite the effectiveness of the proposed models, they simply utilize the linear combination of the source loss and the target loss to optimize the model, ignoring different impacts of behaviors in the source domain. To further exploit the potential of these models, we propose to conduct the behavioral importance-aware optimization for cross-domain recommendation.
Bi-level Optimization. The bi-level optimization problem arises in many scenarios of deep learning, like meta learning [5,21], neural architecture search [14], and auxiliary learning [2,3,19]. To conduct the upper optimization, some works adopt the unrolled differentiation to calculate the upper-level gradient [4,23]. However, several steps of unrolling will be memory-consuming [15] and the efficient one-step unrolling will suffer from short horizons [25]. The  implicit gradient is another widely adopted strategy for bi-level optimization. However, to fully obtain the implicit gradient, it needs to obtain the inverse of the Hessian matrix, which is computationally exhausted for deep models. Several works try to approximate the inverse of the Hessian, like using identity matrix [16], conjugate gradient [22] and truncated Neumann series [15]. All these methods generally split the training dataset into two disjoint subsets for the lower and upper optimization, which easily causes bias in the upper optimization and information loss in the lower optimization. In this paper, we adopt [15] to optimize the proposed novel behavioral perceptron, but propose a reorder-and-reuse strategy to alleviate the potential bias and information loss problems.

THE PROPOSED METHOD
In this section, we present preliminaries, the proposed behavioral perceptron and the effective optimization strategy.

Preliminaries
Assuming that there exists a set of users that have interactions with items in both the source domain and the target domain . The widely CDR model schematic diagram is shown in Figure 1. The CDR dataset includes a source behavior set and a target behavior set , which contains several user-item behavior pairs for training. The CDR model generally contains the user's profile (like user ID or age), the candidate item in the source domain , the candidate item in the target domain , and some other source/target features , / , as input, where ( , ) is a behavior from the source behavior set and ( , ) is from the target behavior set . The other features in the source/target domain , / , could be some historical sequential features for sequential recommendation. The CDR model parameters generally includes target-only/sourceonly parameters / , like the final predictor of each domain in MiNet [20] or CoNet [8], and the shared parameters by both domains , like the user profile embedding. Given the input data and the CDR model, the loss of the target behavior , and the loss of the source behavior , will be calculated. Finally, the loss of all the source behaviors and all the target behaviors are added together with a balanced factor , and the loss sum is used to optimize the CDR model parameters : The current methods only rely on a global hyper-parameter to balance the source and target information. However, the differences of different behaviors ( , ) ∈ to the target domain recommendation are ignored, leading to sub-optimal . Note: Although the source and target loss are optimized together, CDR model only cares about the performance in the target domain.

The Behavioral Perceptron
To conduct the behavioral importance-aware optimization, our proposed optimization objective is as follows, where each source behavior loss is given an individual weight so that the beneficial behavior can be preserved while the harmful ones are discarded. Currently, the most important problem is how to decide the weights for the source-domain behavior. We propose a behavioral perceptron to perceive the importance of each behavior, whose structure is shown in Figure 2.
Specifically, for a behavior ( , ) in the source domain, the behavioral perceptron judges its importance from the global importance of to the target domain and the local importance of to user as follows.
Global item importance. Since different kinds of items in the source domain have different impacts on the target domain, e.g., "cartoon" books are more beneficial than "love story" books when the target domain is toy, we adopt a multi-layer-perceptron(MLP) to map the features of to its importance weight. We concatenate all fields of its features like its ID, category, etc. to an embedding , and then the global item weight is obtained as follows, where is the learnable parameters of the global MLP. Since this weight captures the importance of item to the whole target domain, we call it the global importance.
User-specific item importance. Besides the item information, we also judge the importance of ( , ) from the user's historical interactions. If is more related to user's recent interactions, the ( , ) behavior should be highly weighted. Specifically, the recent interactions of user in the source domain are [ 1 , 2 , · · · , ], and in the target domain are [ 1 , 2 , · · · , ]. We first map each of the historical items to its embedding, and obtain [ 1 , 2 , · · · , ] and [ 1 , 2 , · · · , ]. To make the perceptron find the interest that is related to , we concatenate the embedding to each of item embedding in the two historical sequences, and we obtain for ∈ {1, 2, · · · , }. After obtaining the item-aware historical embeddings, we use two Multi-head Attention(MHA) [24] modules to extract the user's recent interest about in the source and target domain respectively.
, , · · · , , ]; ), (5) where and are the parameters of MHA. After the MHA process, we use mean pooling to obtain the recent source/taregt interest  about , which are , and , . Then we concatenate them together and use an MLP to obtain the final importance as follows, where is the paramters of the local MLP. This importance considers the information from the specific user , and we call it the local user-specific importance.
Finally, we multiply the two importance weights and obtain the final importance of ( , ): where is the normalization function, is the learnable global scalar to balance information of the two domains, and is a function of , and = { , , , , } contains all the behavioral perceptron parameters. Compared to directly assigning each source behavior a weight which requires millions of parameters, the behavior perceptron only involves several fully connected layers. Assuming that the embedding dimension of the item is (typical value 32 or 64), the behavioral perceptron only requires ( 2 ) parameters, which is thousand level. However, how to optimize still remains a problem. Directly using Eq. (2) to optimize will easily cause to be zero, a trivial solution that cannot utilize the source loss information. Next, we present our bi-level optimization framework that jointly optimizes and .

Overall Bi-level Optimization Formulation
The proposed behavioral importance-aware optimization framework is shown in Figure 2. Note that our ultimate goal is to obtain the optimal behavioral perceptron parameters which can select the most beneficial source behaviors to optimize the CDR model , so that the CDR model performs best in the target domain. This goal can be formulated as a bi-level optimization problem as follows, * = arg min ( * ( )), .
(2) with obtained through Eq. (8). The lower optimization aims to find the optimal * ( ) that minimizes ( ; ), i.e., optimize the CDR model parameters when the behavioral perceptron is fixed. Note that if changes, * will also be changed, so * is an implicit function of , and we denote it as * ( ). ( * ( )) is the loss of the CDR model on a new developing dataset ′ in the target domain. Assuming that if we can obtain an additional dataset in the target domain, our final optimization goal is that the CDR model * ( ) can achieve the best performance on the new developing target dataset, whose function is just like the validation dataset. Later we will explain how existing works and how we obtain the additional developing target dataset. Now, we still focus on how to conduct the lower and the upper optimization in Eq. (9). Lower Optimization. The lower optimization is quite straightforward. With the parameters of the behavioral perceptron fixed, we can use any optimizer like SGD or Adam [11] to optimize , so that ( ; ) is minimized. Upper Optimization. The upper optimization is a little more complex, since ( * ( )) is the loss on the target domain and it directly relies on instead of . We cannot use the autograd tools like SGD to calculate ∇ ( * ( )). Therefore, considering that * ( ) is an implicit function of , we utilize the chain rule to obtain the implicit gradient as follows, where ∇ ( * ( )) is easily obtained using the autograd tools and our target now is to obtain ∇ * ( ). Note that * ( ) is the minimal point of ( ; ), so we have: Further calculating the gradient with respect to in both sides of Eq. (11), we can obtain the following results: ∇ 2 ( * ( ), )∇ * ( ) + ∇ ∇ ( * ( ), ) = 0. (12) Therefore, the gradient ∇ * ( ) can be obtained as follows (we omit the parameters in for the sake of brevity.), However, (∇ 2 ) −1 , the Hessian inverse of the CDR model(a neural network), is usually intractable. We adopt the -truncated Neumann series to approximate this inverse, where (∇ 2 ) . With these derivations, we can obtain the upper implicit gradient ∇ as follows, which can be efficiently calculated by the vector-Jacobi product [15]. Now, we can jointly optimize the CDR model parameters and the behavioral perceptron parameters . Specifically, in the lower optimization, with fixed, we update the CDR parameters with the popular optimizer like SGD or Adam until convergence, and obtain the optimal * ( ). When the optimal * ( ) is reached, we switch to the upper optimization and use the gradient in Eq. (14) to update the perceptron parameters . The lower and upper optimization are conducted in an alternating way until convergence.

The Practical Optimization Algorithm
Note that the optimization strategy discussed so far is still conducted on the whole dataset, which is impractical in real recommendation scenarios. Additionally, conducting the lower optimization until converges in each loop is inefficient and how to obtain the additional developing target dataset ′ still remains unresolved.
To tackle these problems, we present a practically efficient and effective solution of our optimization algorithm in Algorithm 1. Note that there are three key points that make the algorithm more practical and efficient compared to previous theoretical analyses.
Batch Optimization. Note that in both the lower and upper optimization, we fetch batches from the whole dataset to calculate the loss and the gradient of parameters. Since we always cannot calculate the gradient of the whole dataset due to memory limit, this kind of batch optimization has been widely adopted in current deep learning and is also effective for the bi-level optimization.
Interval rounds of lower optimization instead of convergence. Theoretically, with fixed , we need to train the to its optimal point * ( ) and then we can conduct the upper optimization to update . However, it is quite time-consuming because each time we update , we need to experience a new complete lower training process. To make the algorithm more effective, we only conduct a fixed rounds of lower optimization for approximation, which is found effective in previous works [2,15,19].
The reorder-and-reuse strategy. As we mentioned before, we need an additional developing target dataset to calculate ( * ( )). Previous works [2,15,19] usually split a small dataset ′ from the target dataset . These works use the rest set − ′ for lower optimization and ′ for upper optimization. However, this kind of split easily causes bias in the upper optimization and information loss in the lower optimization. Specifically, in the upper optimization, we expect * ( ) has best performance on the developing dataset. If this developing dataset is only a small dataset split from the target dataset, it may only contain information of part of the users or items, thus easily biased. In the lower optimization in Figure 1, we note that the target-only parameter is only optimized with the target loss. If we only use − ′ in the lower optimization, will easily become sub-optimal because of the lost information in ′ . To tackle this problem, we propose the reorder-and-reuse strategy thanks to batch optimization. We reorder the target dataset and obtain the dataset ′ and use this reordered dataset to calculating ( * ( )). This reordered dataset on the whole is the same as which cannot be regarded as a validation set when using the whole dataset for optimization, but luckily, we adopt batch optimization, so the batch from used in the lower optimization is different from the batch ′ from the reordered ′ in the upper optimization, and ′ can be regarded as a validation batch used to tune . With this strategy, we can reuse to effectively conduct the bi-level optimization without requiring additional data in the target domain. The superiority of the reorder-and-reuse strategy compared to previous methods is validated in the experiments. The schematic diagram of the reorder-and-reuse strategy and previous methods is in Appendix C.

EXPERIMENTAL RESULTS
In this section, we empirically assess the efficacy of our proposed method on various datasets with different base models. Additionally, we provide ablations to show how our proposed method works.

Dataset
We conduct our experiments on the Amazon datasets [7], which contain users with their behaviors in different domains like book, movie and clothing. Specifically, to validate the generalization ability of our proposed method, we totally choose 7 domains from the Amazon datasets, which are Books, Movies, CDs, Cloth, Electronics(Elec), Home&Kitchen(Kitchen) and Toys. Based on these selected domains, we create 7 source-target cross-domain scenarios, i.e., Books-Movies, Books-CDs, Books-Elec, Books-Toys, CDs-Cloth, CDs-Kitchen, Elec-Cloth. These scenarios include both the intuitively highly-correlated domains like Books-Movies and intuitively less correlated domains like Elec-Cloth. The behaviors of the source and target domains are filtered by the common users between domains. Detailed statistics of the filtered datasets are presented in the appendix. The target domain behavior numbers of different scenarios are also quite different, e.g., the target behavior number of Books-Movies is 792,319 and CDs-Cloth only 36,319. The dataset split for the target domain is the same as that of [8,20], where the test set is composed of the last behavior of each user, the validation set is composed of the second to last behavior, and the rest behaviors belong to the training set. The feature adopted for each user is its ID information, and the features for each item contain both its ID and category. Note that we keep the users with less than 5 ratings and items without metadata, different from [20], and thus our setting is closer to real scenarios.

Competitors and Evaluation Metrics
We choose CoNet [8] and MiNet [20], two typical models for crossdomain recommendation, as our base models, to which we apply our proposed method. Additionally, we also investigate other variants of the two base models to better present how the source domain information influences the model performance. Specifically, details of different models are as follows, • We adopt AUC and RelaImpr, the same metrics as that of [20], to evaluate the models. Higher values indicate better performances.

Implementation
We implement all the methods with PyTorch. We optimize all the base models with Adam [11] optimizer, whose learning rate is searched from {1e-3, 5e-3, 1e-2} to fit different datasets, where the batch size is the same as that of the original paper. As for the hyperparameters in the upper optimization, the truncated number is fixed to 3 as adopted by previous works [2,15], the for conducting upper optimization is searched from {20, 100, 500}, the length of the historical sequence used in the behavioral perceptron is 20 for MiNet and 50 for CoNet, the head number of the multihead attention is 4, and the adopted upper optimizer is SGD with learning rate 1e-2 for all the scenarios. Note that the embedding table used in the perceptron is the same as that of the CDR model, but we stop its gradient in the perceptron so that the embeddings can be regarded as input instead of learnable parameters of the perceptron.
Time Complexity Analysis. During optimization, regarding the optimization time of MiNet/CoNet as unit "1", and considering that the additional computation brought by our method mainly comes from the backward process for the upper gradient, the time complexity of our method is as follows: In one lower-upper loop, the model conducts times of lower optimization and 1 upper optimization. The original MiNet/CoNet only needs lower backwards. Our method has the additional upper optimization which requires ( +2) backwards for the Jacobi calculation where is the truncated number, so it needs total + +2 backwards, which results in (( + +2)/ ) complexity compared to the unit. Since in our experiments is fixed to 3, our method needs about O(1+5/ ) complexity. During inference, we do not change the model structure but only the parameters, the inference complexity is the same as that of the original model.

Recommendation Performance
The overall recommendation performance of different models is presented in Table 1. We have the following observations: • Our proposed BIAO method brings consistent improvements. Whether utilizing MiNet or CoNet as the base model, our proposed method brings further improvement to the base model on all the datasets. Especially under the MiNet on Books-Toys, Elec-Cloth, and the CoNet on CDs-Cloth and Elec-Cloth settings, our BIAO method brings more than 10% RelaImpr improvement without changing model structures. • Our method has greater potential in tackling less irrelevant source-target transfer and cold-start scenarios. It's worth noting that our proposed method achieves about 0.3% absolute AUC improvement in the Books-Movies scenario, but  achieves more significant improvement in other scenarios like Books-Toys and CDs-Cloth. In the Books-Movies setting, the source domain and target domain information is quite similar, thus the optimal weights for different behaviors do not have large differences(which is also validated in 4.5), making the improvement brought by our method comparatively small. However, in the Books-Toys setting, where only specific categories of books have intuitive influence on recommendation in toys, our method brings significant improvement. Additionally, the RelaImpr of our method in the CDs-Cloth and Elec-Cloth is quite significant compared to other settings, where target behaviors in these two settings are quite inadequate, demonstrating the potential of our method to tackle the cold-start problem. • Our method has the ability of exploiting the beneficial information and discarding the harmful information in the source domain. Note that under the Books-Elec, CDs-Kitchen and Elec-Cloth settings, both MiNet-0 and MiNet perform worse than the single-domain MiNet-S. This means both the feature and the loss from the source domain on average are harmful to the target domain. However, our method still surprisingly brings improvement compared to the MiNet-S baseline, indicating the strong ability of our method to discover the beneficial information from the on average harmful source domain behaviors.

The Learned Behavioral Weight
We record the learned weight of each source behavior loss on the Books-Movies and Books-Toys dataset with MiNet, and analyze their statistical characteristics. Figure 3 shows the average weight of items within each category. We have the following observations: • The behavioral perceptron selects highly-related behaviors from the source domain. For example, on the Books-Toys dataset, the books with topics about "Children, Education, Humor, Computer" are highly weighted, indicating that the behavior of buying these books has greater influence on the user's behavior in the target toy domain. These highly weighted categories indeed have higher correlation with toys from human intuitions. Additionally, the books about "Rental" and "Law" are regarded as less important to the user's interest in toys.  Besides the item categories, we also want to know whether the length of user historical behaviors in the source/target domain has influence on the learned weights. The x-axis of Figure 4 means the difference between the user's source behavior length and the target length(e.g., if a user has 5 interactions in the source domain and 10 interactions in the target domain, the difference is 5-10=-5). The yaxis means the learned average weight under each difference value. We can see that larger difference tends to lead to larger weights. This phenomenon is also intuitive. If a user has more interactions in the source domain and fewer interactions in the target domain, the source behavior of this user has higher probability to be valuable, which is consistent with [18]. Case analysis of the learned userspecific importance is given in Appendix B.

Behavioral Perceptron Effectiveness
We conduct an ablation study to validate the effectiveness of the designed Global item importance and the User-specific item importance. We report the performance of the model that removes any of the two components in Table 2. The w/o global refers to the variant that removes Global item importance and w/o user is the variant that removes the User-specific item importance. The results show that in almost all the settings, both of the two components are effective for the cross-domain recommendation.

Reorder-and-Reuse Strategy Effectiveness
Previous works [2,19] that utilize bi-level optimization split a small dataset from the training set, and use this small dataset for upper optimization. However, our proposed method reuses the whole training set and relies on batch optimization to make the data in the lower and upper optimization different during training. We compare our proposed method with the previous methods which utilize different split ratios. Specifically, we split 0.01, 0.05 and 0.1 of the whole target set for upper optimization, and the rest 0.99, 0.95 and 0.9 for lower optimization, respectively. Results in Table 3 show that the previous way to split a small dataset from the whole training set is not as effective as our proposed reorder-and-reuse method, where splitting too few samples makes the upper optimization biased and splitting too many samples does harm to the lower optimization. In most cases, splitting 0.1 data from the training set results in the worst performance, indicating that the benefits from weighting the source behavior cannot compensate for the degradation caused by worse lower optimization of the target-only parameters . The proposed simple but effective reorder-and-reuse strategy can also be applied to other bi-level optimization problems.

Hyper-parameter Sensitivity
Almost all the hyper-parameters involved in the optimization framework are fixed, except for the between two upper optimizations searched from {20, 100, 500}. Figure 5 shows the impact of on Books-Toys and Books-Elec with both MiNet and CoNet, set to {20, 100, 200, 300, 400, 500}. A larger makes in the upper optimization update more slowly, but the approximation errors of the implicit gradient will be smaller. Although the best varies with the base model and the dataset, our optimization brings consistent improvement with different compared to the original MiNet or CoNet(the dotted line named dataset w/o ours in the figure). Therefore, our method brings little HPO (Hyper-Parameter Optimization) burden for performance improvement and can be easily adopted.

CONCLUSION
In this paper, we propose a behavioral importance-aware optimization framework for cross-domain recommendation, which automatically selects the most beneficial behaviors from the source domain to improve the target recommendation performance. The proposed framework involves the behavioral perceptron and the bi-level optimization based strategy, whose effectiveness has been validated through extensive experiments. Our proposed method can be combined with various cross-domain recommendation methods that jointly optimize the source and target loss, serving as a powerful tool for cross-domain recommendation. Future work like exploring more effective behavioral perceptron designs is interesting. Figure 6: Case study for user-specific importance weights in Books-Toys. Each sub-figure gives 4 aspects of a source behavior, i.e., the final learned user-specific weight, the item in the source behavior, the user's historical sequences in the source domain, and the user's historical sequences in the target domain. Each item is represented in the form of category(title) if existed. Case3 and case4 assign high weights to the source behavior, where in case4 the source candidate item is related to the target domain and the users' recent behaviors, while in case3 the 'Health&Fitness' source item is less related to the target domain in global sense but quite fits the user's specific interests in both domains.

A DATASET STATISTICS
We provide the data statistics of our experiments in Table 4. We also conduct several experiments on a more recent backbone DASL [12].
And the results are shown in Table 5. The results further show that our proposed optimization method can be applied to various current CDR models.

B USER-SPECIFIC IMPORTANCE CASE STUDY
In section 4.5, we analyze the statistical characteristics of the learned weights. The learned average weights under each category show that the global item importance is important. In this section, we further provide some case studies in the Books-Toys scenario to analyze the user-specific importance weights in Figure 6. In case1 and case2, the two behaviors are assigned to low user-specific weights because the source item has very low correlations with the source historical behaviors and target historical behaviors. For example, in case2, the candidate source item is about photography, but the source sequences are almost about novel and the target sequences are about kid's toys, making this source behavior less worth learning. In case3 and case4, the source candidate item has high correlations with both source and target historical sequences (highlighted in red), thus are assigned higher weights to learn. Particularly, in case3, the behavior "Health&Fitness" book in the source domain intuitively will have little importance to the target toy domain. However, due to the user's specific interest in "health" and "sports" shown in his historical behaviors, it is assigned a high weight. This phenomenon also indicates the significance of considering the local user-specific importance. Figure 7 presents how previous bi-level optimization methods obtain the developing target dataset and how our proposed reorderand-reuse strategy obtains the developing dataset. In previous methods, they split the original target dataset into two disjoint dataset for the lower and upper optimization, respectively. This kind of split easily causes bias and information loss as analyzed in the paper,