Intent-aware Ranking Ensemble for Personalized Recommendation

Ranking ensemble is a critical component in real recommender systems. When a user visits a platform, the system will prepare several item lists, each of which is generally from a single behavior objective recommendation model. As multiple behavior intents, e.g., both clicking and buying some specific item category, are commonly concurrent in a user visit, it is necessary to integrate multiple single-objective ranking lists into one. However, previous work on rank aggregation mainly focused on fusing homogeneous item lists with the same objective while ignoring ensemble of heterogeneous lists ranked with different objectives with various user intents. In this paper, we treat a user's possible behaviors and the potential interacting item categories as the user's intent. And we aim to study how to fuse candidate item lists generated from different objectives aware of user intents. To address such a task, we propose an Intent-aware ranking Ensemble Learning~(IntEL) model to fuse multiple single-objective item lists with various user intents, in which item-level personalized weights are learned. Furthermore, we theoretically prove the effectiveness of IntEL with point-wise, pair-wise, and list-wise loss functions via error-ambiguity decomposition. Experiments on two large-scale real-world datasets also show significant improvements of IntEL on multiple behavior objectives simultaneously compared to previous ranking ensemble models.


INTRODUCTION
Users typically have various intents when using recommender systems.For instance, when shopping online, users may intend to buy snacks or browse clothes.Generally, we call the users' behaviors the behavior intents and their interacted item categories the item category intents.Multiple behavior intents may be concurrent in a visit, and users need distinct items with different item category intents.Therefore, user intents are essential to recommender systems for recommendation list generation.In this paper, we follow the definition of user intents by Chen et al. [10] as a combination of user behavior and item category, such as booking an item with a hotel category or clicking an item in a phone category.
From the systems' viewpoint, since users usually have diverse intents, multiple item lists will be generated when a user visits the platform.These lists generally come from recommendation models optimized with different behavior objectives, such as clicking, consuming, or browsing duration.Existing research has made promising achievements with a single objective, such as predicting Click Through Rate (CTR) [4,20,48] and Conversion Rate (CVR) [14,32].However, as multiple intents of a user may appear in a visit, it is crucial to aggregate multiple heterogeneous single-objective ranking lists aware of the user's current intents.
An example of the intent-aware ranking ensemble on an online shopping platform is shown in Figure 1.A user needs to buy a phone charger, and she also wants to browse new products about phones and headsets.The system has two single-objective ranking lists ready when she visits the platform.These lists are produced by two recommendation models optimized with users' consumption and clicking histories, respectively.To satisfy the user's diverse intents at once, an intent-aware ranking ensemble model is adopted to aggregate two ranking lists for a final display, where items are reordered according to both basic ranking lists and the user's intents.Thus, charger 1 , charger 2 , phone 1 , and headset 1 are placed at the front of the final list, satisfying users' preference better than both single-objective ranking lists.Therefore, intent-aware ranking ensemble is important for promoting recommendation performance.
However, there have been few attempts to combine heterogeneous single-objective ranking lists (Hereinafter referred to as basic lists) considering user intents.In industry, a common strategy is simply summing basic lists with pre-defined list-level weights, which ignores users' personalized preference.While in academia, existing studies are not adequate to handle ranking ensemble for personalized recommendation.Widely-explored unsupervised rank aggregation methods [3,21,23] are mostly studied in information retrieval tasks rather than recommendation scenario.Recently, supervised methods [1,2,30] have been proposed to combine different item lists in recommendation.Nevertheless, these studies focused on combining homogeneous item lists optimized for the same behavior, not the heterogeneous rank lists for different objectives.Users' intents are also overlooked in the ranking ensemble stage.
To aggregate basic lists aware of user intents, we aim to learn different weights for different basic lists and item categories to sum basic lists' scores.However, it is challenging since numerous weights should be assigned for all items in all basic lists, which may be hard to learn.Therefore, we first prove its effectiveness theoretically.Unlike previous studies, we aim to assign ensemble weights at item level rather than list level.We prove the effectiveness of this form of ranking ensemble and verify that the loss of the ensemble list can be smaller than the loss of any basic models with point-wise, pair-wise, and list-wise loss functions.An ambiguity term is derived from the proof and used for optimization loss.
With theoretical guarantees, another challenge in practice is to infer users' intents and integrate the intents into ranking ensemble of heterogeneous basic lists.To address this challenge, we propose an Intent-aware ranking Ensemble Learning (IntEL) method for personalized ensemble of multiple single-objective ranking lists adaptively.A sequential model is adopted to predict users' intents.And a ranking ensemble module is designed to integrate basic list scores, item categories, and user intents.Thus, the learnable ranking ensemble model can adaptively adjust the integration of multiple heterogeneous lists with user intents.
We conducted experiments on a public-available online shopping recommendation dataset and a local life service dataset.Our method, IntEL, is compared with various ensemble learning baselines and shows significant improvements.The main contributions of this work are as follows: • To our knowledge, it is the first work that aims to generalize ranking ensemble learning with item-level weights on multiple heterogeneous item lists.We theoretically prove the effectiveness of ranking ensemble in this new setting.• A novel intent-aware ranking ensemble learning model, In-tEL, is proposed to fuse multiple single-objective recommendation lists aware of user intents adaptively.In the model, ambiguity loss, ranking loss, and intent loss have been proposed and integrated.• Experiments on two large-scale real-world recommendation datasets indicate IntEL is significantly superior to previous ranking ensemble models on multiple objectives.

RELATED WORK 2.1 Ranking Ensemble
Ranking ensemble, i.e., fusing multiple ranking lists for a consensus list, has been long discussed in IR scenario [17,19] and proved to be NP-hard even with small collections of basic lists [16].
In general, rank aggregation includes unsupervised and supervised methods.Unsupervised methods only rely on the rankings.For instance, Borda Count [5] computed the sum of all ranks.MRA [18] adopted the median of rankings.Comparisons among basic ranks were also used, such as pair-wise similarity in Outrank [19]and distance from null ranks in RRA [21].Recently, researchers have paid attention to supervised rank aggregation methods.For example, the Evolutionary Rank Aggregation (ERA) [30] was optimized with genetic programming.Differential Evolution algorithm [1,2] and reinforcement learning [47] were also adopted for rank aggregation optimization.However, these rank aggregation methods only utilized the rank or scores of items in basic lists without considering item contents and users in recommendation.
Another view on the fusion problem comes from ensemble learning.It is a traditional topic in machine learning [35], which has been successfully applied to various tasks [29,39,42].A basic theory in ensemble learning is error-ambiguity (EA) decomposition analysis [22], which proves better performance can be achieved with aggregated results with good and diverse basic models.It was proved in classification and regression with diverse loss functions [6,45].Liu et al. [24] generalized EA decomposition to model-level weights in ranking ensemble with list-wise loss, where different items in a list shared the same weights.
The differences between the previous studies and our method are mainly twofold: First, rather than calculate a general weight for each basic model, we extend to assign item-level weights considering item category and user behavior intents.We theoretically prove the effectiveness of this extension.Second, we aim to combine heterogeneous lists generated for different behavior objectives and simultaneously improve performance on multiple objectives.

Multi-Intent Recommendation
Since we aggregate ranking lists aware of users' multiple intents, we briefly introduce recent methods on multi-intents and multiinterests in recommender systems.Existing studies focused on capturing dynamic intents in the sequential recommendation [10,25,26,40,43].For instance, AIR [10] predicted intents and their migration in users' historical interactions.Wang et al. [40] modeled users' dynamic implicit intention at the item level to capture item relationships.MIND [34] and ComiRec [9] adopted dynamic routing from historical interactions to capture users' multi-intents and diversity.TimiRec [41] distilled target user interest from predicted distribution on multi-interest of the users.With the development of contrastive learning, implicit intent representations were also applied as constraints on contrastive loss [11,15].
Previous studies usually mixed "intent" and "interest" and paid attention to intent on item contents in single-behavior scenarios.However, we follow [10] to consider both behavior intents and item category intents.Moreover, instead of learning user preference for each intent, we utilize intents as guidance for fusing user preference with different behavior objectives.

Multi-Objective Recommendation
Another brunch of related but different work is the multi-objective recommendation.It mainly contains two groups of studies.One group provides multiple recommendation lists for different objectives with shared information among objectives, such as MMOE [28] and PLE [38], where different lists are evaluated on corresponding objectives separately.The other group tried to promote the model performance on a target behavior objective with the help of other objectives, such as MB-STR [46] predicting users' click preferences.However, instead of generating multiple lists or specifying a target behavior, we fuse a uniform list on which multiple objectives are evaluated simultaneously.
Some studies that tried to jointly optimize ranking accuracy and other goals are also called multi-objective recommendation, such as fairness [44], diversity [8], etc.They sought to promote other metrics while maintaining utility on some behavior.But we aim to concurrently promote performance on multiple objectives by aggregating various recommendation lists.

PRELIMINARIES 3.1 Ranking Ensemble Learning Definition
Let F = { 1 ,  2 , ...,   } be  basic models that are trained for  different objectives (such as click, buy, and favorite, etc.), I (, ) = { 1 ,  2 , ...,   } be the union set of  recommended basic item lists for user  in session environment context  (e.g., time and location), Where    (, ) ∈ R denotes the weight of the -th basic model for item   .The weights are learnable with the help of side information, e.g., user intents and item categories.The items in I are sorted according to    , and are compared with a ground truth order of ranking  , = { 1 ,  2 , ...,   }, which is sorted to users' interactions with a pre-defined priority of user feedback, e.g., Buy>Click>Examine.The priority can be defined by business realities and will not influence the model learning strategy.The definition of ranking ensemble learning is similar to previous work [1,24,31], except that we conduct ensemble on heterogeneous basic models with different objectives, which makes the problem more difficult.The main notations are shown in Table 1.

User Intent Definition
When aggregating basic models optimized with different objectives, users' intent about behaviors and item categories are both essential.Therefore, we define a user's intent in a visit as a probability distribution of item categories and behaviors, Where  and  indicate the item category intents and behavior intents, respectively.The types of categories and behaviors vary with recommendation scenarios.For instance, in online shopping,  can be product class, and  may include clicking and buying.
In music recommender systems,  may be music genre, while  can contain listening and purchasing albums.In experiments, user intents  are predicted from users' historical interactions and environment context.

Ranking Losses
Three representative losses are generally leveraged in the recommendation scenario, namely point-wise, pair-wise, and list-wise loss.
We will theoretically and empirically illustrate the effectiveness of ranking ensemble with three losses in the following sections.For a given user  under session context  with multi-level ground truth ranking  , = { 1 ,  2 , ...  } on an item set I = { 1 ,  2 , ...,   }, the loss of score list S(, ) = { 1 ,  2 , ...,   } is defined as ( and  are omitted): • Point-wise Loss As  is a multi-level feedback based on a group of user feedback, the Mean Squared Error (MSE) loss is utilized as a representative point-wise loss, • Pair-wise Loss We leverage the Bayesian Personalized Ranking (BPR) loss [33].Following the negative sampling strategy for multi-level recommendation [27], a random item from one level lower is paired with a positive item at each level, Where  is the number of interaction levels (e.g., buy, click, and exposure),  + is the number of positive items of all levels,  +  and  −  are positive and one-level-lower negative item set for level , and  is the sigmoid function.
• List-wise Loss Following [24], we adopt the Plackett-Luce (P-L) model as the likelihood function of ranking predictions, Where   indicates the -th item sorted by ground truth .
The corresponding list-wise loss function is

THEORETICAL EFFECTIVENESS OF RANKING ENSEMBLE LEARNING
To prove the effectiveness of our proposed item-level ranking ensemble learning in Eq.1, we aim to prove that the loss of ensemble learning scores S  = {   } can be smaller than any of the loss of basic-model scores S  = {   } for point-wise, pair-wise, and listwise loss, i.e.  (, S  ) ≤  =1 w   (, S  ), ∀w  ,  ∈ {  ,   ,  − }.In this way, we can claim that there exist some combinations of weights w  to achieve results better than all basic models.
Inspired by previous studies in ensemble learning, error ambiguity (EA) decomposition [22] provides an upper bound for ensemble loss  (, S  ), which helps conduct the above proof.For basic lists with loss { (, S  )}, EA decomposition tries to split ensemble loss  (, S  ) into a weighted sum of basic-model loss (     (, S  ), ∀  ) minus a positive ambiguity term 1 of basic models, so that the upper bound of  (, S  ) is controlled by both basic-model losses and ambiguity.It was recently proved in ranking tasks with the same weights for a basic list (i.e.,    =    , ∀ = ) [24].However, different weights should be assigned for different items in our setting.Therefore, we need to verify whether EA decomposition is still available.To summarize, we try to prove that loss functions can be rewritten as  (, S  ) ≤         (, S   ) − , ∀   for point-wise, pair-wise, list-wise loss in the following.

Where S𝑒𝑛𝑠
is an interpolation point between    and    .Define    as Eq.8, we weighted sum losses of all basic models as follows, The first equation is due to  =1    = 1 and  2   /() 2 = 2, and the second equation is due to Eq. 1.Therefore, Proof done.□ Since the ambiguity    in Eq. 8 is positive and    ≥ 0, Eq. 7 follows the form of EA decomposition.Ranking ensemble with itemlevel weights for point-wise loss is effective theoretically, since the ensemble loss is smaller than weighted sum of basic model losses with any    as long as    ≥ 0 and  =1    = 1.For brevity, we will omit statements of score lists and ensemble formulas in the following theorems.

Pair-wise Loss
Sum both sides of Eq.14 with weights, we get Proof done.□ The range limitation of    leads to  ≤ 1.And in pair-wise loss, the order rather than the values of scores matters.So the second term in Eq. 12 (  =1    <  =1    ) can be arbitrarily small.Meanwhile, the ambiguity    is semi-positive.Therefore, Eq.12 follows the form of EA decomposition, and our ranking ensemble method with pair-wise loss is effective theoretically.

List-wise Loss
Where   max denotes the maximum of all weights in list ,    is the ambiguity at position , max sum is defined as *  =  *  −  *  denotes the differences between scores.
Due to space limitation, we only show key steps in the proof: Proof.We define the score difference  : = [ +1 , ...,   ] = [  −  +1 , ...,   −   ] and the logarithm pseudo-sigmoid function, For each basic model of a list of items  (22) Therefore, sum from  = 1 to  =  , we get Proof done.□ Because in the list-wise optimization, the order rather than the values of scores matters,    can be arbitrarily small.Meanwhile, the ambiguity term    is semi-positive.Therefore, Eq.17 conforms to the EA decomposition theory, and our ranking ensemble method with list-wise loss is effective.

Ensemble Loss for Model Training
The above theorems guarantee our proposed ranking ensemble learning method in theory for three representative loss functions.With EA decomposition theory, we prove that the loss of ensemble list is smaller than any weighted sum combination of losses of basic lists:   (,   ) ≤      (,   ) −+Δ, ∀  ≤ 0,  =1   = 1, where  is a positive ambiguity term, and Δ is arbitrarily small.Therefore, the ensemble loss   (,   ) (i.e., differences between ensemble list and ground truth) is possible to be smaller than any basic list loss with suitable weights {   }, and larger ambiguity  will lead to a smaller bound of ensemble loss.Thus, it can be effective for our ranking ensemble task.
In practice, since basic lists are fixed (so   (,   ) are constants), we aim to minimize the ensemble ranking loss   (,   ) and maximize the ambiguity .Therefore, the loss function for ranking ensemble learning,   , is defined as follows, where   (, S  ) can be any of the   (, S  ),   (, S  ), and  − (, S  ), and  indicates the ambiguity term.For BPR and P-L loss, there exists an interpolation S  =    +  (   −    ) in .To simplify the calculation, we let  → 0 without loss of generality.

INTENT-AWARE RANKING ENSEMBLE METHOD 5.1 Overall Framework
After we proved the effectiveness of item-level weights {   } for ranking ensemble with three different loss functions, we need to design a neural network for learning the weights    .As shown in Section 3.2, users' intents about behaviors and item categories help aggregate the basic lists, while intents are not available in advance.Therefore, an intent predictor and an intent-aware ranking ensemble module are designed for our method.
The main framework of our Intent-aware Ensemble Learning (In-tEL) method is shown in Figure 2.For a user  at time  , user intents  are predicted with an intent predictor from her historical interactions and current environment context.Then, with  candidate items generated from  basic models, an intent-aware ranking ensemble module is adopted to integrate basic list scores, item categories, and the predicted user intents.The output of the ensemble module is item-level weights {   } for each item  and basic model .Eventually, weighted sum of all basic list scores constructs the ensemble scores {   } for a final list.Since we focus on the ranking ensemble learning problem, a straight-forward sequential model is used for intent prediction in Section 5.2, and we pay more attention to the design of ensemble module in Section 5.3.IntEL is optimized with a combination of ranking loss   (, ), ambiguity loss , and intent prediction loss   .Details about the model learning strategy will be discussed in Section 5.4.

User Intent Predictor
As defined in Section 3.2, user intent describes a multi-dimensional probability distribution  over different item categories and behaviors at each user visit.The goal of the intent predictor is to generate an intent probability distribution for each user visit.We predict Item Category Basic-list Scores

Intent-aware Cross Attention
Projection intents with users' historical interactions and environment context, as both historical habits and current situations will influence users' intents.
For a user  at time  , her historical interactions from  −  to  −1 and environment context (such as timestamp and location) at  are adopted to predict her intent at  , where  is a pre-defined time window.Enviroment context is encoded into embedding  (, ) with a linear context encoder.Two sequential encoders are utilized to model historical interactions at user visit (i.e., session) level and item level.Session-level history helps learn users' habits about past intents, while item-level interactions express preferences about item categories in detail.At the session level, the intents and context of each historical session are embedded with two linear encoders, respectively.Then two embeddings are concatenated and encoded with a sequential encoder to form an embedding ℎ  (, ).At the item level, "intent" of each positive historical interaction can also be represented by its behavior type and item category.Then item-level "intent"s are embedded with the same intent encoder as sessionlevel, and fed into a sequential encoder to form item-level history ℎ  (, ).The sequential encoder can be any sequential model, such as GRU [12], transformer [37], etc.Finally, context  (, ), sessionlevel ℎ  (, ), and item-level ℎ  (, ) are concated for a linear layer to predict intent Î (, ) ( and  are omitted), Where W  and   are linear mapping parameters.

Design of Ensemble Module
The structure of the intent-aware ranking ensemble module is shown in Figure 3(a).Since the final weights {   } should be learned from both behavior and item categories aware of user intents, the predicted intents, single-behavior-objective basic list scores, and categories of items in basic lists are adopted as inputs.
Firstly, lists of item scores {   | ∈ {1, 2, ...,  },  ∈ {1, 2, ...,  }} are fed into a self-attention layer to represent the relationship among item scores in the same basic list.Item categories {  | ∈ {1, 2, ...,  }} are also encoded with a self-attention layer to capture the intra-list category distributions.The self-attention structure consists of a linear layer to embed scores {   } (or categories {  }) into   -dimensional representations S ∈ R  ×  (or I ∈ R  ×  ) and  layers of multi-head attentions, which follow the cross-relation attention layer proposed by Wang et al. [40], as shown in Figure 3(b).
Secondly, user intent  is embedded into   dimension with a linear projection   = W   ∈ R  ×  .Then the influences of user intent on representations of scores and features are obtained with cross-attention layers, Where the projection matrix W  ∈ R   ×  is shared between two intent-aware attention modules.Since behavior intents and category intents are associated when users interact with recommenders, we use the holistic user intents to guide the aggregations of both basic list scores and item categories rather than splitting the intents into two parts.Finally, weights {   } should be generated from all information.Intent-aware score embeddings A  , intent-aware item category embeddings A  , and intent embedding   are concatenated and projected into space of R  to get the weight matrix Where W  ∈ R ×(2  +  ) is a trainable projection matrix.The output matrix W = {   } is used as the weights for summing basic model scores as in Eq. 1.

Model Learning Strategy
Since an end-to-end framework is to train the intent predictor module and intent-aware ranking ensemble module, joint learning of two modules is utilized for model optimization.
To optimize ranking ensemble results according to theorems via EA decomposition in Section 4, ensemble learning loss   consists of   (, S  ) and  as in Eq.24.Meanwhile, accurate user intents will guide the ranking ensemble, so an intent prediction loss is also used for model training.Since user intents are described by multidimensional distributions, KL-divergence [13] loss   is adopted to measure the distance between true intents  and predicted intents Î.The final recommendation loss   is a weighted sum Where   (, S  ) is the ranking ensemble loss,  is the ambiguity term, and   is the intent prediction loss. and  are hyperparameters to adjust the weights of ambiguity and intent loss, respectively.
6 EXPERIMENTS 6.1 Experimental Setup  Tmall is a multi-behavior dataset from the IJCAI-15 competition, which contains half-year user logs with Click, Add-to-favorite (Fav.), and Buy interactions on Tmall online shopping platform.We employ the data in September for ensemble learning and exclude items and users with less than 3 positive interactions.Following the data processing strategy by Shen et al. [36], we treat a user's interactions within a day as a session.Three-week interactions before the ensemble dataset are used for the generation of basic-model scores, which will be discussed in Section 6.1.2.
LifeData comes from a local life service App, where the recommender provides users with nearby businesses such as restaurants or hotels.Users may click or buy items on the platform.One-month interactions of a group of anonymous users are sampled, and users and items with less than 3 positive interactions are filtered.A user's each visit (i.e., entering the App) is defined as a session, and sessions with positive interactions are retained.
Basic models are optimized for each of the behaviors, which will be introduced in Section 6.1.2.Ranking ensemble is conducted at session level, and interactions in most  = 20 historical sessions are considered in the intent predictor.Detailed statistics are shown in Table 2, which includes the dataset for ensemble learning only while excluding the data used for basic model generation.Moreover, training data for basic models have no overlap with ensemble learning data.
6.1.2Basic-model Score and Intent Generation.In IntEL, basic scores are pre-generated and fixed during ranking ensemble.For Tmall, we adopted DeepFM [20] as basic models to train three models with Click, Fav., and Buy objectives separately.In each session, we select the top 30 items predicted by each basic model to construct three item sets, and take the union of them, plus positive interactions of the session, to form the basic item lists for reranking.Please refer to our public repository for details about basic model training strategy 3 .For LifeData, two basic score lists are used for ranking ensemble, which are sorted by predicted clicking probability and buying probability provided by the platform, respectively.
As for intents, in Tmall, || = 3, and we merge categories with less than 50 items, resulting in category | | = 357.In LifeData, | | = 6 and || = 2. Hence, the dimension for intent  is 1071 for Tmall and 12 for LifeData.Intent ground truth  probability is calculated from all positive interactions in each session.
6.1.3Baseline Methods.We compare IntEL against basic models and several ranking ensemble baselines as follows, 1. Single  : Use one of the basic models' scores to rank the item list. indicates Click, Fav., and Buy, respectively.
2. RRA [21]: An unsupervised ensemble method, where items are sorted with their significance in basic-model lists.
3. Borda [5]: An unsupervised ensemble method to take the average ranks of all basic models as the final ranking.4. Rank [7]: A gradient-based optimization method used for learning2rank task.We regard items as documents, basic-model scores and item categories as document features, and MLP as a backbone model.5. ERA [30]: An evolutionary method to aggregate some basiclist features with Genetic Algorithm (GA), where fitness function is calculated on validation set.
7. aWELv+Int/IntEL: Two variations of aWELv considering user intents.Intents are predicted as a feature for aWELv+Int.The IntEL module is used for predicting list-level weights for aWELv+IntEL.
Our methods are shown as IntEL-MSE, IntEL-BPR, and IntEL-PL with three different kinds of loss functions.
6.1.4Experimental settings.We split both datasets along time: the last week is the test set, and the last three days from the training set is the validation set.The priority for the mutli-level ground truth  are Buy>Favorite>Click>Examine for Tmall, and Buy>Click>Examine for LifeData.As for evaluation, we adopt NDCG@3, 5, and 10 to evaluate the ensemble list   on the multilevel ground truth  (i.e., all) and each behavior objective.
We implement IntEL model in PyTorch, and the code of IntEL and all baselines are released 3 .Each experiment is repeated with 5 different random seeds and average results are reported.All models are trained with Adam until convergence with a maximum of 100 epochs.For a fair comparison, the batch size is set to 512 for all models.We tuned the parameters of all methods over the validation set, where the learning rate are tuned in the range of [1 − 4, 1 − 2] and all embedding size are tuned in {16, 32, 64}.Specifically, for IntEL, we found that it has stable performance when GRU [12] with embedding=128 is used for the intent predictor, and self-attention with  = 2 for Tmall and  = 1 for LifeData.The ambiguity loss weight  is set to 1e-5, 1e-5, and 1e-4 for IntEL-MSE, IntEL-BPR, and IntEL-PL.Hyper-parameter details are released 3 .

Overall Performance
The overall performances on Tmall and LifeData are shown in Table 4 and Table 5, respectively.We divide all models into four parts: The first part evaluates on each single-objective basic model's scores.The second is unpersonalized baseline ensemble models, and the third contains personalized baselines: aWELv and its two variants with user intents.The last part shows our method IntEL with three loss functions.From the results, we have several observations: First, our proposed IntEL achieves the best performance on all behavior objectives in both datasets.IntEL with three loss functions, i.e., IntEL-MSE, IntEL-BPR, and IntEL-PL, outperform the best baselines on most metrics significantly.Although the two datasets are quite different, as shown in Table 3, IntEL has stable, great ensemble results on both datasets.Second, IntEL with different loss functions show different performances on two datasets.On Tmall, IntEL-MSE is better than IntEL-BPR and IntEL-PL.It is because there is four-level ground truth  (three behaviors), and ranking on such diverse item lists is close to rating prediction.Therefore, IntEL-MSE, which directly optimizes the ensemble scores, performs better than IntEL-BPR and IntEL-PL, which optimize the comparison between rankings.On LifeData, IntEL-PL and IntEL-BPR perform better since LifeData has shorter sessions with fewer positive interactions (as in Table 3).So comparison-based BPR and P-L achieve better performance.
Third, comparing different baselines, we find that supervised methods (Rank, ERA, and aWELv) outperform unsupervised RRA and Borda greatly on Tmall.It is because heterogeneous singlebehavior objective models (Single XXX) have diverse performance, making rank aggregation difficult for unsupervised methods.
Lastly, aWELv and its variants perform well on LifeData but not on Tmall since session lists are generally longer (Table 3) for Tmall, and list-level weights of aWELv miss useful intra-list information.So item-level weights that consider item category intents are necessary.Nevertheless, aWELv is better than basic models in both datasets, which is consistent with the theory.Moreover, aWELv+Int/IntEL outperform aWELv on most metrics, indicating that user intents contributes to ranking ensemble learning.

Further Analysis
To further explore the performance of our ranking ensemble learning method, we conduct an ablation study, analysis of user intents, and hyper-parameters analysis on the best model for each dataset, i.e., IntEL-MSE for Tmall and IntEL-PL for LifeData.

Ablation Study. The main contributions of our proposed
IntEL include adopting user intents for heterogeneous ranking ensemble and integration of basic-list scores, item categories, and user intents.We compare IntEL with five variants: Excluding one of the inputs: -Int (without intent), -I (without item categories), and -S (without basic-list scores).Replacing two main elements: -Cross, removing the intent-aware cross-attention layer; and -Self, replacing the self-attention layer with a direct connection.
NDCG@3 on the general multi-level ranking list of variants and IntEL are shown in Figure 4. Ranking performance drops on all five variants, indicating all inputs and two attention layers contribute to the performance improvement of IntEL.Removing scores   shown in Figure 5.It illustrates that too large or small  will both lead to ranking ensemble performance decrease.Especially when  is too large, the model will focus on maximizing basic model ambiguity  to minimize   , while ensemble learning loss   is less optimized.As for the intent loss weight , performance on Tmall shows fluctuation with , while performance on LifeData is relatively stable.It is because intent prediction difficulty differs on two datasets: Tmall contains 1071 types of intents, which is hard to predict accurately, so a proper intent loss weight is essential for predictor optimization, while LifeData has only 12 intents, which is easier to capture and model.

CONCLUSION
In this paper, we propose a novel ranking ensemble method IntEL for intent-aware single-objective ranking lists aggregation.To our knowledge, we are the first to generalize ranking ensemble learning with item-level weights on heterogeneous item lists.And we are also the first to integrate user intents into rank aggregation in recommendation.We generalize the ranking ensemble with itemlevel weights and prove its effectiveness with three representative loss functions via error-ambiguity decomposition theory.Based on the proof, we design an ensemble learning loss   to minimize ranking ensemble loss   and maximize ambiguity .Then we design an intent-aware ranking ensemble learning model, IntEL, to learn weights for heterogeneous lists' ensemble.In IntEL, a sequential intent predictor and a two-layer attention intent-aware ensemble module are adopted for learning the personalized and adaptive ensemble weights with user intents.Experiments on two large-scale datasets show that IntEL gains significant improvements on multiple optimization objectives simultaneously.This study still has some limitations.For basic list generation, we only applied one classical method, DeepFM, for different behaviors separately.However, multi-behavior methods are also possible models to generate multiple basic lists simultaneously, which may lead to different performance for IntEL.Also, a straight-forward method was adopted to predict intents and incorporate intent prediction loss.In the future, we will investigate the possibility of integrating more heterogeneous basic lists for other objectives in recommendation with IntEL.As we find that more accurate user intents will lead to better ranking ensemble performance, we will also try to design more sophisticated intent predictors to achieve better results.

1 Figure 1 :
Figure 1: An example of fusing two single-objective item lists into a final list aware of multiple concurrent user intents.

Figure 3 :
Figure 3: Structure of the intent-aware ranking ensemble module.

Figure 5 :
Figure 5: Ranking ensemble results of IntEL with different hyper-parameters.

Table 2 :
Datasets statistics in ranking ensemble experiments.

Table 3 :
Main differences between two datasets.Pos.indicates positive interactions.

Table 4 :
Results of IntEL with three different loss functions and baseline methods on Tmall.Boldface shows the best result.Underline indicates the best baseline.Notation **/* demonstrates significantly better than the best baseline with p<0.05/0.01.

Table 5 :
Results of IntEL with three different loss functions and baseline methods on LifeData.Boldface shows the best result.Underline indicates the best baseline.Notation **/* demonstrates significantly better than the best baseline with p<0.05/0.01.