Bundle MCR: Towards Conversational Bundle Recommendation

Bundle recommender systems recommend sets of items (e.g., pants, shirt, and shoes) to users, but they often suffer from two issues: significant interaction sparsity and a large output space. In this work, we extend multi-round conversational recommendation (MCR) to alleviate these issues. MCR, which uses a conversational paradigm to elicit user interests by asking user preferences on tags (e.g., categories or attributes) and handling user feedback across multiple rounds, is an emerging recommendation setting to acquire user feedback and narrow down the output space, but has not been explored in the context of bundle recommendation. In this work, we propose a novel recommendation task named Bundle MCR. We first propose a new framework to formulate Bundle MCR as Markov Decision Processes (MDPs) with multiple agents, for user modeling, consultation and feedback handling in bundle contexts. Under this framework, we propose a model architecture, called Bundle Bert (Bunt) to (1) recommend items, (2) post questions and (3) manage conversations based on bundle-aware conversation states. Moreover, to train Bunt effectively, we propose a two-stage training strategy. In an offline pre-training stage, Bunt is trained using multiple cloze tasks to mimic bundle interactions in conversations. Then in an online fine-tuning stage, Bunt agents are enhanced by user interactions. Our experiments on multiple offline datasets as well as the human evaluation show the value of extending MCR frameworks to bundle settings and the effectiveness of our Bunt design.

user-item interactions, leading to the difficulty of modeling user preferences accurately; (2) Output space complexity: predicting a correct bundle (i.e., multiple items) from all item combinations is more challenging than traditional individual item recommendations.
Currently, two approaches are proposed in bundle recommendation to circumvent these issues.The first line [4,6,35] presents discriminative methods, i.e., ranking existing bundles which avoids the complexity issue, by treating each bundle as a generalized individual 'item'.The application scenarios of those methods are usually narrow (e.g., for pre-defined bundle sales).The second line [2,12,20] uses generative methods, i.e., generating (perhaps new) bundles, which is more flexible but still suffers from limited accuracy.In these works, bundle recommenders are one-shot, i.e., recommending a complete bundle with a single attempt.As the traditional bundle recommendation in Figure 1a shows, the user receives a completed bundle (shirt, shoes and pants) and reacts to this bundle (picking a shirt but ignoring others), then recommendation ends.Clearly, such one-shot setting doesn't allow the model to collect continuous user feedback and provide enhanced bundle with higher accuracy.Considering such limitations, we present a new multi-round and interactive way for bundle recommendation, i.e., allowing the user and system to "discuss" bundle composition together.Specifically, we call this multi-round conversational bundle recommendation task Bundle MCR.
The core idea of Bundle MCR is to extend one of the representative conversational recommendation mechanisms -multi-round conversational recommendation (MCR) [14,26,27,53] -to bundle contexts, in which the system can acquire user feedback on item tags and narrow down the output space during conversations for more accurate bundle recommendation.Although recently many conversational recommendation works [11,14,26,27,51,53], especially MCR frameworks [14,26,27,53] have proven effective to elicit user preferences for individual item recommendation, designing a new MCR framework for bundle recommendation is still non-trivial: existing MCR frameworks target recommending an individual item only (named Individual MCR); they cannot directly work for bundle settings for several reasons: (1) not considering user-bundle interactions or item-item relationships in user preference modeling; (2) recommending top-K individual items instead of multiple items as a bundle (or partial bundle); (3) handling user feedback and posting questions on tags related to an individual target item, without considering feedback or questions to different items within a bundle.We illustrate the gap between Individual MCR and Bundle MCR in Figure 1a.Individual MCR updates user feedback on tags (e.g., attributes like sport-style, blue) to narrow down the candidate item pool effectively but cannot post questions or model the feedback to multiple items (i.e., bundle-aware) directly.Instead, Bundle MCR aims to complete a bundle with the user by generating multiple items as a bundle or partial bundle, and handling questions to multiple items (e.g., sport-style pants and white shoes).
Methodologically, we formulate Bundle MCR as a Markov Decision Process (MDP) problem with multiple agents.Then, we propose a new model architecture Bundle Bert (Bunt) to conduct these functions in a unified self-attentive [23,39,43] architecture.Furthermore, to train Bunt efficiently, a two-stage training strategy is proposed: we pre-train Bunt with multiple cloze tasks, to learn the basic knowledge of how to infer correct items, tags and when to ask or recommend based on conversation contexts mimicked by offline user-bundle interactions.Then, we introduce a user simulator, create a simulated online environment, and fine-tune Bunt agents with reinforcement learning on conversational bundle interactions with users.We summarize the contributions of this work as below: • We propose a Bundle MCR setting where users and the system complete a bundle together.To our knowledge, this is the first work that considers a conversational mechanism in bundle recommendation and also alleviates the bundle recommendation issues of information sparsity and output space complexity.• We present an MDP framework with multiple agents for Bundle MCR.Under this framework, we propose Bundle Bert (Bunt) to conduct multiple Bundle MCR functions in a unified self-attentive architecture.We also design a two-stage (pre-training and fine-tuning) strategy for Bunt learning.
• We evaluate conversational bundle recommendations on four offline bundle datasets and conduct a human evaluation, to show the effectiveness of Bunt and the potential of conversational bundle recommendation.

Bundle Recommendation
Bundle (or set, basket) recommendation offers multiple items as a set to user.Traditional bundle recommendation adopts Integer Programming [31,47,54] or Association Analysis [9,16] Most of them have no personalization.Some works [45,49] apply constraint solvers.Recently, more works are learning-based and can be divided into two categories: (1) discriminative methods: Bundle BPR (BBPR) [35] extends BPR [37] to personalized bundle ranking (BBPR also designed a heuristic generative algorithm but it is time-consuming); DAM [6] and BGCN [3] enhance the representation of users with factorized attention networks or graph neural networks.(2) generative methods: An encoder-decoder framework is used in BGN [2] (RNN [10]-based) and PoG [8] (Transformer [43]-based) to generate multiple items as a personalized bundle.BGN decomposes bundle recommendation into quality/diversity via determinantal point processes.BYOB [12] treats bundle generation as a sequential decision making problem with reinforcement learning methods.In our work, the bundle recommender is for interactive (conversational) settings.As Table 1 shows, existing bundle recommenders are one-shot so they focus on user modeling and bundle generation only.Our model (Bunt) also considers how to handle user feedback, post questions and manage conversations in a unified architecture.

Conversational Recommendation
Conversational recommender system (CRS) enables systems to converse with users actively.CRSs seek to ask questions (e.g.'which one do you prefer') to establish user preferences efficiently or to explain recommendations.Existing CRS methods can be classified by the question spaces: (1) Asking free text: this method generates human-like responses in natural language [7,22,29].For example, [29] collects a natural-language conversational recommendation dataset ReDial and builds a hierarchical RNN framework on it.KBRD [7] further incorporates knowledge-grounded information to unify recommender systems with dialog systems.(2) Asking about items: [11,48] For example, [11] designs absolute (i.e., want item A?) or relative-question templates (i.e.item A or B?) and evaluates several question strategies such as Greedy, UCB [1] or Thompson Sampling [5]; (3) Asking about tags: the system is allowed to ask questions on user preference over different tags associated with items.For example, CRM [40] integrates conversation and recommender systems into a unified deep reinforcement learning framework to ask facets (e.g.color, branch) and recommend items.SAUR [52] proposes a System Ask-User Respond paradigm to ask pre-defined questions about item attributes in the right order and provide ranked lists to users.The multi-round conversational recommendation (MCR) [14,26,27,53] setting also belongs to conversational recommendation setting (3).

Multi-Round Conversational Recommendation (MCR)
In our work, we focus on MCR setting, based on the following logic: (1) Completing a bundle is naturally a multi-round process, in which more user feedback to item tags is collected to make more accurate recommendations and put more items into the potential bundle.(2) MCR is arguably the most realistic setting available [14,26,27,53] and widely used in recent conversational recommenders.For example, EAR [26] proposes a Estimation-Action-Reflection framework to ask attributes and model users' online feedback.Furthermore, SCPR [27] incorporates an item-attribute graph to provide explainable conversational recommendations.UNICORN [14] proposes a unified reinforcement learning framework based on dynamic weighted graph for MCR.To make individual MCR more realistic, MIMCR [53] allows users in MCR to select multiple choices for questions, and model user preferences with multi-interest encoders.However, existing MCR frameworks are proposed for individual item recommendation (i.e., Individual MCR).Thus the entire model architecture (e.g., FM [36]) and question strategy design is not compatible with bundle contexts.As Table 1 shows, our work uses a similar conversation management idea as exiting individual MCRs, but we design model architectures for bundle-aware user modeling, question asking, feedback handling and bundle generation.
Different from individual MCR, we propose a new concept slot for bundle MCR1 , i.e., the placeholder for a consulted item.For example, an outfit (1: shoes, 2: pants, 3: shirt) has three slots X = {1, 2, 3}.Ideally, bundle MCR (1) determines the number of slots; (2) fills target items in the slots during conversations.
Bundle MCR is formulated as: given the set of users U and items I, we collect tags corresponding to items, such as the set of attributes P (e.g., "dark color") and categories Q (e.g., "shoes").As illustrated in Figure 1b, for a user  ∈ U: (1) Conversation starts from a state S (1)  which encodes user historical interactions {B 1 , B 2 , . . .}, where B * represents a bundle of multiple items.Let us set conversational round  = 1, the system creates multiple slots as X ( ) .
(2) Then, the system decides to recommend or ask, i.e., (i) recommending |X ( ) | items as a (partial) bundle to fill these proposed slots, denoting as B ( )  = { î |  ∈ X ( ) }; or (ii) asking for user preference per slot on attributes A ( ) Here in each slot , î ∈ I, â ∈ P and ĉ ∈ Q.
(3) Next, user  is required to provide feedback (i.e., accept, ignore, reject) to the proposed partial bundle B  per slot  ∈ X ( ) .(4) After that, the system updates user feedback into new state S ( +1)  , records all the accepted items into a set, denoting as B 2 , and updates the slots of interest as X ( +1) by creating new slots and removing the slots  in which user has accepted the recommended item î .
After multiple rounds of step ( 2)-(4), the system collects rich contextual information and create bundle B for user.
The conversation terminates when  is satisfied with the current bundle (i.e., B equals the target bundle B *  ) or this conversation reaches the maximum number of rounds  .
In Bundle MCR, we identify several interesting questions: (1) how to encode user feedback to bundle-aware state S ( +1)  ?(2) how to accurately predict bundle-aware items or tags?(3) how to effectively train models in Bundle MCR? (4) how to decide the size of slots X ( ) per round?In this work, we focus on ( 1)-( 3), and use a simple strategy for (4), i.e., keeping the size of slots as a fixed number .Though the slot size per round is fixed, the final bundle sizes are diverse due to different user feedback and conversation rounds.We leave more flexible slot strategies for future works.
Note that we use attribute set P and category set Q and related models for all baselines and proposed methods.But for ease of description, we only take the attribute set P as the example of tags in following methodology sections.

GENERAL FRAMEWORK
We formulate Bundle MCR as a two-step Markov Decision Process (MDPs) problem with multiple agents, since (1) the system makes two-step decisions for first recommending or asking (i.e., conversation management), then what to recommend or ask; (2) multiple agents are responsible for different decisions: an agent (using   ) is for conversation management; a bundle agent (using   ) decides to recommend items to compose a bundle; an attribute agent (using   ) considers which attributes to ask.The goal of our framework is to maximize the expected cumulative rewards to learn different policy networks  *  ,  *  ,  *  .We divide a conversation round into user modeling, consultation, and feedback handling like [53], then we describe our state, policy, action and transition design under this framework in related stages.

States: Bundle-Aware User Modeling
We first introduce the shared conversation state S ( )  for all agents.S ( )  is encoded (specific encoder is introduced in Section 5) from the conversational information S ( )  at conversational round , which is defined as: • Long-term preference is represented by the set of user 's historical bundle interactions {B 1 , B 2 , . . .}.
• Short-term contexts collect accepted items and attributes in conversations before conversational round .X ( ≤ )   is the set of slots till rounds , i.e., X ( ≤ ) =   ′ =1 X ( ′ ) .In slot  at round , we record the tuple ( denotes the item id accepted by the user.If no item accepted in slot , ǐ ( )  is set as a mask token [MASK]; is the set of accepted attributes in slot .For example, the initial short-term context is a -sized set of ([MASK], ∅) tuples, meaning we know nothing about accepted items or attributes. 3 Candidate pools contain item and attribute candidates per slot at round  (it is not space costly by black lists).
They are initialed as completed pools I and P, and updated with user 's feedback, described in Section 4.3.
Second, we introduce an additional conversation information S()  (encoded as additional state S()  in Section 5.1) for conversation management agent.

S(𝑡)
records the result id of previous t-1 rounds as a list, such as [rec_fail, ask_fail, ...].It is a commonly used state representation for conversation management agent in [14,26,27].We follow the result id settings as [26], but apart from "rec_suc" id for successfully recommending a single item, we further introduce a "bundle_suc" id to record the result of successfully recommending the entire bundle.

Policies and Actions: Bundle-Aware Consultation
The system moves to the consultation stage after getting conversation states in user modeling stage.Now, the system makes a two-step decision: (1) whether to recommend or ask (using policy   ); ( 2) what to recommend (using policy   ) or what to ask (using policy   ).We define these policies as: , where î is the action corresponding to slot  and the actions space is I ( )  .
•   -attributes consultation: if asking, the agent uses S ( )  as input to generate |X ( ) | (i.e., ) attributes as , where â is the action corresponding to slot  and the actions space is P ( )  .

Transitions: Bundle-Aware Feedback Handling
The system handles user feedback in a transition step.The user  will react to the proposed  items or attributes with acceptance, rejection or ignoring.Generally, in our transition step, "acceptance" is mainly used to update short-term contexts, "rejection" is used to update candidate pools and we change nothing when getting "ignoring".
• Update S ( +1)  : long-term preference is fixed, we update the short-term contexts and candidate pools as follows: (1) Feedback to items: for each consulted item î , (i) all item candidates pools I ( +1)  ′ where  ′ ∈ X ( )   Long-Term Pref.ℬ%, ℬ', … ○ is the feedback handling step to update short-term contexts.We describe steps 1  ○-3 ○ in Sections 4 and 5, where we keep   and omit the similar policy   , for ease of description.

Rewards: Two-Level Reward Definitions
We define two-level rewards for these multiple agents.
(1) Low-level rewards are for   and   , i.e., to make item recommendations and question posting more accurate online.At round , for each slot  the reward    = 1 if   hits target item, otherwise 0. Reward    for   is similar.
(2) High-level rewards are for the conversation management agent   reflecting the quality of a whole conversation.The reward   is 0 unless the conversation ends, where we calculate   using one of the final bundle metrics (e.g., F1 score, accuracy).

MODEL ARCHITECTURE
Under this framework, we propose a unified model, Bundle BERT (Bunt).In this section, we first describe the architecture of Bunt, then we describe how to train Bunt with offline pre-training and online fine-tuning.
Bunt is an encoder-decoder framework with multi-type inputs and multi-type outputs to handle user modeling, consultation, and feedback handling.The encoder-decoder framework is commonly used in traditional bundle recommendation tasks [2,8,20].We use a self-attentive architecture for three reasons: (1) Self-attentive models have already been proven as an effective representation encoder and accurate decoder in recommendation tasks [8,19,23,28,39]; (2) RNN [10]-based model inputs have to be "ordered", while self-attentive model discards unnecessary order information to reflect the unordered property in bundles; (3) A self-attentive model can be naturally used in cloze tasks (e.g., BERT [15]), which is suitable for predicting unknown items or attributes in slots.

Bunt for Bundle-Aware User Modeling
5.1.1Long-Term Preference Representation.We encode user historical interactions {B 1 , B 2 , . . .} as user long-term preferences E  using hierarchical transformer (TRM) [43] encoders: TRM bundle is a transformer encoder over the set of bundle-level representations {B 1 , B 2 , . . .}, the output E  ∈ R   × represents user long-term preferences,   is the number of historical bundles, and  is the hidden size of the TRM bundle model.The bundle representation B  ∈ R 1× is also extracted by a transformer encoder, namely TRM item , over the set of item embeddings in this bundle, then the set of output embeddings from TRM item is aggregated by average pooling AVG as B  .Our two-level transformers contain no positional embeddings since the input representations are unordered.
5.1.2Short-Term Contexts Representation.We describe how to represent short-term contexts {( ǐ ( )  , Ǎ ( )  )| ∈ X ( ≤ ) }.We feed the contexts into a special embedding layer EMB, then obtain two sets of embeddings for items, attributes: where E ( ) * , ∈ R | X (≤ ) |× denotes item (I) and attribute (A) embeddings.For items, we retrieve embeddings of the accepted item ids (or [MASK] id).For attributes, we retrieve embeddings corresponding to the accepted attribute ids (or where TRM  is the  th transformer layer with cross attention [43], W −1 ∈ R × is a learnable projection matrix at layer -1 for attribute representation.⊕ is element-wise addition and LN denotes LayerNorm [42] for training stabilization. We incorporate the attribute feature E ( ) , before each transformer layer in order to incorporate multi-resolution levels, which is effective in transformer-based recommender models [30].Thus for the output representation ) contains contextual information from slots in conversation contexts.We treat O  and candidate pools I ( )

𝑥 , P (𝑡 )
for all slots  ∈ X ( ≤ ) as the encoded state S ( )  .Moreover, for the additional conversation records S()  introduced in Section 4.1, we encode it as a vector S()  by using result id embeddings and average pooling.

Bunt for Bundle-Aware Consultation
For consultation step, we feed the encoded state into multiple policy networks to get outputs for each slot  ∈ X ( ) :  [18] models with ReLU [34] activation and softmax layer.We use   or   to infer the masked items or attributes in slot .In inference stage, we take the actions with the highest probability to decide recommending or asking, to construct the consulted (partial) bundle B ( )

𝑢 or questions on attributes A (𝑡 )
.Compared with other individual-item MCR models, the contextual information stored in different slots matters in bundle recommendation, so it is natural to share the state encoded from different slots for both recommendation and question predictions in a unified self-attentive architecture. 4It should be exactly represented as O (,0)  , ; we omit some notations for simplicity of the decoder description below. 5 is predicted by an MLP model with sigmoid function and using concatenated S ( )  , O   as input.Sample  ∈ [1,  ], then mask  items in    , set the masked positions as slots X; 6: Retrieve attributes for all the items in B   and mask attributes with probability ; ⊲ Mimic short-term contexts 7: Predict the distributions of masked items, attributes and conversation management in slot  ∈ X via Equation (5); 8: Compute loss L offline with Equations ( 6) to (8); update Θ using gradient-related optimizer (e.g., [25]).

Offline Pre-Training
Due to the large action spaces of items and attributes, it is difficult to directly train agents from scratch.Thus, we first pre-train the Bunt model on collected offline user-bundle interactions.The core idea of pre-training is to mimic model inputs and outputs in the process of Bundle MCR, which can be treated as multiple cloze (i.e., "fill the slot") tasks given a few accepted items and attributes to infer the masked items and attributes.

Multi-Task Loss.
Bunt offline training is based on a multi-task loss for recommendation and question asking simultaneously, i.e., L offline = L rec + L ask , where  is a trade-off hyper-parameter to balance the importance of these two losses in offline pre-training.We treat item prediction as a multi-class classification task for masked slots X ( ) : where   is the binary label (0 or 1) for item .Meanwhile, attribute predictions are formulated as multi-label classification tasks.We use a weighted cross-entropy loss function considering the imbalance of labels to prevent the model from only predicting popular attributes.The loss function of attribute predictions is: where   is a balance weight of attribute  following [24], and note that multiple   can be 1 for multi-label classification.
Furthermore, we pre-train part of conversational manager, i.e.,  ′′  , to decide whether to recommend or ask: For slot , as long as item agent   hits the target item,   is set as 1; otherwise, if the attribute agent hits the target,   is 0.   is set as -1 when no agents make successful predictions.We denote L ask = L cate + L attr + L conv .

Training Details.
Figure 2b illustrates Bunt offline training.We pre-train Bunt on offline user-bundle interactions, to obtain the basic knowledge to predict the following items or attributes given historical bundle interactions and conversational information.The training details are in Algorithm 1.

Online Fine-Tuning
Figure 2c shows Update Θ  using M  with RL methods (e.g., DQN [33], PPO [38]); Then reset M  ; ⊲ policy learning environment to update related parameters and improve the accuracy after interacting with users.The online fine-tuning details are in Algorithm 2. We omit the details of RL value networks like [27].

Evaluation Protocol and Metrics
Following [12,23,27], we conduct a leave-one-out data split (i.e. for each user, randomly select  -1 bundles for offline training, the last bundles for online training, validation and testing respectively in a ratio of 6:2:2).We choose the multi-label precision, recall, F1, and accuracy, defined in [50] to measure the quality of the generated bundle.

Datasets
We extend four datasets (see statistics in Table 2) for Bundle MCR. 6 (1) Steam: This dataset collects user interactions with game bundles in [35] from the Steam7 platform.We use item tags as attributes in Bundle MCR and item genres as categories and discard users with fewer than two bundles according to our evaluation protocol.(2) MovieLens: This dataset is a benchmark dataset [17] for collaborative filtering tasks.We use the ML-10M version by treating movies rated with the same timestamps (second-level granularity) as a bundle.We treat provided genres as categories, tags as attributes in Bundle MCR.
(3) Clothing: This dataset is collected in [32] from Amazon8 e-commerce platform; we use the subcategory clothing.We treat co-purchased items as a bundle by timestamp.We use item categories in the metadata as categories in Bundle MCR, and style in item reviews (style is a dictionary of the product metadata, e.g., "format" is "hardcover", we use "hardcover") as attributes.For MovieLens and Clothing, bundles are grouped by timestamp thus noisy.To improve data quality, we filter out users and items that appear no more than three times.(4) iFashion is an outfit dataset with user interactions [8].Similar to [44], we pre-process iFashion as a 10-core dataset to ensure data quality.We use categories features from iFashion metadata, and tokenize the title as attributes in Bundle MCR.

Baselines
We introduce three groups of recommendation baselines to evaluate Bundle MCR and Bunt (we call our full proposed Bunt in Section 5 as Bunt-Learn).More technical details of baselines can be found in Section 2.

Traditional bundle recommenders.
Freq uses the most frequent bundle as the predicted bundle.BBPR [35]: considering the infeasible time cost of cold bundle generation in BBPR, we use BBPR to rank existing bundles.BGN [2] adopts an encoder-decoder [41] architecture to encode user historical interactions and generate a sequence of items as a bundle.We use the top-1 bundle in BGN generated bundle list as the result.PoG [8] is a transformer-based [43] encoder-decoder model to generate personalized outfits.We use it for general bundle recommendation.BYOB [12] is the most recent bundle generator using reinforcement learning methods.
6.3.2Adopted individual recommenders for Bundle MCR.FM-All is an FM [36] variant used in MCR frameworks [26,27], "All" means this model in Bundle MCR only recommends top- items per round without asking any questions.FM-Learn follows the item predictions in FM-All, but use other pre-trained agents in Bunt for conversation management and question posting.EAR [26] and SCPR [27] are popular Individual MCR frameworks based on FM.We keep the core ideas of estimation-action-reflection in our EAR and asking attributes by path reasoning in our SCPR, and change the names to EAR* and SCPR* because some implemented components are changed for adapting into Bundle MCR.
We do not use recent UNICORN [14] or MIMCR [53], because the unified action space in UNICORN is incompatible with Bundle MCR to generate multiple items or attributes per round; main contributions of MIMCR are based on the multiple choice questions setting, which is incompatible with Bundle MCR.implement Algorithm 2 using OpenAI Stable-Baselines RL training code.We reuse Proximal Policy Optimization [38] (PPO) in Stable-Baselines10 to train four agents (  ,   ,   ,   ) jointly (  is category policy, similar to   ) using Adam optimizer with lr=1e-3.Other hyper-parameters follow the default settings in Stable-Baselines.We re-run all experiments three times with different random seeds and report the average performance and related standard errors.
6.4.2User Simulator Setup.Due to the difficulty and cost of interacting with real users, we mainly evaluate our frameworks with user simulators, similar to previous works [14,26,27,53].We simulate a user with a target bundle B * in mind, which is sampled from our online dataset.To mimic real user behavior, the user simulator accepts systemprovided items which agree with target bundle B * ; accepts categories and attributes that agree with the potential target items in the current slot. 11The user simulator only explicitly rejects categories or attributes that are not associated with any items in B * .In other cases, the user simulator ignores items, categories and attributes provided by system.The user simulator is able to terminate conversations when all items in B * have been recommended.Otherwise, the system ends conversations after  =  conversation rounds.We set the maximum conversation rounds  to 10 in our experiments.

Main Performance of Bundle MCR and Bunt-Learn
Table 3 and Figure 3 show the main performance of our proposed framework and model architecture compared with other conversational recommendation baselines.We make the following observations:  Fig. 3. Bunt performance in one-shot setting, and cumulative accuracy curves.
6.5.1 Bunt Backbone Performance.Though we propose Bunt for the Bundle MCR task, we first show Bunt is competitive in traditional one-shot bundle recommendation.Figure 3a shows Bunt outperforms classic bundle recommenders (i.e., BBPR) markedly, and is comparable to or sometimes better than recent bundle generators (e.g., BGN, PoG).In this regard, Bunt backbone is shown to learn basic "bundle recommendation" knowledge much like other models.
6.5.2Effectiveness of Bundle MCR.We show the effectiveness of Bundle MCR by comparing model (e) and (i) in Table 3.
For example, the accuracy on MovieLens data is improved from 0.061 to 0.181.This indicates even given the same backbone model, introducing a conversational mechanism (Bundle MCR) can collect immediate feedback and improve recommendation performance.Also, we observe the relative improvement on the other three datasets is higher than on Steam.For example, the relative improvement in accuracy is 61.56% on Steam compared to 196.72% on MovieLens.This shows challenging datasets (e.g., sparser, larger item space) can gain more benefit from Bundle MCR, since it allows users to provide feedback during conversations.
6.5.3 Effectiveness of Bunt-Learn.We adopted several Individual MCR recommenders (a)-(d) in Table 3 into bundle settings, in which the backbone model (FM [36]) recommends top-K items without considering bundle contexts.
Compared with these individual MCR recommenders, Bunt-Learn achieves the best performance.For example, compared with model (b), where we only replace Bunt backbone with FM, ours improves accuracy from 0.239 to 0.727 on Steam.
This shows that directly applying existing individual MCR recommenders in Bundle MCR is not optimal, and also shows the benefits of our Bunt design.Moreover, compared with bundle recommenders only recommending items (models (f)-(h)) ours introduces question asking and improves recommendation performances consistently, except for the recall score in iFashion.This is because we should replace recommending with asking, so recall may drop given fewer recommendation rounds (F1-Score is improved still).

Accuracy
Curve with Conversation Rounds.The cumulative accuracy curves in Figure 3b show Bunt-All achieves the best results in beginning conversation rounds, then is outperformed by Bunt-Learn.This is because Bunt-Learn requires several rounds to ask questions and elicit preferences.Thus, Bunt-Learn in late rounds can recommend more accurately and surpasses baselines.For example, on MovieLens, Bunt-Learn outperforms the baselines after  = 6.This also indicates the proposed multiple cloze pre-training tasks are suitable for training Bundle MCR effectively.

Human Evaluation on Conversation Trajectories
Considering the cost of deploying real interactive Bundle MCRs, similar to [21,46], we conduct human evaluation by letting real users rate the generated conversation trajectories from Section 6.This result shows the superiority of Bunt-Learn.Interestingly, the performance gap in human evaluation is not as large as results in simulators (e.g., on Steam, Bunt-Learn accuracy is 3x as FM-Learn).

CONCLUSION AND FUTURE WORK
In

Fig. 2 .
Fig. 2. (a) Bunt architecture illustration.Bunt is a Bert-like model which encodes long-term preference and short-term contexts to infer masked items, categories and attributes per slot  ∈ X ( ) .In this example, X ( ) = {2,  } because the related items are still unknown (i.e.,with [MASK]), and X (≤ ) = {1, . . .,  }.We define long-term preference and short-term contexts in Section 4.1.(b) Bunt offline pre-training diagram, where 1 ○ denotes user modeling, 2 ○ mimics the consultation step, 3 ○ mimics the feedback handling step, but instead of updating the conversation state at the next round, offline training simply re-masks the target bundle to generate the next masked bundle as Bunt inputs.(c) Bunt online training diagram, where 1 ○ is user modeling, 2 ○ is the consultation step to generate partial bundle B ( )  or attributes A ( )  and categories C ( )  , 3○ is the feedback handling step to update short-term contexts.We describe steps1  ○-3 ○ in Sections 4 and 5, where we keep   and omit the similar policy   , for ease of description.

6. 3 . 3
Simple bundle recommenders for Bundle MCR.Bunt-One-Shot uses Bunt in traditional bundle recommendation following the inference of PoG[8].{BYOB, BGN, Bunt}-All models are simple bundle recommender implementations in Bundle MCR, only recommending top- items per round without asking any questions.6.4 Experimental Setup 6.4.1 Training Details.Our training phases are two-stage 9 : (1) in offline pre-training, we follow Algorithm 1 to implement and train our Bunt model with PyTorch.The number of transformer layers and heads are searched from {1,2,4},  = 32,  = 2,  = 0.1 and masking ratio  = 0.5.We use an Adam [25] optimizer with initial learning rate 1e-3 for all datasets with batch size 32.The maximum bundle size is set as 20.(2) In online fine-tuning, we Bunt performance compared with other bundle recommenders in oneshot bundle recommendation.

6. 5 . 5
Bunt-Learn Component Analysis.Compared with (a), (b)-(d) show the effectiveness of long-term preference and short-term context encoding; (e) indicates the importance of using bundle-aware models; (f)-(i) show the benefit of online fine-tuning, which helps   most because conversation management is hard to mimic in offline datasets, and 5. From Steam and MovieLens datasets,we sample 1000 (in total) pairs of conversation trajectories from <Bunt-Learn, SCPR*> or <Bunt-Learn, FM-Learn> (SCPR* and FM-Learn are the best baselines using individual item recommenders).Each pair of conversation trajectories is posted to collect 5 answers from MTurk12 workers, who are required to measure the subjective quality by browsing the conversations and selecting the best model from the given pair.We use the answers from high-quality workers who spend more than 30 seconds and the LifeTimeAcceptanceRate is 100%, and count the majority votes per pair.Lastly, we collected 388 valid results: <Bunt-Learn, SCPR*> votes are 121:88, and <Bunt-Learn, FM-Learn> votes are 110:69.

Table 1 .
Functionality requirements in Bundle MCR model design and comparisons with individual MCR and bundle recommendation.
•   -(partial) bundle generation: if recommending, the agent uses S •   -conversation management: use S()  and S ( ) to predict a binary action (recommending or asking).
delete î because it has been recommended; (ii) if î is accepted, short-term contexts in slot  will assign î to ǐ explicitly rejected, it only happens when user strongly dislikes this attribute.So â will be removed from all attributes candidates pools, and items associated with â will be removed from all item candidate pools as well.
in short-term context is updated byǍ ( )  ∪ { â }; (iii) if â is  :it is updated by appending a new result id for round , result in a t-sized list.•Updateslots X ( +1) : as Section 3 described, if items accepted, we remove the corresponded slots from X ( ) , and create new slots to keep the size as .For a new slot  ′ , the short-term contexts is ([MASK], ∅), and candidate pools are the union sets of previous candidate pools to excluded items or attributes that user strongly dislikes.BUNT Backbone (fixed) then apply average pooling AVG on embeddings to obtain E 5.1.3Long-andShort-TermRepresentationFusion.We feed user long-term preferences E  and short-term contexts E ( ) * , into an -layer transformer.For notation simplicity4, we denote E ( ) , as O 0 and get the fused representation: respectively,  is a gating weight5. ′  ,  ′′  ,   and   are MLP where  ∈ {0, 1}, (Conv.Management)   ( | O   ) =   ( | O   ), where  ∈ I ( )  , (Bundle Generation)   ( | O   ) =   ( | O   ), where  ∈ P ( )  .* represents the probability.Policy network   is linearly combined by two sub models  ′  and  ′′  for state S()  and O Algorithm 1 Bunt Offline Pre-Training Input: historical user bundle interactions D, masking ratio , Bunt (including  ′′  ,   ,   ) parameters Θ, slot size K; Output: Bunt parameters Θ after pre-training; 1: while not meet training termination criterion do Sample a user  ∈ U, get historical bundles { B 1 , . . ., B  } from D; sample a historical bundle as target bundle, e.g., B  ; the online-training diagram, where we fine-tune Bunt agents during interactions with (real or simulated) users.Our core idea is fixing Bunt backbone parameters, fine-tune agents   ,   and   in a Bundle MCR Algorithm 2 Online Bunt Fine-Tuning Input: trainable Bunt parameters Θ  , Θ  and Θ  for three networks   ,   and   , empty buffer M  , M  and M  ; Output: Bunt policy networks parameters Θ  , Θ  and Θ  ;1: for episode  = 1, 2, . . .do , â ,    ) |  ∈ X( )} to M  , calculating    via Section 4.4; ⊲ i.e., (state, next_state, action, reward)  ),   ,   ) to M  , calculating   via Section 4.4; ⊲ i.e., (state, next_state, action, reward) 16: if B = B *  or  =  then 18: if M  ( = {, ,  }) meets pre-defined buffer training criterion (e.g., buffer size) then 19:

Table 2 .
Data Statistics, where # denotes quantity number, U denotes user, I denotes item, B denotes bundle, C denotes category and A denotes attributes.B/U represents the number of bundles per user, B size represents the average number of items per bundle.

Table 3 .
Bunt and other individual conversational recommendation methods that are adopted for bundle settings.The best is bold.

Table 4 .
Ablation Studies (F1-score) to evaluate model architecture, fine-tuning (FT) and pre-training (PT).PT  {, } .815±.018 .171±.002 (m) w/o PT All .003±.001 .001±.001 with only a binary action space is easier for online learning than other policies; (j)-(m) show pre-training is necessary, especially for item policy, because bundle generation is challenging to directly learn from online interactions with RL.
this work, we first extend existing Multi-Round Conversational Recommendation (MCR) settings to bundle recommendation scenarios, which we formulate as an MDP problem with multiple agents.Then, we propose a model architecture, Bunt, to handle bundle contexts in conversations.Lastly, to let Bunt learn bundle knowledge from offline datasets and an online environment, we propose a two-stage training strategy to train our Bunt model with multiple cloze tasks and multi-agent reinforcement learning respectively.We show the effectiveness of our model and training strategy on four offline bundle datasets and human evaluation.Since ours is the first work to consider conversation mechanisms in bundle recommendation, many research directions can be explored in the future.In Bunt, our question spaces are about categories and attributes, so how to use free text in bundle conversational recommendation is still an open question.Meanwhile, how to explicitly incorporate item relationships (e.g., substitutes, complements) in conversational bundle recommendation should be another interesting and challenging task.Moreover, since individual items can be treated as a special bundle, it is interesting to unify existing individual conversational recommenders into conversational bundle recommendation, i.e., augmenting conversational agents' abilities without extra cost.