Masked and Swapped Sequence Modeling for Next Novel Basket Recommendation in Grocery Shopping

Next basket recommendation (NBR) is the task of predicting the next set of items based on a sequence of already purchased baskets. It is a recommendation task that has been widely studied, especially in the context of grocery shopping. In next basket recommendation (NBR), it is useful to distinguish between repeat items, i.e., items that a user has consumed before, and explore items, i.e., items that a user has not consumed before. Most NBR work either ignores this distinction or focuses on repeat items. We formulate the next novel basket recommendation (NNBR) task, i.e., the task of recommending a basket that only consists of novel items, which is valuable for both real-world application and NBR evaluation. We evaluate how existing NBR methods perform on the NNBR task and find that, so far, limited progress has been made w.r.t. the NNBR task. To address the NNBR task, we propose a simple bi-directional transformer basket recommendation model (BTBR), which is focused on directly modeling item-to-item correlations within and across baskets instead of learning complex basket representations. To properly train BTBR, we propose and investigate several masking strategies and training objectives: (i) item-level random masking, (ii) item-level select masking, (iii) basket-level all masking, (iv) basket-level explore masking, and (v) joint masking. In addition, an item-basket swapping strategy is proposed to enrich the item interactions within the same baskets. We conduct extensive experiments on three open datasets with various characteristics. The results demonstrate the effectiveness of BTBR and our masking and swapping strategies for the NNBR task. BTBR with a properly selected masking and swapping strategy can substantially improve NNBR performance.


INTRODUCTION
Next basket recommendation is a type of sequential recommendation that aims to recommend the next basket, i.e., set of items, to users given their historical basket sequences.Recommendation in a grocery shopping scenario is one of the main use cases of the NBR task, where users usually purchase a set of items instead of a single item to satisfy their diverse needs.Many methods, based on a broad range of underlying techniques (i.e., RNNs [3,16,21,27,43], self-attention [4,32,45], and denoising via contrastive learning [27]), have been proposed for, and achieve good performance on, the NBR task.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).Next novel basket recommendation.A recent study [24] offers a new evaluation perspective on the next basket recommendation (NBR) task by distinguishing between repetition (i.e., recommending items that users have purchased before) and exploration (i.e., recommending items that are new to the user) tasks in NBR and points out the imbalance in difficulty between the two tasks.According to the analysis of existing methods in [24], the performance of many existing NBR methods mainly comes from being biased towards (i.e., giving more resources to) the repetition task and sacrificing the ability of exploration.Building on these insights, recent work on NBR has seen a specific focus on the pure repetition task [e.g., 20] as well the introduction of specific methods for the repetition task [1,20].
Novelty and serendipity are two important objectives when evaluating recommendation performance [13,18].People might simply get tired of repurchasing the same set of items.Even when they engage in a considerable amount of repetition, there is still a large proportion of users who would like to try something new when shopping for groceries [24].
This phenomenon is especially noticeable for users with fewer transactions in their purchase history [1].Therefore, one of the key roles of recommender systems is to assist users in discovering potential novel items that align with their interests.However, in contrast to the pure repetition task, the pure exploration task in NBR remains under-explored.
Besides, due to the difference in difficulty between the two tasks, many online e-commerce and grocery shopping platforms have started to design a "buy it again" service to isolate repeat items from the general recommendation. 1,2  Motivated by the research gaps and real-world demands, we formulate the next novel basket recommendation (NNBR) task, which focuses on recommending a novel basket, i.e., a set of items that are new to the user, given the user's historical basket sequence.Different from the repetition task, which predicts the probability of repurchase from a relatively small set of items, the NNBR task needs to predict possible items from many thousands of candidates by modeling item-item correlations, which is more complex and difficult [24].NNBR is especially relevant to the "Try Something New" concept in the grocery shopping scenario.Table 1 shows differences between three types of basket recommendation and positions our work.
From NBR to NNBR.The NNBR task can be seen as a sub-task of the conventional NBR task, in which NBR methods are designed to find all possible items (both repeat items and novel items) in the next basket.Therefore, it is possible to generate a novel basket by selecting only the top- novel items predicted by NBR methods.To modify NBR methods for the NNBR task, an intuitive solution is to remove the repeat items from the ground-truth labels and train models only depending on the novel items in the ground-truth labels.Given this obvious strategy and given that many methods have already been proposed for NBR, an important question is: If we already have an NBR model, do we need to train another model specifically for the NNBR task?Surprisingly, we find that training specifically for exploration does not always lead to better performance in the NNBR task, and might even reduce performance in some cases. 1 After login, users may see a "buy it again" page on e-commerce platforms (see, e.g., Amazon https://amazon.comand grocery shopping platforms (see, e.g., Picnic https://picnic.app),where the platform collects repeat items.Similarly, in the grocery shopping scenario, "Try Something New" services also exist, where only novel items are recommended to the user. 2 See, e.g., http://community.apg.org.uk/fileUploads/2007/Sainsburys.pdf for an example of the "Try Something New" concept in offline retail, and the Weekly New Recipe service at https://ah.nl/allerhande/wat-eten-we-vandaag/weekmenu for an example in online retail.
BTBR: A bi-directional transformer basket recommendation method.In NNBR, item-to-item correlations are especially important, since we need to infer the utility of new items based on previously purchased items.Besides, a single basket is likely to address diverse needs of a user [37].E.g., what a user would like to drink is more likely to depend on what he or she drank before rather than on the tooth paste they previously purchased.However, most existing NBR approaches [16,21,27,43,45] are two-stage methods, which first generate a basket-level representation [35], and then learn a temporal model based on basket-level representations, which will lead to information loss w.r.t.item-to-item correlations [21,32,45].Some methods [21,32,45] learn partial item-to-item correlations based on the co-occurrence within the same or adjacent basket as auxiliary information beyond basket-level correlation learning.Instead of learning or exploiting complex basket representations, we learn item-to-item correlations from direct interactions among different items across different baskets.To do so, we propose a bi-directional transformer basket recommendation model (BTBR) that adopts a bi-directional transformer [36] and uses the shared basket position embedding to indicate items' temporal information.
Masking and training.To properly train BTBR, we propose and investigate several masking strategies and training objectives at different levels and tasks, as follows: (i) item-level random masking: a cloze-task loss [8,34], in which we randomly mask the historical sequence at the item level; (ii) item-level select masking: a cloze-task loss designed for exploration, in which we first select the items we need to mask and then mask all the occurrences of the selected item; (iii) basket-level all masking: a general basket recommendation task loss, in which we mask and predict the complete last basket at the end of the historical sequence; (iv) basket-level explore masking: an explore-specific basket recommendation task loss, in which we remove the repeat items and only mask the novel items in the last basket of the historical sequence; and (v) joint masking: a loss that follows the pre-train-then-fine-tune paradigm, in which we first adopt item-level masking for the cloze task, then fine-tune the model using basket-level masking.
In addition, conventional sequential item recommendation usually assumes that the items in a sequence are strictly ordered and sequentially dependent.However, recent work [e.g., 5,26,39,42] argues that the items may occur in any order, i.e., the order is flexible, and ignoring flexible orders might lead to information loss.Similarly, it is unclear whether the items that are being purchased across baskets have a strict order in the grocery shopping scenario.Thus, we propose an item swapping strategy that allows us to randomly move an item to another basket according to a certain ratio, which can enrich item interactions within the same basket.
We conduct extensive experiments on three publicly available grocery datasets to understand the effectiveness of the BTBR model and the proposed strategies on datasets with various repeat ratios and characteristics.
Our contributions.The main contributions of this paper are: • To the best of our knowledge, we are the first to formulate and investigate the next novel basket recommendation (NNBR) task, which aims to recommend a set of novel items that meets a user's preferences in the next basket.• We investigate the performance of several representative NBR methods w.r.t. the NNBR task and find (i) that training specifically for the exploration task does not always lead to better performance, and (ii) that limited progress has been made w.r.t. the NNBR task.
• We propose a simple bi-directional transformer basket recommendation (BTBR) model that learns item-to-item correlations across baskets.
• We propose several types of masking and item swapping strategies for optimizing BTBR for the NNBR task.Extensive experiments are done on three open grocery shopping datasets to assess the effectiveness of the proposed strategies.
BTBR with a proper masking and swapping strategy is the new state-of-the-art method w.r.t. the NNBR task.

RELATED WORK
In this section, we describe two lines of research in the recommender systems literature that are related to our work: sequential recommendation and next basket recommendation.
However, BERT4Rec and follow-up work only focus on the item sequential recommendation with only random masking during training [44].We extend BERT4Rec to the basket sequence setting and propose several types of masking strategies and training objectives that are specifically designed for the NNBR task.Furthermore, in this work we study the next novel basket recommendation task, where both historical interactions and the predicted target are baskets (sets of items).None of the sequential recommendation models listed above have been designed to handle a sequence of baskets.
Next basket recommendation.Next basket recommendation is a sequential recommendation task that addresses the sequence of baskets in the grocery shopping scenario.Existing methods can be classified into three types: frequency neighbor-based methods [9,17], Markov chain (MC)-based methods [28], and deep learning-based methods [1, 3, 4, 16, 20-22, 27, 32, 37, 38, 40, 43, 45].Recently, Li et al. [24] have evaluated and assessed NBR performance from a new repetition and exploration perspective; they find that the repetition task, i.e., recommending repeat items, is much easier than the exploration task, i.e., recommending explore items (a.k.a.novel items in this paper), besides the improvements of many recent methods come from the performance of the repetition task rather than better capturing correlations among items.Inspired by this finding, an NBR method [1] that only models the repetition behavior has been proposed, and an NBRR task [20] that only focuses on recommending repeat items has been formulated.
In this paper, we propose and formulate the next novel basket recommendation task that focuses on recommending novel items to the user, whereas all of the NBR methods mentioned above focus on the conventional NBR task, and their performance when generalized to the NNBR task remains unknown.

TASK FORMULATION
In this section, we describe and formalize the next novel basket recommendation task which is the focus of this paper.Formally, given a set of users  = { 1 ,  2 , . . .,   } and a set of items  where   denotes a recommended item list that only consists of the novel items  novel  of user .

OUR METHOD
In this section, we first describe the base bi-directional transformer basket recommendation model (BTBR) we use, then introduce several types of masking strategies for the NNBR task, and finally describe an item swapping strategy.

Bi-directional transformer basket recommendation model
Learning basket representations [35] and modeling temporal dependencies across baskets are two key components in almost all neural-based NBR methods.Many NBR methods introduce complex architectures to learn representations for baskets in grocery shopping [4,21,27,32,45].Instead of proposing more complex architectures to learn better basket representations and temporal dependencies, we want to simplify the model and only focus on the item-level correlations across different baskets, which helps us to infer novel items from users' historical items.
As a widely used method to model temporal dependencies, a recurrent neural network (RNN) [6,10] requires passing information sequentially according to the temporal order, whereas there is no temporal order for items within the same basket, and basket-level representations at each timestamp are required [16,21,27,43].Another alternative method is the self-attention mechanism (a.k.a.transformer) [36], which is capable of learning the representations of every position by exchanging the information across all positions.Therefore, we adopt the bi-directional transformer [8,36] as the backbone of our BTBR model, which not only allows us to learn item-to-item correlations from the direct interactions among items across different baskets but is also able to handle basket sequence information in grocery shopping.The overall architecture of BTBR is shown in Figure 1.
Embedding layer.In order to use transformers [36] for NNBR, we first transfer a basket sequence to an item sequence via a "flatten" operation, e.g., It has been shown that the positions of items are informative in the sequential recommendation scenario [19,31].Different from solutions in conventional item sequential recommendation, where each item is combined with its unique position embedding w.r.t.its position in the item sequence, we use a learnable position embedding for every basket, and items within the same basket will share the same position embedding.For example, given a basket sequence  = [{ 1 ,  2 }, { 1 ,  3 ,  4 }, { 4 ,  5 }], we first flatten  and get a sequence of item embeddings , and a position embedding sequence . Finally, the input sequence of transformer layer will be . Note that the padding and truncating operations are also employed to handle sequences of various lengths.
Bi-directional transformer layer.The transformer architecture contains two sub-layers: (1) Multi-head attention layer, which adopts the popular attention mechanism [36] and aggregates all items' embeddings across different baskets with adaptive weights.
(2) Point-wise feed-forward layer, which aims to endow nonlinearity and interactions between different latent dimensions.
We use stacked transformer layers to learn more complex item-to-item correlations, that is: where  denotes the bi-directional transformer layer,   = [ℎ  1 , ℎ  2 , . . ., ℎ   ] denotes a representation sequence derived from the last transformer layer, and  denotes the maximum sequence length of input sequence  , .Besides, residual connections [11], dropout [30], layer normalization [2], and GELU activation [12] are adopted to enhance the ability of representation learning.For more details about the bi-directional transformer layer, we refer to [19,31,36].
Prediction layer.After hierarchically exchanging information of all items across baskets using the transformer, we get   ∈ R × , which contains the corresponding representations ℎ  for all items in the input sequence.Following [19,31], we use the same item embedding   ∈ R × as the input layer to reduce the model size and alleviate the overfitting problem.For a masked position (item), we get its learned representation ℎ ∈ R  and compute the interaction probability distribution  of candidate items by: where  is the embedding matrix for candidate items and  denotes a bias term.

Masking strategy
Since there are repetition signals in the basket sequence, it is unclear whether these signals are merely noise/shortcuts or contain valuable information for the task of recommending novel items.After constructing the base model (BTBR), the challenging problem that needs to be addressed is how to properly train the model to improve its ability of finding novel items that meet users' interests.In this section, we propose four types of alternative masking strategies for the next novel basket recommendation task by considering different tasks and levels, as well as the repetition-exploration signals.Figure 2 shows examples of four types of masking strategies and Table 2 shows the characteristics of different training strategies.
Cloze task.The first type of training objective is a cloze task [34], i.e., "masked language model" in [8].Specifically, we mask a proportion of items in the input sequence, i.e., replace each of them with a "mask token, " and then try to predict the original items based on their contexts.We call this masking "item-level."Two main advantages of this item-level masking & training strategy are (i) it allows us to generate more item-level training samples by breaking the definition of "basket, " and (ii) it learns both sides' information via the bi-directional transformer, which might allow the model to better capture item-to-item correlations.We first introduce two item-level masking strategies as follows: Basket level all masking m m Basket level explore masking m Fig. 2. The original basket sequence (at the top) and four types of masking strategies.
(1) Random: This is a conventional masking strategy, which has been adopted in BERT4Rec [31].Specifically, given a flattened item sequence, we randomly select several positions of the sequence and mask the corresponding items of the selected position according to mask ratio  as input.
(2) Select: One potential issue w.r.t. the above Random masking is that the masked items (prediction target) might still exist in the non-masked positions, so the model might mainly predict the masked item via its repetition information rather than inferring new items based on item-to-item correlations.Therefore, we propose the select masking strategy, which is specifically targeted at the exploration demand of the NNBR task.Specifically, given a flattened item sequence, we first derive the item set  in this sequence, then randomly select several items   ∈  according to mask ratio , and finally mask all the occurrences of   in the sequence.Since there is no repetition information available, the model can only infer the targeted items, i.e., novel items, via learning the item-to-item correlations.
Basket recommendation task.Using the cloze task as the learning objective has limitations: (i) it is not able to fully respect the temporal dependencies of a sequence, since we can only use the historical information (left-side context) when we make the recommendation; and (ii) it is not specifically designed for the basket recommendation task and a mismatch might exist.Therefore, the second type of training objective we consider is the basket recommendation task, which masks the input sequence at the basket-level instead of item-level.Specifically, we mask the last basket and try to predict the items in this basket only based on the historical items (left-side information).Similarly, we propose another two basket-level masking strategies as follows: (3) All: This masking strategy can be regarded as optimizing the model for the NBR task.Given a flattened item sequence, we find and mask all items, i.e., both novel items and repeat items in the last basket.
(4) Explore: This is a NNBR-specific masking strategy.Given a flattened item sequence, we find the items in the last basket, instead of masking all items, we only mask the novel items  ∈  novel and remove the repeat items  ∈  rep .
The model will be only optimized for finding all novel items in the future based on the historical basket sequence.
Joint task.The pretrain-then-finetune paradigm has been widely adopted in NLP tasks.It is worth noting that item-level masking (the cloze task) and basket-level masking (the basket recommendation task) can also be combined as a joint masking strategy to employ the pretrain-then-finetune paradigm in NNBR, which first uses item-level masking strategy (i.e., self-supervised task) to get item correlations as the pre-train stage and then employ basket-level masking strategy (i.e., supervised task) to finetune it for the basket recommendation task.
Loss.Following [31], we select minimizing the negative log-likelihood loss as the training objective: where   is the masked item set,  ( |   , ) is the predicted probability of item  at position .
Test and prediction.To predict a future basket (a set of items), we only need to add one masked token at the end of the user's item sequence, since items within the same basket share the same position embedding.In the NNBR task, the candidate items are novel items  new that the user has not bought before, thus we use the embedding matrix w.r.t. the novel items of every user to compute the probabilities according to Eq. 3. Finally, we select top- novel items with the highest scores as the recommendation list of the next novel basket.

Swapping strategy
In sequential recommendation, some work [5,26,39,42] argues that the items in a sequence may not be sequentially dependent and different item orders may actually correspond to the same user intent.Ignoring flexible orders in sequential recommendation might lead to less accurate recommendations for scenarios where many items are not sequentially dependent [26,39,44].In grocery shopping, the items purchased within the different baskets might not have rigid orders.To further understand if considering the flexible orders among items could further improve the performance w.r.t. the NNBR task, we propose the item swapping strategy to create augmentations for the BTBR.
Specifically, as illustrated in Figure 3, we randomly select items according to a swap ratio  and then move them to another basket to enrich the items' interactions within the same basket.Besides, we introduce a hyper-parameter, i.e., swap hop , to control the basket distance of the swapping strategy.Note that we only perform the local swap strategy when using item-level masking (the cloze task) to train the model, since basket-level masking (the basket recommendation task) is designed to respect the sequential order and predict the future basket based on historical information.
training Fig. 3.An example of the item swapping strategy.

Research questions
To understand the next novel basket recommendation task, and evaluate the performance of BTBR with different strategies, we conduct experiments to answer the following questions: RQ1 How do existing NBR models perform w.r.t. the NNBR task?Does training specifically for the NNBR task lead to better performance?
RQ2 How does BTBR with different masking strategies perform compared to the state-of-the-art models?
RQ3 Does the swapping strategy contribute to the improvements?
RQ4 How do the hyper-parameters influence the models' performance and how different masking strategies affect the training dynamics?
RQ5 Is the joint masking strategy more robust than using the single masking strategy?

Experimental setup
Datasets.We evaluate the NNBR task on three publicly available grocery shopping datasets (TaFeng, 4 Dunnhumby, 5and Instacart6 ), which vary in their repetition and exploration ratios.Following [24], we sample users whose basket length is between 3 and 50, and remove the least frequent items in each dataset.We also focus on the fixed size (10 or 20) next novel basket recommendation problem.In our experiments, we split the dataset across users, 80% for training, and 20% for testing, and leave 10% of the training users as the validation set.We repeat the splitting and experiments five times and report the average performance.The statistics of the processed datasets are shown in Table 3. Baselines.We investigate the performance of six NBR baselines, which we select based on their performance on our chosen datasets in the analysis performed in [24].Importantly, for a fair comparison, we do not include methods that leverage additional information [3,4,32].
• G-TopFreq: A simple and effective method that recommends the top  most popular items in the dataset as the next basket for users.• TIFUKNN: A state-of-the-art method that models the temporal dynamics of users' past baskets by using a KNN-based approach based on personalized frequency information (PIF) [17].
• Dream: A RNN-based method that gets basket representation using pooling strategy and employs RNN to model sequential behavior [43].
• Beacon: A RNN-based method that uses RNN to capture sequential behavior and uses correlation-sensitive basket encoder to consider intra-basket item correlations [21].
• DNNTSP: A state-of-the-art method that utilizes a graph neural network (GNN) and self-attention mechanisms to encode item-item relations across baskets and capture temporal dependencies [45].
• CLEA: A state-of-the-art method that uses contrastive learning and a GRU-based encoder to denoise and automatically extract items relevant to the target item [27].
Note that for the above baseline models (except G-TopFreq), we have two versions with different training methods, i.e., using all items in the last basket as training labels (Train-all), and only using novel items in the last basket as training labels (Train-explore).
Configurations.For the training-based baseline methods and TIFUKNN, we strictly follow the hyper-parameter setting and tuning strategy of their respective original papers.The embedding size is tuned on {16, 32, 64, 128} for all training-based methods based on the validation set to achieve their best performance.
We use PyTorch to implement our model and train it using a TITAN X GPU with 12G memory.For BTBR, we set self-attention layers to 2 and their head number to 8, and tune the embedding size on {16, 32, 64, 128}.The Adam optimizer with a learning rate of 0.001 is used to update parameters.We set the batch size to 128 for the Tafeng and Dunnhumby datasets, and 64 for the Instacart dataset; we sweep the mask ratio  in {0.1, 0.3, 0.5, 0.7, 0.9}, local swap ratio in {0, 0.1, 0.3, 0.5, 0.7, 0.9} and swap hop  in {1, 3, 5, 7, 9}.
Metrics.Two widely used metrics for the NBR problem are @ and @.In the NNBR task,  measures the ability to find all novel items that a user will purchase in the next basket; NDCG is a ranking metric that also considers the order of these novel items, i.e., where  is a set of users who will purchase novel items in their next basket,  novel  is a set of ground-truth novel items of user ,   equals 1 if    ∈    , otherwise   = 0.    denotes the -th item in the predicted basket   .Note that some methods might assign high scores w.r.t. the repeat items [24], to generate a novel basket, we fully remove the repeat items, then only rank and select top- novel items as the recommended basket   to ensure a fair comparison, i.e., the recommended basket only consists top- novel items.To answer RQ1, we employ two training strategies for each baseline method: (i) Train-all: we keep both repeat items and explore items as part of the ground-truth labels during training, which means that the model is trained to find all possible items in the next basket; and (ii) Train-explore: we remove the repeat items and only keep novel items in the ground-truth labels during training, which means the model is specifically trained to find novel items in the next basket.For the NNBR performance evaluation, we assess the models' ability to find novel items, which means the recommended novel basket consists of top- novel items and there are no repeat items.We report the experimental results of different baseline methods in Table 4.We have three main findings.
First, we see that no method consistently outperforms all other methods across all datasets.On the Tafeng dataset, several NN-based methods (Dream-all, Dream-explore, Beacon-all, Beacon-explore, DNNTSP-all, CLEA-explore) fall in the top-tier methods group with quite good performance.On the Dunnhumby dataset, Beacon-explore achieves the best performance w.r.t.all metrics.On the Instacart dataset, TIFUKNN-explore is among the best-performing methods, which means that well-tuned neighbor-based models may outperform complex neural-based methods on some datasets w.r.t. the NNBR task [7,24].The performance of G-TopFreq is obviously the worst on the Tafeng and Dunnhumby dataset, however, its performance is quite competitive on the Instacart dataset, which indicates that the popularity information is very important w.r.t. the NNBR task in the scenario with a high repeat ratio.
Second, the improvements of recent methods achieved in NBR task do not always generalize to the NNBR task.
Recent proposed methods (TIFUKNN, CLEA, DNNTSP) have surpassed the previous classic baselines (i.e., G-TopFreq, Dream, Beacon) by a large margin in conventional NBR task [17,24,27,45], whereas, the improvements are relatively small or even missing on some datasets when handling the NNBR task.This indicates that the recently proposed methods make limited progress on finding novel items for the user and that their improvements mainly come from the repeat recommendation, which is consistent with the findings in [24].
Third, the NNBR performance changes diversely for different methods when changing from Train-all to Train-explore.
Training and tuning existing NBR methods specifically for the NNBR task lead to significant or mild improvements in most cases, since the models do not need to deal with the repetition task and they are more targeted on finding novel items that meet users' preferences.Surprisingly, we find that DNNTSP-explore's performance is much worse than DNNTSP-all on the Tafeng and Dunnhumby datasets.We suspect that the underlying reason for this deterioration is that the repeat items (labels) contain useful item-to-item correlation signals that can be captured by the DNNTSP. 7Since various NBR methods have distinct architectures, certain methods may gain more from tailored training for exploration, while others can grasp item-item correlations from repeat labels.Consequently, it is unwise to indiscriminately eliminate repeat labels during training. 8

Effectiveness of BTBR (RQ2)
In this experiment, we evaluate the overall NNBR task performance of BTBR with different masking strategies, i.e., item-level random masking (item-random), item-level select masking (item-select), basket-level all masking (basket-all) and basket-level explore masking (basket-explore).The results of the comparison with the best baseline performances are shown in Table 5. 9 Based on the results, we have several observations.First, BTBR with the basket-all masking strategy (i.e., conventional next basket recommendation task) can significantly outperform the best baselines on the Tafeng and Instacart datasets, and achieve comparable performance on the Dunnhumby dataset.This result indicates that it may not be necessary to introduce basket representations, because only modeling item-to-item correlations is already effective for the NNBR task.
Second, there is no consistent best masking strategy across all datasets.On the Tafeng dataset, it is clear that basket-level masking outperforms item-level masking, where basket-all and basket-explore can respectively outperform and achieve the existing best performances w.r.t. each metric; however, using item-level masking leads to significant deterioration.On the Dunnhumby and Instacart datasets, BTBR with item-level masking strategies significantly outperforms the best performance achieved by baselines by a large margin, and is superior to BTBR with basket-level masking strategies.The above results show that the sequential order of items or baskets on the Tafeng dataset might be more strict than the order on the Dunnhumby and Instacart datasets, so using item-level masking, which fails to fully respect the sequential order and has poor performance on the Tafeng dataset.Third, we can also observe that item-select masking achieves better performance than item-random masking w.r.t.all metrics across all datasets (paired t-test,  < 0.05), i.e., the improvements range from 4.1% to 9.0%, which demonstrates the effectiveness of our specifically designed item-select masking strategy for the NNBR task.In a sequence with many recurring items, the conventional random masking strategy could not ensure there is no masked item remaining in the other positions of the sequence, so the model might learn to predict the masked item based on the items' remaining occurrences, i.e., item self-relations.While the proposed item-select masking strategy will remove all occurrences of the same item, which can ensure that the masked items are novel items w.r.t. the remaining masked sequence, and the model has to infer the masked novel item via learning the masked item's relation with other items.
Finally, it can also be seen that basket-explore masking, which is specifically targeted at the NNBR task, does not lead to any improvements on the Tafeng and Dunnhumby datasets, and results in a decrease in performance on the Instacart dataset, compared with basket-all masking.This result again verifies the findings in Section 5.3 and indicates that masking and training BTBR specifically for the NNBR task may be suboptimal, since the repeat item labels may also be helpful with item-to-item correlations modeling.

Effectiveness of the item swapping strategy (RQ3)
In this section, we conduct experiments to verify the effectiveness of the swapping strategy, and the results are shown in Table 5.We find that adding a swapping strategy on top of item-random and item-select leads to a decrease in performance on the Tafeng dataset.At the same time, we note that adding a swapping strategy on top of item-random and item-select leads to better performance on the Dunnhumby and Instacart datasets (paired t-test,  < 0.05).These results are not surprising, since the swapping strategy will not only enrich the item interactions within the basket, but also has a risk of introducing noise w.r.t. the temporal information.The sequential order is relatively strict on the TaFeng dataset (see Section 5.4), and the model can not benefit from the swap strategy.We further investigate the influence of hyper-parameters of the swapping strategy, i.e., swap ratio and swap hop.
Figure 4 shows a heatmap w.r.t.Recall@10 on different datasets when swap ratio ranges within [0.1, 0.3, 0.5, 0.7, 0.9] and swap hop ranges within [1,3,5,7,9].We observe that training with both high swap ratio and swap hop (the upper-right of the heatmap) leads to poor performance on the Tafeng and Dunnhumby dataset.When it comes to the Instacart dataset, better performance is achieved via using a high swap-hop.The repeat ratio on Instacart dataset is high, which means that the user's interest is relatively stable and swapping across adjacent baskets will not help, so a higher swap hop is preferred to enrich item interactions within the basket on this dataset.
Given the above findings, there is a trade-off between enriching the item interactions within baskets and respecting the original temporal order information, so it is reasonable to search for the optimal swap hyper-parameters to get the highest performance on different datasets in practice.

Effect of mask ratio and training dynamics (RQ4)
We investigate the effect of mask ratio and analyze how the performance changes as training goes on to further understand the properties of different masking strategies.Mask ratio.The mask ratio  when using item-level masking is a hyper-parameter that is worth discussing.Fig- ure 5 shows the Recall@10 when the mask ratio ranges within [0.1, 0.3, 0.5, 0.7, 0.9].We can observe that item-select outperforms item-random with the same mask ratio in most cases.We also see that the optimal mask ratio is 0.1 for item-random and item-select, and the optimal mask ratio is much higher (0.5, 0.7) on the Dunnhumby and Instacart datasets.We suspect that a higher mask ratio is preferred in the NNBR task when the dataset has long interaction records for the users.
Training dynamics.Figure 6 shows how the Recall@10 evolves as training goes when using different masking strategies.First, it is obvious that basket-level masking achieves its best performances very fast, and then drops much earlier than item-level masking.This is because the training labels of basket-level masking are static, which can easily lead to overfitting, while the training labels of item-level masking are dynamic, which alleviates overfitting.Second, compared to basket-all masking, basket-explore masking further aggravates the overfitting problem via removing the repeat items (labels), which might lead to a performance decrease, especially in the scenario with a high repeat ratio.Finally, the performance of item-random and item-select evolves similarly on the Tafeng dataset, since the repeat ratio on it is small.On the Dunnhumby and Instacart datasets, item-random masking results in overfitting earlier than the item-select masking, since the masked item might still exist in other positions of the masked sequence and the model will rely more on the repeat item prediction instead of inferring novel items, as the repetition prediction task is relatively easier [24].

Effectiveness of joint masking (RQ5)
So far, we have built a comprehensive understanding of different masking strategies and realize that no single masking strategy is optimal in all cases, due to the diverse characteristics of datasets.Now, we conduct experiments to evaluate the effectiveness of joint masking (training), i.e., pre-train the model using item-select masking, then fine-tune the model using basket-all masking.The results are also shown in Table 5.We find that BTBR with joint masking consistently outperforms the best performance obtained by existing baselines across datasets; the improvements range from 1.3% to 7.6% on Tafeng dataset, from 9.2% to 12.5% on Dunnhumby dataset and from 19.5% to 22.4% on Instacart dataset.Joint masking does not lead to further improvements compared with a single optimal strategy, i.e., basket-all on the Tafeng dataset and item-select with swap on the Dunnhumby and Instacart datasets, in most cases. 10The joint masking strategy under the pretrain-then-finetune paradigm is still valuable due to its robustness w.r.t.NNBR task (i.e., it consistently achieves the best performance) on various datasets with different characteristics.

CONCLUSION
We have formulated the next novel basket recommendation task, i.e., the task of recommending novel items to users given historical interactions.The task has practical applications, and helps us to evaluate an NBR model's ability to find novel items for a given user.To understand the performance of existing NBR methods on the NNBR task, we have evaluated several NBR models with two training methods, i.e., Train-all and Train-explore.To address the NNBR task, we have proposed a bi-directional transformer basket recommendation model (BTBR), which uses a bi-directional transformer to directly model item-to-item correlations across different baskets.To train BTBR, we have designed five types of masking strategies and training objectives considering different levels: (i) item-level random masking, (ii) item-level select masking, (iii) basket-level all masking, (iv) basket-level explore masking, and (v) joint masking.To further improve the BTBR performance, we also proposed an item swapping strategy to enriching item interactions.
We have conducted extensive experiments on three datasets.Concerning existing NBR methods we have found that: (i) the performance on the NNBR task differs widely between existing NBR methods; (ii) the performance of existing methods on the NNBR task leaves considerable room for improvement, and the top performing methods on the NNBR task are different from the top performers on the NBR task; and (iii) training specifically for the NNBR task by removing repeat items from the ground truth labels does not lead to consistent improvements in performance.Concerning our newly proposed BTBR method, we have found that: (i) BTBR with a properly selected masking and swapping strategy can substantially improve the NNBR performance; (ii) there is no consistent best masking level for BTBR across all datasets; (iii) the proposed item-select masking strategy outperforms the conventional item-random masking strategy on the NNBR task; (iv) the item-basket swapping strategy can further improve NNBR performance; and (v) a joint masking strategy is robust on various datasets but does not lead to further improvements compared to a single level masking strategy.
A broader implication of our work is that blindly training specifically for the proposed recommendation task might lead to sub-optimal performance and it is necessary to consider various training objectives on diverse datasets.Another implication is that it is important to consider the differences between repetition behavior and exploration behavior when designing recommendation models for the grocery shopping scenario.
One limitation of this paper is that we only focus on the grocery shopping scenario.An obvious avenue for future work, therefore, is to extend the proposed item-select masking strategy to sequential item recommendation scenarios, and investigate if it can outperform the widely used item-random masking strategy w.r.t.finding novel items.

REPRODUCIBILITY
We share both our processed dataset and the source code used to produce the results in this paper at https://github.com/liming-7/Mask-Swap-NNBR.
This research was (partially) funded by the China Scholarship Council (grant #20190607154), the Hybrid Intelligence Center, a 10-year program funded by the Dutch Ministry of Education, Culture and Science through the Netherlands Organisation for Scientific Research (NWO), https://hybrid-intelligence-centre.nl, and project LESSEN with project number NWA.1389.20.183 of the research program NWA ORC 2020/21, which is (partly) financed by the Dutch Research Council (NWO).All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

Table 1 .
Three types of basket recommendation.
2 , ...,  ] represents the historical interaction sequence for user , where    represents a set of items  ∈  that user  purchased at time step .For a user , the repeat item  The goal of the next novel basket recommendation task is to predict the following novel basket which only consists of novel items  novel  that the user would probably like, based on the user's past interactions   , that is, rep  =  1  ∪ 2  ∪ . . .∪    , and the novel item  novel  is the item that user  have not purchased before, i.e.,  novel  ∈  novel  =  −  rep  .

Table 2 .
Comparison of four types of masking strategies from three aspects, i.e., temporal orders, explore specific and amount of training signals.

Table 3 .
Statistics of the processed datasets.

Table 4 .
Results of methods training for finding novel items, i.e., Train-explore, compared against the methods training for finding all items, i.e., Train-all.Boldface and underline indicate the best and the second best performing performance w.r.t. the NNBR task, respectively.Significant improvements and deteriorations of Train-explore over the corresponding Train-all baseline results are marked with ↑ and ↓ , respectively.(paired t-test, p < 0.05).