To Copy, or not to Copy; That is a Critical Issue of the Output Softmax Layer in Neural Sequential Recommenders

Recent studies suggest that the existing neural models have difficulty handling repeated items in sequential recommendation tasks. However, our understanding of this difficulty is still limited. In this study, we substantially advance this field by identifying a major source of the problem: the single hidden state embedding and static item embeddings in the output softmax layer. Specifically, the similarity structure of the global item embedding in the softmax layer sometimes forces the single hidden state embedding to be close to new items when copying is a better choice, while sometimes forcing the hidden state to be close to the items from the input inappropriately. To alleviate the problem, we adapt the recently-proposed softmax alternatives such as softmax-CPR to sequential recommendation tasks and demonstrate that the new softmax architectures unleash the capability of the neural encoder on learning when to copy and when to exclude the items from the input sequence. By only making some simple modifications on the output softmax layer for SASRec and GRU4Rec, softmax-CPR achieves consistent improvement in 12 datasets. With almost the same model size, our best method not only improves the average NDCG@10 of GRU4Rec in 5 datasets with duplicated items by 10% (4%-17% individually) but also improves 7 datasets without duplicated items by 24% (8%-39%)!

Its single hidden state and static item embeddings could force me to copy, not to copy, and cause not ideal distribution.
Figure 1: The output softmax layer which is used in most neural networks prevents the recommender from modeling the ideal next item distribution and the softmax bottleneck also makes the recommender unable to learn the correct copying behavior from the training data.

INTRODUCTION
Many recommendation tasks on the internet can be formulated as a sequential recommendation problem [27], whose goal is to recommend the next item to each user based on the historical sequential interactions (e.g., click stream, purchasing record, and exercise practicing sequence) between the user and items [15].In a sequential recommendation application, a good recommender often needs to capture the compositional meaning of multiple items in the input sequence, and many researchers have demonstrated that neural networks are able to model the complex interactions of the input items well and achieved state-of-the-art performances [40].
As shown in Figure 1, a sequential recommender takes the item history of a user as the input and outputs a probability distribution of the next item.A list of the items with the highest predicted probabilities would be recommended to the user.The recommender can assign the highest probabilities to the items in the input history, with which the user interacted before.The repetition behavior is like copying the input items to the recommendation list.The recommender can also choose not to copy and encourage the user to explore the new items.Li et al. [22] found that the modern neural recommender still cannot properly learn to copy or exclude  The similarity structure in the item embedding space helps recommender's generalization capability in this example.(b) In a dataset with many duplicated items, the recommender often needs to copy the items from the shopping history, but the item similarity structure in the embedding space does not allow the recommender to output the desired distribution.(c) In a dataset with only few or no duplicated items, the model needs to learn not to recommend the items the users have already interacted with but to recommend something similar to them instead.The ideal distribution would form a donut shape in the item embedding space, which cannot be modeled by the single hidden state and static item embeddings in the softmax layer.
the items from the history in many situations.Motivated by the practical need, we first identify that the output softmax layer, which is adopted by most of the state-of-the-art neural recommenders, is a major source of the problem and demonstrate a substantial performance improvement after alleviating the softmax problem.
The softmax layer can be viewed as a matrix factorization layer [32,42].Instead of using a fixed user embedding as in a classic collaborative filtering method, the neural recommenders use a RNN (recurrent neural network) or transformer architecture to encode the historical input item sequence as a user embedding.Then, as shown in Figure 2 (a), the cross entropy loss and softmax layer encourage the high dot product between the generated user embedding and the embeddings of the possible next items.
As in collaborative filtering, the softmax / matrix factorization layer has several benefits: It would encourage similar items to have similar item embeddings and similar input sequences to be encoded as similar user embeddings.In many cases, the similarity structure in the embedding space boosts the generalization capability of the system.We illustrate an example in Figure 2 (a): Assuming in our training data, we know that (i) many users like to buy a big bottle of drink while buying party supplies and (ii) users like to keep buying Pepsi.Then, when a neural recommender sees a new user bought some Pepsi cans and party supplies before, it can output a user embedding that is between the Pepsi cans and the average of 2-liter drinks.The user embedding would be close to the embeddings of a 2-liter Pepsi bottle because the 2-liter Pepsi bottle is similar to both other Pepsi products and other 2-liter drinks.
Despite the effectiveness of softmax / matrix factorization layer, its item similarity structure is global and sometimes prevents the neural recommender from outputting the desired distribution.As shown in Figure 2 (b), if one user repeatedly bought a diaper product and an apple juice product, the next item should be very likely to be either the diaper product or the apple juice product.In order to output the distribution, the neural recommender needs to output a user embedding that is between the diaper embedding and apple juice embedding, which might be accidentally close to an embedding of apple-flavored baby food and/or an embedding of apple juice for children.In this case, the single hidden state embedding and static item embeddings in the softmax layer causes a difficulty in properly copying the items in the historical item sequence.This problem would become more serious when the embedding space is very crowded (e.g., the number of items is large).
On the other hand, the similarity structure could also force the neural recommender to improperly copy the previous items.As the example shown in Figure 2 (c), if one user just bought DVDs for the Avengers movies, the users may want to watch the other Marvel movies such as Iron Man or Captain America (i.e., something like Avengers but not Avengers themselves).However, a user embedding close to all the other Marvel movies would be unavoidably close to the embeddings of Avengers because the Avengers movies are related to all the other Marvel movies.This could force the neural recommender to keep recommending the movies that the user has seen before.In this case, the softmax layer causes difficulty in properly excluding the items in the historical item sequence.
To alleviate the issues caused by the output softmax layer, we adopt softmax-CPR [6], which is originally proposed to reduce the hallucinations of the language generation models.From Figure 2 (b), we can see that the main issue comes from the single hidden state in the item embedding space, so softmax-CPR uses different hidden states to compute the probabilities of different partitions of items.For example, we can use a hidden state for only the items in the input sequence (e.g., the apple juice and diaper) and another hidden state for the rest of the items (e.g., baby foods).Then, the former hidden state could be placed between the apple juice and diaper without being interfered with by the other baby food items.
We test our methods in RecBole [41,44].Compared to the softmax, which is used in most of the neural sequential recommenders, softmax-CPR improves 19% on geometrically averaged NDCG@10 [36] in 12 datasets and the improvement gaps are similar in SAS-Rec [15] and GRU4Rec [13].Our experiments also identify the source of improvement of RepeatNet [30] comes from alleviating the softmax issue rather than the self-attention mechanism.By identifying the source of the problem better, softmax-CPR can achieve larger improvements using a simpler model and easily be combined with any neural encoder.

Main Contributions
• We found that the single hidden state and static item embeddings in the output softmax layer prevent the neural sequential recommenders from learning the copying/excluding behavior of the users.The perspective explains the improvement of Re-peatNet [30] and simple post-processing [22] in the datasets without duplicated items.• We adapt the softmax-CPR, which is recently proposed in Chang* et al. [6] for NLP problems, for sequential recommendation tasks and implement them in RecBole. 1 Experiments in 12 datasets compare the various softmax alternatives unifiedly that might be able to solve the softmax bottleneck problems.We conduct detailed ablation studies to attribute the improvement to each modification we made in the softmax layer.

PROBLEMS
In this section, we will introduce the problems caused by the single hidden state and static item embeddings in the output softmax layer formally.

Duplicated Items
In many applications, duplicated items appear frequently in the item sequence [1,3], and the repeated behavior could change in different domains, different experiment setups, or even just at different times in a sequence.For example, when a user buys a dress, the user often first explores lots of options (i.e., not many repetitions) and then repeatedly clicks a set of options to compare them.Once the user buys a dress, it is not very likely they will buy the same dress again, which means the repetition patterns for predicting the next click and next purchase are very different.
To show the diversity of the copying patterns, we plot the relationship between the repetition probability of the last item and the sequence length in Figure 3.The much higher blue curves than the orange curves imply that the copying probabilities drastically increase once the user starts to interact with items repeatedly.Moreover, the very different curves in different datasets suggest that it is very hard to manually design a general copy strategy that is applicable to various domains.

Softmax Bottleneck
The output probability from a softmax layer could be written as where .This implicit matrix factorization process is illustrated in Figure 2 (a).
One main limitation of the softmax layer and matrix factorization is its static item embedding  .
. The meaning of each item is different for different users, but the item embeddings in the softmax layer are global and independent of the input item sequence.For example, most of users often buy items A and B together, so their global embeddings tend to be similar.However, if a kind of users kept only buying A because he/she likes A but doesn't like B, these two items are not similar to this kind of users at all.However, even though the users have repeatedly demonstrated their low probabilities of buying B, the recommender might be forced to keep recommending B by their similar global embeddings.As illustrated Figure 2 (b) and (c), this discrepancy is especially serious when the user has a strong preference for copying or excluding the historical items.
Chang and McCallum [5] extend the concept of softmax bottleneck [42] and theoretically show that the structure of embeddings would create a multi-modal distribution that cannot be modeled by the single hidden state in the softmax layer (e.g., the bi-modal distribution in Figure 2 (b) and the donut shape distribution in (c)).Due to the latency and generalization concerns, we sometimes choose to share input and output embeddings and usually cannot use a very large hidden state size in an industrial recommender.Furthermore, we cannot observe all possible item distributions in the training data since every user could like different product combinations.For these reasons, it is hard for the softmax layer to Top k1

Top k2
Input Items reduce the multi-modal distribution by moving the item embeddings, which highlights the seriousness of the softmax bottleneck in many sequential recommendation tasks.

SOLUTIONS
In this section, we introduce several potential methods that could alleviate the softmax bottleneck problems.

Post Processing
In industry, the repetition issues are often alleviated by business insights and heuristic rules.For example, a product manager might notice that most users won't watch the same movie twice and many movies in the recommendation list have been watched, so we could simply add a post-processing step to remove the movies that have been watched from our recommendation candidate list [22].Nevertheless, the simple statistics in Figure 3 suggest that finding generally applicable rules is difficult.Moreover, in some domains, the rules might quickly become too complicated to manage.For example, a user might want a food delivery app to recommend something new but sometimes prefers to buy food from the restaurants he/she has tried more than  times (to exclude the restaurants that are too bad to try again).In these cases, learning the copy patterns from a large training set should be an easier and more effective approach than the manually designed heuristic rules.

Softmax-CPR
In this section, we briefly review Softmax-CPR proposed by Chang* et al. [6].If you would like to know more details of its motivation and formulation, please refer to Chang* et al. [6] or Appendix C.1.Softmax-CPR combines three methods: Context partition, Pointer network, and Reranker partition, to improve the output softmax layer.First, we introduce the context partition method.Context partition makes a small change in the logit computation in the softmax layer.In Equation ( 1), the softmax layer lets Logit(,   ) =       .In context partition, the logit of item where  The context partition allows the recommender to learn when to copy input items and when to exclude the input items.For example, in Figure 2  is the average of a linearly projected hidden state embedding corresponding to the item  (see Equation (5) in the appendix for its formula).This means that the pointer network allows the recommender to predict the local and context-dependent embedding for the input items.
In the context partition or pointer network, we only compute the logits of input items separately using projected hidden states.However, some unexplored items are also likely to appear next and they might also encounter the softmax bottleneck problem.To alleviate the issue, the reranker partition computes the logits of the most likely  items separately using another new hidden state (see the term     ,1   below for an example).As illustrated in Figure 4, Softmax-CPR combines all the above methods by to linearly depend on each other and not be able to move freely in the item embedding space.To solve this issue, Chang and McCallum [5] concatenate multiple input hidden states    and project them into a new hidden state    to expand its dimensionality.We combine their method with softmax-CPR in Figure 4 by replacing    with    in Equation (3).

Mixture of Softmax (MoS)
In Figure 2 (b), one possible solution is to put one hidden state near the diaper and another hidden state near the apple juice to model its multi-modal distribution.The mixture of softmax (MoS) is proposed to achieve the goal [5,23,42].MoS and softmax-CPR both use multiple hidden states, while their roles of each hidden state are different.In MoS, we need to compute the dot product between every hidden state and all the item embeddings; in softmax-CPR, we partition the item set and each hidden state only determines the logits/probabilities of the items in a partition (e.g., only the input items in the context partition).Compared to softmax-CPR, MoS is more computationally expensive and does not explicitly model users' repetition behavior.

RepeatNet
Inspired by CopyNet [10] / pointer network, RepeatNet [30] explicitly models the probability of copying the items from the input.In their paper, they do not test its performance on the datasets without duplicated items.Compared to softmax-CPR, RepeatNet has several disadvantages.First, RepeatNet introduces many extra parameters to GRU4Rec, which increases the computational overhead and the difficulties in identifying the source of the improvement.Second, when computing the probabilities of copying the items, RepeatNet does not leverage the global item similarity structure, which might hurt the generality of the model (see Figure 2 (a) for an example).Third, RepeatNet does not solve the softmax bottleneck problem for the items that are not in the input sequence.Table 3: The test performance (%) in 5 datasets with duplicated items.The notations are the same as Table 2.

EXPERIMENTS
All the experiments are done in RecBole [41,44], a library that provides various recommendation models and datasets.We select 12 datasets from RecBole that are large enough and widely used in the previous work to make the results more representative and less sensitive to the hyperparameter setup and random seeds [9].The datasets come from various domains and have various sizes.We report their statistics in Table 1.

Models and Baselines
We implement the following softmax alternatives by modifying the model code of SASRec [15] and GRU4Rec [13].We choose SASRec and GRU4Rec for several reasons.(i) SASRec and GRU4Rec are both state-of-the-art and widely-used encoders [18,38].(ii) RepeatNet is based on GRU4Rec.(iii) We want to compare the improvements over transformer-based encoders and RNN-based encoders.
• Softmax: The performance of the SASRec and GRU4Rec.
• Softmax + Mi: Computing the probabilities using multiple input hidden states (Mi).Please see Section 3.3 for more details.
Here, Mi uses the hidden states corresponding to the last three input items and all the layers of the neural encoders (i.e., 1 layer for GRU4Rec and 2 layers for SASRec).• Softmax + C: Use context partition in Equation ( 2).
• Softmax + CP: Use context partition and pointer network.
• Softmax + CPR:20,100,500 + Mi: Use softmax-CPR in Equation (3) and multiple input hidden states as in Figure 4. • Mixture of Softmax (MoS): The baseline similar to Lin [23], Yang et al. [42].We set the number of softmax to be 3. • Softmax w/o Duplication [22]: Set the probability of the repeated items to be 0 via post-processing to improve the performance on the datasets without duplicated items.In addition, we also compare the softmax alternatives on top of SASRec/GRU4Rec with RepeatNet [30].

Setup
We report three metrics NDCG@10 (normalized discounted cumulative gain) [14], HR@10 (Hit rate) [43], and MRR@10 (Mean Reciprocal Rank) [28].To be closer to the real-world setup, we do not conduct negative example subsampling when reporting the testing performance, so the scores might look small in some datasets with a large number of items.Since different datasets could have very different performance ranges, we report the geometric mean of all datasets to summarize the performance of every method [36].
We follow the default evaluation protocol and model setup in RecBole (e.g., input and output item embeddings are shared in GRU4Rec and SASRec).We found that the default hyperparameters in RecBole generally work well except that a smaller dropout rate for SASRec yields much better performances.Overall, we found that our performance improvement is not sensitive to the hyperparameters but we still tried our best to tune the hyperparameters under the constraints of our computational resources.
For the smaller datasets (i.e., Amazon Beauty, Games, MovieLens 1m, and Steam), we perform a grid search on its hyperparameters using their NDCG@10 scores on validation sets.In the grid search, we use learning rates [5e-4, 1e-3, 2e-3] and batch sizes [64,128].For GRU4Rec, the dropout rates are [0, 0.5].For SASRec, hidden state dropout rates are [0, 0.1].For all the other 8 larger datasets, the learning rate is 1e-3, and dropout rate is 0. All the hyperparameter values or search ranges in RepeatNet are the same as GRU4Rec.
To fitting the models into our GPU memory, we adjust our hidden state sizes and training batch sizes as shown in Table 1.
Using the grid search results, we can analyze the hyperparameter sensitivity of learning rates, dropouts, and batch sizes.To know the sensitivity to the hidden state sizes, we conduct another grid search using hidden state sizes [16,32,64,128] and batch sizes [64,128].The learning rate and dropout are set according to the previous grid search results to optimize the testing performance of the Softmax + Mi baseline in each of the 4 smaller datasets.
All experiments are done in Nvidia TESLA M40.The time is measured by computing the probability of all the items without using any nearest neighbor search.The model codes in RecBole are often not optimized for their running time, so the time comparison is more meaningful given the same neural encoder.Thus, we do not report the time of RepeatNet to avoid unfair comparisons.

Results
The results are presented at Table 2, Table 3, and Table 4.We can see context partition (Softmax + C) substantially improves over Softmax.After adding pointer network (P), reranker partition (R), and multiple input hidden states (Mi), Softmax + CPR:100 + Mi achieves the best overall performances in Table 4. Unlike the counterpart in language models [6], multiple reranker partitions, Softmax + CPR:20,100,500 + Mi, do not result in better performances in recommendation models.
The improvement on GRU4Rec is slightly larger than that on SASRec.This shows that there is a small overlap between the benefits of softmax-CPR and the benefits of self-attention in the neural encoder.Noting that the performances of SASRec and GRU4Rec are not directly comparable because we only coarsely tune the hyperparameters.MoS [23,42] performs almost the same as Softmax and requires much longer training and inference time.Although Softmax w/o Duplication performs poorly in the 5 datasets with duplications, it significantly outperforms Softmax in 7 datasets without duplications as found in Li et al. [22].Nevertheless, allowing the models to easily exclude the duplications during the training (e.g., Softmax + C) still slightly outperforms the post-processing baseline in datasets without duplications.
We discover that although being designed for the datasets with duplicated items, RepeatNet can also substantially improve the datasets without any duplicated items.In many datasets, the performances of RepeatNet are very similar Softmax + C on top of GRU4Rec.This suggests that the main source of improvement from RepeatNet comes from computing the probabilities of the repeated item separately as in Softmax + C rather than its selfattention mechanism or its extra parameters.Furthermore, after identifying the source of improvement from RepeatNet, Softmax + C can achieve similar improvement while only needing one-third of its model size, Softmax + CPR:100 + Mi can further expand the improvement by better overcoming the softmax bottleneck, and we can apply the softmax alternatives to any neural encoder of interest (e.g., they lead to similar improvement in SASRec).
Table 4 shows that the extra parameters introduced by our softmax alternatives are neglectable.Theoretically speaking, the extra computations in Softmax + CPR:100 + Mi are also very small compared to the original softmax layer, which computes the dot product between the hidden state and every item embedding.However, we still see some increases in training and testing time, which might be caused by the constraints of PyTorch's built-in functions.Thus, in applications requiring low latency, we recommend using Softmax + C and/or writing CUDA code to minimize the extra overhead.
The hyperparameter analyses in Figure 5 show that all the methods are not very sensitive to the particular values of hyperparameters.The lines are pretty flat in batch size and learning rate figures.For dropout rates in GRU4Rec, using 0 dropout uniformly degrades the performance of smaller datasets such as Amazon Video Games.For hidden size, performance starts to degrade when the size is smaller than 64.In RepeatNet, the item embedding size is two times of the hidden state size, so its 16 hidden state size is similar to other methods' 32 hidden state size.

RELATED WORK
Due to the commonness of repetition in sequential recommendation, many studies propose methods to improve the accuracy of recommending the repeated items.For example, Bhagat et al. [3], Wang et al. [37] propose probabilistic models to find the proper time to recommend the items user bought before.Ariannezhad et al. [1], Ma et al. [25] propose special neural network architectures to explicitly model periodic user behavior.These methods usually design complicated models based on business insights in some specific domains, which might limit generalization ability and applicability to other domains.
Li et al. [21,22] systematically analyze the repetition and exploration behavior of existing sequential recommenders, and Li et al. [22] observe that sharing the input and output embeddings could intensify the improper copying issue.Our work provides an explanation for the empirical observation (see Section 2.2).
Chang and McCallum [5], Yang et al. [42] introduce the concept of softmax bottleneck, the limitation of single embeddings, and the solution using multiple embeddings in MoS to improve language models.Recently, multiple embeddings are applied to information retrieval [16,19,24] and recommendation [23,39].Nevertheless, as we show in our experiment, the improvement of using multiple embeddings in MoS is often limited and inconsistent in sequential recommendation tasks.
Recently, Chang* et al. [6] propose softmax-CPR to improve the distribution of the next word prediction and factuality of the generated text.Our work focuses on studying its meaning and effectiveness in the sequential recommendation tasks, which are previously unknown.
In our work, softmax-CPR uses different embeddings for the repetition intents and exploration intents.The idea is similar to multiple-intent recommendation or recommendation diversification.To diversify the recommendation, Kula [20] ensembles multiple LSTMs, Kim et al. [17] clusters the items, and Chen et al. [7] formulates the sequential recommendation as a sequence to sequence task.Nevertheless, the diversity improvement often comes with much more complicated models specialized for specific datasets, limited recommendation accuracy improvement, and/or significantly increased computational overhead.

CONCLUSION
In the last decade, various neural sequential recommenders are proposed and the output softmax layer or single hidden state is used in almost all of them.These studies often focus on developing a new neural encoder architecture for some specific applications and show its superior performance on a few datasets.In our study, we show that the choice of output softmax layer is also very important in all the 12 datasets we tried.Under our experimental setup, it is even more important than the choice of the neural encoder.
In the 5 datasets without any duplication, the similar performances of softmax + C, RepeatNet, and softmax w/o duplication [22] reveal that breaking the softmax bottleneck is the main source of the significant improvements of RepeatNet or removing the duplication in the post-processing in these datasets.
Finally, we recommend setting softmax-CPR as the default method for computing the next item probability in sequential recommendation tasks due to its simplicity and universal improvement on the datasets with or without duplications.

ETHICAL CONSIDERATIONS
Modeling the repetition behavior better might sometimes intensify the filter bubble [33] on a recommendation-based web platform.For example, if one user keeps watching a set of videos talking about conspiracy theories, predicting the next item to be a video from this set might increase the system's accuracy, but further strengthen the intellectual isolation and polarization of the society.We believe that how to break the bubble is still an open problem and out of the scope of this paper.
To understand the sensitivity of hyperparameters, we first fix the value of a target hyperparameter and compute the geometric means of NDCG@10 in the 4 smaller datasets in each configuration of the other hyperparameters.Then, we average the scores from all the other hyperparameters given each target hyperparameter value.When analyzing the sensitivity of hidden state size, GRU4Rec's item embedding size is the same as the hidden state size.
In the repetition statistics figures, we count the total number of the sequences with repetitions (blue) or without repetitions (orange) given a sequence length and sum over all the sequence lengths.We report the count in the legend of the figures.

E REAL EXAMPLES AND ANALYSIS
We visualize the top 3 recommendations from different methods in Steam dataset [15].In the dataset, the items are video games and the interactions are the user review, representing the user's interest at that time.A user could leave multiple reviews for a game at different times.
In Figure 6a, we can see that the user reviewed three games Really Big Sky, Robocraft, and Portal Stories: Mel before and the user reviewed Robocraft twice.Softmax + CPR:100 + Mi predicts that the user is very likely to review the last two games again and less likely to review the first game again, which makes sense because Really Big Sky is not a well-known game.RepeatNet only predicts that user might repeat the last game and Softmax + Mi does not predict any repeated item in its top 3 recommendation list.
In Figure 6b, the user has a broad interest and did not reviewed the same game twice.RepeatNet still predicts that user might repeat the last game and Softmax + Mi copies Saints Row: The Third, the first game user reviewed.The other two recommendations from Softmax + Mi are similar to Saints Row: The Third.The top three predictions from Softmax + CPR:100 + Mi are three popular freeto-play games, Trove, Warfare Online, and A.V.A. Alliance of Valiant Arms.These recommendations are reasonable because the last two games the user reviewed, Warframe and Spiral Knights, are also popular free-to-play games.

Figure 2 :
Figure 2: The benefits and problems of the output softmax layer in a neural sequential recommender.(a) The output softmax layer implicitly factorizes the interaction matrix into the global item embeddings and the hidden states of the neural encoder.The similarity structure in the item embedding space helps recommender's generalization capability in this example.(b) In a dataset with many duplicated items, the recommender often needs to copy the items from the shopping history, but the item similarity structure in the embedding space does not allow the recommender to output the desired distribution.(c) In a dataset with only few or no duplicated items, the model needs to learn not to recommend the items the users have already interacted with but to recommend something similar to them instead.The ideal distribution would form a donut shape in the item embedding space, which cannot be modeled by the single hidden state and static item embeddings in the softmax layer.

Figure 3 :
Figure 3: The probability of observing the repeated next item (i.e., the next item has already been in the input sequence) at the certain sequence length in x-axis.The blue curves are the probability if the input sequence has already had duplicated item(s), while the orange curves indicate the probability when every item in the input sequence is unique.
(in): Item 1 p i 1, …, Item n p i n Pointer Network (P)

Figure 4 :
Figure 4: The architecture of softmax-CPR and multiple input hidden states (Mi).
(a) GRU encoder and |  | (b) GRU encoder and lr (c) GRU encoder and dropout (d) GRU encoder and batch size (e) Trans.encoder and |  | (f) Trans.encoder and lr (g) Trans.encoder and dropout (h) Trans.encoder and batch size

Figure 5 :
Figure 5: Hyerparameter analyses using the geometric mean of Amazon Beauty, Games, MovieLens 1m, and Steam datasets.
is the input item sequence with length  from user ,    is the hidden state for the sequence encoded by a neural encoder, and is the output item embeddings for item/product .During training, the maximum likelihood estimation would increase the probability of observing the actual next product   ,   (  +1 =   |  ), by maximizing       and minimizing       .That is, the hidden state    would be pulled closer to   and pushed away from other item embeddings ) are the linear projections of the hidden state.In this paper,   .(   ) =  .   +  .(e.g.,    (   ) =      +   ) and each linear projection layer would learn different parameters (weights  .and bias  .) during training.
(b), the recommender can place    , between the apple juice and the diaper without being interfered by the baby foods because    , is the hidden state only for the input items.Similarly, in Figure 2 (c), the recommender can learn to output a very small value of     ,   to exclude all the previously seen movies, while placing    , at the center of the movies we should recommend.The context partition is related to a pointer network.The main difference is that the pointer network computes the logit of the input items by     , ,  ,

Table 1 :
[4]here  ( 3 ) is the top  3 items with the highest     ,   ,  ( 2 ) is the top  2 words with the highest logits, similarly for  ( 1 ), and    ).Note that this method could be easily combined with maximum inner product search[4].We can just use    , to search the possible next items and use other hidden states to adjust the logits of these possible next items and input items.Since the number of input items |  | and  3 are much smaller than the total number of items, the computational overhead should be relatively small.The dataset sizes are reported by the number of thousands (k).We adjust the hidden state size (|   |) and batch size (bsz) accordingly under our GPU memory constraint.All the new hidden states    ,.are the projection of    .The limited dimension of    would force    ,.
3.3 Multiple Input Hidden States (Mi)

Table 2 :
We compare the test performance (%) of NDCG@10 and HR@10 in 7 datasets without duplicated items.C, P, R means context partition, pointer network, and reranker partition, respectively.20,100,500 refers to  1 = 20,  2 = 100 and  3 = 500; Mi means the multiple input hidden state enhancement.The best values given the same neural encoder are highlighted.

Table 4 :
• Softmax + CPR:100 + Mi: Use softmax-CPR and multiple input hidden states.We report the model size and average time of training/testing the models on Amazon-2014 Books for 1 epoch.The other notations are the same as Table