Curse of "Low" Dimensionality in Recommender Systems

Beyond accuracy, there are a variety of aspects to the quality of recommender systems, such as diversity, fairness, and robustness. We argue that many of the prevalent problems in recommender systems are partly due to low-dimensionality of user and item embeddings, particularly when dot-product models, such as matrix factorization, are used. In this study, we showcase empirical evidence suggesting the necessity of sufficient dimensionality for user/item embeddings to achieve diverse, fair, and robust recommendation. We then present theoretical analyses of the expressive power of dot-product models. Our theoretical results demonstrate that the number of possible rankings expressible under dot-product models is exponentially bounded by the dimension of item factors. We empirically found that the low-dimensionality contributes to a popularity bias, widening the gap between the rank positions of popular and long-tail items; we also give a theoretical justification for this phenomenon.


INTRODUCTION
Matrix factorization (MF) [24] is a standard tool in recommender systems.MF can learn compact, retrieval-efficient representation and thus is easy-to-use, particularly in large-scale applications.On the other hand, due to the advancement of deep learning-based methods, complex nonlinear models have also been adopted to enhance recommendation quality [54].However, despite the advances in model architecture, most models share a common structure, i.e., dot-product models [47].Dot-product models are a class of models that estimate the preference for a user-item pair by computing a dot-product (inner product) between the user and item embeddings; MF is the simplest model in this class.This structure is essential for large-scale applications due to its computationally efficient retrieval through vector search algorithms [25,34,52].
The dimensionality of user/item embeddings characterizes dotproduct models.One interpretation of dimensionality is that it refers to the ranks of user/item embedding matrices that minimize the errors between true and estimated preferences.In the extreme case where the dimensionality is one, the embedding of each user and item degenerates into a scalar.Notice here that the rankings estimated from a feedback matrix also degenerate into two unique rankings, namely, the popularity ranking and its reverse; for ranking prediction, a one-dimensional user embedding determines only the signatures of preference scores.Generalizing this, we arrive at a curious question: When the dimensionality in a dot-product model is low or high, what do the rankings look like?
Previous studies [29,45,70] have reported the effectiveness of large dimensionalities in rating prediction tasks.Rendle et al. [49] also recently showed that high-dimensional models can achieve very high ranking accuracy under appropriate regularization.Furthermore, successful models in top- item recommendation, such as variational autoencoder (VAE)-based models [32], often use large dimensions.These observations are somewhat counterintuitive since user feedback data are typically sparse and thus may lead to overfitting under large dimensionality.On the other hand, conventional studies of state-of-the-art methods often omit the investigation of dimensionality in models (e.g., [19,32,50,51,65]).Although exhaustive tuning of hyperparameters for model sizes (e.g., the number of hidden layers and the size of each layer) is unrealistic due to the experimental burden, the dimensionality of user and item embeddings is rather unnoticed compared with other hyperparameters, such as learning rates and regularization weights.This further stimulates our interest above.
In this study, we investigate the effect of dimensionality in recommender systems based on dot-product models.We first present empirical observations from various viewpoints on recommendation quality, i.e., personalization, diversity, fairness, and robustness.Our results reveal a hidden side effect of low-dimensionality: lowdimensional models incur a low model capacity with respect to these quality requirements even when the ranking quality seems to be maintained.In the convention of machine learning, we can often avoid model overfitting by using low-dimensional models.However, such models would suffer from potential long-term negative effects, namely, overfitting toward popularity bias.Consequently, low-dimensionality leads to nondiverse, unfair recommendation results and thus insufficient data collection for producing models that can properly delineate users' individual tastes.Furthermore, we theoretically explain the cause of the observed phenomenon, curse of low dimensionality.Our theoretical results-which apply to dot-product models-provide evidence that increasing dimensionality exponentially improves the expressive power of dot-product models in terms of the number of expressible rankings, and that we may not completely circumvent popularity bias.Finally, we discuss possible research directions based on the lessons learned.

PRELIMINARY: DOT-PRODUCT MODELS
In this section, we briefly describe dot-product models.Many practical models are classified as dot-product models [47] (also known as two-tower models [66]), which estimate the preference  , ∈ R of a user  ∈ U for an item  ∈ V by an inner-product between the embeddings of  and  as follows: r, = ⟨ (), ()⟩, where  () ∈ R  and  () ∈ R  are the embeddings of  and , respectively.The design of the feature mappings  : U → R  and  : V → R  depends on the overall model architecture;  and  can be arbitrary models, such as MF and neural networks.
(V)AEs can also be interpreted as dot-product models [32,51].Most AE-based models take partial user feedback (|V |-dimensional multi-hot vector) corresponding to one user as input and have a fully-connected layer to make a final score prediction for |V | items.Denoting the -dimensional intermediate representation of a partial user feedback simply by q ∈ R  , the |V |-dimensional structured prediction z ∈ R | V | can be expressed as follows: where are the weight matrix and the bias in the fully-connected layer, respectively.Here, f : is the activation function (e.g., softmax) that is order-preserving 1 .
Given an auxiliary vector  () = [q; 1] ∈ R +1 with an additional dimension of 1 and an auxiliary matrix with an additional column of b, the predicted ranking is derived by the order of W ′  () ∈ R | V | because f is order-preserving and does not affect the ranking prediction.Therefore, each row of W ′ can be viewed as the corresponding item embedding  (), and thus the prediction for  is also derived from the order of the dot products {⟨ (), ()⟩ |  ∈ V}.This point is often unregarded in empirical evaluation; for example, Liang et al. [32] used a large dimensionality of 600 for VAE-based models for the Million Song Dataset and Netflix Prize; by comparison, the MF-based baseline [24] employed a dimensionality of only 200.

EMPIRICAL OBSERVATION
We first present empirical observations that establish the motivation behind our theoretical analysis and discuss the possible effects of dimensionality on recommendation quality in terms of various aspects, such as personalization, diversity, item fairness, and robustness to biased feedback.

Experimental Setting
For our experiments, we use implicit alternating least squares (iALS) [24,28,49], which is widely used in practical applications and implemented in distributed frameworks, such as Apache Spark [39].
1 Formally, we can say f is order-preserving if f satisfies, for any Because iALS slows down in high-dimensional settings 2 , we use a recently developed block coordinate descent solver [48].
We conduct experiments on three real-world datasets from various applications, i.e., MovieLens 20M (ML-20M) [18], Million Song Dataset (MSD) [8], and Epinions [35].To create implicit feedback datasets, we binarize explicit data by keeping observed user-item pairs with ratings larger than four for ML-20M and Epinions.We utilize all user-item pairs for MSD.For Epinions, we keep the users and items with more than 20 interactions as in conventional studies [1,2].We also strictly follow the evaluation protocol of Liang et al. [32] based on a strong generalization setting.

Personalization and Popularity Bias
Personalization is the primary aim of a recommendation system, which requires a system to adapt its predictions for the users considering their individual tastes.By contrast, the most-popular recommender is known to be an empirically strong yet anti-personalized baseline; that is, it recommends an identical ranking of items for all users [16].Therefore, the degree of personalization in the predicted rankings may be considered as that of popularity bias [1,2,12,68].
Figure 1 shows the personalization measure (i.e., Recall@) and the average recommendation popularity (ARP@) [68] for iALS models with various embedding sizes.Here, ARP@ is the average of item popularity (i.e., empirical frequency in the training split) for a top- list.We tune  ∈ {64, 128, . . ., 4,096} for ML-20M and MSD and  ∈ {8, 16, . . ., 512}; because the numbers of items after preprocessing are 20,108 for ML-20M, 41,140 for MSD, and 3,423 for Epinions, we tune  in different ranges to ensure the low-rank condition of MF.It can be observed that, for all settings, the models with small dimensionalities exhibit extremely large ARP@ values (see blue lines).Particularly, at the top of the rankings, the popularity bias in the prediction is severe (shown by the leftmost figures in the top and bottom rows).These results suggest that low-dimensional models recommend many popular items and, therefore, suffer from anti-personalization. Furthermore, insufficient dimensionality leads to low achievable ranking quality.
On the other hand, the trends exhibited by high-dimensional models are rather different for ML-20M vs. for MSD and Epinions.Although the ranking quality becomes saturated at a relatively low dimensionality of 1,024 in ML-20M for all , the quality on MSD and Epinions can be further improved with high dimensionality.Moreover, the popularity bias in ML-20M is the lowest with  = 512, which is not optimal in terms of ranking quality, whereas the best one in terms of ranking quality for MSD also performs the best in terms of popularity bias.This is possibly because the popularity bias is more severe for ML-20M than for MSD; thus, reconciling high quality and low bias is difficult for ML-20M.To confirm this, Figure 2 illustrates the distribution of item popularity in ML-20M, MSD, and Epinions.The y-axis represents the normalized frequency of each item (i.e., relative item popularity) in the training dataset.The popularity bias in ML-20M is more intense than that in MSD; there are many "long-tail" items in ML-20M.Remarkably, although the results differ between ML-20M and MSD, we can draw the     Epinions (K = 20) Figure 3: Effect of the dimensionality of iALS on catalog coverage and item fairness on ML-20M.Each line indicates the Pareto frontier of R@ vs. Cov@ (top row) or R@ vs. Negative Gini@ (bottom row) of models with fixed dimensionality.same conclusion: sufficient dimensionality is necessary to alleviate the existing popularity bias in ML-20M or to represent highly personalized recommendation results for MSD and Epinions.

Diversity and Item Fairness
Catalog coverage is one of the concepts related to the diversity of recommendation results (i.e., aggregate diversity) [3][4][5]21] and refers to the proportion of items to be recommended.This can be viewed as the capacity of an effective item catalog under a recommender system.On the other hand, item fairness is an emerging concept similar to catalog coverage, yet it applies to different situations and requirements [10].When each item belongs to a certain user, the recommendation opportunity for the items is part of the user utility; for instance, in an online dating application where each user corresponds to an item, users (as items) can obtain more chances of success when they are recommended more frequently.
Figure 3 shows the effect of dimensionality on catalog coverage and item fairness for ML-20M and Epinions.Each curve indicates the Pareto frontier of an iALS model that corresponds to a certain    dimensionality with various hyperparameter settings; we omitted this experiment for the large-scale MSD because of the experimental burden.We use Coverage@ (Cov@) and Negative Gini@ as the measures of catalog coverage and item fairness, respectively.Cov@ is the proportion of items retrieved that are in the top- results at least once [21], and Negative Gini@ is the negative Gini index [6] of items' frequencies in the top- results 3 .A clear trend can be observed for all settings: models with larger dimensions achieve higher capacities in terms of both catalog coverage and item fairness.This result implies that low-dimensional models cannot produce diverse or fair recommendation results.Notably, there exists a pair of models that are almost equivalent in terms of ranking quality (i.e., R@) but substantially different in catalog coverage or item fairness.This suggests a serious potential pitfall in practice.When developers evaluate and select models based only on ranking quality, a reasonable choice may be to use a low-dimensional model due to the space cost efficiency.However, such a model can lead to extremely nondiverse and unfair recommendation results.Even when developers select models based on both ranking quality and diversity, the versatility of models is severely limited if the dimensionality parameter is tuned for a narrow range of values owing to computational resource constraints.

Self-Biased Feedback Loop
To capture the dynamic nature of user interests, a system usually retrains a model after observing data within a certain time interval.By contrast, hyperparameters are often fixed for model retraining because hyperparameter tuning is a costly process, particularly when models are frequently updated.Hence, the robustness of deployed models (including hyperparameters) to dynamic changes in user behavior is critical.This may also be related to unbiased recommendation [13] or batch learning from logged bandit feedback [57].Because user feedback is collected under the currently deployed system, item popularity is formed in part by the system itself.Therefore, when a system narrows down its effective item catalog, as demonstrated in Section 3.3, the data observed in the future concentrate on items that are frequently recommended by the system.This phenomenon accelerates popularity bias in the data and further increases the number of cold-start items.
To observe the effect of dimensionality on data collection in a training and observation loop, we repeatedly train and evaluate an iALS model with different dimensionalities on ML-20M and MSD.Following a weak generalization setting, we first observe 50% of the feedback for each user in the original dataset.We then train a model using all samples in the observed dataset and predict the top-50 rankings for all users, removing the observed user-item pairs from the rankings.Subsequently, we observe the positive pairs in the predicted rankings as an additional dataset for the next model training.In the evaluation phase, we compute the proportion of observed samples for each user; we simply call this measure recall for users.Furthermore, we also compute this recall measure for items to determine the degree to which the system can collect data for each item.
Figure 4 shows the evolution of the recall measures for users and items.For both ML-20M and MSD, models with higher dimensionalities achieve more efficient data collection for both users and items.Remarkably, the difference is substantial in the data collection for items; in the figures in the second and fourth columns, a much larger efficiency gap can be observed between the high-and low-dimensional models.Furthermore, this trend is emphasized for MSD, which has a larger item catalog than ML-20M (as shown by the figure in the fourth column of the bottom row).The results with  = 64 for MSD in terms of median recall for users and items are remarkable, as shown by the red lines in the third and fourth columns of the bottom row.The model with  = 64 can collect data from users to some extent (third column), whereas the data come from a small proportion of items (fourth column).Here, Figure 5 illustrates the performance gap (mean recall for users) between models with  = 64 and  = 128, 256.Although the gap in the first epoch is relatively small for each setting, it grows in the next few epochs owing to the different efficiency of data collection; interestingly, in ML-20M, the models with  = 128, 256 deteriorate from that with  = 64 only in the first epoch.The performance gain is emphasized particularly with larger  values.These results imply that the gap between low-and high-dimensional models may become tangible, particularly in a training and observation loop, whereas the evaluation protocol in academic research simulates only the first epoch.
In summary, the results obtained in this part of the study are in good agreement with those presented in the previous sections.Dimensionality determines the capacity in terms of various aspects of recommendation quality beyond accuracy.As demonstrated earlier, deterioration in diversity, item fairness, and data collection ultimately affect long-term accuracy.

Summary of Empirical Results
Here, we summarize the empirical results obtained thus far.We first evaluated the MF models (i.e., the simplest dot-product models) in the standard setting of item recommendation in Section 3.2.We observed that high-dimensional models tend to achieve highranking quality and low popularity bias in their predicted rankings.To examine this phenomenon in depth, Section 3.3 investigated the relationship between dimensionality and diversity/fairness.We obtained a clear trend showing that low-dimensionality limits the versatility of models in terms of diversity and item fairness rather than ranking quality; even when the other hyperparameters are tuned, the achievable performance in terms of diversity/fairness is in a narrow range.In Section 3.4, we further investigated the effect of such a limited model space on data collection and long-term accuracy.We simulated a practical situation wherein a system retrains models by using incrementally observed data under its own recommendation policy.The results suggest that the data collected under low-dimensional models are severely biased by the model itself and thus impede accuracy improvement in the future.In the following section, we shall dive into the mechanism of this "curse of low-dimensionality" phenomenon.

THEORETICAL INVESTIGATION
We present (a summary of) theoretical analyses on the expressive power of dot-product models to support the empirical results provided in Section 3, whose formal statements and proofs are deferred to Appendix A. Our results are twofold and highly specific to dotproduct models: : Illustration of the proof idea of Theorem 4.1, which characterizes representable rankings of size  by regions in hyperplane arrangements.There are four item vectors Each dashed line connects a pair of the four vectors; each bold line is orthogonal to some dashed line and intersects the origin.These bold lines generate twenty regions in total, each corresponds to a distinct ranking.
Bounding the Number of Representable Rankings (cf.Appendix A.1). First, we investigate how many different rankings may be expressed over a fixed list of item vectors.Our hypothesis is that we suffer from popularity bias and/or cannot achieve diversity and fairness satisfactorily owing to low expressive power.In particular, we are interested in the number of representable rankings parameterized by the number of items , dimensionality , and size of rankings .
Slightly formally, we say that a ranking  of size  is representable over  item vectors v 1 , . . ., v  in R  if there exists a query vector q ∈ R  (e.g., a user embedding) such that  is consistent with a ranking uniquely obtained by arranging  items in descending order of ⟨q, v  ⟩.We devise upper and lower bounds on the number of representable rankings, informally stated as: Theorem 4.1 (informal; see Theorems A.1 and A.2).The following holds: • Upper bound: For every  item vectors in R  , the number of representable rankings of size  over them is at most  min{,2 } .• Lower bound: There exist  item vectors in R  such that the number of representable rankings of size  over them is  Θ( ) .
Our upper and lower bounds imply that the maximum possible number of representable rankings of size  is essentially  Θ(min{, } ) , offering the following insight: increasing the dimensionality  "exponentially" improves the expressive power of dotproduct models.Figures 6 and 7 illustrate the proof idea of our upper and lower bounds, which characterize representable rankings by hyperplane arrangement and facets of a polyhedron, respectively.

Mechanism behind Popularity Bias (cf. Appendix A.2).
We then study the mechanism behind popularity bias and its effect on the space of representable rankings.Consider first that there exists a pair of sets consisting of popular and long-tail items.Usually, many users prefer popular items than long-tail ones, which turns out to restrict the space from which user embeddings are chosen; we can thus easily infer that some similar-to-popular items rank higher than long-tail items as well.Slightly formally, given a pair of sets,  = {p 1 , . . ., p | | } and  = {l 1 , . . ., l | | }, consisting of popular and long-tail items, we force a query vector q (e.g., a user embedding) to always ensure that items of  rank higher than items of .Under this setting, we establish a structural characterization of such "near-popular" items that are ranked higher than , informally stated as: Theorem 4.2 (informal; see Theorems A.8 and A.9). Suppose that a query vector q ranks all items of  higher than all items of .Then, if an item vector s is included in a particular convex cone that contains the convex hull of , then s ranks higher than every item of  (i.e., s is popular).
Moreover, this cone becomes bigger as more popular or long-tail items are added (i.e., | | or || is increased).
Example 4.3.Figure 8 shows an illustration of Theorem 4.2.We are given three popular items  = {p 1 , p 2 , p 3 } and two long-tail items l 1 , l 2 .Given that a query vector q ranks the three popular items higher than l 1 only, q ranks another item s higher than l 1 whenever s is in S(, {l 1 }), which is a convex cone denoted by northeast blue lines.Similarly, if q ranks  higher than l 2 , s in S(, {l 2 }), denoted by northwest red lines, ranks higher than l 2 .
Consider now the case that popular items of  rank higher than both l 1 and l 2 .Then, an item s ranks higher than l 1 and l 2 if s is included in S(, {l 1 , l 2 }), which is a convex cone denoted by two arrows.This convex cone is larger than S(, {l 1 }) and S(, {l 2 }).
The above theorem suggests that the existence of a small number of popular and long-tail items makes other near-popular items superior to long-tail ones, reducing the number of representable rankings, and thus we may not completely avoid popularity bias.
In conclusion, our theoretical results justify the empirical observations: The limited catalog coverage with low dimensionality in Section 3.3 is due to lack of the number of representable rankings as in Theorem 4.1, and the popularity bias (a large value of ARP@) observed in Section 3.2 is partially explained by Theorem 4.2 and the discussion in Appendix A.2. Furthermore, as our Let  = {p 1 , p 2 , p 3 } be a set of three popular items and  = {l 1 , l 2 } be a set of two longtail items.Any item in S(, {l 1 }) (resp.S(, {l 2 })) denoted northeast blue (resp.northwest red) lines, ranks higher than l 1 (resp.l 2 ) whenever all popular items of  rank higher than l 1 (resp.l 2 ).The intersection of S(, {l 1 }) and S(, {l 2 }) is crosshatched.S(, {l 1 , l 2 }) is a convex cone defined by two arrows, which includes S(, {l 1 }) ∩ S(, {l 2 }).Any item in S(, {l 1 , l 2 }) ranks higher than both l 1 and l 2 whenever all items of  rank higher than all items of .
theoretical methodology only assumes that the underlying architecture follows dot-product models, the counter-intuitive phenomena of low-dimensionality would be applied to not only for iALS but also for any dot-product models.

DISCUSSION AND FUTURE DIRECTION
Efficient Solvers for High Dimensionality.High-dimensional models are often computationally costly.Even in the most efficient methods based on MF, the optimization involves cubic dependency on  and thus does not scale well for high-dimensional models.Motivated by this, previous studies have explored efficient methods for high-dimensional models [7,20,48].Because the traditional ALS solver for MF is in the class of block coordinate descent (CD), conventional methods can be derived by designing the choice of blocks [48].It is hence interesting to design block CD methods for high-dimensional models considering efficiency on modern computer architectures (e.g., CPU with SIMD, GPU, and TPU [37]).Developing solvers for more complex models, such as factorization machines [7,46], is useful to leverage side information.Since deep learning-based models are also employed, efficient solvers for such non-linear models [66] will be beneficial for enhancing the ranking quality.The extension of conventional algorithms to stochastic optimization, which uses a subset of data samples in a single update, is also important for using memory-intensive complex models and large-scale data.To reduce the memory footprint of embeddings, sparse representation may also be promising [42].As sparsity constraints introduce another difficulty in optimization, some techniques, such as the alternating direction method of multipliers (ADMM) [9], would be required.ADMM is a recent prevalent approach to enable parallel and scalable optimization under constraints [55,56,59,69].Improving the serving cost of high-dimensional models is also essential by using maximum innerproduct search (MIPS) [25,34,52].
Efficient Methods for Diversity and Item Fairness.Our empirical results suggest that diversity and item fairness may also be beneficial for efficient data collection and long-term accuracy.We can also infer from our results that directly optimizing diversity and item fairness is a promising approach to enhancing long-term recommendation quality.From a practical point of view, however, diverse and fair item recommendation is often computationally costly because of combinatorial explosions in large-scale settings.Hence, it is an important direction to develop efficient diversityaware recommendation [14,15,61,63,67].In the same spirit as dot-product models, efficient sampling techniques based on MIPS are essential for real-time retrieval [17,22].On the other hand, fairness-aware item recommendation is a challenging task in terms of its computational efficiency.Because fairness requires controlling item exposure while considering all rankings for users, it is relatively complex in both optimization and prediction phases compared to fairness-agnostic top- ranking problems.There exist a variety of approaches to implementing fair recommender systems based on constraint optimization [11,38,44,53,59].Although we focus on offline collaborative filtering, online recommendation methods based on bandit algorithms [26,27,31,33,43,60] are also promising for directly integrating accuracy optimization and data collection.
Further Theoretical Analysis on Dot-product Models.Our theoretical results might help us gain a deeper understanding of the expressive power of dot-product models.Besides the open problems described in Section 4, there is still much room for further investigation of its expressive power.We may consider a fine-grained analysis of representable -permutations for  ∉ {, }; e.g., to establish the exact upper or lower bounds of nperm  .One major limitation of Theorem 4.1 is that they do not promise that every  Θ( ) ranking is representable (over some  -dimensional item vectors); i.e., we cannot rule out the existence of a small set of rankings that cannot be expressed under dot-product models.Thus, a possible direction is to analyze the representability of an input set Π of rankings: Can we construct  item vectors over which each ranking of Π is representable?This question may be thought of as an inverse problem to that discussed in Appendix A.1.

CONCLUSION
In this paper, we presented empirical results that reveal the necessity of sufficient dimensionality in dot-product models.Our results suggest that low-dimensionality leads to overfitting to existing popularity bias and further amplifies the bias in future data.This phenomenon, referred to as curse of low dimensionality, partly causes modern problems in recommender systems, such as personalization, diversity, fairness, and robustness to biased feedback.Then, we theoretically discussed the phenomenon from the viewpoint of representable rankings, which are the rankings that a model can represent.We showed the bound on the number of representable rankings for -dimensional models.This result suggests that lowdimensionality leads to an exponentially small number of representable rankings.We also explained the effect of popular items on representable rankings.Finally, we established a structural characterization of near-popular items, suggesting a mechanism behind popularity bias under dot-product models.

A FORMAL STATEMENTS AND PROOFS A.1 Number of Representable Rankings
We devise lower and upper bounds on the number of representable rankings parameterized by the number of items, dimensionality, and size of rankings.Hereafter, we identify the set V of  items with [] ≜ {1, 2, . . ., }.For nonnegative integers  ∈ N and  ∈ [], let   denote the set of all permutations over [], and let    denote the set of all -permutations of [] (also known as partial permutations); e.g.,  We now define the representability of -permutations.For  items, let v 1 , . . ., v  be vectors in R  representing their embeddings.Without much loss of generality, we assume that they are in general position; i.e., no +1 vectors lie on a hyperplane.Given a query vector q ∈ R  (e.g., a user embedding), we generally produce a ranking obtained by arranging  items in descending order of ⟨q, v  ⟩.We thus say that q over v 1 , . . ., v  represents a -permutation  ∈    if ⟨q, v  ( ) ⟩ > ⟨q, v  (  ) ⟩ for all 1 ≤  <  ≤  and ⟨q, v  ( ) ⟩ > ⟨q, v  ( ) ⟩ for all  + 1 ≤  ≤ , and that  is representable if such q exists.Here, we emphasize that "ties" are not allowed; i.e., q does not represent  whenever ⟨q, v  ⟩ = ⟨q, v  ⟩ for some  ≠ .We let nperm  (v 1 , . . ., v  ) be the number of representable -partial permutations over v 1 , . . ., v  ; namely, By definition, nperm  (v 1 , . . ., v  ) ≤   ≤   .Our first result is an upper bound on nperm  .
The proof uses a characterization of representable permutations by hyperplane arrangements, which is illustrated in Figure 6.
Subsequently, we provide a lower bound on nperm  in terms of the number of facets of a polyhedron.Let P ≜ conv({v 1 , . . ., v  }) ⊂ R  be a convex hull of v 1 , . . ., v  ; every vertex of P corresponds to some v  .For a vector a ∈ R  and scalar  ∈ R, a linear inequality ⟨a, x⟩ ≤  is said to be valid if ⟨a, x⟩ ≤  for all x ∈ P. A subset F of P is called a face of P if F = P ∪ {x : ⟨a, x⟩ = } for some valid linear inequality ⟨a, x⟩ ≤ .In particular, ( − 1)-dimensional faces are called facets.Every facet includes exactly  vertices of P by definition (whenever  vectors are in general position).Our second result is as follows.
Theorem A.2.For any  vectors v 1 , . . ., v  in R  in general position, nperm  (v 1 , . . ., v  ) is at least the number of facets of P.Moreover, there exist  vectors v . By Theorems A.1 and A.2, the maximum possible number of representable -permutations over  vectors in R  is  Θ(min{, } ) for all  ∈ [] and  = O (1).
What remains to be done is the proof of Theorems A.1 and A.2.
Proof of Theorem A.1.For each 1 ≤  <  ≤ , we introduce a pairwise preference  , ∈ {±1} between  and .We wish for a query vector q ∈ R  to ensure that item  (resp.) ranks higher than item  (resp.) if  , is +1 (resp.−1).This requirement is equivalent to ⟨q, v  − v  ⟩ •  , > 0. Thus, if the following system of linear inequalities is feasible, any of its solutions q represents a unique permutation consistent with  , 's: 4⟨q, v  − v  ⟩ •  , > 0 for all 1 ≤  <  ≤ . ( Because H , is obtained via the division of R  by a unique hyperplane that is orthogonal to v  − v  and intersects the origin, the number of  , 's for which Eq. ( 1) is feasible is equal to the number of regions generated by hyperplane arrangement, which has been investigated in geometric combinatorics.By [23,64], the number of regions generated by  2 ( − 1)-dimensional hyperplanes that have a common point is at most thereby completing the proof.□ Remark A.3. Figure 6 illustrates the equivalence between representable permutations and regions of hyperplane arrangements.There are four vectors v 1 , v 2 , v 3 , v 4 on R 2 .Each dashed line connects a pair of the four vectors; each bold line is orthogonal to some dashed line and intersects the origin.These hyperplanes generate twenty regions, each of which expresses a distinct permutation.The number "12" is tight because the left-hand side of Eq.
Before going into the proof of Theorem A.2, we prove the following, which partially characterizes representable -permutations and is illustrated in Figure 7: Claim A.4.For any set  ∈ []   of  items, if there exists a facet F of P including v  for every  ∈  , there is a query vector q ∈ R  that represents a -permutation consisting only of  .
Proof of Claim A.4.By the definition of facets, there exist a ∈ R  and  ∈ R such that ⟨v  , a⟩ =  for all  ∈  while ⟨v  , a⟩ <  for all  ∉  .Thus, letting q ≜ a completes the proof.□ Remark A.5. Figure 7 gives an illustration of Claim A.4.There are six vectors v 1 , v 2 , v 3 , v 4 , v 5 , v 6 on R 2 .The convex hull P of them has v 1 , v 2 , v 3 , v 4 as its vertices; each dashed segment represents a facet of P. Taking an outward normal vector to a facet, denoted a bold arrow, as a query vector, we obtain a 2-permutation dominated by two vectors on the facet.
Proof of Theorem A.2.Because of Claim A.4, the number of representable -permutation is at least the number of facets of P. By the upper bound theorem due to McMullen [36,41], the maximum possible number of facets of a convex polytope consisting of  vertices in R  is  ⌈/2⌉ .This upper bound can be achieved when P is a cyclic polytope, 5 thus completing the proof.□ Example A.6.Finally, we show by a simple example that facets do not fully characterize representable -permutations; i.e., Claim A.4 is not tight.Four item vectors v 1 , v 2 , v 3 , v 4 in R 2 are defined as: v 1 = (+1, 0), v 2 = (−1, 0), v 3 = (0, +1), v 4 = (0, + 1 2 ).The convex hull P is clearly a triangle formed by {v 1 , v 2 , v 3 }.By Claim A.4, for each facet of P, there is a representable 2-permutation dominated by the vectors on the facet.However, the query vector q = (0, 1) produces a 2-permutation dominated by v 3 and v 4 , even though v 4 lies on no facet of P. We leave the complete characterization of representable -permutations as an open problem.

A.2 Mechanism behind Popularity Bias
Subsequently, we study the mechanism behind popularity bias and its effect on the space of representable rankings.Suppose we wish a query vector q ∈ R  to ensure that the embedding of | | popular items denoted  ≜ {p We can easily decide whether Q (, ) is empty or not as follows.
Observation A.7. Q (, ) ≠ ∅ if and only if there exists a hyperplane that divides  and .Now, it is considered that the query vectors are chosen only from Q (, ).What type of item vector would always rank higher than ?We define S(, ) as the closure of the set of vectors in R  that rank higher than all vectors of  provided that q ∈ Q (, ); namely, S(, ) ≜ {s ∈ R  : ⟨q, s⟩ ≥ ⟨q, l  ⟩, ∀ ∈ [| |], q ∈ Q (,  )}.⟨q, l  ⟩ 5 The cyclic polytope is defined as conv( {  ( 1 ), . . .,   (  ) } ) for distinct   's, where   ( ) ≜ (,  2 , . . .,   ) is the moment curve. 6We take the closure for the sake of analysis.

Figure 1 :
Figure 1: Effect of the dimensionality of iALS on popularity bias in recommendation results.

Figure 2 :
Figure 2: Distributions of relative item popularity.

Figure 4 :
Figure 4: Effect of the dimensionality on data collection efficiency.

Figure 5 :
Figure 5: Improvement of models with  = 128, 256 from that with  = 64 in terms of mean recall for users.

Figure 7 :
Figure 7: Illustration of the proof idea of Theorem 4.1, which partially characterizes representable rankings of size  by facets of a polyhedron.There are six item vectors v 1 , v 2 , v 3 , v 4 , v 5 , v 6 on R 2 , of which convex hull P has v 1 , v 2 , v 3 , v 4as its vertices; each dashed segment represents a facet of P. For each facet, there exists a ranking of size 2 dominated by any two item vectors on the facet.

)
On one hand, for any representable permutation  ∈   , Eq. (1) defined by  , = +1 if  ranks higher than , −1 if  ranks higher than , must be feasible.On the other hand, two different sets of pairwise preferences derive distinct permutations (if they exist).Hence, nperm  (v 1 , ..., v  ) is equal to the number of  , 's for which Eq. (1) has a solution.Observe now that Eq. (1) is feasible if and only if the intersection of H , 's for all 1 ≤  <  ≤  is nonempty, where H , is an open half-space defined as