Test-Time Embedding Normalization for Popularity Bias Mitigation

Popularity bias is a widespread problem in the field of recommender systems, where popular items tend to dominate recommendation results. In this work, we propose 'Test Time Embedding Normalization' as a simple yet effective strategy for mitigating popularity bias, which surpasses the performance of the previous mitigation approaches by a significant margin. Our approach utilizes the normalized item embedding during the inference stage to control the influence of embedding magnitude, which is highly correlated with item popularity. Through extensive experiments, we show that our method combined with the sampled softmax loss effectively reduces popularity bias compare to previous approaches for bias mitigation. We further investigate the relationship between user and item embeddings and find that the angular similarity between embeddings distinguishes preferable and non-preferable items regardless of their popularity. The analysis explains the mechanism behind the success of our approach in eliminating the impact of popularity bias. Our code is available at https://github.com/ml-postech/TTEN.


INTRODUCTION
In recent years, recommender systems have reached successful accomplishments in providing personalized recommendations by analyzing user history.These systems are widely employed in various domains, such as e-commerce, recruitment, and online content platforms [5,16,17].However, the presence of popularity bias is one common challenge faced by recommender systems.Popularity bias refers to a phenomenon where popular items are overrepresented in the recommendation results, while less popular items receive less exposure than they deserve [1].This bias arises due to the inherent nature of recommendation algorithms where the popular items take a large portion of the train data, leading them to become more dominant in the recommendation list [2].
In this paper, we introduce a novel strategy called 'Test Time Embedding Normalization (TTEN)' to mitigate popularity bias in recommender systems.We begin by showing that there exists a significant correlation between the popularity of an item and the magnitude of its embedding.We further analyze the cosine similarity between the user and item embeddings.When trained with a proper loss, such as sampled softmax loss [20], we find that the cosine similarity distinguishes preferable and non-preferable items regardless of their popularity.Interestingly, our observation indicates that the well-known models are inherently capable of disentangling popularity and preference without the need for explicit bias mitigation algorithms [11,12,14,18].The commonly used inner product score function, however, strengthens the popularity bias since the function multiplies cosine similarity and magnitude of embeddings together.Building upon this observation, we aim to control the effect of popularity by normalizing item embeddings when generating the recommendation results.We conduct extensive experiments and find the proposed approach outperforms existing state-of-theart popularity bias mitigation strategies.
Our contribution can be summarized as follows: • We propose a test-time embedding normalization method with controllable normalization strength.• We show that the normalization with the sampled softmax can effectively outperform the existing approaches aimed at popularity bias mitigation.• Our analysis shows that the magnitude of item embeddings is highly correlated with item popularity, and cosine similarity is sufficient to capture the relevance of items.

RELATED WORK
We introduce previous studies that aim to mitigate popularity bias in recommender systems.Liang et al. [12] employ a re-weighting strategy that assigns weights inversely proportional to the item's popularity.Zheng et al. [21] separate user and item embeddings into interest and conformity using a negative sampling strategy based on causality.Ren et al. [14] propose a gradient-based method to address popularity bias.They argue that popularity bias arises due to the dominance of popular items in the positive items, resulting in a significantly larger gradient of positive items than negative items.Consequently, a gradient-adjusting algorithm is introduced to mitigate the popularity bias.Wu et al. [20] suggest sampled softmax for recommender systems.This work shows that sampled softmax maximizes discounted cumulative gain and alleviates popularity bias as it samples popular items as negative more frequently.

Preliminaries
Problem Formulation.Let U = {1, . . .,  } denote the set of users and I = {1, . . .,  } the set of items.A set of interactions between the users and the items can be represented as a binary matrix  ∈ R  × where   = 1 if user  interacted with an item  and   = 0 otherwise.Embedding-based models in recommender systems aim to learn user embeddings   ∈ R  and item embeddings   ∈ R  , where  is a dimension of embedding [7,8,10].Once the embeddings are obtained, the inner product between user and item embeddings is used to compute the relevance score r =  ⊤    between user  and item .Top- most relevant items are then recommended for each user.
Graph Convolution Network.A graph convolution network (GCN) framework has recently emerged as a state-of-the-art embeddingbased approach in recommendation [8].The GCN leverages the user-item interaction matrix to propagate user and item information through their interaction.The embeddings are iteratively updated through graph propagation, leading to enriched representations that capture the relationship between users and items.
Loss Functions.A proper loss function needs to be defined to learn the user and item embeddings through the GCN.We consider two popular loss functions: Bayesian Personalized Ranking (BPR) [15] and Sampled Softmax (SSM) [20] losses.
The BPR is a pairwise loss that encourages the model to rank observed user-item interactions higher than unobserved interactions.The SSM loss maximizes the probability of a positive item among sampled negative items through a softmax function.Let N  is a set of sampled negative items for user , i.e.,    = 0 ∀ ∈ N  .The SSM loss for user  is formulated as follows: where  (  ,   ) is a cosine similarity of two embeddings and  is temperature.
The negative items of each user are taken from the positive items of the other users.In doing so, popular items are sampled more frequently than unpopular items, making them the hard negatives.

Test Time Embedding Normalization
In this section, we begin by analyzing the relationship between the popularity of items and the magnitude of their embeddings.Table 3 presents the correlation between item popularity and embedding magnitude, demonstrating a significant correlation between these two factors.Our finding aligns with previous work by Ren et al. [14], who also observe that popular items tend to have larger embedding magnitudes due to positive gradients acquired during model updates.
Based on our findings, we introduce a strategy called Test Time Embedding Normalization (TTEN), which aims to mitigate popularity bias in recommender systems.Our approach controls the impact of item embedding magnitudes (ℓ 2 norm) during inference, given their strong correlation with item popularity.
To recommend items, an inner product between user and item embeddings is widely used as a relevance score, i.e., r =  ⊤    .By decomposing the inner product of user embeddings and item embeddings as  ⊤    = cos(  ,   )∥  ∥∥  ∥ where cos(•) is a cosine similarity, we can rewrite the inner product as their cosine similarity and magnitudes.To mitigate the impact of the magnitude, which is closely tied to item popularity, we propose TTEN to compute the relevance score as where  controls the strength of the normalization.If  = 1, the relevance score only depends on the cosine similarity between two embeddings.If  = 0, the relevance score follows the inner product.
Through the different choices of , we can control the influence of the embedding magnitude for the recommendation.Note that the magnitude of user embedding does not influence the ranking of the final recommendation list for a given user.The normalization process is computationally efficient and can be easily integrated into existing embedding-based recommender systems, making it a practical solution for mitigating popularity bias in real-world applications.While previous research [13,19] utilizes the normalization during the training phase, the potential of normalization to mitigate popularity bias during the inference stage has been underexplored.Although some works [3,20] compared the results of normalization during the training and test time, their tests were not conducted in the context of bias mitigation, missing the importance of normalization in mitigating popularity bias.

EXPERIMENTS
In this section, we evaluate the performance of our proposed method, TTEN.Furthermore, we conduct a comprehensive analysis to investigate the behavior of our approach through a series of experiments.We address the following research questions and provide insights into the proposed approach.
• RQ1: Does the proposed test time embedding normalization outperform existing strategies for mitigating popularity bias in recommender systems?• RQ2: Why does test time embedding normalization eliminate popularity bias in recommender systems?• RQ3: How does the scale of the normalization impact the recommendation results of unpopular and popular items?

Experimental Setup
Dataset.We utilize three publicly available datasets, Gowalla [4], Yelp20181 , and ML-10m2 , which are widely recognized and employed in the field of recommendation systems.For a fair comparison, we preprocessed the Gowalla and Yelp2018 datasets following the methods described in He et al. [8].For the ML10m dataset, we transform the explicit ratings into implicit feedback, assigning a value of 1 if the user rated the item and 0 otherwise, following the  [18].The detailed statistics of these datasets are provided in Table 1.
Baselines.We use LightGCN (LGN) [8] trained with BPR and SSM loss as a baseline and backbone model for which TTEN is applied.We compare our approach with four methods, IPS [12], MACR [18], GRAD [14], and BIGNN [11], designed to mitigate the popularity bias in recommender systems.We reproduce all baselines except for GRAD and BiGNN.
Evaulation Protocols.Many approaches that try to mitigate popularity bias [14,18,21] commonly utilize an unbiased test set to assess the performance of the proposed method.This is because the biased test set, which often follows a long-tail distribution, can yield high performance even when the model produces a biased recommendation.Therefore, we follow evaluation protocols commonly used in related studies [14,18,21] to appropriately assess the impact of removing popularity bias.The unbiased test set is constructed by randomly sampling items from a uniform distribution, ensuring that each item has an equal probability of being selected.We use the train and test set from the MACR [18] for a fair comparison with existing methods and keep 50 % of the test set as the validation set for hyperparameter search.The performance of the model is evaluated through Recall@20 and NDCG@20, considering all unobserved items as a negative set.
Implementation Details.The dimension of user and item embeddings are both set to 64, and the embeddings are initialized using the Xavier initializer [6].The model is trained for 300 epochs, with early stopping applied after 50 epochs with patience of five.For optimization, we utilize the Adam optimizer [9] with a learning rate of 1e-3 and batch size of 4096. 2 regularizer coefficient is set to 1e-5 for BPR loss and 1e-7 for SSM loss.Three-layered LightGCN is used for all experiments.Following the guidelines in Wu et al. [20], we conduct a hyperparameter search as outlined in Wu et al. [19] to discover the optimal temperature value.The temperature of 0.1 for the Gowalla dataset, 0.12 for the Yelp2018 dataset, and 0.1 for the ML10M dataset are chosen through the search.We set the normalization strength to one, i.e.,  = 1, unless noted.

Overall Performance (RQ1)
Table 2 provides the overview of the performance of baseline models and our proposed methods.We observe that TTEN yields better performance with the SSM loss compared to the BPR loss.Although it has been shown that SSM loss is robust to the bias mitigation in theory [20], our results reveal that the SSM loss still suffers from the popularity bias.TTEN with SSM demonstrates significant improvements over all baseline models.We outperform previous state-of-the-art approaches with a substantial margin of 4.26% to 31.5% in terms of Recall@20.These results highlight the effectiveness of our approach in the mitigation of popularity bias.Given that Table 2: Overall performance of our method and baselines.

Item Group
Recall@20 LGN TTEN (b) Recall@20 Figure 1: Frequency and recall@20 of each item group in the Gowalla dataset.The SSM loss is used with LightGCN (LGN).
TTEN does not have any additional modules during the training process, our approach is faster and more efficient compared to the other methods.One relevant method to our approach is GRAD [14], which involves intervention in the inference stage by utilizing the accumulated gradient.However, our results demonstrate that utilizing the magnitude of the item embedding is sufficient and more efficient than utilizing the gradient of an item.Fig. 1 shows the frequency and recall of recommended items in each popularity group.We divide the items into five groups according to their popularity, from the most popular (5) to the least popular (1).All item groups are arranged to have the same number of items.Our method exhibits a superior ability to achieve fairness in recommendation results compared to LGN, which produces unnormalized output.The recommended frequency of each item group is approximately the same with TTEN, whereas the recommendation list of the unnormalized output is highly biased towards the popular group.Furthermore, TTEN surpasses the unnormalized output in terms of Recall@20 for the item groups from (1) to (4).The result shows the ability of our model to effectively recommend less popular items, mitigating the adverse effects of popularity bias.

Anaylsis of the Relationship between
Popularity and Embeddings (RQ2) In this section, we analyze the relationship between embeddings and popularity and the success behind our method.
Relationship between Popularity and Magnitude.Table 3 shows the pearson correlation between popularity and magnitude of item  Relationship between Popularity and Cosine Similarity.As TTEN leverages cosine similarity in the recommendation process, we investigate whether the cosine similarity differentiates the popularity and preference between the item and the user embedding.For the experiment, we categorize items into four different groups according to their popularity and preference for each user.We assign the most popular 20% items into the popular group and the remaining items into the unpopular group.For preference, we divide items into positive and negative groups based on whether the item is included in the test set.As a result, we obtain the following four groups: positive popular, negative popular, positive unpopular, and negative unpopular.Then, we measure the average cosine similarity between each group and the user.
The distribution of the average cosine similarity in each group is shown in Fig. 2. We observe that cosine similarity effectively distinguishes the positive and negative items regardless of popularity in most cases.These results provide an explanation for the effectiveness of embedding normalization, particularly in the context of the unbiased dataset with the SSM loss.With BPR loss, popular item group shows high cosine similarity even if it is irrelevant to the user, suggesting the application of TTEN to SSM loss.

Analysis on Scale of Normalization (RQ3)
We conduct experiments with a varying value of  that controls the normalization strength as shown in Eq. (2).Fig. 3 shows the recall and recommendation frequency of each item group.We observe that as the normalization strength increases, the recall and frequency of items in unpopular groups increase.Note that popular groups 0 0.2 0.4 0.6 0.  become less dominant, and unpopular groups become dominant as we increase the strength, showing the trade-off between popularity groups.Our results demonstrate the potential of our method to flexibly control the impact of popularity during the inference stage.This can be further used in the real-world setting, such as tailoring the user experience to account for varying levels of popularity bias.

CONCLUSION
In this paper, we proposed test time embedding normalization to mitigate popularity bias in recommender systems.Our approach effectively addresses popularity bias by removing the effect of item embedding magnitude, which is highly correlated with popularity.Through extensive experiments, we have investigated the effectiveness of the proposed method and understand the impact of normalization on model performance and fairness.

Figure 2 :
Figure 2: Distributions of average cosine similarity between a user and items broken down by item popularity and preference.The top figure depicts the results of the popular item group, and the bottom depicts the result of the unpopular item group.

Figure 3 :
Figure 3: Changes of recall@20 and the frequency ratio of each item groups with varying normalization strength .

ACKNOWLEDGEMENT
This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-01906, Artificial Intelligence Graduate School Program(POSTECH)) and National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No.RS-2023-00217286)

Table 1 :
Statistics of the datasets.

Table 3 :
Pearson correlation between item embedding magnitude and popularity.