A Deep Generative Recommendation Method for Unbiased Learning from Implicit Feedback

Variational autoencoders (VAEs) are the state-of-the-art model for recommendation with implicit feedback signals. Unfortunately, implicit feedback suffers from selection bias, e.g., popularity bias, position bias, etc., and as a result, training from such signals produces biased recommendation models. Existing methods for debiasing the learning process have not been applied in a generative setting. We address this gap by introducing an inverse propensity scoring (IPS) based method for training VAEs from implicit feedback data in an unbiased way. Our IPS-based estimator for the VAE training objective, VAE-IPS, is provably unbiased w.r.t. selection bias. Our experimental results show that the proposed VAE-IPS model reaches significantly higher performance than existing baselines. Our contributions enable practitioners to combine state-of-the-art VAE recommendation techniques with the advantages of bias mitigation for implicit feedback.

such data suffers from various forms of bias, and consequently, it generally does not reflect the true user preferences but is a biased indicator of it [17,18,25,28]. Examples include (i) popularity bias [2,29,37], where some items get more clicks due to their popularity on the platform; (ii) positivity bias [15,29], where users are more likely to provide ratings for items they would rate highly; and (iii) position bias [5,43], where users tend to click items ranked at a higher position of the result page. To mitigate the negative effects of selection bias, prior work has proposed the usage of inverse propensity scoring (IPS), a counterfactual estimation technique [35]. IPS counteracts the effect of selection bias by reweighting datapoints inversely to their probability of being observed. Recently, the IPS approach has been extended to optimize matrix factorization (MF) models from implicit feedback [34]. While MF methods [13,14] and more recent neural MF-based methods [12,46] have a long tradition in the recommendation field, the state-of-the-art methods for learning from implicit feedback data use variational autoencoders (VAEs) instead [21,36,41]. VAE-based methods perform well in the lowdata regime, when only a few interactions are available for most users [6,21]. Despite the importance of bias mitigation and the strong performance of VAEs for recommendation from implicit feedback, existing work has overlooked the issue of bias in a generative setting [20,34,35]. To the best of our knowledge, previous work has not considered state-of-the-art debiasing combined with VAEs, despite the obvious potential for improved performance.
We address this gap by introducing VAE-IPS: an IPS debiasing method for VAE optimization from implicit feedback. We start by introducing an ideal generative objective for training with implicit feedback data. We then show that naively ignoring selection bias during learning leads to the optimization of a biased estimate of the ideal generative objective. In contrast, we propose VAE-IPS, which uses IPS, and prove that it unbiasedly optimizes the ideal generative objective in expectation. Our experiments show that VAE-IPS reaches higher performance than existing debiasing methods and VAE without debiasing. Thus, VAE-IPS successfully integrates state-of-the-art VAE recommendation models with the advantages of debiasing implicit feedback.

RELATED WORK
Collaborative filtering (CF) is often approached as a matrix completion problem. A general problem with recommendation data is that it is missing-not-at-random [22]. Missing entries should not be treated as negative feedback because their absence is often not due to a lack of preference. Training standard CF-based methods on such biased data leads to suboptimal learning and evaluation [23,24]. A common technique to learn and evaluate in an unbiased manner is based on IPS. It has been applied in multiple feedback settings: (i) explicit CF [15,35], (ii) pointwise implicit [34], and (iii) pairwise implicit [32]. In this work, we extend the application of debiasing methods to deep generative models.
Latent factor models are a popular choice for training recommender systems [3,8,30,45], but they are limited to linear models. Alternatively, deep neural networks are also used to model recommendation problems [12,21,46]. In particular, VAE-based models have strong and robust CF performance [6,21]. The BiVAE model models the dyadic nature of user-item interaction data [41]. BiVAE models the distribution of user-item scores as a Bernoulli distribution, instead of a multinomial distribution over all items for a given user [6,21]. Our work builds on the BiVAE framework by contributing an unbiased generative model of user-item relevance.

BACKGROUND
Selection bias in click feedback. Let , be an indicator variable for the true user-item relevance, and let , be a binary click indicator variable, indicating whether user clicked item . Lastly, , denotes the observation variable, indicating whether user observed item . We assume that interaction behavior follows a simple examination model, where the click probability is a product of the probability of observance and that of relevance [4,27,32,34,42]: where , and , are the observation and relevance probability, respectively. Intuitively, the click model implies that clicks only occur on items that are relevant to a user and observed by them.
Variational autoencoders for click feedback. The goal of a VAE is to model the generative distribution ( | , ) [41], where is the click signal, and , are -dimensional latent variables for the corresponding user-item pair. Typically, we are interested in the posterior distribution ( , | ), which can be used for Bayesian inference post learning. We follow the recently proposed VAE model BiVAE, which models the pointwise generative distribution of the interaction signal.
Next, we define the ideal generative objective for user-item relevance following the BiVAE [41] framework. The generative story for user-item relevance is defined as follows: • for each user in the dataset, draw a latent vector ∼ N (0, ); • for each item in the dataset, draw a latent vector ∼ N (0, ); • for each user-item pair ( , ) in the dataset, the relevance score is defined as follows: , = ( ), where the function is the sigmoid function ( ( ) = 1 1+ − ); and • for each user-item pair ( , ), draw the relevance variable , from a Bernoulli distribution as follows: , ∼ Bern( , ). The log-likelihood for the relevance as defined previously (Eq. 1) can be expressed as: where is the total number of interactions in the dataset, and rel , is the log-likelihood for the single ( , ) pair. Henceforth, the equations are defined for a single ( , ) interaction. For defining the ideal relevance generative objective, let the posterior for the useritem latent variables ( , ) (which we denote as , going forward) be ( , ) (with as parameters of the posterior network), and let the conditional likelihood distribution for relevance be ( , | , ) (with as parameters of the conditional likelihood's network). Then, the likelihood in Eq. 2 is defined as: Next, we define ideal , as the following lower-bound of the loglikelihood: where we define ideal , as this lower-bound of the log-likelihood, also known as the evidence lower bound objective (ELBO) in the autoencoder literature [19,41]. This is the ideal distribution, since the ELBO is the quantity that is optimized in VAEs instead of the loglikelihood ( rel , ) [19]. The first term is the conditional-likelihood, and the second term is the KL-divergence between the posterior and the prior ( ( , )), which acts as a regularizer. In general, the following inequality holds: rel , ≥ ideal , . We assume that relevance is a binary random variable (Bernoulli distributed) for implicit feedback data ( ∈ [0, 1]), plugging it into Eq. 4, we obtain: where ( , ) is the probability of relevance for the ( , ) pair. Then The first part of the objective is similar to the familiar cross-entropy loss used with binary MF [14,31], where in the VAE case, the latent embeddings are sampled from a distribution, as opposed to being deterministic in case of MF. The second part of the loss is the KLdivergence between the posterior ( , ) and a simple normal distribution based prior ( , ), which acts as a regularizer during the training procedure [19,41].

METHOD: VAE-IPS ESTIMATOR
In this section, we introduce the naive click-based estimator, followed by a discussion of its bias, and finally we introduce our proposed estimator, which is unbiased in expectation. Naively taking the existing loss (Eq. 4) and replacing relevance with clicks, results in the following biased estimate of the loss function: In other words, the click-based objective function click , is simply the ideal objective function (Eq. 4), where relevance , has been subsituted by the click signal , . To prove that this is a biased estimator of the ideal objective, we apply the click model assumption and derive the expected value of this estimator with respect to the observation variable , , i.e., we consider E [ click , ]: Clearly, this is a biased estimate of the ideal loss (Eq. 4), where the propensity-term , is a confounding variable. We can express the exact bias by the following difference: From Eq. 9 it is clear that the click-based estimator will be unbiased only if , = 1, for all ( , ) pairs, which is clearly an unfeasible condition with the prevalence of selection bias in interaction data.

Proposed unbiased generative estimator
We propose an unbiased generative estimator, in a similar vain as existing IPS corrections for position bias [43], trust-bias [1], and popularity and positivity-bias [15]. The IPS-corrected unbiased estimate of the true generative objective is defined as follows: This estimator is an unbiased estimate of the true relevance based objective (Eq. 4). To prove this, we derive the expected value of the estimator with respect to the observation variable: Thus, we prove that in expectation ips , is equal to the ideal relevancebased objective from Eq. 4, and therefore, our VAE-IPS estimator is provenly unbiased.

Variance of the novel VAE-IPS estimator
Proof. We need to estimate the following quantity: .
To keep our notation concise, we will use 1 , = log( ( , )) and 0 , = log(1 − ( , )). We start by expanding the first part of Eq. 13: where in the second step, we make use of the identity 2 , = , , and going from the second to third step, we evaluate the expectation term inside. Next, we expand the second part of the Eq. 13 by resolving the inner expectation with the use of Eq. 11: where we also make use of the identity 2 , = , , and (1 − , ) 2 = 1 − , . Substituting the first and the second part back to Eq. 13: We see that the variance depends inversely on the propensity, which suggests that items with lower propensity will have a higher variance, and vice-versa. In practice, this is not a big issue since methods like propensity clipping can greatly reduce variance at the cost of a small amount of bias. Eq. 16 shows how we apply this technique in our experimental setup.

EXPERIMENTAL SETUP
We assess the performance of VAE-IPS for a relevance prediction task in semi-synthetic, real-world, and fully-synthetic setups.

Baselines and settings
Our comparison includes the following six methods: (i) Binary Matrix Factorization (MF): We use the matrix factorization model for implicit feedback dataset from [14], where the squared loss is replaced with the cross-entropy loss to account for the clicks being Bernoulli distributed. (ii) Rel-MF: The binary matrix factorization model trained with IPS weighted loss from [34]. (iii) MD-DR: A doubly-robust variant of the IPS matrix factorization model, which uses a control variate to reduce the variance of the IPS method [44].
(iv) MF-Dual: The dual unbiased matrix factorization model for implicit feedback data. To the best of our knowledge, it is the current state-of-the-art method for debiasing implicit feedback data [20].
(v) VAE: We use the BiVAE framework developed to model dyadic data [41], which is more suitable for pointwise predictions. This VAE baseline is optimized with the proposed coordinate descent style optimization method, where the posteriors for user and items are optimized alternately. The latent variables for this case are and , and the user-item prediction score is defined as the dot product between the user and item latent variable, ( , ) = · . This is similar to the prediction score defined in matrix factorization, the key difference being the use of neural networks and variational inference. And (vi) VAE-IPS: This is our proposed method, the BiVAE model optimized with the unbiased VAE-IPS objective (Eq. 10). Practically, with the alternative coordinate descent optimization, we use the IPS correction alternately for both user-based and item-based loss functions in the BiVAE framework. Variance reduction. The IPS estimator is known to suffer from large variance [40], due to the use of the inverse of the probability score, which is unbounded. We apply propensity clipping [34,38], which can greatly reduce variance while only introducing a small amount of bias. Formally, we define a clipped propensity as: where is a hyper-parameter, which controls the trade-off between bias and the variance. A small can result in high variance but little bias, whereas a high can lead to little variance but high bias. Propensity estimation. To estimate the propensity of an item, we use its relative click frequency in the training dataset [34]. The intuition behind training click frequency is that an item is more likely to be exposed to a user if it has historically been clicked more and vice-versa. Formally, we define the propensities as: where we make the assumption that propensity scores are uniform across all users [34]. Implementation details. The source code to reproduce the findings from the paper is available at: https://github.com/shashankg7/ VAE-IPS.

Semi-synthetic experimental setup
To assess the performance of our proposed method, we use the MovieLens-1M dataset [11]. The dataset consists of ∼6K users and ∼3,700 items, with 1 million explicit feedback ratings. To convert an explicit feedback dataset into an implicit feedback dataset, we consider all ratings with a value over 4 as positive interactions and rest of the interactions as unlabelled instances. We follow the experimental setup from [31]. To evaluate the performance of the methods, we use 50% of the dataset as test set. To simulate an unbiased test set, we re-sample 30% data from the test set with a sampling probability as 1/ , where is an item's normalized frequency in the training dataset, and is used to control the selection bias in the test set. A value of = 1 ensures the least selection bias and other values simulate controlled randomization.
We use NDCG@5 and MAP@5 as evaluation metrics. For calculating the normalizing factor of the NDCG@5, we follow the advice from [7] and base it on the entire dataset. Our evaluation metrics follow the definitions of earlier work on Rel-MF [34].

Real-world dataset experimental setup
We also evaluate VAE-IPS on a real-world dataset, where the test set interactions are from a truly uniform-random policy. We use the Yahoo! R3 dataset [23], which consists of interactions from a music recommendation service. The randomized test set ensures that it is free from the selection bias present in the training set.
To get a validation set, we split the training set in both datasets in accordance to a 80%/20% randomized split. We use the validation dataset to tune the hyper-parameters for the baselines and VAE-IPS. We use DCG@5 as the metric for hyper-parameter tuning, and tune the hyper-parameters using the self-normalized importance sampling (SNIPS) version of the DCG@5 metric [35].

Fully-synthetic experimental setup
We also examine a more controlled setting, where we generate a fully-synthetic dataset according to the click-model (Eq. 1). The synthetic data generation process samples observance and relevance probabilities from beta distributions: , ∼ Beta(1, 50) and , ∼ Beta(0.5, 0.5); additionally, noise variables are sampled from a uniform distribution with parameter : , ∼ Unif(0, ); and clicks are Bernoulli samples from the resulting click probabilities: We generate the dataset with total number of users | | = 1000, and total number of items | | = 100. We randomly split the relevance matrix into train and test sets, and generate clicks only from the train part of the split. To evaluate the robustness of the methods with respect to click noise, we vary the parameter. For the sake of brevity, we only compare with MF-Dual as it is found to be the best performing baseline in the previous two settings.

RESULTS AND DISCUSSION 6.1 Semi-synthetic experimental results
The results on the unbiased relevance prediction task using the MovieLens-1M dataset are presented in Table 1. We evaluate the performance across different values of , which controls the simulated selection bias, i.e., under different sampling test distributions. The VAE-IPS method consistently outperforms all methods by a significant margin across all metrics. The results hold for different settings of , indicating the robustness of VAE-IPS across different degrees of selection bias. Interestingly, Rel-MF and MF-DR perform worse than the vanilla MF model across all settings of . We speculate that this is due to the biased negative loss in the Rel-MF formulation, as noted in previous work [20], and given that the MF-DR model is primarily aimed for explicit feedback data [44], it fails to perform in a binary feedback setting. VAE outperforms Rel-MF and MF-DR, possibly due to it

Real-world experimental results
For the experimental setup on the Yahoo! R3 dataset, the results are presented in Table 2. Similar to the results on MovieLens-1M, VAE-IPS outperforms all other methods by a significant margin. It is interesting to note that vanilla MF outperforms Rel-MF across all metrics in this dataset, even though the test-set is from a uniform random policy. We speculate that this is due to the estimation error in the propensity calculation [44]. Similar findings have been reported in previous work [33], where MF with IPS failed to outperform a vanilla MF method. Consistent with the results on the MovieLens-1M, MF-Dual outperforms all MF-based baselines and VAE without IPS. VAE-IPS clearly provides significantly higher performance than all other tested methods. Therefore, these results show us that, in addition to the semi-synthetic setting, the performance advantages of VAE-IPS are clearly observable on real-world data as well.

Fully-synthetic experimental results
To evaluate the robustness of VAE-IPS with respect to noise, we look at our experimental results on the fully-synthetic dataset in Figure 3. We vary the degree of noise, ∈ {0.1, 0.3, 0.5, 0.7}, where higher values of indicate higher levels of click noise (see Eq. 18); and for comparison, we also consider the performance of the MF-Dual, the highest performing baseline method in the previous settings.
Surprisingly, the performance of both MF-Dual and VAE-IPS is consistent across different noise values, with only small differences in performance between lowest ( = 0.1) and highest noise level ( = 0.7). This observation strongly indicates that they are both  very robust to the noise in recorded clicks. Similar to what we have seen in our other experimental results, VAE-IPS has a higher performance across all metrics and noise settings. We thus conclude that VAE-IPS has a high robustness to noise and that it appears to outperform MF-Dual regardless of the level of click noise.

CONCLUSION
In this paper we investigated whether state-of-the-art VAE recommendation models could be combined with debiasing techniques in the implicit feedback setting.
First, we studied the effect of bias on a naive VAE training objective based on clicks. Our analysis proved that directly optimizing this objective leads to a biased recommendation system that unfairly favors items that were overrepresented during data logging.
Second, we proposed VAE-IPS, a novel IPS correction for the VAE loss, that allows for the combination of VAE recommendation models with the IPS debaising method, and which is provenly unbiased w.r.t. selection bias in clicks. We evaluated VAE-IPS on two public datasets across various metrics and observed that it outperforms all our baselines across all metrics by significant margins. We believe our contribution of VAE-IPS is important to the recommendation systems field, because it combines the expressiveness of state-ofthe-art VAE-based recommender models with IPS debiasing, and could lead to better performing recommender systems that are less affected by selection bias in interaction data.
Future work could consider robust methods for propensity estimation for implicit feedback in recommendation: IPS-based bias mitigation methods can be even more effective with more accurate propensity scores. IPS-based methods are known to suffer from the problem of high variance [9,26], which can result in an unsafe policy and potentially lead to a negative user experience when deployed online. To avoid this, future work could consider adding safety regularization to the IPS objective [10], providing theoretical guarantees for safe deployment. Alternatively, future research can explore the application of other bias mitigation methods, such as the doubly-robust method [26,44], to VAE recommendation models.