Inverse Learning with Extremely Sparse Feedback for Recommendation

Modern personalized recommendation services often rely on user feedback, either explicit or implicit, to improve the quality of services. Explicit feedback refers to behaviors like ratings, while implicit feedback refers to behaviors like user clicks. However, in the scenario of full-screen video viewing experiences like Tiktok and Reels, the click action is absent, resulting in unclear feedback from users, hence introducing noises in modeling training. Existing approaches on de-noising recommendation mainly focus on positive instances while ignoring the noise in a large amount of sampled negative feedback. In this paper, we propose a meta-learning method to annotate the unlabeled data from loss and gradient perspectives, which considers the noises in both positive and negative instances. Specifically, we first propose anInverse Dual Loss (IDL) to boost the true label learning and prevent the false label learning. Then we further propose anInverse Gradient (IG) method to explore the correct updating gradient and adjust the updating based on meta-learning. Finally, we conduct extensive experiments on both benchmark and industrial datasets where our proposed method can significantly improve AUC by 9.25% against state-of-the-art methods. Further analysis verifies the proposed inverse learning framework is model-agnostic and can improve a variety of recommendation backbones. The source code, along with the best hyper-parameter settings, is available at this link: https://github.com/Guanyu-Lin/InverseLearning.

E-commerce and Micro-video platforms [23,26,30].These systems aim to capture users' preferences based on their historical behaviors, with a focus on either explicit or implicit feedback.Explicit feedback, such as user ratings, provides direct indications of user preferences but is challenging to collect due to the need for active user participation [15,21].In contrast, implicit feedback, including user clicks, purchases, and views, offers richer information and is more commonly utilized in modern recommender systems [4,20].In certain contexts like Micro-video platforms, users passively receive recommended items without actively engaging through actions like clicking or rating.Consequently, we encounter a scenario where the labeled feedback is extremely sparse, comprising predominantly quick-skip, long-stay, and a considerable number of slow-skip or short-stay videos with unclear feedback.Effectively leveraging this unlabeled feedback poses a significant challenge for recommendation systems.
The challenge of dealing with unclear feedback in recommender systems has led to various approaches that randomly sample unlabeled data and consider it as negative feedback, resulting in inevitable noise [3,13].Typically, user-clicked data is treated as positive feedback, while unclicked data is sampled as negative feedback [3,13].However, this sampling strategy may include positive instances in the unlabeled data, leading to false-negative cases.Additionally, some studies have explored hard negative sampling techniques, which reduce false-positive instances but increase falsenegative instances [7,8,33].Nevertheless, these methods often underperform when evaluated on true positive and negative data instead of the sampled negative data alone, as demonstrated in Table 2. Notably, a recent work called DenoisingRec [28] focuses on denoising positive feedback by manipulating the loss of falsepositive instances but does not adequately address the issue of noisy negative feedback.Overall, existing approaches tend to concentrate solely on either the positive or negative perspective, without effectively tackling both aspects.
Our analysis of Figure 1 reveals two key observations emerge, serving as the motivation behind our proposed method: • Full use of unlabeled data can boost the performance.In the ML1M dataset, the introduction of unlabeled data boosts performance a lot.However, the performance is harmed in the Micro Video dataset, which means the solely unsupervised learning on unlabeled data is unstable.• Labeled data can guide the learning on unlabeled data.In both two datasets, splitting part of labeled data to guide the unsupervised learning on unlabeled data can boost the performance.That is to say, guidance on unlabeled data can improve the robustness of unsupervised learning.
To simultaneously tackle the unclear feedback problem from positive and negative perspectives, we propose a novel learningbased approach that employs Inverse Dual Loss (IDL) and Inverse Gradient (IG).Our method automatically annotates the unlabeled data and subsequently adjusts the falsely annotated labels.As illustrated in Figure 1, we introduce IDL for unsupervised training on unlabeled data and leverage IG to guide the unlabeled data.
Specifically, the IDL is able to automatically annotate unlabeled data in an unsupervised learning fashion.The IDL employs a welldesigned loss function that leverages both positive and negative feedback.We exploit the property that the loss associated with a false positive/negative instance exceeds that of a true positive/negative instance [28].By assigning different weights to the positive and negative labels of unlabeled instances, calculated using the inverse dual loss, we effectively utilize true positive/negative instances while mitigating the noise introduced by false positive/negative instances.This approach allows us to fully capitalize on valuable information and enhance the quality of annotation.
In addition, to adjust the false annotated labels and improve the robustness of IDL, we further propose an Inverse Gradient (IG) method.Here we build a meta-learning process [10,18] and split the training data into training-train and training-test data.We first exploit training-train data to pre-train the model.Then we further use training-test data to validate the correctness of classification by IDL.In other words, supervising the proposed unsupervised IDL method via training-test data.Specifically, we calculate the gradient for the inverse dual loss of sampled instances as well as the additive inverse of the gradient.The model is optimized by either the direct gradient or the additive inverse of gradient, determined by the split training-test data.Experimental results illustrate that inverse gradient can truly improve the inverse dual loss.In summary, the main contributions of this paper are as follows: • We take the pioneering step to address the unclear passive feedback in video feed recommendation, which is far more challenging than existing works that are either based on explicit or implicit active feedback.
• We propose Inverse Dual Loss (IDL) to annotate the labels for sampled instances in an unsupervised learning manner.Besides, we further propose Inverse Gradient to guide the unsupervised learning on unlabeled data and improve the robustness of IDL.• We experiment on two real-world datasets, verifying the superiority of our method compared with state-of-the-art approaches.
Further studies sustain the effectiveness of our proposed method in label annotation and convergence.

PROBLEM DEFINITION
We will formulate the problem here.The recommendation task aims to model relevance score ŷ  =  (,  | ) of user  towards item  under parameters  .The LogLoss function [34,35] function to learn ideal parameters  * is as: where ,  ∈ U,  ∈ I is the reliable interaction data between all useritem pairs.Indeed, due to the limited collected feedback, the model training is truly formalized as follows: θ = arg min  L D  ( ) + L D  ( ), where D  ∼ D * is the collected labeled data, and D  = {(, , ȳ ) |  ∈ U,  ∈ I} is the sampled unlabeled data where ȳ = 0 is often assumed in existing recommenders for negative sampling.However, such a strategy will inevitably introduce noise because there are some positive unlabeled instances in the sampled data.As a consequence, a model (i.e., θ ) trained with noisy data tends to exhibit suboptimal performance.Thus, our goal is to construct a denoising recommender approximating to the ideal recommender  * as: where L denoise D  ( ) indicates the loss on unlabeled data with all samples annotated correctly, i.e. denoising sampling.

METHODOLOGY
In this section, we will first perform an in-depth analysis of existing solutions and their limitations.Then we will propose inverse dual loss to address the limitations of existing works for easy samples.Finally, we further propose inverse gradient to address the limitation of inverse dual loss and make it capable of not only easy samples but also hard samples that are misclassified.

Inverse Dual Loss
In this section, we first analyze the characteristics of existing solutions on the sampled unlabeled data.Then we introduce the proposed inverse dual loss solution to denoise sampled data.overfitting if it is only trained based on the sparsely labeled data, compared with the ground-truth in Figure 2 (h).
In practice, existing recommenders often sample from unlabeled data and treat all the sampled data as negative feedback.Such an approach introduces false negative, which fails to retrieve items that users may be interested in, as shown by the classification boundary in Figure 2 (b).That is, there exists noise in the sampled negative data.However, existing denoising approaches mainly focus on the noise in positive samples (false positive).For example, DenoisingRec [28] attempts to achieve denoising for false positive instances as: where D  = (, , 1) |  ∈ U,  ∈ I,  *  = 0 is the noisy false positive data they introduce in experiments.For example, R-CE (Reweight Cross-Entropy) of DenoisingRec assigns lower weight on false-positive instances with large loss (Figure 2 (c)), and T-CE (Truncated Cross-Entropy) of DenoisingRec discards those false positive instances with large loss (Figure 2 (d)).Though achieving denoising for false-positive instances, they ignore false-negative instances and fail to address the noise brought by the negative sampling.
In fact, the collected labeled data is much cleaner than sampled unlabeled data, and the number of false-positive instances is limited in real-world recommender systems.On the contrary, the noise brought by negative sampling is far more harmful.In other words, the noise level of positive unlabeled data incorrectly sampled as negative feedback is much higher than that of negative samples wrongly regarded as positive feedback.
To sum up, existing solutions either introduce noise or perform incomplete denoising, which motivates us to further propose a denoising solution for unlabeled data from both positive and negative perspectives.

Labeling with Inverse Dual Loss.
As an existing attempt in DenoisingRec, we have discovered that the false positive instances are with a greater loss.It is also an apparent phenomenon in machine learning.For example, if we have a positive instance and a well-trained model, the loss of classifying it as negative will be greater than that of classifying it as positive.Otherwise, if we have a negative instance and a well-trained model, the loss of classifying it as positive will be greater.Hence, we can assume the sampled unlabeled instances are both possibly positive and negative and then exploit this inherent characteristic to automatically weigh more on the true positive or negative instances while weighing less on the false ones.Definition 1. (Inverse Dual Loss) The inverse dual loss is defined as denoising loss to automatically classify the unlabeled data as: where are the weights for positive loss and negative loss, respectively.
is the normalization parameter.D  is the sampled unlabeled data and  is the model parameters to be learned.Here stopgrad is a stop-gradient operation.
The pros of inverse dual loss are as shown in (e)-(h) of Figure 2: (e) when the easy negative instances are sampled, the loss of classifying them as positive will be greater than that of classifying them as negative, and thus inverse dual loss will assign more weights on the negative loss; (f) gradually assigning more and more weights on the negative loss, the negative unlabeled instances will eventually be classified as negative; (g) likewise, when the easy positive instances are sampled, the inverse dual loss will assign more weights on the positive loss; (h) the positive unlabeled instances will eventually be classified as positive and approximate to the ground-truth.

Limitation of Inverse Dual Loss.
When sampling the easy positive or negative instances, our inverse dual loss can boost learning by correctly labeling the sampled instances.However, it may be an obstacle when there are some hard positive or negative instances.As shown in (i)-(l) of Figure 2, given some hard positive or negative instances, the classification boundary will be prevented from the ground-truth: (i) the hard negative instances are sampled; (j) half of the negative unlabeled instances will be classified as positive and become noise; (k) the hard positive instances are sampled; (l) half of the positive unlabeled instances will be classified as negative and become noise.That is to say, our inverse dual loss relies heavily on the current training classification boundary and the difficulty of sampled data, requiring us to further improve its robustness.

Inverse Gradient
To improve the robustness of Inverse Dual Loss on the hard sampled data towards the current training model, in this section, we further propose inverse gradient to adjust the gradient of false annotated data, inspired by the meta-learning framework [10,18].Then we analyze the convergence of the proposed Inverse Gradient.
Then, the relationship between the loss of them and the model with parameter  on data D  will be either The proof of Theorem 1 is as Appendix A.1.Based on this theorem, we can have the following gradient updating strategies.Generally, we will first split the training data into training-train data and training-test data, where we pre-train the model on the trainingtrain data.Then we will calculate the direct gradient of inverse dual loss on the sampled unlabeled data, which can further result in the following three cases: • When the sampled data is easy, we can exploit the direct gradient to update the model, and it will gain a smaller test loss on the training-test data, as shown in Figure 3 (a); • When the sampled data is hard, exploiting the inverse gradient to update the model will gain a smaller test loss on the trainingtest data, and thus we exploit inverse gradient here as shown in Figure 3 (b); • When the model is approximately optimal, either direct gradient or inverse gradient will prevent it from ground-truth, and we discard this batch of unlabeled data as shown in Figure 3 (c); We present the procedure of exploring these three cases as Algorithm 1.The algorithm first pre-trains the model using the split training-train data.Then the model will update with direct gradient or inverse gradient or even not update, determined by the validation on the split training-test data.More specifically, the inputs of our proposed algorithm are labeled data D  , unlabeled data D  , and learning rate , .The first iteration aims to pre-train the model using the split training-train data.The second iteration aims to explore the direct gradient and inverse gradient on the loss for sampled unlabeled data, where three strategies are explored here as lines 13-15 of Algorithm 1 with   ,   and  as the model parameters updated by the direct gradient, inverse gradient and without being updated by the gradient on the loss for sampled unlabeled data, respectively.Finally, the explored updated direction with minimal test loss on the training-test data will be selected to update the model for this iteration.

Convergence Analysis.
As shown in Figure 3 (d), the first case illustrates when the unlabeled data is ideally sampled, the updating with direct gradient will lead to a smaller loss and better convergence on the training-test data, while the third case with poorly sampled data supposes to update with inverse gradient.However, when the model approximates convergence on the training-test data, updating with either direct gradient or inverse gradient may be poorer than no updating, as shown in stage 2 of Figure 3 (d).
To avoid the gradient ascent problem for the second case, we can set the learning rate for the inverse dual loss to be smaller than that for the test loss, i.e.,  <  in Algorithm 1.In this way, the scale of updating by the gradient for inverse dual loss will be within the scale of updating by the gradient for test loss.That is to say, the gradient ascent problem is less likely to occur on the inverse dual loss for unlabeled data than the loss for labeled data.

EXPERIMENTS
In this section, we perform experiments on two real-world datasets, targeting four research questions (RQs): • RQ1: How does the proposed method perform compared with the state-of-the-art denoising recommenders?What is the effect of two proposed components, i.e., Inverse Gradient (IG) and Inverse Dual Loss (IDL)?• RQ2: How does our proposed inverse dual loss identify the unlabeled data?• RQ3: What is the effect of the inverse gradient on convergence?• RQ4: What is the optimal ratio between the learning rates for inverse dual loss and training-test?
We also study RQ5: "How does the proposed method perform compared with the state-of-the-art hard negative sampling recommenders?" in Appendix A.3.

Experimental Setup
4.1.1Datasets.To practice and verify the effectiveness of our proposed method, we conduct experiments on an industrial Micro Video dataset and a public benchmark ML1M dataset, which is widely used in existing work for recommender systems [6,19].Micro Video is an extremely sparse dataset where users are passive in receiving the feed videos and have rare active feedback.We introduce the details of them in Appendix A.4.

Baselines and Evaluation Metrics.
To demonstrate the effectiveness of our proposed inverse learning on unlabeled data, we compared the performance of recommenders trained by our inverse gradient (IG) with recommenders trained by inverse dual loss (IDL) and normal training by standard loss or negative sampling (NS) [3,13,25].Besides we also compare our inverse learning method with the state-of-the-art methods for denoising recommender systems.Specifically, we also compare two adaptive denoising training strategies, T-CE and R-CE, of DenoisingRec [28].Following DenoisingRec [28], we select GMF and NeuMF [13] as backbones, which are neural Collaborative Filtering models.The details of them are illustrated in Appendix A.2.

4.1.3
Hyper-parameter Settings.For the two denoising strategies [28], we followed their default settings and verified the effectiveness of our methods under the same conditions.The embedding size and batch size of all models are set as 32 and 1,024, respectively.Besides, we adopt Adam [17] to optimize all the model parameters with the learning rate  initialized as 0.0001 and 0.00001 for labeled data on ML1M and Micro Video datasets, respectively, while the learning rate for sampled data is set as  = 0.1.As for the inverse gradient, we split 90% of the training data as training-train data, and the left is used as training-test data.The sampling rate is set as 1.The provided code has included the best hyper-parameters.

Overall Performance (RQ1)
The performance comparison is shown in Table 1, from which we have the following observations.
• Our inverse gradient performs best.Our inverse gradient (IG) method achieves the best performance compared with four baselines and our inverse dual loss (IDL) for three metrics.Specifically, our IG improves the backbone sharply, which shows the ability of our proposed method to well classify the unlabeled data and achieve effective data augmentation to resolve the data sparsity problem of existing recommenders.Note that on apart from IDL and IG, GMF is better than NeuMF in general.But NeuMF

Annotation on Unlabeled Data (RQ2)
To study the ability of our proposed method to annotate the unlabeled data, we visualize the distribution of weights for positive loss and negative loss in the ML1M and Micro Video datasets, respectively, as Figure 4

Convergence Analysis (RQ3)
To investigate the convergence of our proposed inverse gradient, we also plot the loss curve of training-test data for the ML1M and Micro Video datasets on the upper and bottom parts in Figure 6, respectively.Based on the results, we can discover that: • Inverse Gradient can promote convergence.For the ML1M dataset, at the early stage, GMF model is updated with direct gradient, then with a hybrid of direct and inverse gradients, and finally with inverse gradient.In NeuMF model, we can discover more hybrid gradients in the valley of the loss curve.This is because the test gradient is more likely to ascent at the valley where our Inverse Gradient inverses the gradient for dual loss to adjust its direction for better convergence.• Proper learning rate can prevent gradient ascent.For the Micro Video dataset, the models are always updated with direct gradient.This is because the learning rate is relatively low here (as analyzed in the Section 4.5), leading to almost no gradient ascent problem.Most importantly, we can discover that there is no case of passing gradient in the descent procedures, which supports our analysis at Section 3.2.3 that setting a smaller value of learning rate  can avoid the gradient ascent problem for inverse dual loss.• Deep model converges fast but is prone to be overfitting.
For the ML1M dataset, the pre-training on training-train data can promote the model learning effectively at the early stage for both GMF and NeuMF models, while NeuMF will come into a fast convergence in the first epoch.However, for the Micro Video dataset, the pre-training on the training-train data will conflict with the test data, which means the deep learning-based model is prone to be overfitting [5] in the sparse data.

Hyper-parameter Study (RQ4)
As discussed in Section 3.2.3,we can avoid the gradient ascent problem by setting the learning rate for inverse dual loss smaller than that for test loss, i.e.,  <  in Algorithm 1.To experimentally study this convergence analysis at Section 3.2.3 and investigate the impact of the learning rate for convergence, we vary the learning rate for inverse dual loss with 1, 10, 50, and 100 times the learning rate for test loss.Here we study the loss curve of GMF on Micro Video as Figure 7, where we can discover that: • Smaller learning rate for inverse dual loss can avoid gradient ascent.When  is smaller than 10 times , the gradients are often direct gradients.However, when  grows up to 50 times , the inverse gradients appear.Moreover, when  grows up to 100 times , there even appear pass gradients which means the occurrence of gradient ascent.Thus it is consistent to our analysis at Section 3.2.3 that we can avoid the gradient ascent problem by limiting the learning rate for inverse dual loss according to the learning rate for test loss.• Greater learning rate for inverse dual loss can speed up the convergence but more fluctuation.With the growth of learning rate , the loss convergence becomes faster.However, it also results in fluctuation as there appear more inverse gradients and pass gradients.This observation is consistent to our analysis of Section 3.2.3 for convergence.

RELATED WORK
Implicit Feedback with Negative Sampling.Existing recommenders are generally based on implicit feedback data, where the collected data is often treated as positive feedback, and negative sampling [3,13,25] is exploited to balance the lack of negative instances.However, the negative sampling strategy will introduce noise because there are some positive unlabeled [1,9,27] data in the sampled instances.To improve existing implicit feedback recommendation, the identification of negative experiences [11,16] has grabbed the researchers' attention.However, these methods collect either the various user feedback (e.g., dwell time [16] and skip [29]) or the item characteristics [24], requiring additional feedback and manual labeling, e.g., users are supposed to actively provide their satisfaction.Besides, the evaluation of items relies heavily on manual labeling and professional knowledge [24].Thus in practice, these methods are too expensive to implement in real-world recommenders.In addition, hard negative sampling is adopted to improve the negative sampling [7,8,33].However, with fewer false positive samples, the hard negative instances also bring more false-negative samples.Our meta-learning method elegantly annotates the unlabeled instances based on the sparsely labeled instances.
Denoising Recommender Systems.One intuitive approach to reduce noise is to directly include more accurate feedback [22,31], such as dwell time [32] and skip [29]).However, forcefully requiring additional feedback from users may harm user experiences.To address this problem, DenoisingRec [28] achieves denoising recommendation for implicit feedback without any additional data.More specifically, they perform denoising on the false positive instances via truncating or reweighting the samples with larger loss.However, they only consider the positive feedback without further addressing the noise brought by negative sampling.Our work considers the sampled instances as possibly positive and negative, then achieve denoising data augmentation from both positive and negative perspectives.

CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a novel method that automatically annotated the unlabeled data and adjusted the false annotated labels.Such exploration not only addressed the unavoidable noise brought by widely used negative sampling but also improved the current denoising recommenders.Specifically, we proposed inverse learning from both loss and gradient perspectives.The first one was the Inverse Dual Loss that assumed the sampled data to be possibly positive or negative and automatically annotated them.If the positive loss was greater than the negative loss (difficult to label the data as positive), the Inverse Dual Loss would inversely assign more weights to the negative loss and vice versa.Since the Inverse Dual Loss depended heavily on the current training model and the quality of sampled data, we further proposed Inverse Gradient which made Inverse Dual Loss more robust by adjusting the gradient for those falsely annotated instances.We designed a meta-learning method with the training data split into training-train data and training-test data.The model was first pre-trained on the trainingtrain data.Then the pre-trained model would explore updating with the gradient or the additive inverse of the gradient, or even did not update, determined by the training-test data.
As for future work, we plan to apply our inverse learning with more recommendation models as the backbones to further verify the generalization of our proposed methods.

A APPENDIX A.1 Proof of Theorem 1
Theorem 1.Given learning rate  ∈ R,  ≠ 0, assume the temporal model parameters updated by the direct gradient and inverse gradient, respectively, are as   =  −  • ∇L  D  ( ) and   =  +  • ∇L  D  ( ).Then, the relationship between the loss of them and the model with parameter  on data instance , ,  *  ∈ D  will be either Proof.To simplify the problem, from a stochastic perspective with one instance , ,  *  .Our target is to satisfy: or where L (  ), L ( ), L (  ) are the loss functions for the models with parameters   , ,   .To simplify the proof procedure, we define the prediction function as the hypothesis function of the well known logistic regression [14] 1 : where  , is the input feature under the interaction between user  and item .Given learning rate  ∈ R,  ≠ 0 with   =  −  • ∇L  D  ( ) and   =  +  • ∇L  D  ( ) as the temporal model parameters updated by the direct gradient and inverse gradient, respectively, we can have: Assume  • ∇L  D  ( ) , < 0 then we have: Review the loss function from a stochastic perspective with one instance , ,  *  : L Suppose , ,  *  is positive instance with  *  = 1, then we can have: which is a decreasing function towards predicted probability ŷ  .L ( ) will decrease towards ŷ  and further decrease towards   , .Base on the assumption of Eqn.(10), we can have: which satisfies the target of Eqn.(6).Besides, similarly, if  *  = 0, we can have: L (  ) > L ( ) > L (  ).If  • ∇L  D  ( ) , > 0, we can have similar conclusion.That is to say, whatever cases, targets of Eqn.(5) and Eqn.(6) will be satisfied.
The case  • ∇L  D  ( ) , = 0 will result in gradient vanishing and modern machine learning approach often randomly initialize the features to avoid such case.□ 1 https://www.coursera.org/learn/machine-learning

A.2 Baselines
The details of training strategies are as below.
• Traditional strategies: standard loss without sampling and with negative sampling (NS) [3,13,25].• Denoising strategies: T-CE [28] of DenoisingRec that truncates the loss for false-positive instances and R-CE [28] of Denoisin-gRec weighs less on the loss for false-positive instances; Our proposed inverse dual loss that weighs more on the label with a smaller loss (true-positive or true-negative instances).
Beside, we also illustrate the backbones here.
• GMF [13]: A variant of matrix factorization with the elementwise product and a linear neural layer as the interaction function instead of the inner product.• NeuMF [13]: A combination of GMF and Multi-Layer Perceptron.

A.3 Hard Negative Sampling Comparison (RQ5)
To study the performance comparison of our proposed method with SOTA hard negative sampling methods, we compare our method with two typical baselines as below.
• DNS [33]: Dynamic Negative Sampling (DNS) samples unlabeled instances as negatives and picks the hard instances with high predicted scores.• SRNS [8]: SRNS further selects true negatives with high variance to improve DNS.
The results of performance comparison with hard negative sampling methods are as Table 2, where we can discover that the hard negative sampling methods truly perform worse in our setting where both ground-truth positive and negative samples are tested.Here hard negative sampling methods are even outperformed by the negative sampling and pure method without sampling, which means they truly increase the false-negative instances as discussed in Section 1.

A.4 Datasets and Pre-processing
We introduce the details of these datasets, including the pre-processing steps, as follows.
The public ML1M dataset is published at 2 , and we also have uploaded the processed dataset on the link of the code and the supplementary material.The statistics of our adopted Micro Video dataset and ML1M dataset are as Table 3.We will put the Micro Video dataset public to benefit the community.4. That is to say, we have extremely limited reliable feedback in this data, which is very challenging in modern industries.We take the active feedback of like and hate to analyze the regular pattern of playing time and duration of each video which are included in each interaction.From Figure 8, we can discover that the users' like and hate behaviors are related to the finish rate of playing time (user's playing time towards a certain item) to be divided by duration (item's total time): when the users like the video, they are more likely to finish watching it, and vice versa.Thus we treat the finish rate greater than 80% and less than 20% as positive and negative feedback, respectively.Another way to classify the positive and negative feedback is according to the playing time, and we also report the results by treating the playing time over the upper quartile and under the lower quartile as positive and negative feedback, respectively.From Table 5, we can observe that the finish rate is more suitable for pattern capturing, and we adopt such a way as feedback processing in experimental evaluation.ML1M 3 This is a widely used public movie dataset in the recommendation.The rating score in ML1M ranges from 1 to 5, and we treat the rating score over 3 and under 2 as positive and negative feedback, respectively, following DenoisingRec [28].Besides, we split 60%, 20%, and 20% of the data as training, validation, and test data for these two datasets.

1 INTRODUCTIONFigure 1 :
Figure 1: Performance comparison of training with three data types under two datasets.Here "labeled data" means training with only labeled data; "labeled data and unlabeled data" means supervised training with labeled data and unsupervised training with unlabeled data; "labeled data mixed with guided unlabeled data" means splitting part of labeled data to guide the unsupervised learning on unlabeled data.

3. 1 . 1
Analysis of Existing Approach.We first explain the data sparsity problem in recommender systems from the perspective of classification boundary, based on which we will introduce the existing solutions.As shown in Figure2(a), in recommender systems, labeled data tends to be extremely sparse compared with a large number of unlabeled data.A recommendation model is prone to Unlabeled data Positive data Negative data (a) Unlabeled data (b) Negative sampling (c) Reweight loss (d) Truncated loss (e) Easy negatives (f) True-negative label (g) Easy positives (h) True-positive label (i) Hard negatives (j) False-positive label (k) Hard positives (l) False-negative label

Figure 2 :
Figure 2: Illustrations of existing solutions and our inverse dual loss's effectiveness and limitation.(a)-(d) are the illustrations of existing solutions: (a) illustrates there are a lot of unlabeled data; (b) illustrates the traditional negative sampling approach; (c) illustrates the reweighted loss of DenoisingRec; (d) illustrates the truncated loss of DenoisingRec adapted on the false-negative instance.(e)-(f) are the illustrations of our inverse dual loss's effectiveness with easy sampling: (e) illustrates the easy negative instances are sampled; (f) illustrates labeling the sampled instances as true negative; (g) illustrates the easy positive instances are sampled; (h) illustrates labeling the sampled instances as true positive and approximate to ground-truth.(i)-(l) are the illustrations of our inverse dual loss's limitation with hard sampling: (i) illustrates the hard negative instances are sampled; (j) illustrates labeling part of the sampled instances as false positive; (k) illustrates the hard positive instances are sampled; (l) illustrates labeling part of the sampled instances as false negative.

3. 2 . 1
Learning to Label with Inverse Gradient.In this part, we introduce our solution for tackling the false annotated instances of Inverse Dual Loss.Definition 2. (Inverse Gradient) We define the gradient and additive inverse of gradient calculated by (4) w.r.t.∇L  D  ( ) and −∇L  D  ( ) as direct gradient and inverse gradient, respectively, of the loss for unlabeled data D  .Theorem 1.Given learning rate  ∈ R,  ≠ 0, assume the temporal model parameters updated by the direct gradient and inverse gradient, respectively, are as   =  −  • ∇L  D  ( ) and

Figure 4 :Figure 5 :
Figure 4: Positive and negative weight distributions for dual loss on ML1M at first (up) and final (bottom) epochs.

and 5 .Figure 6 :
Figure 6: The loss on training-test data with the adapted gradient from the dual loss of unlabeled data.Each point of the loss curve is marked by its updated gradient direction.

( 3 )Figure 7 :
Figure 7: The loss of GMF model on training-test data with different learning rates for adapted gradient under Micro Video dataset.Each point of the loss curve is marked by its updated gradient direction.

Figure 8 :
Figure 8: Joint distribution of playing time and duration for like (left) and hate (right).

Table 1 :
Performance comparisons with GMF and NeuMF backbones on two datasets.Bold and underline refer to the best and second best results, respectively.Here IG includes the IDL method.
with IG can significantly improve the performance, which is even better than GMF with IG.This is because NeuMF is a deep-based model, which will overfit when data is less or noise.Besides, IG outperforms the existing negative sampling (NS) method, which means there is truly a large number of positive unlabeled data, and directly treating them all as negative feedback will confuse the model.Finally, IG also outperforms existing state-of-art denoising methods, T-CE and R-CE, showing the importance of tackling the noise from both positive and negative feedback.

Table 2 :
Performance comparisons towards hard negative sampling with GMF and NeuMF backbones on two datasets.Bold and underline refer to the best and second best result, respectively.

Table 3 :
Data statistics for processed Micro Video dataset and ML1M dataset.

Table 4 :
Interaction statistics for Micro Video dataset.Micro Video This dataset is collected from one of the largest Micro Video platforms in China, where user behaviors such as playing time, like, and hate are recorded.The data is downsampled from September 11 to September 22, 2021.Users are passive to receive the recommended videos here, and there is extremely limited active feedback, such as like and hate as shown in Table

Table 5 :
Performance comparison of GMF on Micro Video dataset based on feedback classification by finish rate and playing time.All the models are implemented based on Python with a Pytorch 4 framework based on the repository DenoisingRec 5 .The environment is as below.