Unbiased Delayed Feedback Label Correction for Conversion Rate Prediction

Conversion rate prediction is critical to many online applications such as digital display advertising. To capture dynamic data distribution, industrial systems often require retraining models on recent data daily or weekly. However, the delay of conversion behavior usually leads to incorrect labeling, which is called delayed feedback problem. Existing work may fail to introduce the correct information about false negative samples due to data sparsity and dynamic data distribution. To directly introduce the correct feedback label information, we propose an Unbiased delayed feedback Label Correction framework (ULC), which uses an auxiliary model to correct labels for observed negative feedback samples. Firstly, we theoretically prove that the label-corrected loss is an unbiased estimate of the oracle loss using true labels. Then, as there are no ready training data for label correction, counterfactual labeling is used to construct artificial training data. Furthermore, since counterfactual labeling utilizes only partial training data, we design an embedding-based alternative training method to enhance performance. Comparative experiments on both public and private datasets and detailed analyses show that our proposed approach effectively alleviates the delayed feedback problem and consistently outperforms the previous state-of-the-art methods.


INTRODUCTION
Predicting the probability of users clicking or converting on ads or items is critical to many online applications, such as digital display advertising and recommender systems.Take online advertising as an example.Generally, ad delivery platforms provide advertisers with several optional billing models, such as Cost Per thousand iMpressions (CPM), Cost Per Click (CPC) and Cost Per Acquisition (CPA), in which CPA is preferred as the conversion is closer to advertisers' profits.For the CPA model, predicting the click rate and conversion rate of users to the placed advertisements is the key to achieving more revenue, which are also known as the Click-Through Rate (CTR) and Conversion Rate (CVR) prediction tasks.These two tasks have received increasing attention from industry and academia in recent years [5,30,32].
Model freshness is important for CTR and CVR prediction models as user interests change dynamically.A common strategy to keep fresh in the industry is to retrain the model daily or weekly on all collected data.This simple strategy can be effective for CTR prediction.However, the delay of conversion behavior makes it challenging to ensure the freshness of CVR models, which is called Delayed Feedback Problem.Unlike click behavior happening quickly within minutes of impression, conversions occur much more slowly after days, sometimes taking up to weeks [2].This leads that the ground truth of recently clicked but unconverted samples is unknown as they may convert in the future.
A vanilla solution is to treat all these unconverted samples as negative feedback, which will cause some positive samples (i.e., real conversions) to be mislabeled, leading to the false negative problem.These mislabeled samples can significantly damage the performance of the CVR prediction model as they are important to the model freshness.Another obvious solution is to wait for a long time until the labels are accurate enough.However, this means that the data is old, which conflicts with the purpose of keeping models fresh.Thus, the delayed feedback problem reflects a trade-off between model freshness and label correctness.Therefore, handling fresh unconverted data with unknown labels is an important challenge for CVR prediction.
Existing methods for the delayed feedback problem can be classified into two types based on the problem setting: online training [3,4,7,10,11,24,25,31] and offline training [2,16,27,28].In the online setting, when new user behaviors are logged, the model is immediately updated on the new data to keep it fresh.In the offline setting, the model is retrained daily or weekly on all collected data to ensure freshness.Industrial systems often choose the appropriate training setting based on their business requirements [4].The methods under different settings are quite different, and here we only focus on the offline training.
To our knowledge, DFM [2] first studied the delayed feedback problem.They propose to explicitly model delay time with delay distribution assumptions.However, the actual delay may not obey the assumptions, which leads to its suboptimal performance.Recent work [26,27] in offline learning attempts to construct unbiased estimates of the oracle loss that uses true labels to alleviate the delayed feedback problem.However, although these methods are theoretically unbiased, we argue that they may fail to introduce the correct information about false negative samples, as they do not construct the correct samples corresponding to these mislabeled samples.
In this paper, we aim to address the delayed feedback problem in CVR prediction through label correction.We theoretically prove that if the label of the observed unconverted samples can be corrected to its probability of being a false negative sample, the label-corrected loss will be an unbiased estimate of the oracle loss that uses true labels.Compared to existing unbiased methods for delayed feedback, the advantage of label correction is that it directly complements the information of the correct samples corresponding to the false negative samples.Further, we attempt to train a label correction (LC) model to correct the labels for the observed unconverted samples.If the LC model is accurate enough, the delayed feedback problem can be well addressed.
However, how to train an accurate LC model is non-trivial.Above all, there exist no ready training data for the LC model.We use a counterfactual labeling method to construct artificial training data by setting a counterfactual deadline.Nevertheless, counterfactual labeling utilizes only partial training data.To enhance the performance of the LC model, we further designed an alternative learning method based on embedding transfer.To demonstrate the effectiveness of our method, we conduct extensive experiments on the public and private datasets.Experimental results show that our method can effectively alleviate the delayed feedback problem.Our main contributions can be summarized as follows.
• To the best of our knowledge, it is the first work in the offline training setting to use an unbiased label correction approach to solve the delayed feedback problem of CVR prediction.

RELATED WORK 2.1 CVR Prediction
The CVR prediction task shares many similarities with the widely studied CTR prediction task.They both predict the probability of a user performing a certain behavior on an ad or an item.Besides, their inputs are generally the same.Generally, the model structure designed for the CTR task can also be applied to CVR prediction.Thus, existing research on CVR prediction focuses more on the differences between CVR and CTR.There are three main challenges for CVR prediction.First, the data for the CVR task are often more sparse than the CTR task.Existing research mitigates this problem through multi-task learning [9,14] and pre-training [18].Second, CVR prediction suffers from selection bias.The CVR prediction model is trained on click samples but infers for all exposure samples during inference.Differences in exposure distribution and click distribution lead to selection bias, which existing work addresses through entire sample space modeling [13,19,22,23], inverse propensity score [1,29], and doubly robust methods [6,15].Third, conversions do not happen as immediately as clicks, with some conversions taking days or even a week.This could result in some positive samples that have not yet converted being incorrectly treated as negative samples.Existing studies address it by delay time modeling [2,21,28] or importance sampling [4,27], which we will detail in Session 2.2.In this work, we focus on the third challenge, the delayed feedback problem, and leave the extension of our method to other problems for future work.

Delayed Feedback
Here we only focus on the delayed feedback problem in the offline setting.
To our knowledge, the delayed feedback problem was first studied by DFM [2].DFM models the delay time explicitly.It assumes that the delay time obeys an exponential distribution and then optimizes the maximum likelihood of the currently observed data labels.[28] extends this approach further by using a non-parametric approach to modeling delay time.A drawback of the above methods is that they try to optimize the observed conversion information instead of directly optimizing the true conversion information.
In contrast to explicitly modeling delay time, recent work [26,27] attempts to address the delay feedback problem by constructing unbiased estimates of the oracle loss that uses true labels.FSIW [27] leverages importance sampling to construct an unbiased loss.Intuitively, it increases the weight of observed positive samples and decreases the weight of potentially negative samples as these samples may be mislabeled.Besides, nnDF [26] assumes that the labels of samples before a time window are accurate and then uses these samples to correct for the biased loss of the whole training data.A drawback of the above methods is that, despite their theoretical guarantee of unbias, they might fail to introduce information about the correct positive sample for each specific false negative sample.For FSIW, it only reduces the weight of the mislabeled samples but cannot introduce the information of the corrected sample, i.e., the weight of the corresponding correct sample is still zero.For nnDF, it does not process recent samples, and therefore cannot introduce information about the correct samples among them.This problem is worse when the data distribution has changed recently.As the information about the false negative samples may differ from the past observed positive samples, only using the observed positive samples cannot complement the correct information about the fresh false negative samples.
Unlike the above state-of-art approaches, we propose to correct the label for each observed negative sample so that the information of the correct sample can be introduced directly.Besides, it can also be proved theoretically that the label-corrected loss is an unbiased estimate of the oracle loss.

PRELIMINARIES 3.1 Notations
In online advertising platforms, the user behaviors for the display ads are logged to train the CVR prediction model.Suppose we collect training data D at timestamp  , i.e., we can obtain all the user behaviors and corresponding features before  .Let D = {(  ,   ,   ,   ,   ) ,  = 1, 2, ...}.The notation  denotes the -th sample.Each sample represents a click record of users.For the -th sample (  ,   ,   ,   ,   ),   denotes the feature information of this sample.  denotes the click timestamp.  is a binary value that denotes whether the clicked ad has a further conversion before the observed timestamp  .If   = 1,   will record the corresponding conversion timestamp.Otherwise,   is empty.  denotes the time elapsed from   to  , i.e.,  −   .
Let   denote whether the -th sample will finally lead to a conversion.Note that we cannot wait forever for the possible conversion to happen.In practice, a long time window   is applied depending on the specific scenario, e.g., one month for Criteo [2].Only conversions within the time window after clicks are considered valid.In other words, if   ≥   , then   =   .If   <   ,   is unknown.Thus,   is not included in the training data D. For test data, we can wait enough time to obtain   for evaluation.
For easy reading, the notations are summarized in Table 1.

Task Formulation
The conversion rate is defined as the probability of the final conversion for a clicked ad, i.e.,    =  (  = 1|  ).The CVR prediction task under delayed feedback is aimed to use the training data D collected at  to predict    for the clicked ads after  .Note that training samples clicked before  −   (i.e.,   ≥   ) can be fed directly into the model without any processing as their labels are correct.Since the core issue for delayed feedback is how to handle the fresh data with unknown labels, we omit these data in the rest of this paper for simplicity, which does not influence the correctness of our proof and method.

Vanilla and Oracle Loss
Next, we introduce the two basic loss functions in the delayed feedback problem.Note that CVR prediction is essentially a binary classification problem.Generally, the cross-entropy loss is adopted for training the CVR model.Let  (•;  ) denote the CVR model with trainable parameters  .Suppose we can now foresee the future and obtain an ideal dataset D * , which contains   for each sample.Then the cross-entropy loss can be written as: Equation ( 1) is called the oracle loss L  as we suppose the final conversion label   for each click record is available.
However, in practice, we cannot obtain the oracle label   for each sample at the data collection timestamp  .If we ignore the delayed feedback and replace the oracle label   with the observed label   , we can get the vanilla loss L  for CVR model training: (2) Note that some samples may convert after the data collection timestamp  .Obviously, the vanilla loss will incorrectly treat some positive samples as negative samples, which will damage the performance of CVR prediction model.

UNBIASED LABEL CORRECTION FOR DELAYED FEEDBACK PROBLEM 4.1 Overall Framework
We propose an Unbiased delayed feedback Label Correction framework (ULC), which aims to address the delay feedback problem in CVR prediction through label correction.The key idea is that delayed feedback leads to and only leads to incorrect labels.If we are able to identify all the incorrect labels and correct them, we can directly calculate the oracle loss.
Fig. 1 illustrates the overall framework of ULC, which consists of a label correction (LC) model and a CVR prediction model.The LC model is designed to predict the probability that an observed unconverted training sample will finally convert, which is used to calculate our proposed label-corrected loss for CVR model training.We prove in Section 4.2 that if the LC model is accurate enough, the label-corrected loss is an unbiased estimate of the oracle loss.
The next question is how to learn an accurate LC model.As there is no ready training data for the LC model, we leverage counterfactual labeling to generate training data.It constructs the artificial data by imagining a counterfactual data collection time  ′ <  , the details of which will be introduced in Section 4.3.However, counterfactual labeling suffers from some problems, such as inadequate utilization of the whole training data, which we analyze in Section 4.4.To mitigate this problem, we further apply alternative training to re-train these two models, enhancing the performance

Unbiased Loss via Label Correction
If we have a label correction model (•; ), which can predict the probability   of a observed non-conversion sample  to be a final conversion sample, i.e.,  (  = 1|  ,   ,   = 0).Then we can directly correct the sample label for the observed negative samples and get the following label-corrected (LC) loss: The above loss is unbiased to the oracle loss if we have an ideal label correction model, as shown in the following theorem.Theorem 1.If an ideal label correction model is satisfied, i.e.,   =  (  = 1|  ,   ,   = 0), then the LC loss is unbiased to the oracle loss.
The advantage of LC loss over the previous unbiased loss (e.g., FSIW and nnDF) is that it directly complements the information of the correct sample corresponding to the false negative samples, i.e.,   (1 −   ) log  (  ;  ).The existing unbiased losses complement the corresponding information in indirect ways, which is strongly influenced by data sparsity and data dynamics.For example, in practice, FSIW complements the correct information by increasing the weights of observed positive samples similar to the false negative sample.However, for some fresh false negative samples, there may not exist similar observed positive samples due to the sparsity and dynamics of CVR data.In this case, these indirect methods cannot effectively supplement the corresponding positive sample information, which is important for the delayed feedback problem as false negative samples are often fresh.In contrast, the LC loss can adequately solve this problem as it directly corrects the label and complements the corresponding correct information.
The remaining problem is how to train an accurate LC model, which we introduce in the next section.

Data Generation with Counterfactual Labeling
For the LC model, there is no ready training data.Note that we need samples with   = 0 &   = 1 as positive samples and with   = 0 &   = 0 as negative samples.However, there are only samples with   = 1 &   = 1 and samples with   = 0 in the original data.To train the LC model, we need to construct artificial samples.We leverage a counterfactual method [27] to generate training data for the LC model.First, we imagine that the training data was collected at a counterfactual deadline (CD) before the training data's actual deadline (AD), i.e.,  .The time interval  between the CD and the AD is a hyperparameter.Second, the samples that are clicked but have not converted before the CD are collected as training data, together with  ′ as the elapsed time of these samples at the CD.Third, we treat the samples with conversion between CD and AD as positive samples, i.e.,  = 1, and others as negative samples, i.e.,  = 0. Obviously, there exist some samples converting after AD are ignored.Nevertheless, as  increases, the proportion of these samples keeps getting smaller.The subsequent experiments demonstrate that even a relatively short  can effectively alleviate the delayed feedback problem.
After data generation, we can train the LC model via the classical binary cross-entropy loss.Then the LC model is frozen and utilized to infer  in the above LC loss.Note that the elapsed time  at the AD is used instead of  ′ when inferring for LC loss.The detailed data generation procedure is shown in Algorithm 1 (lines 2-14).

Alternative Training
Although the training data required for the LC model can be constructed by counterfactual labeling, this method still has some drawbacks.First, data generation only leverages partial training data, i.e., samples that are clicked before CD and converted after CD, which may result in the suboptimal performance of LC model.Second, the LC model also suffers somewhat from delayed feedback.Some potential positive samples that have a long delay and convert after AD may be mistreated as negative samples.Next, we propose an alternative learning based approach to alleviate the first problem.The solution to the second problem we leave to future work.
Note that the CVR prediction model is trained on the whole data, and the conversion rate  ( = 1|) is similar to the label correction rate  ( = 1|, ,  = 0).We suppose that the bottom representation (i.e., the embedding layer in Fig. 1 Training   on D using LC loss until converge 20: Transfer the bottom embeddings from   to   21: end for 22: return CVR prediction model   There are some alternatives compared to alternative training with embedding transfer.For example, joint learning is also a common learning paradigm that enables the LC model to leverage the knowledge of the CVR model.Moreover, in addition to utilizing the learned representation of the CVR model, another easily thought of option is to leverage its prediction.It is possible to mine the mislabeled samples in the training data for the LC model using the CVR prediction model as these potential positive samples might have a high predicted CVR.We also conduct experiments and compare these alternatives in experiment section 5.4.

EXPERIMENTS
To validate the effectiveness of our proposed method, we conduct a series of experiments to answer the following research questions: -RQ1: How does ULC perform on the CVR prediction task compared to the state-of-the-art methods?-RQ2: How do the label correctness and data freshness of counterfactual labeling affect the performance of ULC? -RQ3: In addition to embedding-based alternative training, how do other common schemes such as joint learning perform?-RQ4: How does the ULC model perform on samples with different delay time?

Dataset and Settings
5.1.1Datasets.To our knowledge, there exists only one public dataset [2] widely used in the research of the delayed feedback problem in the offline setting.Other public CVR datasets do not have noticeably delayed feedback or lack enough temporal information.Following the common settings in previous work [2,27] of using one public and one private dataset, we also introduce a collected private production dataset.
Criteo dataset.This public dataset contains clicks and the corresponding conversions from Criteo live traffic data.Each sample corresponds to a single click and is described by several categorical features and continuous features, with the corresponding conversion information, if any.It also includes the timestamps of the click and the possible conversion behavior.We use this dataset's last 23 days of data to conduct our experiments.Following previous work [2,27], three consecutive weeks of data are leveraged as training data, data of the 22nd day is used for validation and the last day is for the testing.Note that this dataset tracks conversion behavior for each click sample, so the ground truth   is available for testing.For validation, we assume that   is unknown and use the label-corrected loss for parameter selection.The processed dataset includes 6,363,085 click samples with a conversion rate of 0.2294.
Production dataset.This dataset is collected from a real production platform with game advertising.In-game payments are treated as conversions.Specifically, we collected and sampled onemonth consecutive user feedback logs.The data format is similar to Criteo, and the last two days are used for validation and testing, respectively.The dataset includes over 2,400,000 click samples with a conversion rate of about 0.005.Statistics are shown in Table 2. 5.1.2Evaluation Metrics.We adopt three metrics that are widely used in CVR prediction tasks [4,24].The first metric is area under ROC curve (AUC) that measures the pairwise ranking performance of the CVR prediction model.The second is area under the precision-recall curve (PRAUC) [24], which also measures the pairwise ranking performance.The third one is the log loss (LL), which measures the accuracy of the absolute value of the CVR prediction.
To further analyze the benefits gained by solving the delayed feedback problem, we calculate the relative improvements (RI) to the maximum gain (i.e., the improvement of the oracle model over the vanilla model) on the above three metrics.For method  , the relative improvements on metric  (•) is defined as Then we can obtain RI-AUC, RI-PRAUC and RI-LL.

Compared Methods.
The following state-of-the-art methods are our baselines for solving delayed feedback in CVR prediction: • Vanilla: a CVR model trained with the observed conversion label.This is the lower bound of possible improvements.• Oracle: a CVR model trained with the ground truth label instead of observed labels.This is the upper bound of possible improvements.
• ULC(ours): a CVR model alternately trained with the LC model and the LC loss.
The above methods can be applied to different CVR models.Due to the column anonymity of the Criteo dataset, we cannot use the models that rely on user modeling.We consider the following classical models that focus on feature interactions as backbones: • MLP: the classical fully connected neural networks.
• DeepFM [5]: a model combining the factorization machines and deep neural networks.
• AutoInt [17]: a model using multi-head self-attention to learn the high-order feature interactions automatically.• DCNV2 [20]: a model using deep and cross networks to learn effective explicit and implicit feature crosses.

Implementation Details.
The embedding size is 64 for all the methods.The MLP model in all the backbones is a simple three-layer model with hidden units [256,256,128] and Leaky ReLU activation.
For AutoInt, the layer number is 3, the number of heads is 2, and the attention size is 64.For DCNV2, we use the stacked structure and one cross-layer.Adam [8] is used as the optimizer, and the learning rate is tuned in the range of [1e-3, 5e-4, 1e-4] with L2 regularization tuned in [0, 1e-7, 1e-6, 1e-5, 1e-4].The batch size is set to 1024 for all the methods except nnDF.Given that the nnDF approach cannot apply to batch-wise training, we set the batch size to the size of the whole training set.For a fair comparison, we consistently use the MLP as the auxiliary model for all methods that rely on auxiliary models.The additional hyperparameters for the baselines are finetuned.Early stopping is applied to obtain the best parameters.We repeat each experiment 5 times with different random seeds and report the average results and make the statistical tests. 1

Overall Performance: RQ1
From Table 3, we can observe that our proposed method ULC outperforms all the baselines and achieves state-of-the-art performance on all the backbones.There are some further observations.First, the oracle model works significantly better than the vanilla model, which validates that the delayed feedback problem indeed hurts the performance of CVR model.Second, FSIW performs significantly better than the DFM and Vanilla models, which is consistent with previous studies [27].However, nnDF is significantly weaker than Vanilla method.It is because nnDF loss requires global dependence computation and can only be optimized using full training data when updating, which leads to a weaker performance than batchwise optimization methods.Third, compared to the best baseline, our method shows a significant improvement of 0.76% in the AUC metric, 1.02% in the PRAUC metric, and 1.85% in the LL metric on average across the four backbones, which demonstrates the effectiveness of our proposed method.
We further analyze the benefits gained by solving the delayed feedback problem.As shown in Table 3, our method narrows the in the PRAUC metric, and 83.13% in the LL metric on average across the four backbones.Compared to the best baseline, our method shows a significant improvement of 60.3% in the RI-AUC metric, 81.43% in the RI-PRAUC metric, and 27.76% in the RI-LL metric on average across the four backbones.This shows that our method can effectively alleviate the delayed feedback problem.Fig. 2 shows the offline performance on the production dataset.For limited space, we only present the results with MLP as the backbone.It is clearly observed that our proposed method alleviates the delayed feedback problem and outperforms the two best baselines.
To guarantee the reproducibility of our work, and also due to the page limitation, we make the following further detailed analyses on the public-available dataset.

Analysis on Counterfactual Labeling: RQ2
In counterfactual labeling, only the samples converted between CD and AD are treated as positive samples (i.e.,  = 1), which leads to some samples converted after AD being mislabeled as negative samples.A long time interval between CD and AD can improve label correctness of counterfactual labeling but reduce data freshness as only clicked data before CD are utilized.We further analyze the effect of different time intervals.
Experimental results using different time intervals are shown in Fig. 3. First, increasing the time interval can effectively increase the recall of counterfactual labeling on positive samples.Second, the best  on the Criteo dataset is around a week.Besides, smaller or larger  will reduce the performance of CVR model.Smaller  leads to more mislabeled samples in the training data of LC model, which in turn leads to lower performance of CVR model.Larger  values, while reducing the mislabeled samples, will make the training data of the LC model older, which leads to its inability to correct well for false negative samples in the CVR training data, since these false negative samples are relatively fresh.

Effectiveness of Alternative Training: RQ3
In addition to embedding-based alternative training, there are also some other design schemes that enable the LC model to exploit the knowledge of CVR prediction model.We conduct experiments on these schemes.

Joint Learning
Strategy.An obvious solution is to jointly train the LC model and the CVR model, which also enables the LC model to utilize the information learned by the CVR model.We use a simple shared-bottom structure [12] to validate the effectiveness of this scheme.The joint loss is a linear weighting of the LC model loss and the CVR model loss.

Prediction-based Alternative Training Strategy.
In alternative training, in addition to using the learned representation of the CVR model, another easily thought of option is to leverage its prediction.Note that in Section 4.3, we mention that some potential positive samples that have a long delay and convert after AD may be mislabeled as negative samples during counterfactual labeling.To alleviate this problem, we consider using the prediction of the CVR model to mine these potentially positive samples.Intuitively, samples with high predicted CVR are more likely to be potentially positive samples.Thus, we design three simple strategies to process the training data of the LC model: (i) hard strategy, i.e., negative samples ( = 0) with predicted CVR above a predefined threshold are treated as positive samples ( = 1); (ii) soft strategy, i.e., using predicted CVR as the label for each negative sample; (iii) drop strategy, i.e., dropping negative samples with predicted CVR above a predefined threshold from the training data of the LC model, as the labels of these samples are not reliable.

Comparisons on Different
Strategies.The results of the above strategies on Criteo with MLP as backbone are shown in Fig. 4. Results on AUC are similar to PRAUC and hence omitted.We have the following observations: (i) using a simple joint learning scheme cannot improve performance.Instead, there is a large loss of performance.The reason is that the inaccurate LC model at the early training stage will mislead the CVR model, which in turn affects the subsequent training.(ii) the three prediction-based strategies cannot improve the performance of the CVR model and even cause a slight degradation.The potential positive samples after AD have a higher delay than , and the number of these samples is very small compared to the number of true negative samples (about 1:50 ratio).Using only predicted CVR cannot effectively discover these samples; instead, it introduces noise.ULC. controls the rounds of alternative training, and  = 0 means no alternative training.We conduct experiments using different values of  on the Criteo dataset with MLP as the backbone.As shown in Fig. 5, alternative training once can significantly improve the performance of the CVR prediction model, which validates the effectiveness of alternative training.Besides, one round is enough, and more rounds have little impact, which is reasonable since the first round that changes the initialization of the LC model from random to the embeddings of CVR model brings more significant changes than the subsequent rounds.Note that even without alternative learning, the performance of ULC is still significantly better than the best baseline, which reflects the effectiveness of using

Analysis on Different Delay Time: RQ4
The delay time is an important property of delayed feedback.We further analyze the performance of CVR model and LC model on samples with different delay time.

CVR performance on different delay time.
For a fair comparison, we divide the positive samples in the test set into five groups in ascending order based on their delay time.Each group has the same number of positive samples.Then, each group is combined with all the negative samples in the test set to form test sets with different delay times.In this way, the number of positive and negative samples is the same for different test sets.Further, since the log loss is sensitive to the conversion rate, to ensure that the conversion rate in the test set is consistent with the original test set, we duplicate five copies of each positive sample.Experiment results on the Criteo dataset with MLP as backbone are shown in Fig. 6.We have the following observations: (i) for the Oracle model without the delayed feedback problem, its performance decreases somewhat as the sample delay time increases, which indicates that samples with a long delay time are more likely to be hard samples.(ii) as the sample delay time increases, the Oracle model performs increasingly better than Vanilla, which is because positive samples are more likely to be false negative samples as the delay time increases.(iii) our method significantly outperforms the Vanilla model and the best baseline on samples with high delays (e.g., G3, G4, and G5), and our boost increases as the delay time increases, which reflects the effectiveness of our method.
An interesting phenomenon is that the Vanilla model performs better than the Oracle model on samples with short delays (G1).It may be because samples with short delays have a higher percentage of observed positive samples than actual positive samples.Further analysis can be found in Appendix.

LC performance on different delay time.
We further analyze the performance of the LC model on samples with different delay time.Similarly, we divide the false negative samples in the training data into five groups in ascending order based on their delay time.Each group has the same number of false positive samples.Then, as the goal of the LC model is to distinguish between false negative samples and true negative samples, each group is combined with all the true negative samples in the training data to form evaluation data with different delay time.4. We have the following observations: (i) AUC ranges from 0.7811 to 0.8698 at different delay time, which reflects that the LC model can effectively recognize false negative samples from all negative samples.(ii) the performance of the LC model decreases as the delay time of the false negative samples increases.There are two possible reasons for this.First, samples with longer delays are more likely to be hard samples.Second, in counterfactual labeling, false negative samples with longer delays are more likely to convert after AD and be recognized as true negative examples, which damages the performance of LC model.

CONCLUSIONS AND FUTURE WORK
In this paper, we propose a framework ULC to address the delayed feedback problem in the offline setting via unbiased label correction.The key idea is that delayed feedback leads to and only leads to incorrect labels.If the incorrect labels can be effectively corrected, the delayed feedback problem can be well addressed.ULC uses an additional LC model to guide the CVR prediction model for unbiased label correction and enhances the performance through alternative training.We prove theoretically that the label-corrected loss in our method is an unbiased estimate of the oracle loss.Comparative experiments on both public and private datasets and detailed analyses show that ULC effectively alleviates the delayed feedback problem and consistently outperforms the previous state-of-the-art methods.
For future work, we are interested in the following points.First, using multiple and dynamic counterfactual deadlines is likely to exploit training data more effectively.Second, given that samples with a long delay time are more likely to be hard samples, we would like to design approaches to enhance the model performance on long-delay samples.Third, we are interested in the combination of our method to selection bias in CVR prediction.

Figure 1 :
Figure 1: Illustration for our proposed framework ULC.

Figure 2 :
Figure 2: Performance comparisons of the proposed method with the top two baselines on the private dataset.The backbone model is MLP.The red dotted line in the right figure denotes Oracle.

Figure 3 :
Figure 3: Effect of different time intervals between counterfactual deadline (CD) and actual deadline (AD) on the Criteo dataset with MLP as the backbone.The blue line represents the performance of ULC and the red line represents the recall of counterfactual labeling on positive samples.Larger recall means fewer mislabels in the training data of LC model.

•
We give a theoretical analysis of unbiased label learning and propose an alternative learning framework for CVR prediction to meet the delayed feedback challenge.
• Comparative experimental results on both public and private datasets demonstrate the effectiveness of our proposed framework.

Table 1 :
Notations and Explanations of Variables the true conversion label of -th sample   the click timestamp of -th sample   the conversion timestamp of -th sample ) learned by the CVR model may facilitate the learning of the LC model and alleviate the first problem mentioned above.Therefore, we adopt an alternative learning paradigm.After training the CVR prediction model, the bottom representation of the LC model is initialized using the bottom representation of the CVR prediction model, and then the LC model is retrained.The retrained LC model can be further used for the retraining of the CVR model.The alternative training procedure is shown in Algorithm 1. Algorithm 1 Alternative training with data generation Input: training data D = {(  ,   ,   ,   ,   )},  is the timestamp when the data D are collected, where   is the feature vector,   is the observed conversion label,   is elapsed time since the click timestamp   ,   is the conversion timestamp. is a hyperparameter denoting the time interval between CD and AD. is the rounds of alternative training.Output: CVR prediction model   1: Initialize CVR prediction model   and LC model   2: // start to generate data D  for LC model training 3: D  = ∅ 4: for  = 1 to |D| do 5: if   <  −  and (  == 0 or   >  − ) then Insert the sample (  ,   − ,   ) to D  end for 14: // finish data generation, start to alternative training 15: for  = 1 to  + 1 do 6:if   >  −  then 7:Label the sample  as   = 1

Table 2 :
Statistics of the datasets.

Table 3 :
Performance comparisons of the proposed model with baseline models on the public Criteo dataset.The best results are in boldface, and the best baselines are underlined.The superscripts ** indicate  ≤ 0.01 for the t-test of ULC vs. the best baseline.↑ means the higher the better, and ↓ is for the lower the better.
Performance w.r.t.test samples with different delay time on the Criteo dataset with MLP as the backbone.A larger value of the  axis means a longer delay time.Results on AUC are similar to PRAUC and hence omitted.the LC model for label correction.Results on AUC are similar to PRAUC and hence omitted.

Table 4 :
Label correction performance of ULC w.r.t.samples with different delay time.G5 is the group with the longest delay, and G1 has the shortest delay.↑means the higher the better, and ↓ is for the lower the better.Experiment results on the Criteo dataset are shown in Table