Counterfactual Data Augmentation for Debiased Coupon Recommendations Based on Potential Knowledge

In real-world coupon recommendations, the coupon allocation process is influenced by both the recommendation model trained with historical interaction data and marketing tactics aimed at specific commercial goals. These tactics can cause an imbalance in user-coupon interactions, leading to a deviation from users' natural preferences. We refer to this deviation as the matching bias. Theoretically, unbiased data which is assumed to be collected via a randomized allocating policy (i.e., without model or tactics intervention) is ideal training data because it reflects the user's natural preferences. However, obtaining unbiased data in real-world scenarios is costly and sometimes unfeasible. To address this problem, we propose a novel model-agnostic training paradigm named Counterfactual Data Augmentation for debiased coupon recommendations based on Potential Knowledge (CDAPK) for the marketing scenario that allocates coupons with discounts. We leverage the counterfactual data augmentation technique to answer the following key question: If a user is offered a coupon that he has never seen before in his history, will he use this coupon? By creating the counterfactual interaction data and assigning labels based on the potential knowledge of the given scenario, CDAPK shifts the original data distribution into an unbiased distribution, facilitating model optimization and debiasing. The advantage of CDAPK lies in its ability to approximate the ideal states of the training data without depleting the real-world traffic flow. We implement CDAPK on five representative models: FM, DNN, NCF, MASKNET, and DEEPFM, and conduct extensive offline and online experiments against SOTA debiasing methods to validate the superiority of CDAPK.


INTRODUCTION
The utilization of recommendation technology has become increasingly prevalent in online marketing campaigns due to the advancement of e-commerce [13,26,58].For instance, the Alipay platform periodically launches marketing campaigns to allocate online coupons aimed at motivating customer retention and consumption.These campaigns employ User-Item Click-Through-Rate (CTR) models to estimate the conversion probability, which refers to the likelihood of a user using a coupon.However, in real-world coupon recommendations, the coupon allocation process is not solely determined by the recommendation model trained with the history interaction data but is also interfered with by marketing tactics [29] aimed at achieving specific commercial goals, such as maximizing daily active users (DAU) or return on investment (ROI).
That is, marketing campaign organizers often employ specific tactics to interfere with the coupon allocation process.These tactics are designed to optimize the allocation of coupons to specific user segments or to motivate particular users to engage with the platform in specific ways.The interference can create an imbalance in user-coupon interaction data, which causes the observational data to deviate from the user's natural preferences.We refer to this deviation as the matching bias [13].As illustrated in Fig. 1, ideally, the user-coupon matching is free from interference other than model scoring, so the user is matched with Coupon 1, which has the highest score; whereas, in the case where marketing tactics are  employed, because the tactics interference 1 , the user is matched with coupon 2 instead of coupon 1.
From the above, it is clear that the observational biased training data can only give a skewed snapshot of user preference, making the coupon recommendation model sink into sub-optimal results.Theoretically, unbiased data, i.e., assumed to be collected via a randomized allocating policy is ideal training data for recommendation models [4,7,42,43,59], because it reflects user preferences in an unbiased manner.However, the acquisition of a substantial amount of unbiased data in real-world industrial scenarios is costly and even infeasible, thereby necessitating debiasing techniques to prevent a negative impact on user experience or business revenue.Researchers have proposed a series of debiasing methods [5-7, 33, 39, 54, 59] to mitigate the effects of bias without requiring large amounts of random data.
In this paper, we propose a novel model-agnostic debiasing paradigm named Counterfactual Data Augmentation for Debiased Coupon Recommendations Based on Potential Knowledge (Cdapk) for the marketing scenario of allocating coupons with discounts.We leverage the counterfactual data augmentation technique [38,56,64] to answer the following key question: If a user is offered a coupon that he has never seen before, will he use this coupon?By matching each user with all possible discount coupons, counterfactual data augmentation creates interaction data that do not exist in reality, i.e., the counterfactual data.Then we utilize the potential knowledge of the given scenario to assign labels to them and involve them in model training.Conceptually, Cdapk shifts the original training data distribution into an unbiased distribution by incorporating counterfactual data to help facilitate the optimization and debiasing of the model.The primary advantage of Cdapk lies in the fact that the model is trained on a large volume of unbiased data to approximate an ideal training state without depleting real-world scenarios' traffic flow.
In this study, we focus on the Alipay marketing coupon recommendation scenario where coupons are attached with different discounts.We can intuitively deduce an explicit prior that, for a given user and the same type of coupon, the probability of conversion increases as the coupon discount increases.Additionally, 1 The predicted model scoring of coupons are multiplied by the interaction likelihoods (tactics-specific weights) as the final ranking score.through online experimental exploration, we observed a posterior for the scenario studied, namely that the random coupon allocation case results in a lower conversion rate compared to model decisions2 , i.e., there is a difference in positive sample proportions between the original training data and the random issuance case data.Our approach leverages these properties, including the monotonicity prior and the differential posterior, to help to assign labels to the counterfactuals.The details are explained in Section 3.
Moreover, we utilize the causal graph language proposed by Pearl [36] to elucidate the impact of the matching bias and explain how to eliminate the impact.As illustrated in Fig. 2(a), by examining the causal relations, we discover that the marketing tactics intervention can impact both the conversion and user-coupon matching, leading to a backdoor path in the causal graph, resulting in the so-called confounding bias in the causal inference literature [17,50] and posing a significant threat to the estimation of conversion probabilities.Fig. 2(b) demonstrates Cdapk effectively alleviates the matching bias by cutting off the backdoor path.Specifically, as illustrated in the figure, the joint unbiased data distribution consisting of the augmented counterfactual data represented by node  * and the observational training data represented by node  which is not affected by confounding effects, i.e., node  does not point to (affect) the combination of node  * and .Consequently, a model trained on such data can accurately capture the causal relationship between matching and conversion without being affected by matching bias.
To summarize, our main contributions are listed as follows: • First, we introduce and exemplify the notion of potential knowledge in a real-world marketing coupon recommendation scenario and present a method to leverage it to help facilitate optimization and debiasing.

RELATED WORK
In this study, we investigate how to alleviate the matching bias in marketing coupon recommendations through counterfactual data augmentation that leverages the potential knowledge of the given scenario, which pertains to the areas of bias/debias, data augmentation, and potential knowledge.

Bias and debias in recommendations
Bias is a pervasive problem in recommender systems, attributable to various factors, including user behavior, uneven item presentation, and feedback loops.[5] presents a comprehensive survey of different types of biases in recommender systems, among which we highlight the following three: Selection bias arises when users have the unrestricted choice to interact with items, and the ensuing observed interaction data do not accurately represent the user population [32]; Popularity bias occurs when popular items are recommended more frequently than their popularity would warrant [2]; Exposure bias arises when users are only exposed to a subset of relevant items, resulting in skewed observed interactions [30].Commonly debiasing methods include: The inverse probability weighting (IPW) or the inverse propensity scores (IPS) are leveraged to mitigate bias by shifting the distribution to a more uniform state [31,41,43,44]; Disentangling of embedding entails separating the interest representation from the bias representation to model a more robust causal relationship [51,62,63]; Double robust methods integrate imputed errors and propensities in a doubly robust manner to obtain unbiased performance estimation [3,28,52]; Unbiased data can be leveraged to train a new task to obtain unbiased representations.Subsequently, the original representation is constrained to unbiased representations, making the model more robust [4,27,53], etc.

Augmentation methods
While augmentation methods, such as rotating images and replacing text synonyms, are effective for image and text data, tabular data is more commonly used in information retrieval.Unfortunately, existing data augmentation methods employed in CV and NLP are not readily applicable to tabular data.Fortunately, Vime [60], Sdat [14], Fda [7], and Camus [59] have emerged to fill this gap, offering suitable augmentation techniques for recommendation scenarios.However, these previous augmentation methods are not suitable for our scenario.The above method augments samples by adding noise and then constraining the labels of the sample before and after augmentation to ensure consistency.While, adding noise to samples can compromise the original semantics of the data, which is particularly critical in marketing scenarios, where preserving the original meaning is of utmost importance.For instance, noisebased augmentation methods may augment discounts that did not previously exist or even negative discounts, which is undesirable.As such, Cdapk matches each user's representations with all possible discount coupons' representations to augment counterfactual samples.

Potential knowledge
Potential knowledge in machine learning refers to knowledge resources that are currently untapped or underutilized, such as domainspecific knowledge, and human expertise that are not explicitly encoded in the data.By effectively managing and converting potential knowledge into explicit knowledge, it can be leveraged to enhance the performance of models.Potential knowledge can manifest as either explicit knowledge [8,10] that is codified and easily accessible, such as rules, algorithms, or data, or tacit knowledge [15,21] that is difficult to express or verbalize, such as intuition or professional know-how.For instance, an expert in a particular domain may possess tacit knowledge that is not readily expressible but can provide valuable insights to guide machine learning models.In this paper, we leverage the potential knowledge introduced in Subsection 3.3.2 to improve the augmentation process and facilitate debiased recommendations.Specifically, we utilize both prior knowledge [12,45,47,49], which is derived from reason or deduction, and posterior knowledge [23][24][25]37], which is gained through experience or observation, to enhance the performance of our recommendation model.

Causal View of the Matching Bias
Causal-directed acyclic graphs (DAGs) are influential tools for explicitly formulating the causal relationships among variables, see [36].A causal graph  = { , } is composed of nodes and edges such that the graph is acyclic, in which Node  denotes the set of variables and Edge  represents the cause-and-effect relationships between those variables.
Here, we introduce the nodes and edges in Fig. 2 which we use to analyze the bias and debias issues in coupon recommendation scenarios.The capital letters (e.g., ), lowercase letters (e.g., ), and letters in the calligraphic font (e.g., M) stand for random variables, particular values, and sample spaces, respectively.Specifically: • Node  and  * represent the matching result, which corresponds to observational training data and the augmented counterfactual data, respectively.• Node  represents the confounders, which are typically hidden variables as they are not explicitly modeled by nondebiased recommendation methods [22,34].• Node  represents the interaction label, which indicates the likelihood of a user using a coupon.• Edge ,  * →  denotes that the conversion probability is determined by the user-coupon matching results.• Edge  →  denotes that the matching results are intervened by confounders, in this case, the intervention can alter the matching results derived from model scoring.• Edge  →  denotes that the confounders directly affect the conversion probability, in this case, the intervention can cause a change in a user's intention to convert.
Fig. 2(a) illustrates the effect of confounding factors leading to a backdoor path between matching and conversion, which produces biased prediction results, i.e., the effect of the matching bias.Formally, due to the existence of confounder , the conditional probability can be deduced as follows: The above deduction follows the law of total probability and Bayes rule.It gives theoretical proof that confounder  not only affects  but also affects the matching result , i.e., there exists a spurious correlation caused by matching bias.
Fig. 2(b) shows the Cdapk alleviating the effect of matching bias by incorporating augmented counterfactual samples into the model training to help block the backdoor path, thus removing spurious associations.This approach transforms latent knowledge of the scenario into explicit knowledge, allowing the model to make more accurate predictions.Theoretically, the conditional probability of Cdapk can be deduced as  ( | = ) =  ( |/ * ).

Problem Formulation
Supposing we have a coupon recommender system with a collected interaction training dataset D T = {(  ,   ,   )|1 ≤  ≤ , 1 ≤  ≤ }, where   ∈ R  denotes the -dimensional user feature embedding (e.g., age and income),   ∈ R ℎ denotes the ℎ-dimensional coupon feature embedding (e.g., ID and discount), and   ∈ {0, 1} is the label associated with the  pair, indicating whether user   has interacted with   (i.e., the coupon is used) or not.We note that  (  ) represents the discount value of the coupon and   (, ) denotes the training data distribution.In the Alipay marketing coupon recommendation scenario [13], only one coupon is dispatched to the user in one request (e.g., user click action), i.e., the length of the recommendation list is 1.Therefore, the problem can be framed as a binary classification task, where the goal is to predict the probability of a user using a particular coupon   .
Formally, let  (•, •) denote an arbitrary loss function between the prediction and the ground truth label, the goal of the coupon recommendation is to learn a parametric function   which minimizes the following loss: where   can be implemented by any recommendation model with a learnable parameter  , and   (, ) denotes the unbiased distribution.However, since the unbiased (ideal) distribution is not accessible in the real-world industrial scenario, the learning procedure is conducted on the collected train dataset D T by optimizing the following empirical loss function: According to the PAC learning theory [19], if the collected train dataset D T is unbiased and sufficiently large, the learned model is expected to be approximately optimal.However, in real-world marketing scenarios, marketing tactics intervene in the matching procedure, leading to the observational training data distribution   being inconsistent with the unbiased distribution   , i.e., the effect of matching bias.As illustrated in Fig. 3,   only covers a portion of the regions of   .Consequently, training data only gives a skewed snapshot of user preference, making the recommendation model sink into sub-optimal results.To solve this problem, we need to supplement the counterfactual to transform the data distribution into an unbiased distribution, as described in the next section.

Augmentation Method.
To avoid consuming business flow in real-world industry scenarios, Cdapk augments counterfactual data and incorporates them into the training data to simulate unbiased distributions.To be specific, the counterfactual data augmentation method can be manifested as matching each user with all possible discount coupons and eliminating pre-existing user-coupon combinations.This augmentation method allows us to leverage potential knowledge related to user or coupon features to help assign labels.
Given the training set where    denotes the feature embedding of the augmented coupon and    denotes the label assigned to the augmented sample.The size of sample space of counterfactual data |D A | can be formulated as follows: where |D T | denotes the size of the original training set,  and  denote the size of user and coupon sample space, respectively. *  denotes the size of unbiased distributed (i.e.,   ) sample space.The complete counterfactual sample space can be very large, placing high demands on computational performance.We set the balance parameter  to 0.2 to trade off between computational efficiency and performance improvement.In Subsection 4.4.2,we conduct parameter sensitivity analysis to evaluate the impact of .At this point, we have obtained the feature part of the augmented (counterfactual) samples.However, to incorporate them in model training, we need to assign pseudo-labels to these augmented samples.For this, we resort to the potential knowledge of the scenario.

3.3.2
The Potential Knowledge.In this paper, we focus on a specific Alipay marketing coupon recommendation scenario where coupons are attached with different discounts.We can identify two pieces of potential knowledge that can be leveraged: the monotonicity prior and the differential posterior.It is important to note that these two potential knowledge resources may not hold in other scenarios.
Monotonicity prior.The monotonicity prior (MP for short) can be defined as follows: for a given user and the same type of coupon, the probability of conversion increases as the coupon discount increases.It is intuitive to understand that a $10 rebate coupon must be more appealing than a $2 rebate coupon with the same conversion requirements.
Differential posterior.The differential posterior (DP for short) can be defined as follows: the random issuance results in a lower conversion rate compared to model decisions.This observation is derived based on real-world business scenarios.
Accordingly, the DP can be formally characterized as: where  measures the difference in the conversion rate between the original samples and the augmented samples.In this paper, we set  to 0.9.In Subsection 4.4.2,we conduct parameter sensitivity analysis to evaluate the impact of .

Assigning
Pseudo-Labels Procedure.Given an original sample   = (  ,   ,   ) where  (  ) is the discount amount and   ∈ {0, 1} is the conversion status.We perform the following assigning pseudo-labels procedure.
The MP label.Formally, the MP label of an augmented    = (  ,   ) are characterized as follows: where   indicates the coupon embedding of the augmented sample, and uncertain corresponds to the no-confidence MP label case.
If   = 1 and the discount amount of   is greater than or equal to that of   , then the MP label is set to 1. Conversely, if   = 0 and the discount amount of the augmented coupon is less than that of the original coupon, then the monotonicity prior label is set to 0. This labeling step ensures that the augmented samples are consistent with the MP.The MP labels are derived from the scenario prior and are independent of other factors, thereby ensuring logical and relative confidence.However, as indicated in Eq.( 7), MP labels can only be generated for certain augmented samples.Consequently, there is a need to devise another labeling strategy for the augmented samples that lack MP labels.
The differential posterior label.Formally, the DP label of an augmented    = (  ,   ) are characterized as follows: where I(•) is the indicator function that returns 1 if the condition inside the parentheses is true and 0 otherwise.The function  (•) refers to the recommendation model.The threshold  is derived from the positive sample number  of the original training data, the parameter , and the predicted scoring of the augmented sample  (   ).Specifically, we sort the scores of the augmented samples from largest to smallest and then take the scores of the augmented samples with the sorted position   ( * ) as the threshold.The   (•) is the downward rounding function.Note that the DP label is obtained from unstable recommendation model scoring and the observational scenario posterior, thus it suffers from low confidence.The final pseudo-labels.To thoroughly leverage both the MP label and DP label, we use the MP label as the final pseudo-label for those augmented samples with the MP label, and we use the DP label as the final pseudo-label for those augmented samples without the MP label.Meanwhile, we employ the degree of similarity (i.e., the proportion of identical labels) between the MP label and the DP label as the confidence weight  for the DP label.Intuitively, the MP label is more plausible than the DP label.So a lower proportion of identical labels indicates less confidence in the DP label.Here,  is a self-learning parameter.
The final pseudo-labels can be formulaically defined as follows:

Overall Debiasing Model Training
Formally, we use the augmented sample set and the original sample set to update the recommendation models, and the parameter  of Cdapk model is optimized by: where D T and D A represent the training data set and the augmented data set, respectively.In addition,  is the confidence weight, and  (•) is the loss function, e.g., cross-entropy loss [46] in this case.

EXPERIMENTS
We build our model via TensorFlow [1] and evaluate its performance through comprehensive offline and online experiments to answer the following key questions:  for FairCo [35], we calculate the error term based on the score list sorted by relevance; for Macr [57], the trade-off parameters  and  are set to 1 − 3 and hyper-parameter  is set to 20; for Pda [61], the hyper-parameter  is set to 0.1 by gird search; for Dice [63], the hyper-parameters  and  are set to 0.1 and 0.01 as recommended, respectively; and for Dmbr [13] the confounding effect of the imbalanced distribution of users/items over each other is eliminated.In addition, we use Fm [40], Dnn [16], Ncf [20], Masknet [55], and DeepFm [18] as vanilla models, respectively.

AUC.
To evaluate the debiasing capabilities, we conducted experiments in unbiased data distribution following the existing studies [4,57,63], where the testing instances are sampled (collected) from an unbiased distribution.In the marketing scenario we studied, the model only provides one coupon to the user at a time, resulting in a binary classification problem that predicts the likelihood of a user utilizing a coupon.Due to the nature of this scenario, ranking performance metrics commonly used in other recommendation scenarios would be inappropriate.As such, we use Area Under Curve (AUC) to measure the efficiency of the debiasing model to alleviate matching bias.A higher AUC score means that the debiasing method is more effective.

SRMI.
In the marketing scenario we studied, the MP holds true, where higher coupon discounts should result in a higher conversion probability.However, due to the influence of marketing tactics, the overall conversion rate of some high-discount coupons may be lower than that of some low-discount coupons, leading the model to learn this counter-intuitive phenomenon, which has been confirmed by [13].As a result, we introduce a novel metric named the Scoring Ranking Monotonicity Index (SRMI), which measures the monotonicity of the model's scoring curve under  different discount coupons.A higher SRMI score indicates that the debiasing method is less affected by the aforementioned counterintuitive phenomena, indicating that the debiasing method is more effective.This metric is calculated for individual users and then aggregated.The calculation formula of SRMI is defined as follows: where Ũ is the test samples set, Ĩ denotes the different discounts set.,  denotes the index of discounts sorted from small to large.The difference in scoring between two different discount coupons  (•) is defined as follows: where  (•, •) denotes model prediction score,   ,   ∈ Ĩ denote the embedding of different discount coupons.

Debiasing Performance Comparison
Table 2 and Table 3 present the AUC and SRMI performance in a simulated unbiased distribution along with a more visual representation of the relative performance improvement over the vanilla model DeepFm4 .Five-fold cross-validation is conducted.We record the mean metric value and the standard deviation of each algorithm.
For each data set and evaluation metric, a pairwise t-test at 0.05 significance level on five-fold cross-validation is performed to show whether the performance of Cdapk is significantly different from the compared algorithm.
In addition, the widely-used Friedman test [9] is utilized for statistical comparison among multiple methods over a number of data sets (# comparing methods  = 8; # data sets  = 30 5 ).The Friedman statistic critical value   is 2.0549 in light of [9].Table 4 reports the Friedman statistics overall evaluation metrics at a 0.05 significance level.The   value is greater than the critical value 2.0549 signifying the null hypothesis of "equal" performance among comparing approaches should be clearly rejected.Therefore, Bonferroni-Dunn test [9] is employed as the post-hoc statistical test to analyze the relative performance among the comparing methods.Here, the difference between the average ranks of the control method (i.e.Cdapk) and one comparing method is calibrated with the critical difference (CD), and the performance difference is considered significant if their average ranks differ by at least one CD, here,  = 1.7013.Fig. 4 illustrates the CD diagram on the AUC and SRMI metric by treating Cdapk as the control method.Here, the average rank of each method is marked along the axis with lower ranks to the right, where any comparing method whose average rank is within one CD to Cdapk is interconnected to each other with a thick line.Otherwise, it is considered to have a significantly different performance against Cdapk.
Based on the reported experimental results, the following observations can be made: • Macr performs worse than the vanilla models, because the counterfactual inference procedure is extremely sensitive to the parameter , and the optimal value of  is case by case so a simple grid search may not find the optimal value.• Ipw, FairCo, and Dice achieve comparable or marginal improvements over the vanilla models in terms of AUC and SRMI.It can be deduced that simply re-weighting, introducing one error term and group features, and dividing samples into interest and conformity samples cannot precisely estimate the effect of bias in marketing coupon recommendation.• Dmbr and Pda achieve marginal improvements over the vanilla models, and the performance of Dmbr in _   balance, we set  as 0.2.Fig. 5(b) illustrates how the performance of Cdapk changes as the value of  varies on Data_A with DeepFm and  set to 0.2, including the performance of AUC and the running times.As shown, when  increases from 0.1 to 0.9, the performance of Cdapk improves sharply, and when it varies from 0.9 to 1.0, the performance is relatively stable.This is because improperly  can lead to problems with the augmented sample labeling, which can change the data distribution and thus affect model training.The training time remains comparable with increasing , so we set  to 0.9 according to the AUC performance of Cdapk.

Ablation Study.
In this section, we will dissect Cdapk to answer the contribution of the MP and the DP to Cdapk.We characterize the combination of the two as a key factor in the validity of Cdapk.Therefore, we investigate the effect of a single piece of knowledge to provide additional insights.Specifically, we utilize only the MP or the MP to construct labels for the augmented samples, denoted as C-M6 and C-D, respectively, and conduct experiments on them to compare the effects.Due to the page limit, the results are summarized in Table 2 and Table 3.We can observe that the C-M performs better than the C-D, but both are worse than the combined model Cdapk results.It is reasonable to speculate that the samples augmented solely with MP can provide additional information for better learning, while the samples augmented solely with DP are derived from the model itself without introducing additional information, thus not helping the model.However, MP cannot assign a score to all augmented samples, so part of the samples are supplemented with DP to calculate the confidence degree based on similarity, allowing the comprehensive use of the model to achieve better results.
where ,  denote the index of discounts sorted from small to large,  denote the collection size of discounts,    denotes conversion rate for discounts with index , and I(•) is the indicator function, return 1 if true and 0 otherwise.The larger the CRI, the more reasonable it is.To a certain extent, this reflects the model's ability to debias.As shown in Table 6, compared to baseline, the CRI of the Cdapk is larger and increases by 12.62%, indicating that the bias of the issuance data determined by Cdapk is more negligible.

CONCLUSION
This study presents Cdapk to mitigate the matching bias for marketing coupon recommendations.Cdapk eliminates confounding bias during model training by leveraging counterfactual data augmentation and the potential knowledge of the given scenario.Specifically, Cdapk creates interaction data that do not exist in reality (the counterfactual), and assigns labels to it based on the potential knowledge of the given scenario to participate in model training.Results from both offline experiments and online A/B testing demonstrate the efficacy of Cdapk in alleviating the matching bias and improving recommendation accuracy.

Figure 1 :
Figure 1: Illustration of the matching processes with and without marketing tactics intervention.The dashed arrows indicate the model's scoring of the user-coupon pairings, and the solid arrows indicate the interaction likelihoods (tacticsspecific weights).The symbol indicates the score used in the sorting of the coupon allocating process, and the boldface indicates the final recommendation result.
Block the backdoor path by Cdapk.

Figure 2 :
Figure 2: : the confounding variable.: the matching result between the user and the coupon. * : the generated counterfactual matching result. : the binary outcome of whether the coupon is used or not.

Figure 3 :
Figure 3: An illustration of the fact that the observational training data distribution   only covers part of ideal data distribution   .  indicates the joint distribution of the observational training data and the augmented counterfactual data (marked as triangles).

Figure 4 :
Figure 4: Comparison of Cdapk (control method) against other methods with Bonferroni-Dunn test.

4. 4 . 2
Parameter Sensitive Analysis.Fig.5(a) illustrates how the performance of Cdapk changes as the value of  varies on Data_A with DeepFm and  set to 0.9, including the performance in terms of AUC and the running times.As shown, when  increases from 0.01 to 0.2, the performance of Cdapk improves slightly, and when it varies from 0.2 to 1.0, the performance is relatively stable.However, the training time increases dramatically with increasing , so on 0.010.020.050.1 0.15 0.2  varies from 0.1 to 1.0.

Figure 5 :
Figure 5: The dark-orange line indicates the Cdapk AUC performance, corresponding to the left axis; the royal-blue line indicates the training time performance, corresponding to the right axis.The shaded regions of the two lines indicate the 95% confidence interval over five runs.To visually display the results under various  values, the -coordinate in (a) is an irregular arrangement.

4. 5 . 2
Debiasing analysis.To demonstrate the benefits of Cdapk, we define a new metric (with reference to SRMI) to measure the impact of matching bias on the real-world discount coupon issuance data: the conversion ranking index (CRI).That is to analyze the online A/B test result and count the situation that the conversion rate of great discount coupons is lower than the small discount coupons in the collection size of discount (CSD) (e.g.,50 in this case) of overall data.Higher CRI values indicate better debiasing performance.Formally, we can formulate the CRI as follows:    >    ) * ( − )

Table 1 :
Characteristics of the experimental datasets.

Table 2 :
Offline results in terms of AUC (the larger the better), where • / • indicates whether Cdapk is statistically superior/inferior to the comparing algorithm on each dataset (pairwise t-test at 0.05 significance level)."%impro."denotes the relative performance improvement over the vanilla model, and the boldface indicates the best performance.
[41]2BenchmarkMethods.As the backbone model of Cdapk is model-agnostic, we compare our method with six state-of-the-art model-agnostic debiasing methods.In particular, for Ipw[41], we use the reciprocal of the group sample share as the propensity score;

Table 3 :
Offline results in terms of SRMI (the larger the better), where • / • indicates whether Cdapk is statistically superior/inferior to the comparing algorithm on each dataset (pairwise t-test at 0.05 significance level)."%impro."denotes the relative performance improvement over the vanilla model, and the boldface indicates the best performance.

Table 4 :
Friedman statistics   in terms of each evaluation metric as well as the critical value at 0.05 significance level (# comparing methods  = 8; # data sets  = 30).

Table 5 :
Summary of the training and testing algorithmic complexity of comparing methods with vanilla model Dnn.Cdapk effectively improves the performance over the vanilla models.As shown inTable 2 and Table 3, the relative improvements of Cdapk over vanilla models are superior to comparing methods in almost all cases.• As shown in Fig.4, it is impressive that Cdapk has the lowest average rank on each evaluation metric.By treating Cdapk as the control method, the performance of Cdapk is statistically comparable to Dmbr and Pda on AUC, and superior to both of them in terms of SRMI.What's more, Cdapk is statistically superior to Macr, Ipw, Dice, and FairCo in all cases.• The improvement of Cdapk from the SRMI metric is better than those from the AUC metric, this is because Cdapk transforms the biased training data distribution into a theoretically unbiased data distribution, i.e., a distribution that is more satisfied with monotonicity.Cdapk learns this property and results in the SRMI metrics favoring it to a certain extent.However, since the SRMI metrics of all the comparing methods are measured on unbiased test data, it still can reflect the debias capability of the debiasing methods.In summary, the above results validate the effectiveness of Cdapk for debiased marketing coupon recommendations.Complexity and Time Analysis.Table 5 summarizes the algorithmic complexity of comparing methods with vanilla model DeepFm, w.r.t.several common factors, i.e.,  (# training samples),  ′ (# augmented training samples),  (# features), and  (# additional features).Furthermore, the following terms are introduced in complexity characterization: F  (, ) and F ′  () represent the training and testing complexity for vanilla model DeepFm; F  (, ) and F ′  () represent the training and testing complexity for an MLP layer; F  (, ) and F ′  () represent the training and testing complexity for a Fm layer.We also conducted time consumption experiments, which revealed the empirical training and testing time of Cdapk are relatively comparable to the other methods.

Table 6 :
Online results between Cdapk and the baseline, where † means better performance.Real-World Campaign Results.To evaluate the effectiveness of Cdapk (based on DeepFm) in real-world scenarios, we conducted an online A/B test in an Alipay marketing coupon recommendation campaign, in which two metrics are measured: the number of coupons used and the use rate.Due to budget constraints, we only compared Cdapk with the baseline DeepFm in the online coupon recommendation system, which randomly and evenly divided all candidates into two buckets.The experimental results are summarized in Table67 , revealing our proposed approach Cdapk achieved a 0.93% increase in use rate and a 1.31% increase in the number of coupons used, demonstrating a significant improvement in the real-world marketing coupon recommendation campaign.