Tree based Progressive Regression Model for Watch-Time Prediction in Short-video Recommendation

An accurate prediction of watch time has been of vital importance to enhance user engagement in video recommender systems. To achieve this, there are four properties that a watch time prediction framework should satisfy: first, despite its continuous value, watch time is also an ordinal variable and the relative ordering between its values reflects the differences in user preferences. Therefore the ordinal relations should be reflected in watch time predictions. Second, the conditional dependence between the video-watching behaviors should be captured in the model. For instance, one has to watch half of the video before he/she finishes watching the whole video. Third, modeling watch time with a point estimation ignores the fact that models might give results with high uncertainty and this could cause bad cases in recommender systems. Therefore the framework should be aware of prediction uncertainty. Forth, the real-life recommender systems suffer from severe bias amplifications thus an estimation without bias amplification is expected. Therefore we propose TPM for watch time prediction. Specifically, the ordinal ranks of watch time are introduced into TPM and the problem is decomposed into a series of conditional dependent classification tasks which are organized into a tree structure. The expectation of watch time can be generated by traversing the tree and the variance of watch time predictions is explicitly introduced into the objective function as a measurement for uncertainty. Moreover, we illustrate that backdoor adjustment can be seamlessly incorporated into TPM, which alleviates bias amplifications. Extensive offline evaluations have been conducted in public datasets and TPM have been deployed in a real-world video app Kuaishou with over 300 million DAUs. The results indicate that TPM outperforms state-of-the-art approaches and indeed improves video consumption significantly.


INTRODUCTION
Recent years have witnessed the growing popularity of online video services (e.g.YouTube and Hulu) and video-sharing platforms (Tik-Tok and KuaiShou).And the amount of time that users spend on watching recommended videos (which is referred to as watch time) becomes a key metric reflecting user engagement.Users who get recommendations with higher watch time tend to stay longer in the platform, which brings a growth of DAU (Daily Active User).
Despite its importance, watch time prediction has not been widely studied in previous researches [3,19].We argue that there are several special important aspects in watch time modeling: First, watch time prediction is essentially a regression problem, but the ordinal relation between watch time predictions is also important in recommendation.On one hand, watch time is a continuous random variable, the recommender system needs to get an accurate prediction of watch time for the usage in downstream phases; on the other hand, watch time is a metric for video comparison and the ordinal relation of predictions is also important.For example, given two predictions of watch time for a video:  1 = 3.5,  2 = 4.5 and the ground truth is  = 4.The two predictions share same regression error of 0.5 in terms of MAE.However these two predictions lead to very different consequences in recommender systems.As the system usually tends to recommend videos with higher predicted watch time, the video is much more likely to be recommended with a prediction of 4.5 compared to the case with 3.5.Therefore, estimating watch time with direct regression fails to model the ordinal relations between watch time.Ranking losses focus on the ordinal relations but may lead to predictions that deviate far from the ground truth.Therefore a good formulation of watch time prediction should satisfy both requirements simultaneously.
Second, there exists strong conditional dependence in the videowatching behaviors.For example, one has to watch half of the video before he/she finishes watching the whole video.This is similar to the case of click and post-click behaviors (e.g.purchase) in Ecommerce platforms [11,18].This conditional dependence needs to be considered in watch time prediction.Third, to enable a robust prediction of watch time, we expect the prediction model to be uncertainty-aware about its predictions.For most regression models, the objective is to get an accurate point estimation by minimizing a  1 or  2 loss.Thus the models might produce predictions with high uncertainty.For real-life recommender systems, this could lead to bad cases where sub-optimal videos are assigned with high rankings by the model, but cause unsatisfactory user experiences.However, how to model the uncertainty in watch prediction is still under-investigated.
Forth, most real-life recommender systems suffer from bias amplifications (e.g.sample selection bias, popularity bias).As the training data for models is usually collected from the logs in the platform, this causes severe bias amplification.According to previous studies [19], the recommendation of video recommender systems can be biased towards videos with longer durations, which verifies the existence of bias amplification.
Given the four issues, we have reviewed existing studies on watch time prediction.Although both methods tackle some important limitations and achieve superior performances in this task, none of them have fully considered all the four issues.The analyses on two state-of-the-art methods on watch time prediction (WLR [3] and D2Q [19]) can be found in Table 1.
In WLR, training samples are either positive (the video impression was clicked) or negative (the impression was un-clicked).And the watch-time prediction is treated as a binary classification problem, where the positive samples are weighted with watch time in the cross-entropy loss.Therefore the odds learned by the classifier equals to the expected watch-time approximately.Despite its simplicity and effectiveness, WLR still has some limitations that prevent its direct application in full-sceen video recommender systems [19], where all video impressions are watched.Therefore, WLR has to be trained with artificially designated positive and negative samples and weights, which may cause a poor approximation of watch-time.Meanwhile the bias amplification effect may get even more severe in WLR as more weights are assigned to the videos of longer duration.D2Q [19] alleviates the duration bias by splitting videos into different groups according to their durations and models watch time with traditional regression models in each group.Therefore the ordinal relationships and conditional dependence in watch time values are neglected; Moreover, both WLR and D2Q treats watch time prediction as a point estimation problem, thus the uncertainty of predictions is ignored.Considering the four aforementioned issues, we propose a new framework TPM for watch time prediction that solves them simultaneously.Specifically, watch time is split into multiple ordinal intervals, and watch time prediction is equivalent to a searching problem deciding which interval the predicted watch time belongs to.The searching process is modeled as a collection of decision making problems organized in a tree structure.Each intermediate node in the tree represents a decision problem which is assigned with a corresponding classifier.Meanwhile each leaf node represents one of the ordinal intervals which is assigned with an expected value of watch time.Each edge represents a possible result of the decision and leads to the next decision.Then the result from upper level becomes the condition for the decisions in the current nodes.Therefore the path from root node to a certain leaf node corresponds to a searching trajectory consisting of a series of decisions.We present a running example in Fig 1 to illustrate the framework.
We explain how TPM solves the four issues in detail as follows: • First, we introduce ordinal ranks into the approximation of watch time.The regression task is decomposed into multiple binary classification tasks whose labels are associated with the ordinal ranks.This approximation makes use of both the continuity of watch time and the ordinal relations between the ranks.• Second, we introduce the conditional dependence into TPM.
Each task of a child node is dependent upon the task from the parent node.In this way, the conditional dependency between the multiple decomposed classification tasks is explicitly modeled.Notice that there are multiple decomposition patterns of the prediction task, we can encode arbitrary conditional dependence into the model.• Third, to enable a robust framework for watch time prediction, we introduce model uncertainty into the objective function.Thanks to the splitting of watch time into ranks, the predicted watch time can be seen as a random variable drawn from multinomial distribution, and we can compute the variance of predicted watch time explicitly, which can be seen as a metric of model uncertainty.Therefore it is introduced into the objective function so that TPM gets an accurate estimation of watch time with high confidence.• Forth, we conduct a causal analysis on the confounding effect of biases and show that conducting backdoor adjustment is equivalent to a specific decomposition with multiple classification tasks.We show that this method applies to different kinds of biases and D2Q [19] can be seen as one of the special cases of TPM.Notice that the structure of TPM is similar to decision trees, we argue that it possesses significant differences with conventional tree models for regression: First, TPM uses the tree structure to decompose a pure regression problem into a series of classification problems, thus the decomposition is conducted on the label space (e.g. the partition of watch time intervals); while tree models partition the feature space into sub-spaces for prediction.Second, the tree-alike decomposition in TPM assigns each node with a corresponding classifier (like neural networks) for decision making; while conventional tree models directly learn a feature partition rule for labeling.
The contributions of this paper are summarized as follows:

RELATED WORK 2.1 Watch-time Prediction
Watch-time prediction is one of the most-concerned problems in industrial recommender systems (especially for short-video and movie recommender systems).However, to the best of our knowledge, only few papers can be found [3,19] in this area.The first work [3] focused on video recommendation in Youtube and proposed the Weighted Logistic Regression (i.e.WLR) method for watch-time prediction.And WLR has become a state-of-the-art method in related applications.However, this method can not be directly applied to full-screen video recommender systems and WLR may suffer from severe bias issues due to its weighting mechanism.D2Q [19] alleviates the duration bias by conducting backdoor adjustments and models watch time with direct watch-time quantile regression.
However the ordinal relationships and dependency between quantiles are ignored in this method.Moreover, as both methods model watch time with point estimation, the uncertainty of predictions have not been considered.

Ordinal regression
Ordinal regression is a technique for predicting ordinal labels, i.e. the relative order of labels is important.Its application can be found in age estimation [12], monocular depth estimation [6], head-pose estimation [7].Despite the wide applications of ordinal regression, it has not been applied on watch time prediction tasks.
Most ordinal regression algorithms are modified from classical classification algorithms.For instance, SVM has been incorporated with multiple thresholds and applied on visual classification [14]; another example is a combination with Online perceptron algorithm [4] which is used for rating prediction.Moreover, ordering information in class attributes is exploited for transforming ordinal regression into multiple classification problems [5].It is worth noticing that decision tree is used in this work [5].However, the binary classification problems in ordinal regression are not conditional dependent as those in TPM, which is a fundamental difference.

Tree based neural networks for recommendation
Tree based models and neural networks are powerful models in various machine learning applications, especially in recommender systems.Tree based methods like LambdaMART [2] are fairly competitive in ranking tasks.Meanwhile, neural networks achieve stateof-the-art performances in leveraging sparse and complex features.However, few efforts have been made on combing the advantages of both methods.An early study [10] attempts to combine decision trees with neural networks for searching.Two models are combined with ensemble learning techniques (e.g.linear combination and stacking) and the combined model achieves superior performances over single models.Moreover, tree models have been used for enhancing the embedding models for explainable recommendation [16].TDM [20,21] is an example of combining tree-based models and neural networks for recommendation.The idea of TDM is to organize the candidate retrieval process as a searching process along the tree, so that most preferred candidates can be retrieved with arbitrary complex models in logarithmic complexity.TPM differs from TDM in several important aspects: first, TPM is designed for watch time prediction given users and corresponding videos, while TDM aims to retrieve relevant candidates from a huge corpus; second, TPM uses tree structure for problem decomposition while TDM utilizes tree structure for corpus partition; third, TPM traverse the tree to predict expected watch time while TDM uses beam search to search for the target leaf nodes.

Debiased recommendation
Many efforts have been drawn to address the biases in recommendation.Previous studies on this topic can be roughly divided into three categories: • inverse propensity scoring: it first computes the propensity score of samples based on certain assumptions and then the  [15,19] is preferred in practical scenarios.
Causal intervention method has been used in watch time prediction for deconfounding duration bias.And we show that backdoor adjustment can be seamlessly incorporated into TPM and this method applies to other confounding factors.

TREE BASED PROGRESSIVE REGRESSION MODEL
We first provide a general formulation for Tree based Progressive regression Model and introduce how watch time prediction is decomposed into several conditional dependent classification problems; Then we present the details of uncertainty modeling for TPM; After that we show that the backdoor adjustment naturally fits in TPM and how biase amplifications are alleviated in detail.Before we go deep into the details of the formulation, we provide a list of notations in Table 2. Instead of treating the problem as direct regression, we first quantize the scale into ordinal ranks: { 0 ≤  1 , . . .,   , . . ., ≤   }, and then cast watch time prediction as the estimation of expected ordinal rank.And the estimation from ordinal ranks is similar to a searching process with iterative comparisons.

Formulation for TPM
For instance, if we conduct a linear search along the ranks, the searching process is as follows: we first decide if  () ≤  0 or not: if  () ≤  0 , then the predicted ordinal rank is  0 ; otherwise we continue by deciding if  () ≤  1 , if it is true, then the rank is  1 , otherwise the process continues.And the process goes on until an interval is finally found.
And if we conduct a binary search, the searching process then becomes: we first decide if  () ≤  /2 : if it is true, then  () falls into the set { 0 ,  1 , . . .,  /2 }; otherwise it belongs to a rank from { /2+1 , . . .,   }.The process continues and the searching space is narrowed down to a certain rank.
Notice that each searching process consists of a sequence of decisions, we propose to fit the watch time prediction model into a searching process from root to a leaf node along the tree.(see Fig. 2): for linear search case, the tree is an unbalanced binary tree; for binary search case, it is a balanced binary tree.
Therefore The tree in TPM consists of a set of nodes: T (X) = { T  T }.Each non-root node represents an interval consisting of consecutive ordinal ranks, i.e.   : [   ,    ],   −  > 1.Without loss of generality, root node is assumed to be the full space:  ∈ [ 0 ,   ].Each leaf node is assigned with a specific ordinal rank i.e.   : [  ,  +1 ].And the subspace of a parent node is the union of sub-spaces of its children.And a path from root to a leaf   is denoted as an ordered node set    = { n   (0), . . ., n   ( (  ))}, where n   () is the node at level  along the path    and  (  ) is the depth of leaf node   .The following equation always holds: In TPM, each non-leaf node is assigned with a classifier, and its outputs indicate the conditional probabilities that watch time belongs to the corresponding ordinal ranks of the child nodes given the probability from its parent's output.Therefore given an instance  and a tree T , its predicted watch time  follows a multinomial distribution as follows: ) is parameterized by the classifier M n   ( −1) assigned to node n   ( − 1).Based on the derivations, the expectation of watch time given a tree T can be computed as follows: Notice that the expectation involves the term  ( | ∈   , , T ), this can be estimated by any predictive model.In this paper, we employ a simple method for estimation: By building multiple trees for watch time estimation, the expectation of watch time can be computed by incorporating the distribution of trees as a prior: Despite that TPM allows a bagging scheme for prediction as Eqn. 3, we restrict the number of trees to be one for simplicity.

Tree Construction
Notice that there is no limit on the type of trees in TPM, the structure of the trees can be designed according to the tasks and dataset.In TPM, each tree corresponds to a decomposition of the ordinal ranks and each non-leaf node inside the tree corresponds to a classifier.As revealed in previous studies [8,9], label imbalance adds difficulty to the predictive modeling, we try to construct the tree with balanced label distribution for each node.
Therefore, we compute the quantiles of watch time and set them as the ordinal ranks for discretion.Then we split the ordinal ranks into halves iteratively and set them to the leaf nodes of a complete binary tree.The tree is constructed by merging two child nodes to a parent node repeatedly.Therefore the classifier for each

Uncertainty Modeling
Previous methods [3,19] focus on modeling watch time with point estimation, however it is unknown how much confidence should be placed on the predictions.And we show that TPM does not only model the error of expected watch time but also attempts to minimize the uncertainty of its predictions.Notice that given a tree for problem decomposition, TPM predicts the probabilities that watch time belongs to the ordinal ranks respectively.Therefore watch time becomes a random variable following a multinomial distribution of  ( ∈   |, T ), ∀  ∈  T .This property of predicting watch time with a distribution is very helpful, as it enables an approximate estimation of watch time variance: Notice that  ( ∈   |, T ) can be computed with Eqn 1, the variances can be computed easily under the assumption  ∼  ( ∈   |, T ), ∀  ∈  T .
A simple example is depicted in Fig 3 to illustrate the idea: assuming that the scale of watch time is split into eight ordinal ranks, the predictions of two models M  and M  have same expectation of watch time:  ( ) = 4.5.However, it is easy to verify that these two predictions have distinct variances:   M  ( ) >   M  ( ).This indicates that M  is more uncertain about its prediction.And it is expected that a model is able to get correct estimation of watch time with high certainty.Therefore we explicitly add variance of predicted watch time into the objective function of TPM.

Training with TPM
Now we present the training process of TPM.Given a training sample (, ) and the tree T in TPM, we first identify the ordinal rank of the sample and the corresponding leaf node   ( ) ∈  T .Then the path from root to the leaf node is identified and the sample is associated to the classifiers along the path.
Each classifier takes  as input, and the label is identified with the child node along the path.In this paper, T is a balanced binary tree, and each non-leaf node is assigned with a binary classification Compute the loglikelihood of (  ,  ) belonging to path   as Eqn.1; The final objective function is a weighted sum of the three components: The training process is illustrated in Alg. 1.

Combined with Backdoor Adjustment
Now we present how backdoor adjustment seamlessly adapts to TPM for debiasing recommendation.First, we present the causal graph in Fig 4 to illustrate the bias effects to watch time prediction.Denote the confounding factor as , the feature representations as  and the watch time as  , the effect between variables are reflected in the edges: •  →  : Confounding factors affect watch time directly.This should be captured by models for an accurate estimation [15,19].•  →  : Confounding factors affect feature representations implicitly.This should be eliminated so that bias amplifications can be avoided.•  →  : Feature representations directly affect watch time, including the effects of user preferences and video contents, etc.
Specifically, this indicates that we can conduct backdoor adjustment by constructing trees according to the distribution of confounding factors and train the classifiers by splitting samples to the corresponding trees.This can be achieved by splitting the Compute the loglikelihood of (  ,  ) belonging to path   by adding   to Eqn.Update M  , ∀ by minimizing L ( , , , T ); 10: end for scale of confounding factors into groups and construct the tree accordingly.Meanwhile, the training data should be split according to the groups and the classifiers in each group is trained with the split data respectively (See Fig. 5 for example).
Specifically, we can inject  into TPM as follows: L ( , , , The training process is illustrated in Alg. 2.

Model Architecture
Notice that TPM does not limit the architecture of classifiers, any architecture for binary classifier applies to TPM.Therefore we adopt a multiple layer perceptron as the backbone structure for the classifiers.The architecture is presented in Fig 6.
Since each non-leaf node in a tree corresponds to a binary classification task, a naive design is to build one classifier for each node where the classifiers are trained independently.However this would cause a considerable large model size thus does not apply to the real-life environment.Therefore we design a single model for all classification tasks by sharing parameters of hidden layers across tasks.Meanwhile, task-specific output layers are introduced into the network to produce outputs for each node.

EXPERIMENTS
We conduct extensive experiments in both offline and online environments to demonstrate the effectiveness of TPM.Three research questions are investigated in the experiments: • First, how does TPM perform in comparison with state-ofthe-art methods for watch time prediction in terms of recommendation accuracy?• Second, how does TPM perform when combined with backdoor adjustment?• Third, how do tree construction in TPM and variance modeling affect its performance?

Experiment Setup
Now we provide an introduction to the experiment setup, including dataset, methods for comparison and metrics for evaluation.[4]: Ordinal regression transforms labels into K ranks and each rank is assigned with a classifier predicting whether the prediction is greater than the rank.
As no existing studies on watch time prediction have ever adopted ordinal regression for modeling, we build a baseline for comparison by applying ordinal regression to watch time prediction directly.Meanwhile, we introduce deconfounding factors into this method in a same way with D2Q and TPM, for fair comparison.• TPM2 : The proposed approach in this paper.Since D2Q focuses on duration bias in recommendation, we also set confounding factor to video duration for comparison.
For fair comparison, the model structures of these approaches are the same except the output layers and corresponding loss functions.

Metrics.
As we concern with an accurate prediction of watch time as well as its ranking capability, we adopt two metrics for evaluation, including MAE (Mean Absolute Error) and XAUC [19]: • MAE (Mean Average Error): This metric is a typical measurement for evaluating regression accuracy.Denote the predition as ŷ and the true watch time as , • XAUC [19]: this metric evaluates if the predictions of two samples are in the same order with their true watch time.Such pairs are uniformly sampled and the percentile of samples that are correctly ordered by predictions is XAUC.

Offline Experiments
4.2.1 Comparison with other methods.We compare the performances of different approaches and the results are listed in Table 3.Notice that TPM achieves the superior performances over other  We alter the number of duration groups and the performances of TPM are depicted in Fig. 7.And the results indicate that splitting samples by duration indeed helps and TPM can seamlessly accommodate with backdoor adjustments.We also conduct experiments of TPM with various number of nodes in the tree, the results are illustrated in Fig. 8.As depicted in the figure, there is a proper number of nodes for tree construction.This coincides with the intuition that the tree should be constructed according to the task and dataset.Meanwhile, to illustrate the effects of variance for uncertainty modeling, we alter the weights for variance in the loss function and present the results in Fig. 9.The results reveal that there is a proper weight of variance which leads to lower uncertainty and satisfactory accuracy.

Online Experiments
We also conduct online A/B experiments on a real-world short-video recommender system in KuaiShou APP.As D2Q is a state-of-theart method for watch time prediction, it is adopted as a baseline for comparison.For TPM, the decomposition tree is design as a complete binary tree with 32 leaf nodes, and the number of duration groups is set to 32.

Experiment Setup.
In online A/B experiments, the traffic is split into ten buckets uniformly.Two buckets of traffic are assigned to baseline while the other two are assigned to TPM.As revealed in [19], Kuaishou serves over 320 million users daily and the results collected from 20% of traffic is very convincing.
The real-life recommender systems are usually complicated.However, most of the systems follow a two-stage framework where a set of candidate items are retrieved in the first stage and the topranking items are selected from the candidates in the ranking stage.Watch time prediction serves as one component in the ranking stage.The items are ranked with multiple predictions (including watch time predictions) and those with higher watch time predictions are more likely to be recommended.

Experiment Results.
The experiments have been launched on the system for 4 days, and the results are listed in Table 5.The metrics for online experiments include accumulated watch time, forward counts (forward the video to friends) and short view counts (watch time is short in respect to the video duration).In online experiments, watch time is a core metric while forward is a constrained metric.Notice that TPM outperforms the baseline in watch time related metrics which verifies the advantage of TPM in predictive accuracy.Moreover the number of negative feed-backs are significantly lower in TPM.And this coincides with the idea of modeling uncertainty in TPM, aiming to produce both accurate and confident predictions.Meanwhile, the differences between TPM and baseline on metrics of interaction are insignificant, thus can be neglected safely.

CONCLUSION
Watch time prediction is one of the core problems in short-video recommendation, as its accuracy affects the quality of videos recommended to users, thus impacting user engagement to the platform.We point out that four issues should be addressed in a real-world watch time prediction framework: first, the ordinal differences between watch time values should be considered; second, the conditional dependence between the video-watching behaviors should be modeled; third, the uncertainty of predictions should be involved in the framework; forth, the framework should take bias amplification into consideration.
To solve these issues simultaneously, we propose TPM (Treebased Progressive regression Model) for watch time prediction.We reveal that watch time prediction can be decomposed into several conditional dependent classification problems that are organized into a tree structure.Meanwhile, the variance of watch time predictions is introduced into the objective function as model uncertainty.And the bias amplification problem is addressed by incorporating backdoor adjustment into TPM seamlessly.
Extensive offline evaluations and online experiments in real-life recommender systems have been conducted and the results validate the effectiveness of TPM.Moreover, TPM has already been deployed in Kuaishou APP, serving over 300 million users daily.

Figure 2 :
Figure 2: Two examples of decomposition trees in TPM we propose a Tree based Progressive Model (TPM) for watch time prediction.The model consists of a tree for problem decomposition T and corresponding classification models {M  ,  ∈ {0, 1, . . ., | T | − | T |}} where  T is the set of nodes in T and  T is the set of leaf nodes in T (see Fig 2 as an overview).

Figure 3 :
Figure 3: An example of watch time predictions as a distribution

Algorithm 1 4 :
Tree based Progressive-regression Model: 1: Input: Training data: (  ,  ), ∀, A decomposition tree: T ; 2: Output: The classifiers of nodes M  , ∀ ∈  T \  T ; 3: for each batch do Assign each training sample into the leaf nodes of T by fitting   to the ordinal ranks of   , ∀; 5: Assign (  ,  ) to the classifiers along corresponding path   ; 6:

7 : 8 :
Compute  ( |, T ) and   ( |, T ) as Eqn. 2 and Eqn.4; Compute the final objective function L ( , , T ) as Eqn.5; 9: Update M  , ∀ by minimizing L ( , , T ); 10: end for task.For each classifier, samples belonging to the right-hand child node is seen as a positive sample.Consider an example in Fig 2, given a sample (, ) and  = 0.8, this sample is associated to classifiers M 0 and M 2 , and for both classifiers, it is a positive sample.The objective function of TPM consists of three components: • Classification error of classifiers along the path: TPM attempts to maximize the likelihood w.r.t. ( T ∈   ( )|, T ), where T is the predicted watch.• Predicition Variance:   ( T |, T ).For easier optimization, we use standard deviation in the loss function:   ( T |, T ) 0.5 • Regression error: a loss function evaluating the difference between the final prediction of watch time and groundtruth: | −  ( T )|.

Figure 4 :
Figure 4: Causal Graph illustrating the confounding effect in watch time prediction.,  ,  represent the confounding factor, the input features and watch time respectively..

Figure 5 :
Figure 5: An examples of decomposition tree in TPM when backdoor adjustment is conducted.Each node is asscoiated with both watch time ( ) and the condounding factor .

Algorithm 2
Training TPM with Backdoor Adjustment 1: Input: Training data: (  ,  ), ∀, A confounding factor , A decomposition tree T ; 2: Output: The classifiers of nodes M  , ∀ ∈  T \  T ; 3: for each batch do 4: Assign each training sample into the leaf nodes of T by matching (  ,   ) to the ordinal ranks of   , ∀; 5: Assign (  ,   ,  ) to classifiers along corresponding path   ; 6:

Figure 6 :
Figure 6: Network Architectures of the Classifier in TPM, where    is the output for the task assigned to node

Figure 7 :Figure 8 :
Figure 7: The performances of TPM with various numbers of groups when the numer of leafs is 32

Figure 9 :
Figure 9: The performances of TPM with different uncertainty weight in training when the num of leafs is 32.

Table 2 :
Notations {  } set of nodes in T  T : {  } set of leaf nodes in T M  the classifier assigned to    confounding factor in causal graph    the path from root to leaf node      ( )the node at level i along path     (  ) the depth of leaf node     , ∀ ordinal ranks of watch time [1]]samples are weighted with inverse propensity scores in the objective function.For example, exposure propensity has been utilized for solving miss-not-at-random problem[13]; However the performances of this method are sometimes unstable due to the high variance of the estimated propensity.And this can be alleviated by propensity clipping or doublyrobust learning[17]via data imputation.•causalembedding:this method[1]decomposes related embeddings into unbiased component and biased component.Both components are used at the training stage and biased component is discarded at inference stage to get an unbiased prediction.• causal intervention: the causes of biases are introduced into the method and interventions are conducted to eliminate their affections on recommendation.Randomization and backdoor adjustment are two representative methods for causal intervention.However, it is costly to conduct randomized experiments on real-life recommender systems, and backdoor adjustment

Table 3 :
Comparison between TPM and other approaches

Table 4 :
TPM with different components on KuaiRec Ablation Studies.We conduct ablation studies on TPM and the results are listed in Table4.The comparison between TPMs with/without mse loss indicate that adding mse loss helps to improve the metric of MAE without sacrificing too much on the ranking metric.The comparison between TPMs with/without var loss indicate that adding variance constraints helps to improve the accuracy.And TPMs with/without deconfounding factors indicate that video watching behaviors are indeed easily affected by the factors, and TPM without deconfounding factors is still competitive because of its considering of the ordinal relationships, the conditional dependency and the variances.

Table 5 :
Comparison between TPM and baseline online, all values are the relative improvements of TPM over the baseline.Watch Time and Forward are positive metrics where higher values are better; Short View is a negative metric where lower values are better.Meanwhile Forward is a constraint metric, and an experiment with more than 1% drop of constraint metrics is not acceptable.For Online A/B tests, an improvement of 0.1% in watch time is very significant.