Optimizing Adaptive Video Streaming with Human Feedback

Quality of Experience~(QoE)-driven adaptive bitrate (ABR) algorithms are typically optimized using QoE models that are based on the mean opinion score~(MOS), while such principles may not account for user heterogeneity on rating scales, resulting in unexpected behaviors. In this paper, we propose Jade, which leverages reinforcement learning with human feedback~(RLHF) technologies to better align the users' opinion scores. Jade's rank-based QoE model considers relative values of user ratings to interpret the subjective perception of video sessions. We implement linear-based and Deep Neural Network (DNN)-based architectures for satisfying both accuracy and generalization ability. We further propose entropy-aware reinforced mechanisms for training policies with the integration of the proposed QoE models. Experimental results demonstrate that Jade performs favorably on conventional metrics, such as quality and stall ratio, and improves QoE by 8.09%-38.13% in different network conditions, emphasizing the importance of user heterogeneity in QoE modeling and the potential of combining linear-based and DNN-based models for performance improvement.


INTRODUCTION
In recent years, video services have become an integral part of people's daily lives due to the rapid proliferation of network technologies like 5G and the increasing demand for content expression by users.According to The 2022 Global Internet Phenomena Report [57], video traffic accounted for a substantial share of 65.93% of total internet traffic in the first half of 2022, showing a 24% increase compared to the same period in 2021.Among them, adaptive video streaming has become one of the mainstream technologies for network video traffic, accounting for 40.24% of the total traffic.
Unfortunately, our analysis on the SQoE-IV assessment database [16], which includes ratings from 32 discerning users, reveals significant diversity in opinion scores and rating scales.Such findings challenge the effectiveness of MOS-based QoE models and emphasize the subjective nature of user perception ( §2.2).Therefore, we consider a novel methodology to mitigate the bias in human feedback induced by user heterogeneity, aiming to achieve the best possible performance for all users.
We propose Jade, which leverages Reinforcement Learning with Human Feedback (RLHF) and a rank-based QoE model to learn a Neural Network (NN)-based ABR algorithm ( §3).Unlike previous approaches, we consider relative values of user ratings to interpret the subjective perception of video sessions.Based on the perceived users' feedback, Jade solves two key challenges, i.e., train a rank-based QoE model ( §4), and design a deep reinforcement learning (DRL)-based method for ABR algorithm generation ( §5).
Unlike previous MOS-based QoE methods [29,65], we propose a rank-based QoE model for video session ranking.Our key idea is to learn a "ranked function" aligned with the rank of each user's opinion score.Modeling as a Deep Structured Semantic Model (DSSM) [21], we implement a pairwise training methodology and learn two NN architectures, i.e., linear-based and Deep NN (DNN)based, to tame the models' generalization ability ( §4.1) -results show that the DNN-based QoE model achieves the highest accuracy but it exhibits imperfections in generalization and stability, while the linear-based model performs the opposite ( §4.2).
We present a DRL-based approach [70] to generate an NN-based ABR algorithm based on the trained QoE models.In detail, we incorporate two entropy-aware mechanisms, i.e., smooth training and  We evaluate Jade against recent ABRs in both slow-network and fast-network paths using trace-driven simulation ( §6.2).In slownetwork paths, Jade outperforms existing algorithms by 22.5%-38.13%,including RobustMPC and Pensieve.Compared with Comyco, Jade has a relatively improved QoE of 8.09%.Jade achieves the Pareto Frontier, indicating optimal trade-offs between different objectives.In fast-network paths, Jade improves QoE by at least 23.07%compared to other QoE-driven ABR algorithms, highlighting the importance of considering both the "thermodynamics" and "kinetics" of the process in QoE modeling.Further experiments demonstrate the potential risks of relying on "imperfect QoE models" without proper caution, as it may result in misguided strategies and undesired performances.We prove that through the synergistic fusion of linear-based and DNN-based QoE models, Jade outperforms approaches that utilize linear-based or DNN-based models solely ( §6.3).We sweep the parameter settings of the QoE model comparison and validate the effectiveness of smooth training and online trace selector in enhancing performance and accelerating the learning process by selecting useful traces ( §6.4).
In general, we summarize the contributions as follows: • We identify limitations in recent QoE models, particularly the use of MOS which lacks consideration for user heterogeneity, leading to sub-optimal policies in the learned ABR algorithms ( §2).• We propose Jade, the first RLHF-based ABR system synthesizes human feedback to develop an NN-based ABR algorithm, including a rank-based QoE model and an entropy-aware DRL-based learning process ( §3, §4, §5).• We implement Jade and evaluate its performance across various network conditions.Results demonstrate its superior QoE performance in diverse scenarios ( §6).

BACKGROUND AND MOTIVATION 2.1 Adaptive Video Streaming
The conventional adaptive video streaming framework consists of a video player with a limited buffer length (i.e., typically 40 to 240 seconds) and an HTTP-Server or Content Delivery Network (CDN) [10].On the server-side, raw videos are chunked into segments, typically lasting 2 to 10 seconds [44,69].These segments are then encoded at different bitrates or quality levels before being stored on the designated storage server [25].Adaptive Bitrate (ABR) algorithms, such as HTTP Live Streaming (HLS) [2] and DASH [13] utilize throughput estimation [34] and playback buffer occupancy [32] to determine the most appropriate bitrate level.

Key findings
Off-the-shelf QoE for adaptive video streaming is calculated by the arithmetic mean of subjective judgments using a 5-point absolute category rating (ACR) quality scale [53], or leveraging a continuous scale ranging between 1-100 [33].However, such schemes fail to consider the effect of user heterogeneity over opinion scores.To this end, we ask: does recent mean opinion score (MOS)-based QoE model work on the right track?If it does work, the following conditions should be satisfied: i) the user's opinion score for each video session should be consistent, and ii) the user should assign similar scores to similar video sessions.Dataset.Recent publicly available QoE datasets either lack comprehensive feedback information [8,17] or have not yet been made available for publication (e.g., Ruyi [76]).We use SQoE-IV assessment database, which contains 1350 subjective-rated streaming video sessions that are generated from diverse video sources, video codecs, network conditions, ABR algorithms, as well as viewing devices [16].Each video session is assessed by a panel of 32 users, who provide their discerning ratings.Thus, the dataset totally consists of 43,200 opinion scores, which is sufficient for data analysis.
Video-wise analysis.We start by investigating the correlation between videos and opinion scores, where the videos are the sessions being performed by different ABR algorithms.Figure 1(a) shows the CDF of opinion score over different videos, where the videos are randomly picked from the SQoE-IV dataset.Notably, diverse opinion scores are observed for the same videos, with video 1 exhibiting scores ranging from 60 to 100, and other videos (i.e., video 2 and video 4) even displaying scores ranging from 0 to 100, encompassing the full range of minimum to maximum opinion scores.These findings implicitly highlight that users hold varying opinions about the same video, albeit with a variance that may exceed our initial expectations.Figure 1(b) illustrates the detailed proportion occupied by ranges, where the difference metric represents the gap between the maximum and the minimum opinion score over each video.We observe that almost all the videos are  scored with higher differences, as few video result in a difference of below 50.Hence, the user's rating of each session is diverse rather than consistent.Such high-variance results also motivate us to further explore the key reasons from the user-level analysis.
User Heterogeneity.We report the box plot of 16 users' feedback in Figure 1(c), here all the results are sorted and collected from the same video set.Surprisingly, we find diverse rating scales on the user level.For instance, some users rate the opinion score from 0 (user 0) while others even attempt to start their evaluation process with 60 (user E).At the same time, we also observe the diverse preference of different users.User F prefers rating higher scores over all sessions but most users are conservative, as the average score is often lower than the median score over all sessions.Detailing users' opinions in Figure 1(d), we observe that there is a lack of uniformity in the criteria used by each user to assign the same score -given the session with the same video quality (i.e., VMAF [52]) and rebuffering time (i.e., the most essential metrics to evaluate QoE [17]), users provide varying feedback.When the video quality reaches 70 and the rebuffering time exceeds 1.0 seconds, the rated scores provided by users exhibit heterogeneous, spanning from 10 to 80. Thus, we argue that users assign diverse opinion scores for the same session.In other words, the perception of the same score, such as 80, may vary among different users.
To sum up, the expected results, as indicated by the conditions of consistent opinion scores from users for each video session and similarity in scores assigned to similar video sessions, have not been achieved w.r.t the observed results.To that end, despite the outstanding performance of recent work, such MOS-based QoE metrics may not perform on the right track.

JADE OVERVIEW
In this work, we utilize Reinforcement Learning with Human Feedback (RLHF) [7] to address the problem at hand by leveraging the rank-based QoE model as a reward signal.Our key idea is to leverage a ranking-based method that naively ignores absolute values but only focuses on the relative values of scores assigned by each user -if a user rates video session A higher than video session B, we interpret this as indicating a superior experience for A compared to B, according to the user's subjective perception.Unlike recent work, we require a rank-based QoE model to accurately generates "virtual values" that align with the relative ranking of users' ratings.By leveraging the trained QoE model as a reward signal, we can generate NN-based ABR algorithms with state-of-the-art Deep Reinforcement Learning (DRL) [59,70] methods, as this problem inherently falls within the purview of DRL.
We propose Jade, which can be viewed as the first RLHF-based ABR algorithm to our best knowledge.The big picture of Jade

RANK-BASED QOE MODEL
In this section, we propose a practical rank-based QoE model that aligns users' feedback scores, design a pairwise loss to train models, and introduce a novel evaluation metric, the Identity Rate1 , to validate the proposed QoE model.

Model Design
Different from prior work that integrates all the scores as the mean opinion score [29,65], our key idea is to create a "ranked function" to map the opinion score for each video session, where the function is learned by pairs of samples in the queries that are previously fed back by the same user.Now we formally give a general description.Definition 4.1.We define a set of queries  = { 1 ,  2 , . . .,   }.For each query   , it is composed by a set of video sessions   = {  1 ,   2 , . . .,    }.Each video session records video playback information for the entire session.At the same time, for each user , the session set   is corresponding to a list of scores    = {   (  1 ), . . .,    (   )}, where    (•) reflects the opinion score judged by user  according to the session   .
In this paper, the opinion score    ranges from 0 to 100.The higher the score, the higher the user's satisfaction with the video session.The playback information contains several underlying metrics such as video qualities, video bitrates, and rebuffering time for each video chunk.Given a rating pair {  (  ),   (  )} over a user , we further define a binary scalar    , which denotes the relative relationship between the two scores.
The detailed explanation is listed in Eq. 1. Followed by the definition, our goal is to train a QoE model  , which enables the relationship of the output   of the given two sessions, i.e.,   (  ),   (  ), being as consistent as possible with    .In other words, if    ≠ 0, we aim to generate a score function  that satisfies: ( Model Architecture.As demonstrated in Figure 3, we propose a pairwise learning approach [49], denoted as  .Considering the generalization ability, we implement a linear-based and a DNNbased model.DNN-based (QoE DNN ): As shown in Figure 4, our proposed DNNbased QoE model takes   = {  ,   ,  } as the inputs, in which   are the VMAF [52] of past  chunks,   is the sequence of past bitrate picked, and   is the past  chunks' rebuffering time.Consistent with the setting of the SQoE-IV, we set  as 7. Furthermore, we take the model's output   as a single scalar, representing the score for the given video session.The model is simple yet effective, with three feed-forward network (FFN) layers, each with 128 neurons, and a single scalar output without an activation function.We further discuss the performance of applying different feature numbers in §6.4.
Linear-based (QoE lin ): We choose a linear form that is widely employed in recent studies [71].Specifically, the linear-based QoE model, i.e., QoE lin , can be written as: All the factors are computed as current video quality and rebuffering time, where [( +1 ) − (  )] + is positive video quality smoothness, and [( +1 ) − (  )] − denotes the negative quality smoothness.  ,   ,   , and   are the learnable parameters.In practice, we consider a "surrogate scheme" by first constructing a lightweight NN with a fully connected layer of 4 neurons, corresponding to the weights of the 4 metrics, and outputting a single scalar.Note that we don't apply any activation function here.We then train the NN via vanilla gradient descent.Finally, we obtain the corresponding parameters in the trained network.The linear-based model architecture is also demonstrated in Figure 4. Loss Function.Inspired by the training model as DSSM [21], both DNN-based and linear-based QoE models are optimized via the pairwise-based loss function L  , which is listed in Eq. 4.
Here  (•) is a Sigmoid function:  () = 1/(1 +  − ),   (•) is the output of the proposed QoE model,    is the relative relationship between the opinion score   and   (See Eq. 1).The function consists of two parts.The first part is pairwise loss, which is one of the common methods used in the recent RLHF field [49].Given two video sessions   and   , this loss function enables the NN to output discriminative results while guaranteeing the same direction as    .The second part is the align loss, which aligns the predicted results if    = 0. Hence, the align loss will be disabled if    ≠ 0, and vice versa.Note the network parameters of   (  ) and   (  ) are shared in our model.Training Methodologies.Followed by the instruction of Figure 4, we summarize the overall training process in Alg. 1, containing two phases.In detail, the training set is instantly generated during the training process.We randomly pick  samples from different queries, users, and video sessions (Line 3 -10).The QoE model is optimized by the generated training set (Line 11 -13).We discuss the performance of different batch sizes  in §6. 4 // Phase 1: Generate training batch.

5:
for t in K do 6: Sample query   ∼ , user .Update model using L  (Eq.4): where I(•) is a binary indicator that returns 1 if the results of the QoE model and opinion scores are in the "same direction".In particular, we accept the relative gap between two QoE scores is less than 5% when    = 0 (i.e. a tie).Implementation.The QoE models are constructed by Tensorflow [3].We use Adam [36] to optimize the NN.As suggested by recent work [29], we randomly separate the SQoE-IV database into two parts, 80% of the database for training (i.e., 5,771,264 pairs) and 20% for testing (i.e., 1,442,941 pairs).
Results.The results of the comparison are illustrated in Table 1. the QoE models that do not utilize any VQA technique (i.e., bitrate signal only), such as MPC's [71] and Pensieve's [44]  QoE model [69] employs SSIM  to outperform Wei et al.'s SSIMbased model [66].Results of KSQI [15] and Comyco's [29] show that VMAF stands for the best VQA metric.Moreover, we observe that the DNN-based QoE model outperforms all other models, achieving the highest Identity Rate of 75.47%.The linear-based VMAF model also performs well, with a consistent rate of 67.25%, making it the best-performing linear model.MOS-OPT, which is offline optimal directly computed by MOS, serves as the upper bound of using MOS.Surprisingly, it even performs worse than the DNN-based QoE model, with a decrease of 4.4%.Such findings prove that the vanilla MOS-based approach [29] fails to accurately characterize the users' QoE.Generalization.We visualize the behavior of linear-based and DNN-based models in Figure 5.As expected, both two types of models are followed by domain knowledge -the normalized score is maximum when VMAF is valued at 100 and no rebuffering events.However, we observe that the DNN-based QoE model faces challenges in achieving optimality compared to the linear-based model, with a more fluctuating training process that lacks stability.For instance, the DNN-based model may predict higher scores even when tethering the VMAF to a fixed value of 10, violating the domain principle of ABR tasks, as rebuffering time increases.In contrast, the linear-based model consistently degrades predicted scores with increasing rebuffering time.Similar abnormal improvements are observed with VMAF valued at 80, with noticeable changes occurring at rebuffering times of 0.8.
In summary, the DNN-based model is "imperfect", showing unstable performance, while the "perfect" linear-based model has acceptable generalization abilities but performs worse than the DNN-based model.

REINFORCED ABRS WITH QOE MODELS
In this section, we describe how to generate an NN-based ABR algorithm using trained QoE models.The overview of the training process is shown in Figure 6.We incorporate two entropy-aware mechanisms: a smooth learning approach and an online trace selection scheme, to fully leverage the advantages of the QoE models.

Smooth Training
To avoid the effect of imperfect models, we adopt the linear-based QoE model (denoted as QoE lin ) to train the NN policy in the early stage of training, since the model has a stronger generalization ability which can lead the strategy into the "near-optimal region".In the later stage, we use the DNN-based QoE model (denoted as QoE DNN ) to continue training the strategy since it can provide more accurate QoE scoring results in such regions.
Here, an intuitive scheme is to design a two-stage approach that learns the policy with the two QoE models separately in different training phases.However, due to the differences in the meaning of the outputs from the two models, seamless integration of the two models may not be straightforward.We, therefore, propose a weighted-decay approach for smooth training.The parameter  is given to connect the outputs of the linear-based and DNN-based model, thus the surrogate QoE metric can be defined as: In this work,  is controlled by the entropy of the policy  over the trajectory of picked trace   .
Here  is the state spaces,  is the bitrate action, and || is the number of the bitrate levels.As shown, for any on-policy DRL method [63], the entropy of the strategy    gradually decreases with the training process, as incidentally reduces the weight of QoE lin and increases the importance of QoE DNN .

Online Trace Selection
Moreover, we use online learning techniques to prioritize a network trace from the training set for training and optimize ABR strategies without prior knowledge.To avoid extrapolation errors in the QoEDNN model, we use the policy entropy    at epoch  as feedback and select the trace with higher entropy outcomes.Higher entropy leads to larger  on QoE * , which in turn results in a robust and safe QoE metric that is dominated by QoE lin .In practice, we model the network trace selection problem as a multi-armed bandit problem and propose the module via the discounted UCB (Upper Confidence Bound) algorithm [19] to pick the network trace with higher entropy outcome for training, with the key method of the module being as follows: in which  is the discounted factor, I(•) will be set as 1 when the trace  has been picked at epoch  (i.e.,   = ), else I(•) = 0.
Here  > 0 is a hyper-parameter that controls the probability of exploration.So   () is defined by the value of network trace  at epoch .For each epoch , the action   can be determined as: In this paper, we set  = 0.999,  = 0.2 to balance the trade-off between exploration and exploitation [72].

Learned Policies with DRL
We employ a DRL approach to train an ABR policy for maximizing the score obtained from the QoE * .Taking Dual-Clip PPO [70] as the basic training algorithm (denoted as L PPO ), the combined objective function L  can be expressed as: where  is an adaptive entropy weight [37] that is dynamically adjusted w.r.t the target entropy   during training, in which  is the learning rate: In this paper, Jade's state  incorporates current buffer occupancy, past chunk's VMAF, past 8 chunks' throughput and download time, and next chunks' video sizes and VMAF scores.The action  is a vector indicating the bitrate selection probabilities.To capture features from diverse input types, Jade's actor-critic network [64] adopts a combination of multiple FFN layers with 128 neurons to extract and combine the underlying features.Jade's learning tools are built with TensorFlow 2.8.1 [3].We set learning rate  = 10 −4 ,   = 0.1.

Jade vs. Existing ABR algorithms
We train Jade once and employ trace-driven simulation to conduct a performance comparison between Jade and several existing ABR algorithms across slow-network and fast-network paths.Slow-network paths.Figure 7 shows the overall performance of Jade and recent works over the HSDPA network traces.We see that Jade stands for the best approach, as its QoE outperforms existing ABR algorithms by 22.56%-38.13%(see Figure 7(d)).Specifically, it performs better than state-of-the-art heuristic RobustMPC (29.22%) and RL-based approach Pensieve (28.82%) -they are all optimized   by the linear combination of QoS metrics [71], resulting in similar performances as well.Comparing the performance with the closest scheme Comyco, we find that despite being 22.56% lower than Jade, it has relatively improved by 8.09% compared to other algorithms.By analyzing the Identity Rate of each QoE model as shown in Table 1, we highlight the importance of using an accurate QoE model as the reward, as it is positively correlated with the performance of the learned ABR algorithm.
Furthermore, Figure 7(a) demonstrates that almost all the ABR schemes are performed within the acceptable region, with a stall ratio of less than 5% [47].Among them, Comyco, BOLA, Jade, and Pensieve achieve the Pareto Frontier, indicating that they provide optimal policies in terms of trade-offs between different objectives, while Jade reaches the best QoE performance, standing for the Top-2 scheme in terms of QoS as well.By referring to Figure 7(b) and Figure 7(c), we can see the detailed behavior in terms of buffer occupancy and quality smoothness.While Jade may not reach the upper right corner of the region, it consistently makes bitrate decisions within a reasonable range for QoS considerations, since it ranks at mid-level among all candidates.Fast-network paths.At the same time, Figure 8 illustrates the performance of Jade and the baselines on the FCC-18 network dataset, which represents a wide range of network conditions that closely resemble real-world networks.Unsurprisingly, the CDF curve of QoE, as depicted in Figure 8(d), shows that Jade surpasses other ABR schemes, achieving significant improvements in QoE by at least 23.07%.Impressively, Jade improves the average QoE DNN by 36.32%,64.66%, 79.95%, and 1.47× compared to state-of-the-art QoE-driven ABR algorithms such as RobustMPC, Comyco, Fugu, and Pensieve, respectively.With further investigation on QoS in Figure 8(a), we observe that Jade can well balance the trade-off between quality and stall ratio, maintaining the session within an acceptable buffer range (see Figure 8(d)).Moreover, Jade also shows outstanding QoE improvements in the fast-network paths, ranging from 1.86% to 2.53%, compared to QoS-driven ABR algorithms, such as Rate-based, BBA, and BOLA.At the same time, Figure 8  reveals that Jade slightly performs better than the closest QoEdriven approach RobustMPC (+0.2% quality, -1.14% stall ratio, -1.9% smoothness) in terms of QoS metrics.However, it significantly outperforms RobustMPC in terms of QoE, which indicates that the QoS result does not necessarily represent QoE -the duration of buffering and bitrate switching events directly impact users' experience, which cannot be reflected by traditional linear-based QoE models like QoE lin .Thus, applying a sequential QoE model like QoE DNN is reasonable, which can not only consider the "thermodynamics" of the outcome but also appreciate the "kinetics" of the process.

Jade vs. other QoE-driven ABR algorithms
We compare the performance of Jade with existing QoE-driven ABR approaches in Figure 9, where the results are collected from the FCC network dataset.Note MPC is RobustMPC [71].Here we show three key findings.First, after changing the goal of MPC to maximize QoE DNN (i.e., MPC-DNN), the overall performance drops significantly, almost doubling the stall ratio compared with MPC but optimized by QoE lin (i.e., MPC-lin).Such results indicate that we should exercise greater caution in using such "imperfect QoE models".Otherwise, unwavering trust in these models may result in misguided strategies, leading to undesired outcomes.As such, the utilization of QoE DNN to enhance heuristics remains an open question.Secondly, both Comyco-lin and Jade-lin employ QoE lin as the reward signal, leading to a similar performance in regard to video quality and stall ratio -the only difference between those two  is the training methodologies.Thirdly, Jade improves QoE DNN by 18.07% compared with PPO-ptx, while experiencing a decrease of 12.72% in stall ratio, demonstrating the strengths of learning from clean slate.Furthermore, Jade-DNN performs worse than Jade-lin, with the decreases on average QoE DNN of 21.17%.The key reason is Jade-DNN occurs too many rebuffering events (stall ratio: 4.1%) and fails to obtain higher video quality, just like MPC-DNN (stall ratio: 11.47%).Finally, through the fusion of linear-based and DNNbased QoE models, Jade not only attains the Pareto Frontier on QoS but also achieves superior QoE outcomes across all considered scenarios, demonstrating the superiority against recent schemes.

Ablation Study
We conduct several experiments to better understand Jade's hyperparameter settings and entropy-aware mechanisms ( §5).
Comparison for QoE models.Effectiveness of trace selector.We present the detailed results of utilizing and not utilizing (i.e., random sample) the trace selector in Figure 11.Note that we also summarize the performance over the validation set every 300 epochs.Results show that selecting useful traces not only improves overall performance but also accelerates the learning process.Moreover, The PDF curve of the trace selection demonstrates varying probabilities of selecting different traces during training.Traces with an average bandwidth of 1-3 Mbps, technically covering the bitrate ladders ( §6.1), are frequently chosen, which shows the effectiveness of the proposed selector.

RELATED WORK
QoE Metrics.Recent QoE models prove the effectiveness of machine learning techniques in mapping streaming features to mean opinion scores, such as regressive models [29,65], DNNs [14], and random forests [54].However, such schemes may not accurately reflect users' opinions due to the existence of user heterogeneity, while rank-based QoE models diverge ( §4.2).QoS-driven ABR approaches primarily employ observed metrics from different perspectives to achieve better QoS performances [31].
Throughput-based ABR algorithms, such as FESTIVE [34] and PANDA [40], adopt past throughput to forecast future bandwidth and choose the appropriate bitrate.BBA [32], QUETRA [68], and BOLA [61] are buffer-based approaches that primarily use the buffer occupancy to select the bitrate.Zwei [23,24] uses self-play learning methods to directly meet QoS requirements.However, all of these approaches optimize QoS rather than QoE.QoE-driven ABR algorithms adjust the bitrate based on observed network status to fulfill users' QoE.MPC [71] employs control theory to maximize QoE via harmonic throughput prediction, while Fugu [69] trains the DNN-based throughput predictor periodically.Pensieve [44] and Comyco [27,29] use DRL and imitation learning to generate DNNs without prior knowledge.MERINA [35] and A 2 BR [28] quickly adapt to different networks.However, the aforementioned works are optimized by linear-based QoE models, which is not suitable for DNN-based models (like QoE DNN , see §5).

CONCLUSION
Off-the-shelf QoE-driven ABR algorithms, optimized by MOS-based QoE models, may not account for user heterogeneity.We developed a novel ABR approach called Jade, which used the main concept of RLHF with a rank-based QoE model and DRL with a hybrid feedback mechanism.In detail, Jade addressed challenges in training two rank-based QoE models with different architectures and generating entropy-aware ABR algorithms for smooth training.Evaluations showed that Jade outperforms existing algorithms in both slowand fast-network paths and achieves the Pareto Frontier.Further research will focus on treating user engagement as the opinion score [75] and applying Jade into the live streaming scenarios [69].

Appendices A ENTROPY-AWARE DEEP REINFORCEMENT LEARNING
We discuss the Dual-Clip Proximal Policy Optimization (Dual-Clip PPO) algorithm and its implementation in the Jade system.The detailed NN architectures are demonstrated in Figure 12.

A.1 Dual-Clip PPO
We consider leveraging a novel DRL algorithm, namely Dual-Clip Proximal Policy Optimization (Dual-Clip PPO) [70].In detail, the loss function of the Jade's actor-network is computed as L Dual (Eq.12): Here I(•) is a binary indicator function, the loss function L PPO can be regarded as the surrogate loss function of the vanilla PPO (Eq.13), which takes the probability ratio of the current policy  and the old policy    into account.The advantage function Â (Eq.14) is learned by bootstrapping from the current estimate of the value function,  * is the smoothed QoE metric after smooth training, and  ′ =0.99 is the discounted factor.It should be noted that Â can be estimated using various other state-of-the-art methods, such as N-step TD, TD(), V-Trace [11], and GAE [58].
In summary, the Dual-clip PPO algorithm uses a double-clip approach to constrain the step size of the policy iteration and update the neural network by minimizing the clipped surrogate objective.If the advantage function yields a value lower than zero, it will clip gradients with a lower bound of the value Â .The hyper-parameters  and  control how the gradient is clipped, and we set them to default values of  = 0.2 and  = 3 [70].
Moreover, the parameters of the Jade's critic network   are updated by minimizing the error of Â .
(15) We present the loss function LDRL in Eq. 15 for brevity.Furthermore, we augment the loss function with the entropy of all policies  (  ), where  is the weight assigned to the entropy.We tune the entropy weight  to minimize the difference between the current entropy and the target entropy   .In our implementation, we set   = 0.1 [37].

A.2 Overall Learning Process
The overall learning process is listed in Alg.Randomly pick a video  from the video set.

Figure 1 :
Figure1: Visualizing Cumulative Distribution Function (CDF) of videos' opinion scores, the relationship between MOS and its variance of different videos, and user heterogeneity over the SQoE-IV dataset[16].online trace selection, to fully leverage the advantages of the QoE models.The smooth training methods leverage the linear-based QoE model in the early stage and transition to the DNN-based QoE model in the later stage, with an entropy-related parameter to combine their outputs for seamless integration, which allows for effective training and integration of the two QoE models during different training phases ( §5.1).The online trace selection scheme adopts online learning techniques to dynamically choose the appropriate network trace for training ( §5.2).We evaluate Jade against recent ABRs in both slow-network and fast-network paths using trace-driven simulation ( §6.2).In slownetwork paths, Jade outperforms existing algorithms by 22.5%-38.13%,including RobustMPC and Pensieve.Compared with Comyco, Jade has a relatively improved QoE of 8.09%.Jade achieves the Pareto Frontier, indicating optimal trade-offs between different objectives.In fast-network paths, Jade improves QoE by at least 23.07%compared to other QoE-driven ABR algorithms, highlighting the importance of considering both the "thermodynamics" and "kinetics" of the process in QoE modeling.Further experiments demonstrate the potential risks of relying on "imperfect QoE models" without proper caution, as it may result in misguided strategies and undesired performances.We prove that through the synergistic fusion of linear-based and DNN-based QoE models, Jade outperforms approaches that utilize linear-based or DNN-based models solely ( §6.3).We sweep the parameter settings of the QoE model comparison and validate the effectiveness of smooth training and online trace selector in enhancing performance and accelerating the learning process by selecting useful traces ( §6.4).In general, we summarize the contributions as follows:

Figure 2 :
Figure 2: Jade's system overview.Based on users' feedback of ranking scores, Jade trains rank-based QoE models for learning an NN-based ABR algorithm.

Figure 3 :
Figure 3: The training workflow of Jade's rank-base QoE model, which utilizes pairwise loss to align the rank of the same user's opinion score over two sessions.isillustrated in Figure2.As shown, Jade is mainly composed of three phases, i.e., user rating, learning rank-based QoE models, and generating ABR algorithms with the learned QoE models.The Jade's workflow is shown as follows.❶ The user rates opinion scores for the video sessions with different ABR algorithms, network traces, and video descriptions.In this work, we directly adopt the SQoE-IV dataset, consisting of 1350 realistic streaming videos generated from various transmitters, channels, and receivers.❷ In the rank-based QoE model phase ( §4), we leverage users' "relative" feedback as the input for implementing a more effective and accurate QoE model.Different from previous work, considering model generalization and precision, we train two types of QoE models, including linear-based and DNN-based.❸ In the training phase, we propose an entropy-aware reinforcement learning method for generating an NN-based ABR algorithm with the guidance of the trained QoE model ( §5).

Figure 4 :
Figure 4: The NN architecture overview for linear-based and DNN-based QoE model.DNN-based model takes a sequential as the input while linear-based model observes the statistics.

Figure 6 :
Figure 6: Learned ABR algorithms with the combination of QoE models and online trace selection.

Figure 7 :
Figure 7: Comparing Jade with recent ABR algorithms over the HSDPA dataset.Error bars show 95% confidence intervals.

Figure 8 :
Figure 8: Comparing Jade with existing ABR algorithms using QoE DNN .Results are collected over the FCC-18 dataset. (b)

Figure 9 :
Figure 9: Performance comparison of Jade and several QoEdriven ABRs.Results are collected over the FCC dataset.

Figure 10 :Figure 11 :
Figure 10: The learning curve of smooth training.

Figure 12 :
Figure 12: The NN architecture of Jade's actor network and critic network.

19 :
Update weight  for smooth training:  = E {,}∼ −  log   (,)  (,) log || .B DISCUSSION B.1 Relation with Previous Paradigms Relation with recent ABR algorithms.Existing ABR algorithms, typically learning-based schemes, often adopt deep reinforcement learning (DRL) to learn the ABR policies without any presumptions.However, they are optimized with the linear-based QoE models [18, 29, 30, 35, 44].By contrast, Jade's ABR strategy is learned by the DNN-based model, where the model often lacks generalization ability over all sessions.Thus, we propose entropy-aware DRL approaches, including smooth training methodologies and online trace selection schemes, to help tame the complexity of combining linear-based and DNN-based models. .

7 :
Sample pairs   and    from sessions   for   .

Table 1 :
Performance Comparison of QoE Models on SQoE-IV , exhibit comparable but lower Identity Rates compared with the others.Puffer's

Table 2 :
Sweeping the parameters for QoE DNN .

Table 2
better performances compared with using QoE DNN and QoE lin solely (25% on QoE).Moreover, the evolution curve of  indicates that QoE DNN dominates the majority of the training process, while QoE lin also plays an important role at the beginning, covering approximately 75% of the training process.Such results further demonstrate that both QoE models are indispensable.