Quality-Aware Dynamic Resolution Adaptation Framework for Adaptive Video Streaming

Traditional per-title encoding schemes aim to optimize encoding resolutions to deliver the highest perceptual quality for each representation. XPSNR is observed to correlate better with the subjective quality of VVC-coded bitstreams. Towards this realization, we predict the average XPSNR of VVC-coded bitstreams using spatiotemporal complexity features of the video and the target encoding configuration using an XGBoost-based model. Based on the predicted XPSNR scores, we introduce a Quality-A ware Dynamic Resolution Adaptation (QADRA) framework for adaptive video streaming applications, where we determine the convex-hull online. Furthermore, keeping the encoding and decoding times within an acceptable threshold is mandatory for smooth and energy-efficient streaming. Hence, QADRA determines the encoding resolution and quantization parameter (QP) for each target bitrate by maximizing XPSNR while constraining the maximum encoding and/ or decoding time below a threshold. QADRA implements a JND-based representation elimination algorithm to remove perceptually redundant representations from the bitrate ladder. QADRA is an open-source Python-based framework published under the GNU GPLv3 license. Github: https://github.com/PhoenixVideo/QADRA Online documentation: https://phoenixvideo.github.io/QADRA/


INTRODUCTION
HTTP Adaptive Streaming (HAS) has emerged as the predominant method for delivering video content across a spectrum of internet speeds and device types [1].Its core concept involves segmenting video content and encoding each segment at various bitrates and resolutions, known as representations, which are then stored on standard HTTP servers.These representations facilitate continuous adaptation of video delivery to the fluctuating network conditions and diverse device capabilities of clients [2].Typically, online streaming applications employ a predefined bitrate ladder, such as the one found in HTTP Live Streaming (HLS) [3], to ensure smooth and efficient content delivery.
There is a growing interest in per-title encoding techniques aimed at improving the perceived quality of streamed content [7].This innovative approach dynamically adjusts encoding parameters such as resolution [7][8][9][10], framerate [11], and presets [12], among others, based on content complexity and viewer preferences to optimize visual fidelity.Among these methods, dynamic resolution encoding has been extensively researched in adaptive streaming applications, where encoding resolutions are adjusted dynamically to maximize video quality [13,14].This approach ensures that perceptual quality remains high in visually intricate segments while conserving bandwidth by lowering resolution in less complex scenes.As demonstrated by rate-distortion plots in Figure 1, the optimal resolution, which maximizes perceptual quality (measured in terms of XPSNR [6]), varies depending on content complexity.By adapting resolution per segment, the streaming system efficiently allocates resources, prioritizing high-quality representation where it's most beneficial.Ultimately, dynamic resolution per-title encoding aims to balance perceptual quality and bandwidth efficiency, delivering an immersive and captivating streaming experience [7].
in data centers require substantial computational resources and energy consumption, especially with new codecs such as Versatile Video Coding (VVC) [15,16].The streaming industry can reduce its carbon footprint and energy consumption by lowering encoding time [17,18].Furthermore, reducing decoding time on the client side reduces stall events and buffering time, contributing to a smooth viewing experience [19,20].
In this paper, the main contributions are as follows: (1) A quality-aware encoding resolution selection framework to maximize the perceptual quality (in terms of XPSNR) of video segments based on their spatiotemporal complexity, target bitrate, and the encoding and/or decoding time constraint for VVC-based streaming environments.(2) Comprehensive analysis of the proposed framework for various encoding time thresholds regarding compression efficiency and encoding latency.
Outline: The remainder of this paper is organized as follows.Section 2 discusses the related work on dynamic resolution encoding in the context of adaptive video streaming.The proposed QADRA framework is explained in Section 3, while Section 4 presents the experimental results.Finally, Section 5 concludes the paper.

RELATED WORK
Most state-of-the-art dynamic resolution per-title encoding methods are based on choosing a particular resolution that provides better visual quality for a given bitrate range.Katsenou et al. [8] uses machine learning to identify the most effective bitrate range for each resolution.The method extracts spatiotemporal features and statistics from sequences at their original resolution.Then, it employs machine learning methods to predict the quantization parameters (QPs) at which the rate-distortion curves across the different resolutions intersect.( r − 1) × 2 encodes must be performed to determine the bitrates at which resolutions should be switched.This content-gnostic approach has been claimed to reduce the number of encodings required compared to other methods (by 81% -94%) compared to the bruteforce encoding approach.It uses constant quantization parameter (CQP) encodes, which are not used in real-time streaming applications.Another method proposed by Bhat et al. [9] uses machine learning to predict the resolution without requiring multiple encodings.Features from the low-resolution encoding of the first few frames are input to a random forest model to predict better-performing resolution for a decision period.Similarly, Zabrovskiy et al. [21] used an artificial neural network to predict an optimized bitrate ladder for each scene, optimized based on the YPSNR quality metric.These methods produce latency significantly higher than the accepted latency in live streaming.OPTE [10] uses random forest models to predict optimized resolution, yielding the highest VMAF [22] using spatiotemporal features extracted for each segment.However, OPTE does not consider encoding latency constraint during the optimized resolution prediction.
To summarize, current related work lacks encoding latency constraints while selecting the optimized encoding resolution, and most state-of-the-art methods need pre-encodings that yield significant latency and energy consumption.Furthermore, discussion on dynamic resolution per-title encoding for VVC-based streaming platforms is limited.

QUALITY-AWARE DYNAMIC RESOLUTION ADAPTATION (QADRA)
Striking the right balance between offering high-quality, highresolution streams and minimizing encoding and/or decoding time and energy consumption is crucial for adaptive streaming platforms to ensure responsive and uninterrupted playback experiences across various end-user devices and network environments.In line with this perspective, this paper proposes a latency-aware dynamic encoding resolution encoding scheme (QADRA) to maximize the perceived quality of video segments based on the video content complexity, target bitrate, and the encoding time constraint.As shown in Figure 2, QADRA is classified into four steps: (1) spatiotemporal complexity feature extraction (Section 3.1), (2) optimized resolution prediction (Section 3.2), (3) optimized QP prediction (Section 3.3) , (4) JND-based representation elimination (Section 3.4),

Spatiotemporal complexity feature extraction
This process involves analyzing the video content in both spatial and temporal dimensions, capturing essential information about object movements, scene changes, and visual details.Prediction models can comprehensively understand the content complexity  and characteristics by extracting relevant spatiotemporal features, such as motion vectors, texture patterns, and frame-to-frame differences.QADRA uses seven DCT-energy-based features extracted by Video Complexity Analyzer (VCA) v2.0 [23]: the average texture energy ( Y ), the average gradient of the luma texture energy (ℎ), the average luminescence ( Y ), Average chroma texture energy of U and V channels ( U and  V ), and the average chrominescence of U and V channels ( U and  V ) as the content complexity features of video segments.The features of the videos of the Inter-4K dataset [4] are compiled in the /Dataset/ folder of the repository.

Optimized resolution prediction
The objective of selecting the optimized resolution based on bitrate and video complexity features is decomposed into two parts: (1) designing models to predict the encoding and/ or decoding time and the perceptual quality; (2) developing a function to obtain the optimized resolution based on the predicted encoding and/ or decoding times and perceptual quality for each available encoding resolution.
Modeling: The perceptual quality (in terms of XPSNR)  (  ,  ) , encoding time  (  ,  ) , and decoding time  (  ,  ) of the representation (  ,   ) relies on video complexity features { Y , ℎ,  Y ,  U ,  V ,  U ,  V }, encoding resolution   , and target bitrate   parameters: Spatio-temporal features encapsulate intricate spatial details and temporal dynamics within the video segment and help assess the video fidelity [24].Including resolution, bitrate, framerate, and preset parameters in the models acknowledges the interplay between compression efficiency, temporal smoothness, and spatial clarity in shaping perceived quality [25].A higher resolution, or bitrate, may improve the quality and increase the file size of the video segment.Similarly, a higher resolution, or bitrate, can reduce the encoding and decoding speed.Notably, encoding and decoding speeds largely depend on hardware-level parameters like RAM capacity, CPU threads, etc. [11].
Optimization: QADRA optimizes the perceptual quality of encoded video segments while adhering to real-time processing constraints.It predicts the optimized resolution of the  th representation to maximize the compression efficiency while maintaining the encoding time below the threshold  L .The optimization function is: where x (,  ) , ê(,  ) , and d(,  ) are the predicted XPSNR, encoding and decoding speeds of the representation (,   ).Implementation: The select_best_resolution method selects the best resolution from predefined resolutions based on the predicted encoding time and the target bitrate.It aims to find the resolution that maximizes the XPSNR within the target encoding time.Firstly, the variables to store predicted XPSNR values (xpsnr) and predicted encoding times (time) are initialized for each resolution in the predefined list.For each resolution in the list, the XPSNR (xpsnr) and encoding time (time) are predicted using the provided features and bitrate.These predictions are made using separate methods (predict_xpsnr and predict_enc_time).The highest predicted XPSNR value (highest_xpsnr) that satisfies the target encoding time constraint (tl) is identified.If a resolution yields the highest XPSNR within the target encoding time, that resolution is selected as the predicted resolution (predicted_resolution).If no resolution satisfies the target encoding time constraint, the predicted resolution remains unchanged (defaulting to the first resolution in the list).The predicted resolution is adjusted based on the bitrate using the get_resolution_based_on_bitrate method.
We employ a data-driven approach to select the resolution that maximizes video quality within the given encoding time constraint, ultimately optimizing the encoding process for the specified bitrate.

Optimized QP prediction
Predicting the QP helps ensure consistent video quality throughout the stream.It allows the encoder to allocate bits judiciously, preventing underallocation (resulting in poor quality) or over-allocation (wasting bandwidth) of bits for encoding.
Modeling: The QP  (  ,  ) relies on video complexity features { Y , ℎ,  Y ,  U ,  V ,  U ,  V }, encoding resolution   , and target bitrate   parameters: Content with intricate details, textures, or sharp edges demands a lower QP to represent these features accurately in the encoded video.Similarly, segments with fast motion, frequent scene changes, or dynamic content require a lower QP to capture the rapid changes between frames accurately [26].
Optimization: The mathematical formulation of the QP optimization to yield a bitrate as close to the target bitrate as possible can be expressed as follows: A loss function measures the deviation between the target and predicted bitrate.The objective is to find the QP that minimizes the loss function.
Implementation: The predict_qp function takes input features extracted from the video segment, along with the resolution (normalized to a range of [0, 1]) and bitrate.These inputs are concatenated into a feature vector.The function retrieves two pre-trained machine learning models: one for predicting the minimum QP (min_model) and another for predicting the maximum QP (max_model).The resolution (normalized) is appended to the feature vector.The feature vector is passed to the minimum and maximum QP prediction models to obtain predicted QP values (b1 and b2).The function uses linear interpolation to compute a predicted QP value (qp_pred) based on the bitrate.The predicted QP is calculated based on the equation of a line passing through two points: (x1, b1) and (x2, b2), where x1 and x2 are predefined values (10 and 50) and b1 and b2 are the predicted QP values corresponding to these points.The predicted QP value is constrained from 10 to 50 in our implementation.The function returns the predicted QP value (qp_pred) as an integer.

JND-based representation elimination
QADRA uses the JND-based representation elimination algorithm proposed in our previous work [11].To avoid the perceptual redundancy of the bitrate ladder, if the predicted quality difference between two representations is lower than the JND [27], the higher bitrate representation amongst them is eliminated.Furthermore, when the predicted quality is greater than  T , i.e., the threshold above which the representation is deemed perceptually lossless, the corresponding representation is eliminated from the bitrate ladder [28].
Implementation: The jnd_elimination function is implemented to eliminate representations based on the JND criterion.The jnd_elimination function takes a list of representations with JND features as input and returns a subset of representations based on the JND criterion.
(1) Initialize an empty list of representations to store the selected representations.

Commandline options
The following command line options are included in the prototype: (1) 7) jnd: This option allows users to set the threshold for perceptual differences (in terms of XPSNR), ensuring that the process considers only noticeable differences in video quality.Default: 0.
These CLI options provide a range of customization for users, enabling them to tailor the prototype's behavior according to their specific needs and preferences in video streaming.

Prediction models
80 % of the videos in the Inter-4K Dataset [4] is used to train the prediction models.Encodings are run on a dual-processor server with Intel Xeon Gold 5218R (80 cores, frequency at 2.10 GHz), where each encoding instance uses four CPU threads.The sequences are encoded at 60 fps using VVenC v1.10 [5] using preset 0 (faster), and QPs ranging from  min to  max .The spatiotemporal features, { Y , ℎ,  Y ,  U ,  V ,  U ,  V } are extracted using VCA v2.0 [23] running as a pre-processor using four CPU threads with multi-threading and x86 SIMD optimizations.The procedure to generate the dataset for training is illustrated in Algorithm 1.
We trained the XPSNR prediction models using multiple regressors, including extra-trees, XGBoost [29], and random forests [30] and observed that the XGBoost regressor performed the best consistently using our feature set.A grid search is performed to explore different combinations of hyperparameter values, and we selected max_depth=10, and n_estimators=400 that maximized performance.Our predictive modeling framework employs a cascading approach to predict both bitrate and encoding time for a given QP.This method involves training distinct models for minimum and maximum QP values ( min and  max , respectively), enabling the prediction of both the maximum and minimum bitrate, as well as the maximum and minimum encoding time.The linear relationship between QP and the logarithm of time and between QP and the logarithm of bitrate underpins the success of this approach.Since the points ( min ,  max ) and ( max ,  min ) are estimated, the optimized QP for a target bitrate  is determined using linear regression.Similarly, using the estimations from the encoding time models, we obtain the points ( min , tmax ) and ( max , tmin ) from which the corresponding encoding time for the QP value is determined.
We have evaluated their performance across various video content types, resolutions, and bitrates, demonstrating consistent and reliable predictions.Additionally, rigorous cross-validation techniques have been employed to assess model performance and ensure its robustness against overfitting.Despite our confidence in the models, we acknowledge potential scenarios where they might underperform or fail to predict encoding parameters accurately.Some factors that could contribute to underperformance include out-ofdistribution data and extreme conditions such as highly complex content.As encoding techniques evolve and new video formats emerge, the models may require periodic updates to maintain their predictive accuracy and generalizability.

Scalability and adaptability
The modular design of QADRA allows easy integration with existing streaming infrastructure and workflow automation systems, facilitating deployment across large-scale streaming platforms.Its predictive models are built to handle a wide range of input data, enabling efficient processing of large datasets and rapid generation of encoding parameter recommendations.QADRA's architecture can be optimized for parallel processing, leveraging distributed computing resources to scale seamlessly with increasing demand.QADRA is adaptable to evolving technologies, including immersive media streaming formats such as virtual reality (VR) and augmented reality (AR).Its predictive models can be trained on datasets encompassing diverse video formats, resolutions, and encoding techniques, enabling it to adapt to emerging standards and formats in immersive media streaming.

EVALUATION RESULTS
The experimental parameters used to evaluate QADRA are listed in Table 1.The VVenC encoding uses the predicted bitrate-resolution-QP configurations for a given input video segment.b is considered the upper bound of bitrate variability, and q is the QP used for encoding.In VVenC, the QP is specified using the qp option, while  the maxrate (easy mode) or MaxBitrate (expert mode) option is used to specify the upper bound of bitrate variability.
Prediction latency and accuracy: The time to predict the resolution-QP for each representation is 5 ms.The accuracy of the encoding time, QP, and XPSNR prediction models are analyzed in terms of mean absolute error (MAE).The average MAE is 56.69 s, 1.32, and 0.16 dB, respectively.The average standard deviation is 83.98 s, 1.96, and 0.22 dB, respectively.
Resolution prediction: The encoding resolution predictions of QADRA are analyzed.Figure 3 shows that QADRA (  = ∞) generally yields the highest resolutions for a given target bitrate compared to Default and other encodings.The selected encoding resolution for a given target bitrate decreases as  E decreases.If the target latency constraint in QADRA is eliminated, i.e.,  E = ∞, resolutions yielding the highest XPSNR are selected.Notably, in scenarios where encoding time constraints become more stringent, higher bitrate representations might be omitted in QADRA due to limitations  in encoding these representations within the allocated time budget, as observed in Figure 4.
Rate-distortion performance: Figure 4 shows the RD curves of the representative video segments in the test dataset.It is observed that the RD curve of QADRA ( E = ∞) is consistently higher than Default and other encodings.This means that, for any given target bitrate, QADRA ( E = ∞) maintains a higher level of visual quality as measured by XPSNR.Consequently, viewers can enjoy a visually pleasing and immersive experience with reduced artifacts, such as blocking or blurring, at the same bitrate.
Latency and energy consumption performance: As shown in Figure 4, QADRA ( E = ∞) yields the longest encoding time due to higher encoding resolutions optimized for maximum XPSNR.This significantly increased encoding time may impact real-time or low-latency applications.Encoding typically utilizes the processing units (e.g., CPU or GPU) intensively.These processing units operate at a relatively constant power level during encoding.Therefore, the power consumed over time remains reasonably consistent, contributing to the linear relationship between encoding time and energy consumption.Hence, we assume that the encoding time savings directly translates to the encoding energy consumption reduction.Since the encodings are assumed to be carried out concurrently, the total encoding time for each segment ( E ) is determined to be the highest encoding time yielded among the bitrate ladder representations [31].Table 2 shows the average encoding time for each segment ( E ) using the considered encoding schemes.It is observed that the encoding and decoding times of representations of video segments decrease as  E decreases.

CONCLUSIONS AND FUTURE DIRECTIONS
This paper implemented a quality-aware dynamic resolution adaptation (QADRA) framework for adaptive streaming applications.QADRA implements an optimized resolution and QP prediction, which uses XGboost-based models to estimate bitrate-resolution-QP triples for a given video segment based on spatiotemporal characteristics.Furthermore, a JND-aware representation elimination algorithm is also implemented, removing the bitrate ladder's perceptual redundancy.One promising avenue for future research is exploring advanced machine-learning models to enhance prediction accuracy.Investigating novel features and metrics that better capture the relationship between encoding time and optimal resolutions might also be a promising avenue.Moreover, delving into collaborative frameworks or distributed algorithms for efficient encoding resolution selection across multiple streaming nodes could be another area of exploration.

Figure 2 :
Figure 2: Encoding using QADRA framework envisioned in this paper.

( 2 )
If the JND threshold (self.jnd) is set to 0, return the input list of representations.(3) Otherwise, iterate through the list of representations.(4) Add the first representation to the representations list.(5) If the XPSNR value of the current representation exceeds the maximum XPSNR threshold (self.max_xpsnr),return the representations list.(6) Iterate through the remaining representations in the list.(7) If the difference in XPSNR between the current representation and the last selected representation is greater than or equal to the JND threshold, add the current representation to the representations list.(8) Return the representations list if the XPSNR value of the current representation exceeds the maximum XPSNR threshold.(9) Return the final representations list.
maxEncTime: This option allows the user to specify the maximum acceptable time for encoding each representation.This option represents the maximum supported resolution.Users can set this option to define the highest resolution permissible for the video encoding task.It ensures that encoding operates within the specified resolution limits.
Default: 2160.(6) maxQuality: This option allows users to set the maximum acceptable XPSNR score.Setting a maximum threshold helps ensure that only videos are encoded below a certain quality level.This is especially used to eliminate perceptually lossless representations.Default: 100.(

Algorithm 1 :
Training dataset generation.set of supported resolutions  min : minimum QP  max : maximum QP for each training video segment do Run VCA and get { Y , ℎ,  Y ,  U ,  V ,  U ,  V } for each  ∈ R do for each  ∈ [ min ,  max ] do Encode segment with QP  ; Record  Y , ℎ,  Y ,  U ,  V ,  U ,  V ,  , , achieved bitrate  ′ , XPSNR  ′ , and PSNR  ′ ;

Table 1 :
Experimental parameters used to evaluate QADRA.

Table 2 :
[3]rage encoding and decoding time results of QADRA compared to the Default bitrate ladder encoding[3].