Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resolution Encoding

Traditional per-title encoding schemes aim to optimize encoding resolutions to deliver the highest perceptual quality for each representation. However, keeping the encoding time within an acceptable threshold for a smooth user experience is important to reduce the carbon footprint and energy consumption on encoding servers in video streaming applications. Toward this realization, we introduce an encoding latency-a ware dynamic resolution encoding scheme (LADRE) for adaptive video streaming applications. LADRE determines the encoding resolution for each target bitrate by utilizing a random forest-based prediction model for every video segment based on spatiotemporal features and the acceptable target latency. Experimental results show that LADRE achieves an overall average quality improvement of 0.58 dB PSNR and 0.43 dB XPSNR while maintaining the same bitrate, compared to the HTTP Live Streaming (HLS) bitrate ladder encoding of 200 s segments using the VVenC encoder, when the encoding latency for each representation is set to remain below the 200 s threshold. This is accompanied by an 84.17 % reduction in overall encoding energy consumption.


INTRODUCTION
HTTP Adaptive Streaming (HAS) has become the de-facto standard in delivering video content for various clients regarding internet speeds and device types.The fundamental idea behind HAS is to divide the video content into segments and encode each segment at various bitrates and resolutions, called representations, stored in plain HTTP servers.These representations continuously adapt the video delivery to the network conditions and device capabilities of the client [1].Conventionally, a fixed bitrate ladder, e.g., HTTP Live Streaming (HLS) bitrate ladder [2], is used in video streaming applications.
Dynamic resolution per-title encoding: In contemporary streaming applications, there is growing interest in per-title encoding methods to enhance the perceived quality of delivered content [3].This innovative approach dynamically adjusts the encoding resolution [3][4][5][6] and the framerate [7], preset [8,9], and other parameters in response to content complexity and viewer preferences to maximize visual fidelity.Dynamic resolution encoding stands out as the most extensively studied per-title encoding scheme in adaptive streaming applications, focusing on adjusting encoding resolutions dynamically to optimize video quality.Adapting the encoding resolution for each segment ensures that the video maintains high perceptual quality in visually intricate segments while efficiently lowering resolution in less complex scenes.As illustrated by the rate-distortion plots of representative segments in Figure 1 (based on the Inter-4K dataset [10]), the optimized resolution, which yields the highest perceptual quality (in terms of XPSNR [11,12]), depends on the content complexity.For sequence 0001, 540p yields the highest XPSNR until 0.8 Mbps, while 1080p yields the highest XPSNR within a bitrate range of 0.8 Mbps and 5.0 Mbps.2160p yields the highest XPSNR for bitrates over 5.0 Mbps.Meanwhile, for sequence 0010, 1080p yields the highest XPSNR until 1.6 Mbps, while 2160p yields the highest XPSNR for bitrates over 1.6 Mbps.The streaming system can allocate resources effectively by tailoring the resolution per segment, prioritizing high-quality representations where it matters most.Ultimately, dynamic resolution per-title encoding strives to balance perceptual quality and bandwidth efficiency, offering viewers an immersive and engaging streaming experience [3].Notably, a convex hull will differ from one codec to the other since the more efficient the codec, the less bitrate it needs to encode at a specific resolution with acceptable quality.For example, High Efficiency Video Coding (HEVC) [13] and Versatile Video Coding (VVC) [14] allow for better use of the higher resolution than the other codecs by providing a wider range of bitrates to stream that resolution [15].
Encoding delay tolerance in video streaming: In live sports or events streaming, where real-time viewing is crucial, encoding delays are typically expected to be very low, usually in a few seconds or less.This minimal delay ensures that viewers receive the action as close to real-time as possible [16].For interactive content like online gaming or live seminars, extremely low encoding delays are crucial in maintaining real-time interactivity.Delays of a few seconds or less are generally acceptable to keep interactions seamless.News broadcasts require relatively low encoding delays to deliver the latest updates.Delays of ten seconds or less are commonly permitted to ensure viewers receive updates as events unfold.Reducing encoding time is also critical in video-on-demand (VoD) streaming applications since it contributes to environmental sustainability.Encoding processes in data centers require substantial computational resources and energy consumption.The streaming industry can reduce its carbon footprint and energy consumption by minimizing encoding time.This is particularly important as environmental consciousness grows and companies seek to adopt greener practices.Lower encoding latency aligns with eco-friendly streaming practices by optimizing resource utilization and minimizing unnecessary energy expenditure.
In this light, encoding time-aware per-title encoding is essential for adaptive streaming applications.Firstly, it enables the adaptation of encoding parameters based on the time constraints of the encoding process.By considering encoding time as a crucial factor, the encoding settings can be optimized dynamically to ensure that content is encoded within specific time limits, meeting the demands of real-time streaming scenarios.Secondly, it enhances resource allocation efficiency by tailoring the encoding parameters to match the available computational resources, thereby avoiding overloading or underutilizing encoding servers.Additionally, it contributes to an overall improvement in streaming efficiency, allowing for smoother content delivery by minimizing encoding delays and ensuring that the streaming service can efficiently handle varying workloads and complexities of encoding tasks.Encoding time-aware per-title encoding is pivotal in optimizing streaming quality while efficiently utilizing computational resources in online adaptive streaming environments.As shown in the rate-encoding time plots of representative segments in Figure 1, the encoding time depends on the encoding resolution chosen for the video content.This relationship is intrinsic to the encoding process, where the number of pixels in each frame significantly impacts the computational workload.Higher encoding resolutions, such as 2160p or 4320p, encompass more pixels, necessitating increased computational resources and time for encoding compared to lower resolutions like 720p or 1080p.The dependency on encoding resolution becomes especially relevant in adaptive streaming, where content is encoded at multiple resolutions to accommodate varying user devices and network conditions.However, state-of-the-art per-title encoding methods do not consider the encoding latency constraints while optimizing the encoding resolution [3,4,6,17].
Contributions: Our main contributions are as follows: (1) An online encoding resolution selection algorithm to maximize the perceived quality of video segments based on their spatiotemporal complexity, target bitrate, and the encoding time constraint.(2) Implementation of the proposed per-title encoding scheme using an open-source VVC-based toolchain.(3) Comprehensive analysis of the proposed resolution selection algorithm for various encoding time thresholds regarding compression efficiency and encoding latency.Since no opensource VVC encoder implementations are available for live encoding, we limit our evaluation to the VoD streaming scenario.
Outline: The remainder of this paper is organized as follows.Section 2 discusses related work on dynamic resolution encoding in the context of adaptive video streaming.The proposed LADRE scheme is introduced and described in detail in Section 3. Section 4 explains the experimental design, while Section 5 presents the experimental results.Finally, Section 6 concludes the paper.

RELATED WORK
Most state-of-the-art dynamic resolution per-title encoding methods are based on choosing a particular resolution that provides better visual quality for a given bitrate range.Table 1 shows the target scenario, the bitrate estimation method, the number of preencodings needed to determine the convex hull, and the encoding type of the state-of-the-art methods.Katsenou et al. [4] uses machine learning to identify the most effective bitrate range for each resolution.The method extracts spatiotemporal features and statistics from sequences at their original resolution.Then, it employs machine learning methods to predict the quantization parameters  (QPs) at which the rate-distortion curves across the different resolutions intersect.( r − 1) × 2 encodes must be performed to determine the bitrates at which resolutions should be switched.This content-gnostic approach has been claimed to reduce the number of encodings required compared to other methods (by 81% -94%) compared to the bruteforce encoding approach.It uses constant quantization parameter (CQP) encodes, which are not used in real-time streaming applications.Another method proposed by Bhat et al. [5] uses machine learning to predict the resolution without requiring multiple encodings.Features from the low-resolution encoding of the first few frames are input to a random forest model to predict better-performing resolution for a decision period.Similarly, Zabrovskiy et al. [17] used an artificial neural network to predict an optimized bitrate ladder for each scene, optimized based on the YPSNR quality metric.These methods produce latency significantly higher than the accepted latency in live streaming.Our previous work OPTE [6] uses random forest models to predict optimized resolution, yielding the highest perceptual quality using spatiotemporal features extracted for each segment.However, OPTE does not consider encoding latency constraint during the optimized resolution prediction.To summarize, current related work lacks considering encoding latency constraints while selecting the optimized encoding resolution, and most state-of-the-art methods need pre-encodings which yield significant latency and energy consumption.

LATENCY-AWARE DYNAMIC RESOLUTION ENCODING
Striking the right balance between offering high-quality, highresolution streams and minimizing encoding time and energy consumption is crucial for adaptive streaming platforms to ensure responsive and uninterrupted playback experiences across various end-user devices and network environments.In line with this perspective, this paper proposes an encoding latency-aware dynamic encoding resolution encoding scheme (LADRE) to maximize the perceived quality of video segments (in terms of XPSNR) based on the video content complexity, target bitrate, and the maximum encoding time constraint.As shown in Figure 2, LADRE is classified into four steps and described in the following subsections: (1) spatiotemporal complexity feature extraction, (2) optimized resolution prediction, (3) optimized rate factor prediction, and (4) constrained variable bitrate (cVBR) encoding using the selected bitrate-resolution-rate factor combinations.
Table 2 summarizes the notations used in LADRE.

Spatiotemporal complexity feature extraction
This process involves analyzing the video content in both spatial and temporal dimensions, capturing essential information about object movements, scene changes, and visual details.Predictive  models can comprehensively understand the content complexity and characteristics by extracting relevant spatiotemporal features, such as motion vectors, texture patterns, and frame-to-frame differences [19].These features serve as valuable inputs for algorithms that dynamically adjust encoding parameters like bitrate resolution, ensuring that the live stream quality while adapting to real-time changing network Intuitively, a higher resolution may be warranted to effectively capture fine details for content with high and relatively low complexity.Conversely, with high temporal complexity may benefit from a lower resolution, as rapid changes between frames can limit the perceptual benefit of higher spatial detail.This paper uses seven DCT-energy-based features [20,21]: { Y , ℎ,  Y ,  U ,  U ,  V ,  V } as the content complexity features of video segments.

Optimized resolution estimation
The objective of selecting the optimized resolution based on bitrate and video complexity features is decomposed into two parts: (1) designing models to predict the encoding time and the perceptual quality in terms of XPSNR; (2) developing a function to obtain the optimized resolution based on the predicted encoding times and perceptual quality for each available encoding resolution.
Modeling: The perceptual quality  (  ,  ) and encoding time  (  ,  ) of the representation (  ,   ) rely on the extracted video complexity features ( Y , ℎ,  Y ,  U ,  U ,  V ,  V ), encoding resolution   , and target bitrate   [7,22]: Spatio-temporal features encapsulate intricate spatial details and temporal dynamics within the video segment and help to assess the video fidelity [23].Including resolution, bitrate, framerate, and preset parameters in the models acknowledges the interplay between compression efficiency, temporal smoothness, and spatial clarity in shaping perceived quality.A higher resolution, bitrate, or framerate may improve the quality and increase the file size of the video segment.A slower preset at the same target bitrate can reduce the file size of the video segment.Similarly, a higher resolution, bitrate, framerate, or a slower preset can increase the encoding duration.
Optimization: LADRE optimizes the perceptual quality of encoded video segments while adhering to real-time processing constraints.It predicts the optimized resolution of the  th representation to maximize the compression efficiency while maintaining the encoding time below the threshold  L .The optimization function is: where v (,  ) and τ(,  ) are the predicted XPSNR and encoding speed of the representation (,   ).This paper considers XPSNR as the perceptual quality measure instead of the popular VMAF metric since the correlation of XPSNR with subjective quality scores is significantly higher than VMAF for VVC-coded bitstreams [24].

Optimized rate factor estimation
Predicting the rate factor helps ensure consistent video quality throughout the stream.It allows the encoder to allocate bits judiciously, preventing underallocation (resulting in poor quality) or over-allocation (wasting bandwidth) of bits for encoding [6].
Modeling: The rate-factor  (  ,  ) relies on the extracted video complexity features, encoding resolution   , and target bitrate   parameters [25]: Content with intricate details, textures, or sharp edges demands a lower rate factor to represent these features accurately in the encoded video.Similarly, segments with fast motion, frequent scene changes, or dynamic content require a lower rate factor to accurately capture the rapid changes between frames.

Optimization:
The mathematical formulation of the rate factor optimization to yield a bitrate as close to the target bitrate as possible can be expressed as follows: A loss function measures the deviation between the target and predicted bitrate.The objective is to find the rate factor that minimizes the loss function.

Constrained variable bitrate encoding
Constrained variable bitrate encoding offers dynamic bitrate allocation while imposing upper limits on the bitrate variability.This method enables flexibility in bitrate adjustment, accommodating the complexity of video content by allowing fluctuations within a predefined range.However, unlike traditional VBR methods, constrained VBR sets constraints on the maximum and minimum bitrates, ensuring that the bitrate variations stay within specific bounds.Doing so balances the trade-off between encoding efficiency and consistent quality.
The encoding uses the predicted bitrate-resolution-rate factor configurations for a given input video segment.b is considered the upper bound of bitrate variability, and ĉ is the rate factor used for encoding.In VVenC [26], the rate factor is specified using the qp option, while the maxrate (easy mode) or MaxBitrate (expert mode) option is used to specify the upper bound of bitrate variability.A reasonable lower bound of the bitrate variability is achieved using the XPSNR-based perceptual QP adaptation in combination with the qp option and does not need to be specified explicitly.

EXPERIMENTAL DESIGN 4.1 Evaluation setup
All experiments are run on a dual-processor server with Intel Xeon Gold 5218R (80 cores, frequency at 2.10 GHz), where each encoding instance uses four CPU threads (i.e.,  = 4) with multi-threading and x86 SIMD [27] optimizations.The Inter-4K dataset [10] is used to validate the performance of the encoding schemes considered in this paper.Inter-4K is a standard benchmark for video superresolution and frame interpolation, containing 1000 UHD videos captured at 60 frames per second (fps).The experimental parameters used to evaluate LADRE are shown in Table 3.The sequences are encoded using VVenC v1.10 [26] using preset 0 (faster).The spatiotemporal features are extracted using VCA v2.0 [21] running as a pre-processor using four CPU threads with multi-threading and x86 optimizations.LADRE uses random forest regression models [28] to predict XPSNR, encoding time, and rate factor trained for each supported resolution in R.

Performance metrics
The resulting overall quality in Peak Signal to Noise Ratio (PSNR) [29] and XPSNR [11] and the achieved bitrate are compared for every representation of each test sequence.Bjøntegaard Delta values [30] BD-PSNR and BD-XPSNR refer to the average increase in PSNR and XPSNR of the representations compared with the reference encoding scheme with the same bitrate, respectively.A positive BD-PSNR and BD-XPSNR indicate a gain in coding efficiency compared to the reference encoding scheme.The relative storage consumption (Δ) of representations is evaluated compared with the reference encoding scheme.The encoding latency of a video segment is calculated from the instant it is fed to the LADRE pipeline until each representation is completely encoded.

Reference schemes
This paper compares LADRE with the following benchmarks: (1) Default: This scheme employs a fixed bitrate ladder, i.e., a fixed set of bitrate-resolution pairs.This paper uses the HLS bitrate ladder specified in the Apple authoring specifications [2] as the fixed bitrate ladder.(2) OPTE [6] This scheme predicts optimized resolution, which yields the highest XPSNR for a given target bitrate.
The other methods mentioned in Table 1 are not considered for evaluation, as they utilize information from pre-encodings to determine convex-hull, which introduces significant latency and energy consumption in the encoding servers.

EVALUATION RESULTS
Prediction latency and accuracy: We evaluate the pre-processing latency ( p ) in encoding introduced by the video complexity feature extraction and the model inference to predict the optimized resolution-rate factor configurations.We extract the features at an average rate of 167 fps over the entire dataset (2160p resolution).This result is critical in future-proofing the system by handling evolving content requirements (e.g., 8K resolution or high framerate content).The time to predict the resolution-rate factor for each representation is 5 ms.As video complexity feature extraction and the optimized resolution and rate factor prediction can execute concurrently in real applications, the overall latency introduced by LADRE is negligible.The accuracy of the encoding time, rate factor, and XPSNR prediction models are analyzed in terms of mean absolute error (MAE).The average MAE is 6.97 s, 1.05, and 0.48 dB, respectively.

Resolution prediction:
We analyze the encoding resolution predictions of LADRE.OPTE generally yields the highest resolutions for a given target bitrate compared to Default and LADRE encodings.The selected encoding resolution for a given target bitrate decreases as  L decreases.If the target latency constraint in LADRE is eliminated, i.e.,  L = ∞, resolutions yielding the highest XPSNR are selected, converging the resolution selection to OPTE.Notably, in scenarios where encoding time constraints become more stringent, higher bitrate representations might be omitted in LADRE due to limitations in encoding these representations within the allocated time budget, as observed in Figure 3.
Rate-distortion performance: Figure 3 shows the RD curves of the representative video segments in the test dataset.It is observed that the RD curve of OPTE is mostly higher than default and LADRE.This means that, for any given target bitrate, OPTE maintains a higher level of visual quality as measured by XPSNR.Consequently, viewers can enjoy a visually pleasing experience with reduced artifacts, such as blocking or blurring, at the same bitrate.Notably, some representations are not encoded in LADRE since it was impossible to select a resolution with encoding time within  L .Hence, the storage needed for the representations is lowered as  L decreases.It is observed in  and BD-XPSNR values decrease as the encoding time constraint is lower.
Latency and energy consumption performance: As shown in Figure 3, OPTE yields the longest encoding time due to higher encoding resolutions optimized for maximum XPSNR.This significantly increased encoding time may impact real-time or low-latency applications.Encoding typically utilizes the processing units (e.g., CPU or GPU) intensively.These processing units operate at a relatively constant power level during encoding.Therefore, the power consumed over time remains reasonably consistent, contributing to the linear relationship between encoding time and energy consumption [7].Hence, we assume that the encoding time savings directly translates to the encoding energy consumption reduction.Since we assume that the encodings are carried out concurrently, the total encoding time for each segment ( L ) is determined to be the highest encoding time yielded among the bitrate ladder representations [31].Table 4 shows the average encoding time for

CONCLUSIONS AND FUTURE DIRECTIONS
This paper proposes an encoding latency-aware dynamic resolution per-title encoding scheme (LADRE) for adaptive streaming applications.LADRE includes an optimized resolution prediction, which uses random forest-based models to estimate bitrate-resolution-rate factor triples for a given video segment based on its spatial and temporal characteristics.The experimental results show that, on average, LADRE, with a target encoding time constraint of 200 s, yields bitrate savings of 10.25 % and 12.03 % to maintain the same PSNR and XPSNR, respectively, compared to the reference HLS bitrate ladder with a negligible additional latency in streaming.This is accompanied by an average decrease of 84.17 % in encoding energy consumption.
One promising avenue for future research is exploring advanced machine-learning models to enhance prediction accuracy.Investigating novel features and metrics that better capture the relationship between encoding time and optimal resolutions might also be a promising avenue.Moreover, delving into collaborative frameworks or distributed algorithms for efficient encoding resolution selection across multiple streaming nodes could be another area of exploration.

Figure 1 :
Figure 1: Rate-distortion (RD) and rate-encoding time curves of representative sequences (segments) of Inter-4K dataset[10] encoded at 540p, 1080p and 2160p resolutions using VVenC at faster preset.Here, XPSNR is used as the quality metric.

Figure 2 :
Figure 2: Encoding using LADRE envisioned in this paper for online streaming applications.

Table 1 :
Comparison of the state-of-the-art dynamic resolution per-title encoding methods with LADRE.

Table 2 :
Notations used in LADRE.b , ĉ predicted resolution, bitrate and rate factor of the  th representation

Table 3 :
Experimental parameters used to evaluate LADRE.
Table 4 that the BD-PSNR

Table 4 :
Average results of the encoding schemes compared to the Default bitrate ladder encoding. L ) using the considered encoding schemes.It is observed that the encoding times of representations of video segments decrease as  L decreases.