Optimal Quality and Efficiency in Adaptive Live Streaming with JND-Aware Low latency Encoding

In HTTP adaptive live streaming applications, video segments are encoded at a fixed set of bitrate-resolution pairs known as bitrate ladder. Live encoders use the fastest available encoding configuration, referred to as preset, to ensure the minimum possible latency in video encoding. However, an optimized preset and optimized number of CPU threads for each encoding instance may result in (i) increased quality and (ii) efficient CPU utilization while encoding. For low latency live encoders, the encoding speed is expected to be more than or equal to the video framerate. To this light, this paper introduces a Just Noticeable Difference (JND)-Aware Low latency Encoding Scheme (JALE), which uses random forest-based models to jointly determine the optimized encoder preset and thread count for each representation, based on video complexity features, the target encoding speed, the total number of available CPU threads, and the target encoder. Experimental results show that, on average, JALE yield a quality improvement of 1.32 dB PSNR and 5.38 VMAF points with the same bitrate, compared to the fastest preset encoding of the HTTP Live Streaming (HLS) bitrate ladder using x265 HEVC open-source encoder with eight CPU threads used for each representation. These enhancements are achieved while maintaining the desired encoding speed. Furthermore, on average, JALE results in an overall storage reduction of 72.70 %, a reduction in the total number of CPU threads used by 63.83 %, and a 37.87 % reduction in the overall encoding time, considering a JND of six VMAF points.


INTRODUCTION
The Moving Picture Experts Group (MPEG) has developed a standard called Dynamic Adaptive Streaming over HTTP (MPEG-DASH) [1] to meet the high demand for streaming high-quality video content over the Internet and overcome the associated challenges in HTTP Adaptive Streaming (HAS) [2].The main idea behind HAS is to divide the video content into segments and to encode each segment at various bitrates and resolutions, called representations.These representations enable a continuous adaptation of the video delivery to the client's network conditions and device capabilities [3].The increase in the computational complexity using codecs such as High Efficiency Video Coding (HEVC) [4] and Versatile Video Coding (VVC) [5], and improvements in video characteristics such as resolution [6], framerate [7], and bit-depth raises the need to develop a large-scale, highly efficient video encoding environment [8].This is crucial for DASH-based content provisioning as it requires encoding multiple representations of the same video content in an encoding server.
encoding speed that is the same as or greater than the video framerate, regardless of the complexity of the video content, is a crucial goal for a low latency live encoder [10].Although the output video's compression efficiency (in terms of the obtained perceptual quality and bitrate) is an essential metric for the encoder, maintaining the encoding speed takes precedence in live streaming scenarios.This is because a reduction in encoding speed may lead to the unacceptable outcome of dropped frames during transmission, eventually decreasing the quality of experience [10].The encoding speed depends on video content complexity and parameters such as (i) target resolution, (ii) target bitrate, (iii) number of CPU threads [11], and (iv) encoder configuration [12].Optimized resource allocation in encoding servers: In adaptive streaming, strategically allocating an optimized number of CPU threads for each video encoder instance at a cloud server is crucial.Tailoring CPU thread count to encoding resolution and target bitrate allows for precise resource allocation, enhancing the efficiency of the encoding process.The cloud server can significantly streamline the encoding process by dynamically adjusting CPU thread counts based on resolution and bitrate, accommodating diverse video qualities within an adaptive streaming environment.Figure 1 shows the encoding time measurement of an entire HLS HEVC bitrate ladder encoding of the Wood_s000 sequence [13] using ultrafast preset of x265 [14] with 4, 8, and 16 CPU threads for each of the twelve encoding representations.Since the video (segment) is of 30 fps, the target encoding speed is considered 30 fps [10].However, specific thread configurations cannot deliver certain representations' desired encoding speed of 30 fps.For example, the 11.6 Mbps and 16.8 Mbps representations do not achieve the target encoding speed (30 fps) using 4 or 8 threads.This emphasizes the impact of CPU utilization and thread configuration on the encoding performance.
Optimized encoding preset: Traditional open-source encoders like x264 [15], x265 [14], and VVenC [16] have pre-defined sets of encoding parameters (termed as presets), which present a trade-off between the encoding time and compression efficiency [7,17].The preset for the fastest encoding (ultrafast for x264 and x265) is used as the encoder preset for the entire live content, independent of the video content complexity [10].Moreover, the streaming service provider arbitrarily chooses the number of CPU threads for each representation, irrespective of the content complexity.Though the conservative technique of fixing the preset and thread count for each encoding instance may achieve the intended result of a low latency encoding, the resulting encoding is sub-optimal, especially when the type of the content is dynamically changing, which is the typical use case for live streams [18].Furthermore, when the content becomes easier to encode (i.e., slow-moving videos or videos that have simpler textures are easy to encode as predicting the current frame from a previous frame is simpler, resulting in smaller residuals), the encoder would achieve a higher encoding speed than the target encoding speed.This, in turn, introduces unnecessary CPU idle time as it waits for the video feed.If the encoder preset is configured such that this higher encoding speed can be reduced while still being compatible with the expected live encoding speed, the quality of the encoded content achieved by the encoder can be improved.Subsequently, when the content becomes complex again, the encoder preset needs to be reconfigured to move back to the faster configuration that achieves live encoding speed [7,19].By employing efficient storage techniques and removing unnecessary representations, the energy consumption associated with storing and transmitting redundant data can be minimized [20].
Contributions: This paper proposes a Just Noticeable Difference (JND)-aware low latency encoding scheme (JALE) that jointly determines the CPU thread count and encoder preset configuration for each bitrate representation dynamically, adaptive to the video content to achieve low latency encoding.Content-aware features, i.e., Discrete Cosine Transform (DCT)-energy-based low-complexity spatial and temporal features, are extracted to determine video segments' characteristics, which random forest-based models use to predict optimized thread count and encoder preset for each representation to maintain the target encoding speed.JALE achieve the desired target encoding speed while maximizing compression efficiency and minimizing the total CPU threads used.Furthermore, based on JND, JALE removes perceptual redundancy between representations in the bitrate ladder.
Paper outline: The remainder of this paper is organized as follows.Section 2 describes the proposed JALE encoding architecture.In Section 3, the performance of JALE is evaluated, and Section 4 concludes the paper.

JALE ARCHITECTURE
The architecture of JALE for streaming applications is presented in Figure 2, according to which the number of threads and encoder preset for every segment in each representation of the bitrate ladder is predicted using spatiotemporal features of the input video segment, the target video encoding speed ( T ), the set of pre-defined supported thread count per instance (C), and the set of pre-defined encoder presets (P).The encoding process is carried out with the predicted encoder preset and the number of threads for each video segment.JALE is classified into three steps: (1) video complexity feature extraction, (2) joint thread count and preset prediction, (3) perceptually-redundant representation elimination.

Video complexity feature extraction
Predictive models can comprehensively understand the content complexity and characteristics by extracting relevant spatiotemporal features, such as motion vectors, texture patterns, and frameto-frame differences [21].In this paper, three DCT-energy-based features [22], the average luma texture energy ( Y ), the average gradient of the luma texture energy (ℎ), and the average luminescence ( Y ), for each segment are extracted using open-source Video Complexity Analyzer (VCA) [6,22].

Joint thread count and preset prediction
Selection of the optimized thread count-preset pair for each segment per representation based on the video content complexity is decomposed into two parts: (1) train models to predict the encoding speed for each thread count-preset pair, (2) develop a function to obtain the optimized thread countpreset pair for each representation.
The encoding speed of the  ℎ representation of the input video segment (  ) is modeled as a function of the video content complexity features, the target representation (resolution   and bitrate   ) [7], the number of threads   , and the encoder preset   , as shown in the equation: We use random forest models [23] to predict the encoding speed for each thread count-preset pair.( T ×  T ) models are trained, where  T and  T represent the number of encoding presets and the number of supported thread count per instance, respectively.The optimized thread count-preset prediction function has a look-up table of ( n , p ) pairs.The supported encoder presets are chosen based on the target encoder and the preference of the streaming service provider.For example, presets ranging from ultrafast to veryslow can be chosen for x264 and x265 encoders.The set of possible thread counts (C) for every encoder instance is input by the streaming service provider based on the encoding server architecture.The priority of ( n , p ) pairs is decided based on the following constraints: (1) the achieved encoding speed ŝ of the  ℎ representation must be greater than or equal to the target encoding speed  (2) total number of CPU threads used for each representation is minimized.An example look-up for x265 encoder is shown in Figure 3, where C : {4, 8, 12, 16, 20, 24}, and P :{medium, fast, faster, veryfast, superfast, ultrafast}.When the look-up table is scanned in the priority order, if the ( n , p ) pair yields an encoding speed higher than   , it is chosen as the optimized thread count-preset pair for the  ℎ representation.

Perceptually-redundant representation elimination
JALE uses the JND-based representation elimination algorithm proposed in our previous work [7].However, it is described in this paper (cf.Algorithm 1) to make it self-contained.The perceptual quality of the  ℎ representation (  ) is modeled as a function of the video content complexity features, the target representation (resolution   and bitrate   ), and the encoder preset   , as shown in the equation [7,24]: Random forest models [23] are trained to predict the perceptual quality of each representation.This paper uses VMAF as the perceptual quality metric for each representation, while other quality metrics can be envisioned and are subject to future work.In practice, it is often observed that the VMAF scores of different representations are very similar, which introduces perceptual redundancy in the bitrate ladder.To address this issue, this paper leverages the concept of the JND threshold, which represents the minimum threshold at which the human eye can perceive differences in quality [25][26][27].The paper aims to eliminate perceptually redundant representations by utilizing the JND threshold.While [28,29] have explored VMAF-based JND thresholds, their complexity is unsuitable for live-streaming applications.Therefore, this paper adopts a fixed JND threshold [30,31] denoted as  J , an input from the streaming service provider.If the difference in the predicted VMAF of two representations is smaller than  J , the representation with a higher bitrate will be eliminated.If the predicted VMAF of a representation is larger than  T , ( T = 100− J ), i.e., the threshold above which the representation is deemed perceptually lossless, the corresponding representation is eliminated from the bitrate ladder.In this manner, overall storage consumption and encoding energy of representations of an input video segment is reduced.

EVALUATION
This section introduces the test methodology used in this paper and presents the experimental results.

Test Methodology
We use four hundred sequences (80 % of the sequences) from the video complexity dataset [13] as the training dataset and the remaining (20 %) as the test dataset.We encode the sequences at 30 fps using x265 v3.5 [14] with multi-threading and x86 SIMD [32] optimizations.The experimental parameters used in this paper are listed in Table 1.We achieve CBR encoding for a target bitrate of   (in Mbps) by setting the bitrate and vbv-maxrate option of x265 as   , and enabling strict-cbr mode.We run all experiments on a dualprocessor server with Intel Xeon Gold 5218R (80 cores, frequency at 2.10 GHz).We consider the bitrate ladder from HLS authoring specification for Apple devices [9].We consider  J as two [30], four, and six [31] based on current industry practices.Prediction models: To ensure the robustness and generalization of the prediction models, we perform a five-fold cross-validation scheme for video sequences and average the results.The scheme also ensures that the test and training segments are split.We perform the hyperparameter tuning on the random forest prediction models on the ultrafast preset to balance the size and prediction accuracy of the models.The selected hyperparameters [33] are min_samples_leaf= 1, min_samples_split= 2, n_estimators= 100, max_depth=14.
Benchmark schemes: We compare JALE with the following encoding schemes: (1) Default: ultrafast preset with eight threads for each encoding instance [9].(2) Bruteforce: optimized thread count-preset pair with and without JND-based representation elimination when the models are fully accurate.This is accomplished by bruteforce encoding using all thread count-preset pairs and selecting the optimized pair [8].Hence, it is suitable only for video-ondemand applications.(3) CAPS [34] determines the optimized preset for each representation for a target encoding speed of 30 fps.We evaluate CAPS where =4, 8, and 16, respectively.
Performance metrics: We compare JALE with the benchmark schemes using BDR P and BDR V [35], which refers to the average increase in bitrate of the representations compared to the reference bitrate ladder encoding scheme to maintain the same PSNR and VMAF.A negative BDR suggests a boost in the coding efficiency of the considered encoding scheme compared to the reference bitrate ladder encoding scheme.Furthermore, we calculate BD-PSNR and BD-VMAF, which refer to the average increase in PSNR and VMAF at the same bitrate compared with the reference bitrate ladder encoding scheme.Positive BD-PSNR and BD-VMAF denote an increase in the coding efficiency of the considered encoding scheme compared to the reference bitrate ladder encoding.Relative storage space difference between the considered encoding scheme  opt and the reference encoding scheme  ref to store all bitrate ladder representations is evaluated as:  We further determine the relative difference in the CPU thread count (Δ ) needed for all bitrate ladder representations of the considered encoding schemes.We measure encoding energy consumption using the Running Average Power Limit (RAPL) interface and the CodeCarbon tool [36].

Experimental Results
Predictions: Figure 4a shows the average encoder preset and the number of CPU threads chosen for each representation of the bitrate ladder across the test dataset.On average, the 0.145 Mbps representation chooses medium preset ( = 5) and 4 threads, while 11.6 Mbps and 16.8 Mbps representations choose ultrafast preset ( = 0) and 24 threads.This is because faster encoding presets and more computational resources are needed to encode high-bitrate representations, such that the encoding speed is above the threshold.Figure 4b shows the average encoding speed for every representation in the bitrate ladder using the default encoding and JALE.
It is observed that, on average, the JALE encodings have speeds above the lower bound of  T = 30 fps.It is also observed that default encoding has a high encoding speed for lower bitrate representations.This scenario occurs when the encoding tasks are relatively less demanding regarding CPU resources, allowing the CPU to remain substantially under-utilized while executing the encoding operations during a live feed.Furthermore, it could not achieve the minimum target encoding speed at higher bitrates (11.6 Mbps and 16.8 Mbps).However, JALE controls the encoding speed to be greater than  T but not significantly higher than the default encoding.This ensures higher CPU utilization when the encodings are carried out concurrently during a live feed.
Rate-distortion performance: Figure 4c shows the average VMAF for every representation in the bitrate ladder using the default, CAPS and JALE encodings, while Figure 5 shows the RD curves of the representative video segments of various video complexities with encoding using benchmark schemes, and JALE ( J = 6).It is observed that the VMAF achieved by JALE is higher than or close to CAPS encoding and consistently higher than the default encoding at the same target bitrates.Moreover, perceptually redundant representations are eliminated in JALE.Furthermore, we observe a substantial improvement in quality at lower bitrate representations, owing to the selection of slower presets.
Bjøntegaard delta rates (BDR): We evaluate the coding efficiency using BDR P , BDR V , BD-PSNR, and BD-VMAF compared to the default encoding, as shown in Table 2. Bruteforce encoding [8] yields 100 % accurate results representing the highest bound of the compression efficiency improvement (in VMAF) compared to the default encoding.Compared to the default encoding, the coding efficiency improvement achieved by JALE is similar to the bruteforce encoding.Hence, the prediction models used in JALE are deemed fairly accurate.Using JALE ( J =6), we observe an average bitrate reduction of 25.93 %, 25.47 % to maintain the same PSNR and VMAF.Furthermore, we observe an average quality improvement of 1.32 dB PSNR and 5.38 VMAF points, respectively, at the same target bitrate.
Storage consumption: We evaluate the relative difference in the storage space between the considered encoding schemes and the default encoding scheme to store all bitrate ladder representations.When the JND value increases, the number of representations in the bitrate ladder decreases, causing a decrease in the overall storage  Encoding energy consumption: We conduct a comprehensive evaluation of encoding schemes by analyzing the relative differences in energy consumption during encoding (Δ enc ) of the bitrate ladder compared to the default encoding scheme.Predictably, compared to Default, and JALE encoding, CAPS ( = 16) yields the highest encoding energy consumption, owing to the selection of slower presets at all bitrates, as the constraint on total CPU threads is not considered.JALE yields the lowest encoding energy compared to other benchmark methods.Moreover, as  J increases, the energy consumption is reduced.

CONCLUSIONS
This paper proposed JALE, a JND-aware low latency encoding scheme for adaptive live streaming applications.JALE jointly predicts the optimized encoder preset and CPU thread count for a given representation for each video segment based on the video content complexity features, target encoding speed, and the total number of available CPU threads.It helps improve quality and CPU utilization during encoding.Furthermore, the JND-based representation elimination algorithm removes perceptually redundant representations in the bitrate ladder.The performance of JALE is analyzed using the x265 open-source HEVC encoder for the HLS bitrate ladder encoding.It is observed that JALE yields an overall average quality improvement of 0.98 dB PSNR and 4.41 VMAF points at the same bitrate, compared to ultrafast encoding of the reference HLS bitrate ladder using eight CPU threads for each representation.Considering a JND of six VMAF points, storage, thread count, and encoding time reductions of 72.70 %, 63.83 %, and 37.87 %, respectively, are observed.
In the future, JALE will support the addition of encoder presets beyond the pre-defined options, enhancing efficiency in the encoding process.This may ensure better flexibility in selecting encoding parameters such that the achieved encoding speed is the same as the target.Furthermore, storage reduction techniques and representation elimination to enhance the overall energy efficiency of video streaming and improve the sustainability of video streaming systems shall be investigated.

ACKNOWLEDGMENT
The financial support of the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology and Development, Austrian Research Promotion Agency (FFG), grant agreement FO999897846 (GAIA) and the Christian Doppler Research Association is gratefully acknowledged.Christian Doppler Laboratory ATHENA: https://athena.itec.aau.at/.

Figure 2 :
Figure 2: Live encoding using JALE envisioned in this paper.

Figure 3 :
Figure 3: ( n, p) look-up table used in the experimental validation of this paper.

Figure 4 :
Figure 4: Results for each representation in JALE.JND-based representation elimination is not considered in these plots.

Table 1 :
Experimental parameters of JALE used in this paper.

Table 2 :
Average results of the encoding schemes compared to the default encoding.