Transcoding Quality Prediction for Adaptive Video Streaming

In recent years, video streaming applications have proliferated the demand for Video Quality Assessment VQA). Reduced reference video quality assessment (RR-VQA) is a category of VQA where certain features (e.g., texture, edges) of the original video are provided for quality assessment. It is a popular research area for various applications such as social media, online games, and video streaming. This paper introduces a reduced reference Transcoding Quality Prediction Model (TQPM) to determine the visual quality score of the video possibly transcoded in multiple stages. The quality is predicted using Discrete Cosine Transform (DCT)-energy-based features of the video (i.e., the video's brightness, spatial texture information, and temporal activity) and the target bitrate representation of each transcoding stage. To do that, the problem is formulated, and a Long Short-Term Memory (LSTM)-based quality prediction model is presented. Experimental results illustrate that, on average, TQPM yields PSNR, SSIM, and VMAF predictions with an R2 score of 0.83, 0.85, and 0.87, respectively, and Mean Absolute Error (MAE) of 1.31 dB, 1.19 dB, and 3.01, respectively, for single-stage transcoding. Furthermore, an R2 score of 0.84, 0.86, and 0.91, respectively, and MAE of 1.32 dB, 1.33 dB, and 3.25, respectively, are observed for a two-stage transcoding scenario. Moreover, the average processing time of TQPM for 4s segments is 0.328s, making it a practical VQA method in online streaming applications.


INTRODUCTION
The demand for Video Quality Assessment (VQA) is growing in video streaming applications.It plays an essential role in video processing from capturing to rendering, including compression, transmission, restoration, and display [15].With all the available encoding options and trade-offs to consider in HTTP Adaptive Streaming (HAS) [1], having a lightweight, and reliable VQA method is crucial.According to the degree of information available for the reference video signals, VQA is classified into full reference (FR), reduced reference (RR), and no reference (NR) methods.NR-VQA methods are "blind", where the original video content is not used for quality assessment, leading to an unreliable VQA [15].On the other hand, since RR-VQA methods use (i) less overhead data compared to FR-based VQA approaches and (ii) are more reliable than NR-based VQA methods, they are employed in real-time scenarios [6].
The workflow of the state-of-the-art RR-VQA methods is shown in Figure 1.The characteristic features of the original video and the reconstructed video (e.g., pixels, relative entropy or entropy difference [5,34], frequency domain features like DCT [35]) after any arbitrary video processing process are extracted.The quality score processor (mostly ML-based implementations in the literature) combines these features to predict the resultant video quality [6].Since Peak Signal to Noise Ratio (PSNR) remains the de facto industry standard for video quality evaluation, many RR-VQA methods are developed to evaluate it [27,29].Furthermore, there are methods that predict the Structural Similarity Index (SSIM) [36,37], Spatio-temporal RR Entropic Differences (STRRED) [38], and Spatial RR Entropic Differences (SRRED) [32] metrics.However, the metrics mentioned above have limitations, such as neglecting the temporal nature of compression artifacts [16].To bridge these gaps, Video Multi-method Assessment Fusion (VMAF) was introduced [14].VMAF was proposed as an FR-VQA model that combines qualityaware features to predict perceptual quality.For that, it incorporates human vision modeling with machine learning and offers an acceptable prediction of the video QoE [16].VMAF is an optimization criterion for better encoding decisions in different applications.
As an example, Orduna et al. [28] prove that VMAF can be used without any specific training or adjustments to obtain the quality of 360-degree virtual reality (VR) sequences perceived by users.Zadtootaghaj et al. [40] use VMAF to analyze the video quality of online video gaming services and calculate the minimum encoding bitrate to reduce the required bandwidth of different streaming games significantly.Sakaushi et al. [30] present a video surveillance system where VMAF is used to measure how the quality of the video is degraded for different bitrates.In [19], optimized bitrateresolution pairs that maximize VMAF are selected for the bitrate ladder.In [20,25], perceptually-aware optimized bitrate-resolution pairs that maximize the visual quality and compression efficiency are selected for the bitrate ladder.Additionally, in [18], the optimized framerate that yields the highest VMAF is selected for every target bitrate in the ladder.Hence, visual quality prediction enables the server to choose the optimized encoding parameters for the bitrate ladder [24].
Contributions: This paper proposes a reduced-reference transcoding quality prediction model (TQPM) for video streaming applications.To the best of our knowledge, this is the first work proposed to predict VMAF for multi-stage transcoding, especially in video streaming applications, where the video segment is subjected to multiple stages of transcoding before being transcoded to the target bitrate representation.To do that, first, DCT-energy-based features are extracted from the input video segment, and the information of the transcoding pipeline (i.e., target bitrate representation of encoder in each stage) is used as the reduced reference for VMAF prediction.Next, feature extraction is carried out only for the input video segment.This method contrasts the state-of-the-art RR-VQA methods where feature extraction is carried out for the input and the output video segments from the transcoding system.The prediction performance of the proposed model is validated using Apple HTTP Live Streaming (HLS) bitrate ladder1 transcoding using the x265 2 HEVC [33] open source encoder.
Paper outline: Section 2 explains the M-stage transcoding model formulated in this paper, while Section 3 discusses the architecture of TQPM.Section 4 illustrates the evaluation of the TQPM performance.Finally, Section 5 concludes the paper.

M-STAGE TRANSCODING MODEL
Recently, video transcoding has been considered a prevalent solution for reconstructing video sequences at in-network servers (deployed at cloud or edge) in latency-sensitive video streaming applications [7][8][9][10].Hence, in this paper, a generalized M-stage transcoding model for HAS as depicted in Figure 2 is proposed, targeting the following scenarios: (1) Single-stage transcoding: This is the scenario where the user receives the bitstream produced by the source server.As shown in Figure 3, clients A and B receive the bitrate representation generated at the origin server.Here, VQA can be accomplished at the origin server, as the original and reconstructed video segments are available at the origin server.However, in the state-of-the-art VQA methods, the encoding process must be complete to determine the visual quality score.Moreover, the time taken for feature extraction (  ) of the input and reconstructed video segments adds to the latency.(2) Two-stage transcoding: In these applications, a higher bitrate representation already available in the edge server is transcoded to a lower bitrate representation requested by the user.As shown in Figure 3, clients C, D, and E receive 11.6 Mbps, 5.8 Mbps, and 2.4 Mbps representations.The edge server transcodes the video segment from the 16.8 Mbps representations to the requested representations.In this manner, the response delay and the backhaul traffic between the origin and the edge servers is expected to be reduced [7].
State-of-the-art VQA methods cannot be used in this scenario as (i) the original input video segment is not available as the reference at the destination (client) and (ii) the final reconstructed video segment is not available at the source (origin server).Assuming a hypothetical scenario where the original and reconstructed video segments are available together at the source or destination, the total processing time would include two encoding and decoding steps and feature extraction of the original and reconstructed segments.
There shall be scenarios of three-stage transcoding that involve two edge servers.As depicted in Figure 2, the generalized M-stage transcoding model for HAS consists of a series of M encoders and M decoders in a chain.M=1 transcoding corresponds to the single-stage transcoding while M=2 transcoding corresponds to the two-stage transcoding.As explained, RR-VQA poses numerous problems while deployed in multi-stage transcoding applications.First, the total transcoding latency to compute video quality (  ) using the input and the final reconstructed video segments is very high.This is because of the encoding and decoding times in the M-stage transcoding process (M encoding and M decoding processes), plus the time taken for feature extraction (  ) of the input and reconstructed video segments add to the latency.The total transcoding latency is formulated in Eq. ( 1), where    and    represents the time taken to encode and decode at the  ℎ transcoding stage, respectively.
Second, determining VMAF is cumbersome in most video streaming applications where (i) the original input video segment is not available as the reference at the destination; (ii) the final reconstructed video segment is not available at the source; (iii) slow VMAF decision-making is not acceptable for online latency-sensitive services.VQA at source by predicting VMAF using the input video segment characteristics and the transcoding system characteristics solves the abovementioned problems.

TQPM ARCHITECTURE
The TQPM architecture is shown in Figure 4, which comprises three steps: (1) input video segment characterization (Section 3.1) (2) transcoding model Characterization (Section 3.2) (3) video quality prediction (Section 3.2) Selecting low-complexity features to characterize the input video segment is critical to utilize lightweight prediction models for quality prediction.High-complexity features would require heavier models (in terms of model size and inference time), contributing to prediction latency.Extracting state-of-the-art Spatial Information (SI) and Temporal Information (TI) features are computationally intensive tasks and do not correlate well with the transcoded video The extracted features, along with the encoding target bitrate representations for each stage of the transcoding process, i.e., b1 , b2 ,.., b , are employed to predict the visual quality, in terms of PSNR, SSIM, and VMAF, as discussed in Section 3.2.

Input Video Segment Characterization
Three DCT-energy-based features extracted by the Video Complexity Analyzer (VCA) [23] open-source software, i.e., (i) the average texture energy , (ii) the average temporal energy ℎ, and (iii) the average luminescence  are used as the reduced reference for each video segment.These features are based on the luma channel of the video segment.Chroma channels are not considered in the proposed solution since the rate control of most of the state-of-the-art encoders does not consider them.Furthermore, VQA metrics like VMAF emphasize the luma channel more than the chroma channels.
The features are based on our previous work [23] and are included here to have the paper self-contained.Firstly, the texture of every non-overlapping block  in each frame  is calculated using Eq. ( 2): where  ×  pixels is the size of the block, and  (, ) is the (, ) ℎ DCT component when  +  > 0, and 0 otherwise [11].The texture is averaged to determine the spatial energy feature per frame, i.e.,   , as shown in Eq. ( 3): where  represents the number of blocks in the frame  [22].Furthermore, the block-wise sum of absolute difference (SAD) of the texture energy of each frame compared to its previous frame is computed and then averaged per frame to obtain the temporal energy feature per frame, (i.e., ℎ  ) illustrated in Eq. ( 4): The luminescence of non-overlapping blocks  of each frame  is defined as: where  (0, 0) is the  component in the DCT calculation.Moreover, the block-wise luminescence is averaged per frame denoted as   as shown in Eq. (6).
The video segment is divided into  chunks with a fixed number of frames (i.e.,   ) in each chunk.The averages of the , ℎ, and  features of each chunk are computed to obtain the reduced reference representation of the input video segment, expressed as: where,   is the feature set of every ℎ chunk, represented as :

Video Quality Prediction
For the sake of simplicity, the settings of the encoders in the M-stage transcoding process, except the target bitrate-resolution pair, are assumed identical [21].The resolutions corresponding to the target bitrates in the bitrate ladder are also assumed to be fixed.Therefore, the transcoding model can be characterized as follows: where b represents the target bitrate of the   encoder (cf.Fig. 2).Note that B is appended to   , which is determined during the input video segment characterization phase, to obtain: The predicted quality v b |.. | b1 can be presented as: LSTM models are typically used in time series prediction applications and can mitigate essential issues in long-term prediction, such as vanishing or exploding gradients [39].Thus, an LSTM-based prediction model [12] is used in this work.The described features are input to the model [12] as a vector of dimension [ × ( +3)], where  denotes the number of chunks in the video segment.More specifically, the feature sequences in the series X are input to the LSTM model, which predicts visual quality for the corresponding input video segment and chain of encoders in the transcoding process.The upper bound for the acceptable deviation from the ground truth quality is considered to be one Just Noticeable Difference (JND), where v and   are the predicted and the ground truth quality, respectively.In this paper, the average target JND is considered as six VMAF points3 based on current industry practices.

EVALUATION
This section first explains the evaluation setup and then presents the experimental results.

Evaluation Setup
In this paper, video sequences from JVET [2], MCML [4], SJTU [31], Berlin [3], UVG [26], BVI [17] datasets are used.The sequences are encoded at 30 fps using x265 v3.5 2 with the ultrafast preset using the Video Buffering Verifier (VBV) rate control mode on a dualprocessor server with Intel Xeon Gold 5218R (80 cores, frequency at 2.10 GHz).The segment length is set as four seconds.80% of the five hundred videos considered are used as the training dataset, and the remaining 20% is used as the test dataset.The bitrate representations considered in the experiments (  ∀  ∈ [1,12]) used as the target bitrate of encoding in each transcoding stage ( b ∀  ∈ [1, ]) are specified in the Apple HLS authoring specifications 1 .The , ℎ, and  features are extracted using the VCA v2.04 open-source video complexity analyzer [23] run in eight CPU threads, with  (cf.Eq. 2) as 32.  is set as 15, i.e., the video segment is divided into eight chunks (T=8).
Hyperparameter tuning is performed on the LSTM model to obtain the maximum prediction performance [39].The number of LSTM cells is set to 50, and the model is trained for 100 epochs with a learning rate of 10 −3 with the Adam optimizer [13].The loss function used to train the LSTM model is the mean absolute error (MAE).The resulting quality and the predicted in terms of PSNR, SSIM, and VMAF [14] are compared for each test sequence for M=1 (single-stage) and M=2 (two-stage) transcoding.Since the content is assumed to be displayed in the highest resolution (i.e., 2160p), the transcoded content is scaled (bi-cubic) to 2160p resolution to determine the visual quality.

Experimental Results
In the first experiment, TQPM's processing time (i.e.,   ) is compared to the total transcoding latency   (cf.Eq.1) in state-of-the-art RR-VQA approaches.The average   for M=1 and M=2 are observed as 1.92s and 3.78s, respectively.The average time taken for feature extraction (  of a 4s segment is 0.323s.Furthermore, the average inference time of the LSTM model is 5 ms.Hence, the average processing time of TQPM for a 4s segment is 0.328s.Thus, TQPM  The second experiment assesses the correlation between the predicted to actual quality score for M=1 and M=2 transcoding.As illustrated in Figures 5a, 5b, and 5c and Figures 5d, 5e, and 5f, there is a strong correlation between the predicted to the actual PSNR, SSIM, and VMAF scores, respectively (e.g., the average  2 scores of VMAF prediction for single-stage and two-stage transcoding are 0.87 and 0.91, respectively).Furthermore, the prediction errors are less than the acceptable threshold of one JND (i.e., six VMAF points, which shows TQPM works with sufficient accuracy. In the final experiment, the prediction performance of TQPM for the b1 representations considered in this paper is investigated using the Mean Absolute Error (MAE) for M=1 and M=2 transcoding.As shown in Table 1, the average MAE for VMAF prediction in M=1 and M=2 transcoding are 3.01 and 3.25, respectively.The results of M=2 correspond to the average visual quality prediction accuracy of transcoding from b1 bitrate representation to the possible lower bitrate representations in the bitrate ladder.Please note that since  1 is the lowest bitrate representation in the bitrate ladder, a scenario corresponding to b1 =  1 does not exist.The  2 scores for M=2 are observed to increase as b1 increases.This is because there is a higher amount of training data (transcoding to lower bitrate representations) as b1 increases.

CONCLUSIONS
This paper proposed TQPM, an online transcoding quality prediction model for video streaming applications.The proposed LSTM-based model uses DCT-energy-based features as reduced reference to characterize the input video segment, which is used to predict the visual quality of an M-stage transcoding process.The performance of TQPM is validated by the Apple HLS bitrate ladder encoding and transcoding using the x265 open-source HEVC encoder.On average, for single-stage transcoding, TQPM predicts PSNR, SSIM, and VMAF with an MAE of 1.31 dB, 1.19 dB, and 3.01, respectively.Furthermore, PSNR, SSIM, and VMAF are predicted for two-stage transcoding with an average MAE of 1.32 dB, 1.33 dB, and 3.25, respectively.
In this paper, trans-sizing and trans-rating are considered as transcoding, i.e., the encoder/codec used for the bitrate ladder representations is assumed to be the same.In the future, transcoding between bitrate ladder representations of various codecs shall be investigated.Another future direction is defining a decision-making component based on the proposed model in an end-to-end live streaming system.

Figure 2 :
Figure 2: M-stage transcoding model considered in this paper.Here,   and   represent the encoding and decoding in  ℎ stage of transcoding, while b denotes the target bitrate of   where  ∈ [1, ].

Figure 3 :
Figure 3: An example scenario of VQA in adaptive streaming applications.Clients A and B receive the highest bitrate representation of the bitrate ladder, encoded at the origin server (single-stage transcoding), while Clients C, D, and E receive lower bitrate representations transcoded at the edge server (two-stage transcoding).

Figure 4 :
Figure 4: TQPM architecture quality [23].This paper uses a lightweight and low-latency feature extraction from input video segments as explained in Section 3.1.The extracted features, along with the encoding target bitrate representations for each stage of the transcoding process, i.e., b1 , b2 ,.., b , are employed to predict the visual quality, in terms of PSNR, SSIM, and VMAF, as discussed in Section 3.2.

Table 1 :
Prediction accuracy of TQPM when M=1 and M=2, respectively, for b1 representations considered in this paper encoded using x265 HEVC encoder.