Video Quality Assessment with Texture Information Fusion for Streaming Applications

The rise in video streaming applications has increased the demand for video quality assessment (VQA). In 2016, Netflix introduced Video Multi-Method Assessment Fusion (VMAF), a full reference VQA metric that strongly correlates with perceptual quality, but its computation is time-intensive. We propose a Discrete Cosine Transform (DCT)-energy-based VQA with texture information fusion (VQ-TIF) model for video streaming applications that determines the visual quality of the reconstructed video compared to the original video. VQ-TIF extracts Structural Similarity (SSIM) and spatiotemporal features of the frames from the original and reconstructed videos and fuses them using a long short-term memory (LSTM)-based model to estimate the visual quality. Experimental results show that VQ-TIF estimates the visual quality with a Pearson Correlation Coefficient (PCC) of 0.96 and a Mean Absolute Error (MAE) of 2.71, on average, compared to the ground truth VMAF scores. Additionally, VQ-TIF estimates the visual quality at a rate of 9.14 times faster than the state-of-the-art VMAF implementation, along with an 89.44 % reduction in energy consumption, assuming an Ultra HD (2160p) display resolution.


INTRODUCTION
With the ever-increasing demands for high-definition video streaming services, the need for video quality assessment(VQA) is growing potentially.VQA plays an essential role in video processing from capturing to rendering, including compression, transmission, restoration, and display [1].With all the available encoding options and trade-offs to consider in HTTP Adaptive Streaming (HAS) [2], having a lightweight and reliable VQA method is crucial.According to the degree of information available for the reference video signals, VQA is classified into full reference (FR), reduced reference (RR), and no reference (NR) methods.NR-VQA methods are "blind" where the original video content is not used for the quality assessment, leading to an unreliable VQA [1].The advantage of RR-VQA lies in its ability to evaluate video quality using limited information, making it more suitable for real-time VQA, especially in adaptive streaming or live broadcast scenarios [3,4].As shown in Figure 1, RR-VQA models often focus on extracting specific features or subsets of data, such as selected frames, segments, or critical information related to video content.This approach allows RR-VQA to provide quality evaluations even when access to the entire reference is restricted or when performing evaluations in bandwidth-constrained environments [5].However, the downside of RR-VQA is that it might not provide the same accuracy or granularity in the quality evaluation as FR-VQA due to the limited information it uses.On the other hand, FR-VQA can offer a more detailed comparison between the distorted video and the entire reference, leading to a more precise VQA.
Peak Signal to Noise Ratio (PSNR) continues to be the predominant industry benchmark for standardizing video codecs.PSNR is an effective method to generate a numeric value that compares an original input file and a coded output file.The limitations of PSNR are (i) its failure to account for the temporal nature of compression artifacts and (ii) lack of correlation between PSNR improvements and subjective quality, particularly in the presence of camera noise [6,7].Structural Similarity (SSIM) is another image quality metric introduced in 2004 [8] that considers image degradation as a perceived change in structural information.It also incorporates critical perceptual phenomena, including both luminance masking and contrast masking terms.Video Multi-Method Assessment Fusion (VMAF) was explicitly formulated by Netflix to correlate strongly with subjective Mean Opinion Scores (MOS).Using machine learning techniques, a large sample of MOS was used as ground truth to train a quality estimation model.Among the state-of-the-art VQA metrics, VMAF achieves the highest correlation with the Difference Mean Opinion Score (DMOS).However, VMAF is computationally intensive, and its time complexity is very high compared to the PSNR and SSIM metrics [9].We estimate visual quality to support low-latency VQA in video streaming applications.The expected computation time should be comparable to PSNR and SSIM, with the highest possible accuracy compared to the VMAF score.The proposed VQA method is expected to replace the state-of-the-art VMAF computation in streaming applications.

Contributions:
The key contributions of this paper are as follows: (1) We propose a fast machine learning-based RR-VQA model using texture information fusion (VQ-TIF), which is implemented in real-time to determine visual quality.The paper has six sections.Section 2 analyzes VQA state-of-theart works.Section 3 reviews the related work.We elaborate on the details of the VQ-TIF model in Section 4. Section 5 explains the evaluation setup and experimental results before concluding the paper in Section 6.

STATE-OF-THE-ART VQA ANALYSIS
Peak Signal-to-Noise Ratio (PSNR). is a conventional quality metric used in video quality assessment due to its simplicity and ease of computation [6,7].Its primary advantage lies in its straightforwardness and popularity as a benchmark for measuring compression performance, which helps to compare different encoding methods or qualities.However, in the context of adaptive streaming, PSNR has several drawbacks.It often does not correlate well with perceived visual quality, especially at lower bitrates or scenarios with complex video content [10].Moreover, PSNR fails to capture perceptual differences and remains insensitive to human visual perception.This renders it unsuitable for evaluating artifacts or distortions that may be visually noticeable but are not adequately represented in PSNR values.In addition, its performance diminishes when evaluating video quality affected by compression artifacts or when the video undergoes format or resolution changes.Consequently, in adaptive streaming scenarios where subjective user experience is paramount, relying solely on PSNR can lead to suboptimal quality decisions and hinder delivering an optimal quality of experience (QoE) to viewers [11].
Structural similarity (SSIM).considers structural information, mimicking human visual perception more closely than PSNR [8].It correlates better with perceived quality changes, particularly in compression, noise, or distortion scenarios [12,13].This makes SSIM more effective in capturing subtle variations in video quality that can impact viewer experience, particularly at lower bitrates.SSIM also has limitations in adaptive streaming contexts.It can be sensitive to specific distortions and may not consistently reflect human perception across all video content [10].Moreover, its performance can vary depending on the complexity of the content and the types of distortions present in the video.It also does not always align with subjective assessments and may not accurately represent viewers' visual quality.As a result, while SSIM provides more nuanced insights than PSNR, it is not the only comprehensive metric for assessing video quality in adaptive streaming settings, where a holistic evaluation considering multiple metrics and subjective perception is crucial.
Video Multi-Method Assessment Fusion (VMAF). is a full-reference, perceptual video quality metric that aims to approximate human perception of video quality.This metric is focused on quality degradation due to compression and rescaling.VMAF estimates the perceived quality score by computing scores from multiple quality assessment algorithms and fusing them with a support vector machine.In contrast to PSNR and SSIM metrics, which do not take temporal information into account [13], three image fidelity metrics and one temporal signal have been chosen as features of the SVM: (i) Antinoise SNR, (ii) Detail Loss Measure, (iii) Visual Information Fidelity, and (iv) Mean Co-Located Pixel Difference (MCPD).An essential feature is the MCPD of a frame to the previous frame (i.e., the temporal component).A VMAF score is more straightforward to understand because it operates in a linear range of 0 to 100 , whereas PSNR and SSIM are logarithmic.It considers scaling and compression artifacts and has a model trained for mobile video consumption [10].
Table 1 shows the Pearson Correlation of PSNR, SSIM, and VMAF quality metrics analyzed for a thousand video sequences from the Inter4k Dataset [14].These sequences were encoded at Ultra High Definition (2160p) resolution using the x264 AVC encoder, employing the ultrafast preset and constant rate factor (CRF).CRF values ranging between 1 and 51 are used in the analysis.The correlation of the VMAF score with the PSNR and SSIM scores is 0.83 and 0.88, respectively.The correlation can also be observed graphically in Figure 2, which demonstrates the rate-distortion (RD) curves of selected video sequences from UVG [15], MCML [16], and SJTU [17] datasets, based on their spatio-temporal complexity, where distortion is measured using PSNR, SSIM, and VMAF.

RELATED WORK
RR-VQA approaches are categorized into three primary types based on the nature of the features they utilize: (i) pixel-based, (ii) frequencybased, and (iii) bitstream-based methods.
Pixel-based RR-VQA:.This approach involves extracting and analyzing spatial or temporal pixel-level information from distorted and reference video frames.It typically involves comparing pixelwise differences or using metrics derived from pixel values, such as MSE or SSIM.Pixel-based methods directly assess visual discrepancies at the pixel level, often considering factors such as luminance, color, and spatial arrangement [18,19].
Frequency-based RR-VQA:.Thie method is based on the analysis of video content in the frequency domain, mainly using information derived from transformations such as the discrete cosine transform (DCT) [20] or the discrete wavelet transform (DWT) [21].By examining the frequency components and their differences between the reference and distorted video, this scheme captures variations in specific frequency bands or coefficients, providing insights into how the signal's frequency distribution influences perceptual quality.
Bitstream-based RR-VQA:.In this approach, the video bitstream data is analyzed to extract features without decoding the video content.It involves inspecting parameters or metadata within the compressed bitstream, including coding information, motion vectors, quantization parameters, or syntax elements [22].By evaluating these aspects, bitstream-based methods aim to infer quality differences without requiring full access to the original video content.
Other than the aforementioned types of RR-VQA, our previous work [23] introduced a reduced reference transcoding quality prediction model (TQPM) to determine the VMAF of the video possibly transcoded in multiple stages.The quality is predicted using video complexity features (i.e., the video's brightness, spatial texture information, and temporal activity) and the target bitrate representation of each transcoding stage.
Since PSNR remains the de-facto industry standard for video quality evaluation, many RR-VQA methods are developed to evaluate it [24,25].Furthermore, there are methods that predict SSIM [26,27], Spatiotemporal RR Entropic Differences (STRRED) [28], and Spatial RR Entropic Differences (SRRED) [29] metrics.However, the described metrics have limitations, such as neglecting the temporal nature of compression artifacts [30].Each RR-VQA category has advantages and limitations based on the type of features extracted and the complexities involved in assessing perceptual video quality.The choice of method often depends on the specific application, available resources, and the extent to which the method aligns with the perceptual attributes most relevant to the evaluation task.Although RR-VQA methods are implemented to estimate PSNR and SSIM, no work in the literature determines VMAF, although VMAF yields VQA closer to the visual perception of HVS than its counterparts.However, it is significantly time-consuming compared to other metrics.

VQ-TIF MODEL
The architecture of the proposed VQ-TIF-based VMAF estimation is illustrated in Figure 3. Since the correlation between SSIM and VMAF is very high (as observed in Table 1), and the computation time of SSIM is significantly lower than VMAF, we select SSIM as a main feature to compute VMAF.In this architecture, the input video The VQ-TIF scores obtained for each chunk are averaged as the VMAF for the reconstructed video segment.

Texture Information Extraction
An intuitive feature extraction method would be utilizing Convolutional Neural Networks (CNNs) [31].However, such models have several inherent disadvantages, such as a longer training time, more inference time, and storage requirements, making them impractical for streaming scenarios.Although CNN-based approaches could produce rich features, simpler models that yield significant prediction performance are more suitable for video streaming applications.
The popular state-of-the-art video complexity features are Spatial Information (SI) and Temporal Information (TI) [32].However, the correlation of SI and TI features with the encoding output features such as bitrate, encoding time, etc. are very low, which is insufficient for encoding parameter prediction in streaming applications [33].In this paper, three DCT-energy-based features [34], the average luma texture energy  Y , the average gradient of the luma texture energy ℎ, and the average luminescence  Y are used as the texture information measures [11,33].

Texture Information Fusion
The texture information fusion step of VQ-TIF is accomplished using the following steps:

Residual computation.
Residual features are formed by subtracting the original video texture information features from the reconstructed video features.This difference is known as the error or residual feature, expressed as: where  ∈ [1,   ].The residual features usually have low information entropy, as the original and reconstructed video frames are similar.The entropy increases with increased distortion introduced in the reconstructed video.

Fusion.
The fusion of the texture information features is established using a long short-term memory (LSTM).LSTM is selected as a model for processing sequential data, making it suitable for combining information for temporally adjacent frames in a video.An advantage of LSTM models is better handling of long-term dependencies in long sequences.Each chunk's feature averages are considered separate data points in the model training process.Therefore, the input data consists of the residuals of the spatially pooled luma texture information features extracted per frame of the video chunk.Moreover, frame-wise SSIM values denoted by  = { 1 ,  2 , ..,    } are appended to the residual features.The prediction model is a function of the residual features of the frames and the SSIM values in a chunk, as shown in Eq. 6.This approach can fuse feature information from temporally adjacent frames to estimate visual quality. where The estimated VQ-TIF score per chunk v can be presented as: v =  ( x).The VQ-TIF score of the reconstructed video segment is the average of the v values estimated for every chunk.

VQ-TIF Implementation
The  Y , ℎ, and  Y features of the original and reconstructed video segments are extracted using VCA v2.0 open-source video complexity analyzer [33].LSTM model is implemented using the Keras [35] machine learning framework.LSTMs can address the vanishing gradient problem, allowing them to retain information over extended intervals.The input shape of the network is [  × 4].In the network architecture implemented in this paper, there are two layers; the first layer consists of 200 LSTM cells, which serve as the memory hub of the network, empowering the model to discern and remember long-range dependencies in input sequences.A dense layer follows this, as the target variable consists of a single value.The dual-layer composition enhances the model's capacity to navigate and comprehend intricate sequential data while ensuring robust performance in regression-oriented tasks.The loss function and optimizer are MAE and Adam optimizer [36], respectively.Hyperparameter tuning was performed on the learning rate, batch size, number of cells, and layers to arrive at the lowest MAE score.

EVALUATION
This section introduces the test methodology used in this paper and then discusses the experimental results.

Evaluation Setup
We use 75 % of the thousand UHD video sequences of the Inter4K Dataset [14] [37].Pearson correlation coefficient (PCC) and Mean Absolute Error (MAE) scores are analyzed between the VQ-TIF scores and the ground truth VMAF quality scores.In addition,  T and  T , i.e., the total time taken and energy consumed to compute the quality metrics, are evaluated.We used codecarbon software to measure the energy consumption [38].

Relevance of features
The importance of the input of features to the LSTM-based fusion model is analyzed using the univariate approach.All the other feature values are set to zero, the MAE is computed, and this is subtracted from the MAE of the model with all features intact, which gives a measure of the decrease in accuracy (i.e., increase in error) when that feature is removed from the model [39].Subsequently, the absolute value of the decrease in accuracy is computed and normalized to obtain the importance score, where higher absolute scores indicate more critical features.The information factor is calculated by applying Min-Max normalization to the average MAE scores.The importance of the features is visualized in Figure 4a.
It is observed that the SSIM feature contributes the most to the VQ-TIF estimation, followed by  E ,  h , and  L features.

Accuracy
We evaluated the precision of VQ-TIF using Pearson Correlation Coefficient (PCC) [40] between the VMAF and VQ-TIF scores.scores.A strong correlation between scores is observed.The average PCC of the VQ-TIF scores to the VMAF score in the evaluation dataset is 0.96, while MAE is 2.71.The maximum deviation between the VQ-TIF and VMAF score is 20.23 points.The accuracy can be improved further by reducing   , increasing the inference time.

Processing time and energy
We observe the computation time and energy consumed for texture information extraction in VQ-TIF as 0.67 s (i.e., 179.10 fps) and 17.18 µJ, respectively (cf. Figure 4c and 4d).The time taken and energy consumed for the SSIM computation are 0.85 s and 22.01 µJ, respectively.The time taken and the energy consumed for texture information fusion are 0.07 s and 1.83 µJ, respectively.Therefore, the total processing time,  T is 1.59 s (i.e., 75.47 fps), while the processing time for the state-of-the-art VMAF computation is 14.52 s (i.e., 8.26 fps).The computation speed of VQ-TIF is 9.14 times higher than the state-of-the-art VMAF evaluation.Furthermore, in terms of total energy consumption, VQ-TIF saves 89.44 % compared to the state-of-the-art VMAF implementation.

CONCLUSIONS
We proposed VQ-TIF, a fast and accurate reduced-reference video quality assessment (RR-VQA) method based on texture information fusion.VQ-TIF includes DCT-energy-based video complexity feature extraction where features representing luma texture and temporal activity are extracted from the original and reconstructed video segments.The extracted texture information is fused using an LSTM-based model to determine the VQ-TIF score.It is observed that VQ-TIF is determined at a speed of 9.14 times faster than the state-of-the-art implementation of VMAF for Ultra HD (2160p) videos, consuming 89.44 % less energy.At the same time, VQ-TIF scores yield a PCC of 0.96 and MAE of 2.71 compared to the VMAF scores.
The evaluation of the proposed VQ-TIF model is limited to static dynamic range (SDR) content.The evaluations on high dynamic range (HDR) content and the associated optimizations are subject to future work.Furthermore, the VQ-TIF model can be extended to determine visual quality at multiple resolutions, including 8K (4320p).Furthermore, various signal distortions may be considered during the model's training to enhance the application scope.

FeaturesFigure 1 :
Figure 1: The structure of state-of-the-art RR-VQA methods utilized, especially within streaming video coding systems.

Figure 3 :
Figure 3: VQA for a video segment using VQ-TIF model envisioned in this paper.

4. 2 . 1
Spatial pooling.The video segments are divided into chunks with a fixed number of frames (i.e.,   ).The averages of the  Y , ℎ, and  Y features of each frame in the chunk are calculated to obtain the spatially pooled representation of the chunk, expressed as:  = { 1 ,  2 , ..,    }, and X = { x1 , x2 , .., x  }, where,   and x are the  ℎ frame feature set associated to the original and reconstructed video chunks, respectively.
as the training dataset and set 5 % as the validation set.The remaining 20 % are set as the test dataset.All experiments are run on a system with an Intel i7-11370H processor and 16GB RAM.All video sequences are encoded using the x264 AVC encoder with CRF values between 1 and 51 to induce different quality distortions, and the corresponding VMAF is evaluated.Each segment comprises eight frames, i.e.,   = 8.Hence, a video sequence is divided into 15 segments.The original and reconstructed video segments' luma texture features are extracted with the VCA v2.0 open-source video complexity analyzer running with eight CPU threads.Furthermore, the original and reconstructed video feature extraction process is implemented concurrently, with four CPU threads for each process.The ground truth VMAF scores are computed using the model released by Netflix