EVCA: Enhanced Video Complexity Analyzer

The optimization of video compression and streaming workflows critically relies on understanding the video complexity, including both spatial and temporal features. These features play a vital role in guiding rate control, predicting video encoding parameters (such as resolution and frame rate), and selecting test videos for subjective analysis. Traditional methods primarily utilize Spatial Information (SI) and Temporal Information (TI) to measure these spatial and temporal complexity features, respectively. Moreover, the Video Complexity Analyzer (VCA) has been introduced as a tool employing Discrete Cosine Transform (DCT)-based functions, namely E and h, to evaluate the spatial and temporal complexity features, respectively. In this paper, we introduce the Enhanced Video Complexity Analyzer (EVCA), an advanced tool that integrates the functionalities of both VCA and the SITI approach. Developed in Python to ensure compatibility with GPU processing, EVCA enhances the definition of temporal complexity originally used in VCA. This refinement significantly improves the detection of temporal complexity features in VCA (i.e., h), raising its Pearson Correlation Coefficient (PCC) from 0.6 to 0.77. Furthermore, EVCA demonstrates exceptional performance on Graphics Processing Unit (GPU) devices, achieving feature extraction speeds exceeding 1200 fps for 1080p resolution videos. Github: https://github.com/cd-athena/EVCA


INTRODUCTION
In recent years, there has been a noteworthy and continuous surge in the consumption of online video content, marking a transformative shift in global media consumption trends [1].This rise in online video viewership not only reflects changes in consumer preferences but also underscores the pressing need for advanced content delivery optimization techniques.The exponential growth in video data traffic over the internet mandates the development of more sophisticated optimization strategies [2,3].These strategies are crucial to enhancing user experience by minimizing buffering, ensuring high-quality streaming, and maintaining the efficiency and sustainability [4] of network infrastructures.This paper focuses on exploring and developing innovative approaches to address the challenges posed by the escalating demand for online video content, with a specific emphasis on optimizing video compression and streaming workflows.
Video compression heavily relies on the intrinsic spatial and temporal intricacies within the video content.Spatial complexity refers to the amount of detail and variations within each frame.The higher the spatial complexity of a frame, the greater the bitrate required to compress it.Conversely, temporal complexity relates to changes in video frames over time.Frames with more motions and changes require a higher bitrate for compression.In video compression, intra-frames (I-frames) are encoded without relying on any previous frames, containing complete information to represent a standalone image.Therefore, I-frames solely depend on the spatial complexity of a frame.On the other hand, predictive frames (P-frames) store only the changes from the previous frame, enabling more efficient compression by transmitting only temporal information.Consequently, P-frames heavily rely on the temporal complexity of a frame with regard the reference frame(s) [5,6].
Rate control algorithms [7] in video compression meticulously manage the distribution of bits among different frames to optimize overall compression efficiency without sacrificing perceptual quality.These controllers dynamically adjust bitrate allocation based on an analysis of the spatial and temporal intricacies within the video content.Achieving a target bitrate is content-dependent and significantly relies on the video's spatial and temporal complexity, forming the foundational elements for designing an effective rate controller-a critical component in video compression algorithms.As an example, Zhao et al. [8] enhance the rate control performance of Versatile Video Coding (VVC) [9] by considering spatial and temporal complexity features.Menon et al. [10] leverage spatial and temporal complexity features to predict the optimal Constant Rate Factor (CRF) for a given target bitrate.This approach enhances the compression efficiency of video coding, particularly tailored for live video streaming.
Video streaming, often relying on HTTP Adaptive Streaming (HAS) [11], is a primary application for video compression.In HAS, identical content is encoded at various bitrates, referred to as representations.This enables dynamic bitrate adjustments based on the viewer's network conditions and device capabilities, ensuring an optimal viewing experience.Consequently, the significance of rate control in this context grows more pronounced, highlighting the imperative to attain accurate bitrates for each representation.This precision, facilitated by effective rate control, plays a crucial role in empowering Adaptive Bitrate (ABR) algorithms to reduce stalls and increase overall Quality of Experience (QoE) [12].
The representations collectively shape what is commonly referred to as a bitrate ladder.Typically presented as bitrate-resolution pairs, a bitrate ladder provides recommendations for the specific bitrate-resolution pairs at which the content should be encoded.While a conventional approach might involve a fixed or one-sizefits-all bitrate ladder, such as the HTTP Live Streaming (HLS) bitrate ladders recommended by Apple, recent advancements, such as per-title encoding approaches, have demonstrated significant improvements.These approaches dynamically construct the bitrate ladder based on the content itself.Unlike a static bitrate ladder, which may not optimally cater to the nuances of diverse content, a dynamically constructed ladder ensures that each representation aligns more closely with the specific complexities and characteristics of the given content [2,[13][14][15][16][17].
In per-title encoding methodologies, the pivotal factors revolve around the spatial and temporal intricacies of video content.Katsenou et al. [3] employed features related to spatial and temporal complexity to determine the most suitable resolutions for each bitrate.They achieve this by predicting transitions between optimal resolutions.Similarly, Menon et al. [13] leverage spatial and temporal complexity features to predict the optimal resolution for each bitrate, placing particular emphasis on live video streaming.In addition to resolution, these features were also utilized to predict the optimal frame rate [18].
Task scheduling in video encoding optimizes the process by employing advanced algorithms to assess video complexity and allocate encoding resources accordingly.This dynamic resource allocation ensures that simpler videos receive resources with lower power and cost, while more complex ones are provided with optimal computational capabilities.In "Where to Encode" [19], the performance of various Amazon EC2 instances for video encoding has been analyzed across different encoding parameters.To predict the encoding time complexity of videos, various approaches have incorporated spatial and temporal complexity features [20][21][22][23].
The application of classifying video content for subjective testing represents another scenario where the spatial and temporal complexity of video content comes into play.Subjective testing, although valuable, is known to be resource-intensive and costly.Given the impracticality of employing a large number of video contents, a pragmatic approach involves clustering videos based on their complexity features.Subsequently, a representative subset is selected from each cluster to streamline the preparation and presentation of video content in subjective tests.This method not only optimizes resource utilization but also ensures a more manageable and efficient subjective testing process [24][25][26].
Recognizing the crucial role of evaluating the spatial and temporal complexity of videos, this paper introduces an open-source software, Enhanced Video Complexity Analyzer (EVCA).Implemented in Python for compatibility with a wide range of Graphics Processing Unit (GPU) devices, EVCA incorporates current spatial and temporal complexity features, specifically SITI and Video Complexity Analyzer (VCA).Additionally, it presents a refined definition for temporal complexity, enhancing the prediction of temporal intricacies.Furthermore, EVCA facilitates the extraction of per-block spatial and temporal complexity, empowering users with detailed insights at a granular level.Additionally, it provides the capability to plot the per-block features, offering a visual representation of the spatial and temporal intricacies inherent in the analyzed videos.

VIDEO COMPLEXITY FEATURES
In this section, our objective is to delineate the features characterizing spatial complexity () and temporal complexity () in a video.We present three distinct definitions, namely (i) ground truth, (ii) SITI, and (iii) VCA.

Ground truth
In the context of video compression, the ground truth spatial complexity (  ) is specifically quantified as the bit count necessary to encode a video frame as I-frames, all under a fixed Quantization Parameter (QP).On the other hand, ground truth temporal complexity (  ) is characterized by the bit count essential for encoding a frame as a P-frame, considering its reference I-frame, and both frames are encoded under a fixed QP [5,6].This distinction captures the nuanced considerations in terms of both spatial and temporal aspects within the compression process, providing a comprehensive perspective.Clearly, frames with higher   and   demand a greater number of bits for encoding under a constant QP.

SITI
Spatial Information ( ) and Temporal Information (  ) are the predominant feature types used to express the spatial and temporal intricacies of video sequences.Notably, ITU-T P.910 [27] incorporates  and   features for the purpose of selecting test scenes in subjective testing scenarios.These features assume a pivotal role in assessing the degree of impairment experienced during the transmission of video over a fixed-rate digital transmission service channel.
is derived by applying a  filter to each video frame (  ) at a specific time point ().The resulting Sobel-filtered frame undergoes further processing to compute the standard deviation across all pixels ().This methodology can be expressed mathematically as follows: The calculation of   incorporates the motion difference feature   (, ), representing the disparity in pixel values of the luminance plane at the same spatial location but across consecutive frames or times.The assessment of   involves determining the standard deviation over space () of   (, ) across all  and .In mathematical terms,   is defined as: where   (, ) denotes the pixel located at the  ℎ row and  ℎ column of the  ℎ frame.The quantification of temporal information within a scene involves determining the maximum value over time among all   values.Elevated   values signify increased motion across successive frames.

VCA
Recently, the Video Complexity Analyzer (VCA) [28] has been introduced as an open-source software tool capable of extracting spatial and temporal complexity features.VCA achieves this through the use of a Discrete Cosine Transform (DCT)-based energy function derived from [29].The DCT-based energy function for block  is defined as follows: where   (,  ) represents the energy of the block  in frame  , while  and ℎ denote the width and height of the block.The term  , (, ) corresponds to the (, ) ℎ DCT component of block  in frame  , with the condition  +  > 2 resulting in its inclusion and being 0 otherwise.This energy function serves to map texture from a multi-dimensional frequency space into a one-dimensional energy space.It assigns exponentially higher costs to higher DCT frequencies, as these are anticipated to arise from a mixture of objects.Notably, the DC value is treated separately due to its color dependence and negligible impact on texture.
In VCA, the spatial complexity of each frame, denoted as , is computed by averaging the energy function over all blocks in a frame, as follows: where  denotes the total number of blocks per frame, and  2 represents the size of each block.
For the temporal complexity, the   of each block in each frame is compared to the   of the corresponding block in the previous frame using the Sum of Absolute Differences (SAD) measure.The average SAD for each frame ( ) is then computed to obtain the temporal complexity feature (ℎ) as follows:

EVCA
In VCA, the temporal complexity (ℎ) for each frame is measured using SAD of   of blocks with their co-located blocks in the previous frame using Eq. 5.For EVCA, we modify the definition of temporal complexity for frame  to include the SAD of weighted DCT values, which we term TC, as follows: where The experimental results show that TC achieves a higher correlation with   compared to ℎ, as will be presented in Section 4. The spatial complexity (SC) of EVCA is set to be the same as  in VCA, i.e.,  as: For all of the above-mentioned spatial-temporal feature extraction methods, we also define the temporal complexity with respect to the two previous frames, in addition to with respect to the previous frame.We name them   −2 , ℎ −2 , and  −2 for SITI, VCA, and EVCA methods, respectively.

COMMAND LINE PARAMETERS FOR EVCA
In this section, we present our Python-based open-source software.It includes three primary spatial-temporal complexity features extraction methods, including SITI, VCA, and EVCA.The specific method can be selected using the command line parameter --method.

SITI
When the --method is set to SITI,  and   are calculated for each frame, and the results are stored in a CSV file.By default, the Sobel filter is used to extract edges from images.

VCA
When the --method is set to VCA,  and ℎ are calculated for each frame, and the results are stored in a CSV file.By default, the DCT transformation is used to transform blocks from the spatial domain to the frequency domain.Additional command line parameters include block size, which can be controlled using --block_size.By setting --block_info, the per-block features are also stored in a separate CSV file at the cost of slower performance.

PERFORMANCE
In this section, we evaluate the performance of our proposed software and present some features and results.Figure 1 shows examples of the third frames from two video sequences selected from JVET CTC [30] and UVG [31] datasets, namely BasketballDrive and Beauty.Their spatial and temporal complexity features, including SITI, VCA, and EVCA, are shown in Figure 2 and Figure 3, respectively.
To evaluate the performance of our implementation for SITI, VCA, and EVCA, we executed it to extract 1080p video sequences comprising 240 frames (equivalent to a 4-second 60 fps video) on an NVIDIA Quadro RTX A6000 GPU device.The results are summarized in Table 1.It is observed that all methods achieve a performance exceeding 500 fps for a single batch.We also assessed the performance of the feature extraction component for each method and observed that all of them achieve more than 1200 fps in feature extraction.This suggests that the overall performance is bottlenecked by the frame reading part, which is a subject for future optimization efforts.
We then calculated SITI, VCA, and EVCA features from the source videos.The scatter plot between   and spatial complexity features, after averaging across all frames, is shown in Figure 4.
Finally, we computed the Pearson correlation coefficient (PCC) and Spearman correlation coefficient (SRCC) among these features and present the results in Table 2 and Table 3 for spatial and temporal complexity features, respectively.It is observed that VCA and EVCA, which use the same spatial complexity feature, significantly outperform SITI in the spatial complexity correlation.For temporal complexity, SITI and VCA achieve similar performance, but EVCA    In conclusion, we have introduced the Enhanced Video Complexity Analyzer (EVCA), an open-source software designed to enhance the analysis of spatial and temporal complexity features in videos.While traditional methods rely on Spatial Information ( ) and Temporal Information (  ) separately, the Video Complexity Analyzer (VCA) utilizes Discrete Cosine Transform (DCT)-based functions to assess complexity features ( and ℎ) for spatial and temporal dimensions, respectively.EVCA integrates functionalities from both approaches and is developed in Python, ensuring compatibility with GPU processing, contributing to its efficiency.Notably, our method refines the definition of temporal complexity originally used in VCA, significantly improving the detection of temporal features (ℎ) and elevating its Pearson Correlation Coefficient (PCC) from 0.6 to 0.77.Furthermore, EVCA exhibits outstanding performance on GPU devices, achieving impressive feature extraction speeds of over 1200 fps for videos with 1080p resolution.However, it is worth noting that further enhancements are possible for future work.The integration of these advancements positions EVCA as a powerful and versatile tool for video complexity analysis, offering enhanced accuracy and processing speeds for researchers and practitioners alike.

Figure 1 :
Figure 1: The third frame of two uncompressed videos: (a) Basketball Drive, (b) Beauty.

Figure 2 :
Figure 2: Spatial and temporal complexity features for the third frame of BasketballDrive.

Figure 3 :
Figure 3: Spatial and temporal complexity features for the third frame of Beauty.

Figure 4 :
Figure 4: Scatter plot between   and spatial complexity features.

Figure 5 :
Figure 5: Scatter plot between   and temporal complexity features.

Table 1 :
Performance of the proposed open-source software for different methods in terms of frame per second.

Table 2 :
Performance of spatial complexity features.

Table 3 :
Performance of temporal complexity features.