Review and evaluation of VVC fast partitioning search methods using a common baseline

An important new tool in Versatile Video Coding (VVC) is the extended block partitioning scheme. It enables recursive block subdivisions with five different split modes. This flexibility largely contributes to VVC's compression performance, but also increases the encoder complexity. There have been many methods proposed in the literature to speed up the partitioning search in VVC encoders, from simple heuristics to complex approaches based on neural networks. This paper intends to review this literature and compare it to a simple complexity reduction technique, namely progressive reduction of available split-depths. For each reviewed paper, a working point is found that roughly matches either the shown runtime reduction or bitrate increase. Our proposed method produces a variety of working points with its convex hull including the operating points of many approaches discussed in the literature. Only half of the reviewed works slightly outperform the envelope of proposed working points. This review and performance evaluation study intends to put into perspective known VVC encoder optimization research from literature when compared to our proposed common baseline method that may serve as a benchmark for future work in this area.


Introduction
The Versatile Video Coding standard (H.266 / VVC) [1] was finalized in July 2020 by the Joint Video Experts Team (JVET) of the ITU-T and ISO/IEC SC29/WG11.Official verification tests [2] have shown that the standard delivers about 50% bitrate reduction over its predecessor, High-Efficiency Video Coding (H.265 / HEVC), for the same subjective quality.All modern video codecs, including VVC and HEVC, are based on a block-based hybrid video coding scheme.That means the input video is subdivided into blocks, on which prediction, transformation and residual coding are applied.
In VVC, the block subdivision consists of two stages.In the first stage, the frame is subdivided into square blocks of equal size (either 128 by 128, 64 by 64 or 32 by 32 luma samples), called coding tree units (CTUs).This block grid is used to define highlevel picture partitioning into tiles, slices, and sub-pictures.The CTUs can be further subdivided into smaller rectangular blocks, so-called coding units (CUs), on which prediction and residual coding are applied.One of the main new features in VVC is the partitioning scheme governing this subdivision, quad-tree with multi-type-tree (QTMTT) [3], allowing recursive block subdivision using one of five predefined splits.With QTMTT, a CTU is first split using the quad-tree (QT).Each leaf of the QT can be further split using the recursive multi-type-tree scheme (MTT), allowing not only a split decision, but also a split type decision between four split types: vertical or horizontal, either one with binary or ternary splits.This increase in the encoder search space, caused by the additional degree of freedom in the partitioning, poses a challenge for the design of encoding algorithms.While QTMTT alone provides approximately 10% objective bitrate reduction of the VVC test model (VTM) reference software over the HEVC test model (HM), it is also one of the reasons for the eightfold runtime increase of the former over the latter [4].
An exhaustive encoder search would apply each mode and measure the arising distortion D as well as the number of bits required for signaling (rate R).The optimal mode is usually chosen by the means of coding cost J optimization using the method of Lagrange multipliers [5]: min J, where J = D + λR. (1) MHV'24, February 11- Before we go into details, the common experimental setup is described in Section 2. In Section 3, we review VVC partitioning complexity, propose a simple search space reduction approach and compare it with state-of-the-art QTMTT complexity reduction approaches for inter-picture prediction.In Section 4 the results are discussed.Section 5 concludes the paper.

Experimental Setup
The presented coding efficiency is measured in terms of Bjøntegaard Delta rates (BDR) [6] and run times are measured as an indicator of complexity.Only the luma BDR is reported for consistency with the reviewed literature.The reported numbers are averaged over the set of test sequences in the random-access encoder configuration as defined in the JVET common test conditions (CTC) [7].The results in this paper are based on multiple versions of the VVC reference software model, VTM, to match the versions used in the reviewed literature.In Section 3.4, extensive results for VTM versions 10.2 and 14.0 are presented.Partial results for other versions are shown in Section 3.5.
All experiments with VTM have been performed on a homogenous cluster with Intel Xeon E5-2697A v4 CPUs.

Partitioning complexity analysis
An exemplary partitioning of a CTU and its split tree are shown in Figure 1.The partitioning contains all types of splits available in VVC: quad-, binary and ternary splits.In the partitioning, the different depths of the MTT splits are marked in color.The MTT split depth represent the recursion level of the MTT split process, as can be seen in the split tree representation in Figure 1 b).The number of MTT split recursion levels for inter frames can be limited in the VVC high-level syntax using the sps_max_mtt_ hierarchy_depth_inter_slice syntax element for the whole sequence or ph_max_mtt_hierarchy_depth_inter_slice for individual frames, reducing the number of encoding options the encoder is allowed to choose from, as will be discussed further.In all tested VTM versions, the default configuration restricts the MTT depth to three recursive splits, as defined in the common test conditions [7].
A theoretical search space analysis of the VVC partitioning and mode decision complexity has been proposed in [8], evaluating both the empirical partitioning search overhead as well as the theoretical upper bound under certain constraints.It has been shown that the number of possible partitioning options per sample in inter frames, if no early termination strategies are  Review and evaluation of VVC fast partitioning search methods using a common baseline MHV'24, February 11-14, 2024, Denver, CO, USA applied and a top-down incremental search is used, increases from 28 in HEVC as configured in HM to over 220 in VVC as configured in VTM, constituting an almost eightfold increase.The measured number of actually evaluated options per sample, when applying the default search algorithm including early termination conditions implemented in both reference encoders, increases from 15 in HM to 29 in VTM.This shows that the VTM encoding algorithm is much more aggressive with regards to early termination decisions than HM.

Literature review
The fast-partitioning search algorithm used in VTM has been described in [9], roughly confirming the observations in [8].The sevenfold speedup over a full search (1) comes at a cost of around 1% BD-rate increase.Many other partitioning search algorithms further improving this baseline have been proposed in the literature.In this paper, we review the ones applicable to inter coding.
Previous works on fast VVC partitioning include both statistical analysis-based [10][11][17] and machine learning (ML) based methods using convolutional neural networks (CNN) [12][13] [16], trained classifiers [14] and decision trees [15].The study of Tang et al. introduces an early skip partitioning pruning method for inter coding using conventional methods to detect the motion in the current block [10].Kuang et al. introduces a statistical analysis based method for fast CU partitioning algorithms utilizing a history of information from previously checked partitions [11].Pan et al. proposed a multi-information fusion CNN model for terminating the CU partitioning as a binary classification, which uses as input the current block, the residual, and the bi-directional motion field [12].The investigation of Tissier et al. [13] summarized and compared the above proposals from Tang et al. [10] and Pan et al. [12], bringing them to the common VTM 10.2 software base.Furthermore, [13] introduced a two-stage ML method, which takes advantage of the inter prediction to predict the split probabilities through a CNN, similar to the work by Amestoy et al. [16].The work from Shen et al. presents a classification framework, which decomposes the multidecision problem into multiple independent single-classification methods to predict the cost (1) of each single partitioning mode, thus allowing for early termination [14].Kulupana et al. propose the usage of a decision tree with a tunable threshold to achieve flexible complexity reduction [15].In [17], we proposed a set of heuristic rules to reduce encoding time at lower complexity partitioning configurations.The method was initially proposed for alternative VVC software encoder, but for the purpose of this paper we ported it to VTM-14.0.

Partitioning depth reduction
All the previously mentioned state-of-the-art approaches compare the proposed method with the encoding algorithm of VTM.The possibility of reducing the search space itself by reducing the maximal possible number of recursive MTT splits (i.e.maximal MTT depth) down from 3, as configured in VTM, has only been   [16] and [17].In [8], it has been shown that reducing this number from 3 to 2, reduces the search space size from over 220 to 75, i.e. threefold.In [16] and [17], it was shown that such a reduction results in almost a twofold speedup at the cost of around 1.5% BD-rate loss.
Based on those observations and findings from [19], we evaluate the working points available when gradually reducing the maximal MTT depth.We propose as a common baseline a simple but effective method, which relies on limiting the partitioning complexity by means of an appropriate VVC highlevel picture header syntax element.Specifically, to enable perpicture maximum depth configuration, the syntax elements sps_partition_constraints_override_enabled_flag and for each adapted picture ph_partition_constraints_override_flag are set to 1.The adapted maximal depth is than signaled per picture using the ph_max_mtt_hierarchy_depth_inter_slice syntax element.
Fine-tuning of the admissible search space for MTT partitioning can be achieved by limiting the maximum MTT hierarchy depth for each temporal layer separately in randomaccess encoding with hierarchical group of pictures (GOP).In the VTM random-access configuration, pictures that belong to a temporal layer can only reference pictures in lower temporal layers.In addition, the quantization parameter increases with the temporal layer leading to less bits spent for less referenced pictures [20] In VTM-10.2,five temporal layers are present (L0-L4), in VTM-14 one more layer is present (L5) due to increased GOP size.Only a minor change to the VTM configuration is required enabling per T-layer instead of global maximum MTTdepth configuration.No change to the actual search algorithm is proposed.The resulting working points are shown in Figure 2. The working points for which all TLs have the same maximum MTT depth are labeled (MTT1, 2, 3).The results are shown for two different VTM versions to provide an appropriate comparison point for each state-of-the-art method.The highest-to-lowest TL order in which the maximum MTT depth is reduced is motivated by the fact that higher temporal layers are usually encoded at lower fidelity, thus only requiring a coarser partitioning.
The proposed method constitutes a highly tunable (due to a dense spacing of the working points) partitioning complexity reduction method.It can be used to easily find reducedcomplexity VVC encoding configurations without adapting the search algorithm.For the sake of simplicity, the method does not concern QT partitioning or intra frames.
The advantage of the proposed method is the fact that the MTT depth can be configured at the picture level with minimal additional signaling.For comparison, Figure 2 also shows the working points available when restricting the encoder search to reduced MTT depths (dashed lines, only MTT3, 2 shown, MTT1 out of range), without making use of the aforementioned VVC high-level syntax.Such an approach requires sending many obsolete "don't-split" flags, which is otherwise implied when the maximal allowed MTT depth is reached.It can be observed that the obsolete signaling reduces the efficiency of such an approach, while still providing the speedups.

Comparison with the state of the art
In addition to the working points for VTM, Figure 2 also includes the available working points described in the literature [10]- [16].Only [11][12][13] (results for [12] reported in [13] for VTM-10.2) use the exact CTC conditions and can thus be directly compared to our results.Other literature results are marked as "Various" to give an estimate of their performance compared to the proposed search space reduction.In Figure 2, the numbers reported in the papers are presented, regardless of the used sequences or VTM version.The working points are discussed in more detail in the next Section 3.5.Further, for each reviewed paper and working point, a working point was found that roughly matches either the average runtime or the luma BDR loss.Results are presented in Section 3.5.
For the studies using the exact CTC conditions [11][12] [13][17], it can be observed that only the methods proposed in [11] and [17] can outperform the complexity reduction method described in Section 3.3 (also see Table II).The results in [11] provide either more speedup compared to VTM-14.0 working points with similar loss, or less loss compared to the VTM-14.0MTT2 working point (with MTT depth set to 2 for all inter frames) at only slightly higher runtime.Method [17] is only favorable for lower depth partitioning, confirming our previous conclusion in that paper.When applied with MTT depth of 2, it achieves runtime very comparable to VTM-14.0 with MTT depth restricted to 1, but with a loss of only 2.67% BDR, compared to 3.51% achieved by unchanged VTM-14.0,showcasing the proposed changes to the search algorithm are beneficial at reduced partitioning depth.
None of the results presented in [12] and [13] extend the Pareto-front of the proposed VTM-10.2partitioning working points.

Comparison with non CTC-based studies
In this section, a detailed comparison with literature, for which the used sequences differ from the JVET CTC [6], is presented.Table I shows which sequences used in [10] [14][15] [16] are common with JVET CTC.Using those common sequences and the same VTM version as in the papers, we find a VTM working point based on the gradual MTT depth reduction approach described in Section 3.3, matching either the speedup or the loss.The results are shown in Table II.
The results reported in [13] for [10] can be improved by the proposed method, which provides more speedup at a lower BDR loss.The improved VTM working points have a lower runtime of 63.7% versus the 69% achieved in [10], at only 1.1% BDR loss compared to 1.3% achieved in the paper.Shen et al. [14] reported achieving a twofold speedup of VTM-7.0 at around 1.75% luma BDR loss.By restricting the MTT depth of the last temporal layer (L4) to 1 and other temporal layers (L0-3) to 2, a working point was achieved with VTM-7.0 with very similar loss of 1.79% luma BDR increase, requiring 55.4% of the runtime of the original VTM configuration.Thus, it can be Review and evaluation of VVC fast partitioning search methods using a common baseline MHV'24, February 11-14, 2024, Denver, CO, USA concluded that [14] provides around 6% additional speedup over the proposed method.
Kulupana et al. [15] proposes three working points based on VTM-8.0.The slowest working point is comparable with one of our proposed method, showing slightly more speedup at a slightly higher loss.For the medium working point with 1.48% luma BDR loss at 58% runtime, VTM-8.0 running with MTT depth of 2 in all inter frames is a good comparison, providing both lower runtime and lower loss, at 55.4% runtime and 1.23% BDR loss.For the fastest point, VTM-8.0 running with MTT depth of 1 in all inter frames provides a similar runtime of just under 37%, while showing only 3.0% luma BDR loss, compared to 3.9% achieved in [15].It can be concluded that the proposed method roughly matches the slow option proposed in [15] and outperforms the two other working points.Amestoy et al. [16] already compare the results to working points available with VTM when reducing the MTT depth, similar to [17].For this reason, it is not surprising that the two results using our approach for VTM-5.0 with similar runtimes show slightly more loss.For the slow option proposed in [16], a working point was found with VTM-5.0 providing 1% less runtime at 70.2% but increasing the luma BDR loss from the 0.6% reported in the paper to 0.8%.The fast option proposed in [16] is based on a VTM configuration with MTT depth reduced to 2 for all inter slices, providing a runtime of 39.7% and 2.4% luma BDR loss.With VTM-5.0, by reducing the MTT depth to 1 for all inter slices, a similar runtime of 38.8% can be achieved, but at a higher BDR loss of 3.3%.

Discussion
In [21], a full option-space optimization of a software derived from VTM was performed.It was observed that the increase of the maximum MTT depth from 2 to 3, as used in VTM, is the least optimal option, also when applied in VTM, with regards to the gain provided per runtime increase.Considering the results shown in Figure 2 and comparison in Section 3.5, a question can be posed: How big is the potential for partitioning optimization beyond simple search space reduction?Here, some studies do provide additional gain compared to the proposed simple partitioning complexity reduction approach, notably [11][14] [16] and [17].
Furthermore, one issue in the comparisons needs to be addressed.Adapting the decision making in the encoder search process does not allow the advantage of reduced signaling overhead, as the encoder might still need to use the full

A1
Tango2 -  [13] partitioning depth.On the other hand, if the partitioning depth is restricted in the high level bitstream or frame configuration as discussed in Section 3.3, the encoder cannot use deeper partitioning, even if possibly advantageous.It has to be acknowledged though, that some of the state-of-the-art methods not outperforming the proposed solution overall, do perform better than just skipping the MTT partitioning search at specific depths (indicated by the dashed curves in Figure 2), e.g.fast and slow results from [13], as shown in Figure 2.

Conclusion
We proposed a convex hull of partitioning options as a benchmark for early termination algorithms for VVC and especially VTM, which we compared to methods and results from the literature.We argue that this simple method poses one of the best approaches available for VVC partitioning complexity reduction, being only slightly outperformed by much more sophisticated approaches, if at all.The method should be used to benchmark future work in VVC partitioning complexity reduction.With its tunable and simple design, the method can also be easily used to flexibly define alternative working points for VVC encoding, similar to [21][22].
This review and performance evaluation study intends to increase understanding of the VVC's and VTM's partitioning scheme, and put into perspective known VVC encoder optimization research from literature when compared to our proposed common baseline method that may serve as a benchmark for future work in that area.
exemplary partitioning of a CTU using QTMTT scheme of VVC.split tree representing the exemplary partitioning in a).

Figure 1 :
Figure 1: Exemplary partitioning of a CTU into CUs using the QTMTT scheme of VVC (a) and the split tree representing the structure (b).QT splits are marked in black, while MTT splits are marked in color with blue representing MTT split depth of 1, red depth 2 and green depth 3, respectively.The CUs are enumerated in a) and b) for better comparison.

Figure 2 :
Figure2: Operating points available in VTM-10 and VTM-14 through the reduction of MTT partitioning depth.Partitioning complexity reduction results from the literature are also shown.Results marked as "Various" use non-CTC test conditions and cannot be easily compared to the rest of the presented results (they are discussed in more detail in Section 3.5), with the plot presenting rather rough approximations of the "Various" methods' performances.
version 10.2 published in Table II Working points from the literature matched with the fine-tuned MTT depth reduction technique, using matching JVET CTC sequences and VTM version [2].

Table I
Detailed configuration used in non-CTC studies.