Inter-Frame Parallelization in an Open Optimized VVC Encoder

The Versatile Video Coding (VVC) standard promises high compression efficiency for diverse content types. Based on VVenC, an open and optimized VVC software video encoder, this work presents an inter-frame parallelization (IFP) method designed to exploit the processing power of modern platforms featuring a high number of computing cores. Encoding an ultrahigh definition video on a 32-core machine with the VVenC's faster preset, the proposed method shows more than 20% increase in encoder speed while only a 1% decrease in compression efficiency compared to the default multi-threading mode. In comparison to single-threaded mode, it corresponds to a speedup factor of 18, up from 15x achievable with the previous parallelization scheme. Furthermore, the synergy of the developed inter-frame parallelization technique with other parallelization methods is explored, including tiles and VVC wavefront parallel processing (WPP). The combination of these approaches enables a notable speedup factor of 21, albeit with a trade-off in coding efficiency. With a focus on VVC, this research contributes to the ongoing discourse on video coding optimization, providing valuable insights into possible pitfalls and the potential gains achievable through efficient parallelization techniques on high-core platforms.


INTRODUCTION
The VVC standard [1] [2], which succeeded the HEVC standard [3], was developed by the collaboration of ITU-T and ISO/IEC, and was concluded in July 2020.It has the capability of decreasing the bitrate up to 45% as compared to HEVC while maintaining the same visual quality.4K High Dynamic Range (HDR) and immersive video can achieve even higher savings of about 50% [4].VVC targets various application scenarios, such as highresolution videos, screen content coding, adaptive streaming, and 360-degree video for virtual reality.
Along with the advancement of compression algorithms and their associated increase in complexity, there are also emerging hardware developments in which manufacturers are introducing CPUs with more cores.Therefore, software encoders can reduce run-time through multi-threaded parallelization.However, the use of systems with a higher number of cores necessitates careful consideration of video encoder internal processes.
One optimized software encoder implementation of VVC is VVenC [5].Based on the JVET reference software VTM [6], it includes a substantial amount of algorithmic improvements, such as SIMD optimizations and different approaches to task based parallelization.The encoder interface is highly configurable, allowing users to select from various complexity-efficiency tradeoffs known as presets.For instance, at low-complexity configuration, in the preset faster, version 1.9.1 of VVenC achieves a speedup of a factor of 346x over VTM version 19.2 and a factor of 54x over HM [7], with an encoding gain of 13.1% combined YUV PSNR-BD-Rate (BDR) [8] over the latter.VVenC offers an open software model with regular maintenance, allowing researchers and enthusiasts to develop their own algorithms.
Our research is based on VVenC, and the improvements achieved are compared to the original VVenC parallelization methods.It focuses on the high-level algorithms utilized by the encoder.The remainder of this paper is organized as follows.Section 2 gives background information regarding the encoder's high-level processing and related work.Section 3 presents the design of the IFP approach applied to VVenC, followed by an experimental evaluation of the proposed solution and its combination with other parallelization methods in Section 4. Finally, Section 5 concludes this paper with a summary and an outlook for future work.

Video Coding from a High-Level Perspective
Raw video data consists of single frames.Each one of them is encoded separately.While some applications require encoding the pictures in their display order, when encoding for maximum efficiency, the frames can be grouped into so-called Groups of Pictures (GOP) [9] and sorted into a hierarchical order different from the display order (Figure 1).
The hierarchy level of a picture within a GOP is usually called temporal layer, with each layer being assigned an identifier (TID).The compression rate of individual frames changes with the corresponding temporal layer.The bitstream is organized in a way that the frames within a specific temporal layer can only reference frames from lower temporal layer for motion compensation, or the frames from the same temporal layer within TID 0. Frames in lower temporal layers, especially so called keyframes with TID 0, have large temporal distances for motion compensation and are the basis for further referencing.Thus, those frames especially are encoded with higher fidelity.The higher the TID of a frame, the lower the temporal distances to its motion prediction candidates as well as its influence on other frames, thus usually the encoding fidelity is reduced.
Frames are themselves subdivided into a rigid grid of square blocks of fixed size, called coding tree units (CTU) in VVC.On top of this subdivision, an additional high-level frame subdivision into tiles and slices can be defined.They constitute partially or fully independent regions with no symbol dependency and possibly, depending on bitstream configuration, no signal dependency.This break in dependencies can also be utilized by the encoder to increase parallelism.The encoding is than implemented by a search for optimal block subdivision, prediction and residual coding within each CTU, which is an inherently dependent process not allowing for much parallelization.
The existing high-level features already offers a space for different parallelization approaches, which will be discussed in the next section.

Related Work
The techniques described in [10] and [11] are founded on the GOP and frame parallel processing methods.Frame parallelization within the GOP, due to temporal dependencies, results in a rigidly defined structure and lacks flexibility.Parallelization of independent GOPs can result in considerable processing delay, and it is more applicable for offline encoding.It also complicates rate control, making bitrate adaptation a challenge.On the other side, without using rate control, it is a simple and straightforward method that can already be utilized at the host level outside the encoder.
The works [12] and [13] use slices in parallel.This approach has a disadvantage: an increase in the number of slices or tiles per frame, especially at low bitrates, may incur a significant bitrate overhead.For example, in [12], experiments were conducted using up to 12 slices per frame, with one slice per thread.The results demonstrate an overall speed-up of a factor of 8 (12 threads) for RA configuration and HD resolution, with a bitrate overhead of approximately 6%.Lower frame resolutions exhibit an even higher overhead, up to 10%.
The work in [14] combines GOP and slice parallelization methods.For the selected set of UHD sequences with a subdivision of 2 CTU lines per slice, a factor of 6.8 for 8 threads is achieved.With 8 encoder instances and 8 threads each, one could encode UHD in real-time on a dual-32-core CPU system.Slices are the main cause of the average 0.49 dB PSNR visual quality degradation reported.
In [15], the authors suggested a wavefront-style approach for parallel processing of CTU lines.The method involves processing macroblocks of subsequent macroblock lines concurrently.Due to prediction dependency on the above and above right neighbor blocks, coding of the first block in a line must be delayed until the above right block is coded.By conducting CTU line coding in parallel (with a fixed delay between adjacent lines), a wavefront pattern emerges.It allows an almost linear speedup on a low number of threads (up to 4).The parallelization throughput has a strong dependency on the number of CTU lines and the number of CTUs in a single line.Today, it is a widely used parallelization technique.Wavefront parallel processing (WPP) was first adopted in HEVC.In VVC, the WPP was extended by interoperability with tiles and reduction of CTU delay by removing access to symbols and signals from the above right CTU neighbor.
According to the research in [16], the live 4K intra encoder utilizes the combination of multiple parallelism types: frame, tile, WPP, and an overlapping wavefront (OWF), where only tiles or WPP can be used at a time.In contrast to the coding of hierarchically inter-predicted GOPs, in the all-intra-coding mode, frames do not have any references to neighbors, which makes them highly suitable for parallel processing.The OWF, as first suggested in [17], extends the execution of WPP by overlapping threads that finish the CTU rows into consecutive frames.Overall, the presented all-intra optimized encoder-decoder pipeline achieves 35 fps on average with 4.7% BDR overhead.It was 3x faster in comparison to an alternative streaming pipeline that included the VVenC encoder.
The Inter-Frame Wavefront (IFW) method from [18] is based on inter-frame-parallel processing combined with WPP.Due to the inter-prediction coding dependencies, the motion vector (MV) search is vertically constrained to a predefined search area of a particular number of CTU lines in the reference frame, depending on the progress of its encoding process.The encoding of a CTU can start as soon as, in the reference frame, the correspondent CTU line is finished and its reconstructed samples are available.The IFW is able to improve the inter-frame parallelism.A factor of 5x for the 8 threads and a factor of 7.6x for the 12 threads for 1080p sequences are achieved.A BDR loss of 4.3% is reported for the tested 1080p sequence.However, IFW is not directly suitable for VVenC.The Adaptive Loop Filter (ALF) [19], a new in-loop filtering stage within VVC, prevents processing a frame in CTU line subsets.Ideally, the entire processed image should be available for ALF filter coefficients estimation.Turning off the ALF could result in an additional 8% bitrate overhead.
In [20], based on VVenC, the concept of hierarchical pipelined parallelization is proposed.The default method includes pipelined wavefront parallelization over CTU lines and frame parallel mode, applied to independent frames from a specific temporal layer, thus not requiring dependency synchronization.The results show significant speed-ups at low-to-medium thread numbers.For UHD, the default encoder configuration at 12 threads achieves a speed-up of 10x at negligible loss of compression performance.As explained before, such a strategy scales very well for higher temporal layers, where many independent frames are available.For lower temporal layers, only two independent frames might be available.This problem is amplified by the fact that those frames are encoded at higher fidelity, resulting in longer encoding times.With an increasing number of threads, the default strategy does not scale linearly and saturates at around 11x-12x speedup when using around 16 threads.With an even higher number of threads, the use of additional parallelization techniques, such as tile parallelization, is required.It helps to improve speed-up scaling but saturates as well at factor 16x and results in a coding loss of up to 5% BDR.
To achieve a higher speed-up scaling factor by utilizing many cores (up to 32), an additional parallelization level is required.The goal is to further increase the number of independent tasks and reduce the structural dependencies where the encoder spends time waiting for further tasks to be available.After careful study of potential benefits and consequences, using the IFW idea as a base, we extended the original solution and propose the interframe parallelization (IFP) method as a further multi-threading optimization in VVenC and VVC coding in general.Part of the work concerns analysing and solving the parallelization bottleneck incurred by ALF parameter derivation.

CTU Line based Frame Parallel Encoding
An essential part of the VVenC parallelization scheme is finegrained per CTU task-based parallelization [20].Initially designed as a frame-internal parallelization scheme, it decomposes the CTU encoding, reconstruction, and filtering processes into a set of 10 tasks to be executed in a pipelined processing order.In detail, these tasks are the CTU mode estimation, reshaping (LMCS), vertical deblocking filter, horizontal deblocking filter, SAO filter, ALF/CCALF statistics collection, ALF/CCALF filter derivation, and ALF/CCALF reconstruction.Besides intrinsic pipeline dependencies, each CTU task has a set of additional requirements on the processing states of adjacent CTUs.For example, the CTU mode estimation requires that the mode estimation of the left, topleft, top, and top-right CTU has already been finished.By fulfilling these requirements, the typical wavefront processing order emerges for some, but not all, of the CTU tasks.Please note that this parallelization is implemented without the use of the VVC wavefront parallel processing (WPP) tool, although VVenC can be configured to use the WPP tool in combination with the described approach.An in-depth description of the CTU task-based parallelization scheme can be found in [20].
Running frames in parallel, each using internally the CTU task-based parallelization scheme of VVenC, can further improve the task throughput and the overall parallelization performance.However, in work [1], this additional frame parallel processing relies on the full reconstruction of all previous inter dependent frames.This allows, for example, in Figure 1, to process frames 3 and 6 in parallel after frames 0, 1, 2, and 9 have been reconstructed.In the typical random access coding order, the potential degree of frame parallelism increases with higher temporal layers (TL).Unfortunately, the most time-consuming frames to encode are at the lower temporal layers, e.g., frames 0, 1, 2, 9, and 10 in Figure 1, where the degree of parallelization is limited.This leads to a particular processing delay because all other frames cannot start and must wait until the lower layer frames are finished.
To overcome this processing delay, the inter-frame dependencies have to be relaxed, allowing the encoding of a frame to start even when the referenced frames are not fully reconstructed.Like IFW [16], the inter-frame parallelization of VVenC processes frames subsets in terms of CTU lines in parallel.To synchronize between inter dependent frames a CTU line counter for each frame tracks the frame part that has already been reconstructed.Using a configurable CTU line offset, the encoding of a new CTU line in a frame is only permitted to start when the corresponding CTU line plus the CTU line offset have been reconstructed in the reference frames.
For example, in Figure 2, given a CTU line offset of one, CTU line 0 of the P-frame has already been encoded, while CTU line 1 is delayed and has to wait until CTU line 2 of the I-frame is finished.The intermediate B-frame has two reference frames.Here, CTU line 0 of the B-frame has to wait for the reconstruction of CTU line 1, in both the I-and the P-frame.
With regard to the CTU task-based parallelization and the pipelined CTU tasks, synchronization is only required for the CTU mode estimation task, the first task of the CTU processing pipeline.Besides the initial synchronization with adjacent CTUs as described above, the CTU mode estimation is now linked to the counter of the reconstructed CTU lines in the referenced frames.The remaining CTU tasks follow the initial pipeline synchronization scheme, as summarized above and described in detail in [20].To enable inter-frame parallelization in VVenC based on partially reconstructed reference frames, the encoder inter prediction capabilities have to be constrained.Here, the vertical MV displacement must be limited to those frame regions that have already been reconstructed.In addition, the sub-pel positions within four sample lines at the bottom of the reconstructed area in the reference frames are not accessible, as shown in Figure 3.This is due to the VVC motion compensation interpolation filter process, where the interpolation filter has 8 or 4 taps for luma or chroma, respectively.Given the MV search algorithm of VVenC, where the sub-pel refinement is built around the best full-pel MV candidate, the available reference picture area has to be reduced by four additional sample lines Overall, the restriction of the inter prediction, reducing the reference picture area, has a negative impact on compression performance.However, the CTU line synchronization offset and, hence, the size of the available reference area can be configured in VVenC to balance the tradeoff between compression and parallelization efficiency.A higher CTU line synchronization offset would allow additional vertical MV candidates, which may lead to a higher compression efficiency, but would also increase the initial delay for each CTU line in a frame, eventually reducing the speedup.In Section 4, simulation results will be given, illustrating this tradeoff in VVenC.

Partial ALF Filter Coefficients Derivation
One of the newly adopted VVC coding tools is the ALF filter.This filter is the final step in the encoder loop and applies spatial filtering to enhance the reconstructed video signal.Based on a Wiener filter, ALF shows the highest coding gain of all available independent VVC coding tools.
Overall, the ALF process consists of three different adaptive filters [19].Two ALF filters are used for the luma and chroma components independently.The third one is a Cross-Component ALF, which adds a correction to the chroma component based on the collocated luma signal for each CTU.
For each of the three ALF filters, the encoder can derive new sets of filter coefficients, typically based on statistics gathered from the reconstructed picture after deblocking and SAO filters have been applied.Newly derived filter coefficient sets are transmitted in the VVC bitstream and can be used for filtering the current as well as successive pictures in coding order.In VVC, up to eight different filter sets, containing coefficients for each of the three ALF filters, can be stored and re-used.In addition, VVC contains a pre-defined bank of 16 filter coefficient sets for the luma ALF filter.However, for ALF chroma and CCALF, no predefined filter coefficient sets have been specified.Consequentially, for these filters, new filter sets have to be obtained through the derivation process.
The CTU filter adaptation follows the derivation process.In this step, the encoder performs a rate distortion optimization (RDO)-based selection of the filter coefficient set to be used for each CTU.In addition, after deciding the best ALF filter set for each CTU, the encoder decides if the ALF/CCALF filters for each component should be disabled entirely for the whole frame.Finally, in the reconstruction step, ALF/CCALF component filters are applied CTU-wise to finalize the in-loop picture reconstruction.Except for the filter derivation step, all other ALF steps, e.g., statistics gathering and reconstruction, have already been parallelized on a CTU basis, as shown in [20].
For the ALF luma filter, due to the availability of pre-defined fixed filter sets, the derivation of new filter sets could be omitted, and the fixed filter sets could be directly applied to the CTUs.However, using only the fixed filter sets results in a loss of coding performance, as will be shown in Section 4.
In summary, ALF chroma as well as CCALF require the derivation of filter sets, whereas for ALF luma, the compression efficiency can be improved by deriving new filter sets based on the statistics gathered from reconstructed pictures.In [20], VVenC uses the entire reconstructed picture to derive the different ALF filter sets.This approach is not suitable for inter-frame parallelization, where dependent frames start encoding before the reference frames are fully reconstructed.To continue facilitating the use of derived ALF filter sets, the ALF derivation process should be restricted to the partially reconstructed frame.Similar to the CTU line-wise MV synchronization, the ALF derivation subset is also defined in terms of CTU lines.This number of CTU lines used for the ALF derivation can be configured in VVenC as well.Like the CTU line synchronization offset parameter from the MV restriction, the number of ALF derivation lines exposes a tradeoff between compression efficiency and parallelization performance.The larger the number, and hence, the more sample statistics that can be gathered, the better the filter performance is expected.As shown in Figure 4, for the remaining CTU lines, the encoder performs only the ALF filter adaptation, where the encoder selects which filter set to be used for each CTU.
In addition to the restricted ALF filter set derivation, the propagation of ALF filter sets, e.g., the use of filter sets in successive frames, has been disabled.This has to be done due to the parallel out-of-order encoding of frames, which prevents an ordered propagation of filter sets between successive frames in the encoder.Thus, there are no filter sets from previous frames that can be reused.

EXPERIMENTAL RESULTS
The performance of the proposed methods is analyzed in terms of multithreading speedup over single-threaded execution, and with regards to deviations in coding efficiency caused by VVenC algorithm modifications.

Experimental Setup
The experiments were performed on a Linux server (Ubuntu 22.04) based on an AMD EPYC 7502P CPU.For all multi-threaded encodings, the number of threads was set up to the maximum available: 32 physical cores of the CPU.The single-threaded results were produced employing the multi-core capability of the server by running 32 parallel encoder instances.The encoding setup can be found in Table 1.The conducted results are an average over the classes A1, A2 (UHD resolution), and B (HD resolution) of the JVET common test conditions (CTC) [21].
The captured runtime is the wallclock time measured using the time Unix utility.The resulting speedups are the ratio of singlethreaded runtime T1 to the runtime of multi-threaded encoding with N threads, TN: The coding efficiency metric BDR gives the coding efficiency as a weighted sum of the per component Y, U, and V PSNR values using the 6:1:1 weighting method suggested in [22].SIMD encoder optimizations are enabled in all experiments.
The proposed parallelization method was tested under the faster preset configuration of VVenC.The original VVenC setting of 4 parallel frames was not altered.That would also correspond to the settings of the experiments conducted in [20].

Evaluation of Inter-Frame Motion Prediction Restriction
In case of inter-frame parallelization, the height of the reference area for inter-frame MV prediction, and hence the length of the vertical MV component, must be restricted.The resulting impact on coding efficiency can be easily determined by turning off the The results in Table 2 show that an increase in the reference area height reduces the loss caused by the MV restriction.The significant coding loss at synchronization to the same CTU line (IFP0) arises due to the extremely limited vertical size of the reference region used in inter prediction.Such a loss is unacceptable and can degrade the final IFP performance.The 1-CTU line offset (IFP1), which corresponds to vertical range of up to 124 samples downward, is already sufficient as a source for inter prediction.However, it works better for HD content than for UHD (0.45% vs. 1.29%, respectively).On UHD content, at the sub-CTU level, the encoder usually selects larger blocks, whose inter prediction displacement in the IFP case is more restricted than the smaller blocks.Enabling 2-CTU lines offset for inter prediction further reduces compression loss, maintaining a similar speedup.In general, the HD speedup is less than the UHD.This is due to the smaller number of CTU lines and the smaller number of CTUs in a which is predetermined by the CTU size and video resolution.Due to CTU line synchronization, the IFP moves the focus towards the parallelization of lines of different frames than the in-frame WPP.

Evaluation of ALF Dependency
As described in Section 3.2, due to IFP, the ALF derivation process takes place for a subset of the picture, namely for a particular number of CTU lines, starting from the top of the picture.Table 3 presents the results for different ALF solutions regarding interframe parallelization.The straightforward method is to simply disable ALF.With an 8% BDR loss, this is the worst-case scenario of all those examined.
Using fixed luma filters and disabling chroma ALF improves the results a little.There are 16 fixed ALF filter available, and restricting ALF to just them reduces luma loss by around 75%. Due to the missing chroma filtering in this solution, both for regular ALF and CCALF, the chroma loss is still relatively high negatively impacting overall results.
Finally, the best performing solution is ALF filter derivation from only partial data, i.e.only a fixed number of top CTU lines, for both the luma and chroma channels.Such approach can be applied to all luma ALF, chroma ALF as well as CCALF.This reduces both luma and chroma loss, with the results being better the more CTU lines are used.At the same time, the exact number of used CTU lines does not have a large impact on the overall speedup (as long as kept relatively small compared to the total number of CTU lines).It shows that with a larger subset, the distortion statistics of the picture can be approximated better.
Based on the results, the partial ALF derivation over the CTU lines was found to be the best way for IFP to reduce the compression performance impact.

Overall IFP Performance
The final IFP setup was tested for different CTU line synchronization settings, where the MV restriction and ALF partial derivation are both aligned to the same number of CTU lines.Table 4 shows the results for synchronization from 1 to 3 CTU lines, along with the VVenC default parallelization method.
The IFP0, with synchronization to same CTU line, shows large coding efficiency loss.The MV restriction, as discussed in Section 3.1, is the main cause.At the same time the speedup is worse than with IFP1 and IFP2, which might be caused by increased workload in quantization steps required with worse available prediction.
The IFP1 with 1-CTU line offset achieves over 18x speedup for UHD content.This is a significant improvement over the default mode, with only a 2.3% BDR bitrate overhead, which is primarily caused by the impaired accuracy of chroma ALF filter derivation.For UHD, the IFP2 with 2-CTU lines shows the best trade-off regarding coding efficiency and speedup of all regular IFP modes.It reveals a speedup of 18.5x for about 1% BDR loss.
Also for HD, the overall IFP results show an acceleration.Efficient parallelization of the ALF processing layer leads to an additional increase in speedup, in addition to the speedup observed in the MV restriction evaluation.
VVenC also provides other multi-threading approaches, i.e. parallelization over tiles and the can utilize VVC's WPP's reduced CTU synchronization delay for improved scaling.The results in Table 5 demonstrate that the use of IFP with tiles and WPP can further increase speedup.However, as of current VVenC version, the IFP combination with tiles is only possible for horizontal tiling of the frame.Vertical tiling either needs additional restriction of the reference frame area to the tiles above or additional synchronization to the CTU lines above, which would introduce an extra delay.
Nevertheless, with 2-by-1 tiling, for UHD, IFP with 2-CTU lines offset achieves a speedup factor of more than 21 at a loss of around 5.2%.As can be seen, the additional tools introduce an additional bitrate overhead, with IFP overhead added on top of it.Also in the HD use case, IFP on top of tiles and WPP provides an additional speedup (over 16x).

CONCLUSION
In this paper, we investigated further the parallelization opportunities in an open and optimized VVC software encoder, VVenC.Using CTU line synchronization, the temporally dependent frames can still be parallelized, which results in additional speedup.The newly introduced additional adaptive loop filter process, with some restrictions, remains part of the coding loop and so contributes to the coding efficiency.
The presented IFP approach, implemented in VVenC, using 32 threads, is more than 20% faster than the VVenCs previous parallelization approaches, at a relatively small coding efficiency loss of around 1% for UHD content.For UHD, the speedup is between 17 and 21.IFP can be combined with other parallelization strategies, such as tiles and WPP, still providing additional speedups.
For HD videos, IFP also provides a speedup increase.However, it is not as effective as in the case of UHD.Due to the smaller video resolution, the latencies caused by low-level inter-task dependencies of the encoder prevent higher speedups.To improve this, more parallizable CTUs, hence less synchronization latency, are needed.A refinement of the CTU line synchronization method could help, where the CTU line could also be synchronized CTUwise horizontally with respect to the horizontal MV restriction.It should be noted that, depending on the size of the horizontal MV constraint, additional bitrate overhead is expected.
It might be possible to make restricted ALF compression work better by enabling ALF temporal prediction through APS.This lets subsequent frames use the filter sets from previous frames, which cuts down on the amount of data needed to send new filter sets.However, the expected improvement is in the range of 0.1-0.2%, and it requires a particular synchronization for the filter set propagation.
Considering the rate control (RC) mode, due to fully parallel frame processing, the IFP might introduce some hurdles.The advanced RC algorithm in VVenC [23] is efficiently aligned with the original frame parallelization approach of the encoder, where only frames that belong to the same TL are encoded in parallel.This allows an update of RC with the exact bitrates of the encoded frames enabling better prediction for subsequent frames.IFP requires the prediction is done before the data for previous frames is available, which makes the reliable update of the RC algorithm harder.Harmonization of the new IFP parallelization strategy will be a focus of future work.

Figure 1 :
Figure 1: Group of pictures coding structure with four temporal layers.Arrows depict the motion prediction dependencies.Numbers below frames depict the coding order.

Figure 3 :
Figure 3: MV restriction at the bottom of the CTU line.

Figure 4 :
Figure 4: Frame with ALF derivation in the first CTU row and CTU-wise adaptation for remaining rows.

Table 3 :
The bitrate overhead (BDR) and speedup results for different solutions to reduce the ALF dependency.

Table 2 :
Impact of vertical MV component restriction for different synchronization offsets (32 threads, ALF is off in both anchor and candidate).

Table 4 :
The overall IFP results compared to default VVenC mode for different synchronization offsets (32 threads).