Spatial Scalability with AV1: A Comparison between Scalable AV1 and MPEG-5 LCEVC in Video Quality and Complexity

Rising video streaming demands call for scalable delivery methods, enhancing user experiences and reducing bitrate and storage for applications requiring multiple video versions. Spatial scalability in broadcasting eliminates the simulcasting need by serving both HD and UHD users with a single stream. This concept extends to supporting various devices and networks in multi-conferencing, and finds applications in augmented adaptive streaming, scalable video messaging, low latency pixel streaming, and cloud gaming. Moreover, multi-layer encoding enhances user experience by adding features like HDR on an SDR base or bitdepth scalability.

Scalable Video Coding AV1 (AV1-SVC, not to be confused with SVT-AV1): uses AV1's rescale feature, where the decoded frames of the base layer (BL) are provided to the enhancement layer (EL) as additional reference frames.These would be upscaled and the encoder adaptively decides whether to use the upscaled reference frames from BL or the previously encoded frames from the EL.

MPEG-5 Part 2 LCEVC [1] (Low Complexity Enhancement Video
Coding) enhancing AV1: employs a codec-agnostic EL that is combined with any BL, in this case SVT-AV1 [2], yielding an enhanced video stream.The LCEVC stream typically improves the complexity vs. quality trade-off of the single-layer (SL) codec that it enhances, and it offers quality improvements for LCEVCsupported devices without affecting non-LCEVC devices.
This study explores the pros and cons of these scalable methods.Scalable video encoding is known to suffer from greater encoding complexity and efficiency loss compared to SL encoding.The research compares SL UHD AV1 encodings performed with two base codecs (AMD xAV1 and SVT-AV1 software encoder), along with their respective SL HD renditions (upscaled to UHD [3]), with two scalable HD+ UHD AV1 encodings, i.e., SVC AV1 and LCEVC-enhanced AV1.Key areas of assessment are:  Objective quality metrics (VMAF, VMAF_NEG, PSNR [4]),  Rate-distortion (RD) curves,  Visual quality assessment, e.g., upscaled base vs. enhanced,  Encoding complexity.

VIDEO QUALITY RESULTS
The content set includes known test clips from the Netflix dataset plus specifically selected very high-detail footage to ensure the difference between upscaled HD and UHD is clearly perceivable, encoded in CBR at streaming-relevant bitrates and qualities.average VMAF RD curves for SL UHD, SL HD and Scalable HD+ UHD across tested contents are reported below.Average BD-rates [5] across the dataset are in the table below (positive means a bitrate increase, negative means bitrate savings): SVC AV1 UHD is slightly less efficient than SL AV1 UHD and HD.In the case of LCEVC, scalable HD+UHD has greater or similar encoding efficiency to SL SVT-AV1 UHD, depending on metrics, and it clearly outperforms HD.It needs to be noted that the presented results reflect both the formats themselves (i.e., SVC AV1 vs. the combination of LCEVC with AV1) as well as how well the encoders are utilizing the coding tools.Further optimizations can be expected in the future for both alternatives.
Full results are available here.

COMPLEXITY RESULTS
Initial LCEVC findings compare SL HD SVT-AV1, SL UHD SVT-AV1 and Scalable LCEVC HD+ UHD.Encodes are performed on an Intel Core i9 server and the Linux utility [6]

CONCLUSIONS
While SVC AV1 may not be independently advantageous for UHD video delivery, it shows promise in specific applications such as video conferencing and HD+UHD simulcast.LCEVC, in addition to scalability advantages, also offers bitrate savings compared to SL AV1, along with a significant reduction in UHD encoding cost.
Although the benefits of LCEVC for software encoding are readily apparent, assessing its impact on hardware solutions is more complex.Hardware transcoding solutions may be constrained by other factors like decode capacity and available DDR bandwidth.Recognizing this complexity, there is potential for further evaluation of LCEVC in hardware contexts, suggesting new possibilities for efficient and high-quality video delivery.
is used to measure CPU time.The table below reports user+system encoding times in