skip to main content
research-article
Open Access

Deep Learning-Based Intra Mode Derivation for Versatile Video Coding

Authors Info & Claims
Published:17 February 2023Publication History

Skip Abstract Section

Abstract

In intra coding, Rate Distortion Optimization (RDO) is performed to achieve the optimal intra mode from a pre-defined candidate list. The optimal intra mode is also required to be encoded and transmitted to the decoder side besides the residual signal, where lots of coding bits are consumed. To further improve the performance of intra coding in Versatile Video Coding (VVC), an intelligent intra mode derivation method is proposed in this paper, termed as Deep Learning based Intra Mode Derivation (DLIMD). In specific, the process of intra mode derivation is formulated as a multi-class classification task, which aims to skip the module of intra mode signaling for coding bits reduction. The architecture of DLIMD is developed to adapt to different quantization parameter settings and variable coding blocks including non-square ones, where only one single trained model is required. Different from the existing deep learning based classification problems, the hand-crafted features are also fed into intra mode derivation network besides the learned features from feature learning network. To compete with traditional methods, one additional binary flag is utilized in the video codec to indicate the selected scheme with RDO. Extensive experimental results reveal that the proposed method can achieve 2.28%, 1.74%, and 2.18% bit rate reduction on average for Y, U, and V components on the platform of VVC test model, which outperforms the state-of-the-art works.

Skip 1INTRODUCTION Section

1 INTRODUCTION

With the rapid development of information technology, videos have been applied to the fields of entertainment, surveillance, education, and so on. To adapt to more applications in our daily life, the videos have evolved in various dimensions in the last decade, including High Definition (HD), Wide Color Gamut (WCG) [14], High Dynamic Range (HDR) [14], Multi-view Video plus Depth (MVD) [28], 360 degree video [43], light field image/video [11], and dynamic point cloud [33]. Unfortunately, from low dimension to high dimension, the dramatically increased video data challenges the limited storage space and transmission bandwidth. From H.264/Advanced Video Coding (AVC) [39], High Efficiency Video Coding (HEVC) [35], to the state-of-the-art Versatile Video Coding (VVC) [7] that was issued in 2020, although a large compression ratio has been achieved, it still cannot catch up with the increase of video data. Advanced video compression algorithm is always desired to maximize the visual quality at a given bandwidth budget.

In the framework of existing hybrid video coding, the modules mainly consist of intra/inter prediction, transform, quantization, entropy encoding and in-loop filtering. To improve compression efficiency, a variety of novel coding tools have been developed in the issued standards, including QuadTree plus Multi-Type Tree (QT+MTT) structure [17] for coding block partition, Matrix-based Intra Prediction (MIP) [34] and Cross-Component Linear Model (CCLM) [45] for intra luma and chroma prediction, History-based Motion Vector Prediction (HMVP) [46] and Decoder-side MV Refinement (DMVR) [15] for motion estimation/compensation, Multiple Transform Selection (MTS) [8] for transform, CABAC engine with multi-hypothesis probability estimation for entropy encoding, and Sample Adaptive Offset (SAO) and Adaptive Loop Filter (ALF) for in-loop filtering. These mentioned coding tools have achieved significant coding gains.

One of the most important modules is intra prediction [19], which aims to remove spatial redundancy as much as possible. Parts of the available neighboring blocks are weighted to produce the predicted block. Traditionally, intra modes include Planar, DC, and angular modes. To achieve more accurate prediction result, various algorithms have been developed. In [4], intra prediction was analyzed in frequency domain, and the frequency components were selectively discarded to improve the performance. Li et al. [20] presented a bi-intra prediction method based on the binary combination of existing uni-intra prediction modes. Rather than regular out-block reference pixels, the in-block ones were employed in [2] to perform intra prediction for screen content, and an additional in-loop residual signal was used. An iterative filtering method was employed for intra prediction in addition to the traditional intra prediction in [10]. To achieve more reference pixels, the multi-line based scheme was presented in [21], where six more lines of pixels located at the above and left neighbors were collected. Different from fixed scan order, an adaptive block coding order [49] was proposed for intra prediction to better exploit spatial correlations. In analogous to motion estimation in inter coding, Intra Block Copy (IBC) [42] was introduced for screen content, which aims to exploit long distance correlations in an image. Two modes with high probability from gradient histogram were combined to generate a new intra mode in [1]. In [47], the local and nonlocal correlations were exploited for hybrid intra prediction, where the adaptive template matching prediction, combined local and nonlocal prediction, combined neighboring modes prediction were performed. These methods mentioned above exploit spatial redundancy from neighbors with manually designed functions, which may limit the performance. Advanced schemes are desired to adapt to diverse video contents.

To further improve compression efficiency of intra coding, the problem of signal processing is formulated as an artificial intelligence task, where powerful neural network is adopted [25, 48] and a training database for deep video compression is provided in [24]. In specific, the problem of intra luma prediction was formulated as an inpainting task [50], and the problem of intra chroma prediction was modeled as a colorization task [23, 52]. An iterative training strategy for neural network was presented in [12], where training blocks were collected from previous iteration to further improve performance. Wang et al. [38] proposed a multi-scale convolutional network based intra prediction approach, in which the neighboring reconstructed L-shape was fed to the network as well as the traditional angular intra prediction result to make more accurate prediction. With conditional autoencoder [6], multi-mode intra prediction was performed for luma and chroma components. Sun et al. [36] proposed two enhanced intra prediction schemes with multiple neural networks, where the appending scheme was to replace the traditional modes and the substitution scheme was to replace the highest and lowest probable traditional modes. In [16], a progressive spatial recurrent neural network was presented for intra prediction, which was able to produce prediction by passing information along from previous output. To adapt to variable coding blocks in intra prediction, fully connected and convolutional neural networks were carefully designed [13] for small and large blocks, respectively. Most of these existing learning based methods aim to make more accurate luma and chroma predictions from a regression perspective to achieve coding gains, while the module of intra mode derivation has not been exploited from a classification perspective with deep learning tools.

In intra coding, the intra mode is also required to be encoded and transmitted to the decoder side besides residual signal. For intra mode signaling, Most Probable Mode (MPM) list, which is constructed from the neighboring blocks, plays an important role and saves significant coding bits. In [18], two MPM construction methods were presented for VVC, where one was extended from HEVC, and the other was sorted according to the probability of each candidate. Besides the nearest neighboring lines, Chang et al. [9] extended MPM mechanism to Multi-Reference Line (MRL) scheme for better performance. A conditional random field model was established to re-construct the MPM list in [22], where the short and long range correlations were considered. In addition, decision tree was utilized to exploit multiple dynamic lists of intra mode signaling [32]. By investigating the occurrences of intra modes in the neighboring blocks, Most Frequent Mode (MFM) list [44] was derived to compete with the existing MPM list. To skip intra mode signaling and save coding bits, Xu et al. [41] proposed a predictive coding scheme, in which the angular correlation in spatial domain was calculated with modulo-N arithmetic operations. Additionally, template based [40], histogram of gradients based [30], and texture analysis based [29] intra mode derivation methods were presented in a manual manner. For depth video coding, a coding tool [27] was presented to reduce intra mode signaling bitrate, in which the texture intra modes were inherited for the depth intra modes. Basically, the MPM list construction and intra mode derivation have been investigated by traditional statistics and experience, which can be further improved with advanced learning based schemes.

In this work, to skip the module of intra mode signaling and save coding bits, the process of intra mode derivation is formulated as a multi-class classification task. The main contributions of this work are listed as follows.

(1)

The process of intra mode derivation in intra coding is modeled as a multi-class classification task, termed as Deep Learning based Intra Mode Derivation (DLIMD), which is used to skip the module of intra mode signaling for saving bits.

(2)

In DLIMD, the learned features and hand-crafted features are combined together for intra mode derivation. Additionally, the proposed DLIMD can be applied to variable coding blocks (including non-square blocks) and any different Quantization Parameter (QP) settings.

(3)

To further improve the performance, one additional binary flag is utilized to indicate the finally selected scheme from Rate Distortion (RD) cost competition. The proposed method achieves superior performance when compared with the state-of-the-art algorithms.

The remainder of this work is organized as follows. Motivation is presented in Section 2. The proposed DLIMD for video coding is discussed in detail in Section 3. The experiments are conducted and the results are analyzed in Section 4. Section 5 concludes this work.

Skip 2MOTIVATION Section

2 MOTIVATION

In VVC, intra coding modes/tools [31] include DC, Planar, 65 angular modes, Wide Angle Intra Prediction (WAIP), MRL, Position Dependent Prediction Combination (PDPC), MIP, Intra-Sub Partition (ISP), and CCLM. It should be mentioned that the intra mode is also required to be encoded and transmitted to the decoder side. To effectively signal these intra modes to the decoder side, the derivation is performed with intra modes from neighbors, where six of them are produced and accommodated to the MPM list. Generally, the first one in the MPM list is always fixed, i.e., Planar mode, which is encoded with two-bit length. The other five MPMs are achieved according to spatial correlation from the neighbors, and encoded with three-bit to six-bit length. The non-MPM modes are divided into two parts which contain 3 and 58 modes, respectively. Truncated binary encoding is performed for them with six-bit and seven-bit length. The detailed intra modes signaling can be found in Figure 1. In addition, statistical experiments are conducted under the platform of VVC Test Model version 5.0 (VTM 5.0) to present coding Bits Per intra Mode (BPM), where ten sequences with various contents from different classes are encoded under All Intra (AI) configuration. The value of BPM is calculated by the total coding bits of intra mode against the number of intra blocks, where the coding bits are collected after CABAC entropy encoding. The statistical results are shown in the left columns of Table 1 and the values of BPM are 3.35, 3.48, 3.44, and 3.39 on average under four QP settings.

Fig. 1.

Fig. 1. 67 intra mode signaling in VVC.

Table 1.
ClassSequenceCoding bits per intra mode (BPM)\(\alpha\)Percentage of coding bits of intra mode\(\beta\)
QP = 22QP = 27QP = 32QP = 37QP = 22QP = 27QP = 32QP = 37
ATango22.182.902.922.975.39%10.9%14.7%18.3%
FoodMarket42.452.612.632.717.14%10.4%12.9%14.9%
BBasketballDrive2.942.922.932.856.60%11.5%15.6%18.3%
BQTerrace3.403.573.533.445.27%9.46%14.1%18.9%
CBQMall3.843.863.753.658.95%12.2%16.3%20.8%
BasketballDrill3.523.553.583.7614.6%18.4%21.0%25.1%
DBlowingBubbles4.344.254.173.827.88%11.5%16.3%20.6%
BasketballPass3.854.043.933.678.82%11.8%16.4%22.1%
EFourPeople3.593.623.573.539.80%13.8%17.7%21.8%
Johnny3.373.433.433.508.39%13.1%18.7%23.2%
AVERAGE3.353.483.443.398.28%12.3%16.4%20.4%

Table 1. Statistical Results of Intra Mode Signaling

Furthermore, to demonstrate how many bits are spent in the module of intra mode signaling, the percentage of coding bits of intra mode in a frame is collected, and illustrated in the right columns of Table 1. It can be found that this percentage increases from 8.28% to 20.4% on average as QP value increases. In the case of small QP settings, the percentage is limited, because the coding bits of residue (the difference between prediction and source) are much larger than those of intra mode, while in the case of large QP settings, the coding bits of residue become limited, which results in a high percentage of coding bits of intra mode. From these results, we can conclude that if more advanced intra mode signaling approach is presented, the coding performance can be further improved.

Skip 3PROPOSED DEEP LEARNING BASED INTRA MODE DERIVATION FOR VIDEO CODING Section

3 PROPOSED DEEP LEARNING BASED INTRA MODE DERIVATION FOR VIDEO CODING

3.1 Problem Formulation and Framework

In this work, we focus on the optimization of DC, Planar, and 65 angular modes signaling for the luma component, while the chroma component is ignored. According to Figure 1, the straightforward idea of improving coding performance is to predict the best intra mode from all 67 candidates and place it to the first in MPM list. This intra mode derivation can achieve promising performance, because the intra mode signaling only consumes two bits, which is less than other cases. However, it still can be improved by skipping the RD checking process and intra mode signaling to save coding bits.

The optimal intra mode of current block is finally selected based on the minimum RD cost by checking the candidate list. This process can be represented by the following equation, (1) \(\begin{equation} n^* = \mathop {\arg \min }_{n}\big \lbrace D_n + \lambda \big (R_n^{r} + R_n^{m} + R_n^{o}\big)\big \rbrace , \end{equation}\) where \(n\) indicates the index of intra mode, \(n \in [0, 66]\) for Planar, DC, and 65 angular modes, \(D_n\) is the distortion, \(\lambda\) is the Lagrange Multiplier, \(R_n^r\), \(R_n^m\), and \(R_n^o\) indicate the coding bits of residue, intra mode, and other information, respectively.

According to Equation (1), to achieve the optimal one from a pre-defined candidate list, this process can be formulated as a multi-class classification task. Generally, the construction of MPM list can be regarded as a manual classification scheme, and top-6 intra modes are manually selected. To further improve the performance, we aim to solve this multi-class classification task with a deep learning approach. In specific, the optimal intra mode can be derived directly instead of checking candidate list, and the module of intra mode signaling is expected to be skipped for coding bits reduction. Figure 2 illustrates the framework of proposed deep learning based intra mode derivation for video coding. T and Q indicate transform and quantization, T\(^{-1}\) and Q\(^{-1}\) indicate inverse transform and inverse quantization.

Fig. 2.

Fig. 2. Framework of proposed deep learning based intra mode derivation for video coding.

In the video encoder, intra mode derivation is performed, including the conventional intra mode checking from intra mode list and the proposed DLIMD. According to RD cost, only one of them will be finally selected. If DLIMD is selected, the strategy flag is set as 1 and the switch is opened for skipping the module of intra mode encoding; otherwise, the strategy flag is set as 0 and the switch is closed for activating the module of intra mode encoding. The strategy flag is always encoded and transmitted to indicate the selected scheme. It is worth mentioning that the other modules in video codec are not changed. The RD cost competition between DLIMD and traditional method (including DC, Planar, angular modes, MRL, MIP and ISP) can be represented by the following equation. (2) \(\begin{equation} S^* = \mathop {\arg \min }_{S}\big \lbrace D_S + \lambda \big (R_S^{r} + R_S^{m} + R_S^{f} + R_S^{o}\big)\big \rbrace , \end{equation}\) where \(S^*\) indicates the selected scheme, i.e., \(S\in\) {DLIMD, traditional method}, \(D_S\) is the distortion under \(S\) scheme, \(\lambda\) is the Lagrange Multiplier, \(R_S^r\), \(R_S^m\), \(R_S^f\), and \(R_S^o\) indicate the coding bits of residue, intra mode, strategy flag, and other information, respectively. In addition, it should be noted that if the proposed DLIMD is selected, there are no coding bits for intra mode, i.e., \(R_S^m = 0\).

In the video decoder, the strategy flag is firstly decoded before intra prediction. If this strategy flag is 0, the intra mode will be decoded directly; otherwise, the intra mode will be derived by the proposed DLIMD. With the intra mode, intra prediction is performed accordingly. Finally, the prediction result plus decoded residual information produces the reconstruction.

To estimate the upper bound of performance under the proposed framework, we define that \(\alpha\) and \(\beta\) are the original value of BPM and the percentage of coding bits of intra mode in a frame, the statistical values of them are illustrated in Table 1, \(\gamma\) is the percentage of selected intra blocks under the proposed scheme. One additional binary flag is utilized for indication between the proposed scheme and the original scheme, which is encoded by context mode. Then, the value of BPM becomes \(- \gamma \times log_2(\gamma) + (1 - \gamma)\times {(\alpha - log_2(1 - \gamma)})\). Accordingly, the bit saving can be calculated as follows, (3) \(\begin{equation} \eta = \frac{\alpha - [- \gamma \times log_2(\gamma) + (1 - \gamma)\times {(\alpha - log_2(1 - \gamma)})]}{\alpha } \times \beta . \end{equation}\) The condition of upper bound is \(\gamma = 100\%\), then the bit saving equals to \(\beta\). As such, the upper bound of bit saving can reach 8.28%, 12.3%, 16.4%, and 20.4% when QP value equals to {22, 27, 32, 37}, respectively.

Besides theoretical analysis, an experiment has been conducted to present the upper bound of bit saving under the proposed framework. At the encoding stage, the optimal intra mode of current block is achieved after rate distortion cost comparison among all the candidates. Then, this optimal intra mode is regarded as the predicted one to skip intra mode signaling for bits reduction. The sequences are encoded with small QPs {11, 16, 21, 26}, normal QPs {22, 27, 32, 37}, large QPs {33, 38, 43, 48}, and default AI configuration. The coding performance is measured by Bj\({\phi }\)ntegaard Delta Bit Rate (BD-BR) [3] with respect to the original VTM. From Table 2, it can be found that the upper bound of bit saving can reach 5.68%, 12.3%, and 19.5% on average for the luma component under small, normal, and large QP settings, which are close to those from theoretical analysis. However, this is the ideal case that cannot be achieved because the intra mode cannot be accurately predicted.

Table 2.
ClassSequenceSmall QPs {11, 16, 21, 26}Normal QPs {22, 27, 32, 37}Large QPs {33, 38, 43, 48}
YUVYUVYUV
ATango2–5.80–4.49–4.05–12.3–11.9–11.6–18.3–17.8–17.8
FoodMarket4–4.12–5.19–4.76–10.8–8.99–8.73–14.2–13.4–14.9
BBasketballDrive–3.00–2.31–3.45–10.4–9.63–10.7–18.2–19.0–21.1
BQTerrace–3.49–2.84–2.96–8.68–8.62–8.70–16.8–16.8–20.6
CBasketballDrill–8.05–8.48–8.60–15.3–13.8–15.6–22.2–21.0–22.6
BQMall–6.00–5.63–5.80–11.3–9.59–10.7–19.1–20.2–17.3
DBasketballPass–7.32–6.60–6.78–11.5–13.0–14.7–20.5–17.9–16.9
BlowingBubbles–6.53–6.10–6.56–12.7–10.8–13.4–20.7–19.0–15.9
EFourPeople–7.14–6.89–7.33–15.3–14.5–13.8–21.8–21.0–20.2
Johnny–5.32–5.26–6.27–14.5–14.9–15.9–22.8–23.1–22.2
AVERAGE–5.68–5.38–5.66–12.3–11.6–12.4–19.5–18.9–19.0

Table 2. Upper Bound of Bit Saving under the Proposed Framework in Terms of BD-BR (Unit: %)

3.2 Deep Learning based Intra Mode Derivation

Figure 3 illustrates the proposed architecture of deep learning based intra mode derivation scheme, in which two neural networks are included, one is feature learning network and the other is intra mode derivation network. The former is used to extract the highly dimensional features and the latter aims to infer the optimal intra mode directly without RD cost checking. In particular, the hand-crafted and learned features are combined to enjoy their individual benefits. The detailed hyper parameters of these two networks are listed in Tables 3 and 4.

Fig. 3.

Fig. 3. Architecture of deep learning based intra mode derivation.

Table 3.
#TypeKernelStrideOutputsActivation
1aCNN\(1\times 1\)164ReLU
1b\(3\times 3\)
2a\(1\times 1\)
2b\(3\times 3\)
3a\(1\times 1\)
3b\(3\times 3\)
4\(3\times 3\)
5\(3\times 3\)

Table 3. Hyper-parameters of the Feature Learning Network

Table 4.
#TypeInput SizeNodesActivation
1FCN33792 + 732048ReLU
22048 + 73
3
4
567SoftMax

Table 4. Hyper-parameters of the Intra Mode Derivation Network

In the feature learning network, five convolutional layers are included, and the first three ones (each has two sub-layers) are placed in a parallel manner. The kernel sizes are \(1\times 1\) and \(3\times 3\) in convolutional layers. Rectified Linear Unit (ReLU) is employed as activation function. The number of feature maps in each convolutional layer is 64. In the intra mode derivation network, five fully connected layers are included, and the node of each layer except the last one is 2048. In the last fully connected layer, the activation function is SoftMax, and the number of nodes becomes 67, which aims to match the number of intra modes. Hand-crafted features are always included in the input of each fully connected layer, which is represented by, (4) \(\begin{equation} {\bf I}_i^f = {\rm {concat}}\big ({\bf O}^f_{i-1}, {\bf f}_0\big), i \in [1,5], \end{equation}\) where \({\bf O}^f_{i-1}\) is the output of the \((i-1)^{th}\) fully connected layer, \({\bf f}_0\) indicates the hand-crafted features, \({\bf O}^f_0\) is the reshaped vector of output of feature learning network. To avoid overfitting, the dropout is performed at each fully connected layer. The dropout rate is set as 0.5 at the training stage, and set as 1.0 at the testing stage.

With the neighboring blocks and reference pixels, 73 hand-crafted features are collected, including 67 features from gradient histogram, five features from intra mode of neighboring blocks, and QP value. The gradient histogram can be regarded as the probability of each candidate intra mode, and its detailed calculation can be found in [29]. Due to the highly spatial correlation, the Up-Left (UL), Up (U), Up-Right (UR), Left (L) and Bottom Left (BL) blocks are used to provide their final selected intra modes as five hand-crafted features, as shown in the module of hand-crafted feature collection in Figure 3. The QP value balances the reconstruction quality and coding bits, i.e., lower QP value indicates better reconstruction quality and more coding bits, and vice versa, which has an impact on the intra mode derivation. In addition, with QP value as the feature, it is unnecessary to train different networks for different QP settings. In the case of frame boundary, the intra modes of neighbors are initially set as Planar mode because they are unavailable.

Due to lossy video coding, the neighboring blocks and reference pixels are degraded, which may affect the hand-crafted features, especially the gradient histogram. Therefore, the learned features are employed. As mentioned before, the coding block is not fixed, which can be flexibly partitioned from \(128\ \times \ 128\) to \(4\ \times \ 4\), including non-square patterns. Accordingly, the number of reference pixels is different for different coding blocks. For luma coding block, it follows that the width and height belong to {4, 8, 16, 32, 64} [31]. Therefore, it seems that 25 networks are required, which challenges computational and storage resources. It is expected that one single trained network can be applied to variable coding blocks. In the designed architecture, the multi-line reference pixels are collected and padded to a fixed size to adapt to variable coding blocks, where the fixed memory is allocated under the maximum available coding block \(64 \times 64\). Suppose the current block has a dimension of \(4\times 4\), the padding is performed to the lines that exceed the current block size of \(4\times 4\), where 60 padded lines from left and 60 padded lines from top are always used. A matrix with size of \((64 + 4 + 64)\times 4\) is fed to the feature learning network. Finally, the learned features with size of \((64 + 4 + 64)\times 4\times 64 = 33792\) are produced and reshaped to one dimension of vector.

In addition, the number of FLoating-point OPerations (FLOPs) [26] is used to evaluate the complexity of neural network. For convolutional layer, (5) \(\begin{equation} FLOPs = 2H \times W \times (C_{in}\times K^2 + 1) \times C_{out}, \end{equation}\) where \(H\), \(W\) and \(C_{in}\) are height, width and number of channels of the input feature map, \(K\) is the kernel size, and \(C_{out}\) is the number of output channels. The values of FLOPs in convolutional layers are \(1.3\times 10^5\), \(6.7\times 10^5\), \(8.7\times 10^6\), \(7.7\times 10^7\), \(8.7\times 10^6\), \(7.7\times 10^7\), \(7.7\times 10^7\), and \(3.9\times 10^7\), respectively. For fully connected layer, (6) \(\begin{equation} FLOPs = (2I - 1)\times O, \end{equation}\) where \(I\) is the input dimensionality and \(O\) is the output dimensionality. The values of FLOPs in fully connected layers are \(1.3\times 10^8\), \(8.6\times 10^6\), \(8.6\times 10^6\), \(8.6\times 10^6\), and \(2.8\times 10^5\), respectively.

3.3 Neural Network Training

The DIV2K database [37] with 900 images (the resolution changes from \(2040\times 648\) to \(2040\times 2040\)) is used to generate the neural network training dataset. These images are all resized to \(2048\ \times \ 1536\) and packed as a sequence from RGB color space to YCbCr color space. Then, this pseudo sequence is encoded by VTM 5.0 with default AI configuration to collect the training samples, where the QP values are set as {22, 27, 32, 37}. During the process of video coding, the hand-crafted features and source reference pixels of current block are collected with the associated label (intra mode) regardless of the coding block size. According to Figure 3 shown in [51], the distribution of intra modes is uneven. The Planar, DC, horizontal, and vertical modes are more frequently selected than other intra modes. Additionally, the unbalanced data may make the multi-class classification network training failure. Thereby, the number of training samples for each label and QP is fixed as 50,000, and then the total number is \(50000 \times 4 \times 67\). In total, the volume of training data reaches about 80 GB. In addition, for the purpose of validation, \(20000\times 67\) samples are selected in the training dataset, where the number of validation samples is 20,000 for each label.

In this work, the Tensorflow package is adopted for network training on NVIDA GeForce 1080 Ti GPU with AdamOptimizer. The memory of workstation is 112G, which is able to accommodate the training samples. For this multi-class classification task, the cross entropy is utilized as the loss function, (7) \(\begin{equation} L = -\frac{1}{N}\sum ^N_{i=1} \sum ^M_{j=1} \big \lbrace y_{j}^{i} \times ln \big (x_{j}^{i}\big)\big \rbrace , \end{equation}\) where \(N\) is the number of training samples in a batch, \(M\) is the number of classes, i.e., \(M=67\), \(y_{j}^{i}\) is the ground truth of the \(i^{th}\) training sample, and \(x_{j}^{i}\) is the output of intra mode derivation network after softmax layer. It should be noted that the ground truth is represented in the one-hot manner. For example, if the intra mode is 4 for the \(i^{th}\) training sample, \(y^i_4 = 1\) and \(y^i_j = 0\) \((j \ne 4)\), it also can be rewritten as \({\bf y}^i = [y^i_1, y^i_2, \dots , y^i_j, \dots , y^i_{67}] = [0, 0, 0, 1, 0, \dots , 0]\). The batch size and number of epochs are set as 1024 and 1000. The initial learning rate \(r_0\) is \(1\times 10^{-4}\), and it is always updated after each epoch, i.e., \(r_0 \times 0.999^i\), where \(i\) is the index of training epoch.

Skip 4EXPERIMENTAL RESULTS AND ANALYSES Section

4 EXPERIMENTAL RESULTS AND ANALYSES

4.1 Coding Performance Comparison

The experiments are conducted on the platform VTM 5.0 following the default AI configuration and the Common Test Conditions (CTC) [5]. The workstation is equipped with the Intel Core i7-4790 CPU @2.60 GHz, Windows 7 Enterprise 64-bit operating systems for video coding. The original VTM 5.0 is regarded as the anchor for coding performance comparison, which is evaluated by BD-BR. Twenty-two sequences with various contents and resolutions, different from the training dataset, are utilized in the experiments.

Table 5 illustrates the values of BPM under the proposed method. Two sequences of each class are utilized for this experiment, which are identical to those in Table 1. During intra coding, the total bits of intra modes and the number of intra blocks are collected for BPM calculation when the QP values are set as {22, 27, 32, 37}. Compared with those shown in Table 1, the average values of BPM are changed from 3.35, 3.48, 3.44, and 3.39 to 2.38, 2.35, 2.22, and 2.20 under four QP settings, respectively. In general, the coding bit saving of intra mode can be calculated as follows regardless of residue and other information, (8) \(\begin{equation} \eta ^{\prime } = \frac{\alpha - {\alpha }^{\prime }}{\alpha } \times 100 \%, \end{equation}\) where \(\alpha\) is the original value of BPM and \(\alpha ^{\prime }\) is the current value of BPM. Accordingly, the bit saving of intra mode can reach 30.4%, 33.2%, 36.2%, and 35.3% on average under four QP settings, respectively.

Table 5.
ClassSequenceCurrent value of BPM\(\alpha ^{\prime }\)Coding bit saving of intra mode\(\eta ^{\prime }\)
QP = 22QP = 27QP = 32QP = 37QP = 22QP = 27QP = 32QP = 37
ATango21.021.461.441.7153.2%49.7%50.7%42.4%
FoodMarket41.331.501.531.8245.7%42.5%41.8%32.8%
BBasketballDrive1.991.831.651.6932.3%37.3%43.7%40.7%
BQTerrace2.762.532.362.2518.8%29.1%33.1%34.6%
CBQMall2.842.782.592.4026.0%28.0%30.9%34.2%
BasketballDrill2.732.602.492.5822.4%26.8%30.4%31.4%
DBlowingBubbles3.092.922.712.4528.8%31.3%35.0%35.9%
BasketballPass3.163.022.782.5717.9%25.2%29.3%30.0%
EFourPeople2.512.402.312.2330.1%33.7%35.3%36.8%
Johnny2.412.452.342.3228.5%28.6%31.8%33.7%
AVERAGE2.382.352.222.2030.4%33.2%36.2%35.3%

Table 5. Coding Bits Per Intra Mode under the Proposed Method

Three state-of-the-art works are adopted for coding performance comparison. Narsallah’s scheme [29] derives the intra mode with gradient histogram and the one with the highest probability is determined eventually. Abdoli’s scheme [1] produces a new intra prediction result with weighted intra modes from the top-2 highest probability in gradient histogram. Li’s scheme [22] re-constructs the MPM list with short and long range correlations. These three works are optimized from different directions, and related to the proposed method, which can be compared in terms of coding performance. The comparison is illustrated in Table 6.

Table 6.
ClassSequenceNarsallah’s [29]Abdoli’s [1]Li’s [22]Proposed
YUVYUVYUVYUV
A1Tango2–0.20–0.410.01–0.69–0.91–0.16–0.08–0.040.13–2.50–2.77–2.20
FoodMarket4–0.360.06–0.44–0.81–0.33–0.68–0.02–0.140.00–2.65–1.66–1.14
Campfire–0.18–0.16–0.34–0.42–0.20–0.12–0.080.00–0.19–2.38–1.14–1.98
A2CatRobot1–0.13–0.29–0.33–0.33–0.08–0.36–0.07–0.08–0.01–2.69–1.90–2.44
DaylightRoad20.02–0.290.01–0.29–0.32–0.10–0.26–0.13–0.15–2.70–2.66–2.63
ParkRunning3–0.04–0.02–0.11–0.27–0.25–0.29–0.05–0.07–0.08–1.04–0.85–0.88
BMarketPlace–0.12–0.08–0.09–0.350.00–0.43–0.10–0.23–0.20–2.29–1.51–1.61
RitualDance–0.46–0.41–0.33–0.65–0.32–0.31–0.02–0.23–0.20–1.81–1.67–1.55
Cactus–0.060.090.02–0.350.03–0.21–0.07–0.18–0.06–2.49–1.17–3.41
BasketballDrive–0.20–0.67–0.16–0.67–0.82–0.25–0.10–0.61–0.35–2.42–2.68–2.04
BQTerrace0.00–0.28–0.11–0.27–0.37–0.26–0.15–0.31–0.41–1.88–1.77–2.40
CBasketballDrill0.150.040.80–0.270.140.89–0.31–0.01–0.69–2.291.18–2.62
BQMall–0.37–0.500.14–0.52–0.27–0.37–0.010.40–0.01–2.89–1.93–1.36
PartyScene–0.18–0.11–0.16–0.35–0.20–0.29–0.160.02–0.03–1.93–2.45–0.71
RaceHorsesC–0.16–0.16–0.39–0.49–0.16–0.110.010.08–0.08–1.78–0.89–2.01
DBasketballPass–0.20–0.14–0.02–0.41–0.01–0.380.04–0.50–0.59–1.67–2.31–4.17
BQSquare–0.33–0.27–0.02–0.23–0.08–0.24–0.010.140.04–1.92–0.13–1.73
BlowingBubbles–0.27–0.68–0.95–0.69–0.53–0.960.03–0.55–0.34–2.09–1.36–1.67
RaceHorses–0.250.290.38–0.43–0.65–0.38–0.04–0.030.69–1.84–2.48–1.42
EFourPeople–0.41–0.52–0.38–0.55–0.770.040.03–0.170.00–3.21–2.61–2.28
Johnny–0.25–0.84–0.39–0.45–1.01–0.540.04–0.16–0.07–2.31–3.71–5.30
KristenAndSara–0.27–0.320.12–0.48–0.36–0.470.02–0.430.07–3.31–1.79–2.44
AVERAGE–0.19–0.26–0.12–0.45–0.34–0.27–0.06–0.15–0.12–2.28–1.74–2.18

Table 6. Performance Comparison in Terms of BD-BR with QPs {22, 27, 32, 37} (Unit: %)

For Narsallah’s scheme [29], it reduces 0.19%, 0.26%, and 0.12% bit rate on average for Y, U, and V components, respectively. 0.45%, 0.34%, and 0.27% bit rates are saved for Y, U, and V components in the Abdoli’s scheme [1]. For Li’s scheme [22], it achieves 0.06%, 0.15%, and 0.12% bit rate reduction on average for luma and two chroma components, respectively. Regarding the proposed method, the bit rate reduction reaches 2.28%, 1.74%, and 2.18% on average for luma and two chroma components, respectively. From this comparison, it can be observed that the proposed method is better than the other three methods. Compared with Narsallah’s scheme [29], the proposed method not only adopts the existing hand-crafted features, but also learns features in highly dimensional space for the intra mode derivation.

In addition, the test sequences are encoded under the small QP setting {11, 16, 21, 26} and large QP setting {33, 38, 43, 48} to evaluate the performance of the proposed method. It should be noted that the neural network is not re-trained. The coding performance is shown in Table 7. The bit rate reductions can reach 0.71% and 3.64% for luma component under the small and large QP settings, respectively. Compared with the results in Table 6, the performance of normal QP setting is a little worse than that of large QP setting and better than that of small QP setting. The reason is that the percentage of coding bits of intra mode in a frame becomes large as QP value increases, and vice versa. Consequently, in the low bit rate scenario, the compression efficiency is greatly improved by the proposed method.

Table 7.
ClassSequenceSmall QPs {11, 16, 21, 26}Large QPs {33, 38, 43, 48}
YUVYUV
A1Tango2–0.76–0.320.11–3.68–4.01–3.88
FoodMarket4–0.31–0.77–0.60–3.24–2.89–3.28
Campfire–1.00–0.66–0.67–4.37–2.63–3.27
A2CatRobot1–0.79–0.25–0.42–4.37–2.63–3.27
DaylightRoad2–0.44–0.18–0.03–5.03–4.61–5.28
ParkRunning3–0.26–0.26–0.23–2.33–1.60–1.72
BMarketPlace–0.58–0.57–0.04–2.97–5.290.23
RitualDance–0.29–0.58–1.14–4.87–6.58–4.68
Cactus–0.60–0.49–0.46–4.15–1.81–3.41
BasketballDrive–0.600.01–0.86–3.65–4.40–4.16
BQTerrace–0.63–0.36–0.47–3.70–3.57–6.81
CBasketballDrill–0.87–1.63–1.20–3.03–5.430.93
BQMall–1.00–0.59–0.86–4.16–4.52–5.75
PartyScene–0.84–0.51–0.58–3.71–0.98–6.77
RaceHorsesC–0.70–0.55–0.57–3.35–2.35–4.11
DBasketballPass–0.660.53–1.87–1.91–6.51–3.23
BQSquare–0.89–0.79–1.62–3.84–10.4–9.82
BlowingBubbles–0.76–1.34–0.59–2.88–0.621.48
RaceHorses–0.89–0.79–1.62–3.84–10.4–9.82
EFourPeople–0.87–0.71–0.97–2.96–2.23–4.26
Johnny–1.21–1.08–0.89–4.20–3.56–4.80
KristenAndSara–0.62–0.64–1.15–4.36–3.06–2.66
AVERAGE–0.71–0.57–0.76–3.64–4.15–4.00

Table 7. Performance Evaluation in Terms of BD-BR with Different QP Settings (Unit: %)

As shown in Figure 4, the first frames of six sequences, including BasketballPass (\(416\times 240\)), BQSquare (\(416\times 240\)), BQMall (\(832\times 480\)), BasketballDrill (\(832\times 480\)), FourPeople (\(1280\times 720\)), and Johnny (\(1280\times 720\)), are utilized to demonstrate the coding blocks that are selected by the proposed DLIMD in a frame, where the QP value is 22. The selected blocks are marked in different colors according to the size. There are five different colors and the details are listed as follows. If the size is smaller than \(8\times 8\), the block is marked as red color; if the size is greater than \(8\times 8\) and smaller than \(16\times 16\), the block is marked as green color; if the size is greater than \(16\times 16\) and smaller than \(32\times 32\), the block is marked as blue color; if the size is greater than \(32\times 32\) and smaller than \(64\times 64\), the block is marked as black color; otherwise, the block is marked as white color. It can be clearly observed that lots of coding blocks select the proposed DLIMD. Moreover, the quantitative results are presented in Table 8 under different QP settings and different sequences. The percentage of DLIMD selection is calculated by the ratio of selected area against the whole frame, which is represented by, (9) \(\begin{equation} \Omega = \frac{\sum _{i=1}^{N}C_i \times w_i \times h_i}{\sum _{i=1}^{N}w_i \times h_i} \times 100\%, \end{equation}\) where \(N\) indicates the number of coding blocks in a frame, \(C_i\) indicates the DLIMD selection, \(C_i = 0\) if the current coding block does not select DLIMD, \(w_i\) and \(h_i\) are the width and height of the current coding block. From this table, the percentage can reach 42.9%, 45.2%, 48.5%, and 50.5% on average under four QP settings, respectively. It indicates that the coding performance can be efficiently improved. In addition, the selected blocks under four QP settings are re-organized according to block size. There are 17 available block sizes for these sequences, i.e., \(4\times 4\), \(4\times 8\), \(4\times 16\), \(4\times 32\), \(8\times 4\), \(8\times 8\), \(8\times 16\), \(8\times 32\), \(16\times 4\), \(16\times 8\), \(16\times 16\), \(16\times 32\), \(32\times 4\), \(32\times 8\), \(32\times 16\), \(32\times 32\), and \(64\times 64\). For each block size, the ratio of selected block number against the total block number is calculated, as shown in Figure 5. It can be observed that the ratio can reach from 43.0% to 54.5%.

Fig. 4.

Fig. 4. DLIMD selected in a frame. (They are resized to the same resolution for visualization.)

Fig. 5.

Fig. 5. Percentage of selected blocks according to block size.

Table 8.
ClassSequenceQP = 22QP = 27QP = 32QP = 37
A1Tango239.941.344.346.5
FoodMarket438.537.641.444.2
Campfire48.048.649.853.5
A2CatRobot141.347.350.550.6
DaylightRoad245.450.952.852.4
ParkRunning343.646.547.247.9
BMarketPlace41.344.547.148.9
RitualDance43.845.950.252.2
Cactus40.845.148.150.7
BasketballDrive43.446.150.652.6
BQTerrace41.748.451.754.4
CBasketballDrill57.860.359.356.1
BQMall43.344.948.550.5
PartyScene41.845.547.852.5
RaceHorsesC39.843.545.449.9
DBasketballPass45.540.848.851.3
BQSquare39.640.645.549.8
BlowingBubbles41.542.847.748.7
RaceHorses38.839.745.650.3
EFourPeople43.245.547.548.2
Johnny42.345.049.250.0
KristenAndSara42.243.347.449.4
AVERAGE42.945.248.550.5

Table 8. Percentage of the Proposed Method Selection (Unit: %)

4.2 Influence of Learned and Hand-crafted Features

The individual influence of learned features and hand-crafted features in the proposed architecture (shown in Figure 3) is analyzed. Four cases are presented, i.e., (1) H: the module of learning features is removed, only the hand-crafted features are used for intra mode derivation; (2) L: the hand-crafted features are removed, only the learned features are used for intra mode derivation; (3) H’+L: both the hand-crafted features (excluding gradient histogram) and learned features are used for intra mode derivation; and (4) H+L: both the hand-crafted features and learned features are used for intra mode derivation.

With the same training samples claimed in Section 3.3, three more neural networks are trained separately according to the listed cases. The training process is as same as that in Section 3.3, where NVIDIA GeForce 1080 Ti GPU with AdamOptimizer is adopted and the loss function is cross entropy. Figure 6 illustrates the comparison of these four cases in terms of training loss and validation classification accuracy. The classification accuracy is calculated as follows, (10) \(\begin{equation} P = \frac{1}{N}\sum _{i=1}^N \delta _i \times 100\%, \end{equation}\) where \(N\) is the number of testing samples, \(\delta _i=1\) in the case that the difference between predicted label and ground truth is less than a pre-defined threshold \(\Delta\), i.e., \(\Vert \texttt {argmax}({\bf y}^i) - \texttt {argmax}({\bf O}^f_5)\Vert \le \Delta\), otherwise \(\delta _i=0\). argmax() returns the position of maximum value in a vector, \({\bf y}^i\) is the ground truth represented in the one-hot manner, \({\bf O}^f_5\) is the output of intra mode derivation network. Here, the value of \(\Delta\) is set as 0. The cases of separate hand-crafted and learned features both converge at round 3.8 cross entropy loss and achieve about 25% validation classification accuracy, the case of H’+L converges at about 3.6 cross entropy loss and achieves about 30% validation classification accuracy, while the combination of hand-crafted and learned features converges at 3.5 cross entropy loss and achieves about 35% validation classification accuracy.

Fig. 6.

Fig. 6. Comparison of four cases with hand-crafted and learned features.

From these results, it can be obviously observed that the case of combining hand-crafted and learned features achieves the best performance when compared with the other three. The reasons are that although CNN is able to extract high-level features and latent representation, the hand-crafted features still can provide useful information and compensate the limitation of learned features. For example, the intra modes of neighbors from spatial domain, which cannot be learned from the feature learning network, play an important role for intra mode derivation.

4.3 Ablation Study of Architecture

In addition, we aim to further analyze the impact of modules in the network architecture. Alternative networks are designed, and illustrated in Figure 7. Different from the proposed network shown in Figure 3, the convolutional layers are placed in the serial manner, the number of feature maps in the first three layers is 128 which matches the input of convolutional layers 2a, 2b, 3a, 3b in Figure 3, the kernel sizes of the first and third convolutional layers are set as \(1\times 1\) and the others are \(3\times 3\), the hand-crafted features are only combined to the first fully connected layer.

Fig. 7.

Fig. 7. Alternative network. (H: hand-crafted feature, L: learned feature).

Three configurations are listed for comparison, i.e., Case A: feature learning network in Figure 3 and intra mode derivation network in Figure 3; Case B: feature learning network in Figure 3 and intra mode derivation network in Figure 7; Case C: feature learning network in Figure 7 and intra mode derivation network in Figure 3. It should be noted that Case A is the proposed one. Here, two more networks for Cases B and C are trained with the same samples. The results are compared in terms of multi-class classification accuracy. Two test sequences from each class defined in the CTC [5] are employed, i.e., BasketballPass (\(416\times 240\)), BlowingBubbles (\(416\times 240\)), BQMall (\(832\times 480\)), BasketballDrill (\(832\times 480\)), FourPeople (\(1280\times 720\)), Johnny (\(1280\times 720\)), BasketballDrive (\(1920\times 1080\)), BQTerrace (\(1920\times 1080\)), Tango2 (\(3840\times 2160\)), and FoodMarket4 (\(3840\times 2160\)). These sequences are all encoded by VTM 5.0 with default AI configuration, where the QP values are set as {22, 27, 32, 37}. During the process of encoding, the testing samples are collected simultaneously. For each sequence, \(800\times 67\times 4\) samples are selected under four QP settings. Table 9 illustrates the experimental results, and the classification accuracy is calculated by Equation (10). In Table 9, under this condition of \(\Delta = 0\), the multi-class classification accuracies are 34.8%, 31.4%, and 32.6% on average for Cases A, B, and C, respectively. As such, we can conclude that the hand-crafted features combined to each fully connected layer and different kernel sizes placed in the parallel manner in convolutional layers can achieve better performance.

Table 9.
ClassSequenceCase A (proposed)Case BCase C
\(\Delta\) = 0\(\Delta\) = 1\(\Delta\) = 3\(\Delta\) = 5\(\Delta\) = 0\(\Delta\) = 1\(\Delta\) = 3\(\Delta\) = 5\(\Delta\) = 0\(\Delta\) = 1\(\Delta\) = 3\(\Delta\) = 5
ATango231.845.359.868.028.441.656.765.329.643.358.167.1
FoodMarket434.550.967.075.331.147.864.273.331.849.165.974.4
BBasketballDrive37.252.665.672.134.049.662.569.534.549.363.170.2
BQTerrace32.847.358.664.928.743.054.461.331.046.158.064.7
CBQMall35.650.962.367.532.947.458.264.133.950.060.966.8
BasketballDrill37.355.066.972.433.852.464.670.035.454.066.972.5
DBlowingBubbles32.147.261.167.829.244.158.265.730.246.060.868.3
BasketballPass34.750.863.469.332.048.059.565.832.549.962.268.2
EFourPeople34.149.361.567.729.844.856.862.932.247.559.766.1
Johnny37.654.768.475.134.251.565.572.735.252.566.673.4
AVERAGE34.850.463.570.031.447.060.167.132.648.762.269.1

Table 9. Multi-class Classification Accuracy (Unit: %)

In addition, the normalized confusion matrices of Case A are illustrated in Figure 8. The horizontal is predicted label and the vertical is ground truth. It can be observed that the difference between ground truth and predicted label is limited. For the proposed one (Case A), the average classification accuracies under four conditions are 34.8%, 50.4%, 63.5%, and 70.0% in Table 9, respectively. Although there are some differences between predicted label and ground truth under the conditions of \(\Delta = \lbrace 1, 3, 5\rbrace\), the intra prediction results may be similar, and the RDO will be performed to balance the distortion and coding bits during video coding. Therefore, the coding gains can still be achieved with limited difference between predicted label and ground truth.

Fig. 8.

Fig. 8. Confusion matrix of multi-class classification under Case A.

4.4 Computational Complexity Analyses

Additionally, the coding/decoding time of video codec equipped with the DLIMD is compared with that of the anchor, which is calculated by, (11) \(\begin{equation} \Delta T_m = \frac{1}{4}\sum _{i=1}^{4}{\frac{T_{\Psi }^m(QP_i)}{T_{c}^m(QP_i)}}, \end{equation}\) where \(T_{c}^m(QP_i)\) is the coding/decoding time of the anchor under \(QP_i\), and \(T_{\Psi }^m(QP_i)\) is the coding/ decoding time of the video codec equipped with proposed method under \(QP_i\), \(m \in\) {coding, decoding}. Compared with the anchor, the values of computational complexity of the proposed method are 33.6 times, 140.3 times under CPU+GPU platform and 231.3 times, and 604.8 times under CPU platform on average for video coding and decoding, respectively. The computational complexity is a great challenge. In the video codec, the DLIMD is performed in the variable coding blocks and the convolutional/fully connected operations in the neural network result in high complexity.

For other deep learning based schemes [12, 38] that focus on the optimization of intra prediction, the values of computational complexity are 9.87 times, 87.4 times at the encoder side and 151.7 times, and 124.5 times at the decoder side with respect to the anchor. The former and latter schemes with 1.92% and 3.4% bit rate reductions for the luma component are performed on the platform of CPU and CPU+GPU, respectively. For the conventional schemes [1, 22, 29] whose compression efficiencies have been compared in Table 6, the values of encoding and decoding complexity are 109%, 111%, 101% and 104%, 105%, 100% with respect to the anchor. It can be found that the computational complexity of deep learning based schemes including the proposed one is much higher than that of conventional schemes.

Generally, to accelerate deep learning based schemes, the strategies include SIMD optimization, neural network quantization, and parameters/layers pruning. The first two strategies require the support/optimization from hardware devices. Therefore, the third one is adopted to investigate the trade-off between computational complexity and compression efficiency. One more architecture (denoted as DLIMD-L) is designed by reducing the parameters, i.e., the output of last layer in feature learning network is changed from 64 to 16, and the number of nodes of hidden layers except the last one in intra mode derivation network is changed from 2048 to 128. According to the definition of FLOPs [26], the computational complexity can be largely reduced. DLIMD-L is trained with the same training dataset as DLIMD. The coding experiments are performed on the platform of CPU+GPU and the results are presented in Table 10. These sequences are all encoded with default AI configuration, where the QP values are set as {22, 27, 32, 37}. It can be observed that the coding efficiency of DLIMD-L is \(-\)1.17% on average for luma component in terms of BD-BR, which is worse than that of DLIMD. The values of encoding and decoding complexity of DLIMD-L are 27.9 times and 40.8 times with respect to the anchor (VTM 5.0), where 6.8 times of encoding complexity and 105.0 times of decoding complexity are reduced. Although the computational complexity is still high, we believe that it can be optimized in the future.

Table 10.
ClassSequenceDLIMDDLIMD-L
BDBR (%)ComplexityBDBR (%)Complexity
YUVEncodeDecodeYUVEncodeDecode
ATango2–2.50–2.77–2.20\(36.8\times\)\(144.7\times\)–1.71–2.53–1.30\(28.8\times\)\(38.1\times\)
FoodMarket4–2.65–1.66–1.14\(25.8\times\)\(134.9\times\)–1.92–1.08–0.39\(19.9\times\)\(39.1\times\)
BBasketballDrive–2.42–2.68–2.04\(36.6\times\)\(135.3\times\)–1.55–1.70–2.27\(29.0\times\)\(35.3\times\)
BQTerrace–1.88–1.77–2.40\(37.0\times\)\(167.9\times\)–0.73–0.97–0.33\(30.3\times\)\(36.2\times\)
CBQMall–2.89–1.93–1.36\(33.9\times\)\(138.2\times\)–2.11–0.36–1.14\(27.2\times\)\(56.3\times\)
BasketballDrill–2.291.18–2.62\(33.4\times\)\(176.9\times\)1.400.97–0.26\(27.4\times\)\(37.1\times\)
DBlowingBubbles–2.09–1.36–1.67\(30.4\times\)\(136.8\times\)–1.32–0.35–1.77\(25.9\times\)\(38.6\times\)
BasketballPass–1.67–2.31–4.17\(32.2\times\)\(155.8\times\)–0.461.77–1.65\(26.7\times\)\(43.0\times\)
EFourPeople–3.21–2.61–2.28\(42.9\times\)\(136.5\times\)–2.44–2.81–1.54\(34.1\times\)\(46.1\times\)
Johnny–2.31–3.71–5.30\(38.1\times\)\(131.2\times\)–0.81–0.73–3.52\(30.4\times\)\(38.2\times\)
AVERAGE–2.39–1.96–2.52\(34.7\times\)\(145.8\times\)–1.17–0.78–1.42\(27.9\times\)\(40.8\times\)

Table 10. Trade-off between Computational Complexity and Compression Efficiency on the Platform of CPU+GPU

4.5 Coding Performance under the Latest VVC Test Model and Other Configurations

In addition, the proposed method is evaluated on the platform of the latest VVC test model, i.e., VTM 16.0, in which DLIMD has been implemented. Besides AI configuration, the coding experiments are also conducted under Low Delay P (LDP) and Random Access (RA) configurations. It should be noted that the neural network is not changed after training, as claimed in Section 3.3.

The experimental results are shown in Table 11, where the original VTM 16.0 is utilized as the anchor to calculate the value of BD-BR. It can be observed that the proposed method achieves 1.91%, 0.87%, and 1.15% bit rate reductions for Y component under AI, LDP and RA configurations, respectively. The coding gains are a little worse than those under VTM 5.0 shown in Table 6. The reasons are that the neural network is not re-trained, and the intra coding from VTM 5.0 to VTM 16.0 has been optimized.

Table 11.
ClassSequenceAI ConfigurationLDP ConfigurationRA Configuration
YUVYUVYUV
A1Tango2–2.48–0.33–1.60–0.83–0.81–0.46–1.18–0.96–0.10
FoodMarket4–1.87–2.17–1.36–0.63–1.430.05–0.94–0.38–1.03
Campfire–2.37–1.49–2.72–1.18–0.95–0.71–1.82–1.47–1.32
A2CatRobot1–2.54–1.53–2.14–1.36–1.95–1.62–1.53–1.72–1.32
DaylightRoad2–2.48–2.06–2.86–1.84–2.55–2.29–2.30–1.69–2.02
ParkRunning3–1.07–0.78–0.37–0.41–0.41–0.43–0.48–0.36–0.35
BMarketPlace–1.88–1.31–1.99–0.60–0.550.41–1.11–1.610.88
RitualDance–3.48–4.15–3.44–0.72–0.69–1.20–1.04–0.86–1.53
Cactus–2.27–1.61–1.05–1.08–0.94–0.37–1.63–1.35–2.01
BasketballDrive–2.35–1.94–2.38–0.78–1.39–0.24–1.10–0.29–1.06
BQTerrace–1.74–1.63–0.80–0.78–1.46–1.23–1.30–1.12–0.97
CBasketballDrill–1.31–2.29–1.05–0.41–1.030.06–1.22–2.00–1.41
BQMall–2.11–0.34–2.58–1.14–2.15–1.16–1.340.04–0.12
PartyScene–1.50–1.10–2.58–0.77–0.57–1.27–0.90–1.08–0.52
RaceHorsesC–1.470.18–0.42–0.25–1.02–0.67–0.65–1.230.11
DBasketballPass–0.47–1.94–3.35–0.25–1.250.700.18–2.36–0.67
BQSquare–1.26–2.130.27–0.55–2.42–3.05–0.74–0.07–0.96
BlowingBubbles–1.11–1.050.20–0.32–0.390.71–1.03–0.21–0.78
RaceHorses–1.12–1.040.81–0.25–1.010.40–0.382.080.16
EFourPeople–3.02–1.73–2.68–2.04–2.66–2.93–2.14–1.28–2.18
Johnny–1.88–3.26–2.15–1.23–0.35–2.26–1.06–1.87–0.71
KristenAndSara–2.22–4.38–2.66–1.640.44–2.44–1.50–1.16–1.95
AVERAGE–1.91–1.73–1.68–0.87–1.16–0.91–1.15–0.95–0.90

Table 11. Coding Performance in Terms of BD-BR on the Latest Platform of VTM 16.0 under AI, LDP, RA Configurations (Unit: %)

Skip 5CONCLUSIONS Section

5 CONCLUSIONS

In this paper, a deep learning based intra mode derivation method is presented to skip the module of intra mode signaling for saving coding bits. Instead of checking the candidate intra modes one by one to achieve the optimal, this process is casted into a multi-class classification task from signal processing to artificial intelligence. To adapt to variable coding blocks and different QP settings with one single model, the architecture is effectively developed. In particular, the hand-crafted and learned features are combined to compensate their individual limitations. The rate-distortion optimization is performed between the proposed method and the traditional method with a strategy flag signaled for performance competition. Compared with the state-of-the-art works, the proposed method achieves significant coding gains.

REFERENCES

  1. [1] Abdoli Mohsen, Guionnet Thomas, Raulet Mickael, Kulupana Gosala, and Blasi Saverio. 2020. Decoder-side intra mode derivation for next generation video coding. In 2020 IEEE International Conference on Multimedia and Expo (ICME). 16.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Abdoli Mohsen, Henry Félix, Brault Patrice, Duhamel Pierre, and Dufaux Frédéric. 2018. Short-distance intra prediction of screen content in versatile video coding (VVC). IEEE Signal Processing Letters 25, 11 (2018), 16901694.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Bjontegaard Gisle. 2001. Calculation of Average PSNR Differences between RD Curves. ITU-T Video Coding Experts Group, VCEG-M33.Google ScholarGoogle Scholar
  4. [4] Blasi Saverio G., Mrak Marta, and Izquierdo Ebroul. 2015. Frequency-domain intra prediction analysis and processing for high-quality video coding. IEEE Transactions on Circuits and Systems for Video Technology 25, 5 (2015), 798811.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bossen Frank, Boyce Jill, Suehring Karsten, Li Xiang, and Seregin Vadim. 2019. JVET Common Test Conditions and Software Reference Configurations for SDR Video. Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, JVET-N1010-v1.Google ScholarGoogle Scholar
  6. [6] Brand Fabian, Seiler Jürgen, and Kaup André. 2021. Intra-frame coding using a conditional autoencoder. IEEE Journal of Selected Topics in Signal Processing 15, 2 (2021), 354365.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Bross Benjamin, Chen Jianle, Ohm Jens-Rainer, Sullivan Gary J., and Wang Ye-Kui. 2021. Developments in international video coding standardization after AVC, with an overview of versatile video coding (VVC). Proc. IEEE 109, 9 (2021), 14631493.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Cai Xun and Lim Jae S.. 2013. Algorithms for transform selection in multiple-transform video compression. IEEE Transactions on Image Processing 22, 12 (2013), 53955407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chang Yao-Jen, Jhu Hong-Jheng, Jiang Hui-Yu, Zhao Liang, Zhao Xin, Li Xiang, Liu Shan, Bross Benjamin, Keydel Paul, Schwarz Heiko, Marpe Detlev, and Wiegand Thomas. 2019. Multiple reference line coding for most probable modes in intra prediction. In 2019 Data Compression Conference (DCC). 559559.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Chen Haoming, Zhang Tao, Sun Ming-Ting, Saxena Ankur, and Budagavi Madhukar. 2016. Improving intra prediction in high-efficiency video coding. IEEE Transactions on Image Processing 25, 8 (2016), 36713682.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chen Jie, Hou Junhui, and Chau Lap-Pui. 2018. Light field compression with disparity-guided sparse coding based on structural key views. IEEE Transactions on Image Processing 27, 1 (2018), 314324.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Dumas Thierry, Galpin Franck, and Bordes Philippe. 2021. Iterative training of neural networks for intra prediction. IEEE Transactions on Image Processing 30 (2021), 697711.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Dumas Thierry, Roumy Aline, and Guillemot Christine. 2020. Context-adaptive neural network-based prediction for image compression. IEEE Transactions on Image Processing 29 (2020), 679693.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] François Edouard, Fogg Chad, He Yuwen, Li Xiang, Luthra Ajay, and Segall Andrew. 2016. High dynamic range and wide color gamut video coding in HEVC: Status and potential future enhancements. IEEE Transactions on Circuits and Systems for Video Technology 26, 1 (2016), 6375.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Gao Han, Chen Xu, Esenlik Semih, Chen Jianle, and Steinbach Eckehard. 2021. Decoder-side motion vector refinement in VVC: Algorithm and hardware implementation considerations. IEEE Transactions on Circuits and Systems for Video Technology 31, 8 (2021), 31973211.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hu Yueyu, Yang Wenhan, Li Mading, and Liu Jiaying. 2019. Progressive spatial recurrent neural network for intra prediction. IEEE Transactions on Multimedia 21, 12 (2019), 30243037.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Huang Yu-Wen, Hsu Chih-Wei, Chen Ching-Yeh, Chuang Tzu-Der, Hsiang Shih-Ta, Chen Chun-Chia, Chiang Man-Shu, Lai Chen-Yen, Tsai Chia-Ming, Su Yu-Chi, Lin Zhi-Yi, Hsiao Yu-Ling, Chubach Olena, Lin Yu-Cheng, and Lei Shaw-Min. 2020. A VVC proposal with quaternary tree plus binary-ternary tree coding block structure and advanced coding techniques. IEEE Transactions on Circuits and Systems for Video Technology 30, 5 (2020), 13111325.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Jiang Minqiang, Li Shanxi, Ling Nam, Zheng Jianhua, and Zhang Philipp. 2018. On derivation of most probable modes for intra prediction in video coding. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS). 14.Google ScholarGoogle Scholar
  19. [19] Lainema Jani, Bossen Frank, Han Woo-Jin, Min Junghye, and Ugur Kemal. 2012. Intra coding of the HEVC standard. IEEE Transactions on Circuits and Systems for Video Technology 22, 12 (2012), 17921801.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Li Congrui, Zhao Zhenghui, Li Junru, Zhang Xiang, Ma Siwei, and Li Chen. 2019. Bi-intra prediction for versatile video coding. In 2019 Data Compression Conference (DCC). 587587.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Li Jiahao, Li Bin, Xu Jizheng, and Xiong Ruiqin. 2018. Efficient multiple-line-based intra prediction for HEVC. IEEE Transactions on Circuits and Systems for Video Technology 28, 4 (2018), 947957.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Li Junru, Wang Meng, Zhang Li, Zhang Kai, Liu Hongbin, Wang Shiqi, Ma Siwei, and Gao Wen. 2020. Unified intra mode coding based on short and long range correlations. IEEE Transactions on Image Processing 29 (2020), 72457260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Yue, Yi Yan, Liu Dong, Li Li, Li Zhu, and Li Houqiang. 2021. Neural-network-based cross-channel intra prediction. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3, Article 77 (Jul. 2021), 23 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Ma Di, Zhang Fan, and Bull David. 2022. BVI-DVC: A training database for deep video compression. IEEE Transactions on Multimedia 24 (2022), 38473858.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Ma Siwei, Zhang Xinfeng, Jia Chuanmin, Zhao Zhenghui, Wang Shiqi, and Wang Shanshe. 2020. Image and video compression with neural networks: A review. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2020), 16831698.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Molchanov Pavlo, Tyree Stephen, Karras Tero, Aila Timo, and Kautz Jan. 2017. Pruning convolutional neural networks for resource efficient inference. In 5th International Conference on Learning Representations (ICLR). 117.Google ScholarGoogle Scholar
  27. [27] Mora Elie Gabriel, Jung Joel, Cagnazzo Marco, and Pesquet-Popescu Béatrice. 2014. Depth video coding based on intra mode inheritance from texture. APSIPA Transactions on Signal and Information Processing 3 (2014), 113.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Müller Karsten, Schwarz Heiko, Marpe Detlev, Bartnik Christian, Bosse Sebastian, Brust Heribert, Hinz Tobias, Lakshman Haricharan, Merkle Philipp, Rhee Franz Hunn, Tech Gerhard, Winken Martin, and Wiegand Thomas. 2013. 3D high-efficiency video coding for multi-view video and depth data. IEEE Transactions on Image Processing 22, 9 (2013), 33663378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Nasrallah Anthony, Abdoli Mohsen, Mora Elie Gabriel, Guionnet Thomas, and Raulet Mickael. 2019. Decoder-side intra mode derivation with texture analysis in VVC test model. In 2019 IEEE International Conference on Image Processing (ICIP). 31533157.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Nasrallah Anthony, Mora Elie, Guionnet Thomas, and Raulet Mickael. 2019. Decoder-side intra mode derivation based on a histogram of gradients in versatile video coding. In 2019 Data Compression Conference (DCC). 597597.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Pfaff Jonathan, Filippov Alexey, Liu Shan, Zhao Xin, Chen Jianle, De-Luxán-Hernández Santiago, Wiegand Thomas, Rufitskiy Vasily, Ramasubramonian Adarsh Krishnan, and Auwera Geert Van der. 2021. Intra prediction and mode coding in VVC. IEEE Transactions on Circuits and Systems for Video Technology 31, 10 (2021), 38343847.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Reuze Kevin, Hamidouche Wassim, Philippe Pierrick, and Deforges Olivier. 2019. Dynamic lists for efficient coding of intra prediction modes in the future video coding standard. In 2019 Data Compression Conference (DCC). 601601.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Schwarz Sebastian, Preda Marius, Baroncini Vittorio, Budagavi Madhukar, Cesar Pablo, Chou Philip A., Cohen Robert A., Krivokuća Maja, Lasserre Sébastien, Li Zhu, Llach Joan, Mammou Khaled, Mekuria Rufael, Nakagami Ohji, Siahaan Ernestasia, Tabatabai Ali, Tourapis Alexis M., and Zakharchenko Vladyslav. 2019. Emerging MPEG standards for point cloud compression. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 1 (2019), 133148.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Schäfer Michael, Stallenberger Björn, Pfaff Jonathan, Helle Philipp, Schwarz Heiko, Marpe Detlev, and Wiegand Thomas. 2020. Efficient fixed-point implementation of matrix-based intra prediction. In 2020 IEEE International Conference on Image Processing (ICIP). 33643368.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Sullivan Gary J., Ohm Jens-Rainer, Han Woo-Jin, and Wiegand Thomas. 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology 22, 12 (2012), 16491668.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Sun Heming, Cheng Zhengxue, Takeuchi Masaru, and Katto Jiro. 2020. Enhanced intra prediction for video coding by using multiple neural networks. IEEE Transactions on Multimedia 22, 11 (2020), 27642779.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Timofte Radu and Agustsson Eirikur. 2017. NTIRE 2017 challenge on single image super-resolution: Methods and results. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 11101121.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wang Yang, Fan Xiaopeng, Liu Shaohui, Zhao Debin, and Gao Wen. 2020. Multi-scale convolutional neural network-based intra prediction for video coding. IEEE Transactions on Circuits and Systems for Video Technology 30, 7 (2020), 18031815.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Wiegand Thomas, Sullivan Gary J., Bjontegaard Gisle, and Luthra Ajay. 2003. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13, 7 (2003), 560576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Xiu Xiaoyu, He Yuwen, and Ye Yan. 2016. Decoder-side intra mode derivation for block-based video coding. In 2016 Picture Coding Symposium (PCS). 15.Google ScholarGoogle Scholar
  41. [41] Xu Xiaozhong, Cohen Robert, Vetro Anthony, and Sun Huifang. 2012. Predictive coding of intra prediction modes for high efficiency video coding. In 2012 Picture Coding Symposium. 457460.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Xu Xiaozhong, Liu Shan, Chuang Tzu-Der, Huang Yu-Wen, Lei Shaw-Min, Rapaka Krishnakanth, Pang Chao, Seregin Vadim, Wang Ye-Kui, and Karczewicz Marta. 2016. Intra block copy in HEVC screen content coding extensions. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 6, 4 (2016), 409419.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Ye Yan, Boyce Jill M., and Hanhart Philippe. 2020. Omnidirectional 360° video coding technology in responses to the joint call for proposals on video compression with capability beyond HEVC. IEEE Transactions on Circuits and Systems for Video Technology 30, 5 (2020), 12411252.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Yoon Yong-Uk, Park Do-Hyeon, Kim Jae-Gon, Lee Jinho, and Kang Jung-Won. 2019. Most frequent mode for intra-mode coding in video coding. Electronics Letters 55, 4 (2019), 188190.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Zhang Kai, Chen Jianle, Zhang Li, Li Xiang, and Karczewicz Marta. 2018. Enhanced cross-component linear model for chroma intra-prediction in video coding. IEEE Transactions on Image Processing 27, 8 (2018), 39833997.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Zhang Li, Zhang Kai, Liu Hongbin, Chuang Hsiao Chiang, Wang Yue, Xu Jizheng, Zhao Pengwei, and Hong Dingkun. 2019. History-based motion vector prediction in versatile video coding. In 2019 Data Compression Conference (DCC). 4352.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Zhang Tao, Fan Xiaopeng, Zhao Debin, Xiong Ruiqin, and Gao Wen. 2018. Hybrid intraprediction based on local and nonlocal correlations. IEEE Transactions on Multimedia 20, 7 (2018), 16221635.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zhang Yun, Kwong Sam, and Wang Shiqi. 2020. Machine learning based video coding optimizations: A survey. Information Sciences 506 (2020), 395423.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Zheng Amin, Yuan Yuan, Zhou Jiantao, Guo Yuanfang, Yang Haitao, and Au Oscar C.. 2016. Adaptive block coding order for intra prediction in HEVC. IEEE Transactions on Circuits and Systems for Video Technology 26, 11 (2016), 21522158.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Zhu Linwei, Kwong Sam, Zhang Yun, Wang Shiqi, and Wang Xu. 2020. Generative adversarial network-based intra prediction for video coding. IEEE Transactions on Multimedia 22, 1 (2020), 4558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Zhu Linwei, Zhang Yun, Li Na, Pi Jinyong, and Wu Xinju. 2020. Sparse representation-based intra prediction for lossless/near lossless video coding. In 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP). 164167.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Zhu Linwei, Zhang Yun, Wang Shiqi, Kwong Sam, Jin Xin, and Qiao Yu. 2021. Deep learning-based chroma prediction for intra versatile video coding. IEEE Transactions on Circuits and Systems for Video Technology 31, 8 (2021), 31683181.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Deep Learning-Based Intra Mode Derivation for Versatile Video Coding

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2s
      April 2023
      545 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3572861
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 February 2023
      • Online AM: 16 September 2022
      • Accepted: 1 September 2022
      • Revised: 15 July 2022
      • Received: 4 April 2022
      Published in tomm Volume 19, Issue 2s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)453
      • Downloads (Last 6 weeks)76

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!