Abstract
Tensor network as an effective computing framework for efficient processing and analysis of high-dimensional data has been successfully applied in many fields. However, the performance of traditional tensor networks still cannot match the strong fitting ability of neural networks, so some data processing algorithms based on tensor networks cannot achieve the same excellent performance as deep learning models. To further improve the learning ability of tensor network, we propose a quantized tensor neural network in this article (QTNN), which integrates the advantages of neural networks and tensor networks, namely, the powerful learning ability of neural networks and the simplicity of tensor networks. The QTNN model can be further regarded as a generalized multilayer nonlinear tensor network, which can efficiently extract low-dimensional features of the data while maintaining the original structure information. In addition, to more effectively represent the local information of data, we introduce multiple convolution layers in QTNN to extract the local features. We also develop a high-order back-propagation algorithm for training the parameters of QTNN. We conducted classification experiments on multiple representative datasets to further evaluate the performance of proposed models, and the experimental results show that QTNN is simpler and more efficient while compared to the classic deep learning models.
1 INTRODUCTION
In recent years, the rapid development of information technology, especially the development of technologies represented by Cloud computing, Big data, Internet of Things, Artificial intelligence, and so on, has made smart devices such as mobile phones, PCs, and smart sensors visible everywhere, and the source and quantity of data are growing at an unprecedented rate. In addition to the typical 4V characteristics of big data (Volume, Velocity, Variety, Value), diversified data sources have created the characteristics of big data such as heterogeneity, high-dimension, and the internal structure and relationship of these data are also more complicated [21]. In addition, the data in modern applications has gradually presented multi-dimensional characteristics, which to some extent makes it difficult for traditional vector or matrix-based machine learning algorithms to deal with multi-dimensional data.
Tensor as a natural multidimensional generalization of vector and matrix is an effective tool for the representation and analysis of massive multi-dimensional data [12]. In recent years, tensor models have been successfully applied to many fields, such as chemistry, neuroscience, computer vision, text mining, clustering, social networks [10, 25], and so on. However, tensor volume can easily become prohibitively big as the number of dimensions increase, thus requiring enormous computational and memory resources to process such data. Feature extraction and dimension reduction of high-dimensional data as an important task in the field of data mining has led many scholars to propose related methods for high-dimensional data based on low-rank tensor networks, which allow for huge data tensors to be approximated (compressed) by compact low-rank core tensors, effectively alleviating the curse of dimensionality of high-dimensional data [7, 16]. In addition, some multi-linear analysis methods for image classification based on tensor models are proposed in Reference [3], which models face images involving multiple factors into high-order tensors to extract compact low-dimensional features. Non-negative tensor decomposition models are also proposed to mine potential patterns and components that have stronger interpretability [18], such as image classification [17] and social network mining [6]. Moreover, some manifold algorithms for dimensional reduction of high-dimensional dataset based on tensor models have also been developed in Reference [14]. Through establishing potential correlation patterns among the heterogeneous data from multiple sources, coupled matrix and tensor decomposition further realizes joint analysis of data and effectively enhances knowledge discovery [1]. More generally, the above-mentioned methods rely on the inherent internal structure information of the original data and the global characteristics of the entire dataset (sample similarity) to a large extent and ignore the local structure information of the high-dimensional data, so it is difficult to improve their learning ability.
Deep neural networks are widely used in various fields with their powerful learning and representation capabilities, and they can effectively capture the high-level abstract features behind the data through a deep network architecture. The most representative ones are convolutional neural networks, including GoogleNet [20], VGG [19], ResNet [8] and so on, which have achieved great success in the field of computer vision. However, as the number of network layers continues to increase, the parameter scale of these networks is getting larger, making it difficult to deploy models on small embedded devices [9, 24]. In recent years, some tensor models have been successfully integrated into deep neural networks to improve the structure of the network. The convolutional neural network was accelerated by using fine-tuned CP decomposition [13]. Tensor regression layers are also introduced to CNN to achieve reduction of network parameters while maintaining the multi-linear structure of the hidden layer features [11]. To further reduce the storage requirements of neural network parameters, tensor train model is applied to the network by representing the parameters of the full connection layer in TT form [15, 22]. Furthermore, some research using tensors to explore the interpretability of neural network models is also proposed in References [2, 5]. However, most of these existing methods use the low-rank structure of the tensor network to reduce the parameters of the neural network, instead of really improving the learning ability of the tensor network. To extend the neural network to the high-dimensional tensor space, tensor factorized neural network (TFNN) directly builds a multi-layer nonlinear tensor network by stacking multiple Tucker regression layers [4]. In addition, an unsupervised deep computation model based on tensor is proposed to learn features of heterogeneous big data in Reference [23].
In this article, we further propose a quantized tensor neural network (QTNN) to improve the learning ability of the tensor network, which effectively integrates the high-order convolution operations into the tensor network to learn the local features of high-dimensional data. First, we generalize the standard convolution operation to more general high-order scenarios and propose a high-order convolution operation to capture the local information of high-dimensional data. Then, the high-order tensors containing local information are further input into a multi-layer nonlinear tensor network for feature learning and classification. Moreover, we also develop a high-order error back-propagation algorithm based on tensor networks for parameter optimization. Finally, we conducted classification experiments on three real-world datasets to verify the performance of the proposed models, and the experimental results fully demonstrate that the QTNN model integrating neural networks and tensor networks has superior performance. The contributions of this article can be summarized as follows:
(1) | Considering that the traditional tensor networks have limited learning ability, and it is difficult for them to learn the hidden features of data sufficiently, we propose a multi-layer nonlinear tensor network, which further combines the idea of neural network to learn the factor matrices of each layer by optimizing the objective function while maintaining the original structure of data. We also theoretically demonstrate that it is equivalent to a traditional NN model, but more efficient. | ||||
(2) | To further improve the classification performance of the tensor network, we integrate high-order convolution operations into the nonlinear tensor network to learn the local neighborhood features of the original data and propose a quantized tensor neural network (QTNN). Compared with the classic CNN models, QTNN can learn the hidden features of data more efficiently and has fewer parameters. | ||||
(3) | To further optimize the parameters in the network, we generalize the classical back-propagation algorithm to higher-order scenarios and proposed a high-order error back-propagation algorithm based on tensor neural network, which also realizes the transformation of the traditional optimization algorithms from vector space to higher-order tensor space. | ||||
We organize the rest of this article as follows: In Section 2, we review some related preliminaries and notations. Some basic definitions are provided in Section 3, which are necessary for further constructing the tensor neural networks. Section 4 presents two tensor neural network models and a corresponding high-order error back-propagation algorithm. The complexity of the proposed models is analyzed in Section 5. After that, the experiments conducted on three representative datasets are provided in Section 6. Finally, we give the conclusion of this article in Section 7.
2 PRELIMINARIES
2.1 Tensor and Tensor Decomposition
Tensors, as natural multi-dimensional generalizations of vectors and matrices, have been successfully applied to several domains such as computer vision, neuroscience, and pattern recognition. In this section, we briefly review some basic knowledge of tensors, which is also a prerequisite for understanding tensor neural networks.
Tucker decomposition (also called higher-order SVD) provides a more general factorized format for high-dimensional tensors, which can be regarded as the higher-order extension of the singular value decomposition, and the Tucker model can be formulated as follows: (1) \[\begin{align} \mathcal {X}& \cong \sum _{r_{1} =1}^{R_{1}}\cdots \sum _{r_{N}=1}^{R_{N}}\mathcal {G}_{r_{1},\ldots r_{N}}a^{(1)}_{r_{1}}\circ a^{(2)}_{r_{2}}\cdots \circ a^{(N)}_{r_{N}}\nonumber \nonumber\\ &=\mathcal {G}\times _{1}A^{(1)}\times _{2}A^{(2)}\cdots \times _{N}A^{(N)} , \end{align}\] where \(\mathcal {X}\in R^{I_{1}\times I_{2}{\cdots }\times I_{N}}\) is the original data tensor, \(\mathcal {G}\in R^{R_{1}\times R_{2}{\cdots }\times R_{N}}\) is core tensor, and \(A^{(n)}=[a^{(n)}_{1},a^{(n)}_{2},\ldots a^{(n)}_{R}]\in R^{I_{n}\times R_{n}}, R_{n}\ll I_{n}\), is mode-n factor matrix, which provides the principal component of mode \(n\), \(n=1,2,{\ldots }N\), and the potential complex interaction pattern between principal components of different modes modeled by the core tensor.
Actually, the above Tucker decomposition result could be further transformed into the following form to establish the connection between the input data and the compressed low-dimensional core tensor: \[ \mathcal {G}=\mathcal {X}\times _{1}A^{(1)\dagger }\times _{2}A^{(2)\dagger }\cdots \times _{N}A^{(N)\dagger }, \] where \(A^{(n)\dagger }=(A^{(n)^{T}}A^{(n)})^{-1}A^{(n)^{T}}, 1\le n\le N\), are the pseudo-inverses of factor matrices. Without loss of generality, \(\mathcal {G}\) is compressed feature tensor with low-dimension of original data tensor \(\mathcal {X}\).
Alternatively, the modular multiplication transformation from \(\mathcal {X}\) to \(\mathcal {G}\) can be interpreted as a convolution operation. Each element of \(\mathcal {G}\) is obtained by inner product of the original data \(\mathcal {X}\) and the rank-1 convolution kernel tensor. The total number of convolution kernels is \(R_{1}R_{2}\ldots R_{N}\), and they are all of the same order and dimension as the original data. Different from the convolution operation in classical CNN model that has a local receptive field, high-order Tucker model has a global receptive field. For example, \[ \mathcal {G}=\mathcal {X}\times _{1}A^{(1)}\times _{2}A^{(2)}\cdots \times _{N}A^{(N)}. \] consider each element of \(\mathcal {G}\), then we have following expression: (2) \[\begin{align} \mathcal {G}_{r_{1},r_{2},\ldots ,r_{N}} &= \sum _{i_{1},i_{2},\ldots ,i_{N}}^{I_{1},I_{2},\ldots ,I_{N}}\mathcal {X}_{i_{1},i_{2} \ldots ,i_{N}} A^{(1)}_{r_{1},i_{1}} A^{(2)}_{r_{2},i_{2}}\ldots A^{(N)}_{r_{N},i_{N}} \nonumber \nonumber\\ & = \sum _{i_{1},i_{2},\ldots ,i_{N}}^{I_{1},I_{2},\ldots ,I_{N}}\mathcal {X}_{i_{1},i_{2},\ldots ,i_{N}}\mathcal {A}_{i_{1},i_{2},\ldots ,i_{N}}\nonumber \nonumber\\ & = \langle \mathcal {X},\mathcal {A}_{r_{1},r_{2},\ldots ,r_{N}}\rangle , \end{align}\] where \(\mathcal {A}_{r_{1},r_{2},\ldots ,r_{N}}= A^{(1)}_{r_{1},:}\circ A^{(2)}_{r_{2},:}\cdots \circ A^{(N)}_{r_{N},:} \in R^{I_{1}\times I_{2}{\cdots }\times I_{N}}\) is an \(N\)-order kernel tensor indirectly generated by \(N\) factor matrices.
3 DEFINITIONS AND NOTATIONS
To effectively represent the local structure information of data, we define the following high-order quantization and convolution operations to extract the local neighborhood features of high-order tensors. We also introduce a nonlinear Tucker layer to extract the hidden features while maintaining the original structure of data. In fact, the operations defined below can be regarded as the high-order generalizations of operations defined in the standard CNN, and some examples are provided in Figure 1.
Fig. 1. Graphic representation of some defined operations.
(High-order Quantification).
Given an \(N\)-order tensor \(\mathcal {X}\in R^{I_{1}\times I_{2}{\cdots }\times I_{N}}\), and \(I_{n}=I_{n}^{^{\prime }}\times J_{n}, 1\le n \le N\), then \(\mathcal {X}\) can be represented as: \[ \mathcal {X}^{^{\prime }}\in R^{I_{1}^{^{\prime }}\times I_{2}^{^{\prime }}{\cdots }\times I_{N}^{^{\prime }}\times J_{1}\times J_{2}\cdots \times J_{N}}, \] where the first \(N\) orders of \(\mathcal {X}^{^{\prime }}\) are used to represent the local information of the original data, and the remaining \(N\) orders are used to indicate the number of data blocks divided in each dimension of the original data, and the original data is totally divided into \(J_{1} J_{2}\ldots J_{N}\) blocks with the size of \(I_{1}^{^{\prime }}\times I_{2}^{^{\prime }}{\cdots }\times I_{N}^{^{\prime }}\).
(Generalized High-order Quantification).
Given an \(N\)-order tensor \(\mathcal {X}\in R^{I_{1}\times I_{2}\cdots \times I_{N}}\), and fix the size of each data block as \(I_{1}^{^{\prime }} \times I_{2}^{^{\prime }}\cdots \times I_{N}^{^{\prime }}\), the strides are \(\lbrace s_{1}, s_{2},\ldots ,s_{N}\rbrace\), then there are total \(J_{1}^{^{\prime }} J_{2}^{^{\prime }}\ldots J_{N}^{^{\prime }}\) intersecting blocks, \(J_{n}^{^{\prime }}=\lceil \frac{I_{n}-I_{n}^{^{\prime }}}{s_{n}}\rceil +1, 1\le n \le N\), and \(\mathcal {X}\) can be further represented as: \[ \mathcal {X}^{^{\prime }}\in R^{I_{1}^{^{\prime }}\times I_{2}^{^{\prime }}\cdots \times I_{N}^{^{\prime }}\times J_{1}^{^{\prime }}\times J_{2}^{^{\prime }}\cdots \times J_{N}^{^{\prime }}}, \] where the first \(N\) orders represent size of each block, and the last \(N\) orders represent the number of data blocks divided in each dimension of the original data.
(High-order Convolution Operation).
Given an \(N\)-order tensor \(\mathcal {X}\in R^{I_{1}\times I_{2}\cdots \times I_{N}}\) and an \(N\)-order convolution kernel tensor \(\mathcal {W}\) with the size of \(k_{1}\times k_{2}\cdots \times k_{n}\), the strides and padding size are \(\lbrace s_{1}, s_{2},\ldots ,s_{N}\rbrace\), \(\lbrace p_{1}, p_{2},\ldots ,p_{N}\rbrace\), respectively, then the convolution of two tensors is defined as follows: \[ \mathcal {F}=\mathcal {W}\ast \mathcal {X,} \] where the feature tensor \(\mathcal {F} \in R^{J_{1}\times J_{2}\cdots \times J_{N}}\), \(J_{n}=\lceil \frac{I_{n}+2p_{n}-k_{n}}{s_{n}}\rceil +1,1\le n\le N\), and each element of feature tensor contains \(k_{1}\times k_{2}\cdots \times k_{n}\) spatial neighborhood information of original tensor.
(High-order Pooling Operation).
Given an \(N\)-order feature tensor \(\mathcal {F} \in R^{J_{1}\times J_{2}\cdots \times J_{N}}\), and the pooling stride along each order is \(s=\lbrace s_{1}, s_{2},\ldots ,s_{N}\rbrace\), then the pooling operation of \(\mathcal {F}\) is defined as follows: \[ \mathcal {F}^{^{\prime }}=Pooling(\mathcal {F},s), \] where the feature tensor \(\mathcal {F}^{^{\prime }} \in R^{K_{1}\times K_{2}\cdots \times K_{N}}\), \(K_{n}= J_{n}/s_{n},1\le n\le N\).
(Nonlinear Tucker Layer).
Given an \(N\)-order input tensor \(\mathcal {X}\in R^{I_{1}\times I_{2}\cdots \times I_{N}}\), the nonlinear Tucker layer is defined as follows: (3) \[\begin{align} \mathcal {Z}& =h\lbrace \mathcal {X}\times _{1}U^{(1)}\times _{2}U^{(2)}\cdots \times _{N}U^{(N)}+\mathcal {E}\rbrace , \end{align}\] where \(\mathcal {Z}\in R^{J_{1}\times J_{2}\cdots \times J_{N}}\) is potential low-dimensional feature of \(\mathcal {X}, U^{(n)}\in R^{J_{n}\times I_{n}}, J_{n} \ll I_{n}, 1\le n\le N\), are parameter matrices, \(\mathcal {E}\in R^{J_{1}\times J_{2}\cdots \times J_{N}}\) is a high-order bias tensor, and \(h: R^{J_{1}\times J_{2}\cdots \times J_{N}}\rightarrow R^{J_{1}\times J_{2}\cdots \times J_{N}}\) is a nonlinear activation function.
Considering that the size of the parameter tensor \(\mathcal {E}\) increases exponentially with the increase of the orders, we can further represent the bias tensor in the format of CP to reduce the amount of parameters, that is, \(\mathcal {E}=[B^{(1)},\ldots ,B^{(N)}]=\sum _{k=1}^{K}b^{(1)}_{k}\circ b^{(2)}_{k}\cdots \circ b^{(N)}_{k}\), and \(B^{(n)}=[b^{(n)}_{1},\ldots , b^{(n)}_{K}]\in R^{J_{n}\times K }, 1\le n\le N\). Actually, the nonlinear Tucker layer is equivalent to the full connection layer of DNN, and the parameter matrix \(W\) of DNN can be generated from \(N\) factor matrices in the following way: (4) \[\begin{align} vec(\mathcal {Z}) & =h\lbrace (U^{(N)}\otimes \cdots \otimes U^{(1)})vec(\mathcal {X})+vec(\mathcal {E}\rbrace), \end{align}\] where \(\otimes\) is the Kronecker product, \(W=U^{(N)}\otimes U^{(N-1)}\cdots \otimes U^{(1)}\in R^{J_{1}J_{2}\ldots J_{N}\times I_{1}I_{2}\ldots I_{N}}\), and \(b=vec(\mathcal {E})\in R^{J_{1}J_{2}\ldots J_{N}}\). However, it is obvious that the weights generated by the \(N\) factor matrices have fewer parameters than the weights directly defined by DNN, that is, the number of parameters goes from \(J_{1}\ldots J_{N}I_{1}\ldots I_{N}\) to \(\sum _{n=1}^{N}J_{n}\times I_{n}\). In addition, nonlinear Tucker layer can mine the hidden features of the data more efficiently while maintaining the original structural information of the data.
(Tensor Contraction Layer).
Given an \(N\)-order tensor \(\mathcal {X}\in R^{I_{1}\times I_{2}\cdots \times I_{N}}\), and an \(N+1\)-order parameter tensor \(\mathcal {W}\in R^{I_{1}\times I_{2}\cdots \times I_{N}\times C}\), then the tensor contraction operation between \(\mathcal {X}\) and \(\mathcal {W}\) is defined as follows: (5) \[\begin{align} \langle \mathcal {X},\mathcal {W}\rangle _{c}=\mathcal {X}\times _{i_{1},\ldots ,i_{N}}\mathcal {W}+{\bf \emph {b}}=\sum _{i_{1},i_{2},\ldots ,i_{N}}^{I_{1},I_{2},\ldots ,I_{N}}\mathcal {X}_{i_{1},i_{2},\ldots ,i_{N}}\mathcal {W}_{i_{1},i_{2},\ldots ,i_{N},:}+{\bf \emph {b,}} \end{align}\] where \({\bf \emph {b}}\in R^{C}\) is a bias vector.
4 TENSOR NEURAL NETWORK CONSTRUCTING
4.1 Tensor Neural Network
The main ideal of TNN is to extend the neural network models to a more general high-dimensional situation through the tensor decomposition model and establish a close correlation between the tensor network and the neural network [4]. The general TNNs of 1-order, 2-order, and \(N\)-order are provided in Figure 2, and it is obvious that the 1-order TNN is equivalent to the NN model. Combining Equation (4), 2-order, \(N\)-order, or higher-order TNN models could be converted into standard NN models through vectorization processing. Actually, each layer of TNN is composed of the nonlinear Tucker layer.
Fig. 2. High-order generalization of the traditional NN models.
Considering an \(N\) order input tensor \(\mathcal {X}\in R^{I_{1}\times I_{2}{\cdots }\times I_{N}}\), \(\mathcal {A}^{(1)}\in R^{J_{1}\times J_{2}{\cdots }\times J_{N}}\) is corresponding core tensor, and \(\mathcal {Z}^{(1)}\) is the activated feature tensor in first layer. In this way, each layer of the TNN is essentially a process of factor analysis or structural analysis of the input data. For the first case, the \(N\)-order core tensor \(\mathcal {A}^{(1)}\) is obtained by \(N\) linear transformations of the original data, and the factor matrix \(U^{(n)}\) can be regarded as the main factor components of mode-\(n\), \(1 \le n \le N\), which are used to project \(\mathcal {X}\) into a core tensor \(\mathcal {A}^{(1)}\), then get \(\mathcal {Z}^{(1)}\) through the activation function \(h(\cdot)\). For the second case, the nonlinear Tucker layer can be further regarded as a structural analysis process of the original data, and the basic structure tensors \(\mathcal {U}^{(k)}\in R^{I_{1}\times I_{2}{\cdots }\times I_{N}}, 1 \le k \le J_{1}J_{2}\ldots J_{N}\) are generated by the factor matrices in the following way: (6) \[\begin{align} \mathcal {U}^{(j)}=U^{(1)}_{j_{1},:}\circ U^{(2)}_{j_{2},:}\cdots \circ U^{(N)}_{j_{N},:} , \end{align}\] where \(j=j_{1}+\sum _{n=2}^{N}(j_{n}-1)\prod _{t=1}^{n-1}J_{t}\), and each element in \(\mathcal {A}^{(1)}\) is formulated by inner product, which project \(\mathcal {X}\) into high-dimensional tensor space through structure tensors. For example, \(\mathcal {A}_{j_{1},\ldots ,j_{N}}^{(1)}=\langle \mathcal {X}, \mathcal {U}^{(j)}\rangle\), which means the component of the original data on the structure tensor \(\mathcal {U}^{(j)}\). In addition, this high-dimensional tensor space is formulated by \(J_{1}J_{2}\ldots J_{N}\) structure tensors.
As described above, the output of the first hidden layer of the \(N\)-order TNN is constructed by \(N\) factor matrices \(\lbrace U^{(1)},U^{(2)},\ldots ,U^{(N)}\rbrace\), a bias tensor \(\mathcal {E}\) and activation functions \(h(\cdot)\). (7) \[\begin{align} \mathcal {Z}^{(1)}&=h\lbrace \mathcal {X}\times _{1}U^{(1)}\times _{2}U^{(2)}\cdots \times _{N}U^{(N)}+\mathcal {E}\rbrace \end{align}\] As shown in Figure 3, we can further construct a deeper TNN by learning multiple hidden features \(\lbrace \mathcal {Z}^{(1)},\mathcal {Z}^{(2)},\ldots \rbrace\), and \(U^{(l)}_{n}, 1\le n \le N\) denote \(N\) factor matrices of the \(l\)th hidden layer, \(\mathcal {E}^{(l)}\) is corresponding bias tensor. For each layer of the deep TNN, like a single-layer TNN, it is a process of factor analysis or structural analysis of the high-order data, and \(N\) factor matrices in fixed layer are shared for different input data, which is also very reasonable.
Fig. 3. The structure of multi-layer TNN.
4.2 Quantized Tensor Neural Network
For the 2-order TNN, the extracted feature \(A^{(1)}\) in the first layer is obtained by bilinear transformation \(U^{(1)}\) and \(U^{(2)}\), which is significantly different from the typical CNN models. Here, we further propose a quantized tensor neural network, which can effectively use the local structure information of the data, and the classic pooling operation can be indirectly completed by modular multiplication operation on the specified orders. More specifically, the QTNN model of high-dimensional data is mainly divided into three steps. The first step is to extract the high-order neighborhood features of the data through tensor convolution operation. For example, to extract \(k_{1}\times k_{2} \ldots \times k_{N}\) neighborhood features of \(N\)-order data tensor \(\mathcal {X}\in R^{I_{1}\times I_{2}{\cdots }\times I_{N}}\), we use \(k_{1}k_{1} \ldots k_{N}\) convolution kernels to convolve the original data to obtain an \(N+1\)-order tensor \(\mathcal {X}^{^{\prime }}\in R^{I_{1}\times I_{2}{\cdots }\times I_{N}\times k_{1} \ldots k_{N}}\) and represent it as a \(2N\)-order tensor \(\mathcal {X}^{^{\prime }}\in R^{I_{1}\times \cdots I_{N}\times k_{1} \ldots k_{N}}\), then \(\mathcal {X}^{^{\prime }} _{i_{1},i_{2},\ldots i_{N},:,\ldots ,:}\in R^{k_{1} \times \cdots k_{N}}\) can be further regarded as the local feature of element \(\mathcal {X} _{i_{1},i_{2},\ldots i_{N}}\). In fact, the main idea of this operation is to reduce the redundant features in the network by giving the convolution features the same spatial structure information as the original data. The second step is the feature extraction of these small data blocks based on tensor network, and the last step is the feature interaction between the feature blocks.
Given an \(N\)-order input \(\mathcal {X}\in R^{I_{1}\times I_{2}\cdots \times I_{N}}\) and its corresponding convolution feature tensor \(\mathcal {X}^{^{\prime }}\in R^{I_{1}\times I_{2}\cdots \times I_{N}\times k_{1} \cdots \times k_{N}}\), which can also be obtained by multiple convolution layers to expand the field of view of each element. Then the output of the first hidden layer \(\mathcal {Z}^{(1)}\) of the network can be obtained by the following formula: (8) \[\begin{align} \mathcal {Z}^{(1)}&=h\lbrace \mathcal {X}^{^{\prime }}\times _{1}U^{(1)}_{1}\times \cdots \times _{N}U^{(1)}_{N}\times _{N+1}V^{(1)}_{N+1}\times \cdots \times _{2N}V^{(1)}_{2N}\oplus \mathcal {E}^{(1)}\rbrace , \end{align}\] where \(U^{(1)}_{n}\in R^{J_{n}\times I_{n}}, J_{n}\ll I_{n}\), \(V^{(1)}_{N+n}\in R^{s_{n}\times k_{n}}, s_{n}\le k_{n},1\le n\le N\), and \(\mathcal {E}^{(1)}\in R^{J_{1}\times J_{2}\times \cdots J_{N}}\). Without loss of generality, \(U_{*}^{(1)}\) is the factor matrix that is used to extract the features of data blocks, and \(V_{*}^{(1)}\) is used to perform feature interaction between the data blocks. Equation (8) can be further described as following formats, which is more meaningful for QTNN model.
Actually, the \((k_{1}+\sum _{n=2}^{N}(k_{n}-1)\prod _{t=1}^{n-1}k_{t})\)-th block of \(\mathcal {X}^{^{\prime }}\) is denoted as \(\mathcal {X}^{^{\prime }}_{:,\ldots ,:,k_{1},\ldots ,k_{N}}\in R^{I_{1}\times I_{2}\cdots \times I_{N}}\), and the local feature of this block can be extract as follows: (9) \[\begin{align} \mathcal {H}^{(1)}_{:,\ldots ,:,k_{1},\ldots ,k_{N}}=\mathcal {X}^{^{\prime }}_{:,\ldots ,:,k_{1},\ldots ,k_{N}}\times _{1}U^{(1)}_{1}\times \cdots \times _{N}U^{(1)}_{N}+\mathcal {E}^{(1)} \end{align}\] (10) \[\begin{align} \mathcal {H}^{(1)}=\mathcal {X}^{^{\prime }}\times _{1}U^{(1)}_{1}\times \cdots \times _{N}U^{(1)}_{N}\oplus \mathcal {E}^{(1)} . \end{align}\] Then, the features of arbitrary data blocks of the first hidden layer could be extracted by Equation (9). For a fixed data block, the above operation is also equivalent to multiple convolution operations on the data block, and the number of convolution kernels is \(J_{1} J_{2}\ldots J_{N}\). After learning the low-dimensional features of each data block through Equation (10), we further obtain the interactive information between the blocks through Equation (11). For simplicity, we denote \((k_{1}+\sum _{n=2}^{N}(k_{n}-1)\prod _{t=1}^{n-1}k_{t})\)-th feature blocks as \(\mathcal {H}^{(1)}_{k_{1},\ldots ,k_{N}}\in R^{J_{1}\times J_{2}\cdots \times J_{N}}, 1\le j_{n}\le J_{n}\), then we have (11) \[\begin{align} \left(\mathcal {H}^{(1)}\times _{N+1}V^{(1)}_{N+1}\right)_{l_{1},j_{2}\ldots ,j_{N}}=\sum _{j_{1}=1}^{J_{1}}\mathcal {H}^{(1)}_{j_{1},\ldots ,j_{N}}\times V^{(1)}_{N+1(l_{1},j_{1})} \end{align}\] (12) \[\begin{align} \mathcal {Z}^{(1)}=h\left\lbrace \mathcal {H}^{(1)}\times _{N+1}V^{(1)}_{N+1}\times _{N+2}V^{(1)}_{N+2}\cdots \times _{2N}V^{(1)}_{N+2}\right\rbrace . \end{align}\] In fact, Equation (11) is equivalent to combining the features of the data block along the mode-(N+1). Similarly, we can continue to combine feature blocks along the remaining orders and generate multiple global features by Equation (12), and the overall process is shown in Figure 4.
Fig. 4. The structure of multi-layer QTNN.
4.3 Forward Propagation of QTNN
In this section, we formally define the structure of the QTNN. Without loss of generality, we can regard TNN as a special case of QTNNs when the input data is not processed by convolution operations. In addition, we abstract the structure of QTNN into three layers, namely, input layer, hidden layer, and output layer, and corresponding forward computation process is also provided.
Input layer: This layer mainly includes tensor convolution, feature extraction, and feature interaction. Considering a high-order input tensor \(\mathcal {X}\in R^{I_{1}\times I_{2}\cdots \times I_{N}}\) and corresponding convolution feature tensor is \(\mathcal {X}^{^{\prime }}\in R^{I_{1}\times \cdots I_{N}\times k_{1}\times \cdots k_{N}}\). Obviously, QTNN will degenerate into TNN when the convolution operations is not performed. First, we can get the low-dimensional feature of each data block in the first layer by (13) \[\begin{align} \mathcal {H}^{(1)}&=\mathcal {X}^{^{\prime }}\times _{1}U^{(1)}_{1}\times \cdots \times _{N}U^{(1)}_{N}\oplus \mathcal {E}^{(1)} , \end{align}\] where \(U^{(1)}_{n}\in R^{J_{n}\times I_{n}}, 1\le n \le N\). Actually, Equation (13) is also equivalent to convolution operation of CNN; the difference between two models is that the convolution kernel used here is composed of a rank-1 tensor. In addition, each data block is equivalent to being indirectly pooled, and its size is changed from the original \(I_{1}\times I_{2}\cdots \times I_{N}\) to \(J_{1}\times J_{2}\cdots \times J_{N}\). After obtaining the feature of data blocks of original data, feature blocks can be further interacted in the following way: (14) \[\begin{align} \mathcal {A}^{(1)}&=\mathcal {H}^{(1)}\times _{N+1}V^{(1)}_{N+1}\times \cdots \times _{2N}V^{(1)}_{2N} , \end{align}\] where \(V^{(1)}_{N+n}\in R^{s_{n}\times k_{n}}, 1\le n \le N\), then the output feature tensor is \(\mathcal {Z}^{(1)}=h\lbrace \mathcal {A}^{(1)}\rbrace \in R^{J_{1}\cdots \times J_{N}\times s_{1}\cdots \times s_{N}}\). Actually, \(\mathcal {Z}^{(1)}\) is composed of \(s_{1}s_{2}\ldots s_{N}\) feature blocks, and each block can be further regarded as a compressed feature of \(\mathcal {X}\).
Middle layer: For the \(l\)th layer, \(2\le l \le L-1\), the \(2N\)-order tensors \(\mathcal {Z}^{(l-1)}\in R^{J_{1}^{l-1}\cdots \times J_{N}^{l-1}\times s_{1}^{l-1}\cdots \times s_{N}^{l-1}}\) and \(\mathcal {Z}^{(l)}\in R^{J_{1}^{l}\cdots \times J_{N}^{l}\times s_{1}^{l}\cdots \times s_{N}^{l}}\) are the output tensors of layer \(l-1\) and layer \(l\), respectively, and the relationship between them is given by the following equations: (15) \[\begin{align} \mathcal {H}^{(l)}&=\mathcal {Z}^{(l-1)}\times _{1}U^{(l)}_{1}\times \cdots \times _{N}U^{(l)}_{N}\oplus \mathcal {E}^{(l)}, \end{align}\] (16) \[\begin{align} \mathcal {A}^{(l)}&=\mathcal {H}^{(l)}\times _{N+1}V^{(l)}_{N+1}\times \cdots \times _{2N}V^{(l)}_{2N}, \end{align}\] (17) \[\begin{align} \mathcal {Z}^{(l)}&=h\lbrace \mathcal {A}^{(l)}\rbrace , \end{align}\] where \(U^{(l)}_{n}\in R^{J_{n}^{l}\times J_{n}^{l-1}}, V^{(l)}_{n}\in R^{s_{n}^{l}\times s_{n}^{l-1}}\), and \(\mathcal {E}^{(l)}\in R^{J_{1}^{l}\times \cdots \times J_{N}^{l}}\) is bias tensor.
Output layer: For the classification task of \(C\) classes, we use the softmax function \(softmax(\cdot)\) and the tensor contraction operation to output the probability of each class. The specific definition of layer \(L\) is given as follows: (18) \[\begin{align} \emph {{\bf a}}^{(L)} & =\langle \mathcal {W},\mathcal {Z}^{(L-1)} \rangle _{c} +\emph { {\bf b}}, \end{align}\] (19) \[\begin{align} a^{(L)}_{c} & =\langle \mathcal {W}_{:,\ldots ,:,c},\mathcal {Z}^{(L-1)} \rangle + b_{c}, 1 \le c \le C, \end{align}\] (20) \[\begin{align} {\bf \emph {y}} & = softmax(\emph {{\bf a}}^{(L)}) =\frac{exp(\emph {{\bf a}}^{(L)})}{\sum _{i=1}^{C}exp\left(a^{(L)}_{c}\right)} , \end{align}\] where \(\emph {{\bf y}},\emph {{\bf b}} \in R^{C}\) correspond to the probability distribution of the output label and bias vector, respectively, and \(\mathcal {W}\in R^{J_{1}^{L-1}\cdots \times J_{N}^{L-1}\times s_{1}^{L-1}\cdots \times s_{N}^{L-1}\times C}\) is \(2N+1\)-order parameter tensor, which is used to output the probability of each class, and \(\mathcal {W}_{:,\ldots ,:,c}\) is \(2N\)-order tensor, which is obtained by setting the index of the \((2N+1)\)-th order to \(c\).
Classification loss function: Given a set of high-order data tensors, and the corresponding class labels \(\lbrace \mathcal {X}_{k},r^{k}\rbrace , 1\le k\le K\), \(y^{k}=\lbrace y^{k}_{c}\rbrace _{c=1}^{C}\) is the output probability distribution of the network. Then, the classification cross-entropy loss function is defined as follows: (21) \[\begin{align} Loss(\Theta)=\sum _{k=1}^{K}E_{k}(\Theta)=-\sum _{k=1}^{K}\sum _{c=1}^{C}r^{k}_{c}ln y^{k}_{c} , \end{align}\] where \(\Theta =\lbrace U^{(l)}_{n},V^{(l)}_{N+n},\mathcal {E}^{(l)},\mathcal {W}, 1\le n\le N, 1\le l\le L-1\rbrace\) represents the parameters of the network, and the label vector \(r^{k}\) uses one-hot coding method: (22) \[\begin{equation} r^{k}_{c}= {\left\lbrace \begin{array}{ll}1,& {\mathcal {X}_{k}\; \text{belongs to}\; {c}\text{th class}}\\ 0,& \text{otherwise.}\end{array}\right.} \end{equation}\]
4.4 Back-propagation of QTNN
In this section, our main work is to minimize the classification loss function through the gradient optimization method and further provide a high-order back-propagation algorithm for QTNN. The most important step in this process is to calculate the gradient of the loss function with respect to the network parameters \(\Theta =\lbrace U^{(l)}_{n},V^{(l)}_{N+n},\mathcal {E}^{(l)},\mathcal {W}, 1\le n\le N, 1\le l\le L-1\rbrace\). To solve this problem, we derive the following parameter update formulas:
(1) Output layer: For the output of the last layer of the network \(a^{(L)}\), we can compute the gradient of \(E_{k}(\Theta)\) with respect to each element of \(a^{(L)}\) as follows: (23) \[\begin{align} d^{(L)}_{c}\doteq \frac{\partial E_{k}(\Theta)}{\partial a^{(L)}_{c} }=\sum _{t=1}^{C}\frac{\partial E_{k}(\Theta)}{\partial y^{k}_{t} }\frac{\partial y^{k}_{t}}{\partial a^{(L)}_{c}}=y^{k}_{c}-a^{(L)}_{c} . \end{align}\] For simplicity, we can denote the above formula as: (24) \[\begin{align} d^{(L)}\doteq \frac{\partial E_{k}(\Theta)}{\partial a^{(L)} }=y^{k}-a^{(L)} , \end{align}\] then we can derive the derivative of the parameters of the last layer in the following way: (25) \[\begin{align} \frac{\partial E_{k}(\Theta)}{\partial \mathcal {W}_{k_{1},\ldots ,l_{N},c} }&=\frac{\partial E_{k}(\Theta)}{\partial a^{(L)}_{c} }\frac{\partial a^{(L)}_{c}}{\partial \mathcal {W}_{k_{1},\ldots ,l_{N},c} }=d^{(L)}_{c}\mathcal {Z}^{(L-1)}_{k_{1},\ldots ,k_{N}}, \end{align}\] (26) \[\begin{align} \frac{\partial E_{k}(\Theta)}{\partial b_{c} }&=\frac{\partial E_{k}(\Theta)}{\partial a^{(L)}_{c} }\frac{\partial a^{(L)}_{c}}{\partial b_{c} }=d^{(L)}_{c} , \end{align}\] then we can rewrite the above formulas in tensor form as: (27) \[\begin{align} \frac{\partial E_{k}(\Theta)}{\partial \mathcal {W}_{:,\ldots ,:,c} }&=d^{(L)}_{c}\mathcal {Z}^{(L-1)}, \end{align}\] (28) \[\begin{align} \frac{\partial E_{k}(\Theta)}{\partial b }&=d^{(L)} . \end{align}\] After getting the gradient of the parameters of layer \(L\), we can use the gradient descent algorithm to update the parameters of the \(L\)th layer. To recursively calculate the gradient of the parameters of layer \(l, 1\le l\le L-1\), we continue to calculate the gradient of \(\mathcal {Z}^{(L-1)}\). According to the Equation (18), we have (29) \[\begin{align} \frac{\partial E_{k}(\Theta)}{\partial \mathcal {Z}^{(L-1)}_{k_{1},\ldots ,k_{N}} }&=\sum _{t=1}^{C}\frac{\partial E_{k}(\Theta)}{\partial a^{(L)}_{t} }\frac{\partial a^{(L)}_{t}}{\partial \mathcal {Z}^{(L-1)}_{k_{1},\ldots ,k_{N}} }=\sum _{t=1}^{C} d^{(L)}_{t}\mathcal {W}_{k_{1},\ldots ,k_{N},t}, \end{align}\] (30) \[\begin{align} \frac{\partial E_{k}(\Theta)}{\partial \mathcal {A}^{(L-1)}_{k_{1},\ldots ,k_{N}} }&=\frac{\partial E_{k}(\Theta)}{\partial \mathcal {Z}^{(L-1)}_{k_{1},\ldots ,k_{N}} }\times h^{^{\prime }}\left\lbrace \mathcal {A}^{(L-1)}_{k_{1},\ldots ,k_{N}}\right\rbrace . \end{align}\] For simplicity, we can also denote the above formula as: (31) \[\begin{equation} \mathcal {\delta }^{(L-1)}\, \dot{=}\, \frac{\partial E_{k}(\Theta)}{\partial \mathcal {A}^{(L-1)} }=(\mathcal {W}\times _{2N+1}d^{(L)})* h^{^{\prime }}\lbrace \mathcal {A}^{(L-1)}\rbrace , \end{equation}\] where \(*\) is Hadamard product.
(2) Hidden layer: For the \(l\)th layer, \(1\le l \le L-1\), we can compute the gradients of \(\Theta =\lbrace U^{(l)}_{i},V^{(l)}_{N+i},\mathcal {E}^{(l)}, 1\le i\le N\rbrace\) through the following formula: (32) \[\begin{align} \frac{\partial E_{k}(\Theta)}{\partial V_{i(m, n)}^{(l)}}&=\sum _{j_{1}=1}^{J_{1}^{l}} \cdots \sum _{j_{N}=1}^{J_{N}^{l}} \sum _{s_{1}=1}^{s_{1}^{l}} \cdots \sum _{s_{N}=1}^{s_{N}^{l}}\frac{\partial E_{k}(\Theta)}{\partial \mathcal {A}_{j_{1} \cdots j_{N},s_{1} \cdots s_{N}}^{(l)}} \frac{\partial \mathcal {A}_{j_{1} \cdots j_{N},s_{1} \cdots s_{N}}^{(l)}}{\partial V_{i(m, n)}^{(l)}}\nonumber \nonumber\\ &=\sum _{j_{1}=1}^{J_{1}^{l}} \cdots \sum _{j_{i-1}=1}^{J_{i-1}^{l}} \sum _{j_{i+1}=1}^{J_{i+1}^{l}} \cdots \sum _{s_{N}=1}^{s_{N}^{l}} \delta _{j_{1} \cdots j_{N},s_{1} \cdots s_{N}}^{(l)}\nonumber \nonumber\\ & \times \left(\sum _{t_{1}=1}^{s_{1}^{l}} \cdots \sum _{t_{i-1}=1}^{s_{i-1}^{l}}\sum _{t_{i+1}=1}^{s_{i+1}^{l}} \cdots \sum _{t_{N}=1}^{s_{N}^{l}} \mathcal {H}_{j_{1} \cdots j_{N},t_{1} \cdots t_{N}}^{(l-1)} \times V_{1\left(s_{1},t_{1}\right)}^{(l)} \cdots \times V_{2N\left(s_{N}, t_{N}\right)}^{(l)}\right) . \end{align}\] For simplicity, we can denote Equation (32) as follows: (33) \[\begin{align} \frac{\partial E_{t}}{\partial V_{i(m,n)}^{(l)}}=\ \lt \delta _{j_{i}=m}^{(l)}, \mathcal {T}_{j_{i}=n}\gt , \mathcal {T}_{j_{i}=n}=\mathcal {H}_{j_{i}=n}^{(l-1)} \times _{N+1} V_{1}^{(l)} \cdots \times _{N+i-1} V_{i-1}^{(l)} \times _{N+i+1} V_{i+1}^{(l)} \cdots \times _{2N} V_{2N}^{(l)} . \end{align}\] As for the gradient of \(\mathcal {H}^{(l)}\), for simplicity, we can also denote it as \(\mathcal {\beta }^{(l)} \dot{=} \frac{\partial E_{k}(\Theta)}{\partial \mathcal {H}^{(l)}}\), then we have: (34) \[\begin{align} \frac{\partial E_{k}(\Theta)}{\partial \mathcal {H}^{(l)}_{j_{1},\ldots ,j_{N},s_{1},\ldots ,s_{N}} }&= \sum _{t_{1}=1}^{s_{1}^{l}} \cdots \sum _{t_{N}=1}^{s_{N}^{l}}\mathcal {\delta }^{(l)}_{j_{1},\ldots ,j_{N},t_{1},\ldots ,t_{N}} \times V^{(l)}_{N+1(t_{1},s_{1})}\cdots \times V^{(l)}_{2N(t_{N},s_{N})} . \end{align}\] Similarly, we can update the factor matrix \(U^{(l)}_{i}, 1\le i\le N\) by the following formula: (35) \[\begin{align} \frac{\partial E_{t}}{\partial U_{i(m,n)}^{(l)}}=\, \lt \beta _{j_{i}=m}^{(l)}, \mathcal {T}_{j_{i}=n}\gt , \mathcal {T}_{j_{i}=n}=\mathcal {Z}_{j_{i}=n}^{(l-1)} \times _{1} U_{1}^{(l)} \cdots \times _{i-1} U_{i-1}^{(l)} \times _{i+1} U_{i+1}^{(l)} \cdots \times _{N} U_{N}^{(l)} . \end{align}\] As for the gradients of bias tensor in layer \(l\), we have: (36) \[\begin{align} \frac{\partial E_{k}(\Theta)}{\partial \mathcal {E}^{(l)}_{j_{1},\ldots ,j_{N}} }&= \sum _{s_{1}=1}^{s_{1}^{l}} \cdots \sum _{s_{N}=1}^{s_{N}^{l}}\frac{\partial E_{k}(\Theta)}{\partial \mathcal {H}^{(l)}_{j_{1},\ldots ,j_{N},s_{1},\ldots ,s_{N}} } . \end{align}\] Considering that the bias tensor increases exponentially with the increase of the order, we can use the CP model to generate bias indirectly when the dimension of the input data is extremely high. The gradient of the corresponding parameter is given below.
Assuming \(\mathcal {E}^{(l)}=[B^{(1)},\ldots ,B^{(N)}]=\sum _{k=1}^{K}b^{(1)}_{k}\circ b^{(2)}_{k}\cdots \circ b^{(N)}_{k}\), and \(B^{(n)}=[b^{(n)}_{1},\ldots , b^{(n)}_{K}]\in R^{J_{n}\times K }, 1\le n\le N\). The mode-n unfolding of \(\mathcal {E}^{(l)}\) is given as follows: (37) \[\begin{align} \mathcal {E}^{(l)}_{(n)}=unfolding(\mathcal {E}^{(l)},n)=B^{(n)}(B^{(n+1)}\cdots \odot B^{(N)}\cdots \odot B^{(1)})^{T} , \end{align}\] where \(\odot\) is defined as Khatri-Rao product. Then, we have (38) \[\begin{align} \frac{\partial \mathcal {E}^{(l)}_{(n)}}{\partial B^{(n)} }=\frac{\partial B^{(n)}(B^{(n+1)}\cdots \odot B^{(N)}\cdots \odot B^{(1)})^{T}}{\partial B^{(n)} } . \end{align}\]
In Algorithm 1, a high-order error back-propagation algorithm of tensor neural network is provided, which realizes the transformation of traditional back-propagation algorithm from vector space to higher-order tensor space.

5 MODEL ANALYSIS
Considering an \(N\)-order input data tensor \(\mathcal {X}\in R^{I_{1}\times I_{2}\cdots \times I_{N}}\), and the corresponding compressed rank of each order is \(R_{n}, R_{n}\le I_{n}, 1\le n\le N\). In fact, the tensor neural networks directly takes N-dimensional tensor as an input to avoid the vectorization of data, which also preserves the structural information of the original data well. In addition, the weight \(W\in R^{R_{1}R_{2}\ldots R_{N}\times I_{1}I_{2}\ldots I_{N}}\) generated by factor matrices \(U^{(n)}, 1\le n \le N\), has fewer parameters when compared to the standard fully connected layer. More specifically, the standard fully connected layer taking a flatten vector as input, which has \(n_{FC}\) parameters, (39) \[\begin{align} n_{FC}=\prod _{n=1}^{N}I_{n}R_{n}+\prod _{n=1}^{N}R_{n} , \end{align}\] where \(\prod _{n=1}^{N}I_{n}R_{n},\prod _{n=1}^{N}R_{n}\) are the size of weight matrix and bias vector in fully connected layer, respectively. As for the nonlinear Tucker layer, the hidden feature is obtained by multi-linear transformations of the original data and \(N\) factor matrices, and it has \(n_{TL}\) parameters, (40) \[\begin{align} n_{TL}=\sum _{n=1}^{N}I_{n}R_{n}+\prod _{n=1}^{N}R_{n} , \end{align}\] where \(\sum _{n=1}^{N}I_{n}R_{n}\) is the number of parameters in factor matrices, and \(\prod _{n=1}^{N}R_{n}\) corresponds to the number of parameters in bias tensor. Actually, we usually use CP model to generate bias to avoid the size of the bias tensor increasing exponentially with the orders. Assuming \(\mathcal {E}^{(l)}=[B^{(1)},\ldots ,B^{(N)}]=\sum _{k=1}^{K}b^{(1)}_{k}\circ b^{(2)}_{k}\cdots \circ b^{(N)}_{k}\), and \(B^{(n)}=[b^{(n)}_{1},\ldots , b^{(n)}_{K}]\in R^{R_{n}\times K }, 1\le n\le N\), then we have (41) \[\begin{align} n_{TL}=\sum _{n=1}^{N}I_{n}R_{n}+NR_{n}K. \end{align}\] To further improve the learning ability of tensor neural network, we introduce multiple high-order convolution layers in front of TNN network to extract local features of high-order tensor. Assuming there are three convolution layers, the corresponding number of convolution kernels is \(n_{1},n_{2},n_{3}\), respectively, and the size of all the convolution kernels is \(k\times k\cdots \times k\), then the number of parameters in the convolution layers is about \(n_{CL}\), (42) \[\begin{align} n_{CL}=(n_{1}+n_{2}n_{1}+n_{3}n_{2})k^{N} . \end{align}\]
6 EXPERIMENTS
To verify the performance of tensor neural networks in the classification tasks, we conducted experiments on three representative datasets, namely, MNIST, CIFAR10, and CIFAR100 datasets. We take a unified computer resource, and the system is Ubuntu16.04LTS and the hardware is Intel+2 X RTX 2080Ti. The experimental computer is equipped with a CPU of Intel Core TM i7-6700 @3.2 GHz with eight cores and a GPU of NVIDIA GeForce GTX TITAN and a memory of 32 G RAM. CUDA 10.0 is used for parallel acceleration of experimental codes. The experiments are implemented based on PyTorch framework, and the tensor models are calculated by Tensorly library. All parameters of the proposed models are obtained through initial training, and the image classification models are obtained by training 200 epochs. The initial learning rate is set as 0.003, and the SGD optimizer is used in the experiments, and the learning rate momentum is 0.9. For MNIST dataset, the batch size is set to 16 for training, and the CIFAR10 and CIFAR100 are set to 128.
6.1 The Experiment on the MNIST Dataset
First, we estimate the performance of the proposed models on the MNIST dataset, which is compiled by the NIST, consists of 70,000 grayscale (0–255) images of 10 handwritten digits (0–9), and each image in the MNIST dataset consists of 28 * 28 pixels. The training set has 60,000 handwritten digital pictures, and the test set has 10,000 handwritten digital pictures. The classification accuracy of the test set is used as the evaluation basis of our network.
In the first experiment, we designed a variety of different network topologies for each model, namely, NN, TNN, CNN, and QTNN, and compared the performance of different types of networks. Two-layer and three-layer tensor neural networks (TNNs) and quantized tensor neural networks (QTNNs) are compared with classic NNs and CNNs, respectively. A two-layer NN model with a vector of length 784 as input, 70 hidden features in the middle layer, 10 outputs for the classification, is simply denoted as (784-70-10). In addition, a two-layer tensor network with matrix of size 28*28 as input, and the size of compressed feature in the hidden layer is \(10*10\), can be denoted as (28*28-10*10-10). As for the multi-layer TNN models, which is composed by multiple nonlinear Tucker layers, it can achieve better results than the traditional NN model while using fewer parameters. This is mainly because we use tensor models to efficiently process multidimensional data without destroying the structural information of the original data. Although we add multiple tensor layers to the TNN models, the number of parameters in each layer is almost negligible compared with the parameters in the full connection layer in NNs. In the implementation of QTNNs and classic CNNs, we use one convolution layer to extract the local feature of original data. More specifically, the first layer is composed of 25 convolution kernels with the size of \(5\times 5\), the stride of convolution is set as 1, and the padding is also adopted to preserve the size of the original data.
As shown in Table 1, we use multiple nonlinear Tucker layers in the TNN model to gradually extract global features of original data, and each layer is actually composed of two parameter factor matrices. Obviously, different from adding a traditional fully connected layer to NNs, which easily brings an unbearable amount of parameters, adding a nonlinear Tucker layer in TNNs or QTNNs hardly changes the amount of parameters of the original models. Moreover, it can also achieve better classification accuracy than NN models when using multi-layer network structure. As mentioned earlier, this also further demonstrates that TNN models have the ability to learn potential spatial structure information from observed data. To make better use of the local information of the data, the convolution operations are integrated into the TNNs to extract the high-order neighborhood features of the data, and further forms the QTNN network, which well combines the advantages of CNNs and TNNs. As for QTNN models, we mainly extract the neighborhood features of each element through a layer of 5*5 convolution, and then directly input high-order feature tensor into the TNNs instead of pooling. In fact, the pooling operation is indirectly completed by the subsequent nonlinear Tucker layer. Different from the huge amount of parameters brought by multiple fully connected layers in CNNs, QTNNs can more effectively improve the classification accuracy at the cost of a small amount of parameters by adding a nonlinear Tucker layer. QTNNs can also achieve almost the same performance as CNNs, and it is significantly better than NNs and TNNs.
Table 1. Comparison of Classification Accuracy (%) and the Number of Parameters of Different Networks on MNIST
6.2 The Experiment on the CIFAR-10 Dataset
CIFAR-10 is a commonly used dataset for universal object classification, which includes 10 different types of objects, such as vehicles and animals. The data of CIFAR-10 are all three-channel RGB images, which can be represented as 3-order tensors with the size of 3 \(\times\) 32 \(\times\) 32. All the samples of CIFAR10 are from the real world, and the size and background of the objects are significantly different, which also puts forward higher requirements for the performance of the classification network.
Table 2 shows the classification results and the number of parameters of related models, including NNs, CNNs, TNNs, and QTNNs. For different network topologies, the classification performance of TNNs is significantly better than the NNs. This also further shows that preserving the structural information of the original data is very beneficial and necessary for the construction of neural networks. As mentioned above, TNN models are more inclined to analyze the overall structural properties of original data and ignore the local neighborhood features. However, the local features of the data are of great significance for the analysis of images, videos, and other data, which is the main reason for the poor classification performance of TNN models compared with QTNN models. In addition, we adopt a single convolution layer with 25 convolution kernels to extract the local feature of original data, and the size of each kernel is \(3\times 5\times 5\). The classification accuracy of CNN with multiple fully connected layers is compared to QTNN with multiple tensor layers, and it is clear that the classification accuracy of QTNNs is significantly better than CNNs with only one fully connected layer. When we introduce multiple full connection layers into CNNs, although the classification accuracy can be significantly improved, it will also bring a huge number of parameters, which is extremely inefficient compared with the way to improve the model performance by introducing multiple tensor layers into QTNNs.
Table 2. Comparison of Classification Accuracy (%) and the Number of Parameters of Different Networks on CIFAR-10
To fully explain the local feature of the data is more conducive to improve the classification accuracy of the different models, we use three convolution layers in CNNs and QTNNs to expand the local field of view of the convolution features. The numbers of convolution kernels of three different layers are 16, 32, 64, respectively, the sizes of the corresponding convolution kernel are \(3\times 5\times 5\), \(16\times 5\times 5\), \(32\times 5\times 5\). We also compare the performance of CNNs and QTNNs with a single convolutional layer, and the first layer of corresponding networks are constructed by 64 convolutional kernels of size \(3\ \times \ 5\ \times \ 5\). For the QTNNs, we do not adopt pooling operation after each convolution operation, which also guarantees the integrity of the structural information of the original data. As shown in Table 3, the performance of the two kinds of networks is significantly improved with the increase of the number of convolutional layers. Moreover, the classification accuracy of QTNNs is generally better than that of CNNs.
Table 3. Classification Accuracy (%) of Various Models with Different Number of Convolution Layers on CIFAR-10
6.3 The Experiments of Deep Tensor Neural Networks on the CIFAR-10 and CIFAR-100
To meet the needs of industrial and living applications, we further integrate the QTNN network into the deep neural network VGG model and build a deeper tensor network to improve the accuracy of classification. We evaluate the performance of related models on two datasets, i.e., CIFAR-10, CIFAR-100. CIFAR-100 dataset is more complex compared with CIFAR-10, and it contains 100 subclasses, and each subclass contains 600 images, including 500 training images and 100 test images. The 100 classes are grouped into 20 broad categories. Each image carries a small “fine” label and a large “coarse” label. The experimental result of VGG16 is used as a reference for subsequent experiments. In this experiment of deep classification network, we introduce BN layer to avoid the gradual deviation or change of the distribution of activation input values during the training process as the network deepens, forcing its distribution back to the standard normal distribution with a mean of 0 and a variance of 1.
To construct a deeper network model similar to the VGG structure more intuitively, we can broadly consider that deep tensor neural network consists of two basic layers, namely, convolution layer and tensor layer. The convolution layer is used to extract the local features of original data, and the tensor layer is used to further analyze the features while maintaining the data structure. Then, we construct two different network topologies based on the two basic layers, namely, topology a and topology b.
As shown in Figure 5, the first structure mainly uses the convolution operation to extract the features of the data without introducing the pooling operation to change the size of the original data, and the pooling operation is indirectly completed by the subsequent tensor layers. The second structure is to reduce the size of the data by introducing pooling operation while convolution is adopted to extract data features. The classification accuracy of three models is summarized in Table 4, and it can be clearly seen that in the construction of deep network, the pooling operation is helpful to improve the classification accuracy of the model and reduce the time of parameter optimization. Actually, the parameters of the deep tensor neural network are mainly derived from the convolutional layers, and the parameters of the tensor layers can be almost ignored, which is markedly different from the CNNs and NNs. In addition, we try to make the features obtained by convolution operations have the same spatial structure as the original data through higher-order quantization operations and learn more compact neighborhood features to avoid redundancy. Although the classification accuracy of VGG is slightly higher than that of tensor neural networks, VGG adopts a more complex network structure, including the introduction of more convolution layers to extract features and more full connection layers with a large number of parameters, which indirectly verifies the high efficiency of tensor neural networks.
Fig. 5. The structure of deep tensor neural networks.
Table 4. Classification Accuracy (%) of Deep Tensor Neural Networks
7 CONCLUSION
In this article, we propose a quantized tensor neural network (QTNN), which effectively integrates the simplicity of tensor networks and the powerful learning ability of neural networks. Different from the traditional tensor neural network models, which tend to analyze the overall structural information of the data and ignore the local structural features of the data, QTNN solves this problem effectively by further introducing the convolution operations and realizing the efficient processing of the data while maintaining the structure of the original data. Moreover, we make the features obtained by convolution operations have the same spatial structure as the original data through higher-order quantization operations to learn more compact neighborhood features to avoid redundancy. We also develop a high-order error back-propagation algorithm based on tensor networks for parameter optimization. Experimental results on three representative datasets further demonstrate the effectiveness and simplicity of our proposed models.
By combining tensor network with neural network, the models proposed in this article have achieved good performance on classification task. In future work, we will continue our research in the following directions: (1) Establishing the potential correlation of hidden layer features of deep neural network and reducing redundant features by tensor network. (2) Based on the complete theory of multi-linear algebra, further exploring various tensor neural network models with stronger interpretation.
- [1] . 2013. Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemomet. Intell. Lab. Syst. 129 (2013), 53–63.Google Scholar
Cross Ref
- [2] . 2019. Compression and interpretability of deep neural networks via Tucker tensor layer: From first principles to tensor valued back-propagation. arXiv preprint arXiv:1903.06133 (2019).Google Scholar
- [3] . 2014. Robust face clustering via tensor decomposition. IEEE Trans. Cyber. 45, 11 (2014), 2546–2557.Google Scholar
- [4] . 2017. Tensor factorized neural networks. IEEE Trans. Neural Netw. Learn. Syst. 29, 5 (2017), 1998–2011.Google Scholar
- [5] . 2016. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory.Google Scholar
- [6] . 2015. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Mining Knowl. Discov. 29, 1 (2015), 203–236. Google Scholar
Digital Library
- [7] . 2010. Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Applic. 31 (2010), 2029–2054.
DOI: DOI: DOI: https://doi.org/10.1137/090764189 Google ScholarCross Ref
- [8] . 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [9] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
- [10] . 2009. Tensor decompositions and applications. SIAM Rev. 51 (2009), 455–500. Google Scholar
Digital Library
- [11] . 2020. Tensor regression networks. J. Mach. Learn. Res. 21 (2020), 1–21.Google Scholar
- [12] 2014. A tensor-based approach for big data representation and dimensionality reduction. IEEE Trans. Emerg. Topics Comput. 2 (2014), 280–291.Google Scholar
Cross Ref
- [13] . 2014. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. arXiv preprint arXiv:1412.6553 (2014).Google Scholar
- [14] . 2012. Tensor distance based multilinear globality preserving embedding: A unified tensor based dimensionality reduction framework for image and video classification.Expert Syst. Applic. 39 (2012), 10500–10511. Google Scholar
Digital Library
- [15] . 2015. Tensorizing neural networks. arXiv preprint arXiv:1509.06569 (2015).Google Scholar
- [16] . 2011. Tensor-train decomposition. SIAM J. Sci. Comput. 33, 5 (2011), 2295–2317. Google Scholar
Digital Library
- [17] . 2010. Tensor decompositions for feature extraction and classification of high dimensional datasets.Nonlin. Theor. Applic. 1 (2010), 37–68.Google Scholar
Cross Ref
- [18] 2019. Multiscale structure tensor for improved feature extraction and image regularization. IEEE Trans. Image Process. 28 (2019), 6198–6210.Google Scholar
Digital Library
- [19] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- [20] . 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [21] 2019. Data fusion in cyber-physical-social systems: State-of-the-art and perspectives. Inf. Fusion 51 (2019), 42–57.Google Scholar
Digital Library
- [22] . 2017. Tensor-train recurrent neural networks for video classification. In International Conference on Machine Learning. 3891–3900. Google Scholar
Digital Library
- [23] . 2015. Deep computation model for unsupervised feature learning on big data. IEEE Trans. Serv. Comput. 9, 1 (2015), 161–171.Google Scholar
- [24] . 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [25] . 2017. A tensor-based multiple clustering approach with its applications in automation systems.IEEE Trans. Industr. Inform. 14 (2017), 283–291.Google Scholar
Cross Ref
Index Terms
Quantized Tensor Neural Network
Recommendations
Tensor compressed video sensing reconstruction by combination of fractional-order total variation and sparsifying transform
High reconstructed performance compressed video sensing (CVS) with low computational complexity and memory requirement is very challenging. In order to reconstruct the high quality video frames with low computational complexity, this paper proposes a ...
Influence-guided Data Augmentation for Neural Tensor Completion
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge ManagementHow can we predict missing values in multi-dimensional data (or tensors) more accurately? The task of tensor completion is crucial in many applications such as personalized recommendation, image and video restoration, and link prediction in social ...
Tensor Missing Value Recovery with Tucker Thresholding Method
INCOS '13: Proceedings of the 2013 5th International Conference on Intelligent Networking and Collaborative SystemsIn this paper, a tensor missing value recovery method on the tensor Tucker decomposition is presented. The contribution of this paper is to extend matrix shrinkage operator to the tensor Tucker higher-order singular value decomposition operator to ...











Comments