Abstract
Convolutional neural networks (CNNs) have dramatically improved the accuracy of image, video, and audio processing for tasks such as object recognition, image segmentation, and interactive speech systems. CNNs require large amounts of computing resources for both training and inference, primarily because the convolution layers are computationally intensive. Fast convolution algorithms such as Winograd convolution can greatly reduce the computational cost of these layers. However, Winograd convolution has poor numeric properties, such that greater savings in computation cause exponentially increasing floating point errors.
A defining feature of each Winograd convolution algorithm is a set of real-value points where polynomials are sampled. The choice of points impacts the numeric accuracy of the algorithm, but the optimal set of points for small convolutions remains unknown. Existing work considers only small integers and simple fractions as candidate points. In this work, we propose a novel approach to point selection using points of the form \(\lbrace -\frac{1}{c},-c,c,\frac{1}{c}\rbrace\) using the full range of real-valued numbers for c. We show that groups of this form cause cancellations in the Winograd transform matrices that reduce numeric error. We find empirically that the error for different values of c forms a rough curve across the range of real-value numbers. It is therefore possible to localize the values of c that lead to lower error. We show that it is not necessary to choose integers or simple fractions as evaluation points, and that lower errors can be achieved with non-obvious real-valued points. We study a range of sizes for small convolutions and achieve reduction in error ranging from 2% to around 59% for both 1D and 2D convolution, when compared to state of the art. Furthermore, we identify patterns in cases when we select a subset of our proposed points that will always lead to a lower error. Finally, we implement a complete Winograd convolution layer and use it to run state-of-the-art deep convolution neural networks on real datasets and show that our proposed points achieve reduction in error, ranging from 22% to 63%, while also showing how an increased Winograd output size can result in execution speed-up for some cases.
1 INTRODUCTION
Resource-constrained embedded devices play an important role as edge devices in the Internet of Things (IoT) network and autonomous systems such as driverless cars. To truly enable these technologies, there is a critical need to execute resource-intensive algorithms like convolutional neural networks (CNNs) on edge devices. CNNs are used extensively for applications like facial recognition, image segmentation, speech recognition, and object detection [18, 26, 28, 29]. For such resource-intensive algorithms, computing at the edge is essential because sending large quantities of data over a network to a cloud-based infrastructure consumes much energy and may be too slow for real-time processing of data. It is also a challenge to use resource-constrained devices for such resource-intensive applications due to the limited processing, memory, and/or battery capacity of many edge devices [5, 27].
The computational cost of a CNN is primarily derived from convolution layers. A large array of input data is convolved with a large number of much smaller convolution kernels. A simple implementation of convolution requires \(O(kn)\) operations to convolve an input of size n with a size k kernel. So-called fast convolution algorithms can greatly reduce the operation count. For example, the input and kernel can be transformed to the Fourier domain using the fast Fourier transform (FFT) with only \(O(n log_2(n)\) and \(O(k log_2(k))\) operations, respectively. In the Fourier domain, convolution requires only \(n+k-1\) complex-number multiply operations. Thus, the dominant term in FFT convolution is typically \(O(n log_2(n))\) operations, and FFT has been successfully used for CNN convolution [20].
An alternative to FFT is Winograd’s fast convolution algorithm. Like FFT convolution, Winograd’s approach applies transforms to the inputs and output. However, unlike FFT, Winograd’s algorithm performs the core of the convolution in \(n + k -1\) real-number multiply operations. Complex-number multiplication requires at least four times as many arithmetic operations as real-number multiplication. Thus, Winograd convolution can be significantly faster than FFT convolution, particularly for the small convolution kernels found in CNNs. Indeed, Winograd’s approach is optimal with respect to the number of multiply operations in the inner core of the algorithm [34]. Thus, with an input of length n and a kernel of length k, Winograd requires only a theoretical minimum of \(n+k-1\) general multiplications (Hadamard product operations) [1, 35]. Note that Lavin and Gray’s seminal paper on Winograd convolution for CNNs actually uses a closely related approach, Toom-Cook convolution. All Toom-Cook convolution algorithms have an equivalent Winograd algorithm, and in the remainder of the article we will follow the convention of the literature of using the terms interchangeably.
In Winograd convolution, the input and kernel are sampled at a given set of points using transform matrices. This is followed by a Hadamard product (element-wise multiplication), which is then converted back to the original space by an inverse transform. In CNNs the kernels are typically small, most commonly \(3 \times 3\) or \(5 \times 5\), but the inputs are large. The common practice is to break the input into sub-blocks and convolve the kernel with each sub-block. Winograd gains its efficiency from computing multiple output points at once. Table 1 shows the number of pairwise multiplication operations needed for various input block sizes and for kernels of size \(3 \times 3\) and \(5 \times 5\). Large block sizes need more multiplies in total for the block but result in less computation per output point.
Table 1. Number of Scalar Multiply Operations Used by the Hadamard Product Step of 2D Winograd Convolution for Different Input Block Sizes
Maximizing the input block size minimizes the computation needed per point in the output. However, Winograd convolution suffers from numeric accuracy issues: the floating point (FP) error increases exponentially with the size of the convolution [2]. Thus, we cannot simply use a large block size to minimize computation. Our block size must also be small enough to keep the numerical error low. However, if we can find ways to reduce the numerical error, it will allow us to use larger block sizes and require less computation. Our work aims to reduce numeric errors in Winograd convolution so that larger tile sizes can be used for greater efficiency.
In this work, we propose a new approach to improving the numerical accuracy of Winograd convolution, with the goal of reducing the computation and energy needed for CNNs. We make the following contributions:
We propose using symmetric points of the form \(\lbrace -\frac{1}{c}, -c, c, \frac{1}{c}\rbrace\), which we show cause cancellations in the Winograd transform matrices and thus reduce numeric error.
We demonstrate experimentally that the error curve for different values of these symmetric points is roughly smooth over the range of real-valued numbers. Whereas prior work has focused primarily on exhaustively searching sets of rational numbers with small numerators and denominators, the smooth error curve allows us to find rational or irrational real-number interpolation points that reduce the numeric error.
For larger block sizes, we extend our approach to two variables \(c, d\) with the same form, i.e., \(\lbrace -\frac{1}{d},-\frac{1}{c},-d,-c,c,d,\frac{1}{c},\frac{1}{d}\rbrace\), and demonstrate that we can find complementary pairs of values for c and d that reduce numeric errors.
For intermediate block sizes for which we cannot simply choose four symmetric points, we show how a partial use of our symmetric-point strategy can provide some improvement in numeric error.
We evaluate our proposed symmetric points strategy and find that for most important cases for CNNs it outperforms the best-known existing point selection strategy.
We also evaluate our proposed symmetric points on state-of-the-art deep convolution networks using a real-world dataset and show that it outperforms for the majority of cases the existing point selection strategy.
This article is organized as follows: Section 2 provides a literature review and shows how our work differs from it. Section 3 gives an introduction to implementing convolution in deep neural networks using fast convolution methods, while Section 4 describes the Toom-Cook algorithm, a form of fast convolution, in some detail. The proposed point selection and all the relevant details are given in Section 5, while the results comparing our proposal to the state of the art is given in Section 6. Section 7 outlines some additional experiments performed in an attempt to improve the numeric accuracy of the algorithm, and Section 8 outlines future research directions. Section 9 concludes the article.
2 RELATED WORK
The Toom-Cook algorithm treats the input and kernel as polynomials and computes the product of the polynomials using Lagrange interpolation. The approach was originally designed for multiplication of large multi-word integers [6, 32], but it can also be used for convolution. Later, Winograd [34] generalized Toom-Cook using the Chinese remainder theorem and proved that it is optimal with respect to the number of pairwise multiplications. In Toom-Cook the transforms correspond to sampling and interpolating polynomials at points on the real number line. The choice of points determines the values in the transform matrices. By choosing appropriate points, the cost of computing the transforms can be reduced. In particular, using integer and simple fraction points can simplify the transforms by making some matrix values zero, \(\pm 1\), or constants that allow fixed-point multiplication using shifts and adds. A great deal of work on Toom-Cook and Winograd convolution focuses on finding transform matrices that minimize the number of operations needed for the transforms [3, 23, 31]. Bodrato [4] presents a search algorithm for finding transforms that require a minimum number of integer add and shift operations.
Lavin and Gray were the first to apply Winograd (Toom-Cook) convolution to CNNs and show that Winograd convolution is faster than direct convolution on all layers of the VGG network [19]. Whereas much signal processing has traditionally been performed in fixed point, CNNs are commonly implemented using floating point. Lavin and Gray note that Lagrange interpolation has poor numerical properties that can result in large floating point errors.
There are two main existing approaches when selecting interpolation points to minimize the error in floating point Winograd convolution. First, the Chebyshev nodes define a set of n interpolation points defined by the equation \(x_k = cos(\frac{2k-1}{2n}\pi), k = 1,\ldots, n\). The Chebyshev nodes are designed to reduce Runge’s phenomenon, that is, inaccuracies around the boundaries of the interpolated range.
Second, Barabasz et al. [1] present a theoretical error analysis of FP accuracy for Winograd algorithms and derive a bound on the worst-case FP error. They show that the error bound for the Toom-Cook algorithm grows exponentially with the size of the convolution and propose methods to reduce the error. They propose a canonical evaluation ordering of summation based on Huffman coding and find that it reduces FP error in Winograd-based convolution by 12% to 14%. They also study mixed-precision and pairwise summation algorithms. They also consider the problem of selecting suitable interpolation points to reduce the numerical error. The existing best practice in the signal processing literature was that small numerators and denominators are the best points for fixed-point Winograd convolution. Barabasz et al. empirically evaluated many combinations of these simple fractional points and found good sets of points for the most common sizes used in CNNs.
Our work differs from Barabasz et al. [1] in three main ways. First, unlike prior work, which considers only simple fractions as suitable points, we consider the full range of rational and irrational numbers as potential points. Second, we propose combinations of points in a specific form that causes terms in the transform matrices to cancel one another: \(-c, -\frac{1}{c}, \frac{1}{c}, c\). We show that points of this form have a mostly smooth error curve, which allows us to find close-to-optimal values of c within a manageable search time. The points proposed by Barabasz et al. are purely empirical in nature, whereas we propose points of a specific form with a particular error distribution that allows us localize points that reduce the convolution error. Finally, we evaluate our proposed points on synthetic and real CNN data and show that they result in significantly lower errors.
Among other contributions is the one by Fernandez-Marques et al., which presents a Winograd-aware quantized network using two different output sizes of \(4\,\times \,4\) and \(2\,\times \,2,\) which incorporates the numerical inaccuracies of the Winograd transformations to the learning of model parameters in a CNN [7]. Their Winograd-aware ResNet-18 network results in a \(2.66\times\) speedup for the CIFAR-10 dataset.
Zhao et al. present a combination of the Winograd convolution and Strassen matrix multiplication algorithms to reduce computational complexity [38]. For different matrix sizes, they show that a combination of Winograd and Strassen reduces the number of multiplications as compared to conventional convolution, at the cost of some increase in addition operations. However, the number of extra addition operations is much less than the reduction in the number of multiplications. This reduction saves 75% of the execution as compared to other alternatives.
To address the problem of an increased numerical error when Winograd convolution algorithms are applied to large input and kernel tiles, Huang et al. present a novel way to decompose large kernel tiles into several small kernels [12] at the cost of some increased computation. This allows the Winograd algorithm to be applied to a wide range of general convolutions and achieves a speedup of approximately \(2\times\) without affecting numerical accuracy significantly.
From an implementation perspective on resource-constrained embedded devices, a number of contributions have been reported in the literature. Maji et al. present an implementation of Winograd using the ARMv8-A NEON SIMD instruction set [22] using a region-wise multi-channel implementation and showing a speedup of up to 60% in the mean absolute runtime. Xygkis et al. present an efficient, application-independent, and Winograd-specific software kernel for edge devices like the Intel/Movidius Myriad2 platform showing up to 42% improvement in evaluation runtime for VGG [37].
Podili et al. present an implementation of Winograd convolution in an FPGA-based accelerator using double-buffering to reduce memory latency [25]. They also propose a data layout to reduce memory bandwidth and claim to achieve \(1.2\times\) improvement in throughput by using \(3\times\) fewer multipliers and \(2\times\) less on-chip memory. Although they claim that their proposed implementation does not impact accuracy, they do not present data to support this claim. Xiao et al. present a fused layer architecture with dynamic programming to determine the structure of fusion and an automatic tool-flow from Caffe to FPGA using Vivado HLS [36]. They fuse multiple layers in CNNs by reusing the intermediate data to save memory transfer, which is important to performance on their memory-bandwidth-limited FPGA.
Vincent et al. [33] propose to reduce the condition number of the Vandermonde matrices by scaling the convolution matrices. They find that this approach allows them to reduce the error in just one specific case: when convolving a \(5\times 5\) kernel to create a \(9\times 9\) output block for AlexNet and Inception v3.
3 DEEP CONVOLUTIONAL NEURAL NETWORKS BASED ON FAST CONVOLUTION
A deep convolutional neural network consists of many different layers, of which the convolutional layer is the most compute intensive. The conventional convolution layer computes an individual element of the output feature map by multiplying and accumulating the corresponding input feature map with filters. However, the convolution involved in a CNN are short with popular filter sizes of \(3\times 3, 5\times 5,\) and \(11\times 11\) [18, 28, 29]. It was demonstrated by Winograd in [35] that the existing method proposed by Toom-Cook [6, 32] generates optimal convolution algorithms in terms of minimum number of general multiplications for fixed filter kernel and input tile sizes. These convolution algorithms were also shown by Lavin and Gray [19] to be around \(2\times\) as fast as direct convolution.
The generic 2D convolution can be represented as follows: (1) \(\begin{equation} Y_{i,k,x,y} = \sum _{c=1}^{C}\sum _{v=1}^{R}\sum _{u=1}^{S} D_{i,c,x_u,y+v}G_{k,c,u,v}, \end{equation}\) which reads a 2D input feature map of size \(H\times W\) with C channels and convolves it with a bank of K filters with C channels. Each filter kernel has a dimension of \(R\times S\). The family of Winograd convolution algorithms performs this convolution by (1) transforming the input and kernel tile into the Winograd domain, (2) performing a point-wise multiplication (the actual convolution), and (3) transforming the resultant product back to the original domain using an inverse transform. Here, convolution in the spatial domain is transformed to element-wise multiplication in the Winograd domain.
The Winograd family of convolution algorithms consists of a wide variety of algorithms with different tradeoffs. To generate these algorithms, one requires a polynomial, which may be linear or super-linear. Linear polynomials generate algorithms that are equivalent to those generated by the Toom-Cook method; they guarantee a theoretical minimum number of element-wise multiplication operations [2]. In this article we focus on these optimal Winograd convolution algorithms that minimize the element-wise multiplication and are equivalent to the set of Toom-Cook convolution algorithms. A wider range of non-optimal Winograd algorithms are investigated by Barabasz and Gregg in [2].
4 TOOM-COOK ALGORITHM
The Toom-Cook algorithm or method is a linear convolution algorithm based on representing convolution as a polynomial product. The linear convolution of g of size M and d of size N can be represented as (2) \(\begin{equation} y_k = \sum _{n=0}^{k}g_{k-n}d_n, \quad 0 \le k \lt L, \end{equation}\) where \(L = M+N-1\) and \(g_m = 0\) if \(m \ge M\) and \(d_n = 0\) if \(n \ge N\). Associating the polynomial \(g(x)\) and \(d(x)\) of degrees \(M-1\) and \(N-1\), respectively, to the vectors g and d of sizes M and N, respectively, of Equation (2), direct computation of Equation (2) is equivalent to the polynomial product [31]: (3) \(\begin{equation} y(x) = d(x)g(x). \end{equation}\)
To compute this polynomial product, assuming both polynomials are of degree N, Toom-Cook first evaluates each polynomial at \(2N-1\) distinct points \(\alpha _i\). This is equivalent to transforming the two polynomials to the Toom-Cook (or Winograd) domain. Point-wise multiplications of \(g(\alpha _i)\) and \(d(\alpha _i)\) to produce \(y(\alpha _i) = g(\alpha _i)d(\alpha _i)\) follows (the actual convolution). Finally, using the Lagrange interpolation, \(y(x)\) is recovered as [15, 24] (4) \(\begin{equation} y(x) = \sum _{i=0}^{n-1} y(\alpha _i) \frac{\prod _{j\ne i}(x-\alpha _j)}{\prod _{j\ne i}(\alpha _i - \alpha _j)}. \end{equation}\)
The Lagrange interpolation states that given a set of n distinct points (\(\alpha _n\)) and corresponding values of \(y(\alpha _n)\), the polynomial \(y(x)\) can be uniquely determined. The degree of the polynomial \(y(x)\) must be less than or equal to n.
Polynomial evaluation can be determined by applying the following Vandermonde Matrix to a given polynomial [15]: (5) \(\begin{equation} V[\alpha _0, \dots, \alpha _n] = \begin{bmatrix} 1 & \alpha _0 & \alpha _0^2 & \dots & \alpha _0^{n-1} \\ 1 & \alpha _1 & \alpha _1^2 & \dots & \alpha _1^{n-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \alpha _{n-1} & \alpha _{n-1}^2 & \dots & \alpha _{n-1}^{n-1} \\ \end{bmatrix}, \end{equation}\) where V is the linear transform matrix of the Chinese remainder theorem (CRT), which maps \(\mathbb {F}[x]/f(x)\) onto \(\mathbb {F}[x]/f_1(x) \times \dots \times \mathbb {F}[x]/f_t(x)\). Here, \(f_1(x) \dots f_t(x)\) are the irreducible rational factors of \(f(x)\). In this way, the Toom-Cook method is a special case of the Winograd family of algorithms, which are based on the CRT.
Applying the inverse Vandermonde matrix (\(V^{-1}\)) returns the original coefficients and thus \(V^{-1}\) corresponds to interpolation and can be computed using either Lagrange’s formulation of Equation (4) [15] or CRT applied on polynomials [31]. Mathematically, the whole transform can be represented as [1] (6) \(\begin{equation} V^{-1}\left(V_{d}d \odot V_{g}g\right)\!, \end{equation}\) where \(\odot\) represents point-wise (Hadamard) multiplication.
Applying the matrix exchange theorem, we obtain the following formulation: (7) \(\begin{equation} V^{-1}\left(V_{d}d \odot V_{g}g\right) = V^{-1} {\it Diag}\left(V_{g}g\right) V_{d}d = V_{d}^T\left(V_{g}g \odot V^{-T}d\right)\!. \end{equation}\)
Representing \(V_{d}^T\) as \(A^T\), \(V_{g}\) as \(G,\) and \(V^{-T}\) as \(B^T\) gives us the following formulations for 1D transform: (8) \(\begin{equation} Y = A^T\left[(Gg) \odot (B^Td)\right]\!. \end{equation}\)
The minimal 2D algorithm (\(F(m\,\times \,m,r\,\times \,r\))) can be obtained by nesting the 1D algorithm (\(F(m,r)\)) with itself to obtain (9) \(\begin{equation} Y = A^T\left[\left[GgG^T\right] \odot \left[B^TdB\right]\right]A. \end{equation}\)
4.1 Toom-Cook: Matrix Construction
The transform matrices, \(A/A^T\), \(B/B^T\), and \(G/G^T\), identified in Equations (8) and (9) are based on n different real points \(\alpha _0, \dots, \alpha _{n-1}\). These are the real-valued points where the two input polynomials are evaluated. Typically, in a conventional Winograd convolution, an input of length n is convolved with a kernel of length k to produce an output of length \(m = n + k - 1\) using \(n + k - 1\) general multiplications. However, since the formulations of Equations (8) and (9) involve matrix interchange, Lavin and Gray expressed their convolution in terms of the output size \(m,\) which is computed using a kernel of length k and input size \(n = m+k-1\) using \(m+k-1\) general multiplications.
Matrices \(A^T\) and G are Vandermonde matrices of sizes \(m\,\times \,n\) and \(n\,\times \,k\), respectively. The inverse Vandermonde matrix \(B^T\) is of size \(n\times n\). The Lagrange formulation of Equation (4) shows that \(B^T\) matrix will have a scaling factor of \(N_i = \prod _{j\ne i}(\alpha _i-\alpha _j)\) associated with each row. This scaling factor can either be applied to the output of the pairwise multiplication of Toom-Cook convolution or embedded into the matrix G (the preferred choice) [1].
The generalized forms of the matrices are thus (10) \(\begin{equation} A^T =\begin{bmatrix} 1 & 1 & \dots & 1 \\ \alpha _0 & \alpha _1 & \dots & \alpha _{n-1} \\ \vdots & \vdots & \ddots & \vdots \\ \alpha _0^{m-1} & \alpha _1^{m-1} & \dots & \alpha _{n-1}^{m-1}\\ \end{bmatrix}, \quad G = \begin{bmatrix} 1 & \alpha _0*N_0 & \dots & \alpha _0^{k-1}*N_0\\ 1 & \alpha _1*N_1 & \dots & \alpha _1^{k-1}*N_1\\ 1 & \alpha _{n-1}*N_{n-1} & \dots & \alpha _{n-1}^{k-1}*N_{n-1}\\ \end{bmatrix} \end{equation}\) and (11) \(\begin{equation} B^T = \begin{bmatrix} M_{0,0} & \dots & M_{0,n-1} \\ \vdots & \ddots & \vdots \\ M_{n-1,0} & \dots & M_{n-1,n-1} \end{bmatrix}, \end{equation}\) where \(M_i(x) = \frac{M(x)}{m_i(x)}\) is the Lagrange interpolation matrix with \(M(x) = (x-\alpha _0)\dots (x-\alpha _{n-1})\). \(m_i(x)\) are the irreducible factors of the polynomial \(M(x)\), where \(M(x)\) is also referred to as a reducing polynomial and the choice of a good reducing polynomial affects the efficiency of the algorithm [31]. For example, if the Lagrange interpolation points are \(0,\pm 1\), then \(M(x) = x(x-1)(x+1),\) with \(m_i(x)\) being \(x, x-1,\) and \(x+1\). Based on these polynomials, the convolution operation can also be mathematically expressed as [31] (12) \(\begin{equation} y(x) = g(x)d(x) \mod {M}(x). \end{equation}\)
The FP error in the Toom-Cook convolution is largely dependent on the point selection and the size of input/output tiles. With regard to the interpolation points, different points give different FP errors, both in rounding and in representation [1]. Furthermore, larger output and filter sizes reduce the number of element-wise multiplications per computed output, but they also are a source of numerical instability [33]. The growth in the number of additions and constant multiplications in the transforms of input, kernel, and output is quadratic with the tile size. However, each input tile is convolved with many different kernels, and each kernel is convolved with many different input tiles. Thus, although the cost of the transforms is quadratic, this cost is amortized over many uses of the same transformed input tile or kernel.
Note that the point selection may have a small impact on the execution time of Winograd convolution for CNNs. Where the transform matrix contains a zero value, no arithmetic is needed to implement that part of the transform. Similarly, where an element of the transform matrix has the value \(\pm 1\), only addition and no multiplication is needed to implement that part of the transform. Thus, the point selection affects both the floating point error and the number of operations needed to implement the transform matrices. Indeed, zero values in the transform matrices tend to reduce both the computational cost and the arithmetic error. However, the overall computation cost of the transforms is relatively low for Winograd convolution in CNNs. As noted above, each input tile is convolved with many kernels and vice versa. Thus, the computational cost of Winograd convolution for CNNs is typically dominated by the element-wise multiplication of many pairs of input tile and kernel, not by the transforms that are performed just once for each input tile and kernel.
4.2 Modified Toom-Cook Algorithm
The Toom-Cook algorithm reduces the number of element-wise multiplications at the cost of an increase in the number of additions needed to implement the transforms. To reduce the number of additions, while keeping the number of multiplications the same, a modified form of the Toom-Cook algorithm is used. It has been shown by Barabasz et al. [1] that this modified form reduces FP error in both linear transforms and the convolution operation.
The main idea behind the modified form is to have an input of size \(n-1\) instead of n with the same kernel size. Essentially, we are solving for one-size-smaller problem with the interpolation points ranging from \(\alpha _0 \dots \alpha _{n-2}\) and the matrices \(A^T\), \(G,\) and \(B^T\) are constructed based on these points.
In the modified algorithm, the reducing polynomial is taken as \(M^{\prime }(x) = (x-\alpha _0)\dots (x-\alpha _{n-2})\) and we compute (13) \(\begin{equation} t(x) = g(x)d(x) \mod {M}^{\prime }(x), \end{equation}\) where \(t(x)\) is a polynomial of degree one less than of \(y(x)\) of Equation (12), which is recovered by (14) \(\begin{equation} y(x) = t(x) + g_{n-1} d_{k-1}M^{\prime }(x). \end{equation}\)
The modified Toom-Cook algorithm is represented as (15) \(\begin{equation} s(x) \equiv g(x)d(x) \mod {M}^{\prime }(x)(x-\infty). \end{equation}\)
The three transform matrices are initially generated for \(n-1\) points and extra rows/columns are added based on the following formulations [1]: (16) \(\begin{align} G^{m(n-1)}d & = & G^{(n-2)}d + d_k \end{align}\) (17) \(\begin{align} A^{m(n-1)}g & = & A^{(n-2)}g + g_n \end{align}\) (18) \(\begin{align} B^{m(n-1)}(G^{m(n)}d \odot A^{m(n)}g) & = & B^{(n-2)}(G^{(n-2)}d \odot A^{(n-2)}g) + d_{k-1}g_{n-1}M^{\prime }(x), \end{align}\) where \(G^{(n-2)}\), \(A^{(n-2)}\), and \(B^{(n-2)}\) are the matrices for the original method with \(n-1\) points and \(G^{m(n-1)}\), \(A^{m(n-1)}\), and \(B^{m(n-1)}\) are the matrices for the modified method.
Essentially, this entails that an \((n-1)\)th row is added to both the G and A matrix with all zeros and a 1 at the last position. For matrix B, a final row and column are added where the last row contains all zeros bar a 1 at the last position and the last column contains consecutive coefficients of polynomial \(M^{\prime }(x)\). Thus [1]: (19) \(\begin{equation} {rClCl} G^{m(n-1)} = \begin{bmatrix} & G^{n-1} \\ 0 & \dots 0 & 1 \end{bmatrix} \quad A^{m(n-1)T} = \begin{bmatrix} & 0 \\ A^{(n-2)T} & \vdots \\ & 0 \\ & 1 \end{bmatrix} \quad B^{m(n-1)T} = \begin{bmatrix} B^{(n-2)T} & 0\\ M_0(x)\dots M_{n-1} & 1 \end{bmatrix}\!. \end{equation}\)
With the modified Toom-Cook technique resulting in fewer additions and also significantly reducing the FP error, we will use this technique to perform all of our analysis.
4.3 Memory Requirements
As described in Equation (9), the 2D Winograd/Toom-Cook convolution consists of three main steps: (1) transform of the input feature map (IFM) and kernel from the spatial domain into the Winograd domain, (2) pairwise multiplication (Hadamard product), and (3) transform of the pairwise product back to the spatial domain.
In a 2D DNN convolution, the input feature map has dimensions \(C \times H \times W\) and the weight kernel has dimensions \(M \times C \times k \times k\). Here, C represents the number of input channels in the IFM, and M is the number of channels in the output feature map (OFM) of the layer. For example, the first convolution layer of ResNet-20 [10] convolves an input image of three channels (RGB) and produces an OFM of 16 channels, with each channel within the IFM and OFM consisting of \(32\times 32\) elements, forming the height (H) and width (W) dimension. K indicates the size of each 2D weight filter.
Winograd convolution involves transforming the IFM and filter kernel into the Winograd domain, performing a Hadamard product between the transformed inputs, and then transforming the product back to the spatial domain. With multiple input channels in the IFM, each channel should undergo these steps before being summed across the channels to form the OFM. When each input channel of the IFM is convolved using this simple approach, the pairwise product of each channel must be transformed back to the spatial domain before summation. However, Lavin and Gray [19] observe that the transforms are linear. Thus, the sum of the transformed pairwise products is equal to the transformed sum of the pairwise products. This means that one inverse transform is needed per OFM channel, greatly reducing the operation count.
In the following discussion we assume the the convolution has an output stride of 1. To perform Winograd transformation with an IFM of dimensions \(H \times W\), the IFM must be broken into blocks of size \(m \times m\). Recall that m is the size of the output block from the Winograd algorithm where \(n = m + k -1\) (see Section 4.1). Where H and/or W are not an even multiple of m, the edge blocks must be padded with zeros to create an image of the correct size. This padding can be added physically to the memory of the IFM or added logically in the code that performs the transform of the IFM. The number of vertical and horizontal blocks of size \(m \times m\) are thus \(\lceil H/m \rceil\) and \(\lceil W/m \rceil\), respectively.
Note that although the size of each output block is \(m \times m\), a larger input block of size \(n \times n\) is needed for each output block. When the IFM is divided into blocks, the output blocks divide the IFM perfectly into blocks of size \(m\times m\). But the input blocks do not divide the IFM perfectly. Instead, the input blocks overlap at their edges, where the same input points (or “pixels”) are used in the computation of different output blocks. For an input feature map of size (20) \(\begin{equation} C \times H \times W, \end{equation}\) we could construct a tensor of input blocks that have size (21) \(\begin{equation} C \times \lceil H/m \rceil \times \lceil W/m \rceil \times n \times n. \end{equation}\)
This tensor of input blocks is shown graphically in Figure 1. The Winograd transform matrix \(B^T\) is applied to each of these input blocks, creating a tensor of transformed input blocks with dimensions (22) \(\begin{equation} C \times \lceil H/m \rceil \times \lceil W/m \rceil \times n \times n. \end{equation}\)
Fig. 1. IFM made up extracted matrices before transformation to Winograd domain.
Thus, the transform does not change the number and size of the blocks, as compared with a tensor of the input blocks. The filter kernel, however, has a dimension of \(M\,\times \,C\,\times \,k\,\times \,k\) in the spatial domain and \(M\,\times \,C\,\times \,n\,\times \,n\) in the Winograd domain. Each \(n\,\times \,n\) kernel is point-wise multiplied with all \(n\times n\) regions of the transformed IFM for all input and output channels and the corresponding output summed across input channels, as shown in Figure 2.
Fig. 2. Point-wise multiplication of the filter kernel with the IFM and cross-channel summation.
However, in order to speed up the processing of Hadamard product and cross-channel summation, the dimensions of the IFM and kernel matrix can be re-ordered, which allows doing both these steps using general matrix multiplication (GEMM). Optimized GEMM libraries like Blas, OpenBlas, and Intel’s MKL are available to speed up the computation. Thus, the dimensions of the transformed IFM are re-ordered to have the following layout in memory: (23) \(\begin{equation} n\times n\times (\lceil H/m \rceil * \lceil W/m \rceil) \times C, \end{equation}\) and the filter kernel is transformed to have the following dimension: (24) \(\begin{equation} n\times n\times C\times M. \end{equation}\)
In other words, the transformed IFM now contains \(n \times n\) matrices, each of dimension \((\lceil H/m \rceil * \lceil W/m \rceil)\,\times \,C,\) and the filter kernel contains \(n\,\times \,n\) matrices, each of dimension \(C\,\times \,M\). Each of the \(n\,\times \,n\) transformed input matrices is multiplied by the corresponding one of the \(n \times n\) transformed kernel matrices, for a total of \(n \times n\) matrix multiplications. The matrix multiplications can be implemented with the standard GEMM library, each yielding a result matrix of size \((\lceil H/m \rceil * \lceil W/m \rceil)\times M\). Thus, the \(n^2\) GEMM calls create an output tensor with dimensions \(n\times n\times (\lceil H/m \rceil * \lceil W/m \rceil)\times M\). The inverse transform is then applied to this output tensor to produce an output of dimension \(M\times H\times W\).
In practice, all of these intermediate representations of the matrix do not need to be created in memory explicitly. A carefully programmed input transform routine can compute the transformed input tensor directly from the original IFM. A similar strategy can be used for the output transform, and for DNN inference, the kernel can be pre-computed in the Winograd domain. Thus, for Winograd, we will need the buffers defined in Table 2, where extra buffers needed for GEMM-based Winograd transformed are highlighted.
Table 2. Buffer Requirements for Winograd Convolution
Looking at Table 2, the memory requirements for Winograd-based convolution are significant. For each original IFM, kernel, and OFM there is a corresponding new tensor in the Winograd domain. Consider the case of DNN convolution with \(H= 56, W =56, C=128, M=128,\) and \(k= 3\). Table 3 shows the total size of the original spatial-domain tensors and the size of the tensors in the Winograd domain for output block size \(m = 2, 3,\ldots 8\). The size of the Winograd domain tensors is much larger than the original spatial domain tensors. For example, for \(m=2\), the Winograd domain tensors are \(3.65\times\) the size of the original tensors. However, the total size of the Winograd domain tensors does not necessarily grow with m. In fact, for \(m=8,\) the Winograd tensors are just \(3.04\times\) the size of the originals. This smaller factor is due to less duplication of input and output pixels for larger block sizes. However, the overhead is significant in both cases.
Table 3. Size of Tensors in the IFM, Kernel, and OFM between Direct and Winograd Convolution for a < \(C,H,W\) > \(=\) <128,56,56> Dimensional IFM and < \(M,C,K,K\) > \(=\) <128,128,3,3> Dimensional Kernel
5 PROPOSED POINT SELECTION FOR TOOM-COOK ALGORITHM
The overall performance of Toom-Cook convolution algorithms, in terms of FP error, is dependent on the points on which the polynomials are evaluated. For example, in order to limit additional general multiplications with \(m=2\) and \(r=3\) (\(F(2,3)\)), the evaluation points of \([0,1,-1]\) generate the matrices that only involve additions, subtractions, and shifts by 1 [3]. Two of these points, 0 and \(\infty\), guarantee zeros in the three matrices of Equation (6) and thus remove any source of FP error [1].
There is no known systematic method for selecting the best points to minimize the error [33]. Barabasz et al. [1] evaluate sets of points empirically to try to find sets that reduce the numeric error. However, any real-number values can be chosen as valid points, leading to an impossibly large search space of sets of points. Barabasz et al. solve this problem by restricting the points that they consider to a small set of rational numbers (including integers) of the form \(\frac{x}{y}\), where \(x, y \in 1,\ldots,4\). They start off with \([0,1,-1,\infty]\) and add more points as they increase the input size.
Although the set of basic points \([0,+1,-1]\) are good, there is no clear technique to select further points as the size of input/output is increased. Barabasz et al. identify several features of sets of interpolation points that tend to work well. First, small integers and simple fractions such as \(\pm 1, \pm \frac{1}{2}, \pm 2, \pm \frac{1}{4}, \pm 4\) work well because they have few significant binary digits and can be represented exactly. Second, each point c is raised to successive powers, such as \(c^1, c^2, c^3, c^4 \dots\). Point values that are close to 1.0 do not become too large or small when raised to powers. Third, and on the other hand, parts of the transform matrices depend on the difference (subtraction) between different interpolation points. If different points are too close together, the difference is a very small number that tends to result in increased numeric error. Fourth, selecting pairs of points that differ in sign, that is, c and \(-c\), and those that are reciprocal, that is, c and \(\frac{1}{c}\), often leads to lower numeric errors. Such pairs of points help to create zero or \(\pm 1\) coefficients, while also helping with the conditioning of transform matrices [8, 9]. Note that these four features are often in conflict with one another, and there is no known method for finding sets of points that are optimal with respect to the floating point error.
Rather than confining ourselves to interpolation points that are simple small integers and fractions, in this article we consider the full range of real values on the number line. This means that the search space of possible values is the full range of floating point values for each interpolation point we choose. This search space is far too large to consider, so we instead propose to search combinations of interpolation points that differ in sign and are reciprocals, that is, points of the form \(-\frac{1}{c},-c,c,\frac{1}{c}\). Thus, instead of having to consider all possible floating point values for four independent points, we instead consider all possible floating point values for a single point c and derive the other values from it. The result is a search space of possible values that is much smaller, although still too large to search exhaustively. However, in Section 6.1 we show that the error curve for different values of c is roughly smooth, and it is therefore possible to find good values of c around a rough global minimum value of the curve.
In our proposed method, we select the following set of points in addition to 0 and \(\infty\): (25) \(\begin{equation} \lbrace -c,c\rbrace \end{equation}\) for \(F(2,3)\) (that is, input block size 4, output block size 2, kernel size 3); (26) \(\begin{equation} \left\lbrace -\frac{1}{c},-c,c,\frac{1}{c}\right\rbrace \end{equation}\) for \(F(4,3)\); and (27) \(\begin{equation} \left\lbrace -1/d,-d,-\frac{1}{c},-c,c,\frac{1}{c},d,1/d\right\rbrace \end{equation}\) for \(F(8,3)\). For input points \(n\lt 6\), we empirically evaluate a subset of points for \(F(4,3)\). For example, for \(F(3,3)\) the following points are evaluated: (28) \(\begin{equation} \left\lbrace -\frac{1}{c}, -c, c\right\rbrace \quad \left\lbrace -\frac{1}{c}, -c, \frac{1}{c}\right\rbrace \quad \left\lbrace -\frac{1}{c}, c, \frac{1}{c}\right\rbrace \quad \left\lbrace -c, c, \frac{1}{c}\right\rbrace . \end{equation}\)
For \(n\gt 6\), we select all of c points, i.e., \(-\frac{1}{c},-c,c,\frac{1}{c}\), and select a subset of d points, depending on the value of n. So, e.g., for \(F(7,3)\), we evaluate the following points, in addition to all c points, 0, and \(\infty\): (29) \(\begin{align} \lbrace -&1/d,-d\rbrace \quad \lbrace -1/d,d\rbrace \quad \lbrace -1/d,1/d\rbrace \end{align}\) \(\begin{align} &\lbrace -d,d\rbrace \quad \lbrace -d,1/d\rbrace \quad \lbrace d,1/d\rbrace . \end{align}\)
In addition to these points, we also construct points using a subset of those proposed by Barabasz et al. in [1]. For example, for \(F(4,3)\), Barabasz et al. propose \([0,-1,1,\frac{1}{2},-3,\infty],\) and we construct the following points using these (in addition to 0 and \(\infty\)): (30) \(\begin{align} &\left\lbrace 1,-1,-\frac{1}{c},-c\right\rbrace \quad \left\lbrace 1,-1,-\frac{1}{c},c\right\rbrace \quad \left\lbrace 1,-1,-\frac{1}{c},\frac{1}{c}\right\rbrace \end{align}\) \(\begin{align} &\lbrace 1,-1,-c,c\rbrace \quad \left\lbrace 1,-1,-c,\frac{1}{c}\right\rbrace \quad \left\lbrace 1,-1,c,\frac{1}{c}\right\rbrace \end{align}\) (31) \(\begin{align} \left\lbrace -3, -\frac{1}{c}, -c, c\right\rbrace & \quad \left\lbrace -3, -\frac{1}{c}, -c, \frac{1}{c}\right\rbrace \quad \left\lbrace -3, -\frac{1}{c}, c, \frac{1}{c}\right\rbrace \quad \left\lbrace -3, -c, c, \frac{1}{c}\right\rbrace . \end{align}\)
In the following sections, we describe how selecting the proposed interpolation points results in simpler transform matrices as compared to selecting points with no such structure.
5.1 Point Selection for F(2,3): n = 3, m = 2, k = 3
Beginning with \(F(2,3)\), we selected [\(0,c,-c,\infty\)] as the input evaluation points. As stated earlier, picking up points with opposite signs helps in getting a zero or one coefficient in the matrices. Two out of the three matrices generated by these input evaluation points are given in Equation (35). In order to compare, the two matrices generated if four disparate points were selected is given in Equation (34): (32) \(\begin{equation} B^T = \begin{bmatrix} ab & -a - b & 1 & 0\\ 0 & -b & 1 & 0\\ 0 & -a & 1 & 0\\ 0 & ab & -a - b & 1 \end{bmatrix} \quad G = \begin{bmatrix} 1/ab & 0 & 0\\ 1/a(a - b) & 1/(a - b) & a/(a - b)\\ -1/b(a - b)& -1/(a - b)& -b/(a - b)\\ 0 & 0 & 1 \end{bmatrix}, \end{equation}\) \(\begin{equation} B^T = \begin{bmatrix} 0 & -c & 1 & 0\\ -c^2 & 0 & 1 & 0\\ 0 & c & 1 & 0\\ 0 & -c^2 & 0 & 1\\ \end{bmatrix} \quad G = \begin{bmatrix} 1/2c^2 & -1/2c & 1/2\\ -\frac{1}{c^2} & 0 & 0\\ 1/2c^2 & 1/2c & 1/2\\ 0 & 0 & 1\\ \end{bmatrix}. \end{equation}\)
When we replace a and b with c and \(-c,\) respectively, a number of terms in the two matrices cancel, which leads to simpler computations and helps reduce the error. For example, \(-a-b\) is reduced to 0 in \(B^T\) and \(1/(a-b)\) is reduced to \(1/-2c\) in G. Similar simplifications occur in the \(A^T\) matrix.
5.2 Point Selection for \(F(4,3)\): \(n = 5, m = 4, k = 3\)
For \(F(4,3)\), we propose choosing [\(-\frac{1}{c},-c,0,c,\frac{1}{c}\)] points for evaluation. The corresponding matrices (\(B^T\) and G) generated due to these points are given in Equation (37): (33) \(\begin{align} B^T & = & \begin{bmatrix} 0 & c & -c^2 & -\frac{1}{c} & 1 & 0\\ 0 & \frac{1}{c} & -\frac{1}{c^2} & -c & 1 & 0\\ 1 & 0 & -(c^4 + 1)/c^2 & 0 & 1 & 0\\ 0 & -\frac{1}{c} & -\frac{1}{c^2} & c & 1 & 0\\ 0 & -c & -c^2 & \frac{1}{c} & 1 & 0\\ 0 & 1 & 0 & -(c^4 + 1)/c^2 & 0 & 1 \end{bmatrix} \end{align}\) \(\begin{align} G & = & \begin{bmatrix} -c^4/(2c^4 - 2) & c^3/(2(c^4 - 1) & -c^2/(2c^4 - 2)\\ 1/2(c^4 - 1) & -c/(2c^4 - 2) & c^2/2(c^4 - 1)\\ 1 & 0 & 0\\ 1/2(c^4 - 1) & c/2(c^4 - 1) & c^2/2(c^4 - 1)\\ -c^4/(2c^4 - 2) & -c^3/(2c^4 - 2) & -c^2/(2c^4 - 2)\\ 0 & 0 & 1 \end{bmatrix}. \end{align}\) Matrices generated for four disparate points [\(a,b,c,d\)] are given in Equations (38), (35), and (43): (34) \(\begin{align} B^T & = & \begin{bmatrix} B_{0,0} & B_{0,3} \\ B_{3,0} & B_{3,3} \end{bmatrix}, \end{align}\) where (35) \(\begin{align} {rClrCl} B_{0,0} & = & \begin{bmatrix} abcd & -ab(c+d)-cd(a+b) & a(b+c+d)+b(c+d)+cd\\ 0 & -bcd & b(c+d)+cd \\ 0 & -acd & a(c+d)+cd \end{bmatrix}, \end{align}\) \(\begin{align} B_{0,3} & = & \begin{bmatrix} -a-b-c-d & 1 & 0\\ -b-c-d & 1 & 0\\ -a-c-d & 1 & 0 \end{bmatrix}, \end{align}\) \(\begin{align} B_{3,0} & = & \begin{bmatrix} 0 & -abd & a(b+d)+bd \\ 0 & -abc & a(b+c)+bc \\ 0 & abcd & -ab(c+d)-cd(a-b) \end{bmatrix}, \end{align}\) \(\begin{align} B_{3,3} & = & \begin{bmatrix} -a-b-d & 1 & 0\\ -a-b-c & 1 & 0\\ a(b+c+d)+b(c+d)+cd & -a-b-c-d & 1 \end{bmatrix}, \end{align}\) and (36) \(\begin{align} {rCl} G & = & \begin{bmatrix} 1/abcd & 0 & 0 \\ 1/a(a-b)(a-c)(a-d) & 1/(a-b)(a-c)(a-d) & a/(a-b)(a-c)(a-d) \\ -1/b(a-b)(b-c)(b-d) & -1/(a-b)(b-c)(b-d) & -b/(a-b)(b-c)(b-d) \\ 1/c(a-c)(b-c)(c-d) & 1/(a-c)(b-c)(c-d) & c/(a-c)(b-c)(c-d) \\ -1/d(a-d)(b-d)(c-d) & -1/(a-d)(b-d)(c-d) & -d/(a-d)(b-d)(c-d) \\ 0 & 0 & 1 \end{bmatrix}. \end{align}\)
When a, b, \(c,\) and d are replaced with \(-\frac{1}{c}, -c, \frac{1}{c}, c\), the simplification in matrices is evident. A number of values are either reduced to zero/one or significantly simplified. Similar reductions are achieved for larger input/output sizes as well.
Note that the simplifications of matrix terms caused by our point selection may slightly reduce the computational cost of Winograd convolution for CNNs, as well as reducing the numerical error. Where the resulting transform matrices contain a zero value, no computation is needed to implement that part of the transform. Similarly, where the transform matrix contains the value \(\pm 1\), only addition, not multiplication, is needed to implement part of the transform. Our proposed approach to point selection tends to result in simple values in the transform matrices, and thus it may reduce the computation cost of transforms compared to other point selections. But it is not specifically designed to maximize the number of zero and \(\pm 1\) values in the transform matrices. In particular, it is sometimes possible to increase the number of \(\pm 1\) values in the matrix with a different point selection.
However, the impact on execution time of slightly increasing the number of \(\pm 1\) values in the transform matrices is minor. As noted in Section 4, each input tile is convolved with many kernels and vice versa, so the cost of the transforms is amortized across many uses. Thus, the largest computational component of Winograd convolution for CNNs is the element-wise Hadamard product, not the transforms. The Hadamard product of many input tiles and many kernels and the summation of convolution results across channels are normally implemented using matrix multiplication. This part of Winograd convolution dominates the execution time and is entirely independent of the interpolation points that are used in the transforms.
6 EXPERIMENTAL EVALUATION
In this section we apply our method of point selection for Winograd convolution and search for suitable interpolation points. The search space of floating point values is extremely large, but we find that the error curve for points in our form is smooth at a macro level, with a clear region of values that are lower than other parts of the curve. This allows us to focus our search on that region of the error curve where the error is low and find real-valued points that reduce the numerical error. This smoothness allows us to find good interpolation points in an otherwise impossibly large search space. We evaluate our selected sets of points on two different types of dataset. First, we evaluate with uniform random values to demonstrate the value of the points in general. Second, we implement our method within the Caffe framework for deep neural networks and evaluate our point selections on several real neural networks. Finally, we measure the time taken by Winograd convolution for various input sizes to demonstrate the relationship between speed and accuracy.
Barabasz et al. in [1] propose using the Huffman tree evaluation order that reduces the average error at no additional cost in computations. We also use the same evaluation order and present results based on it. Although simulations were carried out for all of simple Winograd, Winograd with Huffman, and Winograd with Huffman using double-precision floating point for the transforms and single precision for the Hadamard product, we only mention these in cases where the conclusions reached are different than Huffman-based Winograd convolution. For error calculation, different types of convolution algorithm were evaluated on \(5,\!000\) randomly generated input data for each evaluation. This is done to cancel any effect of individual inputs and make the results statistically constant. L1 norm is then computed between the output of each type of convolution and a normal convolution using double-precision floating point values. The selected points were then used to evaluate Winograd convolution on selected deep convolution networks with real-world data.
6.1 The Error Curve
Typically, as suggested in the literature, the points that reduce the error in Winograd/Toom-Cook convolution shall be either integers or simple fractions of the form \(\frac{a}{b}\), where both a and b are integers. However, in this work, instead of only focusing on these values, we are looking at the error curve to find optimal points that reduce the error.
In order to illustrate this, we take the example of \(F(4,3)\) convolution and evaluate it on the points \(\lbrace -\frac{1}{c},-c,0,c,\frac{1}{c}\rbrace\). We vary the value of c from 1.1 to 2.5 with a step size of \(10^{-3}\) and calculate the error in both 1D and 2D convolution.
The error is plotted as a function of c and is shown in Figure 3 for 1D convolution and in Figure 4 for 2D convolution. All three types of convolution, simple, Huffman based, and Huffman with mixed precision, are shown.
Fig. 3. 1D modified Toom-Cook convolution with \(\lbrace -\frac{1}{c},-c,0,c,\frac{1}{c}\rbrace\) input evaluation points.
Fig. 4. 2D modified Toom-Cook convolution with \(\lbrace -\frac{1}{c},-c,0,c,\frac{1}{c}\rbrace\) input evaluation points.
Both the curve and all three different types of convolution show a similar behavior, with minimum error values corresponding to values of c in the range of 1.6 to 1.8 with error increasing either side of this range. The points considered are not in the form favored in the literature but behave well as compared to points, e.g., \(-\frac{1}{2}, -2, 0, 2, \frac{1}{2}\). Before analyzing points based on this methodology and those based on simple fractions, we analyze whether the points proposed by Barabasz et al. in [1] combine well with our proposed point formulation, as shown in Equation (31), in the next section.
6.2 Searching for Optimality
Here we combine some of the points proposed in [1] with our proposal evaluation points and analyze the resulting convolution error. In other words, in addition to 0 and \(\infty\), we fix more points while selecting a subset of our proposed points to constitute the remaining. We refer to these as base points. We present results for three cases, \(F(2,3)\), \(F(3,3),\) and \((4,3)\). For the case of \(F(2,3)\), results are shown in Table 4, where we evaluate six different combinations of polynomial evaluation points. The best results are highlighted in bold and the points 0 and \(\infty\) are implicit.
Table 4. Optimal Points with Respect to Various Choices for \(F(2,3)\)
Among the various options, the combination of \(\lbrace -\frac{1}{c},\frac{1}{c}\rbrace\), \(\lbrace 1,-c\rbrace,\) and \(\lbrace -1,\frac{1}{c}\rbrace\) stand out with minimal difference in the best values. For the last two combinations, 1 and \(-1\) are fixed while we iterate over one out of \(\lbrace -\frac{1}{c}, -c, c, \frac{1}{c}\rbrace\) points. Although for these two combinations lowest error is given by \(\lbrace 1,-c\rbrace\) and \(\lbrace -1,\frac{1}{c}\rbrace\), plotting the error curve shows that the combination of \(\lbrace 1,-\frac{1}{c}\rbrace\) and \(\lbrace 1,c\rbrace\) also gives similar results. The error curve is shown in Figures 5(a) and 5(b). Similarly, for the general case where we pick two out of the proposed combinations of \(\lbrace -\frac{1}{c}, -c, c, \frac{1}{c}\rbrace\), four combinations give similar results.
Fig. 5. \(F(2,3)\) with evaluation point (a) \(-1\) and (b) 1 as base points and (c, d) the general case.
Table 5 shows the same analysis for \(F(3, 3)\). Choosing only from our proposed points does not result in minimum convolution error, which is given when 0.5 or \(\pm 1\) is chosen as the base point for 1D and 2D convolution, respectively and gives at least a \(16.67\%\) and 20% improvement for 1D and 2D convolution, respectively. However, looking at the results of \(F(4,3)\), our proposed points give the lowest error among all options, improving by at least \(2.75\%\) and \(5.95\%\) for 1D and 2D convolution, respectively, see Table 6. This difference between \(F(3, 3)\) and \(F(4, 3)\) is due to the fact that for \(F(3,3)\) not all points from \(\lbrace -\frac{1}{c},-c,c,\frac{1}{c}\rbrace\) can be chosen at one given time, reducing the opportunity for maximum cancellations and reductions to 0 and \(\pm 1\).
Table 5. Optimal Points with Respect to Various Choices for \(F(3,3)\)
Table 6. Optimal Points with Respect to Various Choices for \(F(4,3)\)
As shown in Figure 5, certain combinations of points are better than others. Upon further investigation, we found that when combining base points with a subset of our proposed points, some combinations were always better than others. For example, as shown in Table 7, when using \(\lbrace 0,3,\infty \rbrace\) as the base points, selecting \(\lbrace -\frac{1}{c}, c\rbrace\) and \(\lbrace -\frac{1}{c}, -c, \frac{1}{c}\rbrace\) for \(F(3,3)\) and \(F(4,3)\) convolution always reduces the error. In a similar way, patterns are identified for other combinations with occasional exceptions highlighted.
\(\dagger\)For mixed-precision Winograd convolution of [1] with Huffman summation tree, the pattern that gives minimum FP error is \(\lbrace -c,\frac{1}{c}\rbrace .\)
Table 7. Optimal Pattern with Minimum Error for 1D Convolution
\(\dagger\)For mixed-precision Winograd convolution of [1] with Huffman summation tree, the pattern that gives minimum FP error is \(\lbrace -c,\frac{1}{c}\rbrace .\)
The patterns identified for 2D convolution are given in Table 8 for the same convolution size as those given in Table 8. The patterns are fairly similar for both 1D and 2D convolution, which suggests that the problem of finding points for 2D convolution is not a fundamentally different problem to finding good 1D points.
Table 8. Optimal Pattern with Minimum Error for 2D Convolution
6.3 State of the Art
Having identified the best point selections using our proposed method, we compare the floating point errors in 1D and 2D Winograd/Toom-Cook convolution against the state-of-the-art numbers given by Barabasz et al. [1]. The floating point error is defined as the L1 norm between the output of Winograd convolution and a normal convolution using double-precision floating point values. Barabasz et al. analytically and experimentally evaluated a number of points and suggested optimal points that reduced the error. For this work, we experimentally evaluated hundreds of thousands of different combinations for various input polynomial evaluation points and present the result for 1D and 2D convolution in Tables 9 and 10, respectively. The first row of the table, \(n = 0\), indicates direct convolution in single-precision floating point and the error is computed using the same way as for the Winograd outputs.
| n | Points 1D by [1] | Error 1D [1] | Proposed 1D Points | Error 1D | % Imp. |
|---|---|---|---|---|---|
| 0 | Direct convolution | \(1.75\times 10^{-8}\) | Direct convolution | \(1.75\times 10^{-8}\) | – |
| 4 | \(P_4 = \lbrace 0,-1,1,\infty \rbrace\) | \(2.45\times 10^{-8}\) | \(c=1.028, \lbrace 0,-1,\frac{1}{c}\rbrace\) | \(3.06\times 10^{-8}\) | \(-24.9\) |
| 5 | \(P_4 \cup \lbrace \frac{1}{2}\rbrace\) | \(5.19\times 10^{-8}\) | \(c=1.5,\lbrace 0,\frac{1}{2}, -c, c\rbrace\) | \(4.69\times 10^{-8}\) | \(9.6\) |
| 6 | \(P_4 \cup \lbrace \frac{1}{2},-3\rbrace\) | \(6.92\times 10^{-8}\) | \(c=1.829, \lbrace -\frac{1}{c},-c,0,c,\frac{1}{c}\rbrace\) | \(5.65\times 10^{-8}\) | \(18.35\) |
| 7 | \(P_4 \cup \lbrace \frac{1}{2},-\frac{1}{2},-3\rbrace\) | \(9.35\times 10^{-8}\) | \(c = 2.22, d = 1.0\) | \(1.07\times 10^{-7}\) | \(-14.44\) |
| \(0, -\frac{1}{c},-c,c,\frac{1}{c}, d\) | |||||
| 8 | \(P_8 = P_4 \cup \lbrace \frac{1}{2},-\frac{1}{2},2,-2\rbrace\) | \(1.15\times 10^{-7}\) | \(c = 2.0, d = 1.0\) | \(1.16\times 10^{-7}\) | 0 |
| \(0, -\frac{1}{c},-c,c,\frac{1}{c}, -d, d\) | |||||
| 9 | \(P_8 \cup \lbrace -\frac{1}{4}\rbrace\) | \(2.34\times 10^{-7}\) | \(c = 1.313,d = 2.478\) | \(2.29\times 10^{-7}\) | \(2.14\) |
| \(\lbrace 0,-\frac{1}{c},-c,c,\frac{1}{c}, -\frac{1}{d}, -d, d\rbrace\) | |||||
| 10 | \(P_{10} = P_8 \cup \lbrace -\frac{1}{4}, 4\rbrace\) | \(3.46\times 10^{-7}\) | \(c=1.953,d=1.229\) | \(1.4 \times 10^{-7}\) | 59.54 |
| \(\lbrace -\frac{1}{c},-c,-\frac{1}{d}, -d, 0,c,d,\frac{1}{d},\frac{1}{c}\rbrace\) |
Table 9. Proposed Points for Toom-Cook 1D Convolution with Kernel of Size 3 and Corresponding FP Error and Its Comparison with Corresponding Points and FP Error of [1]
| n | Points 2D by [1] | Error 2D [1] | Proposed 2D Points | Error 2D | % Imp. |
|---|---|---|---|---|---|
| 0 | Direct convolution | \(4.63\times 10^{-8}\) | Direct convolution | \(4.63\times 10^{-8}\) | – |
| 4 | \(P_4 = \lbrace 0,-1,1,\infty \rbrace\) | \(7.65\times 10^{-8}\) | \(c =1.054, \lbrace 0,-\frac{1}{c},\frac{1}{c}\rbrace\) | \(9.22\times 10^{-8}\) | \(-20.52\) |
| 5 | \(P_4 \cup \lbrace \frac{1}{2}\rbrace\) | \(2.35\times 10^{-8}\) | \(c=2.0,\lbrace 0,1,-1, -c \rbrace\) | \(1.51 \times 10^{-7}\) | \(35.75\) |
| 6 | \(P_4 \cup \lbrace \frac{1}{2},-2\rbrace\) | \(3.29\times 10^{-7}\) | \(c=1.622, \lbrace -\frac{1}{c},-c,0,c,\frac{1}{c}\rbrace\) | \(2.37 \times 10^{-7}\) | \(27.96\) |
| 7 | \(P_4 \cup \lbrace \frac{1}{2},-2,-\frac{1}{2}\rbrace\) | \(6.81\times 10^{-7}\) | \(c = 2.0, d = 1.0\) | \(7.72\times 10^{-7}\) | \(-13.36\) |
| \(\lbrace 0,-\frac{1}{c},-c,c,\frac{1}{c}, d\rbrace\) | |||||
| 8 | \(P_8 = P_4 \cup \lbrace \frac{1}{2},-\frac{1}{2},2,-2\rbrace\) | \(8.79\times 10^{-7}\) | \(c=2.0, d = 1.003\) | \(8.79\times 10^{-7}\) | 0 |
| \(\lbrace 0,-\frac{1}{c},-c,c,\frac{1}{c}, -\frac{1}{d}, d\rbrace\) | |||||
| 9 | \(P_8 \cup \lbrace -\frac{1}{4}\rbrace\) | \(3.71\times 10^{-6}\) | \(c = 1.305, d = 2.485\) | \(3.06\times 10^{-6}\) | \(17.5\) |
| \(\lbrace 0,-\frac{1}{c},-c,c,\frac{1}{c}, -\frac{1}{d}, -d, d\rbrace\) | |||||
| 10 | \(P_{10} = P_8 \cup \lbrace -\frac{1}{4}, 4\rbrace\) | \(7.35\times 10^{-6}\) | \(c=1.272, d=2.099\) | \(5.28\times 10^{-6}\) | \(28.16\) |
| \(\lbrace -\frac{1}{c},-c,-\frac{1}{d}, -d, 0,c,d,\frac{1}{d},\frac{1}{c}\rbrace\) |
Table 10. Proposed Points for Toom-Cook 2D Convolution with Kernel of Size 3 and Corresponding FP Error and Its Comparison with Corresponding Points and FP Error of [1]
Apart from the case of \(F(2,3)\), our proposed method of picking points reduces the error compared to the best existing points [1]. We find improvements for 1D and 2D convolution ranging from \(2.14\%\) to \(59.54\%\) and from \(17.5\%\) to \(35.75\%\), respectively. However, for the two cases of \(F(5,3)\) and \(F(6,3)\), we were not able to improve upon the evaluation points proposed by Barabasz et al. in [1] for either 1D and 2D convolution. It seems possible that the existing point selection for these convolution sizes may already be optimal.
This shows that selecting integers or simple fractions as polynomial evaluation points is not necessary and it is important to analyze the error curve to identify points to reduce the floating point error in 1D and 2D convolution.
6.4 Deep Convolution Networks Using Winograd Convolution
As mentioned earlier, short-length Winograd/Toom-Cook convolution lends itself favorably toward convolution in deep neural networks. In order to realize the applicability of the proposed points on real deep convolution networks, Winograd convolution was implemented in Caffe v1.0. Caffe [14] is a deep learning framework that allows describing a neural network in a modular way and is developed by Berkeley AI Research (BAIR) and by community contributors. Specifically for this work, a customized distribution of Caffe developed by the Software Tools Group at Trinity College Dublin is used.
A complete convolution layer is implemented using Winograd for those layers where the kernel size is three. This is because all the previous evaluation is limited to this kernel size. However, the Winograd library of functions developed for the previous evaluation were not optimized to be used in an actual network. This limited the evaluation on real networks using only normal Winograd convolution and not the one where order of summation is based on Huffman coding. Furthermore, running a full network on all the images in the dataset proved to be prohibitively long.
Therefore, in order to present a proper evaluation, the networks are run on \(1,\!024\) input images. Performance of Winograd convolution, for different evaluation points, against normal convolution is evaluated by computing the L1 norm of distances between the floating point values at the output of individual layers and normalizing the norm by the number of images and layers in a particular network. The choice of \(1,\!024\) images was motivated after running for a number of different input images, ranging from 2 to \(1,\!024\), and seeing little difference in the L1 norm.
Two datasets are evaluated for a number of networks. For datasets, we chose the CIFAR10 and CIFAR100 datasets [16], while for the networks, we use a pre-trained GoogleNet [30], AlexNet [17], ResNet-20 [10], and SqueezeNet [13] with the CIFAR10 dataset and ResNet-20 [10] with the CIFAR100 dataset. Among the three networks, SqueezeNet is most suitable for embedded devices because of its smaller size and fewer parameters.
Among the chosen networks, 17 out of 21 convolution layers of ResNet-20, 4 out of 4 layers of AlexNet, 10 out 57 convolution layers of GoogleNet, and 8 out of 26 layers of SqueezeNet were implemented using Winograd. In the future, all layers can be replaced by Winograd layers that cater to different kernel dimensions, stride factors, and group convolutions. It was not possible to run SqueezeNet using Winograd convolution for \(n=10\) since the dimension of the feature maps was too small for the given number of interpolation points.
The results are presented in Table 11, where the points are the same as those shown in Table 10. The best results, indicated by the lowest L1 norm, for each network and number of Winograd input points are highlighted in bold and underlined, for readability. A number of points proposed as part of the point selection scheme in this work produce much better results than the state of the art, with a maximum of around 60% improvement. Furthermore, the results presented in Table 11 are fairly consistent with those shown in Table 10 for a single 2D convolution. The proposed points produce less error, not only for larger Winograd input size but also for most of the smaller input size.
Table 11. Normalized L1 Norm of Floating Point Error for Various Deep Convolution Neural Networks on Two Datasets
The case of \(n=6\) is particularly interesting, because it corresponds to the set of points \(\lbrace 0, \infty, -\frac{1}{c}, -c, \frac{1}{c}, c \rbrace,\) which contains all four symmetric variants of c. This is the size of convolution where we might expect our approach to perform best, and indeed we see good results. The case of \(n=10\) is another sweet spot for our approach, where all four variants of c and d are included in the set of points alongside 0 and \(\infty\).
Some of the largest reductions in error for our method arise where \(n=9\). It is not obvious why our method is so well suited to this case. However, when we compare the errors for \(n=8\) and \(n=9,\) we see that the method that chooses simple fractions performs exceptionally poorly for \(n=9\). The error for \(n=9\) is almost 10 times as large as the error for \(n=8\) when using simple fractions. In contrast, our approach manages to find better real-number values for this awkward-sized convolution case.
Table 11 also presents results for the Chebyshev nodes. Chebyshev nodes improve the conditioning of polynomial interpolation [11], which is an essential step of Winograd/Toom-Cook convolution. Looking at the numbers in Table 11, Chebyshev nodes produce good results. In fact, for GoogleNet and SqueezeNet, with \(n=5\), these nodes are able to achieve the lowest L1 norm, and for ResNet-20 (on CIFAR100 dataset) they outperform the simple fractional points for \(n=4\) and \(n=5\). However, the L1 norm of these nodes progressively gets worse as compared to the other two schemes as the number of input interpolation points is increased.
6.4.1 Execution Time.
Winograd convolution allows the computation of multiple output points in the OFM at the same time. Thus, increasing the output size (m) speeds up the computation. However, as mentioned earlier, the floating point error increases significantly with increase in the output size.
As outlined in Section 4.3, there are three key steps to the Winograd: the transformation of the IFM and kernel to the Winograd domain, point-wise multiplication (Hadamard product; the actual convolution), and transformation of the Hadamard product back to the spatial domain.
In DNN convolution, the input feature map is divided into many sub-blocks and each sub-block is convolved with many kernels. Further, each kernel is convolved with many sub-blocks. Thus, the cost of the transforms is amortized over many uses of the transformed input blocks and kernels. In DNN convolution, the corresponding elements from the different input-channel-wise convolution are summed to create a single channel of the output feature map. As described in Section 4.3, the channel-wise summation can be computed in the Winograd domain, so that the output transform is applied to the sum of the channels. Therefore, the cost of the output transform is similarly amortized over many separate convolutions.
In contrast, the largest computational cost is the pairwise Hadamard product, which must be computed separately for the cross-product of every input block and kernel. Note that when computing the Hadamard product, we also sum the products across the input channels. The Hadamard product can be computed using highly optimized implementations of GEMM matrix multiplication.
Figure 6 shows the execution time of the Hadamard product and summation across input channels for two different \(3 \times 3\) convolution layers of ResNet-18. The x-axis shows the input block size, n for the Winograd convolution, and the y-axis shows execution time. In Figure 6(a) we see that a larger input block size, n, leads to lower execution times. As described earlier in Table 1, a larger block size leads to a lower operation count, so a speedup for larger block sizes is what we expect.
Fig. 6. Hadamard product execution time of Winograd convolution for two different layers of the ResNet-18 network with dimensions < \(M,C,H,W\) >: (a) <64,64,56,56> and (b) <128,64,28,28>, normalized against number of batches.
Figure 6 also shows data for several different mini-batch sizes in the DNN convolution. In deep convolution neural networks, typically, multiple batches in input images are processed together to speed up the execution of the network. Either the full batch is processed at once, known as batch processing, or the full batch of images is divided into multiple sets or mini-batches and processed together, known as mini-batch processing. Increasing the batch size increases the size of one of the inputs to the GEMM operation, helping in achieving the speedup by mitigating the effects of smaller input dimensions or larger Winograd output size. The mini-batch size, \(B,\) is the number of separate input feature maps we operate upon in the convolution. Thus, \(B=1\) means we operate on a single IFM of size \(C \times H \times W\). For larger B we operate on a tensor of IFMs with dimension \(B \times C \times H \times W\).
In Figure 6(b), there is a speedup up to \(n=6\) for a mini-batch size \(B=1\), but larger values of n are slower. The reason for this is that the height and width of the input are relatively small at just 28, so the input is divided into so few blocks that the matrix multiplication operates on few rows. In contrast, when using a mini-batch of four or greater, much larger block sizes become efficient because of the larger matrix multiplication. Thus, for DNN layers that operate on just a single small input feature map, a medium-sized block size is likely better than a large one.
Something similar can be seen in Figure 8. AlexNet has a similar speed advantage when the IFM dimensions are large (\(32\times 32\), in Figure 8(a), as compared to a smaller IFM (\(16\times 16\)) in Figure 8(b).
This is further illustrated in Figure 7(a), where each input IFM channel is of dimension \(56\times 56\). The same speedup is not achieved when the dimension is \(28\times 28,\) but having a higher batch size still benefits. By far, the greatest benefit of an increased Winograd output size is attained with large-dimensional IFMs, like the one shown in Figure 9 for the VGG19, along with a large number of input and output channels.
Fig. 7. Hadamard product execution time of Winograd convolution for two different layers of the GoogleNet network with dimensions < \(M,C,H,W\) >: (a) <192,64,56,56> and (b) <192,96,28,28>, normalized against number of batches.
Fig. 8. Hadamard product execution time of Winograd convolution for two different layers of the AlexNet network with dimensions < \(M,C,H,W\) >: (a) <96,3,32,32> and (b) <384,384,16,16>, normalized against number of batches.
Fig. 9. Hadamard product execution time of Winograd convolution for two different layers of the VGG19 network with dimensions < \(M,C,H,W\) >: (a) <64,64,224,224> and (b) <128,64,112,112>, normalized against number of batches.
Thus, a faster execution with an increase in the output size of Winograd will also help lower the energy cost since it is directly proportional to the amount of time an algorithm or system is running.
7 OTHER EXPERIMENTS
Apart from experimenting with the proposed evaluation points, details and results about which are given in Sections 5 and 6, we experimented with the following methods in an attempt to further reduce the FP error:
Implementing Winograd transforms using Kronecker product for 2D convolution
Using pairwise summation while performing addition during the Winograd transforms
Using different base points for the two separate transforms used to implement 2D convolution
The Kronecker product of two matrices is a generalization of the outer product of vectors. The Kronecker product of A and B, also referred to as direct or tensor product [21], is an \((m_1m_2)\,\times \,(n_1n_2)\) matrix, where A and B are \(m_1\times n_1\) and \(m_2\times n_2\) matrices and the product is written as \(C = A \otimes B\). The Kronecker product has one important property that is relevant to the 2D convolution and is given here: (37) \(\begin{equation} \text{Vec}(ABC) = (C^T \otimes A) \times \text{Vec}(B), \end{equation}\) where \(\text{Vec}(X)\) indicates the vectorized form of the matrix X.
The property shown in Equation (44) matches exactly with the terms \(GgG^T\) and \(B^TdB\) of Equation (9). So, instead of implementing 2D Winograd transforms using two multiple matrix multiplications, one can simply calculate the Kronecker product between G and G (similarly, B and B) and then do a vector multiplication. Another idea that led to this experiment was that multiplying the transform matrix together before doing a vector multiplication with the input/coefficient vector will result in a reduced FP error. However, the number of additions and multiplications involved in the operation defined by Equation (44) is much greater than the standard matrix multiplication of Equation (9). We found that FP errors obtained using the Kronecker product were greater than that of the original case. Similarly, we experimented with pairwise summation when multiplying the Kronecker product (\(G\otimes G\)) with the vectorized form of g by first doing a point-wise multiplication and then adding in pairs. However, the fundamental problem of the Kronecker product resulting in more multiplications and additions caused the error to remain large despite an improved summation method.
Finally, we experimented with using different sets of points for the first and second dimensions when doing 2D convolution. Typically, 2D convolution is implemented by nesting two 1D convolutions using identical points for both dimensions. We experimentally evaluated having two different sets of interpolation points for the two 1D convolutions that we use to compute 2D convolution. However, the initial results were poor, so we did not experiment further with this approach.
8 DISCUSSION AND FUTURE WORK
This article addresses the problem of selecting interpolation points for Winograd convolution to minimize the numerical error. Reducing the numerical error allows larger input block sizes to be used, which reduces the computational cost of Winograd convolution, as shown in Figure 1. Our proposed method allows us to abandon the existing practice of using just small integer and fractional points and instead search the full space of real-valued points. This allows us to reduce the numerical error significantly for several important convolution sizes. As described in Section 7, we also studied other strategies to reduce the numerical error that were not successful.
The ideal solution to the problem of selecting interpolation points for Winograd convolution would be an analytic mathematical solution. The closest existing such approach is the Chebyshev nodes, which are designed to minimize the error. The Chebyshev nodes may be close to optimal in reducing the asymptotic error. However, for the small-sized convolutions found in CNNs, the complex interaction between the terms in the transform matrices result in small errors that outweigh the asymptotic advantages of the Chebyshev nodes. Given the many small interactions between point selections for constant-sized small convolutions, it seems possible that the structure of an analytic solution might be different for each small size of convolution.
One possible area of significant improvement is finding interpolation points that work well for a particular CNN. CNNs are typically trained ahead of time in large data centers, and the trained network is then deployed on edge devices. Thus, at the time that the CNN is deployed, its weights are already known. In the current article, we search for interpolation points that reduce the numerical error for any set of weights and inputs. Inevitably, these points that are suitable across a wide range of weights and inputs are compromises between multiple conflicting goals. When the set of weights is known, fewer such compromises are necessary, and it may be possible to find interpolation points that work better for a specific set of weights.
9 CONCLUSION
Winograd and Toom-Cook convolution are efficient algorithms to compute short-length convolutions that occur very regularly in deep convolutional neural networks. However, these convolution algorithms suffer from reduced floating point accuracy due to transforms to and from the Winograd/Toom-Cook domain. These convolution algorithms are based on polynomial interpolation using distinct points, and research contributions have suggested using small integers or simple integers of the form \(\frac{a}{b}\), where both a and b are small integers to reduce the floating point computation errors.
In this work, we propose a particular form of input points for the modified Toom-Cook algorithm, i.e., \(\lbrace -\frac{1}{c}, -c, 0, c, \frac{1}{c}\rbrace\). We evaluate the error curve for these points and find that it is mostly smooth with a clear region of low errors. This allows us to find good values of c that reduce the numeric error without having to consider all possible floating point values. We find real-valued points whose forms are not simple integers or fractions that reduce the error, especially for the case of \(F(4,3)\) convolution with input and output tile size of \(n=6\) and \(m=4\). The reduction in error is 18% and 28% for 1D and 2D convolution, respectively.
We extended our method to the larger convolution size of \(F(8,3)\) by repeating the pattern of \(F(4,3)\), i.e., \(\lbrace -\frac{1}{d}, -\frac{1}{c}, -d, -c, 0, c, d, \frac{1}{c}, \frac{1}{d}\rbrace\). This allows us to achieve an error reduction of 59% and 28% for 1D and 2D convolutions, respectively, compared to the best existing points. In addition to our proposed form of points, where the points 0 and \(\infty\) are fixed as base points, we have experimentally evaluated more base points in addition to 0 and identified that the \(F(3,3)\) convolution benefits with a different base, achieving 20% and 36% improvement for 1D and 2D convolution.
Finally, we implemented a complete Winograd convolution layer in Caffe v1.0 and used it to run GoogleNet, AlexNet, and SqueezeNet on the CIFAR10 dataset and ResNet-20 on both CIRAR10 and CIFAR100 datasets. We evaluated our proposed points along with points using the existing strategy of only picking simple rationals and the Chebyshev nodes, using \(1,\!024\) images. For evaluation, we computed the distance between output of each convolution layer of Winograd and normal convolution, i.e., L1 norm. The L1 norm was then normalized to the number of convolution layers and images. The points proposed in this work achieve reduction in L1 norm in the majority of cases, with improvement ranging from 22% to almost 63%. We also evaluated the execution time of the Hadamard product of various real-world deep convolution network layers for different Winograd input/output sizes and showed that larger input/output sizes of Winograd speed up the execution but only when the dimensions of the input to the layers were large and when inputs were provided in batches, as is the norm for real networks.
ACKNOWLEDGMENTS
We also extend our gratitude to Dr. Israr Ali Khan of Namal Institute Mianwali, Pakistan for his support.
- [1] . 2020. Error analysis and improving the accuracy of Winograd convolution for deep neural networks. ACM Trans. Math. Softw. 46, 4, Article
37 (Nov. 2020), 33 pages. Google ScholarDigital Library
- [2] . 2019. Winograd convolution for DNNs: Beyond linear polynomials. In Proc. Int. Conf. Italian Association for Artificial Intelligence. Springer International Publishing, 307–320.Google Scholar
- [3] . 2010. Fast Algorithms for Signal Processing. Cambridge University Press.Google Scholar
Cross Ref
- [4] . 2007. Towards optimal Toom-Cook multiplication for univariate and multivariate polynomials in characteristic 2 and 0. In Arithmetic of Finite Fields, and (Eds.). Springer, Berlin, 116–133.Google Scholar
- [5] . 2019. Deep learning with edge computing: A review. 107, 8 (2019), 1655–1674.Google Scholar
- [6] . 1966. On the Minimum Computation Time of Functions. Ph.D. Dissertation. Cambridge, MA.Google Scholar
- [7] . 2020. Searching for Winograd-aware Quantized Networks. arXiv:2002.10711.Google Scholar
- [8] . 1974. Norm estimates for inverses of Vandermonde matrices. Numer. Math. 23 (1974), 337–347. Google Scholar
Digital Library
- [9] . 1990. How (un)stable are Vandermonde systems. Asymptotic and Computational Analysis 124 (1990), 193–210.Google Scholar
- [10] . 2016. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’16). 770–778. Google Scholar
Cross Ref
- [11] . 2002. Accuracy and Stability of Numerical Algorithms, Vol. 80. SIAM.Google Scholar
Cross Ref
- [12] . 2020. DWM: A decomposable Winograd method for convolution acceleration. Proceedings of the AAAI Conference on Artificial Intelligence 34, 4 (2020), 4174–4181. Google Scholar
Cross Ref
- [13] . 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. CoRR abs/1602.07360 (2016). http://arxiv.org/abs/1602.07360.Google Scholar
- [14] . 2014. Caffe: Convolutional architecture for fast feature embedding. In Proc. ACM Int. Conf. on Multimedia (MM’14). 675–678. Google Scholar
Digital Library
- [15] . 2004. Automatic derivation and implementation of fast convolution algorithms. Journal of Symbolic Computation 37, 2 (2004), 261–293.
Computer Algebra and Signal Processing. Google ScholarCross Ref
- [16] . 2009. Learning Multiple Layers of Features from Tiny Images.
Technical Report . https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf.Google Scholar - [17] . 2012. ImageNet classification with deep convolutional neural networks. In Proc. Int. Conf. Neural Inf. Proc. Sys. - Volume 1 (NIPS’12). 1097–1105.Google Scholar
- [18] . 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (
May 2017), 84–90.Google ScholarDigital Library
- [19] . 2016. Fast algorithms for convolutional neural networks. In Proc. IEEE Conf. Comput. Vision Pattern Recog.IEEE, 4013–4021.Google Scholar
- [20] . 2018. FFT-based deep learning deployment in embedded systems. In Proc. Design Automation Test Europe. 1045–1050. Google Scholar
Cross Ref
- [21] . 2000. The ubiquitous Kronecker product. J. Comput. Appl. Math. 123, 1 (2000), 85–100. Google Scholar
Digital Library
- [22] . 2019. Efficient Winograd or Cook-Toom convolution kernel implementation on widely used mobile CPUs. In Proc. Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2’19). 1–5. Google Scholar
Cross Ref
- [23] . 1981. Fast Fourier Transform and Convolution Algorithms. Springer.Google Scholar
Cross Ref
- [24] . 1999. VLSI Digital Signal Processing Systems, Design and Implementation. Wiley-Interscience.Google Scholar
- [25] . 2017. Fast and efficient implementation of convolutional neural networks on FPGA. In Proc. IEEE Int. Applicat.-Specific Syst. Arch. Processors Conf.11–18. Google Scholar
Cross Ref
- [26] . 2017. The emergence of edge computing. Computer 50, 1 (
Jan. 2017), 30–39.Google ScholarDigital Library
- [27] . 2016. Edge computing: Vision and challenges. 3, 5 (2016), 637–646. Google Scholar
Cross Ref
- [28] . 2014. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv:1409.1556.Google Scholar
- [29] . 2015. Going deeper with convolutions. In Proc. IEEE Conf. Comput. Vision Pattern Recog.1–9. Google Scholar
Cross Ref
- [30] . 2015. Going deeper with convolutions. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’15). 1–9. Google Scholar
Cross Ref
- [31] . 1997. Algorithms for Discrete Fourier Transform and Convolution (2nd ed.). Springer-Verlag.Google Scholar
Cross Ref
- [32] . 1963. The complexity of a scheme of functional elements realizing multiplication of integers. Soviet Mathematics - Doklady 3 (1963), 714–716.Google Scholar
- [33] . 2017. On improving the numerical stability of Winograd convolution. In Proc. Int. Conf. Learning Representation.Google Scholar
- [34] . 1980. Arithmetic Complexity of Computations. SIAM Publications.Google Scholar
Cross Ref
- [35] . 1980. Signal processing and complexity of computation. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process.IEEE, 94–101.Google Scholar
- [36] . 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proc. Annual Design Automation Conference (DAC’17). Association for Computing Machinery, New York, NY, Article
62 , 6 pages. Google ScholarDigital Library
- [37] . 2018. Efficient Winograd-based convolution kernel implementation on edge devices. In Proc. ACM/ESDA/IEEE Design Automation Conference (DAC’18). 1–6. Google Scholar
Digital Library
- [38] . 2018. A faster algorithm for reducing the computational complexity of convolutional neural networks. Algorithms 11, 10 (2018). Google Scholar
Cross Ref
Index Terms
(auto-classified)Winograd Convolution for Deep Neural Networks: Efficient Point Selection
Recommendations
Winograd convolution: a perspective from fault tolerance
DAC '22: Proceedings of the 59th ACM/IEEE Design Automation ConferenceWinograd convolution is originally proposed to reduce the computing overhead by converting multiplication in neural network (NN) with addition via linear transformation. Other than the computing efficiency, we observe its great potential in improving NN ...
Optimizing Winograd-Based Convolution with Tensor Cores
ICPP '21: Proceedings of the 50th International Conference on Parallel ProcessingConvolution computing is one of the primary time consuming part of convolutional neural networks (CNNs). State of the art convolutional neural networks use samll, 3 × 3 filters. Recent work on Winograd convolution can reduce the computational complexity ...
Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural Networks
Popular deep neural networks (DNNs) spend the majority of their execution time computing convolutions. The Winograd family of algorithms can greatly reduce the number of arithmetic operations required and is used in many DNN software frameworks. However,...















Comments