Abstract
Designing hardware accelerators to run the inference of convolutional neural networks (CNN) is under intensive research. Several different architectures have been proposed along with hardware-oriented optimizations of the neural network models. One of the most used optimizations is quantization since it reduces the memory requirements to store weights and layer maps, the memory bandwidth requirements and the hardware complexity. As a consequence, the inference throughput has improved and the computing cost has been reduced, allowing inference to be executed on embedded devices. In this work, we propose highly efficient dot-product arithmetic units for ternary and non-ternary convolutional neural networks on FPGA. The non-ternary dot-product unit uses a fused multiply-add that avoids expensive adder trees, while the ternary dot-product unit uses a dual product unit followed by an optimized conditional adder tree structure. In both cases, designs with and without embedded DSP are considered. The solution is configurable and can be adapted to the available number of resources of the FPGA to achieve the best efficiency. A CNN architecture was developed and characterized using the proposed dot product units. The results show a performance improvement of 1.8 × with a 2× more area efficiency for low bit-width quantizations when compared to previous works running large CNNs in FPGA.
1 INTRODUCTION
The convolutional neural network (CNN) is one of the most utilized deep learning models for a vast set of applications, including image classification [25], object detection [49], and image segmentation [2]. Compared to traditional machine learning algorithms, convolutional neural networks obtain improved accuracy in many problems. However, it is well known that CNNs have high computing and memory requirements. Known networks [9, 10, 29, 31] have hundreds of millions of parameters and a computational complexity of billions of multiply-accumulations per inference. Multicore computing architectures are therefore necessary to run the inference of these models in acceptable running times. The preferred computing platforms are therefore GPUs (Graphics Processing Units) [26], FPGAs (Field-Programmable Gate Arrays) [6], and ASICs (Application Specific Integrated Circuits) [20].
The programmability and high computing power of GPUs make them the most used platform for training and inference of deep neural networks. The availability of training platforms and highly efficient software libraries turns the development of deep learning models for GPUs into a relatively fast task. The disadvantage of these platforms is that they are power hungry and therefore not adequate for embedded computing where power is scarce. While training of deep neural networks is still a task for GPUs, inference can be deployed on other computing platforms since the computational requirements are orders of magnitude lower. Therefore, designers started to develop architectures for deep learning inference in FPGA and ASIC.
Dedicated hardware accelerators implemented in FPGAs and ASICs have shown very good performance and power efficiency [6, 20]. ASICs are more efficient than FPGA but they are more expensive for medium volume production and not flexible enough to follow the highly dynamic evolution of deep neural networks, which reduces the throughput efficiency of ASIC-based solutions [13]. FPGAs have been extensively used to accelerate the execution of different deep learning models. The hardware flexibility of FPGAs allows designing dedicated hardware accelerators tailored for specific models. This improves the throughput and energy efficiency of the architecture, important for embedded systems. The same underlying hardware architecture can be adapted to the characteristics of a particular model. This improves the efficiency of FPGA-based architectures, compared to ASIC-based solutions.
The throughput of the hardware accelerators of neural networks can be further improved with data quantization [24] in which weights and activations are represented with a particular datawidth with a fixed-point or custom floating-point format [42]. Representing data with less bits reduces the required memory and simplifies the arithmetic units, which improves the area and the performance of the architectures. Quantization is a trade-off between performance, area, power and accuracy. Typically, 8-bit quantization is considered without negligible accuracy reduction for most cases. However, when designing for FPGAs, any sized quantization is allowed [5]. The possibility to consider any type and size quantization increases the design space but offers the opportunity to optimize the architecture for a specific problem, which is particularly relevant for embedded applications.
Most FPGA proposals for inference acceleration use quantization and explore the available parallelism of model inference to improve the inference throughput. To achieve faster implementations, a larger FPGA can be used but the performance efficiency is almost constant. To further improve these designs it is important to look at the fine grain aspects of the architectures, namely the core arithmetic units.
The hardware accelerators of deep neural networks are many-core architectures where the cores execute the same operations in parallel. The execution of a layer is mostly the calculation of a huge number of dot products. The hardware design of this operation determines the efficiency of the architecture. Dot product implementations on FPGA explore the parallelism by using a number of parallel multipliers followed by an adder tree. So, the design of the multipliers and the adder tree determines the design of the dot product. For example, in a direct hardware implementation in FPGA of the dot product between vectors of size 8 with data represented with 8-bit integer, the adder represents around 20% of the total area. This increases to 30% with a careful design of the multiplier, and the percentage increases as we reduce the size of the operands. Since a CNN accelerator has hundreds of dot product units, the cost of the adder tree has a high impact over the final circuit area.
In this paper, we propose two different architectures to calculate dot products of integer numbers. One architecture is proposed for non-ternary quantizations and the other is for quantizations where the parameters and/or activations are ternary (values in {\(-\)1, 0 and 1}). For the first architecture a fused multiply-add unit is proposed to calculate dot products. The solution does not require adder trees. For the ternary architecture, a second architecture is proposed. This architecture uses an adder tree, but the first level of the adder is fused with two multiplications, which reduces by half the required adder tree, followed by a conditional adder tree. A CNN accelerator is then proposed that integrates these dot product units. The utilization of the new multiply-add units improve the efficiency of the CNN accelerator compared to recent state of the art. The main contributions of this work are the following:
A novel dot-product unit for quantizations above ternary based on a fused multiply-adder;
Optimization of the dot-product unit for ternary quantizations;
A configurable CNN architecture with the new dot product units.
2 RELATED WORK
A convolutional neural network consists of a sequence of layers. Each layer receives input feature maps from a previous layer and generates output feature maps for the next layer. The most common layers of CNNs are the convolutional, the fully connected, and the pooling layers. The convolutional layers are the most computationally intensive where 3D convolutions are executed between the input feature maps and a set of filters. Some convolutional layers are followed by pooling layers that sub-sample the Output Feature Maps (OFMs).
The last convolutional layer is usually followed by one or more fully connected or dense layers, where all nodes between two layers are densely connected. The last dense layer generates the probabilities of each class of the model.
Many CNN accelerators have been proposed in the past decade [20, 27]. The hardware accelerators for CNN architectures typically follow one of three main strategies: a single module to run all layers, a dataflow of modules, or a mix of both. Some works are designed with a the single module [11] that serially runs all layers. The on-chip memory is used to store weights and input and output feature maps before sending them to external memory. However, the large data volume transferred with the external memory may become an execution bottleneck. Also, with a single module only it has to be configurable to support the execution of all layers with the same efficiency.
The second design strategy for CNN accelerators implement one specific hardware module for each layer in a pipelined dataflow [18, 21]. These dataflow architectures are quite efficient, since each module is optimized for each particular layer and can even use different quantizations in different modules. This architecture reduces the transfers with external memory, but requires significant on-chip memory resources to store intermediate maps and weights, which may easily consume all available on-chip memory. This approach is usually considered for large FPGA devices.
A third type of CNN accelerator designs could be considered where layer subsets are mapped to a fixed set of modules [28]. This solution sacrifices performance for resource efficiency and consists of several modules, each mapping a subset of the CNN layers.
Whatever the model of computation used to design the CNN accelerator, algorithmic and architectural techniques, like pruning and quantization, have been considered to improve the execution of convolutional neural networks [8, 27, 42]. Pruning cuts connections between neurons to reduce the number of operations as well as the memory size to store the weights. The problem of the method is that the sparsity introduced challenges the regular structures of computing datapaths. Techniques, like block pruning, [33] exist to reduce the sparsity problem caused by pruning in which weights are pruned in blocks.
The quantization method reduces the size of the operands (weights and activations) and/or the format of the operands. Fixed-point quantization is the most common, although some works have considered custom floating-point formats. In [17], an FPGA-based accelerator is proposed using block floating-point arithmetic. The block floating-point format has one different mantissa for each value of the dataset but a common exponent. It reduces the necessary bits to represent data and at the same time provides a larger dynamic range. Still, the format requires a more expensive hardware design of the arithmetic modules compared to fixed-point but lighter when compared to standard 32 or 16-bit floating-point. Another recent approach considering custom floating-point has proposed a low-precision floating-point quantization [40]. It is an 8-bit floating-point format with sign, mantissa and exponent, like a normal floating-point format. The same format is used for all data in the same layer of a neural network model. Four and five-bit mantissas have shown good accuracy results, and the size of multiplications is kept very small.
Usually, CNNs with fixed-point quantization can achieve the same accuracy obtained with these custom floating-point formats with post-training quantization or fine-tuning. So, the authors of these custom floating-point units refer the advantages of these formats for post-training quantization, that is, quantization without retraining.
The quantization design space is large since both weights and activations can be quantized to any size. In spite of this flexibility, most works have proposed architectures for 8- [35, 45] and 16-bits quantization [1], as well as for binary quantization [7, 32, 48]. A few exceptions exist that design accelerators for other quantizations. In [12], a custom network model is trained with binary and ternary quantizations and implemented in FPGA. In [44], a custom model is quantized with 1-bit weights and 4-bit activations and also implemented in FPGA. A mixed quantization is proposed in [30] that uses intra-layer mixed quantizations for weights. Weights are quantized to 4-bit or 8-bit while activations are quantized with 5-bits. The data widths can be fixed for all layers or specific for each layer. In [39], the authors propose and implement a mixed-precision VGG16 network on FPGA, where weights have different bitwidths in different layers, more specifically, they implement 8-bit weights for the first layers and 1 or 2-bit weights for hidden layers. In [49], a hybrid quantization scheme uses 8-bit fixed-point and shift quantization (powers of 2) to represent weights, with fixed 8-bit activations in all layers. The solution improves the hardware area due to power of two quantization. In [36, 37], a core is proposed to run weights represented with 8 or 2 bits. The idea is to use the same hardware module for both configurations of layers and map the architecture in low-density FPGAs. In [46], an accelerator is proposed to run both CNN and RNN (Recurrent Neural Network) models with 4-bit quantization for both weights and activations. In [19], a CNN accelerator is proposed to accelerate residual-like networks with 8-bit activations and 4-bit weights. An FPGA accelerator for fully ternary networks was proposed in [23] together with custom \(2\times 2\) arithmetic to increase the performance of the solution.
These works on fine-grain quantization are very important since the trade-off between accuracy and quantization determines the best design for a specific application. Therefore, it is important to design CNN accelerators for diverse low-bit quantizations and optimize their core arithmetic units for best efficiency, which is one of the main objectives of the work reported in this paper.
In any CNN accelerator, quantized or not, the main arithmetic operation is multiply-add and multiply-accumulate. CNN accelerators may easily integrate thousands of multipliers and adders. Many CNN architectures implement 3D convolutions of CNNs as the accumulation of 2D convolutions. It allows using the Winograd minimal filtering algorithm [22]. Using the Winograd transformation, the number of multiplications is reduced at the cost of some extra additions. Since the hardware complexity of the multiplication is several times higher than the addition, the method is quite efficient. However, the method is only feasible for low filter size, requires some data manipulation and fixed hardware structure, and its efficiency reduces with the quantization size, since the hardware multiplier-addition gap reduces with the size of the operands. Instead of accumulating 2D convolutions, other methods consider directly 3D convolutions [35]. The method is independent of the filter size, exposes more inter-kernel parallelism allowing the inner-product of larger vectors.
To further improve the design of a CNN accelerator, low-level hardware optimizations of the arithmetic units are very important since a CNN accelerator may contain thousands of these units. So, any hardware reduction of a single core has a great impact over the full architecture. The multipliers are the most expensive units of the cores compared to the adders. Therefore, some authors have proposed optimizations for the design of multipliers in FPGA. In [15], the multiplier architecture is restructured into smaller architectures. The multiplier is optimized for Intel FPGA devices with reductions in both area and latency. Multiplier optimization on Xilinx devices was proposed in [38]. The work introduces a novel two-operand addition circuit using radix-4 partial-product generation with addition and uses it to implement multipliers. The proposed multipliers improve the area and the performance.
The implementation of multipliers in FPGA is an active research area that considers several different packaging problems, including the design of multiple multipliers with a single large multiplier and the design of a large multiplier with multiple small multipliers. Multipliers larger than that present in a Digital Signal Processing (DSP) unit are implemented with several DSP or with an hybrid solution with DSP and Look-Up Tables (LUTs). The best solution is determined by the size of the target multiplier, as well as the available resources. Therefore, the solution is not unique. The design of several small multipliers with a single multiplier or a chain of multipliers is an architectural problem, whose packing efficiency depends on the the size of the multiplier inputs.
With low bitwidth quantization, typically 8 or less bits, a single DSP can implement multiple multiplications and additions. In [34], two 8-bit multiplications with a common operand are implemented in a single \(25\times 18\) DSP. The authors in [43] implemented four \(4\times 4\) multipliers in a single \(27\times 18\) multiplier. Both DSP and LUTs can be mixed to effectively use all the resources of the FPGA and increase the area efficiency, like in [35] where 8-bit cores are implemented with both DSP and LUTs achieving close to 400 GOPS in a low-density ZYNQ7020 FPGA.
The quantization size determines the hardware size ratio between arithmetic implemented with LUTs and that implemented with DSP. For example, a \(16\times 16\) multiplier can be implemented with a single DSP or with about 136 LUTs, a ratio of 136 LUTs per DSP. The same single DSP can implement two \(8\times 8\) multipliers and these two multipliers can be implemented with about 72 LUTs, a ratio of 72 LUTs per DSP. As the quantization size decreases, the utilization of DSP against LUTs decreases. When very-low bitwidth quantization is considered, DSP are usually used for accumulation and LUTs are left for multiplication and addition [16].
Since dot products are the main operations of a neural network, a few works have proposed optimizations for this operation. In [34], the authors proposed an implementation of parallel multiply and accumulate units in FPGA for dot-product operations with very high performance/area ratios using a mix of DSP blocks and LUTs. The main emphasis of the work is to consider a balanced utilization of DSPs and LUTs to implement dot-products. In [23], dot product arithmetic units are proposed for the implementation of CNNs using ternary arithmetic. Novel multioperand 2-bit adders were designed to replace the typical adder tree and improve the area and performance of the circuit.
Unless full binary or ternary CNN are considered for which very efficient dot product arithmetic can be designed since multipliers are simple and adder trees can be optimized [23], the adder trees are expensive relative to the multipliers when doing arithmetic for low bitwidth quantization. Therefore, designing efficient fused multiply-adders will improve the design of dot-products. This can be achieved not only by logic manipulation but can be further optimized if the application context is considered. This work is a step forward for this optimization in the context of convolutional neural networks.
3 FUSED MULTIPLY-ADDER UNIT FOR DOT-PRODUCT ARITHMETIC
The method used to design the dot products depends on the size of the operands. This work considers weights and activations quantized with 2, 4, and 8 bits. 2-bit quantizations correspond to ternary quantizations with values in the set {\(-\)1, 0, 1}. Therefore, they are treated differently. These quantizations are considered in this work because they are the most common, but others can be considered with the same proposed methods. Two different design solutions are considered according to the quantizations:
The designs to be proposed consider both LUT-only solutions and mixed LUT/DSP solutions.
3.1 Dot-Product Arithmetic Unit for Non-Ternary Quantization
Given two vectors, \(W = {W_{N-1},W_{N-2},\ldots , W_1,W_0}\) and \(Y = {Y_{N-1},Y_{N-2},\ldots , Y_1,Y_0}\), with the same number, \(N\), of elements, the dot product, DP(W, Y), between vectors \(W\) and \(Y\) is given by Equation (1). (1) \(\begin{equation} DP(W,Y) = \sum _{i=0}^{N-1} W_i \times Y_i \end{equation}\)
The result is obtained after \(N\) multiplications and additions, where the first addition is with 0. Since there are no data dependencies, the operation can be easily parallelized using multiple multipliers in parallel followed by an adder tree. If there are less than \(N\) multipliers in parallel, then the dot product is calculated in multiple steps with the partial results being accumulated. Formally, assuming \(K\) multipliers, with \(N\) a multiple of \(K\) (for non multiple cases, the method is the same but in the last step some multipliers are not used) the addition of all the products is calculated according to Equation (2). (2) \(\begin{equation} DP(W,Y) = \sum _{i=0}^{N/K-1} \sum _{j=0}^{K-1} W_{i*K+j} \times Y_{i*K+j} \end{equation}\)
3.1.1 Designing 8× 8 Dot-Products.
A direct hardware implementation of Equation (2) uses \(K\) multipliers followed by an adder tree and an accumulator. Figure 1 illustrates an implementation for 8-bit operands and 8 multipliers.
Fig. 1. Block diagram of a direct implementation of a dot-product for the example with 8 multipliers in parallel of 8-bit operands.
The last adder and the accumulator can be replaced by a single ternary adder. Ternary adders require carry routing through normal routing lines, increasing the pressure over routing and the delay. However, it removes the last adder of the adder tree, reducing the number of LUTs.
Knowing that an M-bit adder is implemented with M 6-input LUTs, the number of 6-input LUTs to implement an adder tree with K M-bits operands plus the accumulator with \(L\) bits, \(AreaTree_{LUT}\), without any optimization with the ternary adder, is given by Equation (3). (3) \(\begin{equation} AreaTree_{LUT} = L + \sum _{i=1}^{log_2K-1} \frac{K}{2^i} \times (M + i) \end{equation}\)
The adder tree has \(log_2K-1\) levels of adders, each level, \(lv\), has \(\frac{K}{2^{lv}}\) adders (where the first level is \(lv=1\)) and all adders in a particular level are (M+lv)-bit adders.
Knowing that the complexity of the area of \(K\) multipliers with products of size \(M\) is \(\mathcal {O}(K\times M^2)\) and that the complexity of the area of the adder tree to add the outputs of the multipliers is \(\mathcal {O}(log_2K \times K \times M)\), the relative area of the adder tree increases with the reduction of the multiplier size.
Since we are targeting dot products for low quantizations, reducing the overhead of the adder tree is even more important. Many compressor trees have been proposed for FPGA [15]. Among the several compressor trees, one of the most efficient for FPGA implements the adder tree with 4:2 compressors [14]. Each set of four terms is compressed to two terms: a sum and a carry. The 4:2 compressor uses the carry chain just like a normal two-operand adder (see the dot product implementation with 4:2 compressors in Figure 2) and it is a regular structure, contrary to several other compression techniques.
Fig. 2. Block diagram of a direct implementation of a dot-product for the example with 8 multipliers in parallel of 8-bit operands using 4:2 compressors.
The two terms of the final 4:2 compressor are accumulated with a ternary tree, in the example. When the critical path resides in the ternary adder, it can be replaced by two 2-input adders.
The complexity of the adder tree using a 4:2 compressor and the ternary adder, \(AreaTree42_{LUT}\), is given by Equation (5). (4) \(\begin{align} AreaTree42_{LUT} & = L + \sum _{i=1}^{log_2K-1} \frac{K}{2^{i+1}} \times (M + i) \end{align}\) (5) \(\begin{align} & = L + \frac{1}{2} AreaTree_{LUT} \end{align}\)
The solution with the compressors is about half the size of the adder tree with 2-input adders.
Contrary to the above solutions that utilize an adder tree, the work proposed in this paper introduces a fused multiply-adder to remove the adder tree. We started with the multiplier proposed in [38]. This multiplier uses Booth recoding to reduce the number of partial products. A level of LUTs and a carry chain is used to multiply three recoded bits of the multiplier by the multiplicand and sum the partial product with the previous partial product (see [38] for further details on the multiply-add of partial products).
Let’s consider an \(8\times 8\) multiplication using this method. A 9-bit partial product is generated for each pair of recoded multiplier bits. This has to be sign extended to be added with the next partial product. Instead of the normal sign extension with the replication of the most significant bit, sign extension is obtained by complementing the most significant bit, adding one to this bit and extending the number with ones. The correctness of this property can be verified considering both positive and negative numbers. When the number is positive, the complement of the sign bit plus one is zero and a carry propagates through all extended ones converting them to zeros. When the number is negative, the complement of the sign bit plus one is one and there is no carry propagation, so the number becomes extended with ones. The method is advantageous since the ones of all partial products can be simplified at synthesis time.
Figure 3 illustrates the partial products with the sign extension.
Fig. 3. Partial products for the example of the \(8\times 8\) multiplier using the Booth recoding from [38].
After the simplification of ‘1’s, the full multiplier is implemented with multiply-add units as shown in Figure 4.
Fig. 4. Complete \(8\times 8\) multiplier with multiply-addition of partial products.
For the example illustrated in Figure 4, the multiplier has four multiply-add units. All units multiply two bits of the coded multiplier and adds with the previous partial product, except the first unit at the top. This multiplies and adds with a binary constant “100000000”, where the ‘1’ appears after sign extension, as explained above.
We have modified the multiplier in Figure 4 to fuse the addition of an input operand with the product of two other operands, that is, to perform the operation \(W \times Y + X\). This circuit allows us, therefore, to implement a fused multiply-adder (MADD) (see Figure 5).
Fig. 5. Proposed fused multiply-add for an \(8\times 8\) multiply-adder.
It can be observed from Figure 5(a) that the addition of the input with the first partial product requires a four input adder in the middle bit (\(1 + x_8 + \overline{p_{0,8} + cin}\)). To avoid this adder, the sign bit of the first partial product has been extended two positions to allow the simplification of the initial isolated ‘1’ with another ‘1’ from the partial product. The following partial products must follow the same 2-bit sign extension in order to accommodate the input bits. Therefore, after simplification of all ‘1’s from sign extension, the final configuration requires two more LUTs for each partial product compared to a single multiplier. The additions with ‘1’ at the final of each partial product forces the propagate signal of the addition to be ‘1’ and the generate signal is never used. In this case, the Xilinx synthesis tool optimizes the entire LUT away and implements the addition only with the carry chain of the FPGA (CARRY4 macro).
The fused multiply-adder (MADD) can be further simplified when used to implement the dot product. This consists of a series of MADD units ending with an accumulator (see Figure 6).
Fig. 6. Example of the proposed dot product unit between two vectors with eight 8-bit values using the proposed fused multiply-adder.
The MADDs increase in size as we move through the chain, but the number of LUTs remain the same since the extra additions are between a carry in and a single value, which are also implemented with the carry-chain only. To obtain this simplification, the CARRY4 primitive is instantiated directly in the VHDL description. The last MADD can be replaced with a multiplier when a ternary adder is used to implement the accumulation. The first MADD of the chain adds the result of the multiplication with zero. The zero input of the first multiplier can be used to input the constant that results from sign extension of all partial products. As can be seen in Figure 4, this constant equals “1010101100000000” for each MADD. Knowing the number of MADD units used in the dot product chain, \(K\), the constant for the full dot product equals \(1010101100000000 \times K\). This is the constant to be be added in the first fused multiply-adder. Consequently, the remaining fused-multiply-adders do not require adding the constant associated with sign-extension.
Removing the constant value allows an extra simplification of the fused multiply-adder. After removing the ‘1’s from the partial products shown in Figure 5(a), the input bits from \(x_9\) to \(x_{15}\) can be concatenated with the four partial products as illustrated in Figure 7.
Fig. 7. Proposed fused multiply-add for an \(8\times 8\) multiply-adder without the constant “10101011”.
Compared to the architecture illustrated in Figure 5(b), the simplified architecture has two fewer full-adders (2 LUTs) in the first partial product and one fewer full-adder (1 LUT) in each of the other partial products. In total, it only requires three more LUTs compared to the multiplier only.
A final optimization can be achieved when the number of elements of the complete vectors of the dot-product, \(K\), is known (e.g., in a CNN). In this case, the constant given by \(1010101100000000 \times K\) can be used to initialize the accumulator. This method reduces the first MADD to a single multiplier and reduces the carry path, since the large constant is not introduced in the processing path of the chain of MADDs.
3.1.2 Pipelined Dot-Product with 8× 8 MADD Units.
Considering the delay of the circuit, when a dot product unit with multiple MADDs is used to calculate multiple partial dot products, according to Equation (2), pipelining is important to increase the throughput of the dot product operation. The architecture with multiple parallel multipliers followed by an adder tree can be easily pipelined since the multiplications run in parallel. In the dot-product circuit implemented with the proposed MADDs, the multiplications occur in sequence and, therefore, to pipeline the circuit, the inputs must be delayed according to its position in the chain of MADDs (see Figure 8). Delaying the inputs in the proposed solution requires extra flip-flops (FF), since pairs of inputs are consumed at different stages of the pipeline.
Fig. 8. Pipelined dot-product circuit implemented with MADDs and with registers at the output of the MADDs.
In the adder tree configuration, all inputs are consumed in the first pipeline stage. All registered outputs in the several pipeline stages come from LUTs. Therefore, pipelining does not add extra LUTs compared to a solution without pipeline, since each logic block has one LUT and a pair of FF.
In the case of the proposed solution, as can be seen from Figure 8, pairs of inputs are consumed at different stages of the pipeline. In this case, the architecture requires mapping a sequence of flip-flops without LUTs in between. Compared to a solution without pipeline, extra logic blocks are needed to implement these isolated FFs.
To avoid this cost, the proposed dot product unit is more appropriate for systolic data processing architectures, where the inputs are input in sequence. In our case, as will be explained in the description of the CNN accelerator, the cost of these registers will be considerably reduced, making the pipelining overhead negligible by sharing it throughout several dot products. The circuit can be further pipelined by pipelining the MADD blocks, up to a single level of LUTs. Internally, there is no cost associated since the pipeline registers are implemented with the flip-flops at the output of the LUTs used to implement the partial products.
3.1.3 Fused 8× 8 Dot-Product with LUTs and DSPs.
DSPs can be also utilized to implement dot-products. A single DSP48E1 can implement two multiple-adds of \(8\times 8\) multiplications [34], where the term to be added cannot have more than 16 bits. It means that we cannot implement a structure similar to the proposed dot product unit with multiple MADD whose size of the term to be added increments along the chain.
A simple approach to use both DSPs and LUTs to implement multiple dot product units is to implement some of these units with LUTs only and some units with DSPs followed by an adder tree. However, due to regular structure of the FPGA with LUTs and DSPs spread regularly throughout the FPGA fabric, it is more efficient to design the dot-products with both LUTs and DSPs. To take advantage of the dual multiply-add with a single DSP, these must be used at the beginning of the dot-product. Different solutions exist, depending on the proportion between the number of DSPs and LUTs (see Figure 9).
Fig. 9. Different implementations of the dot-product circuit with both LUTs and DSPs. (a) Implementation with half of a DSP (A single DSP implements two multiplications of two different dot-products); (b) Two halfs of two DSP, (c) four halfs of four DSPs.
The figure illustrates three different configurations with a different number of DSPs. The first configuration (Figure 9(a)) has half of a DSP (each DSP implements two multiply-adds of two different dot products). It is used to implement one multiplication and one addition with the result of the first multiplier implemented with LUTs. The second configuration (9(b)) utilizes two halfs of two DSPs. Compared to the first configuration, it implements the first multiplier with a DSP instead of LUTs. Finally, the third configuration (9(c)) implements the first four multiplications with four halfs of four DSPs and fuses the result with an adder before entering the chain of MADDs.
All VHDL descriptions with DSPs instantiate directly the DSP48E2 primitive so that the designer has full control of the final synthesis result.
3.1.4 Designing M× N Dot-Products.
The fused multiply-adder can be adapted to any operand size. To illustrate the design of dot products where operands have sizes different from 8, let’s consider the common cases \(8\times 4\) and \(4 \times 4\).
In both configurations the fused multiply-adder has only two partial products (see Figure 10).
Fig. 10. Fused multiply-adder for configuration (a) \(8\times 4\) ; (b) \(4\times 4\) .
The constants relative to the sign extensions to be added are multiples of “101100000000” for the \(8\times 4\) case and “10110000” for the \(4\times 4\) case.
To include DSPs in the design of the dot-products for these operands sizes, a similar approach can be followed. The number of multiply-adders that can be implemented in a single DSP depends on the operand sizes. Considering operands with sizes between 4 and 8, a DSP can only implement two multiply-adders [37], except for the case \(5\times 4\), where it is possible to implement 3 multiply-adders with a common operand. With operands of size \(5\times 5\) and \(4\times 6\) it is possible to implement 3 multiplications with a common operand.
In this paper, we are particularly concerned with the implementation of multiply-additions of size \(4\times 4\). In [41], the authors implement four \(4\times 4\) unsigned MACs in a DSP48E1. The work uses a custom floating-point format where operands are represented in a sign-magnitude format, so operands are unsigned. The work in [43] implements four \(4\times 4\) multiply-adders in a DSP48E2 where one of the operands is signed and the other is unsigned. This was utilized to design a hardware accelerator of CNNs and assumes a RELU-like activation function that generates unsigned activations. However, some activation functions (e.g., hard-swish) can produce negative values.
In this work, we propose the implementation of four signed \(4\times 4\) multiply-adders with a single DSP48E1 and an adder (both DSP48E1 and DSP48E2 versions are supported). Formally, the solution implements four multiply-additions between 4-bit signed operands, W, X, Y, and Z, namely: (6) \(\begin{align} &M0 = W\times X + C0 \end{align}\) (7) \(\begin{align} &M1 = W\times Y + C1 \end{align}\) (8) \(\begin{align} &M2 = Z\times X + C2 \end{align}\) (9) \(\begin{align} &M3 = Z\times Y + C3 \end{align}\) where C0, C1, C2 and C3 are 8-bit operands.
The solution explores the arithmetic modules available in a DSP48E1. The DSP48E1 slice has one signed multiplier (\(25\times 18\)) and one 48-bits adder/accumulator (see the outline of the DSP48E1 architecture in Figure 11).
Fig. 11. Architecture of the DSP48E1 slice.
The DSP slice also contains one pre-adder connected to the 25-bit input of the multiplier. This pre-adder followed by the multiplier are used to implement \((A + D) \times B\). The 48-bit adder of the DSP48E1 allows adding the output of the multiplier with the 48-bit input C or the registered output of the adder, to implement an accumulator.
To implement four 4-bit multiplications with a single DSP slice, we separate both multiplicands, X and Y, and both multipliers, Z and W, so that the products do not overlap, add them and then multiply by W (see Figure 12).
Fig. 12. Packing four 4-bit multiplications, \(X \times W\) , \(Y \times W\) , \(X \times Z\) , \(Y \times Z\) in a single DSP slice.
The operands X and Y are separated by 16 bits and operands Y and W are separated by 6 bits. This results in four products separated by two bits. These two bits allow four accumulations and will be used to add the results from multipliers implemented with LUTs. The result of the operation illustrated in Figure 12 is given by Equation (10). (10) \(\begin{eqnarray} P &= (X.2^{16} + Y) \times (Z.2^{6} + W) = \nonumber \nonumber\\ &= \underbrace{X \times Z}_{P_d}. 2^{30} + \underbrace{X \times W}_{P_c}. 2^{20} + \nonumber \nonumber\\ &+ \underbrace{Y \times Z}_{P_b}. 2^{10} + \underbrace{Y \times W}_{P_a} \end{eqnarray}\)
The four products must now be extracted from P. The least significant product, \(P_a\), is obtained directly from bits 7 down to 0 of P. The other operands may need an adjustment due to the accumulation of extended signals. The method is to add a carry in if the previous product was negative (to be explained below).
Since both operands are signed we need one pre-addition for the multiplier and another for the multiplicand. One addition is implemented with the pre-adder and the other must be implemented with LUTs. This has to be done with LUTs. Terms C0, C1, C2, and C3 are added to the products using the 48-bit adder of the DSP.
From the output of the DSP we need to extract the four independent multiply-additions M0, M1, M2, and M3. The least significant product obtained in the DSP, \(P_a\), is always correct but the following products may have to be adjusted depending on the sign of previous products. If negative, a carry in must be added to the next product and, if positive, the result is already correct and must not be changed. Let’s consider a product \(P_x\), and constant Cx, four different combinations of their signals are possible (see Table 1).
As we can see from the table, if both products are positive, then the next \(P_x\) is correct without any carry in. If both are negative, the carry in is automatically generated by the addition of both operands. When the products have different signs, the result of the addition may be negative or positive, and so the next \(P_x\) may be right or wrong, since a carry in may be generated when not required or not generated when required. To solve this, we manipulate the term Cx to control the generation of the carry in, as follows:
DSP product is positive: if the constant Cx to be added is positive, we keep its sign, otherwise, we flip the bit signal. This guarantees that there is no carry propagation;
DSP product is negative: if the LUT product is negative, we keep its sign, otherwise, we flip the bit signal. This guarantees that there is carry propagation.
We apply these to each Cx product.
Since we changed the most significant bits of Cx, these bits are recovered before being sent to next arithmetic unit (see complete circuit in Figure 13).
Fig. 13. Architecture of the proposed circuit for four parallel dot product accumulations using a single DSPs and some LUTs.
While in the dual multiply-add implementation of operands with size \(8\times 8\) there is only one guard bit between products, in this case there are two guard bits, allowing Cx to have up to 9 bits. This increases the set of possible dot product designs with both LUTs and DSPs (see different implementations of the dot-product between vectors with 16 elements using DSPs and LUTs in Figure 14).
Fig. 14. Different implementations of the dot-product circuit with both LUTs and DSPs. (a) Implementation with two \(4\times 4\) multiplications with two DSPs (each DSP implements four \(4\times 4\) multiplications of four different dot products); (b) Four \(4\times 4\) multiplications with four DSPs, (c) Eight \(4\times 4\) multiplications with eight DSPs.
The first configuration is a chain of multipliers, followed by fifteen multiply-adds and an accumulator. Two of the multiply-adders are implemented with DSPs. The second implementation uses four DSP-based multiply adders in parallel which are merged with a simple adder. The last configuration implements eight MADDs with DSPs in parallel and uses a 4:2 compressor to add their outputs. Any of these circuits can be easily pipelined to improve the throughput.
3.2 Dot-Product Arithmetic Units for Ternary Quantization
In ternary quantization the operands (weight and/or activation) have ternary values {\(-\)1, 0, 1}. In this section, design architectures for dot product calculation with multiplications of size \(Q \times T\), where \(T\) is in {\(-\)1, 0, 1} (we designate them as ternary multiplications throughout the section), are proposed. Architectures for \(Q\) in {8, 4, 2} were designed. However, the design solutions apply also for other quantizations of \(Q\).
The result of the multiplication of a value, \(X\), with a ternary value, \(Y\), is \(X\), 0 or \(-X\). If we consider a multiplication-addition of a ternary multiplication with an input \(C\), the result, \(MA\), is as follows: (11) \(\begin{align} MA & = C + X \quad if Y = 1 \end{align}\) (12) \(\begin{align} & = C + 0 \quad if Y = 0 \end{align}\) (13) \(\begin{align} & = C - X \quad if Y = -1 \end{align}\)
Therefore, the multiply-addition with ternary multiplications is implemented with a conditional adder/subtractor (see Figure 15).
Fig. 15. Conditional adder/subtractor to implement a fused ternary multiply-addition.
The circuit occupies the same number of LUTs of an adder. The dot product can then be implemented with the same chain structure used to implement the non-ternary dot products of the previous section. In this case, multiple fused ternary multiply-adders are used in a chain followed by the accumulator.
An alternative approach is to fuse two multiplications in a single adder module, as proposed in [23]. In this work, two weights, W0 and W1, are multiplied by two activations, A0 and A1, and added with a single level of LUTs. Considering all possible values for the weights, the following operations must be executed: \(\begin{equation*} MA = {\left\lbrace \begin{array}{ll} +A1 + A0, & \text{if} \lbrace W1,W0\rbrace = \lbrace 1,1\rbrace \\ +A1 + 0, & \text{if } \lbrace W1,W0\rbrace = \lbrace 1,0\rbrace \\ +A1 - A0, & \text{if } \lbrace W1,W0\rbrace = \lbrace 1,-1\rbrace \\ +0 + A0, & \text{if } \lbrace W1,W0\rbrace = \lbrace 0,1\rbrace \\ +0 + 0, & \text{if } \lbrace W1,W0\rbrace = \lbrace 0,0\rbrace \\ +0 - A0, & \text{if } \lbrace W1,W0\rbrace = \lbrace 0,-1\rbrace \\ -A1 + A0, & \text{if } \lbrace W1,W0\rbrace = \lbrace -1,1\rbrace \\ -A1 + 0, & \text{if } \lbrace W1,W0\rbrace = \lbrace -1,0\rbrace \\ -A1 - A0, & \text{if } \lbrace W1,W0\rbrace = \lbrace -1,-1\rbrace \end{array}\right.} \end{equation*}\)
All cases can be handled by a simple adder/subtractor, except the last case, \(-A1-A0\). So, this is treated as as \(A1 + A0\) and a signal is sent to the first adder of the tree adder telling this number is negative. To implement the circuit, the set of four bits corresponding to the weights W0 and W1 are converted to three bits (define the operation) plus 1 (define the signal of the result). So, each LUT of the conditional adder/subtractor receives five inputs and the operator can be design with a single level of LUTs. The adders of the adder tree must conditionally add or subtract the inputs according to the signal value sent by the previous adder. Whenever the adder receives two negative values, it adds the absolute values and signals the next adder that the result is negative. The final correction is made at the accumulator.
This solution is more expensive than the fused ternary multiply-addition solution. The difference in the number of LUTs results from the additional logic necessary to recode the weights and to generate the signals to the next level. However, this logic can be avoided as described next.
Ternary weights are fixed and determined at static time. Hence, they can be recoded in order to avoid runtime recoding and consequently the associated logic. Given one pair of weights, W0 and W1, we recode the 4-bit information (2-bits from each weight) as a code to indicate the operation to be executed over the activations and a signal to specify if the result is positive or negative (see Table 2).
| W1 | W0 | Operation | Operation Executed | Code | Signal |
|---|---|---|---|---|---|
| 00 | 00 | 0 + 0 | 0 + 0 | 000 | 0 |
| 00 | 01 | 0 + A0 | 0 + A0 | 001 | 0 |
| 00 | 11 | 0 - A0 | 0 - A0 | 100 | 0 |
| 01 | 00 | A1 + 0 | A1 + 0 | 010 | 0 |
| 01 | 01 | A1 + A0 | A1 + A0 | 011 | 0 |
| 01 | 11 | A1 - A0 | A1 - A0 | 111 | 0 |
| 11 | 00 | -A1 + 0 | 0 - A1 | 110 | 0 |
| 11 | 01 | -A1 + A0 | -A1 + A0 | 101 | 0 |
| 11 | 11 | -A1 - A0 | A1 + A0 | 011 | 1 |
Table 2. Recode of Weights to Optimize the Dot Product Calculation with Two Fused Multipliers
The most significant bit of the code is also used for the carry in of the fused multipliers. This recoding avoids the 2-LUT logic for each pair of inputs in the first stage of the dot product.
The LUT used to generate the signal of the result on each intermediate adder of the adder tree can also be avoided by merging its logic with the first full-adder and recoding the operations (see Table 3).
| sign1 | sign0 | Operation | Operation Executed | Carry-in | Signal out |
|---|---|---|---|---|---|
| 0 | 0 | A1 + A0 | A1 + A0 | 0 | 0 |
| 0 | 1 | A1 - A0 | A1 - A0 | 1 | 0 |
| 1 | 0 | -A1 + A0 | A1 - A0 | 1 | 1 |
| 1 | 1 | -A1 - A0 | A1 + A0 | 0 | 1 |
A1 and A0 are the values to be added and sign1 and sign0 are the signals of this values, respectively.
Table 3. Recode of Operations to Optimize the Adder Tree for the Fused Multipliers
A1 and A0 are the values to be added and sign1 and sign0 are the signals of this values, respectively.
The operation \(-A1 + A0\) is implemented as \((A1 - A0)\) with a signal generated to the next adder. This allows the signal to the next level of the adder tree to be equal to the input signal, sign1, of operand A1. So, only the addition, \(MA0\), and the carry out, \(Cout\), of the first full adder need to be generated as follows: (14) \(\begin{align} MA0 & = A1 \oplus A0 \end{align}\) (15) \(\begin{align} Cout & = A1 \cdot A0 + \overline{A0} \cdot (S1 \oplus S0) \end{align}\)
These are four bit equations that can both be generated with a single LUT.
The dot product unit generated with the solution of the two fused multiplier without the logic overhead associated with the generation of the codes and signals needs about the same number of resources as the dot product unit generated with the fused multiply-adder. The main difference resides in the dataflow. The architecture based on the chain of fused multiply-adders is more appropriate for a systolic-like dataflow, while the second solution is based on the usual parallel execution of multipliers. In terms of pipeline, the former solution requires a careful design to reduce the overhead associated with delay buffers, while the tree adder based solution takes advantage of the available flip-flops present in the adder units. The problems associated with the buffers can be reduced by breaking the chain of units and designing two or more chains in parallel. Finally, in terms of routing, the fused multiply-adder is more regular and potentially easier to route than the adder-tree based solution.
The implementation of ternary multipliers with DSPs is less efficient since the number of multipliers that is possible to implement with a single DSP does not increase proportionally as we reduce the size of the non-ternary operand. Typically, DSPs are instead used to implement the accumulators. Considering the solution with the two fused multipliers, the propagation of the signal of the operands would require a final adjustment before entering the accumulator of the DSP. Therefore, using DSP in the solution would require extra logic. On the other side the solution with the chain of multiply-adder can use the DSPs to do a final dual of ternary addition in SIMD mode (multiple accumulations in a single DSP). For example, a dot product of a ternary representations (both operands are ternary) could use a single DSP to implement four independent ternary accumulations. Figure 16 illustrates an implementation of the dot product with fused ternary multiply-adders and a DSP to implement the accumulator.
Fig. 16. Implementation of the dot product operation with fused ternary multiply-adders and a DSP to implement the accumulator.
The solutions of the example, considers two and four chains of multiply-adders in parallel and a final DSP-based addition to do the accumulation. The DSP can be shared by other dot-product units, depending on the size of the accumulator (the SIMD property of the DSPs allow two independent 24-bit adders or four independent 12-bit adders).
4 CONVOLUTIONAL NEURAL NETWORK ACCELERATOR WITH THE PROPOSED DOT-PRODUCTS
In this section, we describe in detail the architecture of convolutional neural network accelerator that utilizes the proposed dot-product units.
4.1 Overview of the Architecture
Figure 17 presents the architecture of the hardware/software system with the CNN accelerator for the execution of the inference of CNN models.
Fig. 17. Block diagram of the hardware/software system with the CNN accelerator.
The architecture includes a DMA (Direct Memory Access) connected to the external memory controller to read and write data from the external DDR memory. Allowing direct access to external memory frees the processor to execute other tasks associated with the inference execution, namely the configuration of the DMA and the runtime configuration of the CNN accelerator.
The CNN accelerator has one dedicated block for the first layer, one for the hidden layers and another for the last layer. This separation allows the design of a dedicated module for the first layer, which usually has a different quantization and only three input channels and a dedicated block for the dense layer that is unable to explore inter-layer parallelism. The design of the modules is similar but the first layer is tailored to deal with three channels only and a particular quantization and the last is tailored for calculating a single dot-product. The processing blocks of the CNN accelerator have four main types of units: IFM (Input Feature Map) unit, OFM (Output Feature Map) unit, weights unit, and core unit. The IFM unit manages the access to the input feature maps. The input feature map received from the external memory is split through the several IFM units and then sent to the core units to be processed. The OFM unit manages the storage of the output feature maps. It receives the output data from the cores, implements pooling, applies the activation function (ReLU function), merges data, and sends it to the external memory. The weight memory unit is responsible for reading weights and bias from external memory and sending them to the cores. Finally, the core units are responsible for the calculation of the convolutions. This is where the dot-products take place in parallel.
The cores are organized in a two-dimensional matrix. The set of cores in the same row receive the same set of activations but different weight kernels. Each of these cores calculates activations of different output feature maps (inter-parallelism). The set of cores in the same column receive the same weight kernel but different activations of an input feature map. Therefore, each of these cores contributes to the calculations of activations of the same output feature map (intra-parallelism). The results from each core are merged and sent to the OFM units.
The execution of the inference of a CNN model in the architecture is divided into three steps: memory read, compute, and memory write. In the memory read step, the input feature map (complete or a tile) is transferred from the external main memory to the on-chip buffers of the IFM units and the kernels of weights are transferred also from external memory to the weight units. The reading process is controlled by address generator units (AGUs) inside the units, to be detailed later. In the compute step, the IFM units and the weight units read data from the on-chip buffers and send them to the core units to be processed, that is, to calculate the dot-products. Internal AGUs of the IFM and weight units control the reads from the on-chip memories. The memory write step reads data from the OFM buffers and transfer them to main memory. All three steps work in a pipeline dataflow.
The proposed architecture could also be considered to design a CNN accelerator with dot products implemented with adder trees. The main difference resides in the design of the pipeline. The dot product based on the proposed fused multiply-adder works in a systolic fashion. Therefore, it requires delays at the output of weight memories and activation memories. The delays in the weight memories could be achieved by rearranging the weights in memory, but this was not considered in this work. The solution with adders trees does not require these delays. This means that the proposed dot-product units could be used with other architectures that consider intra-kernel parallelism. Another difference between these two approaches is the weight recoding necessary in the design with the proposed dot-product for ternary quantizations.
4.2 IFM Unit
The dual-port on-chip memories of the IFM units are controlled by AGUs (see Figure 18).
Fig. 18. Block diagram of the IFM unit.
The write and read addresses of each memory are controlled by input AGUs and output AGUs, respectively. The output AGU is shared by all units, since data is read in parallel. The AGUs are configured before starting the execution of a layer and remain constant until the next layer. The input data comes from the DMAs an the output data is sent to the core units. The data read from the buffer goes through a delay buffer according to the pipeline configuration of the dot-product units for correct execution of the units, as explained in Section 3.1.2.
The address generator units generate addresses with a nested loop pattern (see Algorithm 1).

The AGU has several configuration parameters that determine the sequence of addresses to be generated. The
The AGU to generate the output addresses needs five loops to execute the convolution, therefore, it is implemented with a chain of three simple AGUs. Each chained address generator unit adds a set of
The accelerator executes the convolutional layer as a 3D convolution with data stored in memory in a ZXY format instead of the original XYZ format. Figure 19 illustrates the memory storage format for a \(4\times 4\times 3\) input feature map that allows the sequential reading of data for the 3D convolution.
Fig. 19. Data storage format for the 3D convolution.
The XYZ format stores the values by column and row of each feature map. In ZXY format, the data is stored by channel first. Figure 19 also highlights the values used to compute the first 3D convolution. The XYZ format distributes the input values in nine groups of three contiguous values, which requires three loops to read. The ZXY format stores the input values in three groups of nine contiguous values each, which requires only two loops to access all values. The ZXY format also allows reading multiple contiguous data from the IFM buffer (up to the value of Z) in a single cycle.
The limitations of the on-chip memory of IFM units forces the tiling of the input feature maps. The tiling implemented in the accelerator divides the input feature maps into blocks of rows. Since there are
The internal AGU that generates the reading address for the IFM buffers is configured to generate a five-loop pattern. The two inner-most loops iterate over the values to perform a single 3D convolution.
The third and fourth loops generate the pattern for the maxpool. The final fifth loop generates the pattern to move across tiles. To run models that do not used short cut connections, this second set of memories should be removed.
4.3 Weight Unit
The weight unit is similar to the IFM unit. The on-chip buffers are also dual-port to allow reading weights while the next kernel is being read from external memory in a ping-pong fashion. It also includes AGUs to generate the write and read addresses and delay buffers (see Figure 20).
Fig. 20. Block diagram of the weight unit.
In this case, both the input and output AGUs are simple AGUs since the weights are read in sequence. The unit also stores the bias of the kernel. The bias is previously added with the dot-product constant when the fused multiply-adder dot-product is used.
The buffer of the weight units is read in sequence for the whole 3D convolution. After each 3D convolution, the internal AGU points back to the initial address. The starting address both for reading weights and biases memories is updated according to each set of kernels being used for convolution.
4.4 OFM Unit
The output feature map unit receives the outputs from the cores and conditionally implements pooling, executes the activation function, reduces the result to the format of the quantization, and conditionally executions the shortcut addition (fused processing). Then, it sends the results back to external memory. A second set of memories is used to store outputs of a previous layers to implement shortcut connections (see Figure 21).
Fig. 21. Block diagram of the OFM unit.
The activation block receives a sequence of vectors with the outputs from a group of cores. So, multiple values are processed in parallel. The module scales the value, according to the scale factor (the architecture considers power of two scale factors, so the multiplication by the scale factor is just a shift), applies the activation function followed by pooling and then conditionally adds the shortcut layer. When pooling is active, the value goes through a pooling circuit that compares with the previous activation. After running the full pooling window, the final activation is sent to the external memory. So, the method efficiently merges the pooling and convolutional layers.
4.5 Core Unit
Each computing core receives as input the feature map values, biases and weights from the on-chip memories. The core outputs the computation result to an OFM memory. The parallelism of the core is statically configurable, that is, the number of fused multiply-add units or multipliers is configurable. This parallelism requires multiple words from IFM units and kernel units.
The accumulators are designed with enough bits to avoid saturation. This size is determined from the maximum number of accumulations among all filters. The size of the accumulators can be optimized considering the maximum value achieved during the inference of the model. Usually, this results in smaller accumulators. However, the impact over the utilization of hardware resources is relatively low. Therefore, we can design the accumulators to avoid saturation assuming the worst case.
The number of core units per row and per line are also statically configurable before synthesis. These factors determine the total available parallelism of the architecture.
The outputs of the cores are separated in groups and the outputs of each group are sent in parallel to the OFM units. Outputs of different groups are serialized. For example, with 64 cores in a line, they can be arranged in groups of eight and serialized in a dataflow pipeline. Serializing groups of values avoids having a too large number of lines running in parallel to the OFM units, while running lines in parallel inside a group avoids long stream sequences that could halt the processing cores waiting for data to be written back.
4.6 Configuration of the Accelerator
Table 4 enumerates the configurable synthesis parameters of the accelerator.
| Parameter | Description |
|---|---|
| Number of core units in a line | |
| Number of core units in a column | |
| Number of multiply-add units in a core unit | |
| Data width of activations | |
| Data width of weights | |
| Weight memory address size | |
| IFM memory address size | |
| OFM memory address size |
Table 4. Configuration Parameters CNN Accelerator
The number of weight units is defined by the
4.7 Software API
A C++ API that enables runtime configuration by a CPU was developed for the accelerator. It includes a main class
The main class contains three objects, each from a different class. Each class contains methods to configure specific cores in the architecture. The
5 RESULTS
In this section, we present the area and performance of the proposed dot-product units and of the CNN accelerator. Different configurations of the CNN accelerator were designed and tested for different models. ResNet50 was tested for quantizations \(8\times 8\), \(8\times 4\), \(4\times 4\), \(8\times 2\) and \(4\times 2\) running ImageNet. The base architecture is similar in all designs where the main difference is the core design and the parallelism of the dot-products inside the cores.
All designs were implemented in the KC705 evaluation board, which includes a Xilinx Kintex-7 XC7K325T-2 FPGA and 1GB of DDR3 memory. The core accelerator was integrated with a RISC-V based SoC. The system uses the low-performance RISC-V soft-processor to control the memory sub-system and the peripherals, including the CNN accelerator. The peripherals’ set includes a boot controller, internal memory to store the firmware, external memory to store the image and weights, timer to measure the application’s time performance. The core accelerator operates with a clock frequency of 200MHz.
5.1 Results of the Dot-Product Cores
The dot product units were implemented in the target FPGA for different quantizations and configurations. Hybrid solutions with both LUTs and DSPs were also implemented. The proposed units were compared with architectures without the proposed optimizations. These architectures are designated as \(X\_K\_Q\times R\)_D, where K is the number of multiply-additions, \(Q\times R\) is the size of the multiplication, \(D\) is the number of DSPs and \(X\) is the type of architecture as follows:
(1) | \(A1\) - Dot product with parallel multipliers followed by an adder tree. The design is synthesized directly from a simple VHDL specification without optimizations (based on the design template illustrated in Figure 1); | ||||
(2) | \(A2\) - Dot product with parallel optimized multipliers followed by an adder tree. The design of the adder tree is synthesized directly from a simple VHDL specification without optimization (based also on the design template illustrated in Figure 1 but with optimized multiplier); | ||||
(3) | \(A3\) - Dot product with parallel optimized multipliers followed by an adder tree with 4:2 compressors (based on the design template in Figure 2); | ||||
(4) | \(P\) - Dot product with the proposed dot-product units. Three designs are considered: one without DSPs (based on the design template in Figure 6), one with a single DSP (based on the design template in Figure 9(b)) and one with two DSPs (based on the design template in Figure 9(c)); | ||||
All designs include an accumulator of size \(Q\times R + 8\). The proposed dot-products were implemented with and without DSPs (see results in Table 5 for the non-ternary dot products).
Table 5. Area Results of the Dot-Product Unit for Different Designs
Compared to design A1, the proposed solution without DSP achieves savings in area from 54% (\(A1\_8\_8\times 8\)_0) up to 59% (\(A1\_32\_4\times 4\)_0). When compared to solution A3, with optimized multipliers and an adder tree with 4:2 compressors, the proposed dot product unit without DSP reduces the number of LUTs by 13% (\(A3\_8\_8\times 8\)_0), 22% (\(A3\_16\_8\times 4\)_0), and 26% (\(A3\_32\_4\times 4\)_0). The savings in LUTs increase because the impact of the adder tree over the total area increases when the size of the multipliers reduces.
From the results it can also be observed that the utilization of the DSPs is more effective for the \(8\times 8\) multipliers. For example, comparing the number of LUTs of solution \(P\_8\_8\times 8\)_2 with 2 DSPs against solution \(A3\_8\_8\times 8\_0\) we obtain savings of 47%. The analogous comparison for multipliers of size \(8\times 4\) only achieves 37% savings.
The implementation of the dot product with multipliers of size \(8\times 4\) with DSPs is less effective than the analogous dot product for multipliers of size \(8\times 8\) because a single DSP only implements also two multiply-adders, just like in the \(8\times 8\) case. The implementation of the dot-product with multipliers of size \(4\times 4\) is more efficient than the \(8\times 4\) case but less efficient than the \(8\times 8\) design.
Considering the dot-product implementations for the ternary quantizations, the two proposed solutions and two other designs were compared, namely:
(1) | \(T1\) - Dot product with parallel multipliers followed by an adder tree (based on the design template illustrated in Figure 1 with multipliers of size \(8\times 2\), \(4\times 2\) and \(2\times 2\)); | ||||
(2) | \(T2\) - Dot product with parallel optimized multipliers followed by an adder tree with 4:2 compressors (based also on the design template illustrated in Figure 1 with multipliers of size \(8\times 2\), \(4\times 2\) and \(2\times 2\)); | ||||
(3) | \(T3\) - Dot product with the solution from [23]; | ||||
(4) | \(O1\) - Dot product with the fused multiply-add units (based on the hardware template in Figure 6 with fused MADDs of size \(8\times 2\), \(4\times 2\) and \(2\times 2\)); | ||||
(5) | \(O2\) - Dot product with the dual multiplier followed by the conditional adder tree (based on the design explained in section 3.2). | ||||
The results are presented in Table 6.
Table 6. Area Results of the Ternary Dot-Product Unit for Different Designs
The proposed dot-products for \(8\times 2\) reduces the number of LUTs of the best previous design by 18% and the best \(4\times 2\) design [23] by 34%. The full ternary dot-products (\(2\times 2\)) proposal from [23] achieves better results than our solution using dedicated ternary adder trees.
The utilization of DSPs is less efficient for the ternary cases because they are being used only for the accumulators. An advantage of using the DSPs is that they break the chain of multiply-adders which reduces the necessary delay buffers to pipeline input data.
5.2 Results of the CNN Accelerator for the ResNet50 Model
The ResNet50 model was used to test the CNN accelerator with the proposed dot-product units for sizes \(8\ \times \ 8\), \(8\ \times \ 4\), \(4\ \times \ 4\), \(8\ \times \ 2,\) and \(4\ \times \ 2\). The fully ternary configuration was not considered for ResNet50 since the accuracy suffered a big drop. Besides, our dot-products units for \(2\times 2\) quantization are worse than the previous work from [23]. The accuracy achieved for each of these configurations is provided in Table 7.
It is not the objective of this work to improve the state-of-the-art accuracy for ImageNet with ResNet50. The objective was just to obtain relative accuracies and check the hardware solution against the software solution. The variation in accuracy among the quantized solutions is just 1.7 pp, except for the more aggressive quantization where there is a drop of 10 pp compared to floating-point. The accuracies of quantizations \(8\times 2\) and \(4\times 4\) are very close but the quantization \(4\times 4\) is slightly better.
The CNN accelerator was designed and implemented for these quantizations (see results in Table 8).
(a) - Peak performance.
Table 8. Resource Utilization and Peak Performance of the Accelerator for Different Quantizations and Configurations of the Dot-Product Units
(a) - Peak performance.
Each architecture was configured for best efficiency when running ResNet50, that is, it considers \(64 \times 7\) cores since the vertical sizes of the feature maps are multiples of 7. Sixteen additional cores are used for the first layer and a single core for the last layer. This core distribution guarantees a balanced dataflow execution of the first, the last, and the hidden layers in the three independent modules of the accelerator, that is, the execution times of the first and the last layers are close to the total execution time of all hidden layers.
As can be observed from the table, the \(4\times 4\) configuration needs more resources than the \(8\times 2\) configuration but as was seen above, the accuracy of the \(4\times 4\) is higher. So, the most appropriate architecture for a particular design is a matter of trade-off between accuracy and area. The peak performance of the architectures running at 200 MHz varies between 1.484 TOPS up to 11.878 TOPS.
The architectures with different quantizations were designed and implemented in the FPGA. The architectures with DSPs were tested in the FPGA (architectures without DSPs were not tested since they work at the same frequency, so the performance results are the same). The implementation results are given in Tables 9 and 10, without and with DSPs, respectively.
Table 9. Measured Performance of the FPGA System for Different Quantizations Without DSP
Table 10. Measured Performance of the FPGA System for Different Quantizations with DSP
The results include the execution time to do the inference of an image, the measured GOPS (Giga Operations per Second), the ratio between the peak performance and the measured performance (this measures the efficiency of the architecture), the number of images per second and the GOPS achieved with each one thousand LUTs (kLUTs). Two versions of the same configuration were considered. The difference between the two versions is the number of MADD per core. This allows us to check the influence of the external memory bandwidth over the execution efficiency of the architecture.
As expected, the images per second increases with the number of MADDs per core. However, the efficiency (peakGOPS/GOPS) reduces with the increase of the number of MADDs per core, that is, two designs with the same quantization but a different number of MADDs per core have different efficiencies. Two architectures with the same quantization and number of cores requires the same volume of data to be transferred to/from the external memory. As we increase the number of MADDs, the accelerator processes the convolutions faster and so the ratio between communication and computation delay increases. At a certain point, the communication becomes the bottleneck and the performance efficiency reduces.
To verify this, two configurations of the architecture with quantization \(4\times 4\) were implemented: one with 32 MADDs per core and the other with 16 MADDs per core. Both architectures were analyzed in detail to see the relation between computation and communication (see the communication and computation delays of the hidden layers of both architectures in Figures 22 and 23).
Fig. 22. Computation versus communication for the architecture with \(4\times 4\) quantization with 32 MADDs/core.
Fig. 23. Computation versus communication for the architecture with \(4\times 4\) quantization with 16 MADDs/ core.
As can be observed from Figure 22, most layers have the communication higher than the computation. These are the layers with unitary convolutions. Convolutions with \(3\times 3\) kernels have a computation higher than the communication. So, in this case, the bottleneck is in the communication, which reduces the computational efficiency of the architecture. When the number of MADDs is reduced to half, the performance of the CNN architecture is halved, and most layers have the computation higher than the communication. In this case, the performance efficiency increases. Therefore, the most appropriate design should be chosen according to the timing requirements, area, and accuracy.
In terms of accuracy versus images/s, it is observed that a degradation of 2 pp in the accuracy (from 73.8% to 71.8%) allows increasing the number of images processed per second by 249 (over 250%).
5.3 Comparison with the State of the Art
The proposed architectures were compared with previous works with different quantizations (see Tables 11 and 12).
The works use different models with different computational complexities and memory bandwidth requirements. Therefore, the GOPS/kLUT and GOPS/DSP metrics are used to quantify the area efficiency (GOPS/area) of different designs. The area efficiency of the proposed work is better than that of previous works. The only comparable work is [35] running at 230 MHz and implemented in a SoC FPGA, where the processor and the memory controller are not implemented in reconfigurable hardware. In spite of this, it is worse in terms of area and performance efficiency.
The previous architectures are mainly based on adder trees to calculate dot-products and convolutions. As shown in the results of the proposed dot-product unit, our unit achieves from 15% (\(8\times 8\)) up to 30% (\(4\times 4\)) resource savings in the design of the dot-products compared to an adder tree with a final ternary adder. Therefore, it is expected to achieve at least the same savings for the whole CNN accelerator. The results from Tables 11 and 12 show that this is true. The GOPS/kLUT and GOPS/DSP are better than those of previous architectures at well over the 30%. The only comparable work is the one from [35] with (\(8\times 8\)) quantization. The dataflow of the architecture proposed in this paper is similar to the one from [35] but with adaptions to include the proposed dot-products. If we consider the same frequency, the metrics improve over 15%, showing the effectiveness of the proposed dot-products.
| [45] | [47] | [35] | [4](a) | Ours | |
| Model | DenseNet | Custom | VGG16 | VGG16 | ResNet50 |
| FPGA | XCK325T | XC7VX690T | XC7045 | GX1150 | XC7K325T |
| \(A \times W\) | \(8 \times 8\) | \(8 \times 8\) | \(8 \times 8\) | \(8 \times 8\) | \(8 \times 8\) |
| LUTs | 173522 | 73320 | 187007 | 129000(a) | 138382 |
| DSPs | 704 | 770 | 824 | 300 | 448 |
| BRAMs | 194 | 405 | 460 | — | 384 |
| Freq | 200 | 200 | 230 | 199 | 200 |
| FPS | 24.1 | 6.8 | 53 | 4.9 | 161 |
| GOPS | 176 | 210 | 1632 | 151 | 1245 |
| GOPS/kLUT | 1.01 | 2.86 | 8.70 | 1.17 | 9.00 |
| GOPS/DSP | 0.25 | 0.27 | 1.98 | 0.50 | 2.71 |
(a) Intel Arria 10 FPGA with ALMs and DSPs.
Table 11. Performance Comparison of the Proposed Architectures with Previous Works Running the Inference of CNN Models on FPGA with Quantization \(8\times 8\)
(a) Intel Arria 10 FPGA with ALMs and DSPs.
| [30] | [19] | Ours (a) | [3] | Ours (a) | |
| Model | ResNet50 | ResNet50 | ResNet50 | ResNet18 | ResNet50 |
| FPGA | XCZU9EG | XC7045 | XC7K325T | XC7045 | XC7K325T |
| \(A \times W\) | \(4/8 \times 5\) | \(8 \times 4\) | \(8 \times 4\) | \(4 \times 4\) | \(4 \times 4\) |
| LUTs | 180100 | 203000 | 99303 | 145049 | 98683 |
| DSPs | 2092 | 0 | 0 | 900 | 304 |
| BRAMs | 441 | 443 | 384 | 226 | 304 |
| Freq | 150 | 150 | 200 | 100 | 200 |
| FPS | 109 | 104 | 175 | 99 | 313 |
| GOPS | 891 | 804 | 1349 | 359 | 2413 |
| GOPS/kLUT | 4.95 | 3.96 | 13.58 | 2.48 | 24.5 |
| GOPS/DSP | 0.43 | — | — | 0.39 | 7.9 |
(a) Design with best GOPS/kLUT, according to Table 8.
Table 12. Performance Comparison of the Proposed Architectures with Previous Works Running the Inference of CNN Models on FPGA for Quantizations Below 8
(a) Design with best GOPS/kLUT, according to Table 8.
The proposed design for the \(8\times 4\) quantizations is about \(3\times\) more area efficient running the same ResNet50 model. It uses half of the LUTs of [19], as a result of the highly efficient proposed dot-products. The same trend is verified for the \(4\times 4\) quantization. Comparing the solution with DSPs to [3], the proposed work is \(10\times\) better in terms of GOPS/kLUT and \(13.8\times\) better in terms of GOPS/DSP.
6 CONCLUSIONS AND FUTURE WORK
In this article, we propose new hardware designs for dot-product calculation based on fused multiply-additions and optimized conditional adder trees. The dot-product units were used in the implementation of a CNN accelerator in FPGA. This work is a step forward in the design of hardware accelerators for low bit-width quantized neural network models.
The dot-product units are replicated hundreds of times in a hardware accelerator. Therefore, its efficient design in terms of area and performance is very important to reduce the quantity of resources necessary to implement a particular CNN within specific performance and area constraints.
The results show that the units can be straightforwardly used in the design of CNN accelerators. In this paper, we considered a unified and configurable hardware module to run the whole model, but the same proposed units can be used with dataflow-like CNN accelerators since the core units are independent of the dataflow of the architecture.
In the future we are planning to adapt and apply the proposed arithmetic units in the design of accelerators with other type of layers, like depthwise layers.
These flexible quantized arithmetic units will also be used to study important issues related to the design of accelerators for deep learning. These include the trade-off between hardware resources and accuracy, hybrid quantization at a layer level, sparse models against quantized models, layer sensitivity to quantization and sparsity, among others.
- [1] 2022. Neural Network Accelerator Comparison. (2022). https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/.Google Scholar
- [2] . 2018. Convolutional neural network based image segmentation: A review. In Pattern Recognition and Tracking XXIX, (Ed.), Vol. 10649. International Society for Optics and Photonics, SPIE, 191–203. Google Scholar
Cross Ref
- [3] . 2021. Mix and match: A novel FPGA-centric deep neural network quantization framework. 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (2021), 208–220.Google Scholar
Cross Ref
- [4] . 2020. CNN2Gate: An implementation of convolutional neural networks inference on FPGAs with automated design space exploration. Electronics 9, 12 (2020). Google Scholar
Cross Ref
- [5] . 2021. A survey of quantization methods for efficient neural network inference. CoRR abs/2103.13630 (2021).
arXiv:2103.13630 https://arxiv.org/abs/2103.13630.Google Scholar - [6] . 2019. [DL] A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst. 12, 1, Article 2 (
Mar. 2019), 26 pages. Google ScholarDigital Library
- [7] . 2018. FBNA: A fully binarized neural network accelerator. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 51–513. Google Scholar
Cross Ref
- [8] . 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. CoRR abs/1510.00149 (2015).Google Scholar
- [9] . 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. Google Scholar
Cross Ref
- [10] . 2018. Squeeze-and-excitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7132–7141.Google Scholar
Cross Ref
- [11] . 2019. A resources-efficient configurable accelerator for deep convolutional+ neural networks. IEEE Access 7 (2019), 72113–72124. Google Scholar
Cross Ref
- [12] . 2017. Accelerating low bit-width convolutional neural networks with embedded FPGA. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). 1–4. Google Scholar
Cross Ref
- [13] . 2017. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 1–12. Google Scholar
Digital Library
- [14] . 2014. Efficient high speed compression trees on Xilinx FPGAs. In MBMV.Google Scholar
- [15] . 2018. High density and performance multiplication for FPGA. In 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH). 5–12. Google Scholar
Cross Ref
- [16] . 2018. A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks. J. Emerg. Technol. Comput. Syst. 14, 2, Article 18 (
Jul. 2018), 16 pages. Google ScholarDigital Library
- [17] . 2019. High-performance FPGA-based CNN accelerator with block-floating-point arithmetic. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 8 (
Aug. 2019), 1874–1885. Google ScholarDigital Library
- [18] . 2017. Throughput-optimized FPGA accelerator for deep convolutional neural networks. ACM Trans. Reconfigurable Technol. Syst. 10, 3, Article 17 (
July 2017), 23 pages. Google ScholarDigital Library
- [19] . 2018. RNA: An accurate residual network accelerator for quantized and reconstructed deep neural networks. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 60–603. Google Scholar
Cross Ref
- [20] . 2022. Review of ASIC accelerators for deep neural network. Microprocessors and Microsystems (2022), 104441. Google Scholar
Digital Library
- [21] . 2019. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 8 (
Aug. 2019), 1861–1873. Google ScholarDigital Library
- [22] . 2017. Fast and efficient implementation of convolutional neural networks on FPGA. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 11–18. Google Scholar
Cross Ref
- [23] . 2018. High-efficiency convolutional ternary neural networks with custom adder trees and weight compression. ACM Trans. Reconfigurable Technol. Syst. 11, 3, Article 15 (
Dec. 2018), 24 pages. Google ScholarDigital Library
- [24] . 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, New York, NY, USA, 26–35. Google Scholar
Digital Library
- [25] . 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (
01 Dec. 2015), 211–252. Google ScholarDigital Library
- [26] . 2019. Analyzing and increasing the reliability of convolutional neural networks on GPUs. IEEE Transactions on Reliability 68, 2 (2019), 663–677. Google Scholar
Cross Ref
- [27] . 2019. FPGA-based accelerators of deep learning networks for learning and classification: A review. IEEE Access 7 (2019), 7823–7859. Google Scholar
Cross Ref
- [28] . 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, USA, 535–547. Google Scholar
Digital Library
- [29] . 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations.Google Scholar
- [30] . 2022. FILM-QNN: Efficient FPGA acceleration of deep neural networks with intra-layer, mixed-precision quantization. In Proceedings of the 30th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’22).Google Scholar
Digital Library
- [31] . 2016. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. Google Scholar
Cross Ref
- [32] . 2016. FINN: A framework for fast, scalable binarized neural network inference. CoRR abs/1612.07119 (2016).
arxiv:1612.07119 Google Scholar - [33] . 2021. Efficient design of pruned convolutional neural networks on FPGA. Journal of Signal Processing Systems 93, 5 (
01 May 2021), 531–544. Google ScholarDigital Library
- [34] . 2017. Parallel dot-products for deep learning on FPGA. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). 1–4. Google Scholar
Cross Ref
- [35] . 2020. A fast and scalable architecture to run convolutional neural networks in low density FPGAs. Microprocess. Microsystems 77 (2020), 103136.Google Scholar
Cross Ref
- [36] . 2020. A configurable architecture for running hybrid convolutional neural networks in low-density FPGAs. IEEE Access 8 (2020), 107229–107243. Google Scholar
Cross Ref
- [37] . 2019. Hybrid dot-product calculation for convolutional neural networks in FPGA. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 350–353. Google Scholar
Cross Ref
- [38] . 2016. Array multipliers for high throughput in Xilinx FPGAs with 6-input LUTs. Computers 5, 4 (2016). Google Scholar
Cross Ref
- [39] 2018. A design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In 28th International Conference on Field-Programmable Logic and Applications.Google Scholar
Cross Ref
- [40] . 2021. Low-precision floating-point arithmetic for high-performance FPGA-based CNN acceleration. ACM Trans. Reconfigurable Technol. Syst. 15, 1, Article 6 (
Nov. 2021), 21 pages. Google ScholarDigital Library
- [41] . 2021. Low-precision floating-point arithmetic for high-performance FPGA-based CNN acceleration. ACM Trans. Reconfigurable Technol. Syst. 15, 1, Article 6 (
Nov. 2021), 21 pages. Google ScholarDigital Library
- [42] . 2021. Accelerating neural network inference on FPGA-based platforms-a survey. Electronics 10, 9 (2021). https://www.mdpi.com/2079-9292/10/9/1025.Google Scholar
Cross Ref
- [43] . 2020. Convolutional neural network with INT4 optimization on Xilinx devices. White Paper 521 (2020).Google Scholar
- [44] . 2019. Synetgy: Algorithm-hardware co-design for ConvNet accelerators on embedded FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, USA, 23–32. Google Scholar
Digital Library
- [45] . 2020. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, USA, 122–132. Google Scholar
Digital Library
- [46] . 2019. A fine-grained sparse accelerator for multi-precision DNN. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, USA, 185. Google Scholar
Digital Library
- [47] . 2020. An efficient FPGA-based implementation for quantized remote sensing image scene classification network. Electronics 9, 9 (2020). Google Scholar
Cross Ref
- [48] . 2017. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). Association for Computing Machinery, New York, NY, USA, 15–24. Google Scholar
Digital Library
- [49] . 2019. Object detection with deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems 30, 11 (2019), 3212–3232. Google Scholar
Cross Ref
Index Terms
(auto-classified)Efficient Design of Low Bitwidth Convolutional Neural Networks on FPGA with Optimized Dot Product Units
Recommendations
A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art ...
CoNNa–Hardware accelerator for compressed convolutional neural networks
AbstractIn this paper, we propose a novel Convolutional Neural Network hardware accelerator called CoNNA, capable of accelerating pruned, quantized CNNs. In contrast to most existing solutions, CoNNA offers a complete solution to the ...
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysAs convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...





























Comments