Abstract
With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: (1) the different dimensions within same-type layers, (2) the different convolution layers especially transposed and dilated convolutions, and (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX1 representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 5.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.
1 INTRODUCTION
Convolutional Neural Networks (CNNs) are widely used in many machine learning (ML) applications and have evolved quickly over the years. There is a growing interest in FPGA for accelerating CNN computation due to its high energy efficiency and performance (e.g., References [6, 7, 22, 31, 37, 44, 48, 52, 59, 60, 62]). However, the recent advancement in CNN models and FPGA-based CNN acceleration has brought several new challenges. 23
Challenge 1: Performance disparity within CNN layers of the same type: In CNNs, layers of the same type (normal convolution layers, for instance) can have different characteristics in terms of their input and output number of channels, feature map size, and kernel size. This changes the computation to communication (CTC) ratio from layer to layer. Therefore, it is important to handle these layers differently given the performance disparity across these layers. We found that tiling factors can play an important role in performance. Zhang et al. [59] showed that the CTC ratio of a single convolution layer varies with different tiling factors. Yang et al. [55] highlighted the importance of choosing proper tiling factors for data reuse in the near and faster memory (on-chip storage for FPGAs) for the overall latency and energy efficiency. These studies lead us to consider using different tiling factors across the network. Figure 1 depicts how different tiling factors can affect the performance of each layer in one CNN network. We compare the performance of using a single set of tiling factors (uniform tiling) to using different tiling factors for each layer (dynamic tiling). For the uniform tiling, we chose the tiling factor that reduces the latency of the entire network. For the dynamic tiling, we focused on each layer and selected the best tiling factor accordingly. Experimental results show that dynamic tiling can speed up the performance of the whole network by \(1.7\times\).
Fig. 1. Performance comparison of designs using uniform and dynamic tiling factors for the first 24 convolutional layers in the CNN network in Figure 3.
Challenge 2: The inefficiency of general-purpose CNN accelerators in processing special CNN layers: Many modern CNNs feature complex architecture topologies with different layer types. One of these special layers is a fractionally strided or transposed convolution (T-CONV) layer [21] (also referred to as a deconvolution layer). It is an upsampling layer that uses trained weights to produce enlarged high-resolution feature maps. T-CONV layers are often used in image segmentation networks and generative adversarial networks such as U-Net [40], DCGAN [39], ArtGAN [49], DiscoGAN [30], FSRCNN [20], to name a few. An atrous or dilated convolution (D-CONV) layer is another special layer that maintains the resolution and coverage of feature maps by expanding the receptive fields of convolution filters as discussed in Reference [57]. One famous network that uses D-CONV layers is CSRNet [32]. Some CNNs include a mixture of convolution layers such as E-Net [38], where normal convolution (N-CONV), transposed convolution, dilated convolution, and asymmetric convolution (A-CONV)4 layers are used. Both T-CONV and D-CONV layers can be naïvely implemented as normal convolution layers. However, such implementations introduce many zeros in the input feature maps of T-CONV layers and in the convolution filters of D-CONV layers, leading to a huge underutilization of the FPGA resources. To tackle this problem, we use a decomposition-based approach (discussed in Section 5) to implement N-CONV, T-CONV, and D-CONV layers efficiently in one versatile systolic array on an FPGA with minimal area overhead. Moreover, other networks such as MobileNetV1 [26] use depth-wise separable convolution layers introduced in Reference [45] to decrease the computation cost. MobileNetV2 [41] introduced residual bottleneck block (RBB) to further reduce the computation complexity. These layers reduce the computation cost but keep the same feature map size; this can make the layer more communication-bound and reduce the computation efficiency.
Challenge 3: Integration overheads of using FPGA in ML frameworks: When processing a CNN application in a modern ML framework such as TensorFlow [5], the complete stack consists of reading the input, computing the CNN, processing the result, and displaying and writing the result. Previous works have only focused on optimizing the CNN kernel on FPGA (e.g., References [6, 7, 22, 31, 48, 52, 59, 62]). This is due to the fact that CNN computation is the most time-consuming step of the whole stack. Hence, the rest of the overheads are ignored. While several works [22, 37] have focused on accelerator generation from TensorFlow-described networks, they did not address the challenges of integrating an accelerator into TensorFlow. By integrating our accelerator with TensorFlow, we are able to directly run networks from TensorFlow on an FPGA. Integrating FPGA into TensorFlow introduces a new set of overheads: communication between TensorFlow and FPGA and the communication between the host and the FPGA kernel itself. Figure 2 shows the breakdown of the end-to-end runtime for processing a 384 \(\times\) 384 RGB image using the network in Figure 3. These steps are listed and described in Section 7. The time for CNN processing, using our accelerator denoted as the kernel, only takes 11.8% of the total runtime. This emphasizes the need for an end-to-end SW/HW co-optimization. Our experiments show that this optimization can increase the end-to-end performance of this network from 4.8 FPS to 23.8 FPS, leading to a 5\(\times\) speedup.
Fig. 2. Runtime breakdown of an FPGA-based CNN acceleration pipeline in TensorFlow.
Fig. 3. OpenPose-V2 CNN architecture.
To solve the challenges above, we propose an FPGA-based CNN framework named FlexCNN. Its architecture employs dynamic tiling, layer fusion, and data layout transformation to adapt to the performance disparity of different CNN layers. Another major component of the architecture is our versatile systolic array, which can efficiently process different convolution layer types. The framework has a compilation flow that takes a CNN as an input, performs design space exploration, and generates an optimized hardware accelerator to run on FPGA. The accelerator is further integrated into a software-hardware pipeline to mitigate the large integration overheads by overlapping the software execution with the hardware computation.
A preliminary version of FlexCNN [47] was published in FPGA 2020. The new contributions in this article include: (1) a novel efficient versatile systolic array for normal, transposed, dilated, and asymmetric convolution layers; (2) ONNX support to handle multiple ML frameworks including TensorFlow, PyTorch, and Caffe; (3) code generation for the new TAPA [15] framework, which is integrated with AutoBridge [25] to improve design frequency; and (4) the implementations of U-Net, E-Net, and VGG-16 CNNs on FPGA using the FlexCNN framework.
In summary, the overall contributions of this work are:
An efficient, flexible, and composable dataflow architecture employing dynamic tiling, layer fusion, and data layout optimization to support a wide variety of CNNs;
A novel versatile systolic array that can efficiently process normal, transposed, dilated, and asymmetric convolution layers;
An automated compilation flow that takes a CNN dataflow graph as an input, maps it to the hardware dataflow graph, and performs a design space exploration to generate an optimized accelerator on FPGA;
A software-hardware pipelining scheme that can improve the end-to-end performance of CNNs;
Real-time efficient implementations of OpenPose, U-Net, E-Net, and VGG-16 CNNs on FPGA.
2 FRAMEWORK OVERVIEW
FlexCNN is an end-to-end framework for automatic hardware acceleration of CNNs on FPGA. FlexCNN implements a flexible and composable dataflow architecture that can be tailored for a variety of complex real-world CNNs. Section 4 discusses the architecture and the multiple optimization techniques such as dynamic tiling, layer fusion and layer parallelization, and data layout optimization. Another important component of the FlexCNN architecture is our novel versatile systolic array, which we discuss in Section 5. The versatile systolic array can efficiently process N-CONV, T-CONV, D-CONV, and A-CONV layers. Section 6 reviews the automated compilation flow. It takes an ONNX CNN model and an ordered list of FlexCNN modules as inputs, then outputs an optimized FPGA accelerator. The compilation tool maps the CNN dataflow graph to the given FlexCNN architecture, performs design space exploration for the best hardware parameters, and generates the synthesizable code for the architecture. Furthermore, FlexCNN implements a software-hardware pipelining technique (discussed in Section 7) to overlap the software overheads with the hardware execution reducing the end-to-end runtime of a CNN’s inference.
3 APPLICATIONS
This section introduces the new layer types and building blocks used in the three real-world CNN applications: OpenPose, U-Net, and E-Net CNNs. It then highlights the applications, architectures, and layer types of each CNN.
3.1 New Layers and Building Blocks
3.1.1 Depthwise Separable Convolution (DSC).
In a normal convolution layer (N-CONV), the feature maps are filtered and combined in one step. The DSC splits this step into two phases. The first phase, depthwise convolution (DW), does the filtering, and the second phase, pointwise convolution (PW), combines the produced filtered feature maps using 1 \(\times\) 1 kernels.
A conv layer takes \(N\) feature maps as the input, each of size \(H \times W\). It uses \(M \times N \times K \times K\) kernels to produce M channels for the output. The total computation cost for this layer is \(M \times N \times H \times W\) \(\times\) \(K \times K\).
However, a DSC uses \(N \times K \times K\) kernels for DW and \(M \times N \times 1 \times 1\) kernels for PW. By applying this change, the amount of computation is reduced by a factor of \(\frac{1}{M} + \frac{1}{K^2}\) [26].
3.1.2 Residual Bottleneck Block.
Google introduced RBB in MobileNetV2 [41] to reduce the computation cost. It consists of a 1 \(\times\) 1 conv followed by a 3 \(\times\) 3 DW and then another 1 \(\times\) 1 conv, each of which is followed by ReLU and a batch normalization layer. The 1 \(\times\) 1 convolutions are used for dimension reduction or restoration. The nature of this block allows us to reduce the number of input and output channels. This reduces the computation intensity and makes the network more efficient.
3.1.3 Special Convolution Layers.
Recent CNNs have introduced variations of the normal convolution layers such as:
A fractionally strided or transposed convolution (T-CONV) layer [21] (also referred to as a deconvolution layer) layer is an upsampling layer that uses trained weights to produce enlarged high-resolution feature maps. T-CONV layers are often used in image segmentation networks and generative adversarial networks such as U-Net [40], DCGAN [39], ArtGAN [49], DiscoGAN [30], FSRCNN [20].
An atrous or dilated convolution (D-CONV) layer is another special layer that maintains the resolution and coverage of feature maps by expanding the receptive fields of convolution filters as discussed in Reference [57]. One famous network that uses D-CONV layers is CSRNet [32].
An asymmetric convolution (A-CONV) layer is a normal convolution layer that uses asymmetric filter sizes such as 1 \(\times\) 5 or 5 \(\times\) 1 filters. In terms of hardware acceleration, this layer requires extra logic to handle each dimension of the filters separately.
3.2 OpenPose
OpenPose [8] is the winner of the COCO 2016 Keypoints Challenge that can detect 2D poses of multiple people in an image. OpenPose network first extracts the features of the input image using the first 10 layers of VGG-19 [46]. This is the backbone of the network. These feature maps are the inputs to a two-branch network. The first branch detects confidence maps, representing body part locations, and the second branch detects part affinity fields, a set of 2D vectors showing the location and orientation of the limbs. The results of these two branches are concatenated with the feature maps from the backbone network and form the input for the next stage. After several iterations, these branches produce final predictions.
This network is interesting to us, since it has an irregular architecture compared to modern CNN-based deep-learning applications. Instead of just a linear forward path where each layer consumes the result of its previous layer, it has concatenation layers that need extra data movement. Moreover, to reduce the computation complexity of the network, we use a modified version of OpenPose [29] that replaces the backbone with a modification of MobileNetV2 [41] and employs DSC [45] for the rest of the network, following the trend in the ML community. Figure 3 depicts the network topology of this version; we call this network OpenPose-V2. Due to the space limitation, we only show the convolutional layers. Each convolution is followed by ReLU and batch normalization layers.
3.3 U-Net
U-Net [40] is a famous CNN used for biomedical image segmentation. It is an encoder-decoder neural network, where the encoder includes four downsampling blocks, and the decoder part is made of four upsampling blocks. Furthermore, the outputs of each downsampling block are added with the inputs of the corresponding upsampling block. Figure 4 illustrates U-Net’s architecture and building blocks. We chose U-Net with its irregular architecture topology and various layer types, including T-CONV layers, to show the effectiveness of our versatile systolic array for a real-world application. Table 1 demonstrates the breakdown of U-Net’s layers and the number of Giga floating-point operations (GFLOPs).
Fig. 4. U-Net downsampling and upsampling blocks and CNN architecture.
3.4 E-Net
E-Net [38] is a lightweight CNN used for pixel-wise semantic segmentation in real-time. Like U-Net, E-Net has an encoder-decoder structure, where the feature maps are downsampled by a factor of 4 and then upsampled by a factor of 4 to restore the original image size. The network is made of bottleneck blocks. Each bottleneck block gets its input from the proceeding block, processes data in two branches, and then merges the two branches with an Add layer to be sent to the next block. The first branch contains a MaxPooling layer for the encoder blocks, an upsampling layer for the decoder blocks, or it can be empty for the intermediate blocks. The second branch contains N-CONV layers, D-CONV layers, A-CONV layers, or T-CONV layers. Figure 5 illustrates E-Net’s architecture and bottleneck blocks. We chose E-Net to test our framework with such a complex CNN topology and various layer types. E-Net represents a stress test for our compilation framework, which needs to map the CNN complex graph into the FlexCNN architecture. E-Net also contains all four types of convolution layers, which is a perfect test case for our versatile systolic array. All the layers used in E-Net and the number of Giga operations (GOPs) are shown in Table 2. The logical GOPs is the number of operations including the zero multiply-accumulate (MAC) operations of T-CONV and D-CONV layers.
Fig. 5. E-Net bottleneck blocks and CNN architecture.
4 FLEXCNN ARCHITECTURE
4.1 A Composable Architecture
FlexCNN is a composable dataflow architecture made up of a number of streaming modules that are connected as a directed graph based on the target CNN architecture. It is flexible and composable in the sense that modules can be reordered, new modules can be added, or some modules can be removed, depending on the target CNN graph. This is particularly important, since state-of-the-art CNNs usually come with new special layers that rigid accelerators struggle to process efficiently. Thus, if a CNN layer type is not supported yet, then the user would simply need to develop a single module for that layer. Currently, FlexCNN has modules to support a variety of CNN layers (see Table 3).
| Module | Layers Supported |
|---|---|
| Standard Systolic Array | normal convolution layers |
| Versatile Systolic Array | normal, transposed, dilated, and asymmetric convolution layers |
| DW Conv | depth-wise convolution layer |
| Act & BN | activation (ReLU, ReLU6, PReLU, Leaky ReLU) and batch normalization layers |
| Add | piece-wise addition of two layers |
| Concat | concatenation of two layers |
| Upsample | nearest neighbor or bilinear upsampling layers |
| Pool | max-pooling or average pooling layers |
Table 3. Current Modules and Their Descriptions
Since FlexCNN is a dataflow architecture, it can be thought of as a coarse-grain pipeline where modules are the pipeline stages. To avoid pipeline stalls, we make sure that all modules are fully pipelined with an initiation interval of 1, meaning that each module produces and consumes data every clock cycle. Therefore, the overall latency is calculated as the latency of the longest pipeline stage + the latency of filling and draining the pipeline. Since convolution layers are the most compute-intensive, the longest pipeline stage is the latency of the systolic array modules. Furthermore, FlexCNN supports CNNs in different data types, including float 32-bit, fixed 16-bit, and fixed 8-bit. Figures 6, 7, and 8 show the architectures for OpenPose, U-Net, and E-Net CNNs, respectively.
Fig. 6. FlexCNN with a standard SA for OpenPose.
Fig. 7. FlexCNN with a versatile SA for U-Net.
Fig. 8. FlexCNN with a versatile SA for E-Net.
4.2 Modules
We implement line-buffer-based streaming architectures for the DW Conv, Act & BN, Add, Pool, and Upsample modules using a similar stencil-based architecture as in Reference [14]. All these modules are parameterized by factors as shown in Table 4, which will be explored by the design space exploration (DSE) engine covered in Section 6.2, for optimal performance. We apply double buffering in both the Reader modules and the Writer module. Furthermore, if the outputs of the whole layer can fit into the on-chip buffer, then the data will be pushed into on-chip buffers and directly fetched by the Reader to save the off-chip communication time.
4.3 Layer Fusion and Layer Parallelization
Due to the limited fast on-chip FPGA memory (BRAMs and URAMs), it is usually necessary to use the slow off-chip memory (DRAM), especially for large CNN models. Thus, the intermediate tensors of layers are loaded from the DRAM using the Reader modules, processed by the FlexCNN compute modules, and then written back to DRAM using the Writer module. An important feature of FlexCNN is that each module can be enabled or disabled dynamically during runtime to process or bypass the data flowing through that module. This feature allows FlexCNN to employ layer fusion and layer parallelization where one DRAM read and write can process multiple CNN layers and reduce off-chip communication, thus improving the hardware utilization of the FPGA. Layer fusion applies to sequential layers from the original CNN graph. Layer parallelization applies to layers that are parallel in the original CNN graph. For example, in a downsampling bottleneck block of E-Net (Figure 5), \(L1\) can be fused with the previous or following ReLU layers on the same branch, which represents layer fusion. For layer parallelization, \(L1\) can be executed in parallel with \(L2\) (MaxPool layer for downsampling block), which represents layer parallelization. Section 6.1 examines the details of mapping CNN layers to the FlexCNN architecture in depth.
4.4 Dynamic Tiling
Tiling is applied when processing the network for improving the data locality and minimizing the communication. Table 4 summarizes the tiling factors employed in FlexCNN, where \(N\) corresponds to the number of input feature maps, \(H\) and \(W\) to the height and width of the input feature maps, and \(M\) to the number of output feature maps. When the tiling factors are not sub-multiples of the tiled dimensions, redundant computation is introduced, which degrades the performance of the design. As explained in Section 1, in a normal CNN network, the types and configurations of different layers vary from each other. Therefore, the optimal tiling factors will be different from each other as well. We have observed that using uniform tiling factor for the whole network will lead to up to 1.7\(\times\) performance slowdown compared to the ideal case using different tiling factors across layers. Therefore, in this work, we apply the dynamic tiling by re-configuring the tiling factors of the accelerators on-the-fly for different layers to maximize the performance. This will bring the hardware overheads to support the dynamic tiling. However, such overheads are negligible compared to the performance improvement. Section 8 evaluates the impacts of this technique in detail.
Previous works such as References [44, 51, 62] have also emphasized the need for different tiling factors across layers. Our architecture distinguishes from the previous work by changing all the tiling factors across each layer dynamically, whereas previous work only adjusted part of the tiling factors or used several accelerators, each with distinct uniform tiling factors on-chip. Equation (1) shows the restriction on the tiling factors. (1) \(\begin{align} \begin{split} Tw(k) &= c_1 \times SA\_COL\\ Tm(k) &= c_2 \times SA\_ROW\\ Tn(k) &= c_3 \times SIMD\\ Tm(k) &= Tn(k+1) \end{split} \end{align}\)
In FlexCNN, the width and output channels of the feature maps are mapped to columns and rows of the SA, respectively. As a result, for each layer, \(Tw(k)\) and \(Tm(k)\) should be multiples of their respective SA dimension. The reduction of multiple input channels is computed in parallel inside each PE of the SA, which is defined as the SIMD lane. This implies that \(Tn(k)\) should be a multiple of SIMD lane. \(Th(k)\) can be any arbitrary value.
As mentioned before, the computation in the DW Conv module can be seen as a stencil kernel. Figure 9 depicts the 3 \(\times\) 3 stencil window connected by line buffers. As depicted in the figure, at each cycle, the line buffers fetch one pixel from a feature map and the data are shifted by one location. The length of the first two lines (for a general case, the first \(K-1\) lines with K being the filter size) is determined by \(Tw(k)\). After all the registers in the line buffers are filled with data (\((K-1) \times Tw(k) + K\) cycles), the computation can start by convolving the registers marked in black with the respective filter. Since the SA module needs to fetch SIMD elements in each cycle, the architecture in Figure 9 is duplicated SIMD times with each one fetching the data from a different feature map. As the length of the line buffer determines the \(Tw(k)\), each line should have “\(\max _{k} Tw(k)\)” registers. We realize dynamic tiling by connecting consecutive rows of the line buffer via a MUX, enabling data feeding from different locations.
Fig. 9. Architecture support for dynamic tiling in the Depth Conv module for a 3 \(\times\) 3 kernel with Tw of size 6/8/10.
4.5 Data Layout Optimization
Data layout optimizations are applied to reduce the number of accesses to DRAM and increase the effective DRAM bandwidth. The first optimization is on the concatenation layers. A CNN network may contain blocks that concatenate the results of several layers. As shown in Figure 3, after each stage in the OpenPose-V2 network, results from two branches will be concatenated with the first outputs from the backbone network. This then serves as the inputs for the following stages. Figure 10 presents the optimized data organization of the network.
Fig. 10. Data organization for OpenPose.
The outputs of the backbone (region B) and each stage (region A, C) are placed close to each other, as shown in Figure 10. To be more specific, the outputs of Stage 1 will be written to region A. Regions A and B will serve as the inputs of Stage 2. In Stage 2, the outputs will be written to region C. The regions B and C will serve as the inputs of Stage 3, similarly. The outputs of each stage are written to regions A and C in a round-robin fashion. With this layout, the outputs of stage branches are concatenated on-the-fly, eliminating unnecessary off-chip DRAM movements.
To further improve the effective DRAM bandwidth, we change the data layout of the feature maps from \(N(k) \times H(k) \times \frac{W(k)}{Tw(k)} \times Tw(k)\) to \(\frac{N(k)}{Tn(k)} \times H(k) \times \frac{W(k)}{Tw(k)} \times Tn(k) \times Tw(k)\). This allows us to increase the burst length from \(Tw(k)\) to \(Tn(k) \times Tw(k)\). A DSC layer can easily become communication-bound because of its low computation to communication (CTC) ratio, since it is mostly using 1 \(\times\) 1 convolution kernels. In this case, when the kernel size of the next layer is 1 \(\times\) 1, since there is no overlapped region between different tiles, we further change the data layout to \(\frac{N(k)}{Tn(k)} \times \frac{H(k)}{Th(k)} \times \frac{W(k)}{Tw(k)} \times Tn(k) \times Tw(k) \times Th(k)\). It further increases the burst length for these layers to \(Tn(k) \times Th(k) \times Tw(k)\). For other kernel sizes, padding is applied, because a tile of \(Tn(k) \times Tw(k) \times Th(k)\) does not have all the data needed for the computation. We need to have \((p-1)\) and \(((p-1) \times Th(k) + (p-1)^2)\) extra DRAM accesses with burst length of \(Tn(k) \times Tw(k)\) and \(Tn(k)\), respectively, to fetch all the data (\(p\) denoting the kernel size). This increases the number of DRAM accesses with a burst length of \(Tn(k)\), which further increases the communication time and prevents us from applying this data layout.
5 THE VERSATILE SYSTOLIC ARRAY
5.1 Problem Formulation
5.1.1 Transposed Convolution..
At first glance, transposed convolution seems to be a completely different operation from a normal convolution. As shown in Figure 11, one T-CONV operation is a scalar multiplication of an input pixel by a \(K \times K\) filter, and the output result (of size \(K \times K\)) is placed in the output feature map (FM) separated by a distance determined by the T-CONV stride (\(S^\prime\)). The overlapping results in the output feature map are then added together to give the final output feature map.
Fig. 11. T-CONV original operation ( \(K=3,S^\prime =2\) ).
This same operation can be performed as a normal convolution by inserting \(S^\prime -1\) zeros between adjacent pixels of the input feature maps and convolving a reversed filter with stride \(S=1,\) as shown in Figure 12. Note that the gray zeros are part of the padding, which is required for N-CONV layers as well.
Fig. 12. Naïve computation of T-CONV ( \(K=3,S^\prime =2\) ).
Let \(N I_h I_w M O_h O_w\) represent the channels, height, and width of input and output FMs, respectively. This naïve implementation requires \(K^2 N M O_h O_w\) (Multiply-Accumulate) MAC operations (\(3^2\) \(\times\) 4 \(\times\) 4 MACs in Figure 12), but the non-zero MAC operations are only \(K^2 N M I_h I_w\) (\(3^2\) \(\times\) 2 \(\times\) 2 MACs in Figure 14). The ideal speedup5 for T-CONV is given by Equation (2) (2) \(\begin{equation} \begin{gathered} Transposed\ Convolution\ Ideal\ Speedup = \frac{O_h O_w}{I_h I_w} = \frac{S^\prime I_h \times S^\prime I_w}{I_h I_w} = S^{\prime 2} \end{gathered} \end{equation}\)
5.1.2 Dilated Convolution..
Similar to transposed convolution, dilated convolution can be naïvely implemented as a normal convolution operation by inserting \(d-1\) zeros between the filters’ values (Figure 13), where \(d\) is the D-CONV dilation rate. The number of MAC operations using this method is \((dK-d+1)^2 NO_hO_wM\) (\(3^2\) \(\times\) 2 \(\times\) 2 MACs in Figure 13). However, the effectual non-zero MAC operations are only \(K^2NO_hO_wM\) (\(2^2\) \(\times\) 2 \(\times\) 2 MACs in Figure 15). Equation (3) gives the ideal speedup for D-CONV. (3) \(\begin{equation} Dilated\ Convolution\ Ideal\ Speedup = \frac{(dK-d+1)^2}{k^2} \end{equation}\)
Fig. 13. Naïve computation of D-CONV ( \(K=2,d=2\) ).
Now, the problem is how to design a versatile SA that can eliminate the ineffectual zero MAC operations to achieve the theoretical ideal speedups with minimal area overhead.
5.2 Approach
Previous FPGA works attempted to accelerate either T-CONV layers, such as References [19, 33, 34, 56, 58], or D-CONV layers, such as Reference [61], but not both. However, some ASIC works proposed versatile accelerators for T-CONV and D-CONV layers. References [10] and [35] target general sparsity including T-CONV and D-CONV layers. Reference [28] uses a systolic array with delay cells to skip the zero MAC operations. Reference [9] proposes a decomposition approach that decomposes T-CONV and D-CONV layers into dense N-CONV layers. However, none of these previous works discussed the area overhead of supporting T-CONV and D-CONV layers efficiently. Table 5 summarizes these works.
| Work | Device | Conv Support | Design Generation | ||
|---|---|---|---|---|---|
| N-CONV | T-CONV | D-CONV | |||
| FlexCNN (ours) | FPGA | Yes | Yes | Yes | Automatic |
| Electronics 2021 [61] | FPGA | Yes | No | Yes | Manual |
| VLSI 2020 [58] | FPGA | Yes | Yes | No | Automatic |
| Electronics 2020 [19] | FPGA | Yes | Yes | No | Manual |
| FCCM 2018 [56] | FPGA | Yes | Yes | No | Automatic |
| ISCAS 2020 [9] | ASIC | Yes | Yes | Yes | Manual |
| ISCAS 2019 [28] | ASIC | Yes | Yes | Yes | Manual |
| VLSI 2020 [10] | ASIC | Yes | Yes | Yes | Manual |
| ISCAS 2019 [35] | ASIC | Yes | Yes | Yes | Manual |
Table 5. Versatile SA Comparison with Other Works
We chose to base our work on the decomposition approach in Reference [9], since it requires the least changes and area overhead to a standard SA. However, their work did not provide enough formulation and details on ideal speedups and filter/feature map decomposition for arbitrary filter size, T-CONV stride (\(S^\prime\)), and dilation rate (\(d\)). In this section, we illustrate the decomposition approach and provide a decomposition algorithm for T-CONV and D-CONV.
5.2.1 Decomposition of T-CONV Operation..
The decomposition of T-CONV operation gets rid of the non-effectual zero MAC operations by decomposing the convolution filters into \(S^{\prime 2}\) sub-filters that convolve over the dense input feature maps producing the same outputs as the naïve implementation, as shown in Figure 14.
Fig. 14. Efficient computation of T-CONV ( \(K=3,S^\prime =2\) ).
5.2.2 Decomposition of D-CONV Operation..
The decomposition of dilated convolution is more straightforward. While filters are decomposed in T-CONV, the input feature maps of D-CONV are decomposed into \(d^2\) sub-feature maps. Each sub-feature map contains non-contiguous pixels separated by a distance \(d-1\), as shown in Figure 15.
Fig. 15. Efficient computation of D-CONV ( \(K=2,d=2\) ).

5.2.3 Unified Decomposition Algorithm..
Algorithm 1 formulates the decomposition of an \(N \times N\) 2-D input matrix \(I\) given a constant \(Z\), where \(I\) is a dense filter and \(Z=S^\prime\) for T-CONV or \(I\) is a dense input FM and \(Z=d\) for D-CONV. The algorithm has two steps: First, it gets the height and width dimensions of each sub-matrix of \(I\). Second, it gets the values of each sub-matrix. After that, it returns the decomposed sub-matrices of \(I\).
5.3 The Versatile Systolic Array
The versatile systolic array efficiently supports four types of convolutional layers, i.e., N-CONV, T-CONV, D-CONV, and A-CONV layers. Figure 16 illustrates the high-level architecture of the versatile SA. Both T-CONV and D-CONV layers can be naïvely implemented as N-CONV layers by inserting zeros in the input FMs of a T-CONV layer or in the filters of a D-CONV layer, as shown in Figures 12 and 13. However, this naïve implementation leads to huge underutilization of computation resources due to the zero MAC operations. To the best of our knowledge, this is the first efficient FPGA implementation of N-CONV, T-CONV, and D-CONV in one systolic array.
Fig. 16. The versatile SA architecture.
5.3.1 The Architecture and Dataflow of the VSA.
To implement the decomposition approach in a systolic array, we used the open-source framework PolySA [16]. The systolic array is output-stationary made of \(SA\_COL \times SA\_ROW\) PEs, and each PE contains \(SIMD\) MAC engines. The SA has \(SA\_COL\)
5.3.2 T-CONV Implementation.
In a normal convolution layer with a 3 \(\times\) 3 filter and one input feature map, each output pixel is computed as the dot product of the filter by the corresponding input pixels. This translates to 9 MAC operations on one of the PE’s local registers. In a T-CONV layer with a 3 \(\times\) 3 filter, \(S^\prime =2\), and one input feature map, the 9 MAC operations are decomposed into \(S^{\prime 2}=4\) N-CONV operations with \((2\) \(\times\) \(2),(2\) \(\times\) \(1),(1\) \(\times\) \(2),\) and \((1\) \(\times\) \(1)\) sub-filters, as illustrated in Figure 14. The MAC operations of the four sub-filters are computed in four different registers. Thus, changing the address of result accumulations is the only modification in the PEs. At this stage, the output pixels in the PEs are not organized. We avoided implementing data reorganization in the PEs and shifted that logic to the
5.3.3 D-CONV Implementation.
The decomposition approach for D-CONV is slightly different from T-CONV. Instead of decomposing the filters, the input feature maps are decomposed, as illustrated in Figure 15. The
6 COMPILATION FRAMEWORK
The compilation framework takes a CNN graph and an ordered list of the modules needed for that CNN as inputs and generates an optimized FPGA accelerator. The compilation framework has three major components (Figure 17). This section discusses the components of the framework in detail.
Fig. 17. Compilation system.
6.1 CNN Layer Mapper
6.1.1 ONNX.
While the original FlexCNN framework supported TensorFlow CNNs only, the updated framework uses Open Neural Network Exchange (ONNX). This is an open-source framework that establishes open standards for representing machine learning algorithms and software tools. The ONNX representation supports multiple famous ML frameworks such as TensorFlow, PyTorch, Caffe, and ScikitLearn, to name a few. ONNX compacts a deep neural network (DNN) model in a single file. This file contains: (1) the DNN’s graph, where each node represents a DNN layer and each edge represents the data flow from one node to another, and (2) the DNN’s parameters, mainly weights and biases.
6.1.2 CNN Layer Mapping.
Now, having a compact representation of any CNN, it is easier for the CNN Layer Mapper to map the nodes of the CNN to the ordered list of FlexCNN modules. This is the component that performs layer fusion and layer parallelization. The architecture must have modules to support all the CNN’s layers. In most CNNs, the convolution layers are the most compute-intensive operations, and the SA is the bottleneck module. Our mapping algorithm iterates through each convolution node and checks to see if the predecessor, successor, or parallel nodes of the convolution node can be mapped to the ordered list. It then outputs a list of layer bundles that are sent to the design space exploration, which is discussed next.
6.2 Design Space Exploration
Given the network, the accelerator architecture, and the FPGA’s resources information, we will perform the design space exploration to select the optimal design parameters that minimize the inference latency of the CNN when run on the target FPGA. Table 4 lists the design parameters to be determined.
Two analytical models resource_est() and latency_est() are built for estimating the resource usage and latency of designs. Currently, the resource model estimates block RAM (BRAM) and DSP usage, which are usually the bottleneck of designs. The DSE process will sweep through the design space with all feasible combinations of design parameters. For each design parameter list, the resource usage is examined first. Designs that over-utilize the resource will be pruned away. Then, we follow a greedy algorithm to select the optimal tiling factors that minimize the latency layer by layer. The DSE process finishes within minutes on a standard workstation.
6.3 Design Generation
This step creates the code that is synthesized into the hardware accelerator. Since we are targeting Xilinx/AMD FPGAs, our design generator creates Xilinx/AMD High-Level Synthesis (HLS) code [53]. Generating the bitstream for such complex architectures has been challenging, especially when using large systolic arrays. The bitstream generation task would usually fail the placement and routing step. For this reason, we recently added support to generate TAPA code [15]. TAPA is a dataflow HLS framework that offers fast compilation, and it generates high-frequency designs with the help of AutoBridge [25]. AutoBridge is a tool targeted at large dataflow architectures. It helps the process of placement and routing by placing the dataflow modules evenly across the FPGA fabric and connecting them with pipelining registers to minimize the critical paths of the design.
Now, having the optimal hardware parameters from the DSE, the user can choose to produce Xilinx/AMD HLS code or TAPA code. The code generation is template-based. The original FlexCNN paper used the PolySA [16] compiler to generate a standard systolic array. To automate the process of generating new versatile SAs with different dimensions based on an application target, we integrated our modifications on the standard SA into the PolySA compilation framework to create new versatile SAs with a push of a button. We also used Algorithm 1 and other scripts to automatically prepare test data to run on FPGA.
7 SOFTWARE-HARDWARE PIPELINING
Figure 2 illustrates the software overheads when integrating an FPGA kernel to a machine learning framework like TensorFlow. This defeats the purpose of hardware acceleration. To overcome this challenge, we use a software-hardware pipelining technique that can overlap the software execution with the hardware kernel execution. We chose TensorFlow as our ML framework, since it is being widely used for inference in the ML community (e.g., References [27, 36]). To invoke FPGA from TensorFlow, we redefine the nodes in the original computation graph. All computation nodes of CNN are merged into one node that is implemented by FPGA. The rest of the graph is still processed on the CPU.
When FPGA is connected to TensorFlow, the whole integration stack consists of the following steps: (1) reading the inputs of CNN, (2) pre-processing including stages such as image resizing, (3) re-organizing the initial data layouts in CPU memory, (4) transferring data from CPU to FPGA device memory, (5) computation on FPGA, (6) fetching the results back via PCIe, (7) reformatting and passing it to TensorFlow, (8) non-CNN computation stages on CPU, (9) processing the results (e.g., estimating the human poses based on the attained results and drawing them for the OpenPose network), and (10) writing out and displaying the results.
Figure 2 shows the breakdown of these stages in the OpenPose application for a 384 \(\times\) 384 RGB input. Among the whole pipeline, which takes 208.8 ms, the FPGA computation in Step 5 only requires 11.8% of the total time. The integration overheads have led to an \(8.45\times\) performance slowdown. To reduce these overheads, we have applied an optimized software/hardware pipelining.
A two-level pipelining is applied on the whole integration stack that enables the simultaneous processing of the aforementioned steps. The first level overlaps TensorFlow’s overheads (steps 1, 2, 9, 10) with the rest of the steps. The second one overlaps FPGA’s computation with data movement steps (steps 3, 4, 6, 7).
Figure 18 illustrates the first level of the pipeline, which is applied at the TensorFlow level. The numbers in the figure show the related step number. Steps 1, 2, 9, and 10 and the rest of the steps are assigned to different processes connected by a queue. Therefore, steps 1, 2, 9, and 10 are overlapped with FPGA-related steps. The overall performance is determined by the stage with the longest latency. Pipelining is enabled by exploiting multiprocessing. In other words, each of the steps is assigned to a separate process. These processes pass the data to each other through queues, as shown in Figure 18.
Fig. 18. First level of the pipeline.
To further improve the performance, we fully pipeline the communication and computation of FPGA, which consists of steps 3 to 7. This builds the second level of the pipeline. To allow pipelining, a batch of images is sent to FPGA. For a certain batch size, the additional latency incurred by batch processing is dissolved when the first level of the pipeline is applied. After the FPGA finishes processing the batch, the results are passed back to TensorFlow and the non-CNN computations are done in parallel for all the images. Figure 19 depicts the redefined graph that we use to achieve such a pipeline. With this optimization, the data movement steps are overlapped with kernel computation and the latency for non-CNN computation (Step 8) is amortized for the whole batch. Note that such deep software+hardware pipelining techniques were also used in References [12, 17] for integrating FPGA accelerators into Spark-based applications.
Fig. 19. The overview of the Process Graph stage.
8 EXPERIMENTAL RESULTS
8.1 Experiment Setup
As mentioned before, the FlexCNN architecture is described either in Xilinx/AMD HLS [53] or TAPA HLS [15]. The target platforms are Xilinx/AMD Virtex Ultrascale+ VCU1525 and Alveo U250 and U280 Data Center Accelerator Cards. Table 6 demonstrates the generated designs and the corresponding tools and FPGA platforms used for each design.
| Target CNN | Code | Xilinx/AMD Tool | Platform | Systolic Array | Precision |
|---|---|---|---|---|---|
| OpenPose-V2 | Vivado HLS | SDAccel 2018.3 | VCU1525 | Standard SA | float 32-bit |
| Individual Layer Tests | Vivado HLS | SDAccel 2018.3 | VCU1525 | Standard SA | float 32-bit |
| U-Net | Vivado HLS | SDAccel 2018.3 | VCU1525 | Versatile SA | float 32-bit |
| E-Net | TAPA HLS | Vitis 2021.2 | U250 | Versatile SA | float 32-bit |
| E-Net | TAPA HLS | Vitis 2021.2 | U250 | Versatile SA | fixed 16-bit |
| E-Net | TAPA HLS | Vitis 2021.2 | U250 | Versatile SA | fixed 8-bit |
| E-Net | TAPA HLS | Vitis 2021.2 | U280 | Versatile SA | fixed 8-bit |
| VGG-16 | TAPA HLS | Vitis 2021.2 | U250 | Versatile SA | float 32-bit |
| VGG-16 | TAPA HLS | Vitis 2021.2 | U250 | Versatile SA | fixed 16-bit |
| VGG-16 | TAPA HLS | Vitis 2021.2 | U250 | Versatile SA | fixed 8-bit |
Table 6. Experiments’ Setup
Observe that the second design in the table, with a standard systolic array, is used to compare the performance of a standard systolic array against the versatile systolic array on individual layers.
8.2 Hardware Optimization
The target FPGA platforms come with four DDR banks. In our implementations, we use two DDR banks, assigning feature maps and weights (including bias) to two separate DDR banks. All the architecture choices are parameterizable and can be adjusted based on the target FPGA. We found the following configurations that work best for the OpenPose-V2 application on Xilinx/AMD VCU1525: The systolic array for our standard conv module is organized as an 8 \(\times\) 8 array with a SIMD factor of 8. For the rest of the modules, we use the same SIMD factor. Table 7 shows the frequency and resource utilization under this configuration.
Table 8 shows the benefits of dynamic tiling and data layout transformation. We can see that these optimizations increase the performance by \(2.3\times\). Figure 1 depicts the performance gain of using dynamic tiling in a layer-by-layer fashion for the first 24 convolutional layers. Table 9 shows how applying dynamic tiling and dynamic data layout affects the tiling factors and effective DRAM bandwidth (BW) for the first layer of the last RBB in OpenPose-V2 compared to a design without these optimizations. The kernel size for this layer is 1 \(\times\) 1, which means it can use the optimized data layout with a burst length of \(Tn(k) \times Tw(k) \times Th(k),\) as described in Section 4.5. This data layout, along with the best tiling factor used for this layer, increases the effective DRAM BW and CTC ratio by \(2.8\times\). This results in \(6.1\times\) performance improvement.
| Model | Precision | Frequency (MHz) | Runtime (ms) | |
|---|---|---|---|---|
| (1) | (2) | |||
| All Uniform | float 32-bit | 237 | 57.7 | 41.5 |
| All Dynamic | float 32-bit | 242.9 | 35.6 | 24.7 |
(2): With applying DRAM organization for concatenation layers.
(1): Without applying DRAM organization for concatenation layers.
Table 8. Performance on OpenPose-V2
(2): With applying DRAM organization for concatenation layers.
(1): Without applying DRAM organization for concatenation layers.
We further test the DSP efficiency of our design on a given convolution layer. Of all the DSPs, 78.7% of them are used in the standard SA module and 11.2% in DW Conv module. We measure DSP efficiency using two factors: the total number of DSPs in the design and the number of DSPs of the modules used by that layer. All the tests are on a 256 \(\times\) 384 \(\times\) 384 input, producing 256 output channels. Table 10 summarizes the results. DSC layers take \(K^2\times\) less computation, making them communication-bound, as shown in Figure 20. This figure depicts that DSC layers fall in the memory-bound region of the roofline model, since they have less CTC ratio. Therefore, we achieve lower computation efficiency in these layers. Additionally, it shows that the data layout optimization for the DSC with the \(1 \times 1\) kernel increases the burst length. This helps to increase the effective DRAM bandwidth, leading to a performance improvement over the \(3 \times 3\) DSC.
Fig. 20. Layers in Table 10 under the roofline model.
Table 10. Performance on Different Convolutional Layers
8.3 The Versatile Systolic Array
To compare the effectiveness of the decomposition approach and the implementation, we conducted tests on the standard SA and the versatile SA using 10 different layers with various filter sizes, T-CONV strides (\(S^\prime\)), and dilation rates (\(d\)), as shown in Table 11.
\(^{*}\) DSP efficiency is measured as the actual performance using non-zero MAC operations divided by the peak performance (GFLOP/s) of the SA.
\(^\dagger\) Ideal speedup is based on our analysis in Section 5.1 using our systolic array architectures.
Table 11. Performance of Different T-CONV and D-CONV Layers
\(^{*}\) DSP efficiency is measured as the actual performance using non-zero MAC operations divided by the peak performance (GFLOP/s) of the SA.
\(^\dagger\) Ideal speedup is based on our analysis in Section 5.1 using our systolic array architectures.
Notice that layers with small \(N,M, or I_{h/w}\) have low computation-to-communication ratios, which make them communication-bound. This explains the low DSP efficiency for these layers. In contrast, the last three layers are computation-bound, and the DSP efficiency of the T-CONV and D-CONV layers is around \(98\%\), while the DSP efficiency of the standard SA is capped at \(\frac{100}{Ideal Speedup}\%\). This matches our ideal speedup analysis in Section 5.1.
Table 12 demonstrates the frequency and resource utilization of the versatile SA design and the standard SA design. In terms of area overhead, the versatile SA requires only about 7% more LUTs, 3% more Flip Flops, and around 3% more DSPs. For on-chip memory, the PEs utilize the BRAMs for local buffers, while the input, weight, and output buffers are implemented using URAMs. The PEs’ local buffers are larger in the versatile SA as the decomposition approach requires \(S^{\prime 2}\times\) the size of buffers for T-CONV decomposition. This explains the 24% increase in BRAM utilization. However, the standard SA needs larger weight buffers to accommodate the zeros inserted in the filters, and this explains the lower URAM utilization for the versatile SA.
8.4 Software-hardware Integration Optimization
In this section, we evaluate the effect of our integration optimization on OpenPose-V2. FlexCNN runs at 24.7 ms, which translates to a peak performance of 40.5 FPS. However, without proper optimization, the direct integration into TensorFlow framework only leads to the performance of 4.8 FPS, as shown in Table 13. Table 13 summarizes the impacts of two-level pipelining on the overall performance. We are using a batch of 16 for the OpenPose network to enable pipelining on FPGA, since it produces the best performance and smoothest output when displaying the result. With two-level pipelining, we achieve up to 5\(\times\) speedup, which leads to the final performance of 23.8 FPS.
8.5 Applications
In this subsection, we evaluate the performance of the three real-world CNNs we implemented on FlexCNN and compare the results with other works.
8.5.1 OpenPose-V2.
To the best of our knowledge, there is only one work [6] that has implemented a variant of OpenPose on FPGA. However, they take a different approach. They reduce the computation cost of the original network by making the weights sparse and using only two stages after the backbone network. Furthermore, they quantized the data to a 16-bit fixed point and stored feature maps and weights on-chip. After these modifications, they neither reported their network’s computation cost nor their architecture’s resource utilization. Thus, we can not compare our results to theirs directly. Instead, we have compared our results against the network implementation using TensorFlow on CPU and GPU.
The CPU is a 56-core Intel Xeon CPU E5-2680 v4 that operates at 2.40 GHz. For GPU, we use the NVIDIA Tesla V100 GPU, and it uses cuDNN [13] to run the network. To have a fair comparison of the latency of running the network on different platforms, we measure the runtime of a single image inference using OpenPose-V2 network. Table 14 summarizes the results. The runtime considers only the CNN inference time on RGB images of size 384 \(\times\) 384. For both the FPGA and GPU, the time to transfer the data from host to device and device to host is excluded from the measurement.
8.5.2 U-Net.
The U-Net CNN model is made of 51 layers. The breakdown of all the layers is shown in Table 1. The number of T-CONV layers’ operations is 2.1 Giga floating-point operations (GFLOPs), without counting the inserted zeros for T-CONV layers.
First, we compared U-Net performance with the TensorFlow implementation of the network on CPU and GPU. The CPU is a 56-core Intel Xeon CPU E5-2680 v4 that operates at 2.40 GHz. For GPU, we ran the network on NVIDIA A100-PCIE-40GB operating at 1.4 GHz. We measured the runtime of a single image inference. Table 15 summarizes the results. Similar to the OpenPose-V2 experiment, the runtime considers only the CNN inference time on RGB images, excluding the data transfer time for both the FPGA and GPU.
Second, we found two works [33, 34] that implement U-Net on FPGA. The first work used two separate accelerators—one for N-CONV and one for T-CONV layers. Although their approach gets rid of the zero MAC operations in T-CONV, it results in low performance and a low DSP efficiency compared to N-CONV, as shown in Table 16. The second work has a better overall performance and DSP efficiency, since it is using an 8-bit fixed point precision and combines DSP and ALM resources to create denser MAC units with higher performance. However, this work does not report the DSP efficiency or performance of the T-CONV and N-CONV individually.
| Measure | FlexCNN | TRETS 2018 [33] | FPL 2019 [34] |
|---|---|---|---|
| Platform | Xilinx/AMD VCU1525 | Xilinx/AMD XC7Z045 | Intel A10 660 |
| Data Type | float 32-bit | fixed 16-bit | fixed 8-bit |
| Frequency | 234 | 200 | 200 |
| N-CONV GOPS | 9.9 | 5.6 | NA |
| T-CONV GOPS | 2.1 | 0.3 | NA |
| Total GOPS | 12.0 | 5.9 | 27.4 |
| N-CONV GOP/s | 206.5 | 125 | NA |
| T-CONV GOP/s | 209.8 | 29 | NA |
| Total GOP/s | 207.0 | 107 | 1578 |
| Peak GOP/s | 239.5 | N/A | 1638 |
| T-CONV support | Yes | Yes | Yes |
| D-CONV support | Yes | No | No |
Table 16. U-Net Evaluation against other Works
8.5.3 E-Net.
Table 2 shows the breakdown of E-Net’s layers. The actual Giga Operations (GOPs) is the number of operations without counting the inserted zeros for T-CONV or D-CONV layers. Two of the previous works [9, 28] use a concept of logical GOPs for T-CONV and D-CONV layers, which counts the redundant zero MAC operations as actual MAC operations. While we do not think it is a good measure, we considered that metric for consistency and comparison purposes. The FlexCNN architecture of E-Net is shown in Figure 8, and we created three designs with float 32-bit, fixed 16-bit, and fixed 8-bit data types. The clock frequency and resource utilization for each design are shown in Table 17. First, we compared the E-Net performance against CPU and GPU. We used the same experimental setup as the U-Net tests. The comparison results are illustrated in Table 18.
Table 17. E-Net Designs and Hardware Utilization on U250 FPGA
While we did not find any FPGA implementation of E-Net, there are three ASIC-based implementations of E-Net. The comparison results are shown in Table 19. Compared to Reference [28], our fixed 8-bit and 16-bit designs achieve lower latencies and higher frames per second (FPS), but our actual performance is slightly lower than theirs. This article only reports the performance (GOP/s) and FPS, but not the network’s number of operations (GOPs). When we calculated the number of GOPs based on the given GOP/s and FPS numbers, we found their operation count to be 1.4 GOPs, which is higher than ours (1.2 GOPs). They may have included operations from the non-convolution layers, but we did not, and this explains why we have higher FPS but lower GOP/s. Similarly, the second work [9] did not report the number of operations, nor did it report the latency or FPS for their implementation. We used our E-Net model to calculate the number of operations for a 512 \(\times\) 512 input image, which is 3.79 GOPs. Given their performance (168 GOP/s), we calculated the latency and FPS numbers (see Table 19). In terms of FPS, our three designs achieve higher rates, but we are using a smaller image size. Also, their work achieves higher performance in terms of GOP/s, but the ASIC frequency is more than \(2\times\) the frequencies of our designs. For Reference [10], we could not have a good comparison, as they only report the performance but do not report the input image size, the GOPs, the latency, or the FPS numbers.
| Work | FlexCNN | ISCAS 2019 [28] | ISCAS 2020 [9] | VLSI 2020 [10] | ||
|---|---|---|---|---|---|---|
| 8 \(\times\) 9 \(\times\) 8 | 16 \(\times\) 9 \(\times\) 16 | 16 \(\times\) 9 \(\times\) 16 | ||||
| Platform | FPGA | FPGA | FPGA | ASIC | ASIC | ASIC |
| Frequency (MHz) | 241 | 219 | 229 | 200 | 500 | 200 |
| Image Size | 288 \(\times\) 288 | 288 \(\times\) 288 | 288 \(\times\) 288 | 288 \(\times\) 288 | 512 \(\times\) 512 | N/A |
| Data Type (w/a)\(^{*}\) | float 32-bit | fixed 16-bit | fixed 8-bit | fixed 8-bit | fixed 16-bit | fixed 2/16-bit |
| Latency (ms) | 20.95 | 13.86 | 12.92 | 14.62 | 22.55 | N/A |
| FPS | 47.72 | 72.15 | 77.39 | 68.40 | 44.35 | N/A |
| Actual GOP/s | 57.2 | 86.5 | 92.8 | 96.0 | 168.0 | 196.2 |
| Logical GOP/s | 426.2 | 644.4 | 691.2 | 639.7 | 1,377.0 | N/A |
Table 19. E-Net Comparison with Other Works
8.6 Comparison with Vitis AI
Vitis AI [4] is a Xilinx/AMD library for accelerating AI models on Xilinx FPGAs. The library uses optimized deep-learning processor units (DPU) cores as an overlay along with a software stack to accelerate a variety of DNN models. Different DPUs are optimized for different workloads (such as CNNs, RNNs, and NLPs) and different goals such as latency or throughput. However, the FlexCNN architecture mainly targets CNNs and focuses on optimizing the latency of CNN inference. In this subsection, we compare the performance of ENet on FlexCNN vs. Vitis AI. Xilinx/AMD reported ENet performance on the U280 using two different DPUs, DPUCAHX8H [1] and DPUCAHX8L [2]. DPUCAHX8H is optimized for throughput, while DPUCAHX8L is optimized for latency. Both DPUs use fixed-point 8-bit formats. For a fair comparison, we used FlexCNN to generate an accelerator on the U280 with the same 512 \(\times\) \(1,\!024\) input image size.
Table 20 shows the resource utilization of Vitis AI DPUs and our FlexCNN-generated design on U280. First, note that Vitis AI deploys multiple DPU cores on the FPGA (3 for DPUCAHX8H and 2 for DPUCAHX8L). The DPUCAHX8H core can be configured to have 3, 4, or 5 processing engines (PENs),6 and the DPUCAHX8L core is configured to have 1 PEN. Thus, the DPUCAHX8H design has a total of 14 PENs, and the DPUCAHX8L design has 2 PENs. Each PEN can process a separate image batch allowing it to process multiple images in parallel. FlexCNN, however, is optimized for latency with a single VSA. Thus, it processes multiple image batches sequentially. We noticed that for such a low-bit (fixed 8-bit) data format, the LUT and FF dominate the resource utilization in FlexCNN, as they are used along with the DSPs to implement the compute units of the VSA. Vitis AI DPUs,however, are designed in RTL and take more advantage of the DSPs to implement the arithmetic logic. In terms of on-chip memory utilization, FlexCNN consumes less URAM than both DPUs and slightly more BRAMs than the DPUCAHX8H design. In terms of frequency, FlexCNN’s design achieves the highest working frequency of 256 MHz.
Table 20. Hardware Utilization of FlexCNN Accelerator and Vitis AI DPUs on U280
Table 21 illustrates the performance of E-Net on Vitis AI DPUs vs. FlexCNN’s design in terms of throughput and latency. First, we noticed that the E-Net model used in the Vitis AI experiments has slightly more operation count (GOPs). After investigation, we found that, unlike the original E-Net model in Reference [38], each pair of asymmetric convolution layers (Figure 5) is implemented as a single convolution layer with 5 \(\times\) 5 filters in the Vitis AI E-Net model, which explains the slight increase in GOP count. The DPUCAHX8H design achieves the highest throughput of 1,057.8 GOP/s delivering 123 frames/s. However, such a high throughput is due to using a batch size of 14. The inference latency of ENet on DPUCAHX8H is not reported in Reference [3], but we can calculate a lower and an upper bound for latency. The lower bound is calculated as \(\frac{1}{FPS}\) (8.1 ms), meaning that the 14 PENs run sequentially, which is very unlikely, because this defeats the purpose of deploying 3 cores with 14 PENs. The upper bound is calculated as \(\frac{Batch\ Size}{FPS}\) (113.8 ms) meaning that the 14 PENs run in parallel, which is more likely. Thus, it is more likely that FlexCNN’s design has a comparable or better inference latency than the DPUCAHX8H. The same analysis applies to VGG-16 in Table 24. For the DPUCAHX8L design, FlexCNN delivers \(2.7\times\) faster inference. Moreover, while the DPUCAHX8L design achieves lower latencies than DPUCAHX8H for various CNNs [3] (see Table 24 for the VGG-16 results), it surprisingly gets the slowest inference for E-Net. FlexCNN’s design achieves both higher throughput and lower latency than the DPUCAHX8L design. Finally, in terms of performance density, DPUCAHX8H achieves the highest GOP/s/kLUT and GOP/s/DSP while FlexCNN, written in HLS, achieves higher GOP/s/DSP than DPUCAHX8L.
| Design | Model Complexity (GOPs) | Batch Size | Frames/s | Latency (ms) | Throughput (GOP/s) | Performance Density | |
|---|---|---|---|---|---|---|---|
| GOP/s/kLUT | GOP/s/DSP | ||||||
| DPUCAHX8H | 8.60 | 14 | 123.0 | 8.1–113.8 | 1,057.8 | 1.5606 | 0.1417 |
| DPUCAHX8L | 8.60 | 2 | 8.1 | 175.4 | 69.7 | 0.1637 | 0.0142 |
| FlexCNN | 7.58 | 1 | 15.1 | 66.0 | 114.6 | 0.1478 | 0.1166 |
Table 21. E-Net Performance Comparison with Vitis AI [3] (Image Size: 512 \(\times\) \(1,\!024\) )
8.7 Comparison with Other Frameworks
In this subsection, we compare FlexCNN with other FPGA-based DNN frameworks in terms of the scope of the frameworks and the performance of their respective architectures.
8.7.1 Scope of the Framework.
The scope and families of DNNs a framework can support depend on the architecture it employs. Most previous works such as DNNWeaver [42], Angel-Eye [23, 24], Caffeine [52, 60], fpgaConvNet [50], DNNBuilder [62], 2D & 3D CNN [43], Cloud-DNN [11], DNNVM [54], DNNExplorer [63], and 3D-VNPU [18] focused on designing architectures that support normal convolution, fully connected (FC), pooling, and activation and batch normalization (Act & BN) layers. These layers are sufficient for simple sequential CNNs such as AlexNet and VGG-16. In addition to the common CNN layers, many CNNs contain many more layer types such as depth-wise convolution, dilated convolution, transposed convolution, upsampling, and bilinear upsampling layers. Therefore, these previous frameworks cannot support complex CNNs with various layer types and complex branching\(^\dagger\) graph topologies such as OpenPose, U-Net, and E-Net. To accelerate a wide range of real-world CNN applications, FlexCNN supports all the aforementioned layer types with the exception of fully connected layers, since they became less used recently, and many famous models like MobileNet-V2 (used in OpenPose) are employed as a backbone for feature extraction without the FC layers. While FlexCNN and these previous works target CNNs, the FP-DNN [22] framework features an architecture that supports recurrent neural networks (RNN) in addition to CNNs. We will explore such a direction in our next work. Finally, Vitis AI [4] is a comprehensive AI compiler supporting CNNs, RNNs, and natural language processing models (NLPs). Table 22 summarizes the scope of all these frameworks.
| Framework | DNNs | Model Topology Branching\(^\dagger\)? | Supported Layers | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N-CONV | T-CONV | D-CONV | DW-CONV | FC | Pool | Act & BN | Upsample | Add | Concat | |||
| DnnWeaver [42] | CNNs | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Angel-Eye [24] | CNNs | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| DAC’17 [52] | CNNs | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ |
| FP-DNN [22] | CNNs, RNNs | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ |
| Caffeine [60] | CNNs | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| fpgaConvNet [50] | CNNs | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ |
| DNNBuilder [62] | CNNs | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| 2D & 3D CNN [43] | CNNs | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Cloud-DNN [11] | CNNs | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ |
| DNNVM [54] | CNNs | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ | ✗ |
| DNNExplorer [63] | CNNs | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ |
| 3D-VNPU [18] | CNNs | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ |
| Vitis AI [4] | CNNs, RNNs, NLPs | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| FlexCNN (ours) | CNNs | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |
Table 22. Scope of the Frameworks
Aside from the type of DNNs, an important aspect of the scope of a framework/architecture is the model size it can handle. Some frameworks like DNNBuilder [62] create dedicated hardware modules for each CNN layer consuming most of the FPGA fabric resources, which makes those frameworks limited to small CNNs with a few layers. However, FlexCNN does not have any limitation on the model size, as it stores the weights off-chip and time-shares the same hardware modules.
8.7.2 Performance.
We compare the performance of FlexCNN’s generated accelerators with multiple frameworks on the famous VGG-16 [46] CNN, since it is used by all these previous frameworks. Similar to some other works, we only implemented the feature extraction part of VGG-16 (Convolution layers), but not the classification part (the last three FC layers), since FlexCNN does not have a dedicated FC module yet, but we will consider adding it in future work. For this comparison, we created three designs with various bit widths and data types detailed in Table 23.
Table 23. VGG-16 Designs and Hardware Utilization on U250 FPGA
We surveyed many previous frameworks targeting CNNs and summarized the results in Table 24. Since we did not implement the FC layers and to have a fair comparison, for each metric used in Table 24, we used the format m1 (m2), where m1 refers to the feature extraction metric (convolution layers with 30.69 GOPs making 99.6% of VGG-16 operations), and m2 refers to the feature extraction + classification (convolution + FC layers with 30.81 GOPs) metric. In general, we can see that the FlexCNN designs achieve performance results better than or comparable to the other frameworks. In terms of throughput, DNNBuilder delivers the highest throughput followed by DNNVM. Cloud-DNN and DNNExplorer achieve comparable throughput to FlexCNN. In terms of the latency of feature extraction, FlexCNN’s 8-bit design achieves the lowest latency of 13.18 ms followed by DNNVM. DNNBuilder achieves the lowest latency of 15.39 ms for feature extraction and classification. In terms of performance density, DNNVM has the highest GOP/s/kLUT followed by Vitis AI DPUs and DNNBuilder, which are all implemented and optimized in RTL. FlexCNN’s fixed-point designs have comparable GOP/s/kLUT to Caffeine and 2D & 3D CNN, and Cloud-DNN, which are all implemented in Xilinx/AMD HLS. As for DSP performance density, FlexCNN’s 8-bit design delivers the highest GOP/s/DSP 2.179. Finally, an important metric to consider is the efficiency of an accelerator measured as the ratio between the achieved performance and the peak performance of the accelerator. DNNBuilder achieves the highest accelerator efficiency followed by DNNExplorer, since they exploit layer-level parallelism by deploying an accelerator for each layer (or a group of layers) of a CNN model. FlexCNN, in contrast, achieves between 82% and 96% accelerator efficiency (higher than DAC’17, Caffeine, DNNVM, and Vitis AI DPUs) while using a single systolic array, thanks to dynamic tiling and the other hardware optimizations employed by FlexCNN.
| Framework | Platform | Precision\(^\dagger\) | Frequency (MHz) | Batch Size | Throughput (GOP/s) | Latency (ms) | Performance Density | Actual/Peak Performance | |
|---|---|---|---|---|---|---|---|---|---|
| GOP/s/kLUT | GOP/s/DSP | ||||||||
| DnnWeaver [42] | Zynq Z020 | FX(16,16) | 150 | 1 | 31.35 (31.38) | - | 0.896 (0.897) | 0.224 (0.224) | - |
| Stratix V SGSD5 | FX(16,16) | 200 | 1 | 157.39 (157.51) | - | 1.040 (1.041) | 0.265 (0.265) | - | |
| Arria 10 GX115 | FX(16,16) | 200 | 1 | 390.02 (361.55) | - | 1.079 (1.000) | 0.290 (0.269) | - | |
| Angel-Eye [24] | Zynq Z045 | FX(16,16) | 150 | 1 | 187.80 (136.97) | 163.42 (224.60) | 1.028 (0.750) | 0.241 (0.176) | - |
| DAC’17 [52] | Arria 10 GT115 | FX(16,8) | 232 | 1 | - (1,171.30) | - (26.85) | - (3.742) | - (0.781) | 89.11% (-) |
| Caffeine [60] | UltraScale KU060 | FX(16,16) | 200 | 1 | 310.00 (266.00) | - (101.15) | 3.100 (2.660) | 0.293 (0.251) | 84.93% (72.88%) |
| Virtex 690T | FX(16,16) | 150 | 1 | 488.00 (354.00) | - (65.13) | 1.627 (1.180) | 0.172 (0.125) | 76.72% (55.66%) | |
| fpgaConvNet [50] | Zynq Z045 | FX(16,16) | 125 | 1 | 155.81 (-) | 249.50 (-) | - | 0.182 (-) | - |
| DNNBuilder [62] | UltraScale KU115 | FX(16,16) | 235 | 1 | - (2,011.00) | - (15.39) | - (7.799) | - (0.466) | - (99.1%) |
| UltraScale KU115 | FX(8,8) | 235 | 2 | - (4,022.00) | - (15.39) | - (15.597) | - (0.931) | - (99.1%) | |
| 2D & 3D CNN [43] | Virtex 690T | FX(16,16) | 150 | 1 | - (570.00) | - (54.06) | - (3.257) | - (0.414) | - |
| UltraScale VU440 | FX(16,16) | 200 | 1 | - (821.00) | - (37.53) | - (4.829) | - (0.597) | - | |
| Cloud-DNN [11] | UltraScale VU9P | FX(16,16) | 125 | 1 | - (1,068.37) | - (28.96) | - (1.397) | - (0.200) | - |
| UltraScale VU9P | FX(16,16) | 214 | 1 | - (1,828.61) | - (16.92) | - (2.645) | - (0.342) | - | |
| DNNVM [54] | UltraScale ZU2 | FX(8,8) | 330 | 1 | 334 (-) | 91.90 (-) | 15.215 (-) | 1.722 (-) | 87.9% (-) |
| UltraScale ZU9 | FX(8,8) | 330 | 3 | 2,820 (-) | 17.24 (-) | 23.94 (-) | 1.829 (-) | 69.6% (-) | |
| DNNExplorer [63] | UltraScale KU115 | FX(16,16) | 200 | 1 | 1,702.30 (-) | 18.05 (-) | - (-) | 0.363 (-) | 95.8% (-) |
| 3D-VNPU [18] | UltraScale ZCU102 | FX(8,8) | 200 | 1 | 1,150 (-) | 26.69 (-) | - (-) | 1.123 (-) | - |
| DPUCAHX8H [1] | Alveo U280 | FX(8,8) | 150 | 14 | - (5,812.07) | - (5.30–74.23) | - (8.575) | - (0.779) | - (67.6%) |
| DPUCAHX8L [2] | Alveo U280 | FX(8,8) | 250 | 2 | - (3,272.75) | - (18.83) | - (7.688) | - (0.409) | - (40.9%) |
| FlexCNN (ours) | Alveo U250 | FL(32,32) | 266 | 1 | 458.6 (-) | 66.92 (-) | 0.632 (-) | 0.082 (-) | 96.2% (-) |
| Alveo U250 | FX(16,16) | 241 | 1 | 1,543.4 (-) | 19.89 (-) | 2.262 (-) | 0.331 (-) | 89.3% (-) | |
| Alveo U250 | FX(8,8) | 198 | 1 | 2,329.1 (-) | 13.18 (-) | 2.256 (-) | 2.179 (-) | 82.1% (-) | |
Table 24. VGG-16 Performance Comparison with Other Frameworks (Image Size: 224 \(\times\) 224)
9 CONCLUSION
In this work, we presented the end-to-end FlexCNN framework for accelerating CNNs on FPGA. Our framework targets the challenges of accelerating modern CNNs. The first challenge stems from the disparity within layers of the same type that results in different computation and communication requirements. As a solution, we proposed a few architectural techniques such as dynamic tiling, layer fusion and layer parallelization, and data layout optimizations. The second challenge arises from the various convolution types such as transposed convolution and dilated convolution. These two layers, if not processed efficiently, can lead to huge underutilization of the computation resources of FPGA due to the large number of redundant zeros. For this, we propose a versatile systolic array that can handle all these layer types efficiently with a small area overhead compared to a standard SA. The third challenge is caused by the software overheads for the end-to-end runtime of CNN inference. To mitigate this issue, we propose a software-hardware pipelining technique that overlaps those overheads with the hardware kernel execution. Finally, we presented our automated compilation flow that takes a CNN model in ONNX format, maps it to the FlexCNN architecture, finds the best hardware parameters and tiling factors using a DSE, and generates accelerators either in Xilinx/AMD HLS or TAPA HLS.
Footnotes
1 Open Neural Network Exchange.
2 Jason Cong has a financial interest in AMD.
Footnote- Footnote
4 Asymmetric convolution layers are the same as N-CONV layers but use non-square filter sizes, like 1 \(\times\) 5 filters.
Footnote5 Based on TensorFlow “
Footnotesame ” padding.6 Processing engines are abbreviated as (PENs) so as not to be confused with the systolic array processing elements (PEs).
Footnote
- [1] DPUCAHX8H Resource Utilization. (n.d.). Retrieved from https://docs.xilinx.com/r/en-US/pg367-dpucahx8h/Resource-Utilization.Google Scholar
- [2] DPUCAHX8L Resource Utilization. (n.d.). Retrieved from https://docs.xilinx.com/r/en-US/pg366-dpucahx8l/Resource-Utilization.Google Scholar
- [3] U280 Performance with 14E300 MHz DPUCAHX8H. (n.d.). Retrieved from https://docs.xilinx.com/r/1.4.1-English/ug1354-xilinx-ai-sdk/Alveo-U280-Data-Accelerator-Card.Google Scholar
- [4] Vitis AI. (n.d.). Retrieved from https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html.Google Scholar
- [5] . 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283.Google Scholar
- [6] . 2018. An FPGA realization of OpenPose based on a sparse weight convolutional neural network. In International Conference on Field-Programmable Technology (FPT’18). IEEE, 310–313.Google Scholar
Cross Ref
- [7] . 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans. Circ. Syst. II: Express Briefs 65, 10 (2018), 1415–1419.Google Scholar
Cross Ref
- [8] . 2017. Realtime multi-person 2D pose estimation using part affinity fields. In IEEE Conference on Computer Vision and Pattern Recognition. 7291–7299.Google Scholar
Cross Ref
- [9] . 2020. Efficient accelerator for dilated and transposed convolution with decomposition. In IEEE International Symposium on Circuits and Systems (ISCAS’20). IEEE, 1–5.Google Scholar
- [10] . 2020. An efficient accelerator for multiple convolutions from the sparsity perspective. IEEE Trans. Very Large Scale Integ. Syst. 28, 6 (2020), 1540–1544.Google Scholar
- [11] . 2019. Cloud-DNN: An open framework for mapping DNN models to cloud FPGAs. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 73–82.Google Scholar
- [12] . 2016. When Spark meets FPGAs: A case study for next-generation DNA sequencing acceleration. In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’16).Google Scholar
- [13] . 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google Scholar
- [14] . 2018. SODA: Stencil with optimized dataflow architecture. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). IEEE, 1–8.Google Scholar
Digital Library
- [15] . 2021. Extending high-level synthesis for task-parallel programs. In IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 204–213.Google Scholar
- [16] . 2018. PolySA: Polyhedral-based systolic array auto-compilation. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). IEEE, 1–8.Google Scholar
Digital Library
- [17] . 2018. From JVM to FPGA: Bridging abstraction hierarchy via optimized deep pipelining. In 10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’18).Google Scholar
- [18] . 2021. 3D-VNPU: A flexible accelerator for 2D/3D CNNs on FPGA. In IEEE 29th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM’21). IEEE, 181–185.Google Scholar
- [19] . 2020. Exploring efficient acceleration architecture for winograd-transformed transposed convolution of GANs on FPGAs. Electronics 9, 2 (2020), 286.Google Scholar
Cross Ref
- [20] . 2016. Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision. Springer, 391–407.Google Scholar
Cross Ref
- [21] . 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016).Google Scholar
- [22] . 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 152–159.Google Scholar
- [23] . 2016. Angel-Eye: A complete design flow for mapping cnn onto customized hardware. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI’16). IEEE, 24–29.Google Scholar
- [24] . 2017. Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 37, 1 (2017), 35–47.Google Scholar
Cross Ref
- [25] . 2021. AutoBridge: Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 81–92.Google Scholar
- [26] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
- [27] . 2018. AI benchmark: Running deep neural networks on android smartphones. In European Conference on Computer Vision (ECCV’18). 0–0.Google Scholar
- [28] . 2019. DT-CNN: Dilated and transposed convolution neural network accelerator for real-time image segmentation on mobile devices. In IEEE International Symposium on Circuits and Systems (ISCAS’19). IEEE, 1–5.Google Scholar
- [29] . 2018. tf-pose-estimation. Retrieved from https://github.com/ildoonet/tf-pose-estimation.Google Scholar
- [30] . 2017. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning. PMLR, 1857–1865.Google Scholar
- [31] . 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–9.Google Scholar
- [32] . 2018. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In IEEE Conference on Computer Vision and Pattern Recognition. 1091–1100.Google Scholar
Cross Ref
- [33] . 2018. Optimizing CNN-based segmentation with deeply customized convolutional and deconvolutional architectures on FPGA. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–22.Google Scholar
Digital Library
- [34] . 2019. Towards an efficient accelerator for DNN-based remote sensing image segmentation on FPGAs. In 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 187–193.Google Scholar
Cross Ref
- [35] . 2019. USCA: A unified systolic convolution array architecture for accelerating sparse neural network. In IEEE International Symposium on Circuits and Systems (ISCAS’19). IEEE, 1–5.Google Scholar
- [36] . 2017. GPflow: A Gaussian process library using TensorFlow. J. Mach. Learn. Res. 18, 1 (2017), 1299–1304.Google Scholar
- [37] . 2018. LeFlow: Enabling flexible FPGA high-level synthesis of TensorFlow deep neural networks. In 5th International Workshop on FPGAs for Software Programmers. VDE, 1–8.Google Scholar
- [38] . 2016. ENet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 (2016).Google Scholar
- [39] . 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).Google Scholar
- [40] . 2015. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234–241.Google Scholar
Cross Ref
- [41] . 2018. MobileNetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.Google Scholar
Cross Ref
- [42] . 2016. From high-level deep neural models to FPGAs. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.Google Scholar
- [43] . 2018. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 97–106.Google Scholar
- [44] . 2017. Maximizing CNN accelerator efficiency through resource partitioning. In ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 535–547.Google Scholar
- [45] . 2014. Rigid-motion scattering for image classification. École Normale Supérieure, Département d’Informatique, Ph.D. Dissertation.Google Scholar
- [46] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- [47] . 2020. End-to-end optimization of deep learning applications. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 133–139.Google Scholar
- [48] . 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 16–25.Google Scholar
- [49] . 2017. ArtGAN: Artwork synthesis with conditional categorical GANs. In IEEE International Conference on Image Processing (ICIP’17). IEEE, 3760–3764.Google Scholar
Digital Library
- [50] . 2018. fpgaConvNet: Mapping regular and irregular convolutional neural networks on FPGAs. IEEE Trans. Neural Netw. Learn. Syst. 30, 2 (2018), 326–342.Google Scholar
Cross Ref
- [51] . 2018. TGPA: Tile-grained pipeline architecture for low latency CNN inference. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). IEEE, 1–8.Google Scholar
Digital Library
- [52] . 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In 54th Annual Design Automation Conference. ACM, 29.Google Scholar
Digital Library
- [53] . 2018. Vivado design suite user guide - high-level synthesis (UG902). https://docs.xilinx.com/v/u/2018.2-English/ug902-vivado-high-level-synthesis.Google Scholar
- [54] . 2019. DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators. IEEE Trans. Comput.-Aid. Des. Integ. Circ. Syst. 39, 10 (2019), 2668–2681.Google Scholar
Cross Ref
- [55] . 2018. DNN dataflow choice is overrated. arXiv preprint arXiv:1809.04070 (2018).Google Scholar
- [56] . 2018. FlexiGAN: An end-to-end solution for FPGA acceleration of generative adversarial networks. In IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 65–72.Google Scholar
- [57] . 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).Google Scholar
- [58] . 2020. Uni-OPU: An FPGA-based uniform accelerator for convolutional and transposed convolutional networks. IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 28, 7 (2020), 1545–1556.Google Scholar
Cross Ref
- [59] . 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 161–170.Google Scholar
- [60] . 2018. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-Aid. Des. Integ. Circ. Syst. 38, 11 (2018), 2072–2085.Google Scholar
- [61] . 2021. FPGA implementation for CNN-based optical remote sensing object detection. Electronics 10, 3 (2021), 282.Google Scholar
Cross Ref
- [62] . 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In International Conference on Computer-aided Design. ACM, 56.Google Scholar
Digital Library
- [63] . 2020. DNNExplorer: A framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator. In 39th International Conference on Computer-aided Design. 1–9.Google Scholar
Digital Library
Index Terms
FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA
Recommendations
An FPGA-based Fine Tuning Accelerator for a Sparse CNN
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFine-tuning learns abundant feature expression for a wide range of natural images by using a pre-trained CNN model. It can be applied to a wide range of the neural network (NN)based computer vision problems. This paper proposes an FPGA-based fine-tuning ...
End-to-End Optimization of Deep Learning Applications
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysThe irregularity of recent Convolutional Neural Network (CNN) models such as less data reuse and parallelism due to the extensive network pruning and simplification creates new challenges for FPGA acceleration. Furthermore, without proper optimization, ...
Automated hardware generation of CNN models on FPGAs: late breaking results
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation ConferenceIn this paper, we propose an automated framework that takes as input a TensorFlow inference graph and generates high-performance accelerators on FPGA by assembling CNN pre-implemented components as a puzzle, based on the graph topology. Using pre-...


























Comments