Abstract
With the proliferation of low-cost sensors and the Internet of Things, the rate of producing data far exceeds the compute and storage capabilities of today’s infrastructure. Much of this data takes the form of time series, and in response, there has been increasing interest in the creation of time series archives in the past decade, along with the development and deployment of novel analysis methods to process the data. The general strategy has been to apply a plurality of similarity search mechanisms to various subsets and subsequences of time series data to identify repeated patterns and anomalies; however, the computational demands of these approaches renders them incompatible with today’s power-constrained embedded CPUs.
To address this challenge, we present FA-LAMP, an FPGA-accelerated implementation of the Learned Approximate Matrix Profile (LAMP) algorithm, which predicts the correlation between streaming data sampled in real-time and a representative time series dataset used for training. FA-LAMP lends itself as a real-time solution for time series analysis problems such as classification. We present the implementation of FA-LAMP on both edge- and cloud-based prototypes. On the edge devices, FA-LAMP integrates accelerated computation as close as possible to IoT sensors, thereby eliminating the need to transmit and store data in the cloud for posterior analysis. On the cloud-based accelerators, FA-LAMP can execute multiple LAMP models on the same board, allowing simultaneous processing of incoming data from multiple data sources across a network.
LAMP employs a Convolutional Neural Network (CNN) for prediction. This work investigates the challenges and limitations of deploying CNNs on FPGAs using the Xilinx Deep Learning Processor Unit (DPU) and the Vitis AI development environment. We expose several technical limitations of the DPU, while providing a mechanism to overcome them by attaching custom IP block accelerators to the architecture. We evaluate FA-LAMP using a low-cost Xilinx Ultra96-V2 FPGA as well as a cloud-based Xilinx Alveo U280 accelerator card and measure their performance against a prototypical LAMP deployment running on a Raspberry Pi 3, an Edge TPU, a GPU, a desktop CPU, and a server-class CPU. In the edge scenario, the Ultra96-V2 FPGA improved performance and energy consumption compared to the Raspberry Pi; in the cloud scenario, the server CPU and GPU outperformed the Alveo U280 accelerator card, while the desktop CPU achieved comparable performance; however, the Alveo card offered an order of magnitude lower energy consumption compared to the other four platforms. Our implementation is publicly available at https://github.com/aminiok1/lamp-alveo.
1 INTRODUCTION
The proliferation of IoT sensors and the volume of data that they generate creates unique challenges in edge computing [24]. One motivating application, among many, is real-time seismic event prediction, which can inform hazard response strategies and enhance early-warning systems [3, 23, 33]. In this case, the relevant question is whether or not the most recent seismic measurements strongly correlate to the relatively short window of time leading up to a previously observed seismic event. Such a system could benefit from increasing the throughput of the near-sensor raw data processing, and acceleration using an FPGA represents one potential avenue to do so.
This article describes an FPGA-based accelerator for a streaming time series prediction scheme called the Learned Approximate Matrix Profile (LAMP) [54]. Given the most recent window of data points, LAMP uses a Convolutional Neural Network (CNN) to predict whether or not a similarly correlated pattern occurred in the time series used to train the model. Exact methods to compute these correlations are impractical due to the requirement that the streaming time series be archived, and the fact that computing the correlations entails execution of an \(O(n^2)\) algorithm on a time series of ever-increasing length [53]. It is certainly more practical to perform inference on a moderately sized CNN; nonetheless, the overhead of CNN inference remains a computational bottleneck that limits the achievable sampling rate. Embedded CPU-based solutions are state-of-the-art, but higher performance and lower energy consumption could be achieved through FPGA acceleration.
We call our approach FPGA-Accelerated LAMP, or FA-LAMP, for short. We implemented our design on both edge- and cloud-based accelerators. We compiled the LAMP model to run on a Xilinx Deep Learning Processing Unit (DPU) using the Vitis AI development environment and executed it on a Xilinx Zynq UltraScale+ MPSoC edge device as well as Xilinx Alveo U280 cloud-based accelerator card. Several layers of the CNN were not compatible with the DPU; to complete the system, we implemented these layers as custom hardware IP blocks. One challenge involved the output layer, which computes a sigmoid activation function; we considered two approximations and evaluated them in terms of accuracy, performance (latency and throughput), resource utilization, and energy consumption on three time series datasets from the domains of seismology, entomology, and poultry farming. Our highest-performing FA-LAMP system configuration on the Zynq device achieved throughput of 453.5 GOPS with an inference rate 10.7\(\times\) faster and an 15.8\(\times\) improvement in energy consumption compared to running LAMP on a Raspberry Pi. Our highest-performing design on the Alveo U280 accelerator card achieved throughput of 5.53 TOPS and demonstrates a 12.3% higher inference rate in comparison to a high-end CPU, while consuming one order of magnitude less energy. Using a dataset obtained from the entomology domain, we show how FA-LAMP can be combined with a post-processing classifier to better understand insect feeding behavior. We also demonstrate how DPU on the Alveo U280 accelerator can be connected to an Ethernet module to process the incoming network data while bypassing the host CPU; this capability allows FA-LAMP to process streaming data coming from external sources across the network.
This article is an extension of our prior work [18], which presented results for the edge-based FA-LAMP implementation on the Xilinx Zynq UltraScale+ MPSoC. This article makes the following contributions as extensions: (1) We extend the evaluation of edge-class devices to include a comparison with a Google Edge TPU. (2) We deploy and evaluate FA-LAMP on an Xilinx Alveo U280 accelerator card, representative of a cloud-based deployment, and we compare FA-LAMP’s performance and energy consumption to desktop and server-class CPUs as well as a GPU. (3) We integrate the DPU with a 100 G Ethernet module on the Alveo card and describe steps taken to optimize throughput accounting for both computational and network performance factors. As a methodological difference, we employ quantization-aware training for our edge scenario, whereas our prior work [18] trained the FA-LAMP CNN in a quantization-oblivious manner.
2 RELATED WORK
We classify previous FPGA-based Deep Neural Network (DNN) studies along three axes: (a) techniques to optimize accelerator design from the perspective of computing engine or memory system; (b) user-accessible frameworks that deploy DNNs on FPGAs; and (c) overlays for DNN acceleration. With respect these axes, our work: (1) leverages the Xilinx DPU in conjunction with a custom HLS kernel to enable efficient whole-network acceleration on-chip; (2) evaluates the impact of different DPU configurations on throughput, latency, and resource utilization on both edge- and cloud-scale FPGAs; (3) analyzes inherent tradeoffs between different approximate implementations of the sigmoid activation function; and (4) co-optimizes computational and network performance by integrating the Xilinx DPU with a 100 Gigabit Ethernet module.
2.1 Optimizing Accelerator Design
Zhang et al. proposed a novel CNN accelerator architecture that performs loop tiling and transformation to explore the design space and balance computation and memory bandwidth [50]. Another recent accelerator architecture [44] implements a large-scale matrix multiplication algorithm that statically allocates constant weights to physical multipliers, allowing the design to operate at a near-peak FPGA clock rate. A similar, yet effective, strategy for FPGA-based edge acceleration is to pack parameter memories into groups that optimize BRAM usage, enabling the accelerator to be synthesized onto a smaller FPGA while maintaining throughput compared to a larger device [26].
Colangelo et al. extended Intel’s FPGA Deep Learning Acceleration (DLA) Suite [4] to accelerate networks with 8-bit and sub 8-bit activations and weights [8]. Similar techniques achieve high throughput in FPGA-based CNN inference by either quantizing the model’s weights or training the model with lower bit precision [28, 30, 47].
We take inspiration from these studies in implementing our handcrafted kernels. We employ loop tiling [50], data reuse [7, 15, 16], and quantization [8] to improve their efficiency.
2.2 Automated Frameworks for DNN Compilation
A number of domain-specific DNN compilers translate a high-level description of a model into synthesizable RTL coupled with an execution schedule. They facilitate DNN deployment on FPGAs but limit opportunities for further optimization, as the generated HLS/RTL code is hard to interpret.
HeteroCL [21] is a Python-based domain-specific language (DSL) extended from TVM [6] that maps high-level specifications of designs to hardware implementations, targeting systolic arrays and stencil architectures. It has been reported that deeply pipelined kernels designed in this framework result in routing congestion in large FPGAs [19]. DNNWEAVER [34] generates target-specific Verilog code for FPGA-based DNN accelerators using hand-optimized design templates; however, the framework can only handle conventional CNNs and does not support quantization. Other automatic DNN generation frameworks include: HLS4ML [11], which targets low-power applications; fpgaConvNet [40], which achieved the best throughput per DSP unit in a recent survey [41]; VTA [6], which uses a TVM-based compiler stack; and FINN [39], which is developed and maintained by Xilinx.
2.3 Xilinx DPU
Recently, Xilinx introduced the Deep Learning Processor Unit (DPU), a programmable engine optimized for CNNs [45]. The DPU supports a variety of deep learning models, including, but not limited to ResNet [51], VGG [35], YOLO [31]. Programmable parameters allow the FPGA designer to control the degree of parallelism and resource utilization of the DPU IP, as we have done in this study. Operations not supported by the DPU can be offloaded to a CPU or to custom IP kernels.
Project Brainwave [12] translates a pre-trained DNN model specified in a graph-based intermediate representation and partitions it for execution on multiple FPGAs in a datacenter. The tool compiles the FPGA sub-graph to Neural Processing Unit (NPU) instruction set architecture (ISA) binary. The NPU ISA supports matrix-vector and vector-vector operations. Intel DLA [1] applies the Winograd transformation [22] to optimize the performance and bandwidth of convolutional and fully connected layers. Last, Light-OPU [49] uses a single uniform computation engine to accelerate lightweight convolutional neural networks.
One key challenge that we faced was that the Xilinx DPU could not execute the three final stages of our FA-LAMP CNN. This required us to design custom kernels to accelerate those functions. It remains an open question as to whether the cost of extending the DPU architecture and ISA to support these functions would be justifiable.
3 FA-LAMP SYSTEM OVERVIEW
3.1 Background: Time Series and the Matrix Profile
A Time series \(T = \langle t_1, t_2, \ldots , t_n\rangle\) is an ordered sequence of \(n\) scalar data points. A subsequence of length \(m\) starting at position \(i\) is denoted \(T_{i,m}\) (or just \(T_i\) if \(m\) is known from context, an assumption that we make here). The Pearson correlation1 between subsequences \(T_i\) and \(T_j\), which measures their similarity, is denoted \(c_{i,j}\) (\(c_{i,j}\) values closer to 1 indicate strong correlation; values closer to 0 indicate no relationship; values closer to \(-\)1 indicate negative correlation). Once we obtain all of the \(c_{i,j}\) values, we can extract the nearest neighbor of \(T_i\) in \(T\). Subsequence \(T_j\) is defined to be the nearest neighbor of subsequence \(T_i\) if \(c_{i,j} \ge c_{i,k}, \forall k \ne j\). The Matrix Profile (MP) [53] (Figure 1) is a vector that contains the correlations of the nearest neighbors of each subsequence in \(T\): \(P(T) = \langle c^{max}_i|\ 1 \le i \le n - m + 1\rangle\), where \(c^{max}_i\) is the maximum correlation between \(T_i\) and any other subsequence \(T_j \in T\), excluding subsequences in an exclusion zone surrounding \(T_i\). Once we compute the MP (correlation to the nearest neighbor of every subsequence), determining time series motifs (repeated patterns) and time series discords (anomalies) becomes trivial [52].
Fig. 1. Matrix Profile (MP) computation for subsequences of length \(m\) : \(c_{i,j}\) denotes the Pearson Correlation between the \(i\) th and \(j\) th subsequences, \(T_{i,m}\) and \(T_{j,m}\) for all \(j\) , excluding an exclusion zone surrounding \(T_{i,m}\) . The maximum Pearson Correlation value \(c^{max}_{i}\) is stored as the \(i\) th entry in the MP.
3.2 Background: LAMP
The MP is itself a time series; while the MP can be computed efficiently with GPUs [53], doing so is not amenable to streaming data. While the time complexity to compute the MP is \(O(n^2logn)\) [48], in the streaming context, the time complexity of updating the MP for each newly sampled data point is \(O(nlogn)\) as \(n \rightarrow \infty\). In other words, not only is it necessary to store the entire time series as it grows over time, but each new data point requires a super-linear pass over all of the data points that have been stored. To sidestep this issue, the Learned Approximate Matrix Profile (LAMP) [54] predicts the maximum correlation between the mostly recently sampled length-\(m\) window of streaming data points to a representative time series used to train the model. This enables real-time analytics, such as anomaly detection and classification, using predicted MP values. The objective of this article is to accelerate LAMP inference using an FPGA.
Figure 2 illustrates the LAMP inference process. Each input consists of \(\mathbf {J}\) z-normalized (zero mean and unit variance) subsequences of length \(\mathbf {M}\), extracted with stride \(\mathbf {S}\). This scheme defines an extraction window in the data, \(\mathbf {W}\), where \(\mathbf {||W|| = J\cdot S + M - 1}\). We slide \(\mathbf {W}\) across the time series and extract a new input for the model for each position of \(\mathbf {W}\). This procedure generates vectors of length \(\mathbf {M}\) with \(\mathbf {J}\) channels as inputs to LAMP’s neural network (a CNN), shown in Figure 3. For each input, the model predicts \(\mathbf {J\cdot S}\) LAMP values, one for each subsequence in \(\mathbf {W}\).
Fig. 2. Illustration of the parameters used for LAMP inference on a streaming time series.
Fig. 3. The CNN used for LAMP inference. Batch normalization layers are omitted to simplify the presentation.
LAMP’s CNN is a simplified version of ResNet [51] for time series classification [42, 54]. Model inputs and outputs are modified to support concurrent predictions. The first layer in the LAMP CNN is batch normalization (omitted from Figure 3 for simplicity); each convolutional layer in the model is followed by a batch normalization layer (also omitted from Figure 3), which are aggregated by Addition layers followed by ReLU activation functions. The final three layers are Global Average Pool (GAP), a fully connected layer, and a sigmoid activation function. Figure 3 reports the kernel dimensions and number of filters used below each convolution layer.
3.3 Xilinx DPU: Objective and Technical Challenges
The Xilinx DPU is a programmable architecture that accelerates many common CNN operations, such as convolution, deconvolution, max pooling, and fully connected layers [45]. The objective of this work is to accelerate LAMP neural network inference on the Xilinx Ultra96-V2 and Alveo U280 FPGA boards, leveraging the DPU to achieve a balance between performance and programmability. The on-board Xilinx Zynq UltraScale+ FPGA features two Arm CPUs and has sufficient capacity to realize at most one DPU, with additional logic remaining to implement custom IP block accelerators; the larger capacity UltraScale+ FPGA in the Alveo U280 card can fit multiple DPU instances.
We ran into several technical challenges. First, the DPU does not support the Global Average Pooling (GAP) and sigmoid layers, shown on the right-hand-side of Figure 3; these layers must be implemented in software running on one of the Arm CPU cores (UltraScale) or as custom hardware IP block accelerators (Ultrascale or Alveo). Second, implementing the fully connected layer, which sits between the GAP and sigmoid layers, would entail significant data transfer overhead between the DPU and the Arm CPU / IP block. Third, the DPU for Ultra96-V2 board uses different configurations to perform the convolutional layer (including accumulation and ReLUs); with space for just one DPU, dynamic reconfiguration during inference would be needed to support the fully connected layer; the alternative, which we adopted, is to implement the fully connected layer externally on the CPU or as an IP block. This approach worked well for both platforms.
3.4 DPU for Edge Processing
Figure 4 depicts the DPU architecture for Zynq devices. The DPU features user-configurable parameters to optimize resource utilization and to select which features are needed for a given deployment scenario. For example, our implementation does not use softmax, channel augmentation, or depthwise convolution. Seven DPU variants exist, which differ in the amount of parallelism provided by the convolution units, with IDs ranging from B512 (smallest, 512 operations per clock cycle) to B4096 (largest, 4,096 operations per clock cycle); the largest variant that fits onto the Ultra96-V2 board is the B2304. The DPU compiler translates a neural network model into a sequence of DPU instructions. After start-up, the DPU fetches these instructions from off-chip memory to control the compute engine’s operations. The compute engine employs deep pipelining and comprises one or more processing elements (PEs), each consisting of multipliers, adders, and accumulators. DSP blocks can be clocked at twice the frequency of general logic.
Fig. 4. Zynq DPU architecture.
The DPU buffers input, output, and intermediate values in BRAM to reduce external memory bandwidth. It directly connects to the Processing System (PS) through the Advanced eXtensible Interface 4 (AXI4) to transfer data. The host program uses the Xilinx Deep Neural Network Development Kit (DNNDK) to control the DPU, service interrupts, and coordinate data transfers. In our design, data transfers were necessary, as the final three layers of the CNN (GAP, fully connected, and sigmoid) were performed outside the DPU.
3.5 DPU for Cloud Acceleration
Two different DPU architectures are currently available that support the High Bandwidth Memory (HBM)2 on the Alveo FPGA card; one is high-throughput (Figure 5) and the other is low-latency (Figure 6). The Alveo DPUs are named DPUCAHX8, as they are targeted towards CNN applications (C) for the Alveo platform with HBM (AH) using 8-bit quantization (X8). The two variants are named DPUCAHX8H (high-throughput) and DPUCAHX8L (low-latency), respectively. Both architectures are provided as device binary files and cannot be further configured. The high-throughput architecture is configured with three DPUCAHX8H DPUs; the low-latency architecture is configured with two DPUCAHX8L DPUs. The DPU compiler for Alveo allows the user to partition the inference model (a graph) between the FPGA and the host. We use the default partitioning option, which divides the model between the layers that are supported by the DPU and those that are not.
Fig. 5. High-throughput DPUCAHX8H architecture, comprising three DPU instances with multiple batch engines for parallel data processing.
Fig. 6. Low-latency DPUCAHX8L architecture, comprising two DPU instances with one convolution engine, scheduler, and code FIFO units.
Figure 5 depicts high-throughput DPUCAHX8H DPU microarchitecture. The DPUCAHX8H consists of shared weights control logic, an instruction scheduler to fetch, decode, and dispatch jobs, a control register bank that provides a control interface between the DPU and host CPU, and can be configured with four or five batch engines that allow the DPU to process multiple input data streams simultaneously. The DPU requires all of the batch engines in a kernel to execute the same neural network; the weight buffer, the instruction scheduler, and the control register bank can serve all of the batch engines. The batch engine contains a compute engine that comprises two sub-engines: a convolution engine and a MISC engine, along with a local memory pool that stores trained model parameters (weights). The convolution engine executes regular convolution/ deconvolution operations, and the MISC engine handles other operations such as ReLU, pooling, and so on. Each batch engine communicates with the device memory through AXI read/write master interfaces.
Figure 6 depicts the low-latency DPUCAHX8L microarchitecture. This microarchitecture comprises convolution and MISC engines and control bank registers, but omits the batch engine and local memory pool. The low-latency architecture is compatible with compiler optimizations such as kernel fusion, which can achieve higher throughput via pipeline-level parallelism.
3.6 HLS Kernel
This subsection summarizes the steps taken to design an IP accelerator that performs the GAP, fully connected, and sigmoid layers using High-Level Synthesis (HLS).
(1) Global Average Pool (GAP): The output of the final convolutional layer in Figure 3 is an array of feature maps \(D\in \mathbb {R}^{M\times N}\) corresponding to each of the \(N\) channels. The GAP generates an \(N\)-dimensional vector \(q\in \mathbb {R}^{N}\) consisting of the average value of each feature map. In other words, (1) \(\begin{equation} q_j \xleftarrow {} \frac{1}{M} \sum \limits _{i=1}^{M} D_{i,j}, \quad 1 \le j \le N. \end{equation}\) The vector \(q\) is then passed to the fully connected layer.
(2) Fully Connected Layer: The input to the fully connected layer is a feature vector \(q \in \mathbb {R}^{N}\). The fully connected layer left-multiplies a weight matrix \(W\in \mathbb {R}^{N\times M}\) by \(q\) and adds a bias vector, \(b \in \mathbb {R}^{M}\), to the result, yielding a new feature vector \(z \in \mathbb {R}^{M}\). (2) \(\begin{equation} z \xleftarrow {} qW + b. \end{equation}\)
Initially, we set \(z \xleftarrow {} b\) in BRAM. We then process each feature \(q_i, 1 \le i \le N\) and multiply it by the element in the \(i\)th row of the weight matrix, \(W_{i,j=1 \ldots M}\), adding each scalar product term to \(z_j\), i.e., \(z_j \xleftarrow {} q_iW_{i,j}\), once again, storing the accumulated sum in BRAM. (We store the weights, biases, and accumulated sum in UltraRAM in our Alveo implementation.) This scheme allows the execution of the fully connected layer to start as soon as the first element \(q_1\) produced by the GAP layer arrives; likewise, each feature \(q_i\) can be discarded as soon as all of its intermediate products are computed.
We use row-wise vector-matrix multiplication and tiling [29] to optimize performance. We tile the weight matrix \(W\) into small \(n_{c}\times n_{r}\) blocks as shown in Figure 7; each vector element is multiplied by \(n_{r}\) matrix elements, allowing the accelerator to perform \(n_{c}\times n_{r}\) scalar multiplication operations per cycle. Parameter \(n_c\) must be chosen to make sure that the latency of GAP layer is greater than the number of cycles required to process \(n_c\) vector elements; \(n_r\) is chosen to be as large as possible to increase system parallelism, subject to resource constraints. We set \(n_c = 8\) and \(n_r = 4\) for the Ultra96-V2 implementation and set \(n_c = 16\) and \(n_r = 16\) for the Alveo card in our experiments.
Fig. 7. Column-wise vector-matrix multiplication tiling scheme.
Figure 8 depicts the hardware architecture for the fully connected layer. The design starts by reading \(n_c\) elements from the previous layer (GAP) and inserting them into \(n_r\) FIFOs. During each iteration, a tile of size \(n_{c}\times n_{r}\) of the weights is read from the BRAM and is multiplied by the corresponding vector, which is provided by the GAP layer. The vector is reused until the final column of the weight matrix is processed; then the next \(n_{c}\) elements are read from the GAP layer and the process repeats. The Multiply-Accumulate (MAC) module executes \(n_{c}\times n_{r}\) parallel multiplications per clock cycle,3 storing the accumulated sums in a BRAM. The MAC module outputs a vector of length \(n_{r}\) that is added to the bias values stored in a separate BRAM; the resulting sum is then transmitted to the Sigmoid layer.
Fig. 8. Fully connected layer hardware architecture.
(3) Sigmoid Activation: The LAMP CNN applies the sigmoid activation function to each scalar element of the feature vector \(z\) produced by the fully connected layer. To simplify notation, we present the sigmoid function of a scalar input \(x\) that can represent any of the scalars \(z_i \in z\): (3) \(\begin{equation} f(x) = \frac{1}{1 + e^{-x}} . \end{equation}\) Computing the sigmoid function directly on an FPGA is impractical due to the cost of division and exponentiation. Informed by extensive studies regarding sigmoid approximations [14], we chose two variants to evaluate: \(\mathsf {ultra\_fast\_sigmoid}\), a piece-wise approximation used in the Theano library [5]; and \(\mathsf {sigm\_fastexp\_512}\), which expands the exponential function for an infinite limit [37].
There are inherent tradeoffs among these approximations in terms of accuracy, throughput/ latency, area, and energy consumption; additionally, their implementation differs radically, depending on the chosen precision and whether they are implemented using fixed- or floating-point arithmetic.4 A thorough survey of the tradeoffs involved is beyond the scope of this article. The final design, which we evaluate in the following section, uses 8-bit fixed-point arithmetic.
The \(\mathsf {ultra\_fast\_sigmoid}\) approximation is defined as follows: (4) \(\begin{equation} f(x) = \left\lbrace \begin{array}{ll} 0.5\left(\frac{\frac{1.5x}{2}}{1+\frac{x}{2}} + 1\right) &\quad 0 \le \frac{x}{2} \lt 1.7 \\ [1em] 0.5(1 + 0.935 + 0.045\left(\frac{x}{2}-1.7)\right) & \quad 1.7 \le \frac{x}{2} \lt 3 \\ [1em] 0.5(1+0.995) & \quad \frac{x}{2} \ge 3 \\ [1em] 0.5\left(-\frac{\frac{-1.5x}{2}}{1-\frac{x}{2}} + 1\right) & \quad -1.7 \le \frac{x}{2} \le 0 \\ [1em] 0.5\left(1 - \left(0.935 + 0.045\left(-\frac{x}{2}-1.7\right)\right)\right) & \quad -3 \lt \frac{x}{2} \le -1.7 \\ [1em] 0.5(1-0.995) & \quad \frac{x}{2} \le -3 \\ [1em] \end{array} \right. . \end{equation}\) Due to the relative simplicity of the operations compared to directly computing the sigmoid function, \(\mathsf {ultra\_fast\_sigmoid}\) can be implemented as a low-latency kernel.
The \(\mathsf {sigm\_fastexp\_512}\) approximation expands the exponential function in terms of an infinite limit (\(n \xrightarrow {} \infty\)), using a value of \(n = 512\) to render the approximation computable [37]: (5) \(\begin{equation} {\text{exp}}(x) = \prod \limits _{k=1}^{lg(n)} \left(1+\frac{x}{k}\right)^k, \hspace{5.0pt}n=512 , \end{equation}\) (6) \(\begin{equation} {\text{sigm}}(x) = \frac{1}{1+{\text{exp}}(-x)} . \end{equation}\)
We implemented our sigmoid layer in HLS using a loop that takes \(x\) as an input from the fully connected layer and approximates the sigmoid using either Equations (4) or (6). In both scenarios, we pipelined the loop with an Initiation Interval (II) of 1; the latency of the loop for \(\mathsf {sigm\_fastexp\_512}\) is higher due to the complexity of the operations.
Figure 9 shows the \(\mathsf {sigm\_fastexp\_512}\) and \(\mathsf {ultra\_fast\_sigmoid}\) approximations, along with their associated errors, defined as the squared difference between them and an exactly computed sigmoid function. Neither is uniformly more accurate than the other for all reported values of \(x\), but \(\mathsf {ultra\_fast\_sigmoid}\) has noticeably higher error closer to zero. This error is tolerable for classification problems [10], where results are normally determined through comparison, not exact values. The error has a greater impact for regression systems that subsequently process the neural network’s calculated output.
Fig. 9. (a) Approximation functions for sigmoid and (b) their error. Both charts were computed using an 8-bit fixed-point data type.
(4) HLS Optimizations: We optimized our design using directives provided by Vivado HLS and through manual redesign of the fully connected layer. As shown in Figure 10, we achieved a \(20\times\) speedup over our baseline implementation, while increasing resource usage by \(1.5\times\):
Fig. 10. Improvements in custom kernel latency and resource utilization due to HLS optimizations.
\(\mathsf {Baseline}\): our starting point design using a 32-bit floating-point data format.
\(\mathsf {Unroll}\): unrolls the inner loops of the GAP and fully connected layers.
\(\mathsf {Pipeline}\): pipelines the outer computation loops and I/O interface loops to infer burst reads/writes; the three layers execute as a pipeline to maximally overlap computation.
Fixed-Point: is the design implemented in an 8-bit fixed-point (\(\mathsf {ap\_fixed}\)<8, 3>) data format that reduces the resource utilization by \(3\times\) [13].
Loop-Tiling-\(n_r\) tiling the fully connected layer (see Figure 7), while retaining the 8-bit data format.
The average resource axis in Figure 10 is the average percentage of BRAMs, LUTs, DSP blocks, and registers used for each design. Most of the speedup arises from pipelining and unrolling loops, which increases the number of DSP blocks and registers used in a design.
Figure 11 shows the overall design on the Ultra96-V2 board. The HLS kernel implements the GAP, fully connected, and sigmoid layers while the rest of the neural network runs on the DPU. The DPU and HLS kernel connect to the processing system via AXI4 ports to allow access to the DDR memory space. The Zynq UltraScale+ processing system in our platform has four High-Performance (HP) ports and two High-Performance Cache coherent (HPC) ports. The DPU I/O interfaces and HLS kernel connect to the HP ports, which provide lower latency than the HPC ports; the DPU instruction fetch port connects to an HPC port.
Fig. 11. The FA-LAMP edge implementation comprises a Zynq UltraScale+ processing system, DPU IP, and custom HLS kernel; the HLS kernel implements the GAP, fully connected, and sigmoid layers.
Figure 12 shows the Alveo U280 FPGA configured to run the high-throughput DPUCAHX8H architecture. The host CPU, which pre-processes the input time series, communicates with the Alveo card via the PCIe bus. The FPGA is partitioned into static and dynamic regions. The static region is a fixed logic partition that contains the board interface logic and cannot be programmed by the user. The dynamic region contains memories, memory interfaces, and user kernels compiled using the Xilinx Vitis compiler. The resources in the dynamic region are further divided into three Super Logic Regions (SLR0-2). The DPU architecture consists of three DPUCAHX8H instances, each of which is mapped to a separate logic region. The DPUs in SLR1 and SLR2 are configured with five batch engines for maximum parallelism; the DPU in SLR0 contains four batch engines to leave space for our custom kernel, which implements the GAP, fully connected, and sigmoid layers, and the AXI switch network and HBM controller to connect the device memory. The switch network connects to all three DPU instances, providing 7, 7, and 6 HBM AXI ports, respectively, and provides two additional ports to the custom kernel in SLR0.
Fig. 12. Alveo architecture programmed with the high-throughput DPU.
4 EXPERIMENTAL SETUP
Figure 13 depicts the LAMP model training process and DPU deployment workflow; a detailed explanation follows.
Fig. 13. Overview of deploying a LAMP model on a DPU.
4.1 Model Training
FA-LAMP deployment on an FPGA begins by training the model. We set the number of subsequences J to 32 [54], the length of window M to 100, and the stride S to 8. We used the Adam [20] optimizer to train the model using stochastic gradient descent with a learning rate of 1e-3 and a batch size of 128. The training objective is to minimize the mean squared error loss between the predicted and exact MP values for the training dataset. We removed the first batch normalization layer from the LAMP CNN [54]: The Vitis compiler merges each convolutional layer followed by a batch normalization layer followed by a ReLU layer; a CNN with a batch normalization layer preceding the first convolutional layer caused an error, because the Vitis compiler interpreted the CNN as consisting of a sequence of batch normalization layers followed by convolutional layers. Removing the initial convolutional layer was the most straightforward way to rectify the problem.
We trained a LAMP model for each dataset offline using the TensorFlow quantization-aware training API on an Nvidia Tesla P100 GPU. This API improves the accuracy of the model prior to quantization to INT8, which is performed post hoc by downstream tools (Vitis AI Quantizer in our case). The model is then calibrated and partitioned in two using Vitis AI: (i) the layers to be executed on a custom kernel (GAP, fully connected, and sigmoid); and (ii) the rest of the model, which runs on the DPU. The custom kernel code includes a header that contains the weights and activations of the fully connected layer for high-level synthesis; the GAP and sigmoid layers do not feature any trained parameters. The second sub-graph of the model is stored in the \(\mathsf {h5}\) format file.
4.2 Model Inference
4.2.1 DPU Deployment.
We use Vitis AI 1.3 to quantize and compile the trained LAMP model. AI Quantizer converts all of the model weights and activations into a fixed-point INT8 format. The Xilinx Intermediate Representation (XIR)-based Compiler then maps the model to the DPU instruction set and data flow. We specified the custom kernel (fully connected, GAP, and sigmoid layers) in Vitis HLS using C++ and the \(\mathsf {ap\_fixed}\)<8, 3> data type. We synthesized the custom kernel using Vivado HLS 2019.2 and integrated the resulting IP block with the DPU using Vitis 2019.2.
We evaluated the LAMP CNNs on a Xilinx Ultra96-V2 development board and Alveo U280 card. Table 1 compares the resources provided by the two platforms. The Alveo card is 30\(\times\) more expensive than the Ultra96-V2 board, while providing considerably more logic, memory, and DSP resources and higher off-chip memory capacity and bandwidth.
| Alveo U280 | Ultra96V2 | |
|---|---|---|
| INT8 Peak Throughput | 24.5 TOPS | 691 GOPS |
| HBM2 Capacity | 8 GB | N/A |
| HBM2 Bandwidth | 460 GB/s | N/A |
| DDR Capacity | 32 GB | 2 GB |
| DDR Bandwidth | 38 GB/s | 25 GB/s |
| Look-Up Tables | 1,304K | 70,560 |
| DSP Slices | 9,024 | 360 |
| Block RAMs | 2,016 | 432 |
| UltraRAMs | 960 | N/A |
| Price | $7,500 | $250 |
Table 1. Comparison between Ultra96-V2 and Alveo U280 FPGA Specifications
The Ultra96-V2 integrates two Arm CPUs (an 1.5 GHz Arm Cortex A-53 and a 600 MHz Cortex-R5) with a Xilinx Zynq UltraScale+ MPSoC featuring 70,560 LUTs, 360 DSP slices, and 7.5 MB of BRAM. We used a 16 GB SD card to store an embedded Linux image created with PetaLinux 2019.2 along with the input time series datasets for the design that we will use for inference. We wrote a host program in C++ that uses the DNNDK API (VART for Alveo) to communicate with the DPU IP core.
We inserted the Alveo FPGA card into a Dell PowerEdge R730 Rack Server that contains a six-core 2.60 GHz Intel Xeon E5-2640 processor. The host connects to the FPGA through a PCI Express 4.0 interface. The server features 32 GB of DDR and 8 GB of HBM with 460 GB/s of bandwidth.
In the standard DPU flow, unsupported layers can be offloaded to a host CPU as an alternative to utilizing custom IP blocks. The Zynq FPGA on the Ultra96-V2 development board features two integrated Arm Cores: a Cortex-A53 and a Cortex-R5. As a baseline for comparison for the edge deployment scenario, we implemented the custom kernel layers on the Cortex-A53, which supports a higher clock frequency than the Cortex-R5. The source code running on the Cortex-A53 employs the same 8-bit fixed-point data type as we used on the FPGA. We use the C++ built-in \(\mathsf {exp()}\) function (from the <cmath> library) to compute the sigmoid and a for-loop to compute the global average pool layer. For the cloud deployment, we evaluated the software performance of the custom kernel on the Intel Xeon E5-2640 CPU, noting that the latency over the PCIe communication channel is significant.
4.2.2 LAMP Deployment on CPU and GPU.
To quantify FA-LAMP’s performance in the cloud scenario, we implemented LAMP inference on a server CPU, a desktop CPU, and a GPU. The GPU platform comprises two NVIDIA GeForce RTX 2080 cards inserted into a Rack Server containing 16 Intel Core i9-9900 processors operating at 3.1 GHz; the Server CPU includes the six-core Intel Xeon E5-2640 server described earlier; and the Desktop CPU is an Intel Core i7-8750 CPU with six cores running at 2.2 GHz. All the platforms mentioned above execute CNN inference in Python 3.7 using Keras’ Predict Generator class with multiprocessing enabled.
4.2.3 Raspberry Pi3 and Edge TPU.
We also ported the LAMP inference engine to run on a Raspberry Pi 3 board, which provides a 100% software baseline that is representative of edge computing. We wrote a short Python script that converts the pre-trained LAMP model saved in the Keras format to TensorFlow Lite with 8-bit full integer quantization, and we configured the optimizer to minimize latency. We performed inference using the trained model on the Raspberry Pi using the TensorFlow Lite Interpreter. The Raspberry Pi 3 features a Quad Core 1.2 GHz Broadcom BCM2837 CPU. Ideally, we would have run the full LAMP model in software on either of the two Arm cores on the Ultra96-V2 board; however, it was not possible to do so, as Keras does not support the Ultra96-V2 board at the time of writing.
We also executed LAMP inference on a Coral USB Accelerator [9] that contains a Google Edge TPU coprocessor [38], an ASIC optimized for AI inference. We first quantized the LAMP model to an 8-bit format using TensorFlow quantization-aware training; we then exported the quantized model as a frozen graph, converted it to a TensorFlow Lite model, and used the Edge TPU compiler to convert it to the supported format for the USB accelerator. The Edge TPU allows pipelining to decompose a large model into segments spread across multiple Edge TPUs; this is particularly important for models whose data segments exceed the Edge TPU cache capacity. Our LAMP model fits within the Edge TPU on-chip memory (8 MB), allowing us to run two models on two Coral USB accelerators concurrently. We inserted the two Coral USB accelerators into two USB 3.0 ports on a desktop PC running Ubuntu 18.04 Linux. We installed the Edge TPU runtime version 13 on Ubuntu and used the increased frequency option, which is known to increase power consumption. We loaded the model and data onto the Edge TPUs using the PyCoral API with Python 3.7.
4.2.4 Comparison to Recent CNN-to-FPGA Compilation Frameworks.
To quantify DPU’s performance, we deployed our LAMP model on several state-of-the-art FPGA edge-based and cloud-based CNN frameworks: HLS4ML [11], fpgaConvNet [40], VTA [6], and FINN [39]. All of these frameworks can target the Ultra96-V2 development board, but only fpgaConvNet and FINN can target the Alveo card.
HLS4ML is a Python package that converts a trained neural network in the ONNX format into an HLS project for synthesis onto an FPGA; layers are implemented by choosing and configuring HLS modules from a template library. We trained and quantized the LAMP model using Tensorflow and then converted it to ONNX using tf2onnx [36]. HLS4ML performs integer scaling during quantization and can be configured on a per-layer basis. To ensure that the model was synthesizable, we limited the amount of loop unrolling. We also corrected some compilation errors that occurred because HLS4ML did not define the correct AXI Stream interface between modules. We set the precision of all weights as biases to 8-bit fixed-point and used the default reuse factor parameter. While HLS4ML supports the sigmoid layer using a lookup table implementation, we replaced the last three of the CNN with our own custom layers to ensure a fair comparison.
Similar to HLS4ML, fpgaConvNet converts a trained model in the ONNX format into an HLS project, propagating model quantization settings into its internal representation. samo [32], a design space exploration tool, can optimize the model implementation on the FPGA using simulated annealing; we used samo’s rule-based optimizer and selected the latency performance objective.
VTA uses a template deep learning accelerator consisting of load, store, and compute (RISC processor) units. We used TVM to translate a trained LAMP model into a Relay module (TVM’s front-end compiler) and applied 8-bit quantization (VTA exclusively supports the int-8 format). We then applied constant folding to reduce the number operators and created an object file to load onto the FPGA. The last three layers of the LAMP model are executed in fp32 on the CPU, as VTA’s front-end compiler is not compatible with custom kernel IP accelerators.
For the FINN framework, we defined our LAMP model in PyTorch and quantized it using Brevitas [25], which exports the model to the FINN-ONNX format. FINN’s compiler then converts the model to one or more more FPGA accelerators; the network must be redefined with Brevitas layers, which correspond to standard PyTorch layers, e.g., there is a \(\mathsf {QuantLinearlayer}\) type. FINN’s non-standard use of ONNX restricted our ability to quantize the LAMP model. To target the Ultra96-V2 board, we quantized weights and activations into a 4-bit representation; to target the Alveo card, we quantized weights and activations into 1-bit and 4-bit representations, respectively. FINN was unable to support our custom IP kernels, so we implemented them using the fp32 format on the host CPU.
4.3 Measurements
We report the throughput and the energy consumption of FA-LAMP CNN inference by direct execution of the model on the aforementioned platforms using three time series datasets, which are summarized in the next subsection. The throughput is reported as the total number of multiply-accumulation operations in the model (7.71 GOPs) executed per second. We also report the inference rate of each platform, which we define to be the number of Matrix Profile values predicted per second. We measure the Ultra96-V2 and Raspberry Pi power consumption using a commercially available Kuman power meter, which provides power measurements for the entire board.
We estimated the power consumption of the FPGA on the Alveo card by periodically transmitting queries through the xbutil tool. xbutil measures FPGA power consumption but does not report the current of the HBM power rails, which we omit from our estimation. We estimated the power consumption of the host Intel Xeon CPU using the PyRAPL software toolkit [27], while eliminating all other application programs running under Windows; we could not eliminate any variability arising from the operating system. We report the GPU’s power consumption using the NVIDIA System Management Interface (nvidia-smi). We estimate energy consumption by multiplying the power measurement by the time required to perform inference on a batch of size 128. Every batch of data predicts 256 MP values based on the configured LAMP parameters, for a total of 128\(\times\)256 predictions per inference. Batch sizes larger than 128 led to degraded results on the Raspberry Pi. We report resource utilization results from Vivado’s post-implementation reports.
We evaluated the efficiency of all DPU variants that we could fit onto the Ultra96-V2 UltraScale+ Zynq FPGA, which can fit no more than one DPU core. We set the DPU’s BRAM and DSP usage to low and disabled the average pool and softmax instructions, since the LAMP neural network does not perform these operations. For the Alveo card, we evaluated the efficiency of high-throughput and low-latency DPU kernels. The DPU IP provides two distinct clock inputs: We set the input clock for DSP blocks to 300 MHz and the input clock for general logic to 150 MHz in both evaluated platforms. We set the HBM clock on the Alveo card to 450 MHz.
4.4 Benchmarks
We trained neural networks for three time series datasets and measured the error of the model’s predictions; this methodology is similar in principle to prior work on LAMP [54].
(1) Seismology Domain: The Earthquake dataset is obtained from a seismic station [53]. Real-time event prediction impacts seismic hazard assessment, response, and early-warning systems [3, 23, 33]. We split the time series into 120 million and 30 million data points for training and inference.
(2) Entomology Domain: The Insect EPG dataset is obtained from an Electrical Penetration Graph (EPG) that records insect behavior [53]. This time series is the record of an insect feeding on a plant, and observed behaviors were classified by an entomologist as \(\mathsf {Xylem\ Ingestion}\), \(\mathsf {Phloem\ Ingestion}\), or \(\mathsf {Phloem\ Salivation}\). Understanding feeding behavior of insects can help farmers identify vector-bearing pests that may decimate crops. We split the time series into 2.55 million and 5 million data points for training and inference.
(3) Poultry Farming Domain: The Chicken Accelerometer dataset was collected by placing a tracking sensor on the back of a chicken [2]. The sensor outputs acceleration measurements along the x-, y-, and z-axes at a 100 Hz sampling rate. The data was labeled to classify the chicken’s behavior into one of three categories: Pecking, Preening, or Dustbathing. This is relevant to disease detection because infected chickens exhibit a marked increase in preening and dustbathing behavior compared to uninfected chickens. Figure 14 depicts a snippet of the dataset corresponding to the x-, y-, and z-axes and behavioral labels. Using only the x-axis measurements, we split the time series into 6 million and 2 million data points for training and inference.
Fig. 14. A snippet of chicken accelerometer data with corresponding labels (Preening: label height = 3, dustbathing: label height = 4, and pecking: label height = 6).
4.5 Source Code and Data Availability
We have publicly released all of code, data, code, and LAMP inference models used to produce the results in this article [17].
5 RESULTS
5.1 Edge: Throughput and Resource Utilization
Table 2 summarizes the resource utilization and the measured throughput of FA-LAMP inference using various system configurations on the Ultra96-V2 FPGA board. The DPU + Arm columns report results when the custom kernel (fully connected, GAP, and sigmoid layers) run on the Arm CPU, while the DPU + IP columns report results for the custom kernel implemented as FPGA IP blocks that connect directly to the DPU; the largest and best-performing B2304 DPU is used when reporting results for DPU + IP. Results are reported for the custom kernel implemented using two sigmoid approximations: \(\mathsf {ultra\_fast\_sigmoid}\) (\(\mathsf {ultra\_fast}\)) and \(\mathsf {sigmoid\_fastexp\_512}\) (\(\mathsf {fastexp\_512}\)) to approximate the sigmoid function.
Table 2. Edge Prototype: Throughput (GOPS) and Resource Utilization Comparison between Different DPU Architectures; (DPU + IP) Uses a B2304 DPU
The DPU + Arm results in Table 2 show that system throughput increases as DPU size and complexity increases, from B512 to B2304. The highest overall throughput is achieved for the DPU + IP configurations, as the three custom kernel layers that the DPU cannot execute are moved from the Arm CPU to a custom accelerator. Data transfer overhead remains present in both cases between the DPU and Arm CPU/IP block: Each read for an input batch of data takes around 0.12 ms, and each write takes around 0.1 ms; the port throughput is around 850 MB/s.
Table 2 also reports the peak (achievable) DPU throughput for each system configuration; this does not include the throughput of the Arm CPU or IP block, because the inference procedure, at present, does not lend itself to concurrent execution. The percentage of achievable throughput ranges from 43.6% to 53.1% for the DPU + Arm configurations and jumps to 65.6% and 62.0% for the two DPU + IP configurations. Even if a hypothetical next-generation DPU could support the three custom kernel operations, the overhead of DPU reconfiguration, which we avoided in the design(s) evaluated here, would also limit the achievable throughput.
DPU resource utilization depends on the degree of parallelism in the chosen configuration; on-chip RAM buffers the weights, bias, and intermediate features. As DPU I/O channel parallelism increases, more on-chip RAM is needed to store more intermediate data and more DSP slices are needed to process that data. When the low DSP usage option is chosen, the DPU uses DSP slices exclusively for multiplication in the convolution layers and offloads accumulation to LUTs. This explains the observed increase in LUT usage as DPU throughput increases.
The custom IP kernels consume additional resources. \(\mathsf {sigmoid\_fastexp\_512}\) performs more multiplication operations and constant division operations than \(\mathsf {ultra\_fast\_sigmoid}\), noting that the latter performs mostly constant multiplications. As a consequence, \(\mathsf {ultra\_fast\_sigmoid}\) achieves higher throughput and lower resource utilization compared to \(\mathsf {sigmoid\_fastexp\_512}\); however, as we will see in the next subsection, these benefits come at the cost of lower accuracy.
5.2 Edge: Comparison to a Raspberry Pi 3 and Edge TPU
Next, we compare the performance and energy consumption of FA-LAMP neural network inference running on the Ultra96-V2 FPGA board to a Raspberry Pi 3 and the Edge TPU device, being representative of a purely CPU-based edge computing systems.
Table 3 reports the throughput (inference rate), energy consumption (in Joules), and performance per power (GOPs/Watt) of processing a single batch of size 128 on each platform. The runtime of FA-LAMP neural network inference does not depend on the size of the representative dataset used for training; thus, the inference rate and energy consumption is identical across all datasets.
Table 3. Edge Prototype: Inference Rate and Energy Consumption of LAMP Neural Network Inference on an Edge TPU, Raspberry Pi 3, and Ultra96-V2 Board
Both the inference rate and energy consumption of all three Ultra96-V2 FPGAs improve by 1 order of magnitude compared to the Edge TPU and \(\sim 6\times\) compared to the Raspberry Pi ; according to our power measurements, the Ultra96-V2 FPGA board consumed \(\sim\)3W of power compared to \(\sim\)4W for the Raspberry Pi. We consider the nominal power consumption of 4.5W for the Edge TPU devices as reported in the datasheet. As expected, the DPU + IP options achieve a higher inference rate than the reported DPU + Arm configuration. Notably, the DPU + IP option using \(\mathsf {sigmoid\_fastexp\_512}\) consumes more energy than both the DPU + Arm and DPU + IP using \(\mathsf {ultra\_fast\_sigmoid}\); referring back to Table 2, this occurs due to the higher demand for DSP blocks (36 more than \(\mathsf {ultra\_fast\_sigmoid}\)), which are clocked twice as fast as the FPGA general logic. All of the evaluated edge platforms exhibit comparable power consumption; however, performance per Watt corresponds, linearly to the inference rate with Ultra96-V2 outperforming the Edge TPU by 1 order of magnitude and the Raspberry Pi by \(\sim 6\times\). The Edge TPU has the lowest performance among all the edge platforms due to its limited RAM capacity and its inability to support batch processing; we conclude that it is not a good option for streaming applications.
5.3 Cloud Prototype: Throughput and Energy
Table 4 details the measured performance and energy consumption of FA-LAMP in different scenarios. The columns starting with LL and HT report measurements for the low-latency and high-throughput DPU on the Alveo card. Similar to Table 2, in LL (HT) + CPU columns, the custom kernel (fully connected, GAP, and sigmoid functions) are offloaded to the CPU, while in LL (HT) + IP columns the custom kernel is implemented as FPGA kernel that runs on programmable logic. The FA-LAMP program in all Alveo implementations is multi-threaded to maximize DPU utilization.
LL = Alveo Low-latency, HT = Alveo High-throughput.
Table 4. Cloud Prototype: Throughput, Latency, Inference Rate, and Energy Consumption
LL = Alveo Low-latency, HT = Alveo High-throughput.
Throughput: The server CPU and GPU achieved an order of magnitude higher throughput than the other systems tested, due to their high core count and parallel processing capabilities; the desktop CPU achieves comparable performance to the high-throughput DPU configurations. The high-throughput DPU achieves higher throughput than the low-latency DPU. Referring back to Figures 5 and 6, the high-throughput architecture has three DPUs, each with multiple batch engines, while the low-latency architecture has two DPUs with a single compute engine and no local memory pool; the low-latency DPU’s fusion engine improves latency, but not throughput.
Latency: We report the latency on each platform as the inference time for a single input. The FPGA-based platforms achieved two orders of magnitude lower-latency compared to the two CPUs and the GPU. The low-latency DPU performs inference approximately 1 ms faster than the high-throughput DPU, benefiting from compiler optimizations such as layer fusion, as supported by its fusion engine (Figure 6). The hardware IP kernel implemented using the \(\mathsf {ultra\_fast\_sigmoid}\) approximation runs around 0.2 ms faster than the \(\mathsf {sigmoid\_fastexp\_512}\) implementation. The FPGA + CPU systems incur the latency associated with transferring data between the FPGA and the server CPU, and reprogramming the DPU at runtime to execute the fully connected layer on the FPGA.
Inference Rate: The inference rate is the number of predictions per second, which correlates to throughput: The GPU and the Server CPU have the highest inference rate, while the inference rate of the Desktop CPU is comparable to those of the FPGA with high-throughput DPU configurations. The high-throughput DPU connected to the custom kernel with the \(\mathsf {ultra\_fast\_sigmoid}\) has the highest overall inference rate among all DPU implementations; this results from the greater arithmetic parallelism provided by the high-throughput DPU compared to the low-latency DPU.
Energy Consumption: The Energy row in Table 4 reports the energy consumption of processing a single batch of size 128 on each platform. The FPGAs are an order of magnitude more energy-efficient than the GPU or CPUs. The lowest overall power consumption was achieved using the high-throughput DPU and the custom IP kernel with the \(\mathsf {ultra\_fast\_sigmoid}\) approximation, which requires far fewer arithmetic operators than \(\mathsf {sigm\_fastexp\_512}\). In terms of performance per Watt, the GPU outperforms all the other platforms while the high-throughput DPU with \(\mathsf {sigm\_fastexp\_512}\) improves CPU’s performance per Watt by 44%.
5.4 Inference Accuracy
Table 5 summarizes the accuracy of the FA-LAMP neural network models that we evaluated in the preceding section. Columns starting with the label “edge” present the results from our previous implementation [18] and columns labeled with “qa_edge” and “qa_cloud” detail the results obtained using quantization-aware training. We include results for a 32-bit floating-point CPU-only implementation of the FA-LAMP models as a baseline to quantify the loss in accuracy due to quantization, which is 2.1–2.8 percentage points (pp) for \(\mathsf {sigmoid\_fastexp\_512}\), and 3.1–6.3 pp for \(\mathsf {ultra\_fast\_sigmoid}\). The 6.3 pp accuracy loss for the Chicken Accelerometer dataset for \(\mathsf {ultra\_fast\_sigmoid}\) can be attributed to the range of values in the input numbers to the sigmoid kernel. Referring back to Table 3, we note that sigmoid layer’s input values lie in the range [\(-\)0.12 1.85], where \(\mathsf {ultra\_fast\_sigmoid}\) has the largest error, when inference is performed on this dataset.
qa = quantization-aware Training, edge = Ultra96, Cloud = Alveo.
Table 5. FA-LAMP Neural Network Inference Accuracy
qa = quantization-aware Training, edge = Ultra96, Cloud = Alveo.
Compared to our previous work [18], the results reported in Table 5 achieved 1.6–2.6 pp improvement in \(\mathsf {sigmoid\_fastexp\_512}\) accuracy and 1.7–3.3 pp improvement in \(\mathsf {ultra\_fast\_sigmoid}\), which are due to the use of quantization-aware training in this study. The differences in accuracy reported for the Ultra96-V2 and Alveo implementations is due to different model compilation flows for the two platforms and potential microarchitectural differences, noting that neither fixed-point nor floating-point addition and multiplication are associative.
5.5 Comparison to Recent CNN-to-FPGA Compilation Frameworks
We deployed our LAMP model on several state-of-the-art FPGA edge-based and cloud-based CNN frameworks and compared their performance; Table 6 reports the resource utilization and throughput of each framework.
| HLS4ML [11] | fpgaConvNet [40] | VTA [6] | FINN [39] | DPU | fpgaConvNet [40] | FINN [39] | DPU | |
|---|---|---|---|---|---|---|---|---|
| FPGA Platform | Ultra96 | Ultra96 | Ultra96 | Ultra96 | Ultra96 | Alveo U280 | Alveo U280 | Alveo U280 |
| Precision | fix-8 | fix-8 | int-8 | fix-4 | int-8 | fix-8 | mix | int-8 |
| DSPs | 256 | 220 | 186 | 220 | 326 | 451 | 1,865 | 2,600 |
| BRAMs | 132 | 164 | 152 | 101 | 174 | 230 | 412 | 628 |
| Throughput | 156 GOPS | 198 GOPS | 101 GOPS | 471 GOPS | 453 GOPS | 2.37 TOPS | 6.12 TOPS | 5.53 TOPS |
Table 6. Performance Comparison with Other FPGA-based Edge and Cloud DNN Deployment Frameworks
For the Ultra96-V2 board, the DPU column represents the results for the DPU integrated with our custom kernel using \(\mathsf {ultra\_fast\_sigmoid}\); for the Alveo card, we picked the high-throughput DPU with \(\mathsf {ultra\_fast\_sigmoid,}\) as this combination yielded the best performance in our prior experiments. FpgaConvNet achieved a throughput of 164 GOPS, outperforming both HLS4ML and VTA by 1.26\(\times\) and 1.96\(\times\), respectively. fpgaConvNet’s higher throughput seems to be due to its streaming architecture, which outperforms single computation engine frameworks for large batch sizes. fpgaConvNet also benefits from the design space exploration performed by the samo optimizer. While FINN outperforms fpgaConvNet and the DPU by 2.37\(\times\) and 1.03\(\times\), its low-precision architecture degrades accuracy by more than 30%, which we consider to be unacceptable from the application perspective.
On the Alveo card, the DPU outperforms fpgaConvNet by 2.33\(\times\); upon inspection, fpgaConvNet was unable to fully utilize the resources provided by the larger FPGA (in comparison to the Ultra96-V2). FINN achieved throughput 1.10\(\times\) higher than the DPU, while implementing an (almost) binary neural network, whose accuracy was around 55\(\%\), which is non-competitive for our purposes.
5.6 Case Study: Interpreting the FA-LAMP Output
The Matrix Profile can be computed using existing methods in an offline context [53], whereas LAMP is used to predict it on streaming data [54]. Regardless of how the Matrix Profile is obtained, subsequent post-processing steps are needed to extract actionable information from it.
As a representative example, we explain how FA-LAMP neural network inference can help a scientist to classify the behavior of an insect in real-time. First, we take the training data (2.5M data points, collected over seven hours) from an insect feeding on a plant. We then create two classes [43]:
Class A: \(\mathsf {Xylem\ Ingestion/Stylet\ Passage}.\)
Class B: \(\mathsf {Non{\text{-}}Probing}.\)
We take a representative dataset from each class (RA and RB) and train two distinct FA-LAMP models, which we, respectively, denote as MA and MB. Let S be a subsequence of streaming data. If \({\bf M_{A}(S)} \gt {\bf M_{B}(S)}\), then we predict that behavior A is occurring; if \({\bf M_{A}(S)} \lt {\bf M_{B}(S)}\), then we predict that behavior B is occurring; otherwise, the prediction is inconclusive.
For evaluation data, we consider the inference data (2.5M data points, collected over the next five hours from the same insect), whose behavior has also been labeled by an entomologist to provide ground truth. We observed 98.2% accuracy in the results of classification using FA-LAMP. Figure 15 shows the time series and the actual and predicted labels reported by the FA-LAMP model for a snippet of test data. To simplify the representation, the time series is rearranged so the first half represents class A and the second half represents class B. Figure 15 shows a snippet of the first half.
Fig. 15. A snippet of insect EPG time series dataset along with the actual and predicted behavior (Class A: label height=1; Class B: label height=0).
6 DPU INTEGRATION WITH ETHERNET
In a real world cloud-scale deployment, a plurality of Alveo cards in a server would be connected through a network switch, allowing them to receive data from external sources. For example, multiple edge devices may transmit sensor data to the server in real time over the Internet. To address the needs of such a deployment, this section describes the integration of a high-throughput DPU with a 100 G Ethernet IP allowing an Alveo-based deployment to receive and process data.
We built our design on top of the Xilinx TCP stack IP repository [46], which comprises an UltraScale+ Integrated 100 Gb/s Ethernet (CMAC) and a network layer kernel. The CMAC kernel is connected to the Alveo’s GT pins exposed by the Vitis shell and it runs at the frequency of a 100 G Ethernet Subsystem clock, i.e., 322 MHz. It exposes two 512-bit AXI4-Stream interfaces (\(S\_AXIS\) and \(M\_AXIS\)) to the user logic, which run at the same frequency as the kernel. Internally it has clock domain crossing logic to convert from kernel clock to the 100 G Ethernet Subsystem clock. The network kernel is a collection of HLS IP cores that provide TCP/IP network functionality, consisting of TCP, ICMP, and ARP modules clocked at 250 MHz. The network kernel exposes AXI4-Stream interfaces to enable the user kernel to open and close TCP/IP connections and to send and receive network data.
Figure 16 depicts the Ethernet subsystem and custom kernel IPs implemented in SLR0 in the Alveo card; due to resource constraints, we had to remove the DPU kernel with four batch engines in SLR0 to fit the CMAC and network layer kernels. As mentioned in Section 3.6, the DPUCAHX8H can be configured to have multiple batch engines that execute model inference in parallel. Each batch engine connects to the global HBM memory using an AXI4 memory mapped interface. The DPU also has an \(s\_axi\_control\) interface, used to start running a task on a DPU core, wait for the task to finish, and clear the DPU’s status. Since the network kernel provided by Xilinx has AXI4-Stream interfaces, we cannot directly connect the kernel to the DPU input ports. One solution would be to transmit the network data to the host and then to the DPU using the VART API; however, this would lead to sub-optimal performance.
Fig. 16. The SLR0 in the Alveo card configured with the Ethernet subsystem and the custom kernel IPs.
To address this bottleneck, we added a memory arbiter module to the network kernel that writes the incoming network data to the memory address used by DPU batch engines. This frees up HBM memory channels 14–18, which the memory arbiter uses to divide the incoming network data into equally sized batches and writes the data to memory channels 0-6 for DPU kernel 2 and memory channels 7–13 for DPU kernel 1. The memory arbiter also provides two memory mapped AXI master interfaces that connect to the \(s\_axi\_control\) interfaces of the two DPU kernels.
After writing the input data to the corresponding addresses of the five batch engines for each DPU, the memory arbiter starts the execution of that DPU kernel by setting the \(reg\_ap\_control\) register to 1 through the \(s\_axi\_control\) interface. This allows the Alveo card to process incoming network data without CPU involvement. The memory arbiter waits for DPU’s interrupt before it signals the start of a new batch.
We tested the DPU integrated with Ethernet system by directly connecting two Alveo U280 cards through their Quad Small Form-factor Pluggable (QSFP) ports. We programmed one of the Alveo cards as a producer of data, combining the CMAC and network layer kernel with a custom user TCP kernel. The TCP kernel opens a TCP connection to provide the IP and TCP port of the destination and to transmit the data over the network. To transmit data, a Tx control handshake is required before each payload transfer. The user kernel first transmits the session ID and the payload size and, upon receiving a positive acknowledgment from the TCP module, transmits the data. The second Alveo card is programmed as a consumer, with two DPU kernels: CMAC, and the modified network kernel, which includes the aforementioned memory arbiter module.
To achieve 100 Gbps, we pipelined the control handshake and payload transfer between the user kernel and the network kernel in the producer FPGA. Since the control handshake is required for each payload transfer and requires 10 to 30 clock cycles, a sequential control handshake-payload transfer may stall. To pipeline the process, we established 32 concurrent connections and pinned them to different threads using the OpenMP API; further increasing the number of concurrent connections yielded no further improvements in our experiments. Next, we transmitted packets whose sizes were a positive integer multiple of 64 bytes. The transmission process buffers portions of the payload in the global memory for retransmission in the event that packet loss and/or memory accesses with unaligned addresses decreases the bandwidth.
Figure 17 shows that the 100 Gigabit QSFP port saturates the available bandwidth at a sufficiently large payload size. We achieved a peak throughput of 86 Gbit/s for payloads larger than 4 KiB, which is feasible, because the DPU and our custom kernel can achieve an initiation interval of 1, meaning that no stall cycles occur in the design pipeline. At smaller payloads, the control handshake required for each payload transfer impedes throughput. To maximize the Ethernet throughput, optimizations on both the producer and consumer sides are required: In the producer’s software code, we leveraged concurrent TCP connections to hide the control handshake latency, and in the consumer’s hardware deployment, we implemented a memory arbiter module to initiate execution of DPU kernels as soon as network data is received.
Fig. 17. Ethernet module throughput on the Alveo card as a function of payload size.
7 CONCLUSION AND FUTURE WORK
This article explored FPGA accelerator architectures for time series similarity prediction using CNNs. We integrated a custom IP accelerator block using different Xilinx DPUs to enable whole-model acceleration of the FA-LAMP CNN on two platforms: a Xilinx Ultra96-V2, which is representative of FPGA-accelerated edge computing, and Alveo U280 FPGA, which is representative of a cloud-based system. Compared to a Raspberry Pi 3 and an Edge TPU, our edge design achieved 5.7\(\times\) and 18.2\(\times\) higher inference rate and improved the energy efficiency by 8.7\(\times\) and 24\(\times\), respectively. We compared the cloud-based accelerator performance with LAMP running on a high-end desktop CPU as well as server CPU processors and a GPU. While the FPGAs could not compete with the server CPU in terms of throughput or inference rate, they reduced latency by two orders of magnitude and energy consumption by one order of magnitude. We also compared the performance of the DPU running FA-LAMP to four state-of-the-art frameworks for CNN compilation onto FPGAs; the result of this experiment showed that the DPU achieves the highest overall performance, with the exception of one framework (FINN) that uses much lower precision and therefore suffers from significant degradation in inference accuracy. Last, we integrated the DPU with a Xilinx 100 Gb/s Ethernet module on the Alveo card, demonstrating the ability process streaming data obtained directly from the network without the involvement of a host CPU.
We envision several avenues of future work to improve FA-LAMP. We would like to more thoroughly explore the space of sigmoid approximation functions, including piecewise alternatives to \(\mathsf {ultra\_fast\_sigmoid}\), which might be able to reduce its error, and variants of \(\mathsf {sigmoid\_fastexp\_N}\) for values of \(N\) other than 512; there is also considerable opportunity to explore the internal architecture and precision of \(\mathsf {sigmoid\_fastexp\_N}\). We also would like to demonstrate that DPU-like overlays can efficiently implement global average pooling and sigmoid approximation functions, which would alleviate the need to transfer data out of the overlay. Long-term, we would like to harden the FA-LAMP inference engine so it can be integrated into a system-on-chip (SoC), creating a near-sensor CNN inference system that can process streaming data.
Footnotes
1 Historically, Euclidean distance between z-normalized subsequences is used as the distance function for time series data mining tasks [48]; the use of Pearson correlation, which limits the range of correlation values to [\(-\)1, +1], is more recent [53, 54] and is arguably more intuitive, as the maximum Euclidean distance value is unbounded.
Footnote2 While FA-LAMP is optimized for streaming time series generated by external sensors, we evaluate FA-LAMP by loading the time series into the HBM and streaming it directly into the FPGA.
Footnote3 A single-cycle multiplier is acceptable for our design, because we use an 8-bit fixed-point data format; increasing the precision or switching to a floating-point data format may necessitate multi-cycle or pipelined multipliers.
Footnote4 Alternative implementations, such as logarithmic number systems or Posits, are also possible, but are neither discussed nor evaluated here.
Footnote
- [1] . 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. (2018), 411–418.
DOI: Google ScholarCross Ref
- [2] . 2020. Fitbit for chickens? Time series data mining can increase the productivity of poultry farms. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). ACM, 3328–3336. Google Scholar
Digital Library
- [3] . 2009. Real-time earthquake detection and hazard assessment by ElarmS across California. Geophys. Res. Lett. 36 (2009).
DOI: Google ScholarCross Ref
- [4] . 2017. An OpenCL™deep learning accelerator on Arria 10. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 55–64. Google Scholar
Digital Library
- [5] . 2010. Theano: A CPU and GPU math compiler in Python. In Proceedings of the 9th Python in Science Conference. 3–10.Google Scholar
Cross Ref
- [6] . 2018. TVM: An automated end-to-end optimizing compiler for deep learning. (2018), 578–594.Google Scholar
- [7] . 2017. Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro 37, 3 (2017), 12–21.
DOI: Google ScholarDigital Library
- [8] . 2018. Exploration of low numeric precision deep learning inference using Intel FPGAs. In Proceedings of the IEEE 26th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM). 73–80.
DOI: Google ScholarCross Ref
- [9] . 2021. Retrieved from https://coral.ai/.Google Scholar
- [10] . 2012. Efficient digital implementation of extreme learning machines for classification. IEEE Trans. Circ. Syst. II Expr. Briefs 59-II, 8 (2012), 496–500.
DOI: Google ScholarCross Ref
- [11] . 2021. hls4ml: An Open-source Codesign Workflow to Empower Scientific Low-power Machine Learning Devices.
DOI: Google ScholarCross Ref
- [12] . 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture. IEEE Computer Society, 1–14.
DOI: Google ScholarDigital Library
- [13] . 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.Google Scholar
- [14] . 2020. FPGA-based implementation of stochastic configuration networks for regression prediction. Sensors 20 (2020), 4191.Google Scholar
Cross Ref
- [15] . 2016. CNN-MERP: An FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks. In Proceedings of the 34th IEEE International Conference on Computer Design. IEEE Computer Society, 320–327.
DOI: Google ScholarCross Ref
- [16] . 2019. Reconfigurable convolutional kernels for neural networks on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 43–52. Google Scholar
Digital Library
- [17] . 2021. Retrieved from https://github.com/aminiok1/lamp-alveo.Google Scholar
- [18] . 2021. FA-LAMP: FPGA-accelerated learned approximate matrix profile for time series similarity prediction. In Proceedings of the 29th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 40–49.
DOI: Google ScholarCross Ref
- [19] . 2021. High-level FPGA accelerator design for structured-mesh-based explicit numerical solvers. arXiv preprint arXiv:2101.01177.Google Scholar
- [20] . 2014. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations.Google Scholar
- [21] . 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 242–251.
DOI: Google ScholarDigital Library
- [22] . 2016. Fast algorithms for convolutional neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4013–4021.
DOI: Google ScholarCross Ref
- [23] . 2018. The limits of earthquake early warning: Timeliness of ground motion estimates. Sci. Adv. 4, 3 (2018).
DOI: Google ScholarCross Ref
- [24] . 2017. Predictive edge computing for time series of industrial IoT and large scale critical infrastructure based on open-source software analytic of big data. In Proceedings of the IEEE International Conference on Big Data. IEEE Computer Society, 1663–1669.
DOI: Google ScholarCross Ref
- [25] . 2021. Xilinx/brevitas.
DOI: Google ScholarCross Ref
- [26] . 2020. Memory-efficient dataflow inference for deep CNNs on FPGA. arXiv preprint arXiv:2011.07317.Google Scholar
- [27] PyRAPL software toolkit 2019. Retrieved from https://github.com/powerapi-ng/pyRAPL.Google Scholar
- [28] . 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 26–35. Google Scholar
Digital Library
- [29] . 2020. Optimizing reconfigurable recurrent neural networks. In Proceedings of the 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 10–18.
DOI: Google ScholarCross Ref
- [30] . 2019. PIR-DSP: An FPGA DSP block architecture for multi-precision deep neural networks. In Proceedings of the 27th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 35–44.
DOI: Google ScholarCross Ref
- [31] . 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 779–788.
DOI: Google ScholarCross Ref
- [32] SAMO. 2022. Retrieved from https://github.com/AlexMontgomerie/samo.Google Scholar
- [33] . 2008. Real-time evolutionary earthquake location for seismic early warning. Bull. Seismol. Societ. Amer. 98 (2008), 1482–1494.
DOI: Google ScholarCross Ref
- [34] . 2016. From high-level deep neural models to FPGAs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 17:1–17:12.
DOI: Google ScholarCross Ref
- [35] . 2015. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.Google Scholar
- [36] tensorflow ONNX. 2022. Retrieved from https://github.com/onnx/tensorflow-onnx.Google Scholar
- [37] . 2020. Approximating activation functions. arXiv preprint arXiv:2001.06370.Google Scholar
- [38] . 2021. Retrieved from https://cloud.google.com/edge-tpu.Google Scholar
- [39] . 2022. Retrieved from https://github.com/Xilinx/finn.Google Scholar
- [40] . 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the 24th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society, 40–47.
DOI: Google ScholarCross Ref
- [41] . 2018. Toolflows for mapping convolutional neural networks on FPGAs: A survey and future directions. ACM Comput. Surv. 51, 3 (
June 2018).DOI: Google ScholarDigital Library
- [42] . 2017. Time series classification from scratch with deep neural networks: A strong baseline. (2017), 1578–1585.
DOI: Google ScholarCross Ref
- [43] . 2016. Machine learning for characterization of insect vector feeding. PLoS Computat. Biol. 12, 11 (
11 2016), 1–14.DOI: Google ScholarCross Ref
- [44] . 2019. Compute-efficient neural-network acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 191–200.
DOI: Google ScholarDigital Library
- [45] . 2019. DPU for Convolutional Neural Network v3.0, DPU IP Product Guide. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_0/pg338-dpu.pdf.Google Scholar
- [46] . 2022. Retrieved from https://github.com/Xilinx/xup_vitis_network_exam ple.Google Scholar
- [47] . 2020. CNN-based feature-point extraction for real-time visual SLAM on embedded FPGA. In Proceedings of the 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 33–37.
DOI: Google ScholarCross Ref
- [48] . 2016. Matrix profile I: All pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM). 1317–1322.
DOI: Google ScholarCross Ref
- [49] . 2020. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 122–132. Google Scholar
Digital Library
- [50] . 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 161–170.
DOI: Google ScholarDigital Library
- [51] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 770–778.
DOI: Google ScholarCross Ref
- [52] . 2020. The Swiss army knife of time series data mining: Ten useful things you can do with the matrix profile and ten lines of code. Data Min. Knowl. Discov. 34, 4 (2020), 949–979.
DOI: Google ScholarCross Ref
- [53] . 2019. Matrix profile XIV: Scaling time series motif discovery with GPUs to break a quintillion pairwise comparisons a day and beyond. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 74–86.
DOI: Google ScholarDigital Library
- [54] . 2019. Matrix profile XVIII: Time series mining in the face of fast moving streams using a learned approximate matrix profile. In Proceedings of the IEEE International Conference on Data Mining. IEEE, 936–945.
DOI: Google ScholarCross Ref
Index Terms
FPGA-based Acceleration of Time Series Similarity Prediction: From Cloud to Edge
Recommendations
Floating-point FPGA: architecture and modeling
This paper presents an architecture for a reconfigurable device that is specifically optimized for floating-point applications. Fine-grained units are used for implementing control logic and bit-oriented operations, while parameterized and ...
FPGA acceleration on a multi-layer perceptron neural network for digit recognition
AbstractThis paper proposes field-programmable gate array (FPGA) acceleration on a scalable multi-layer perceptron (MLP) neural network for classifying handwritten digits. First, an investigation to the network architectures is conducted to find the ...
Acceleration of an FPGA router
FCCM '97: Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing MachinesThe authors describe their experience and progress in accelerating an FPGA router. Placement and routing is undoubtedly the most time-consuming process in automatic chip design or configuring programmable logic devices as reconfigurable computing ...























Comments