skip to main content
research-article
Open Access

QUIDAM: A Framework for Quantization-aware DNN Accelerator and Model Co-Exploration

Published:24 January 2023Publication History

Skip Abstract Section

Abstract

As the machine learning and systems communities strive to achieve higher energy efficiency through custom deep neural network (DNN) accelerators, varied precision or quantization levels, and model compression techniques, there is a need for design space exploration frameworks that incorporate quantization-aware processing elements into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QUIDAM, a highly parameterized quantization-aware DNN accelerator and model co-exploration framework. Our framework can facilitate future research on design space exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, number of total processing elements, and DNN configurations. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, our framework identifies a wide range of design points where performance per area and energy varies more than 5× and 35×, respectively. With the proposed framework, we show that lightweight processing elements achieve on par accuracy results and up to 5.7× more performance per area and energy improvement when compared to the best 16-bit integer quantization–based implementation. Finally, due to the efficiency of the pre-characterized power, performance, and area models, QUIDAM can speed up the design exploration process by three to four orders of magnitude as it removes the need for expensive synthesis and characterization of each design.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable accomplishments across various applications ranging from image recognition [47], object detection [48], to natural language processing [6]. However, the increasing model size and computational cost of these models has become a challenging task for on-device machine learning (ML) endeavours due to the stringent performance per area and energy constraints of the edge devices. To this end, while machine learning practitioners focus on model compression techniques [5, 8, 15], computer architects investigate hardware architectures to overcome the energy-efficiency problem and improve the overall system performance [2, 3, 14, 18, 19, 20, 21, 22, 23, 24, 25, 39, 42].

As computing community hits the limits on consistent performance scaling for traditional architectures, there has been a rising interest on enabling on-device machine learning applications through custom domain-specific DNN accelerators. These domain-specific system-on-chip architectures are specifically designed to exploit the application characteristics. As we deeply care about performance per area and energy efficiency from a hardware point of view, tailored DNN accelerators have shown significant improvements when compared to general-purpose CPUs and GPUs [2, 10, 27, 28, 29, 37]. To better understand the tradeoffs of various architectural design choices and DNN workloads for domain-specific hardware architectures, there is a need for a design space co-exploration framework that can rapidly iterate over various designs and generate power, performance, and area (PPA) results for various DNN models. To this end, in this work we present a framework for quantization-aware DNN accelerator and model co-exploration (QUIDAM).

This work makes the following contributions:

  • We present QUIDAM, a quantization-aware power, performance, and area modeling framework for DNN accelerators. Our framework can enable future research on DNN accelerator and model co-exploration for various design choices such as bit precision, processing element types, scratchpad sizes of processing elements, global buffer size, device bandwidth, number of total processing elements, and DNN architectures.

  • Our framework provides power, performance, and area results not just for a single hardware design point but for a range of different hardware designs as opposed to prior art [1, 2, 4, 31, 38]. Thus, it can be used to jointly analyze tradeoffs of various architectural design choices and DNN workloads and perform multi-objective optimization to achieve a better tradeoff front between accuracy and hardware-efficiency metrics such as performance per area and energy.

The rest of the article is organized as follows. In Section 2, we present a literature review on power and runtime models for CNNs and design space exploration frameworks for hardware accelerators. In Section 3, we describe the architectural details of the QUIDAM framework and the details of our methodology for power, performance, and area modeling of DNN accelerators. In Section 4, we show experimental results demonstrating the efficacy of QUIDAM’s power, performance, and area models and the efficacy of lightweight processing elements to conventional designs in terms of performance per area and energy through a suite of case studies. Finally, Section 5 concludes the article by summarizing the results.

Skip 2RELATED WORK Section

2 RELATED WORK

Prior art has proposed runtime and energy models for DNN workloads [1, 34, 38]. However, these models have been implemented specifically for GPU platforms and thus they create an important limitation for a design space exploration of hardware architectures and potentially hardware and machine learning model co-design opportunities [13, 51, 53].

However, the systems community has proposed several tools and simulation methodologies for DNN accelerator design. Table 1 shows a comparison of existing hardware accelerator frameworks compared to QUIDAM in terms of various design features in chronological order.

Table 1.
Optimized for CNNsFully- Parameterized RTLRow-Stationary DataflowQuantization SupportLightweight PE (Shift-Add Based)Pre-Characterized Model-based DSEDNN Accelerator/ Model Co-Exploration
Aladdin [43]
SCALE-Sim [40]
Eyeriss [2]
MAERI [31]
MAESTRO [30]
HAQ [49]
Accelergy [50]\(^{1}\)
Timeloop [36]\(^{2}\)
Gemmini [11]\(^{3}\)
QUIDAM (Ours)\(^{4}\)
  • 1: Supports INT8, INT16, FP32.

  • 2: Supports INT8, INT16, FP32.

  • 3: Supports INT8, UINT32, FP32.

  • 4: Supports INT4, INT8, INT16, FP32.

Table 1. Comparison of DNN Accelerator Frameworks

  • 1: Supports INT8, INT16, FP32.

  • 2: Supports INT8, INT16, FP32.

  • 3: Supports INT8, UINT32, FP32.

  • 4: Supports INT4, INT8, INT16, FP32.

For example, Aladdin [43] is a pre-RTL power, performance, and area estimation tool for arbitrary hardware accelerators. This simulator takes high-level language descriptions of algorithms as inputs similar to a high-level synthesis methodology and generates dynamic data dependence graphs as an approximate representation of a hardware accelerator without generating RTL. While this approach can be useful for fast exploration of various algorithms, it has limitations in the optimization of the generated hardware accelerators for deep neural networks, because it is not implemented in a domain-specific architecture manner.

Similarly, SCALE-Sim [40] is a cycle accurate, systolic-array-based DNN accelerator simulator. It has a python-based cycle accurate model that can generate results for hardware performance and utilization metrics. Although this tool can help rapid exploration of systolic-array-based DNN accelerators for a given DNN layer, the built-in model is difficult to modify for a different hardware accelerator architecture and it lacks significant features such as quantization support, lightweight processing elements, and DNN accelerator and model co-exploration options.

As mapping deep neural networks onto DNN accelerators plays an important role in energy efficiency, Eyeriss [2] analyzed various dataflow strategies in the existing DNN accelerator domain and came up with a novel approach called row-stationary dataflow that performs better than other dataflow strategies in terms of throughput and energy efficiency [2]. However, Eyeriss is only an instance in the vast design space of DNN accelerators and its implementation is not open source to foster future research on DNN accelerator and model co-exploration.

To this end, MAERI [31] and MAESTRO [30] have been proposed by the researchers to support spatial-array architectures and enrich the design space with various dataflows such as row-stationary dataflow that was originally proposed by Eyeriss [2]. MAERI [31] presents a new DNN accelerator architecture implemented with a set of modular and reconfigurable building blocks that can easily support various DNN mappings onto the accelerator by utilizing switches. On top of that, MAESTRO [30] presents an analytical cost model to predict the hardware cost of dataflows to perform a design space exploration. These open source frameworks assist researchers to perform a design space exploration on various dataflows. However, they are incapable of supporting various bit precision levels and efficient processing elements that utilize cheap shift logic instead of expensive multipliers.

To further improve the computational efficiency of hardware accelerators, researchers have investigated tuning the optimal bit precision level for each layer of a deep neural network. To this end, HAQ [49] proposed optimizing the bit precision of DNN accelerators by proposing an automated hardware-aware quantization framework that leverages reinforcement learning to automatically tune the quantization scheme for a given hardware architecture by integrating the power and performance feedback signals coming from the hardware architecture in the design loop.

As HAQ reveals that the optimal bit precision levels on different hardware architectures and resource constraints are significantly different, recently Accelergy [50] and Timeloop [36] have been proposed to complement the design space exploration process by supporting different bit precision levels with an architecture-level energy estimation methodology [50] and a design space exploration framework [36] for DNN accelerators.

Accelergy [50] presents an architecture-level energy estimation methodology for hardware accelerators that takes in an architecture description and hardware activity statistics such as action counts that are based on a given workload that is needed to be generated by a separate performance model. Although Accelergy [50] proposes a fast energy estimation methodology for DNN accelerators, it is not capable of performing a design space exploration for hardware architectures by itself. To this end, Timeloop [36] presents a framework for evaluating and performing a design space exploration of DNN accelerators. Timeloop [36] proposes a generic template to describe DNN hardware accelerators and characterize deep learning workloads and provides a practical design space exploration tool to understand the tradeoffs in designing DNN accelerators across different workloads.

Although these tools perform preliminary analysis on the design space for DNN accelerators in different aspects, they do not incorporate specialized quantization-aware lightweight processing elements and they do not generate or share a highly-parameterized RTL implementation of the chosen design based on the input hardware configuration that is a significant impediment for enabling deployment of DNNs onto edge devices, as the actual deployment of the hardware design takes a significant amount of engineering effort.

Finally, Gemmini [11] has been proposed as an open source co-processor/accelerator generator framework that leverages a flexible architectural template to represent different accelerator architectures. Moreover, Gemmini [11] provides a parameterized RTL implementation to enable hardware architects to gain more control onto the design and potentially gain subtle insights on how different architectural decisions affect overall performance of the system. To enable system researchers to investigate and run software stacks on the generated hardware accelerator, Gemmini [11] also supports a full system-on-chip environment. From the hardware implementation point of view, the internal DNN accelerator implementation is a systolic-array-based architecture similar to SCALE-Sim [40] that is a relatively more simplistic architecture when compared to spatial-array-based dataflow architectures [2]. Therefore, it does not support more sophisticated dataflows such as row-stationary dataflow and quantization-aware lightweight processing elements that can enlarge the hardware accelerator design space and push the Pareto-frontier even further in terms of accuracy and hardware efficiency metrics such as energy efficiency and performance per area. Moreover, this framework does not have a pre-characterized analytical hardware cost model. Therefore, it is not suitable for a rapid design space exploration of DNN accelerators and DNN accelerator and model co-exploration research. Therefore, there is a need for a framework that can assist system architects and machine learning researchers to quickly iterate over various hardware accelerator designs and DNN configurations while having an accurate hardware cost model and a flexible implementation to enable researchers to easily build their novel ideas on top of it without going through the tedious and laborious effort of RTL implementation from scratch.

To this end, we propose QUIDAM, a highly parameterized spatial-array-based DNN accelerator framework that has implicit optimizations for CNNs as it is a domain-specific architecture but also has sufficient flexibility in terms of changing the microarchitectural features of the architecture, enriching the design space by providing lightweight processing element implementations, and crucial DNN model parameters such as bit precision to fully support DNN accelerator and model co-exploration.

As the systems community investigates novel hardware architectures and tools to enable deployment of efficient inference on edge, the ML community has focused on model compression techniques to achieve these objectives. Specifically, pruning [15] and quantization [52] have received great interest in the ML community to design models for edge devices. In quantization, prior work focuses on improving fixed-point quantization [9, 26]. Additionally, hardware-friendly quantization schemes have also been proposed [8, 33, 46]. In this work, we implement a specific quantization scheme chosen for its hardware efficiency as it relies on reduced representations using a limited sum of power-of-two [8].

Skip 3METHODOLOGY Section

3 METHODOLOGY

In this section, we present the proposed QUIDAM framework, as shown in Figure 1. First, the proposed QUIDAM framework takes hardware accelerator parameters and deep neural network configurations as inputs to obtain power, performance, and area models for the next stages. Figure 2 shows the available hardware accelerator and DNN configuration parameters that can be chosen using the QUIDAM framework. We show the implementation details and architectural components of our QUIDAM framework, as depicted in Figure 2 (Section 3.1). Moreover, we also detail the lightweight processing elements (LightPE) that we implemented in our framework to provide a specialized processing element (PE) type for quantized DNN models (Section 3.2). After developing the power, performance, and area models for the QUIDAM framework for a wide range of accelerator and DNN configurations (Section 3.3), we perform a Pareto-optimality analysis for accuracy and crucial hardware-efficiency metrics such as performance per area and energy (Sections 4.3 and 4.4). Finally, we carry out a DNN accelerator and model co-exploration analysis by generating candidate hardware and DNN configurations and demonstrate the generalizability of the proposed QUIDAM framework by co-exploring the design space of both hardware and DNN configurations (Section 4.5). Therefore, the proposed framework provides power, performance, area, and RTL implementation using the developed power, performance, and area models and parameterized RTL implementation. Our proposed framework fosters future research on design space co-exploration of DNN accelerators and DNN configurations.

Fig. 1.

Fig. 1. Overview of the QUIDAM framework.

Fig. 2.

Fig. 2. Schematic depicting QUIDAM framework, with accelerator parameters and DNN configuration as inputs. The framework takes in accelerator parameters and layerwise DNN configurations and generates power, performance, area results, and statistics on hardware utilization and memory accesses.

3.1 QUIDAM Framework

To enable comprehensive design space exploration for DNN accelerators for on-device machine learning, we implemented QUIDAM, a highly parameterized spatial-array-based DNN accelerator framework in RTL. Our framework enables hardware designers and machine learning practitioners to rapidly iterate over various accelerator designs and DNN configurations and better understand tradeoffs of different architectural components of the design for dizzying requirements of deploying machine learning models to edge devices. Moreover, hardware designers can also use the automatically generated RTL code to follow the design synthesis flow.

As depicted in Figure 2, QUIDAM framework is based on spatial-array-based accelerators and utilizes row stationary dataflow that has been demonstrated to optimize the data movement in the storage hierarchy and improve the energy efficiency of the system [2]. QUIDAM features a set of processing elements organized as a 2D array and a global buffer that stores input feature maps, filters, and activations. The number of PEs in each dimension can be tuned for different power, performance, and area requirements. Each PE includes an input feature map, a filter, partial sum scratchpads, and a multiply-accumulate (MAC) unit that can be implemented as a conventional MAC unit or a specialized shift-add unit based on the desired bit precision. Each of these architectural components can be tuned in a flexible and automated manner to perform a comprehensive design space co-exploration for on-device edge accelerators and DNN models.

3.2 Lightweight Processing Elements

To enrich the design space of hardware accelerators and achieve a better Pareto-frontier for performance per area and energy efficiency, we include LightPE implementations in our framework. LightPEs utilize 8 bits for activations, as well as 4 bits and 8 bits for weights for LightPE-1 and LightPE-2 designs, respectively. As 4-bit and 8-bit quantization techniques for on-device machine learning have become prevalent in various computing platforms, we provide these specialized quantization-aware PE types in our QUIDAM framework to help hardware designers to enrich their design space and find better Pareto-frontiers.

Specifically, LightPEs use a special power-of-two quantization scheme [8] that quantizes the weights of the neural network into a limited sum of powers-of-two. In this case, the multiplication between the activation and weight can then be replaced by shifts and adds. More generally speaking, a multiplication between an 8-bit activation x and an 8-bit weight w can be formulated as follows: (1) \(\begin{align} \begin{split} y &= x\times w\\ &= \sum _{i=0}^7 {1\!\!1}(w_i) \times (x \lt \lt i)\\ &\text{where}~{1\!\!1}(w_i) = {\left\lbrace \begin{array}{ll} 1 & \text{the $i{\rm th}$ bit of $w$ is 1}\\ 0 & \text{otherwise} \end{array}\right.}, \end{split} \end{align}\) where \(\lt \lt\) denotes a left shift operator. Based on Equation (1), LightPE implementations approximate the sum with k shifts and \(k-1\) add operations, which was shown to be effective for significantly increasing energy efficiency [7, 8]. We implement such an idea in LightPE-1 (for one shift) and LightPE-2 (for two shifts and one addition). In addition to LightPEs, we have also incorporated the design of a conventional 16-bit integer quantization (INT16) implementation and a conventional full-precision 32-bit floating point (FP32) implementation.

Figure 3 shows the detailed architectures of FP32, INT16, LightPE-1, and LightPE-2 processing elements. Each processing element has four FIFOs for input feature map, filter, input partial sum, and output partial sum. Moreover, there are three scratchpad memories implemented in each processing element such as input feature map scratchpad, filter scratchpad, and partial sum scratchpad that get the data from aforementioned FIFOs. After getting the data from scratchpads, there are different versions of multiplication implementations between weights and activations for different processing elements. More specifically, Figure 3(a) shows the FP32 processing element that is used in the QUIDAM framework. The FP32 implementation uses a 32-bit floating point multiplier and adder as it is commonly implemented in conventional systems. Figure 3(b) shows the INT16 processing element that has an 16-bit integer multiplier and adder. Figures 3(c) and 3(d) show LightPE-1 and LightPE-2 implementations, respectively. The LightPE-1 implementation utilizes a shift instead of a multiplier and it uses 8 bits for activations and 4 bits for weights. To store a weight w = \(\pm 2^{-m}\), where m = \(0,1, \ldots , 7\), LightPE-1 needs 4 bits: one bit for the sign(w) and 3 bits for \(|m|\). However, the LightPE-2 implementation utilizes two shifts and an addition as shown in Figure 3(d). LightPE-2 utilizes 8 bits for activations and 8 bits for weights. To store a weight w = \(\pm (2^{-m_{1}} + 2^{-m_{2}})\), where \(m_{1},m_{2}\) = \(0, 1, \ldots , 7\), LightPE-2 requires 7 bits: one bit for the sign(w), 3 bits for \(|m_{1}|\), and 3 bits for \(|m_{2}|\). For easier hardware implementation, 8 bits are used. After completing the multiplication and shift operations, all processing element implementations rely on a multiplexer for accumulating input partial sum data. Similarly, a second multiplexer is used for partial sum scratchpad to reset the accumulation. Finally, after the final addition operation, the data are sent to the output partial sum FIFO and the result is available.

Fig. 3.

Fig. 3. Detailed architectures of FP32, INT16, LightPE-1, and LightPE-2 processing elements.

Besides their low-precision benefits such as reducing the storage requirements, LightPEs also replace the multiplications with a more energy and area-efficient shift operation or a limited number of shifts and add operations [7, 8]. Therefore, LightPEs also achieve significant energy and area gains when compared to full-precision 32-bit floating point– and 16-bit integer-based designs. As a result, LightPEs provide an enriched design space for hardware designers and machine learning practitioners to analyze various tradeoffs between accuracy and performance per area and energy.

To this end, we perform a design space exploration analysis with four different processing element types such as FP32, INT16, LightPE-1, and LightPE-2 and compare them in terms of normalized performance per area and normalized energy with respect to the INT16 design point with the highest performance per area. As seen in Figure 4, normalized energy varies \(35 \times\) for almost the same performance per area region and normalized performance per area varies \(5 \times\) for almost the same energy region. In addition, while most of the configurations are clustered around the knee of the scatter plot regardless of the quantization level, we note that FP32 configurations dominate the highest energy ones, while LightPE-1 configurations are the ones that push performance per area to highest values, orders of magnitude larger than the INT16 case. Therefore, different PE types and precision levels lead to significant differences in terms of performance per area and energy. These results also reinforce the need for a design space exploration framework that incorporates quantization-aware hardware and rapidly iterates over various designs.

Fig. 4.

Fig. 4. Different PE types and bit precision lead to significant differences in performance per area and energy. Therefore, there is a need for a design space exploration framework that incorporates quantization-aware processing elements and rapidly iterate over various designs.

3.3 Power, Performance, and Area Modeling

To build our quantization-aware power, performance, and area models, we use various hardware and DNN configurations. Specifically, to cover this comprehensive design space of hardware accelerators, we generate a variety of possible designs by varying global buffer size, number of PEs per row and column in the 2D PE array, bit precision, and PE type (FP32, INT16, LightPE-1, and LightPE-2). Within each PE, we also vary individual scratchpad sizes for input feature map, filter, and partial sum scratchpads.

We use Synopsys Design Compiler and the open source FreePDK45 that is a commonly used process design kit [45] to synthesize our designs to obtain power, area, and initial timing results. We use Synopsys VCS RTL simulator to perform functional verification and collect timing information for various DNN configurations such as VGG-16 [44], ResNet-20, ResNet-34, ResNet-50, and ResNet-56 [16] that are implemented in our testbenches. After collecting power, area, and timing results from these tools, we use polynomial regression models and model selection techniques based on k-fold cross validation [35] to tune the degree of the polynomial.

More concretely, a K-degree polynomial regression model is defined as (2) \(\begin{align} \begin{split} F(\mathbf {x}) = \sum _j c_j \prod x_i^{q_{ij}}\\ \text{where}~\mathbf {x}\in \mathbb {R}^{d}; q_{ij}\in \mathbb {N}; \forall j, \sum _{i} q_{ij} \le K, \end{split} \end{align}\) where \(\mathbf {x}\) is a d-dimensional input feature vector with \(x_i\) the \(i{\text{th}}\) feature, j index denotes the \(j{\text{th}}\)-term of the polynomial, and \(\forall j,~c_j\) are learnable coefficients. We note that using polynomial regression in modeling the power and performance characterizing deep neural networks when running on a fixed hardware (i.e., desktop GPUs) has been studied before [1, 38]. To this end, for the novel design space where both the hardware and network configurations can vary, we explain in the sequel how features should be chosen for modeling. We detail the feature space for our proposed power, area, and latency models in the following.

Power. We use Synopsys Design Compiler with inherently assumed switching activity when generating the power consumption. As a result, we choose \(\mathbf {x}\) to be a four-dimensional vector that includes the scratchpad size for input feature map (\(\mathsf {SP_{\mathsf {if}}}\)), the scratchpad size for partial sum (\(\mathsf {SP_{\mathsf {ps}}}\)), the scratchpad size for filter weights (\(\mathsf {SP_{\mathsf {fw}}}\)), and the number of PEs (\(\mathsf {\#PE}\)). Finally, we develop individual models for each processing element type to improve the performance of our models as power depends on the PE type.

Area. For area modeling, we use the same features as in power modeling, because the features that affect power and area come from the same source. The area model only depends on the hardware configuration in contrast to the latency model that depends on both the hardware and the deep neural network configuration. More specifically, we choose \(\mathbf {x}\) to be a four-dimensional vector that includes the following: the scratchpad size for input feature map (\(\mathsf {SP_{\mathsf {if}}}\)), the scratchpad size for partial sum (\(\mathsf {SP_{\mathsf {ps}}}\)), the scratchpad size for filter weights (\(\mathsf {SP_{\mathsf {fw}}}\)), and the number of PEs (\(\mathsf {\#PE}\)). Similarly to power modeling, we build individual models for each processing element type as the arithmetic units differ between PE types.

Latency. Latency depends on both the hardware and the neural network configurations. To cope with diverse network configurations, we adopt a layer-level latency modeling strategy. Specifically, we use the polynomial model to infer the per-layer latencies and sum them to obtain a network-level value. As a result, our training data for the polynomial model is at a layer-level granularity. We use a 12-dimensional feature vector for the latency model, which includes \(\mathsf {SP_{\mathsf {if}}}\), \(\mathsf {SP_{\mathsf {ps}}}\), \(\mathsf {SP_{\mathsf {fw}}}\), the number of rows in the PE array (\(\mathsf {PE}_{\mathsf {rows}}\)), the number of columns in the PE array (\(\mathsf {PE}_{\mathsf {col}}\)), global buffer size, the input feature map dimension (\(\mathsf {A}\)), the input channel count (\(\mathsf {C}\)), filter count (\(\mathsf {F}\)), kernel size (\(\mathsf {K}\)), stride (\(\mathsf {S}\)), and padding (\(\mathsf {P}\)). For ResNets, we add two more binary features that indicate whether the layer contains regular skip connection \(\mathsf {RS}\in \lbrace 0,1\rbrace\) and dotted skip connection \(\mathsf {DS}\in \lbrace 0,1\rbrace\). Similarly to power and area models, we build latency models specific to each processing element type to capture the correlation between PE implementations to latency. We use these latency models to obtain the performance results by taking the inverse of latency estimations. Therefore, we refer to the inverse of latency as performance throughout the article.

Figure 5 depicts the model selection methodology used in our framework. We compare the mean absolute percentage error (MAPE) and root mean square percentage error (RMSPE) jointly to properly tune the model parameters of our polynomial model that we also apply model selection techniques based on k-fold cross validation [35] to tune the degree of the polynomial. As it can be seen, both MAPE and RMSPE results continuously decrease with a two degree of the polynomial model until a five degree of the polynomial model. Then, both MAPE and RMSPE results increase and get significantly higher. Therefore, for the power, performance, and area models, we use five-degree polynomial models as both MAPE and RMSPE results are negligibly small as shown in Figure 5.

Fig. 5.

Fig. 5. Comparison of the performance results of the model with respect to the degree of the polynomial model. A polynomial order of five is chosen for the power, performance, and area modeling, since it achieves the lowest root mean square percentage error (RMSPE) and mean absolute percentage error (MAPE) at the same time.

Skip 4RESULTS Section

4 RESULTS

In this section, we present power, performance, and area modeling results for each processing element type and perform a design space exploration on various DNN models such as VGG-16 [44], ResNet-20, ResNet-34, ResNet-50, and ResNet-56 [16] on CIFAR-10, CIFAR-100, and ImageNet datasets to iterate through our framework to demonstrate the flexibility of QUIDAM for future studies.

4.1 Power, Performance, and Area Model Accuracy

The power, performance, and area models detailed in Section 3 significantly speed up the design space exploration by three to four orders of magnitude. Indeed, QUIDAM enables fast design exploration, since it reduces the characterization process from days for synthesizing the RTL implementation and determining power, performance, and area, results to seconds for using the trained models.

Figures 68 show the actual and estimated power, performance, and area results for each processing element type such as FP32, INT16, LightPE-1, and LightPE-2. Each data point in Figures 68 corresponds to a different hardware accelerator configuration in the comprehensive design space. As shown by the results, QUIDAM’s PPA models achieve high correlation to the actual PPA values.

Fig. 6.

Fig. 6. Power estimation results for various processing element types such as FP32, INT16, LightPE-1, and LightPE-2. Each data point corresponds to a different hardware configuration that can be achieved by using the corresponding processing element type. As it can be seen, the proposed polynomial model agrees closely with the actual values extracted from the synthesis tools.

Figures 6 and 8 also show that the FP32 implementation has the highest area and power cost whereas LightPEs have the lowest area and power results when one processing element is considered. This shows the hardware efficiency of LightPEs when compared to conventional PE implementations.

We also note that Figures 6 and 8 show a higher correlation to actual power and area results than the corresponding chart for latency shown in Figure 7. This is expected, because the performance results depend on both hardware accelerator and deep neural network configurations, whereas the power and the area models have only hardware accelerator features. Therefore, building a close to perfect performance model is more difficult given the feature space dimensionality and richness.

Fig. 7.

Fig. 7. Performance estimation results for various processing element types such as FP32, INT16, LightPE-1, and LightPE-2. Each data point corresponds to a different hardware configuration that can be achieved by using the corresponding processing element type. As it can be seen, the proposed polynomial model agrees closely with the actual values extracted from the synthesis tools.

Fig. 8.

Fig. 8. Area estimation results for various processing element types such as FP32, INT16, LightPE-1, and LightPE-2. Each data point corresponds to a different hardware configuration that can be achieved by using the corresponding processing element type. As it can be seen, the proposed polynomial model agrees closely with the actual values extracted from the synthesis tools.

4.2 Accelerator Design Space Exploration Results

To show the efficacy of LightPEs to conventional PE designs, we perform design space exploration on various DNN models such as VGG-16 [44], ResNet-20, ResNet-34, ResNet-50, and ResNet-56 [16] on CIFAR-10, CIFAR-100, and ImageNet datasets as shown in Figure 9. We show the normalized performance per area and normalized energy results for each PE type with respect to the baseline INT16-based implementation with the highest performance per area for the given design space. Performance per area is a useful metric as different processing element implementations use different bit precision, and this affects the required storage of these implementations. Therefore, we use performance per area as a comparison metric in our analysis.

Fig. 9.

Fig. 9. Violin plots showing the full distribution of normalized performance per area (left chart) and normalized energy (right chart) results with respect to the best INT16 configuration for various processing element types such as FP32, INT16, LightPE-1, and LightPE-2. Each plot shows the full spectrum results for each PE type and black bars show the minimum, the maximum, and the median values. LightPEs provide higher performance per area and energy-efficient designs when compared to FP32- and INT16-based designs.

As it can be seen, LightPE implementations consistently outperform conventional INT16 and FP32 in both performance per area and energy objectives, which proves their efficacy in terms of hardware efficiency. Figure 9 shows the full distribution including the minimum, the maximum, and the median results for each PE type. As it can be seen, LightPEs consistently provide higher performance per area and lower energy-consuming design points as opposed to FP32- and INT16-based designs. Specifically, LightPE-1 and LightPE-2 achieve \(4.8 \times\) and \(4.1 \times\) more performance per area and \(4.7 \times\) and \(4 \times\) less energy on average across VGG-16, ResNet-20, and ResNet-56 workloads on CIFAR-10/CIFAR-100 and VGG-16, ResNet-34, and ResNet-50 workloads on ImageNet datasets when compared to the best INT16 hardware configuration, respectively. However, INT16 baseline implementation achieves \(1.8 \times\) more performance per area and \(1.5 \times\) less energy on average when compared to the best FP32 configuration.

These conclusions hold for all the models and the datasets considered in this work such as VGG-16, ResNet-20, ResNet-34, ResNet-50, and ResNet-56 thereby showing that the benefits of using lower precision generalize across a variety of models. We conclude that different bit precisions and PE types can lead to significantly different performance per area and energy results that are two critical metrics for machine learning and systems community strives to improve upon.

4.3 Pareto-optimality for Accuracy and Performance per Area

To show the accuracy and performance per area tradeoff for different processing element types, we perform a Pareto-front analysis by training VGG-16, ResNet-20, and ResNet-56 models for CIFAR-10 and CIFAR-100 datasets. For both datasets, we perform five runs for each DNN model and processing element type and plot the mean top-1 accuracy results. The training recipe for both CIFAR-10/CIFAR-100 datasets follows prior art [5, 17] that uses stochastic gradient descent with nesterov momentum, weight decay 0.0005, batch size 128, 0.1 initial learning rate with decrease by \(5 \times\) at epochs 60, 120, and 160, and train for 200 epochs in total. We note that this training recipe is tuned for full-precision models. Therefore, the accuracy results for LightPE variants might be higher with proper hyperparameter tuning.

Figure 10 shows the normalized performance per area and accuracy results for FP32, INT16, LightPE-1, and LightPE-2. Performance per area results are normalized with respect to the best INT16 configuration for each DNN model. We plot the hardware configurations with the highest performance per area results for each processing element type. Next, we perform a Pareto-front analysis among different processing element types and show the Pareto-frontier with a dashed line for each DNN model. As it can be seen, LightPEs are consistently on the Pareto-front for various DNN models and datasets, whereas FP32- and INT16-based designs are occasionally dominated by LightPE variants mostly due to LightPE implementations pushing the Pareto-frontier by being more hardware-efficient in terms of performance per area and energy than FP32- and INT16-based designs. Moreover, in certain situations such as CIFAR-10 accuracy results LightPE-2-based design dominates FP32 and both INT16 and FP32 not only in hardware-efficiency metrics but also in accuracy results for ResNet-20 and ResNet-56 models, respectively. To sum up, LightPE-1 and LightPE-2 achieve on par accuracy results with FP32 and INT16 while achieving up to \(5.7 \times\) and \(4.9 \times\) more performance per area when compared to INT16 configuration, respectively.

Fig. 10.

Fig. 10. Normalized performance per area and top-1 accuracy results for various processing element types such as FP32, INT16, LightPE-1, and LightPE-2 for CIFAR-10 (left chart) and CIFAR-100 (right chart). Each data point corresponds to the hardware configuration with the highest performance per area for the corresponding processing element type. Pareto-front is shown with a dashed line for each DNN model. LightPEs are consistently on Pareto-front for various DNN models.

4.4 Pareto-optimality for Accuracy and Energy

We also perform a Pareto-front analysis for accuracy and energy results. We follow the same training methodology explained in Section 42. Figure 11 shows the normalized energy and accuracy results for FP32, INT16, LightPE-1, and LightPE-2-based designs. Energy results are normalized with respect to the best INT16 configuration for each DNN model. As it can be seen, LightPEs are systematically on Pareto-front for various DNN models and datasets. Specifically, LightPE-1 and LightPE-2 achieve \(4.7 \times\) and \(4 \times\) less energy on average across different workloads and datasets when compared to INT16 configuration, respectively. In addition, we note that as the model complexity increases, the accuracy gap between LightPEs and conventional FP32- and INT16-based designs decreases. Thus, we conclude that our proposed LightPEs have promising results for training larger models with negligible accuracy loss while achieving significant performance per area and energy improvements.

Fig. 11.

Fig. 11. Normalized energy and top-1 error results for various processing element types such as FP32, INT16, LightPE-1, and LightPE-2 for CIFAR-10 (left chart) and CIFAR-100 (right chart). Each data point corresponds to the hardware configuration with the lowest energy for the corresponding processing element type. Pareto-front is shown with a dashed line for each DNN model. LightPEs are consistently on Pareto-front for various DNN models.

We summarize our findings in Table 2 that shows the Pareto-optimal results for each PE type for model accuracy and hardware-efficiency metrics such as performance per area and energy for VGG-16, ResNet-20, and ResNet-56 models on CIFAR-10 and CIFAR-100 datasets. Our results show that LightPE implementations provide on par accuracy results across models and datasets and improve the hardware efficiency in terms of energy and performance per area when compared to FP32- and INT16-based designs. As it can be seen from Table 2, LightPEs consistently dominate FP32- and INT16-based designs in terms of hardware-efficiency metrics such as energy and performance per area. In addition, although we only claim that LightPEs can achieve similar accuracy to FP32 and INT16 designs, in certain situations such as ResNet-20 and ResNet-56 for CIFAR-10 dataset, the LightPE-2-based design achieves higher accuracy than the FP32-based design for ResNet-20, and both FP32- and INT16-based designs for ResNet-56, respectively.

Table 2.
NetworkPE TypeAccuracy (%)EnergyPerformance per Area
CIFAR-10CIFAR-100
VGG-16FP3293.9673.28\(1.2 \times\)\(0.69 \times\)
INT1693.8773.31\(1 \times\)\(1 \times\)
LightPE-293.7873.16\(0.20 \times\)\(4.9 \times\)
LightPE-193.6072.88\({\bf 0.18} \times\)\({\bf 5.7} \times\)
ResNet-20FP3292.4868.85\(1.8 \times\)\(0.48 \times\)
INT1692.8269.13\(1 \times\)\(1 \times\)
LightPE-292.6868.64\(0.29 \times\)\(3.4 \times\)
LightPE-192.2266.78\({\bf 0.25} \times\)\({\bf 4.1} \times\)
ResNet-56FP3293.7272.18\(1.6 \times\)\(0.53 \times\)
INT1693.6072.03\(1 \times\)\(1 \times\)
LightPE-293.7571.94\(0.27 \times\)\(3.8 \times\)
LightPE-193.1370.83\({\bf 0.22} \times\)\({\bf 4.6} \times\)

Table 2. Pareto-optimal Results

Furthermore, Table 3 shows the clock frequency values found by QUIDAM for designs with different PE types. We note that LightPE-based implementations provide up to \(1.7 \times\) and \(1.6 \times\) speedup when compared to FP32- and INT16-based designs, respectively. In addition, we note that LightPE-2 and LightPE-1 implementations achieve 435 and 455 MHz in 45-nm technology node [45], respectively. As the Eyeriss design reports its core clock frequency as 200 MHz and it utilizes 65-nm technology node [4], we apply the prominent technology scaling rules to make a fair comparison among different designs. Based on these scaled calculations [41], we note that QUIDAM finds LightPE implementations that are \(1.5 \times\) to \(1.6 \times\) faster when compared to Eyeriss [4] design. Moreover, with the same INT16-based implementation, the QUIDAM generated DNN accelerator configuration achieves similar clock frequency (197 MHz) after technology scaling.

Table 3.
PE TypeClock Frequency (MHz)
FP32275
INT16285
LightPE-2435
LightPE-1455

Table 3. Clock Frequency Values of QUIDAM-generated Designs with Different PE Types

4.5 DNN Accelerator and Model Co-Exploration

So far, we have demonstrated the usefulness of our proposed framework by mainly varying the hardware architecture given some commonly adopted neural network designs. In this subsection, we demonstrate the generalizability of the proposed framework by co-exploring the design space of both hardware configurations and neural network architectures. To do so, we need an accuracy model for neural architectures in addition to the proposed power, performance, and area hardware cost models to rapidly iterate over different DNN models and perform a DNN accelerator and model co-exploration analysis. We note that an accuracy proxy model is needed, since the search space of DNN architectures is extremely large (hundreds of thousands) that would require an untenable cost of training. To this end, we adopt the weight-sharing evaluation technique to estimate the accuracy of a candidate neural network architecture [12, 32]. More specifically, we first define a neural architecture search space for an existing VGG-16 network as shown in Table 4. The neural architecture search space used in our analysis is composed of Conv-BN-ReLU and MaxPool blocks with the number of repetitions of each block ranging from 1 to 3 and the number of channels ranging from 40 to 512 based on the layer number. It can be observed that the largest configuration is a VGG-16 architecture and there are smaller variants to be searched. The baseline VGG-16 model can be achieved by choosing the largest number of repetitions per block and the largest number of channels available in the search space shown in Table 4. The entire search space contains 110,592 candidate neural network architectures, which is found by multiplying the number of all possible choices in each step of candidate architecture search. To obtain an accuracy predictor, we train the neural network by randomly sampling an architecture from the search space for each batch while sharing the weights with the largest neural network architecture [12, 32]. After the training is done, we randomly select 1,000 network architectures and directly evaluate their accuracy on the validation set as our output for the accuracy predictor.

Table 4.
BlockNumber of RepetitionsChannels
Conv-BN-ReLU{1,2}{40, 48, 56, 64}
MaxPool1N/A
Conv-BN-ReLU{1,2}{80, 96, 112, 128}
MaxPool1N/A
Conv-BN-ReLU{1,2,3}{160, 192, 224, 256}
MaxPool1N/A
Conv-BN-ReLU{1,2,3}{320, 384, 448, 512}
MaxPool1N/A
Conv-BN-ReLU{1,2,3}{320, 384, 448, 512}
MaxPool1N/A

Table 4. Search Space for Neural Architectures Used in DNN Accelerator and Model Co-exploration Analysis

Figure 12 shows the normalized energy and the normalized area vs. top-1 model error results for various DNN accelerator and model pairs on the CIFAR-10 dataset. We randomly sample accelerator configurations and use 1,000 DNN models. We then evaluate each accelerator and model pairs in terms of Pareto-optimality. Performance per area and energy results are normalized with respect to the minimum energy and area point in the INT16 design space, respectively. Figure 12 shows that LightPEs are consistently on the Pareto-front even when the DNN accelerator and model configurations are co-explored that shows the efficacy of LightPEs not only in a few commonly adopted DNN models but in a generalized DNN accelerator and model co-design space.

Fig. 12.

Fig. 12. Normalized energy (left chart) and normalized area (right chart) vs. top-1 error results for various DNN configurations and processing element types such as FP32, INT16, LightPE-1, and LightPE-2. Each data point corresponds to a different hardware and DNN architecture pair that are normalized to the minimum energy (left chart) and the minimum area (right chart) pair in the INT16 design space. Pareto-front for the co-exploration space is shown with a dashed line. As it can be seen, LightPEs are consistently on Pareto-front even when DNN accelerator and model configurations are co-explored.

Based on these analyses and results, we conjecture that QUIDAM can successfully provide a wide range of DNN accelerator and model pairs based on different needs in terms of accuracy and critical hardware-efficiency metrics such as performance per area and energy. Therefore, we conclude that QUIDAM can be used for DNN accelerator and model co-design as it incorporates quantization-aware hardware and PPA models that significantly speeds up the co-design efforts.

Skip 5CONCLUSION Section

5 CONCLUSION

In this work, we present QUIDAM, a quantization-aware highly parameterized DNN accelerator and model co-exploration framework. Our framework can foster future research on design space co-exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad size of processing elements, global buffer size, device bandwidth, number of total processing elements in the the design, and DNN configurations. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, LightPE-1 and LightPE-2 achieve \(4.8 \times\) and \(4.1 \times\) more performance per area and \(4.7\times\) and \(4 \times\) energy improvement on average when compared to the best INT16 hardware configuration, respectively. We also show that our proposed LightPEs consistently achieve Pareto-optimal results in terms of accuracy and performance per area and energy for commonly used DNN models as well as when DNN accelerator and model configurations are co-explored. Therefore, design space co-exploration of quantization-aware DNN accelerators and models merits a meticulous analysis that takes these factors into account.

REFERENCES

  1. [1] Cai Ermao, Juan Da-Cheng, Stamoulis Dimitrios, and Marculescu Diana. 2017. Neuralpower: Predict and deploy energy-efficient convolutional neural networks. In Proceedings of the Asian Conference on Machine Learning (ACML’17), 622637.Google ScholarGoogle Scholar
  2. [2] Chen Yu-Hsin, Emer Joel, and Sze Vivienne. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, Los Alamitos, CA, 367379. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Chen Yu-Hsin, Emer Joel, and Sze Vivienne. 2017. Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro 37, 3 (2017), 1221. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Chen Yu-Hsin, Krishna Tushar, Emer Joel S., and Sze Vivienne. 52(1):127-138, 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (January2017), 127138. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chin Ting-Wu, Ding Ruizhou, Zhang Cha, and Marculescu Diana. 2020. Towards efficient model compression via learned global ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Devlin J., Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT’19).Google ScholarGoogle Scholar
  7. [7] Ding Ruizhou, Liu Zeye, Blanton R. D. (Shawn), and Marculescu Diana. 2018. Lightening the load with highly accurate storage- and energy-efficient LightNNs. ACM Trans. Reconfig. Technol. Syst. 11, 3, Article 17 (December2018), 24 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Ding Ruizhou, Liu Zeye, Shi Rongye, Marculescu Diana, and Blanton R. D. (Shawn). 2017. LightNN: Filling the gap between conventional deep neural networks and binarized networks. In Proceedings of the on Great Lakes Symposium on VLSI 2017 (GLSVLSI’17). Association for Computing Machinery, New York, NY, 3540. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Esser Steven K., McKinstry Jeffrey L., Bablani Deepika, Appuswamy Rathinakumar, and Modha Dharmendra S.. 2020. Learned step size quantization. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=rkgO66VKDS.Google ScholarGoogle Scholar
  10. [10] Gao Mingyu, Pu Jing, Yang Xuan, Horowitz Mark, and Kozyrakis Christos. 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. SIGARCH Comput. Arch. News 45, 1 (2017), 751764. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Genc Hasan, Kim Seah, Amid Alon, Haj-Ali Ameer, Iyer Vighnesh, Prakash Pranav, Zhao Jerry, Grubb Daniel, Liew Harrison, Mao Howard, Ou Albert, Schmidt Colin, Steffl Samuel, Wright John, Stoica Ion, Ragan-Kelley Jonathan, Asanovic Krste, Nikolic Borivoje, and Shao Yakun Sophia. 2021. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Proceedings of the 58th Annual Design Automation Conference (DAC’21).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Guo Zichao, Zhang Xiangyu, Mu Haoyuan, Heng Wen, Liu Zechun, Wei Yichen, and Sun Jian. 2020. Single path one-shot neural architecture search with uniform sampling. In European Conference on Computer Vision. Springer, 544560.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Gupta Suyog and Akin Berkin. 2020. Accelerator-aware neural network design using AutoML. arXiv preprint arXiv:2003.02838, 2020. Retrieved from https://arxiv.org/abs/2003.02838.Google ScholarGoogle Scholar
  14. [14] Han Song, Liu Xingyu, Mao Huizi, Pu Jing, Pedram Ardavan, Horowitz Mark A., and Dally William J.. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the International Conference on Computer Architecture (ISCA’16).Google ScholarGoogle Scholar
  15. [15] Han Song, Mao Huizi, and Dally William J. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google ScholarGoogle Scholar
  16. [16] He K., Zhang X., Ren S., and Sun J.. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770778.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] He Yang, Kang Guoliang, Dong Xuanyi, Fu Yanwei, and Yang Yi. 2018. Soft filter pruning for accelerating deep convolutional neural networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). AAAI Press, 22342240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Inci Ahmet, Bolotin Evgeny, Fu Yaosheng, Dalal Gal, Mannor Shie, Nellans David, and Marculescu Diana. 2020. The architectural implications of distributed reinforcement learning on CPU-GPU systems. arXiv:2012.04210. Retrieved from https://arxiv.org/abs/2012.04210.Google ScholarGoogle Scholar
  19. [19] Inci Ahmet, Isgenc Mehmet Meric, and Marculescu Diana. 2021. Cross-layer design space exploration of NVM-based caches for deep learning In Proceedings of the 12th Non-Volatile Memories Workshop (NVMW). Retrieved from http://nvmw.ucsd.edu/nvmw2021-program/nvmw2021-data/nvmw2021-paper37-final_version_your_extended_abstract.pdf.Google ScholarGoogle Scholar
  20. [20] Inci Ahmet, Isgenc Mehmet Meric, and Marculescu Diana. 2021. DeepNVM++: Cross-layer modeling and optimization framework of non-volatile memories for deep learning. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. (2021), 11. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Inci Ahmet, Isgenc Mehmet Meric, and Marculescu Diana. 2022. Efficient deep learning using non-volatile memory technology. arXiv:2206.13601. Retrieved from https://arxiv.org/abs/2006.13601.Google ScholarGoogle Scholar
  22. [22] Inci Ahmet and Marculescu Diana. 2018. Solving the non-volatile memory conundrum for deep learning workloads. In Proceedings of the Architectures and Systems for Big Data Workshop in Conjunction with ISCA (2018).Google ScholarGoogle Scholar
  23. [23] Inci Ahmet, Virupaksha Siri Garudanagiri, Jain Aman, Thallam Venkata Vivek, Ding Ruizhou, and Marculescu Diana. 2022. QADAM: Quantization-aware DNN accelerator modeling for pareto-optimality. arXiv:2205.13045. Retrieved from https://arxiv.org/abs/2205.13045.Google ScholarGoogle Scholar
  24. [24] Inci Ahmet, Virupaksha Siri Garudanagiri, Jain Aman, Thallam Venkata Vivek, Ding Ruizhou, and Marculescu Diana. 2022. QAPPA: Quantization-aware power, performance, and area modeling of DNN accelerators. arXiv:2205.08648. Retrieved from https://arxiv.org/abs/2205.08648.Google ScholarGoogle Scholar
  25. [25] Inci Ahmet Fatih, Isgenc Mehmet Meric, and Marculescu Diana. 2020. DeepNVM: A framework for modeling and analysis of non-volatile memory technologies for deep learning applications. In Proceedings of the 23rd Conference on Design, Automation and Test in Europe (DATE’20). 12951298. Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Jacob Benoit, Kligys Skirmantas, Chen Bo, Zhu Menglong, Tang Matthew, Howard Andrew, Adam Hartwig, and Kalenichenko Dmitry. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 27042713.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Jang Jun-Woo, Lee Sehwan, Kim Dongyoung, Park Hyunsun, Ardestani Ali Shafiee, Choi Yeongjae, Kim Channoh, Kim Yoojin, Yu Hyeongseok, Abdel-Aziz Hamzah, Park Jun-Seok, Lee Heonsoo, Lee Dongwoo, Kim Myeong Woo, Jung Hanwoong, Nam Heewoo, Lim Dongguen, Lee Seungwon, Song Joon-Ho, Kwon Suknam, Hassoun Joseph, Lim SukHwan, and Choi Changkyu. 2021. Sparsity-aware and re-configurable NPU architecture for samsung flagship mobile SoC. In Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA’21). 1528. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Jouppi N., Young C., Patil Nishant, Patterson David A., Agrawal Gaurav, Bajwa R., Bates Sarah, Bhatia Suresh, Boden N., Borchers Al, Boyle Rick, Cantin Pierre luc, Chao Clifford, Clark Chris, Coriell Jeremy, Daley Mike, Dau Matt, Dean J., Gelb Ben, Ghaemmaghami T., Gottipati R., Gulland William, Hagmann R., Ho C. R., Hogberg Doug, Hu John, Hundt R., Hurt D., Ibarz J., Jaffey A., Jaworski Alek, Kaplan Alexander, Khaitan Harshit, Killebrew Daniel, Koch Andy, Kumar Naveen, Lacy Steve, Laudon J., Law James, Le Diemthu, Leary Chris, Liu Zhuyuan, Lucke Kyle A., Lundin Alan, MacKean G., Maggiore A., Mahony Maire, Miller K., Nagarajan R., Narayanaswami Ravi, Ni Ray, Nix K., Norrie Thomas, Omernick Mark, Penukonda Narayana, Phelps A., Ross Jonathan, Ross Matt, Salek Amir, Samadiani E., Severn C., Sizikov G., Snelham Matthew, Souter J., Steinberg D., Swing Andy, Tan Mercedes, Thorson G., Tian Bo, Toma H., Tuttle Erick, Vasudevan Vijay, Walter Richard, Wang Walter, Wilcox Eric, and Yoon D.. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17), 112.Google ScholarGoogle Scholar
  29. [29] Jouppi Norman P., Yoon Doe Hyun, Ashcraft Matthew, Gottscho Mark, Jablin Thomas B., Kurian George, Laudon James, Li Sheng, Ma Peter, Ma Xiaoyu, Norrie Thomas, Patil Nishant, Prasad Sushma, Young Cliff, Zhou Zongwei, and Patterson David. 2021. Ten lessons from three generations shaped Google’s TPUv4i: Industrial product. In Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA’21). 114. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Kwon Hyoukjun, Chatarasi Prasanth, Pellauer Michael, Parashar Angshuman, Sarkar Vivek, and Krishna Tushar. 2019. Understanding reuse, performance, and hardware cost of DNN dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19). ACM, 754768.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Kwon Hyoukjun, Samajdar Ananda, and Krishna Tushar. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. 461475. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Li Liam and Talwalkar Ameet. 2020. Random search and reproducibility for neural architecture search. In Uncertainty in Artificial Intelligence. PMLR, 367377.Google ScholarGoogle Scholar
  33. [33] Li Yuhang, Dong Xin, and Wang Wei. 2020. Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  34. [34] Marculescu Diana, Stamoulis Dimitrios, and Cai Ermao. 2018. Hardware-aware machine learning: Modeling and optimization. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). IEEE Press, 18. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Mosteller F. and Tukey J. W.. 1968. Data analysis, including statistics. In Handbook of Social Psychology, Lindzey G. and Aronson E. (Eds.). Addison-Wesley, Vol. 2.Google ScholarGoogle Scholar
  36. [36] Parashar Angshuman, Raina Priyanka, Shao Yakun Sophia, Chen Yu-Hsin, Ying Victor A., Mukkara Anurag, Venkatesan Rangharajan, Khailany Brucek, Keckler Stephen W., and Emer Joel. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 304315. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Parashar Angshuman, Rhu Minsoo, Mukkara Anurag, Puglielli Antonio, Venkatesan R., Khailany B., Emer J., Keckler Stephen W., and Dally W.. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA’17).Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Qi Hang, Sparks Evan R., and Talwalkar Ameet. 2017. Paleo: A performance model for deep neural networks. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  39. [39] Aly Mohamed M. Sabry, Gao Mingyu, Hills Gage, Lee Chi-Shuen, Pitner Greg, Shulaker Max M., Wu Tony F., Asheghi Mehdi, Bokor Jeff, Franchetti Franz, Goodson Kenneth E., Kozyrakis Christos, Markov Igor, Olukotun Kunle, Pileggi Larry, Pop Eric, Rabaey Jan, Ré Christopher, Wong H.-S. Philip, and Mitra Subhasish. 2015. Energy-efficient abundant-data computing: The N3XT 1,000x. Computer 48, 12 (2015), 2433. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Samajdar Ananda, Zhu Yuhao, Whatmough Paul, Mattina Matthew, and Krishna Tushar. 2018. SCALE-Sim: Systolic CNN accelerator simulator. arXiv:1811.02883. Retrieved from https://arxiv.org/abs/1811.02883.Google ScholarGoogle Scholar
  41. [41] Sarangi Satyabrata and Baas Bevan. 2021. DeepScaleTool: A tool for the accurate estimation of technology scaling in the deep-submicron era. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’21). 15. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Shao Y., Clemons Jason, Venkatesan Rangharajan, Zimmer B., Fojtik Matthew R., Jiang Nan, Keller Ben, Klinefelter Alicia, Pinckney N., Raina Priyanka, Tell S., Zhang Yanqing, Dally W., Emer J., Gray C. T., Khailany B., and Keckler S.. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Shao Yakun Sophia, Reagen Brandon, Wei Gu-Yeon, and Brooks David. 2014. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). 97108. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Simonyan Karen and Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  45. [45] Stine J. E., Castellanos I., Wood M., Henson J., Love F., Davis W. R., Franzon P. D., Bucher M., Basavarajaiah S., Oh J., and Jenkal R.. 2007. FreePDK: An open-source variation-aware design kit. In Proceedings of the International Conference on Materials Science and Engineering (MSE’07).Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Tambe Thierry, Yang En-Yu, Wan Zishen, Deng Yuntian, Reddi Vijay Janapa, Rush Alexander, Brooks David, and Wei Gu-Yeon. 2020. Algorithm-hardware co-design of adaptive floating-point encodings for resilient deep learning inference. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Tan Mingxing and Le Quoc. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105–6114.Google ScholarGoogle Scholar
  48. [48] Tan Mingxing, Pang R., and Le Quoc V.. 2020. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20), 1077810787.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wang Kuan, Liu Zhijian, Lin Yujun, Lin Ji, and Han Song. 2019. HAQ: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wu Yannan N., Emer Joel S., and Sze Vivienne. 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In Proceedings of the IEEE/ACM International Conference On Computer Aided Design (ICCAD).Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Yang Lei, Yan Zheyu, Li Meng, Kwon Hyoukjun, Lai Liangzhen, Krishna Tushar, Chandra Vikas, Jiang Weiwen, and Shi Yiyu. 2020. Co-exploration of neural architectures and heterogeneous ASIC accelerator designs targeting multiple tasks. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20), 16.Google ScholarGoogle Scholar
  52. [52] Zhou Shuchang, Wu Yuxin, Ni Zekun, Zhou Xinyu, Wen He, and Zou Yuheng. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160. Retrieved from https://arxiv.org/abs/1606.06160.Google ScholarGoogle Scholar
  53. [53] Zhou Yanqi, Dong Xuanyi, Akin Berkin, Tan Mingxing, Peng Daiyi, Meng Tianjian, Yazdanbakhsh Amir, Huang Da, Narayanaswami Ravi, and Laudon James. 2021. Rethinking co-design of neural architectures and hardware accelerators. arXiv:2102.08619. Retrieved from https://arxiv.org/abs/2102.08619.Google ScholarGoogle Scholar

Index Terms

  1. QUIDAM: A Framework for Quantization-aware DNN Accelerator and Model Co-Exploration

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Embedded Computing Systems
            ACM Transactions on Embedded Computing Systems  Volume 22, Issue 2
            March 2023
            560 pages
            ISSN:1539-9087
            EISSN:1558-3465
            DOI:10.1145/3572826
            • Editor:
            • Tulika Mitra
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 January 2023
            • Online AM: 1 September 2022
            • Accepted: 28 July 2022
            • Revised: 16 May 2022
            • Received: 20 October 2021
            Published in tecs Volume 22, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed
          • Article Metrics

            • Downloads (Last 12 months)474
            • Downloads (Last 6 weeks)61

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!