Abstract
The efficiency of deep neural network (DNN) solutions on real hardware devices are mainly decided by the DNN architecture and the compiler-level scheduling strategy on the hardware. When we try to fully exploit the underlying hardware and obtain the optimal tradeoff between DNN accuracy and runtime performance, we discovered that the two optimization goals of DNN architecture and scheduling policy are intimately related to each other. However, current hardware-aware Neural Architecture Search (NAS) methods primarily focus on the DNN architecture search process, ignoring the effects of various compiler-level scheduling strategies (e.g., graph-level optimization, loop transformations, parallelization, etc.) on network candidates being evaluated in the search process. As a result, they may overlook the true-optimal DNN implementations on hardware, which can only be discovered by trying-out different combinations of scheduling strategies and DNN architectures. This work proposes a NAS framework (CHaNAS) that searches for not only the network architecture but also the dedicated compiler-level scheduling policy, as the optimal co-design solution on the target hardware. We propose to use a block-based pre-scheduling methodology to reduce the co-design search space and enable the automatic generation of the optimal co-design, including the network architecture and the tensor programs that practice the scheduling policy. Further, we introduce a new search objective function based on the generalization gap to prevent the selection of architectures that are prone to overfitting. We evaluate CHaNAS on Imagenet on different hardware back-ends against the state-of-the-art hardware-aware search method based on the MobileNet-v3 search space. Experimental results show that the co-design solutions obtained by ChaNAS show up to 1.6×, 1.9×, and 1.7×, 24 performance boost on NVIDIA P100 GPU, Intel Xeon 8163 CPU, and Samsung Note 10 Mobile, respectively, over the baselines of the same-level accuracy.
1 INTRODUCTION
The widespread use of deep learning applications has fueled the development of Deep Neural Network (DNN) model design and efficient deployment on rapidly evolving hardware. There is a deep stack of optimization technologies accessible when building an efficient application or domain-specific neural network system [6, 17], including enhanced neural network architecture [24, 25, 26, 54], optimized frameworks and compilers [4, 12, 18, 40, 46], and even customized hardware design [16, 47, 52, 53]. However, rather than using DNN optimization technology on its own, it has been demonstrated that cross-stack co-design approaches that orchestrate techniques from several layers can produce better outcomes [5, 14, 30, 52]. For example, some recent works [14, 22, 23, 28, 37] have investigated how to design an optimized neural network architecture that fully considers the characteristics of the target hardware. Such a hardware-aware network architecture design method, i.e., hardware-aware Neural Architecture Search (NAS), can automate the DNN design and even exceed earlier human-designed models by incorporating hardware features into the NAS loop.
Previous hardware-aware NAS methods simply took into account the co-optimization of DNN architecture and hardware-related design variables like representation precision and resource provision [30, 44, 50]. However, as we have emphasized, the efficiency of DNN applications also depends on multiple layers of optimization techniques. Specifically, in addition to neural network architectures, how to schedule the neural network onto the hardware at the compiler level, such as task graph reordering, loop reordering, loop tiling, memory customization, and other compiler-managed scheduling policies, plays an important role in the performance of the target neural network system [15, 19, 29]. For example, a typical hardware-aware NAS framework usually relies on a performance model to estimate the performance of the candidate neural networks on the target hardware, then decides whether the discovered network should be kept or updated during the iterative search process. However, in these evaluation-and-search iterations, almost all prior hardware-aware NAS works assume a fixed network scheduling policy that may not extract the best performance from the under-evaluation network architecture on the hardware. Some other works focus on tuning the schedule mapping strategy when given a neural network model to optimize the performance for different hardware [15, 20, 49], but they ignore the correlation between the interactive stacks and lead to locally optimal solutions. In contrast, we show that to discover the best solution at the system level, an ideal hardware-aware NAS framework should incorporate compiler-level optimization, i.e., scheduling strategies, for the target hardware. In this article, we offer the Compiler and Hardware aware Network Architecture Search (CHaNAS) framework for the first time to achieve this goal.
In ChaNAS, we orchestrate two key components from the deep learning optimization stack: the DNN architecture and the scheduling policy in the compiler that tactically maps a model onto the target hardware. As shown in the Figure 1, ChaNAS constructs a joint search space that combines the DNN architectures and scheduling-level optimization, as opposed to traditional methods that either optimize the neural network using NAS based on a fixed schedule policy from a given deep learning library (e.g., Tensorflow) or tunes the scheduling policy to maximize the inference speed for a given network. For each candidate network evaluated in the search space, CHaNAS can measure and discover its realistic peak performance that can be revealed by trying-out different schedule policies such as instruction-level scheduling, memory allocation, loop transformations, and so on, and then choose the true-optimal one. Realizing this goal also faces multiple challenges. First, it requires us to automatically construct the joint search space to cover as many DNN/Schedule policy design pairs as possible for the given AI task and target hardware. Second, we must precisely and effectively analyze each design pair, which is a time-consuming process. Finally, we need to search efficiently in the huge search space to return the optimal co-design solution, which includes not only the DNN implementation but also the corresponding high-performance tensor programs for the potential hardware back-ends.
Fig. 1. CHaNAS jointly design the neural architecture and the computation-graph scheduler to fit the target hardware characteristics. With an enlarged optimization design search space, CHaNAS is more likely to achieve better results than existing scheduler-agnostic hardware-aware NAS methods.
To this end, ChaNAS utilizes a block-based hierarchical design representation to construct the huge co-design search space automatically. In the CHaNAS search space, both the network architecture design and the network scheduling is conducted on the basis of neural network blocks, which help divide the entire network design space exploration into two coordinated phases. We then apply a supernet that acts as an accuracy predictor and use the pre-scheduled blocks to reduce the evaluation cost. Finally, to reduce the search cost, we divide the search space and employ an evolutionary search technique. Overall, as an automated schedule-aware neural architecture search method, ChaNAS makes full use of the NAS’s advantage and schedule optimization that fit the target hardware, allowing a larger design space for architecture design. Therefore, with a larger degree of design freedom than previous schedule-agnostic hardware-aware NAS, it is more likely to find the optimal specialized DNN solution for varied tasks and different hardware back-ends. We evaluate CHaNAS on Imagenet on different hardware back-ends against the state-of-the-art (SOTA) hardware-aware search method MobileNet-v3 [24]. Experiments show that the co-design solutions obtained by ChaNAS show up to 1.6×, 1.9×, and 1.7× performance boost on NVIDIA P100 GPU, Intel Xeon 8163 CPU, and Samsung Note 10 Mobile, respectively, over the baselines of the same-level accuracy.
In summary, this article makes the following contributions:
We propose the first automatic joint search methodology for neural network architecture and scheduling policy on different hardware, CHaNAS. By using a block-based hierarchical design representation, CHaNAS constructs a large joint co-design space and is more able to locate the optimal solution for the target hardware with specified design constraints than existing scheduler-agnostic hardware-aware NAS methods.
We carefully design the blocks that are used to construct the target DNN model and consider the block size to tradeoff the architecture search cost and schedule optimization effect. We also propose an elastic supernet that includes all blocks and is employed by CHaNAS to generate child neural network models in the solution search process. The basic building blocks for this supernet can be automatically pre-scheduled and evaluated on the target hardware to expose the optimal performance to the CHaNAS during the search process with little cost.
Due to the huge co-design search space, we divide the co-design search space according to the constraints of the deployment scenario to reduce the search cost, and employ an evolutionary search strategy to explore the reduced co-design search space, which can gradually generate higher-quality co-design results for the target hardware platform.
To further improve the quality of the automatically generated co-design solution, We propose a new objective function to enhance the generalization capability of the target network, which has been empirically proved to help get networks that are less influenced by the weight sharing problem. In our experiment, the introduction of the new objective function improves the accuracy of the discovered DNN solutions by \( 0.5\% \).
2 BACKGROUND AND MOTIVATION
In this section, we first introduce two critical factors that impact the DNN solution on the target hardware: the efficient DNN architecture design and the schedule strategy that maps the DNN on different hardware. Then we introduce the heuristic approach that is not only typically used in the DNN architecture search but also used in our co-design process. Finally, we present the observation that motivated us to automatically co-design the DNN architecture and corresponding schedule strategy for the target hardware platforms.
Efficient Neural Network Design. Efficient DNN model design is crucial for the overall performance of the deep learning system. Many efficient neural network architectures are proposed to improve hardware efficiency, such as Squeezenet [26], MobileNets [25], ShuffleNets [54], and so on, which mainly focus on reducing the computation (e.g., adopting depthwise). To reduce the manual effort in the efficient DNN design, NAS tends to automate the architecture design process and begins to dominate the recent efficient network design, while the early NAS methods [33, 59] search for high accuracy architectures without considering hardware efficiency. With the development of NAS techniques [8], researchers begin to combine multi-objective optimization into the NAS process [23, 28, 37, 44, 48, 52, 58]. They typically focus on searching efficient NAS models with higher accuracy, lower parameters, Flops (number of multi-add operations), and lower latency (or energy). As a result, those methods incorporate hardware characteristics into the NAS loop to increase inference efficiency, but their performance highly depends on the quality of the search space and search strategy [34, 42]. To evaluate each architecture efficiently, One-shot NAS substantially reduces the computation cost by training only one super-network, a.k.a. supernet, to approximate the performance of every architecture in the search space via weight-sharing. Specifically, One-shot NAS uses the supernet to predict the performance of a specific architecture by deactivating the extra edges w.r.t. a target architecture on the supernet via masking, then performing evaluations using the masked supernet. In general, people follow the manual design heuristic for NAS search space design and use either heuristic or ML method for design search.
DNN Model Deployment on Hardware. How to map DNNs onto a particular hardware device is crucial to the system performance [7, 36]. Currently, contemporary DNN compilers [12, 18, 43] rely on an intermediate representation (IR) of task graph to abstract the network architecture. So they can either manipulate the network IR at the graph-level or redefine the implementation of each single operator that is a node in the IR graph. As shown in the Figure 2, optimizing these DNN models is generally performed in two steps: (i) high-level graph optimizations and (ii) low-level kernel optimizations such as those found in vendor libraries. Typically, at the beginning of the compile optimization, the compiler usually partitions the large computation graph of a DNN model into several subgraphs, which becomes the basic unit in the whole compilation process. This partition has a negligible effect on the performance due to the layer-by-layer construction nature of DNN [55]. Then some graph-level optimization techniques are applied to the subgraphs. Graph-level optimization techniques include layout optimizations, operator fusion, constant folding, auto-batching, and so on. In contrast, operator-level computation optimization is often hardware specific. For example, achieving latency hiding for operator computation requires different strategies on different hardware back-ends. On the CPU, memory latency hiding is achieved implicitly with simultaneous multi-threading or hardware pre-fetching [12]. While GPUs rely on rapid context switching of many warps of threads. Generally, in the deep learning compile process, to optimize an operator or a (sub-)graph of multiple operators, the compiler requires users to define the computation in a high-level declarative language, and the compiler then searches for programs tailored toward different hardware platforms based on some human-written schedule templates, including different schedule primitives (e.g., split, reorder, fuse, etc.) and proper parameters to achieve parallelism, vectorization, memory tilling, and memory access latency hiding for different hardware. For example, TVM [12] requires the user to write a template that defines the structure of the operator programs with some tunable parameters. Then the compiler searches for the best values of these parameters for a specific input shape configuration and a specific hardware target.
Fig. 2. Illustration of DNN compiles optimization.
Heuristic search method. Many algorithms have been applied in both the NAS process and schedule strategy search process, e.g., heuristic algorithm [12], Bayesian optimization [57], reinforcement learning [44, 59], and so on. In this article, we mainly consider the heuristic method in the co-design search process, so we make a brief introduction to the heuristic algorithm here. As illustrated in the Figure 3, the heuristic method mainly consists of five parts. At first, a group of candidates is chosen to serve as the initialized population. The algorithm will then select the best candidate after several iterations. In each iteration, according to the selected individuals (called parents), some new individuals are created (called children), depending on if mutation or crossover should be utilized. Then each child will be evaluated to get the fitness value that determines the quality of the new individual. The new individual is inserted into the population if its fitness value is better than the worst performing individual of the population; otherwise, the new individual is omitted. After that, the population is updated. Each iteration leads to a new generation, and the iteration terminates if the number of max generations is reached.
Fig. 3. Illustration of the heuristic search method.
Observation. Figure 4 reveals a phenomenon we discovered in the SOTA NAS frameworks that run realistic workloads. For different neural network task specifications, e.g., different dataset, optimization goals, or different performance constraints like the accuracy or latency requirements, an ideal NAS framework is supposed to generate the optimal neural network model for the target hardware through the automated search mechanism. However, as the user-specified constraint of accuracy changes, the Pareto frontiers of the NAS schemes that assume different scheduling solutions will intersect with others, which means none of the NAS baselines in Figure 4 can always generate a better solution than the others when the task specification or constraint changes. For example, in Figure 4, if the designer seeks a neural network that must run faster than 35 ms, then the NAS scheme assumes Schedule-B is better, while the one with Schedule-A is more accurate when the latency constraint changes to 20 ms. In other words, prior schedule-agnostic NAS technologies cannot guarantee the optimal network solution in the search process as they assume a fixed compiling strategy. Thus, enabling effective and automatic co-optimization of neural network architecture and the scheduling policy is necessary for an optimized neural network system. In this work, we are the first to investigate the scheduler and hardware-aware NAS that searches not only for the network architecture choices but also the corresponding network scheduling solution on the target hardware at the same time.
Fig. 4. Relative Pareto-frontier of three different scheduling strategies. We randomly extract 200 models from the OFA [9] and test them on the Note 10 mobile, Schedule-A presents the original schedule result of TF_Lite, while Schedule-B and Schedule-C present two schedule strategies that change the loop split knobs in TF_Lite.
3 CHANAS FRAMEWORK
3.1 Problem Formalization
CHaNAS explores the massive joint design space, which is the confluence of the DNN architecture space and the network scheduling space for the target hardware. The DNN architecture space includes the model hyper-parameters such as the layers type, input size, channels, and so on. While the scheduling space contains many compiler-level optimization knobs with their parameters, such as loop tiling, operators fusion, reordering, and so on. Assuming the DNN architecture search space is \( \lbrace arch\rbrace =\lbrace arch_{1},\ldots ,arch_{n}\rbrace \), the scheduling strategy space is \( C=\lbrace c_{1}, \ldots ,c_{m}\rbrace \), and the target design performance constraint is \( P_{thres} \). The objective of CHaNAS is to search from the joint search space \( \lbrace \lbrace arch\rbrace , C\rbrace \) for the best DNN architecture \( arch^{*} \) with the associated scheduling strategy \( c^{*}_{arch^{*}} \), which together contribute to the maximum network accuracy and improved performance P that must at least satisfies constraint \( P_{thres} \). In all, we formalize this problem as (1) \( \begin{equation} \begin{aligned}\left(arch^{*}, c^{*}_{arch^{*}}\right) \in \underset{arch^{*} \in \lbrace arch\rbrace , c^{*} \in C}{\operatorname{argmin}} \mathcal {L}_{\text{val}}(arch,\omega , c) \\ s.t. \quad P(arch^{*}, c^{*}_{arch^{*}}) \lt P_{thres} \\ s.t. \quad \omega ^{*}=\operatorname{argmin} \mathcal {L}_{\text{train}}(\omega , arch), \end{aligned} \end{equation} \) where \( \mathcal {L}_{\text{train}} \) and \( \mathcal {L}_{\text{val}} \) are the task loss of training and validation, respectively; \( \omega \) presents the network weights.
3.2 CHaNAS Overview
As shown in Figure 1(c), in contrast to prior works that only search for solutions inside the network architecture space \( \lbrace arch\rbrace \), CHaNAS has a significantly larger design space to explore. Thus, we need to answer the following questions: (i) How should we automatically construct the co-design search space? It needs to cover both the DNN model and scheduling policy, which requires a compact design space representation to describe the potential co-design solutions with high efficiency. (ii) How should we evaluate each candidate co-design pair efficiently? We have to perform the costly DNN model training and sophisticated schedule optimation to get the real performance on the target hardware. (iii) How should we reduce the huge search space and design an effective search strategy that quickly converges into high-quality co-design solutions? Thus, for the first time, CHaNAS conducts the architecture/schedule-policy co-design by employing a block-based hierarchical design representation to construct the co-design search space efficiently and reduce the efforts taken to explore unnecessary options.
As a sub-graph of interconnected neural network operators (e.g., capsule conv2d, dilated conv2d), a CHaNAS block is the fundamental unit in both network architecture design and scheduling policy search. The design choice of block-level construction is based on two observations. First, the block is the basic unit in network architecture search, as the mainstream NAS approaches will pre-define the elementary blocks and generate the inter-block connection to form a candidate network architecture. Second, in the deep learning compile system, the compiler usually divides the full computation graph into sub-graphs, where the sub-graphs can be treated as blocks. Then the scheduler will attempt to reorder the operators in blocks and then map each block onto hardware by trying out back-end scheduling techniques such as loop-unrolling and blocking. Therefore, block-level construction is flexible enough to support the different design constraints and different hardware platforms. Furthermore, such a factorized hierarchical search space makes a good balance between the diversity of potential co-design options and the size of the entire co-design search space. We address the difficulties in the model-level schedule optimization. Suppose we partition the network into B blocks, and each block has a parameter search space size of R with an average of S scheduling policies per block. Due to our block-based design, our neural network architecture search and schedule policy search are separated, and their search space size is \( R^{B} \) and \( B*S \), respectively. So our total search space would be \( B*S+R^{B} \), versus the original single-level co-design search space at the size of \( R^{B}*S^{B} \).
In the whole design process, the hierarchical exploration flow of CHaNAS is hinged on neural blocks and involves two coordinated phases: the top-down network scheduling component will pre-generate the implementations for each possible neural block and evaluate them on the target hardware, while the bottom-up network architecture generation unit will search through the possible sequences of the neural blocks that have already been virtually mapped and optimized on hardware by the scheduling unit. With these two exploration paths, the optimal co-design solution can be found at a high probability. Figure 5 depicts the overview of CHaNAS, which has three major components: (1) a block-stacked super-net that captures the DNN architecture search space. (all candidate DNN architectures can be extracted from the super-net, which achieves fast accuracy evaluation as the candidate DNN directly inherits its parameters to bypass the expensive network training stage); (2) a block-based scheduling space explorer that transforms each block into a computational sub-graph and optimizes it on the target hardware; and (3) an search algorithm that divides the joint space to reduce the search cost and then search within the co-design sub-space. In CHaNAS, there are three major steps in the co-design flow.
Fig. 5. Overview of CHaNAS. Both the network architecture design and the network scheduling is conducted on the basis of neural network blocks.
Step 1: Construction of the Elastic Super-Net. We at first build the super-net from which many candidate DNN architectures can be derived. The super-net is stacked with blocks. Specifically, each block in super-net has many variable hyper-parameters, such as kernel size, channels, input shape, and so on. After the super-net training completes, CHaNAS extracts child networks for the target hardware from the super-net in the search process for co-design solutions. Specifically, in the super-network training process, we use a new generalization-based search objective function, which avoids the overfitting in previous one-shot NAS methods.
Step 2: Block-level Pre-Scheduling. For the scheduling-level optimization space, the basic DNN building blocks are virtually pre-scheduled and optimized on the target hardware, so that it generates many block-level co-designs that will be used in the evaluation stage of the global joint design space search. Given a parameterized block, the scheduling space, including the graph-level optimization for overall block topology and the operator-level optimizations that are explored via a heuristic method, and the performance of the scheduling policies will be fine-tuned under the direction of a learned cost model until the best scheduling point for that block is obtained.
Step 3: Co-design Exploration. Given the deployment constraints (latency in our test as an example), we first reduce the co-design search space according to the cumulative distribution function of network inference latency. Then we use an evolutionary-based searcher to explore in the reduced search space, for which we build a DNN accuracy predictor and a performance predictor based on the block performance Look Up Table (LUT) profiled at the block-level pre-scheduling stage. At last, the target DNN model with a dedicated scheduling strategy is returned.
3.3 Elastic Super-net Construction
Block is the basic unit in CHaNAS architecture search and scheduling search. However, we have a tradeoff in designing blocks size. In the CHaNAS design space, if the block structure is defined at an over-fine granularity, i.e., fewer layers in the block, then the graph-level scheduling search space in a block will also be too small to conduct thorough scheduling-level optimization. In contrast, if the blocks are too large and lead to a massive block design space, then it will be less likely for the search algorithm to converge to the optimal architecture in an oversized architecture search space. Therefore, we propose a medium-grained block design method to build an elastic super-net to balance the architecture search efficiency and scheduling search efficacy. The super-net will not induce too much search complexity and contains various large-enough blocks to explore the potential of scheduling-level search.
Figure 6 shows the architecture of super-net, which defines the architecture search space. First, the DNN is composed of N basic building blocks. In this work, to design hardware-friendly DNNs and to reduce search time, we adopt the single-path DNN structure without branches [10, 58]. Second, for each block \( block_{i}, (1 \le i \le N) \), there are M units included. Following the common practice of NAS approaches [9, 44], we adopt the elastic MBConv cell as the basic unit in a block. A elastic MBConv cell is composed of sequential layers of conv-1 × 1 (convolution with kernel size 1 × 1), dwconv-k × k (depthwise convolution with kernel size k), SE (Squeeze-and-Excitation), and conv-1 × 1. Between conv-1x1 and dwconv-kxk, the number of channels expands by a ratio of E. In the elastic MBConv, we can search the kernel size k from \( \lbrace 3,5,7\rbrace \), and we also search for the number of channels expansion ratio of E from \( \lbrace 3,6\rbrace \) for each block except for the first one, which has the default expansion ratio of 1. Similarly as in Reference [9], we allow a parameterizable kernel size through a kernel transformation matrix in each MBConv, which expedites the training process. Furthermore, the depth D of the block can also be variable, which means only the first D units are kept in the sampled block. In addition to these configurable parameters, we also allow the DNN model to take arbitrary input image sizes by assigning the model with a size ratio.
Fig. 6. Super-Net architecture, which is composed by a number of blocks.
The elastic design of blocks allows the super-net to offer the candidate sub-networks that are sufficiently flexible for different deployment constraints and makes the search cost affordable. To reduce the evaluation cost further, we train the super network that contains all the possible sub-networks through weight sharing and use it to estimate the accuracy of each sub-network. Specifically, in the super-net training process, we have designed an objective function to reduce overfitting. In the following section, we introduce the generation-based loss function design.
3.4 Training Objective of the Super-net
It has been demonstrated that by sharing the weight parameters among all the architecture instances, we will gain several orders of magnitude speedup in the search process [33, 41]. However, this comes with the cost that the performance of all the candidate networks cannot be fairly evaluated, since they inherit the sub-optimal shared parameters from the supernet. This is due to the super-net containing many sub-networks of different sizes and shapes, and they share the weights with each other, making the sub-networks inherit the common sub-optimal weights. Intuitively, this factor leads to low-quality decision-making of the search controller when selecting the candidate network without considering its genuine optimal performance. Therefore, we propose to use a new training objective function to reduce the phenomenon of model overfitting and prevent interference between the sub-networks.
Recall previous NAS methods [33, 44, 48, 59], they either use training loss or validation loss to update the network parameters, i.e., (2) \( \begin{equation} \begin{aligned}\mathbb {E}\left[\mathcal {L}_{\text{train}}(arch, \boldsymbol {w})\right], \ or \ \mathbb {E}\left[\mathcal {L}_{\text{val}}(arch, \boldsymbol {w})|\right]\!. \end{aligned} \end{equation} \)
Compared to the naive approach, our novel objective function is based on the generalization gap of the network architecture. The intuition behind this is that the selected sub-network should perform well on the data that it has not trained on, thus reducing the effect of the parameter sharing problems. Therefore, we address enhancing the generalization ability of these candidate networks and adding a generation loss into the super-net training objective function. This can improve the accuracy of the search controller by enforcing a fair evaluation result, and help it to select the network models that have higher generation performance. Formally, we define the objective function in the super-net training process as follows: (3) \( \begin{equation} \begin{aligned}\mathbb {E}\left[\mathcal {L}_{\text{train }}(arch, \boldsymbol {w})+\lambda \left|\mathcal {L}_{\mathrm{val}}(arch, \boldsymbol {w})-\mathcal {L}_{\text{train }}(arch, \boldsymbol {w})\right|\right], \end{aligned} \end{equation} \) where w represents the current network’s weights and \( \mathcal {L}_{\mathrm{val}}(arch, \boldsymbol {w})-\mathcal {L}_{\text{train }}(arch, \boldsymbol {w}) \) can be treated as the generation loss. The scalar variable \( \lambda \) balances the training loss and the generalization loss. We observe that \( \lambda =0.45 \) works well in our experiments. In our experiments, we find the generation loss helps CHaNAS gain a \( 0.5\% \) network accuracy improvement, and we present a detailed analysis on how the generation loss helps us improve the overall performance in the evaluation chapter.
3.5 Block-level Pre-scheduling
In the prior NAS process, when a network architecture candidate is selected, it is mapped to the underlying hardware or the hardware model for performance evaluation. However, in CHaNAS, the generated network must first try-out and choose the optimum scheduling policy for that specific hardware, before being tested for true-optimal performance. As a result, we must ensure its blocks are mapped to the hardware with the best scheduling policy for each network. To select the best scheduling point for a block on the target hardware, we build a block schedule optimizer \( \mathcal {F} \) that takes the parameterized block \( block_{i} = (A_{1}, A_{2}, \ldots , A_{\mathrm{b}}) \) from the super-net and the hardware descriptor \( D_{hard} \) as inputs, and the output of the schedule optimizer is the best scheduling policy \( c_{block}^{*} \) and the associated network performance \( P_{c_{block}^{*}} \) on the hardware: (4) \( \begin{equation} \begin{aligned}\mathrm{c_{block}^{*}}, P_{c_{block}^{*}} =\mathcal {F}(A_{1}, A_{2}, \ldots , A_{\mathrm{b}}, D_{hard}). \end{aligned} \end{equation} \)
To guarantee efficient execution of these blocks for the target hardware. The process of block scheduling is depicted in Figure 7, which mainly comprises two parts: the graph-level optimization and the lower-level operators scheduling for the hardware platform. For graph-level optimization, we mainly adopt two methods: layer fusion, including intrinsic fusion, pointwise fusion, and kernel fusion, which reduces the memory and communication overhead for the intermediate data, and data layout transformation, such as the flatten, concat, and reorganization operation, which transform the feature data layouts into back-end friendly patterns.
Fig. 7. Block schedule optimization, including graph-level optimization and lower-level operators schedule mapping.
For lower-level operator scheduling, as each block is composed of MBConv cells with similar structures but different in tensor shapes, we optimize the schedule of each operator (e.g., conv2d, depthwise conv2d) independently in the MBConv cell. The lower-level operator scheduling is a self-tuning process that aims to optimize low-level implementations for maximum hardware performance. Initially, we generate a scheduling space by enumerating different combinations of hardware-specific schedule primitives and the corresponding adjustable parameters (listed in Table 1), including the memory tiling, loop transformations, vectorization/tensorization, and parallelization strategies [13]. Then, without requiring human intervention, we apply a heuristic method to search the scheduling space and find schedules that maximize performance for a particular combination of operator, tensor shape, and hardware configuration.
| Name | Description & Parameters |
|---|---|
| split | divide a loop into several sub-loops |
| fusion | merge several loops into a hyper-loop |
| reorder | change execution orders of loops |
| unroll | unroll a loop by given depth |
| vectorize | apply vector operation to a loop |
| inline | inline a function |
| compute_at | put producer in the body of consumer |
| parallel | use multithreading–which loop to parallel |
| cache | use shared memory to store inputs/results |
| bind | assign a loop to parallel blocks/threads |
Table 1. Different Scheduling Primitives with Their Description
Encoding the operator schedules. The operator schedule is encoded using a vector, with each element representing a specific primitive or parameter choice. Figures 8(c) and (e) are two scheduling examples for one conv2d operator instance in Figure 8(b). Both Figures 8(c) and (e) split the original seven loops of conv2d into 13 sub-loops but with different split parameters, then reorder them, and generate a larger out-most loop by fusion. Figures 8(d) and (f) are the corresponding encoded representation of Figures 8(c) and (e), respectively. The encoding follows a fixed order to ensure that the configuration points with the same number of parameters are put together. For operator split, we record the split factors and create a Cartesian product of sub-spaces. Supposing that a loop in a two-dimensional (2D) convolution of size L is split into Z parts, we can generate different choices via Z-factorization of integer L, where the split of 2D convolution can be represented as \( [f_{1},f_{2},\ldots ,f_{Z}] \), where \( f_{1} \times f_{2}\ldots \times f_{Z}=L \). For reordering, we just record the new order of loops. For fusion, the loops that are not recorded are designed to be fused with their nearby outer loops during fusion. For unrolling, each expanded loop corresponds to a value of 1 when expanded and otherwise 0 when not expanded. Besides, we always parallelize the outer-most loop and vectorize the inner-most loop; therefore, parallel and vectorize are not encoded.
Fig. 8. The example of Operator scheduling example in CHaNAS. (a) Conv2d description. (b) Conv2d code example. (c) Schedule description. (d) Schedule encoding of (c). (e) Schedule description. (f) Schedule encoding of (e).
Scheduling-level Search. To explore this large design space efficiently, We employ a heuristic method to explore the scheduling space with a Gaussian-Process– (GP) based cost model, which is pre-trained using runtime measurement data collected from the target hardware. Different points in the search space are evaluated by querying through the GP-based cost model. Figure 9 has shown the process of schedule exploration. Before the scheduling strategy exploration, an initial set H is maintained, in which the scheduling points \( {p_{1}, p_{2} \dots p_{H}} \) and their performance result \( E_{p} \) are kept. In the scheduling-policy exploration process, we choose a starting point p in H with the probability \( \exp ^{-\gamma \frac{\left(E^{*}-E_{p}\right)}{E^{*}}} \), where \( \gamma \) is a hyperparameter and \( E^{*} \) is the performance of the best schedule point in H. As can be seen, the closer \( E_{p} \) is to \( E^{*} \), the more likely it is that p will be chosen. We then take this point to generate a set of new points through mutation and evaluate these schedule points from the GP-based cost model. We also define some rules as a mutation mechanism to avoid the generation of inefficient schedule points. For example, we discover that in the vast majority of cases, a divisible split factor is the most efficient, but other split factors produce inferior outcomes. As a result, for each loop in the mutation process, we limit the split factors to be divisible. Then the top-k newly generated points will be chosen and evaluated on the hardware and then added to H. All these performance measurement data are then used to update the GP-based cost model. Thanks to these blocks’ structural similarity, we can reuse the GP-based cost model to speed up the scheduling-policy exploration. Consequently, we gradually find the optimal scheduling policy of an operator on the hardware platform after a certain number of iterations, which will be kept in a LUT for CHaNAS to pool during the coordinated search process.
Fig. 9. Overview of the operator-level schedule exploration process, which breeds the next generation using mutation and crossover operators. The schedule explorer evaluates the performance of each schedule candidate using a Gaussian-Process-based cost model and then selects the schedule with the highest score to run on a distributed device cluster via RPC, and, finally, we get the real performance and collect the performance data into the database H. To improve the cost model’s predictive accuracy, we update the cost model periodically using collected data in the database H.
3.6 Co-design Space Exploration
The size of the factorized hierarchical search space has been reduced exponentially when compared to the original joint space, since the candidate DNN architectures in the design space can be directly compared by their best performance, given the blocks that comprise the DNNs have already been evaluated on the target hardware by the pre-scheduling stage. However, the sub-net search space is still enormous; evaluating each candidate DNN still needs additional effort even when we use the weight sharing to avoid the direct training. We first automatically divide the search space according to the performance requirements, which eliminates the search in the sub-spaces that do not satisfy the constraint. Then we use the evolutionary method to search for the target solution.
Automated NAS search space division. In our analysis, we find that input resolution and the model expansion ratio are the two most important factors affecting the network’s performance, so we divide the network search space according to these two factors. In this work, we can select from an input resolution factor \( In\_size = \lbrace 160,176,192,208,224\rbrace \) and a model expansion ratio \( Md\_E = \lbrace 1.0, 1.1, 1.2, 1.3\rbrace \) in our experiment. These parameters can be adjusted according to different situations. The space division leads to \( In\_size \times Md\_E \) (\( 5\times 4=20 \) in our experiment) possible sub-spaces. Our first-step goal is to find the best sub-space that contains the target model.
Locating the target sub-space is non-trivial. One way is to perform a network search on each of the sub-search space and compare the final results. However, the computation overhead will be astronomical. Instead, we evaluate the quality of the search space by randomly sampling \( \delta \) networks from the search space and comparing the distribution of the qualified networks. We collect the Cumulative Distribution Function (CDF) of each satisfying network’s latency and choose the sub-search space with the lowest average latency. The intuition is that, with the same model family, the lower CDF of latency is more likely to achieve higher performance. For computing the network’s latency, we construct a block performance LUT for the target device to enable fast performance estimation of DNN candidates. In the LUT, we track block-level latency metrics on real devices with different input dimensions, channel expansion ratio, and so on. Thus, the latency of the network model can be obtained by referring to the block-level performance \( Lat_{block_{i}} \) in LUT \( Lat_{net} \): (5) \( \begin{equation} \begin{aligned}Lat_{net} = \sum _{i=1}^{F}{Lat_{block_{i}}}, \end{aligned} \end{equation} \) where F denotes the total number of blocks, as the block-level method includes the optimal graphs-level and operator-level runtime optimizations. This can be more accurate than the previous operator-wise lookup table as the block-level method captures the runtime optimizations by adding the representation of model graphs and corresponding performance.
Evolutionary search. After finding the best sub-search space, we also adopt an evolutionary(heuristic) search algorithm to find the target model efficiently. In evolutionary search process, each DNN architecture is represented by as a genome vector, denoted as \( arch_{i} = [block_{1},\dots , block_{b}]\in R_{v} \), where v is the length of the vector. In evolution iterations, we randomly choose 15K sub-networks with different architecture and then measure their accuracy on 10K input samples from the validation set. To accelerate the evolution process, the \( [arch_{i}, accuracy_{i}] \) pairs are used to train a multilayer perception-based accuracy predictor, which can quickly predict the model accuracy based on its architecture parameters. Through iteratively mutating high-quality DNN architectures, we can generate new DNN architectures of potentially higher quality. In each iteration, the searcher evaluates the fitness of each DNN architecture candidate based on the outputs from the accuracy and latency predictive models, then selects architectures with the highest fitness to breed the next generation using mutation and crossover operators. When a certain number of iterations is reached, or the constraints are satisfied, the searcher returns the DNN architecture from the evaluated set and the dedicated optimal scheduling policy for the target hardware.
4 EVALUATION
4.1 Experiments Setup
To demonstrate the efficacy of our proposed method, we evaluate CHaNAS by comparing it with four previous hardware-aware NAS works (Mnasnet [44], Fbnet [48], ProxylessNas [10], and MobileNet-v3 [24]) on the ImageNet2012 classification dataset. For a fair comparison, we run their source code end-to-end similar to our experiment with different random initialization seeds using hyperparameters and commands released by the authors. As mentioned before, we first build the elastic super-net that can be fit into different deployment constraints, and then we optimize each possible block used in the super-net for the target hardware. At last, we reduce the co-design search space and use the evolutionary algorithm to perform the search. We use five blocks to form the super-net with each block having four MBConvs at most. Each MBConv has a kernel size within \( \lbrace 3,5,7\rbrace \) and a channel expand ratio E within \( \lbrace 3,4,6\rbrace \).The input image size is ranged within \( \lbrace 160,176,192,208,224\rbrace \), and the model expansion ratio ranges within \( \lbrace 1.0, 1.1, 1.2, 1.3\rbrace \). The standard SGD optimizer is used to train the supernet with Nesterov momentum 0.9 and weight decay \( 3e^{-5} \). The initial learning rate is 2.5, and the learning rate decays as the cosine function schedules. We also use the knowledge distillation technique to progressively fine-tune the sub-networks. The whole training process takes around 4.5 days on 16 NVIDIA P100 GPUs with 32 GB memory each. For block compile optimization, we implement the optimization strategy in Python and use TVM [12] tools (version 0.7.dev) for code generation targeting various hardware platforms. It is noteworthy that the block optimization process is independent of the super-net training process. Thus, the whole blocks optimization process can be executed with the training process in parallel. In general, the time overhead of the pre-scheduling process can be hidden by training, and the joint-space search can be achieved with reasonable overhead, comparable to prior NAS methods that are agnostic to the scheduling optimization space.
We apply CHaNAS to three different hardware platforms: NVIDIA P100 GPU, Intel Xeon 8163 [email protected] GHz, and Samsung Note 10 phone (Snapdragon [email protected] GHz). For comparison, the performance of previous works on GPUs is measured in Pytorch 1.3 + Cuda 10.1; the NAS solutions On CPU are evaluated in Pytorch 1.3; On the mobile phone, the networks are implemented in TF_Lite. The performance is the averaged results of 1,000 measurements with the workloads for more stable results.
4.2 Improvement from Scheduling-level Optimization
To show the benefit of our block-based pre-schedule optimization, we provide the baseline implementation CHaNAS-W/O, which indicates the solutions that are extracted from the CHaNAS super-net without scheduling-level optimization. We measure the latency of their derived solutions in Pytorch on both GPU and CPU and in TF_Lite on the mobile phone. CHaNAS-W presents the solutions that have gone through block-based scheduling-level optimization in the NAS process. We also use the Note10 mobile phone as a hardware platform, as shown in Table 2, which reports the comparison between CHaNAS-W, CHaNAS-W/O, and previous hardware-aware NAS methods on Samsung Note 10 phone. As we can see, though CHaNAS-W has higher MACs than CHaNAS-W/O, it has an obvious reduction of inference latency (16.4 ms vs. 27.5 ms), for the compiler-level co-design can dynamically adapt to different block structures and help search for the proper network schedules for the target hardware platform. In general, CHaNAS-W can achieve high performance as it strikes a balance between inter-thread and intra-thread workload decomposition on the ARM CPU by exploring numerous scheduling strategies. Besides, according to the DNN architectures obtained from CHaNAS-W and CHaNAS-W/O, the DNN architecture of CHaNAS is relatively more regular that more suitable for acceleration.
| Model | ImageNet Top-1(%) | Mobile Latency | MACs | Search Cost (GPU hours) | NN Training Cost (GPU hours) | Total Cost (GPU hours) | AWS Cost |
|---|---|---|---|---|---|---|---|
| Mnasnet | 74.0 | 34.4 ms | 317M | 4000N | — | 4000N | $12250N |
| Fbnet-C | 74.9 | 33.6 ms | 375M | 216N | 360N | 576N | $1764N |
| ProxylessNas-R | 74.6 | 35.7 ms | 320M | 200N | 300N | 500N | $1530N |
| MobileNet-v3(large) | 75.2 | 28.3 ms | 219M | — | 180N | 180N | $550N |
| CHaNAS-W/O | 76.5 | 27.5 ms | 224M | 40N | 1300 | 1300 +40N | $<124N |
| CHaNAS-W | 76.6 | 16.4 ms | 240M | 50N | 1300 | 1300 + 50N | $<150N |
CHaNAS-W/O indicates the solutions that are extracted from the CHaNAS super-net without scheduling-level re-optimization. “AWS Cost” is calculated based on the price of on-demand P3.16xlarge instances on AWS Cloud. N is the number of deployment scenarios we experimented with in the evaluation.
Table 2. Performance Comparisons on Samsung Note10
CHaNAS-W/O indicates the solutions that are extracted from the CHaNAS super-net without scheduling-level re-optimization. “AWS Cost” is calculated based on the price of on-demand P3.16xlarge instances on AWS Cloud. N is the number of deployment scenarios we experimented with in the evaluation.
Without pre-schedule optimization, CHaNAS-W/O achieves higher ImageNet top1 accuracy than MobileNet-v3 [24] with similar MACs. Specifically, CHaNAS-W/O achieves 1.2×, 1.4×, and 1.5× speedup than MobileNet-v3 on GPU, CPU, and Mobile, respectively. This is attributed to the elastic super-net design; the sub-networks extracted from the CHaNAS super-net have not only high flexibility and medium-granularity suitable for hardware-oriented search.
4.3 Improvement from the Adoption of Generation Loss
Previous NAS methods use the validation loss or the training loss only to update the supernet weights. For example, in the previous super-network training process, the network search controller evaluates and selects the network architectures by using its sub-optimal parameters. However, ChaNAS uses the generation loss to make sure that the candidates in the search process are fairly evaluated and thus improve the final training results of the networks under evaluation. From the Figure 11, we observe that the network search solution with the generalization loss yields a better model in terms of average accuracy and performance. Hereby we give an insight on why the generalization loss helps improve the network accuracy in ChaNAS. The source of improvement obtained by the generalization loss is especially analyzed in Figure 11, where the result of three different loss objectives used in the super-network training stage, and the true validation loss results during the search process are visualized. As we can see, even when the validation loss is used for updating architecture parameters, these candidate networks still suffer from the overfitting problem caused by the weight sharing. This provides evidence that the neural network architecture search can potentially benefit from the adoption of generalization loss, which may point to a different optimization direction of hardware-aware NAS in the co-design cases.
4.4 Benefits Gained by Co-design
To prove the effectiveness of the coordinated search method, Figure 10 shows the Pareto-optimal points found by different hardware-aware NAS methods on three hardware platforms. Through a joint search of network architecture and scheduling policy, CHaNAS-W achieves the highest performance over all different cases of different performance requirements. CHaNAS-W achieves 78.4% ImageNet top1 accuracy on the P100 GPU, 2.3% more accurate than MobileNet-v3, the previous best result reported through the hardware-aware NAS approach. Most importantly, CHaNAS-W runs 1.6×, 2.1×, and 2.2× faster on the P100 GPU; 1.9×, 2.2×, and 2.4× faster on the Intel Xeon 8163 CPU; and 1.7×, 2.7×, and 2.8× faster on the Note 10 phone than MobileNet-v3, Fbnet, and Mnasnet, respectively, while delivering similar output accuracy. With the co-design approach, DNN solutions can be customized and optimized according to the target hardware to achieve better performance.
Fig. 10. Neural network implementations provided by CHaNAS; Coordinated search achieve similar accuracy but significant speedup over the neural network solutions with a fixed scheduling policy on GPU, CPU, and Mobile, respectively.
Fig. 11. The results for different objective functions are used as the metric to update super-network parameters. ChaNAS with \( \mathcal {L}_{\mathrm{gen}} \) significantly outperforms \( \mathcal {L}_{\mathrm{train}} \) -based approaches.
We also show the co-design cost of our CHaNAS compared with previous hardware-nas in Table 2 when developing the candidate network and schedules for N different application scenarios of different constraints; the search cost includes pre-schedule optimization cost and co-search overhead before network deployment. “AWS Cost” is calculated based on the AWS cloud-charging price of on-demand P3.16x large instances. Most previous hardware-aware NAS need to re-design or re-train the candidate DNN model for a new hardware platform or even the change to the design constraint. For example, Mnasnet [44] needs 4,000 GPU hours (near $12,250 AWS cost) for a new development scenario. While CHaNAS decouples the training and search process in NAS due to the block-based design, our super-net training can be performed only once and requires only a marginal search cost for fast deployment in a new application case. CHaNAS only need additional 50 GPU hours ($150 AWS cost) for redeployment.
4.5 Analysis of Design Points Found by CHaNAS
We visualize some of the discovered architectures for the three platforms in Figure 12 by using different colors and shapes to denote the kernel sizes and channel expansion ratios of MBConv. As shown in Figure 12, CHaNAS uses more 7 × 7 kernels in the neural network for GPU than other platforms, and we conjecture that CHaNAS can find more efficient GPU schedules for blocks with 7 × 7 kernels, due to the large parallelism of the computation array. CHaNAS can automatically find a solution that has higher arithmetic intensity to improve GPU utilization. On a Note 10 mobile phone, CHaNAS uses network models of smaller input size but with more layers than that on Intel CPU, which implies that it needs slimmer but deeper architectures to maintain model accuracy due to the insufficiency of on-chip memory space in mobile phones. This is reasonable, since deep-but-slim network architecture requires less on-chip memory space than the wider network architectures.
Fig. 12. Visualization of the architectures found by CHaNAS for NVIDIA P100 GPU, Intel Xeon CPU, and Note 10 mobile.
We also analyze the low-level operator schedule generation result found by the schedule search strategy. Taking Note 10 as an example, we select 18 different convolution layers from the search space and list their configurations in Table 3. Then we compare the latency results of these 18 distinctive convolution layers that were scheduled by CHaNAS-W and TF_Lite, respectively. The absolute performance is shown in Figure 13. CHaNAS-W can achieve an average latency of 0.496 ms for each layer, which is 40% lower than TFlite. To analyze why ChaNAS can achieve higher speed, we carefully study the scheduling behavior generated by CHaNAS-W. From the detailed schedule optimization behavior on Note 10 phone, CHaNAS-W can enable register blocking through multi-level tiling and vectorize the inner-most loop to achieve vectorization, which are two critical factors to improve the schedule results on a Note 10 phone. CHaNAS-W can split spatial loops and reduce loops according to the split factors from exploration results and uses split primitive and reorder primitive recursively to produce a series of small tiles to achieve the multi-level tiling. After splitting, the loops become smaller and can be potentially held by the cache to exploit data locality. To exploit parallelism, CHaNAS-W dynamically fuses several outer loops into one outer-most loop and parallelizes it. The performance improvement brought by CHaNAS-W shows our scheduling-level search process can dynamically optimize the code implementation in contrast to the previous fixed mapping strategy for various hardware.
Fig. 13. Latency results of CHaNAS-W compared to TF_lite on Note 10 mobile for different convolution layers.
| Name | L1 | L2 | L3 | L4 | L5 | L6 | L7 | L8 | L9 | L10 | L11 | L12 | L13 | L14 | L15 | L16 | L17 | L18 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| input_size | 224 | 224 | 112 | 56 | 28 | 14 | 224 | 224 | 112 | 56 | 28 | 14 | 224 | 224 | 112 | 56 | 28 | 14 |
| in/out channel | 3/16 | 16/24 | 24/40 | 40/80 | 80/112 | 160/160 | 3/16 | 16/24 | 24/40 | 40/80 | 80/112 | 160/160 | 3/16 | 16/24 | 24/40 | 40/80 | 80/112 | 160/160 |
| kernel size | 3 | 3 | 3 | 3 | 3 | 3 | 5 | 5 | 5 | 5 | 5 | 5 | 7 | 7 | 7 | 7 | 7 | 7 |
| stride | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 |
Table 3. Configurations of 18 Distinctive Convolution Layers
5 RELATED WORK
Hardware-aware NAS. Recent hardware-aware NAS methods [22, 28, 44] directly incorporate the hardware feedback into the architecture search loop to discover the neural networks that work best on specific hardware. For instance, Mnasnet [44] directly measures the inference latency by executing the model on mobile phones, allowing it to use performance measurement to guide the model search process. ProxylessNAS [10] directly learns the DNN architecture for the ImageNet dataset by proposing a gradient-based approach to train binarized parameters. FBNet [48] proposes a differentiable platform-aware NAS using Gumbel Softmax sampling. OFA [9] can quickly search within a pre-trained supernet for a sub-graph network that achieves the required accuracy and speed on hardware. By these means, efficient network architectures with improved inference efficiency on particular hardware can be obtained automatically. EDD [31] proposes a fully simultaneous, efficient differentiable DNN architecture and implementation co-search methodology. Targeting ASICs, Yang et al. [51] propose the NASAIC that can simultaneously identify multiple DNN architectures and the associated heterogeneous ASIC accelerators to meet both DNN model accuracy and hardware performance requirements. Accelerator-aware NAS [14, 58] explores the neural architecture and hardware accelerator co-design, by parameterizing neural architecture search and accelerator search in a unified joint search space. However, prior hardware-aware NAS methods are oblivious to the compiler-level design choices, which means they do not consider the optimization of various scheduling policies when evaluating network candidates in the search process. Consequently, they may fail to find the de-facto optimal network model that must be eventually scheduled and mapped onto the hardware via a compiler. Hence, we believe automated network-model/scheduling-policy co-search is an approach toward better co-design solutions.
High-performance DNN model scheduling. Scheduling DNN model on the target hardware is another critical factor to improve the efficiency of the DNN system and has attracted a lot of attention from both academia and industry [19, 20]. The best practice currently is still developing schedule libraries for different hardware and needs manually tuning for a new DNN model. Most existing deep learning frameworks [4, 11, 27, 40] rely on these libraries to achieve high performance. For CPUs, MKL-DNN [2] uses JIT techniques to optimize Convolutional Neural Network on Intel Xeon CPUs. SWIRL [46] can generate high-quality vectorized and parallelized code for DNNs to improve the efficiency of CPUs. For GPUs, cuBlas [3] can accelerate linear algebra kernels to extreme high performance, and cuDNN [17] accelerates deep learning applications by assembling a set of efficient algorithms on the GPUs. TF-Lite Micro [4] focuses on accelerating the DNN models on the embedded hardware. To adapt to more hardware back-ends, and improve the generality of different deep learning frameworks, Intel nGraph [18] and Google XLA [1] have the role of a bridge between deep learning frameworks and hardware back-ends. Intel nGraph utilizes MKL-DNN to produce highly optimized implementations on CPUs and the PTX-emitting back-end of the LLVM to generate assembly code for GPUs. The XLA compiler acts as a back-end for TensorFlow. TVM [12] proposes an ahead-of-time compiler that supports multiple front-ends and hardware platforms. These compilers adopt high-level computing graphs and leverage fusion across layers based on predetermined rules. Besides, Auto-scheduling algorithms have gradually attracted a substantial amount of attention and provide appealing productiveness. Tensor Comprehension [45] adopts polyhedral IRs and employs a genetic search of affine transformation options (e.g., tile sizes, loop fusion, and scheduling strategies). PolyMage [38] introduces fusion methods with loop nests and determines the rules of fusion and the range of tiling sizes to ensure a small auto-tuning space. AutoTVM [12] utilizes high-level abstractions to represent the computing graph and includes a template-guided search algorithm for low-level tensor code generation. As a framework for generating tensor programs, Flextensor [56] attempts to reduce human efforts in writing templates by using more general templates to cover more operators. To expand the optimization scope of tensor scheduling, Ansor [55] explores a larger search space to cover the useful tensor-level program optimization options. However, none of the prior compiling or scheduling framework considers the joint search of both scheduling policy and network architecture.
DNN/Compiler Co-design. While DNN/accelerator co-design has been attracting growing research interest, DNN/Compiler co-design remains largely underexplored. This may be partial, because designers are more inclined to treat compilers as well-developed tools that should be touched. Some recent works address this issue and successfully demonstrate the practicality of DNN/Compiler co-design. DNN/Compiler Co-design includes PCONV [35], PatDNN [39], and CoCoPIE [21], which tackle model compression and compilation simultaneously. During model compression, they focus on structured pruning, guided by pre-determined compiler-friendly patterns. During compilation, they propose efficient compiler code generation, enabling the compilers to maximize or maintain both instruction-level and thread-level parallelism. MCUNet [32] is another framework that integrates model design and compiler optimization. It is composed of two components, TinyNAS and TinyEngine; TinyNAS searches for specialized DNNS, while TinyEngine generates specialized code to eliminate instruction and memory redundancy.
Compared to traditional methods that either optimize the neural network using neural architecture search based on a fixed schedule strategy from a given deep learning library (e.g., Tensorflow [4] and Pytorch [40]) or tune the scheduling policy to maximize the inference speed for a given network [12, 13], CHaNAS can utilize the resources better by neural network architecture-schedule policy co-design.
6 CONCLUSION
We proposed an automated NAS framework, CHaNAS, to co-optimize the DNN architecture and the dedicated network scheduling policy for a specific machine learning task on the hardware. It directly involves compiler optimization in the DNN NAS loop, aiming to offer a higher performance of DNN solutions for the target hardware platforms. With the proposed hierarchical co-design search space and the exploration method, both the network architecture design and the network scheduling can be effectively conducted on the basis of neural network blocks. Specifically, we also introduced a new objective function for super-network training based on the generalization gap, which helps improve the accuracy of the candidate DNN models, and we showed that it outperforms previously proposed training or validation loss functions. In the extensive experiments, when applied to the Imagenet2012 classification task on different hardware back-ends, CHaNAS can generate better co-design solutions over SOTA hardware-aware search methods MobileNet-v3. It is shown in the experiments that the co-design solutions generated by CHaNAS achieved 1.6×, 1.9×, and 1.7× performance boost, respectively, on NVIDIA P100 GPU, Intel Xeon CPU, and Samsung Note 10 Mobile, over the baselines of the same-level accuracy. This approach can transfer to other image classification tasks easily. We believe the larger neural network architecture and schedule co-design search space and effective exploration strategy could benefit the hardware-aware NAS methods further.
- [1] 2020. Google, XLA. Retrieved from https://www.tensorflow.org/xla.Google Scholar
- [2] 2020. Inter MKL-DNN. Retrieved from https://github.com/intel/mkl-dnn.Google Scholar
- [3] 2020. NVIDIA, CUBLAS Library. Retrieved from https://www.nvidia.com/.Google Scholar
- [4] . 2017. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’17). 265–283.Google Scholar
- [5] . 2020. Best of both worlds: Automl co-design of a cnn and its hardware accelerator. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20), 1–6. Google Scholar
Cross Ref
- [6] . 2018. Neuromorphic computing across the stack: Devices, circuits and architectures. In Proceedings of the IEEE International Workshop on Signal Processing Systems (SiPS’18). IEEE, 1–6. Google Scholar
Cross Ref
- [7] . 2017. A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’17). IEEE, 90–97. Google Scholar
Cross Ref
- [8] . 2018. Understanding and simplifying one-shot architecture search. In Proceedings of the International Conference on Machine Learning (ICML’18), 550–559.Google Scholar
- [9] . 2020. Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the International Conference on Learning Representations (ICLR’20).Google Scholar
- [10] . 2019. Proxylessnas: Direct neural architecture search on target task and hardware. In Proceedings of the International Conference on Learning Representations (ICLR’19).Google Scholar
- [11] . 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274. Retrieved from http://arxiv.org/abs/1512.01274.Google Scholar
- [12] . 2018. \( \lbrace \)TVM\( \rbrace \): An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 578–594.Google Scholar
- [13] . 2018. Learning to Optimize Tensor Programs. Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’18). 3389–3400.Google Scholar
- [14] . 2020. You only search once: A fast automation framework for single-stage DNN/Accelerator co-design. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’20). 1283–1286. Google Scholar
Cross Ref
- [15] . 2017. Using dataflow to optimize energy efficiency of deep neural network accelerators. In Proceedings of the 50the Annual IEEE/ACM International Symposium on Microarchitecture (Micro’17). 12–21. Google Scholar
Digital Library
- [16] . 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Select. Top. Circ. Syst. 9, 2 (2019), 292–308. Google Scholar
Cross Ref
- [17] . 2014. cuDNN: Efficient primitives for deep learning. volumearXiv:1410.0759. Retrieved from http://arxiv.org/abs/1410.0759.Google Scholar
- [18] . 2018. Intel ngraph: An intermediate representation, compiler, and executor for deep learning. arXiv:1801.08058. http://arxiv.org/abs/1801.08058.Google Scholar
- [19] . 2014. Energy-aware task mapping and scheduling for reliable embedded computing systems. ACM Trans. Embed. Comput. Syst. 13, 2s (2014), 1–27. Google Scholar
Digital Library
- [20] . 2015. Reliability and energy-aware mapping and scheduling of multimedia applications on multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 27, 3 (2015), 869–884. Google Scholar
Digital Library
- [21] . 2021. CoCoPIE: Enabling real-time AI on off-the-shelf mobile devices via compression-compilation co-design. Commun. ACM 64, 6 (2021), 62–68.Google Scholar
Digital Library
- [22] . 2020. Accelerator-aware neural network design using AutoML. arXiv:2003.02838. Retrieved from https://arxiv.org/abs/2003.02838.Google Scholar
- [23] . 2019. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). Google Scholar
Digital Library
- [24] . 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 1314–1324. Google Scholar
Cross Ref
- [25] . 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from http://arxiv.org/abs/1704.04861.Google Scholar
- [26] . 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv:1602.07360. Retrieved from http://arxiv.org/abs/1602.07360.Google Scholar
- [27] . 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. 675–678. Google Scholar
Digital Library
- [28] . 2019. Accuracy vs. efficiency: Achieving both through FPGA-implementation aware neural architecture search. Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19), 1–6. Google Scholar
Digital Library
- [29] . 2019. Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro’19). 754–768. Google Scholar
Digital Library
- [30] . 2018. Co-design of deep neural nets and neural net accelerators for embedded vision applications. In Proceedings of the 55th ACM/IEEE Design Automation Conference (DAC’18), 1–6. Google Scholar
Cross Ref
- [31] . 2020. Edd: Efficient differentiable dnn architecture and implementation co-search for embedded ai solutions. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 1–6.Google Scholar
Cross Ref
- [32] . 2020. Mcunet: Tiny deep learning on iot devices. arXiv:2007.10319. Retrieved from https://arxiv.org/abs/2007.10319.Google Scholar
- [33] . 2019. Darts: Differentiable architecture search. In Proceedings of the International Conference on Learning Representations (ICLR’19).Google Scholar
- [34] . 2019. On neural architecture search for resource-constrained hardware platforms. arXiv:1911.00105. Retrieved from http://arxiv.org/abs/1911.00105.Google Scholar
- [35] . 2020. Pconv: The missing but desirable sparsity in dnn weight pruning for real-time execution on mobile devices. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5117–5124.Google Scholar
Cross Ref
- [36] . 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 45–54.Google Scholar
Digital Library
- [37] . 2018. Hardware-aware machine learning: Modeling and optimization. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’18). Google Scholar
Digital Library
- [38] . 2015. Polymage: Automatic optimization for image processing pipelines. ACM SIGARCH Comput. Arch. News 43, 1 (2015), 429–443. Google Scholar
Digital Library
- [39] . 2020. Patdnn: Achieving real-time dnn execution on mobile devices with pattern-based weight pruning. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 907–922.Google Scholar
Digital Library
- [40] . 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703. Retrieved from https://arxiv.org/abs/1912.01703.Google Scholar
- [41] . 2018. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning. PMLR, 4095–4104. https://arxiv.org/abs/1802.03268.Google Scholar
- [42] . 2020. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10428–10436. Google Scholar
Cross Ref
- [43] . 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not. 48, 6 (2013), 519–530. Google Scholar
Digital Library
- [44] . 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 2820–2828. Google Scholar
Cross Ref
- [45] . 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv:1802.04730. Retrieved from http://arxiv.org/abs/1802.04730.Google Scholar
- [46] . 2019. SWIRL: High-performance many-core CPU code generation for deep neural networks. Int. J. High Perf. Comput. Appl. 33, 6 (2019), 1275–1289. Google Scholar
Digital Library
- [47] . 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53th ACM/IEEE Design Automation Conference (DAC’16). 1–6. Google Scholar
Digital Library
- [48] . 2019. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10734–10742. Google Scholar
Cross Ref
- [49] . 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference. 1–6. Google Scholar
Digital Library
- [50] . 2020. Latency-aware differentiable neural architecture search. arXiv: 2001.06392 (2020). Retrieved from https://arxiv.org/abs/2001.06392.Google Scholar
- [51] . 2020. Co-exploration of neural architectures and heterogeneous asic accelerator designs targeting multiple tasks. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 1–6.Google Scholar
Digital Library
- [52] . 2019. Synetgy: Algorithm-hardware co-design for ConvNet accelerators on embedded FPGAs. InProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Google Scholar
Digital Library
- [53] . 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53th ACM/IEEE Design Automation Conference (DAC’16). Google Scholar
Digital Library
- [54] . 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 6848–6856. Google Scholar
Cross Ref
- [55] . 2020. Ansor: Generating high-performance tensor programs for deep learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’19). 863–879. https://www.usenix.org/conference/osdi20/presentation/zheng.Google Scholar
- [56] . 2020. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). 859–873. Google Scholar
Digital Library
- [57] . 2019. BayesNAS: A Bayesian approach for neural architecture search. In Proceedings of the 36th International Conference on Machine Learning (ICML’19), Vol. 97. 7603–7613. http://proceedings.mlr.press/v97/zhou19e.html.Google Scholar
- [58] . 2021. Rethinking co-design of neural architectures and hardware accelerators. arXiv:2102.08619. Retrieved from https://arxiv.org/abs/2102.08619.Google Scholar
- [59] . 2017. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
Index Terms
A Framework for Neural Network Architecture and Compile Co-optimization
Recommendations
Neural Architecture Search Survey: A Hardware Perspective
We review the problem of automating hardware-aware architectural design process of Deep Neural Networks (DNNs). The field of Convolutional Neural Network (CNN) algorithm design has led to advancements in many fields, such as computer vision, virtual ...
CHaNAS: coordinated search for network architecture and scheduling policy
LCTES 2021: Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded SystemsAutomatically design an efficient DNN solution for a given deep learning task on the target hardware mainly decided by the neural network architecture and the schedule mapping strategy, where the two goals are closely coupled with each other to fully ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumIntel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...



















Comments