skip to main content
research-article
Open Access

A Framework for Neural Network Architecture and Compile Co-optimization

Authors Info & Claims
Published:29 October 2022Publication History

Skip Abstract Section

Abstract

The efficiency of deep neural network (DNN) solutions on real hardware devices are mainly decided by the DNN architecture and the compiler-level scheduling strategy on the hardware. When we try to fully exploit the underlying hardware and obtain the optimal tradeoff between DNN accuracy and runtime performance, we discovered that the two optimization goals of DNN architecture and scheduling policy are intimately related to each other. However, current hardware-aware Neural Architecture Search (NAS) methods primarily focus on the DNN architecture search process, ignoring the effects of various compiler-level scheduling strategies (e.g., graph-level optimization, loop transformations, parallelization, etc.) on network candidates being evaluated in the search process. As a result, they may overlook the true-optimal DNN implementations on hardware, which can only be discovered by trying-out different combinations of scheduling strategies and DNN architectures. This work proposes a NAS framework (CHaNAS) that searches for not only the network architecture but also the dedicated compiler-level scheduling policy, as the optimal co-design solution on the target hardware. We propose to use a block-based pre-scheduling methodology to reduce the co-design search space and enable the automatic generation of the optimal co-design, including the network architecture and the tensor programs that practice the scheduling policy. Further, we introduce a new search objective function based on the generalization gap to prevent the selection of architectures that are prone to overfitting. We evaluate CHaNAS on Imagenet on different hardware back-ends against the state-of-the-art hardware-aware search method based on the MobileNet-v3 search space. Experimental results show that the co-design solutions obtained by ChaNAS show up to 1.6×, 1.9×, and 1.7×, 24 performance boost on NVIDIA P100 GPU, Intel Xeon 8163 CPU, and Samsung Note 10 Mobile, respectively, over the baselines of the same-level accuracy.

Skip 1INTRODUCTION Section

1 INTRODUCTION

The widespread use of deep learning applications has fueled the development of Deep Neural Network (DNN) model design and efficient deployment on rapidly evolving hardware. There is a deep stack of optimization technologies accessible when building an efficient application or domain-specific neural network system [6, 17], including enhanced neural network architecture [24, 25, 26, 54], optimized frameworks and compilers [4, 12, 18, 40, 46], and even customized hardware design [16, 47, 52, 53]. However, rather than using DNN optimization technology on its own, it has been demonstrated that cross-stack co-design approaches that orchestrate techniques from several layers can produce better outcomes [5, 14, 30, 52]. For example, some recent works [14, 22, 23, 28, 37] have investigated how to design an optimized neural network architecture that fully considers the characteristics of the target hardware. Such a hardware-aware network architecture design method, i.e., hardware-aware Neural Architecture Search (NAS), can automate the DNN design and even exceed earlier human-designed models by incorporating hardware features into the NAS loop.

Previous hardware-aware NAS methods simply took into account the co-optimization of DNN architecture and hardware-related design variables like representation precision and resource provision [30, 44, 50]. However, as we have emphasized, the efficiency of DNN applications also depends on multiple layers of optimization techniques. Specifically, in addition to neural network architectures, how to schedule the neural network onto the hardware at the compiler level, such as task graph reordering, loop reordering, loop tiling, memory customization, and other compiler-managed scheduling policies, plays an important role in the performance of the target neural network system [15, 19, 29]. For example, a typical hardware-aware NAS framework usually relies on a performance model to estimate the performance of the candidate neural networks on the target hardware, then decides whether the discovered network should be kept or updated during the iterative search process. However, in these evaluation-and-search iterations, almost all prior hardware-aware NAS works assume a fixed network scheduling policy that may not extract the best performance from the under-evaluation network architecture on the hardware. Some other works focus on tuning the schedule mapping strategy when given a neural network model to optimize the performance for different hardware [15, 20, 49], but they ignore the correlation between the interactive stacks and lead to locally optimal solutions. In contrast, we show that to discover the best solution at the system level, an ideal hardware-aware NAS framework should incorporate compiler-level optimization, i.e., scheduling strategies, for the target hardware. In this article, we offer the Compiler and Hardware aware Network Architecture Search (CHaNAS) framework for the first time to achieve this goal.

In ChaNAS, we orchestrate two key components from the deep learning optimization stack: the DNN architecture and the scheduling policy in the compiler that tactically maps a model onto the target hardware. As shown in the Figure 1, ChaNAS constructs a joint search space that combines the DNN architectures and scheduling-level optimization, as opposed to traditional methods that either optimize the neural network using NAS based on a fixed schedule policy from a given deep learning library (e.g., Tensorflow) or tunes the scheduling policy to maximize the inference speed for a given network. For each candidate network evaluated in the search space, CHaNAS can measure and discover its realistic peak performance that can be revealed by trying-out different schedule policies such as instruction-level scheduling, memory allocation, loop transformations, and so on, and then choose the true-optimal one. Realizing this goal also faces multiple challenges. First, it requires us to automatically construct the joint search space to cover as many DNN/Schedule policy design pairs as possible for the given AI task and target hardware. Second, we must precisely and effectively analyze each design pair, which is a time-consuming process. Finally, we need to search efficiently in the huge search space to return the optimal co-design solution, which includes not only the DNN implementation but also the corresponding high-performance tensor programs for the potential hardware back-ends.

Fig. 1.

Fig. 1. CHaNAS jointly design the neural architecture and the computation-graph scheduler to fit the target hardware characteristics. With an enlarged optimization design search space, CHaNAS is more likely to achieve better results than existing scheduler-agnostic hardware-aware NAS methods.

To this end, ChaNAS utilizes a block-based hierarchical design representation to construct the huge co-design search space automatically. In the CHaNAS search space, both the network architecture design and the network scheduling is conducted on the basis of neural network blocks, which help divide the entire network design space exploration into two coordinated phases. We then apply a supernet that acts as an accuracy predictor and use the pre-scheduled blocks to reduce the evaluation cost. Finally, to reduce the search cost, we divide the search space and employ an evolutionary search technique. Overall, as an automated schedule-aware neural architecture search method, ChaNAS makes full use of the NAS’s advantage and schedule optimization that fit the target hardware, allowing a larger design space for architecture design. Therefore, with a larger degree of design freedom than previous schedule-agnostic hardware-aware NAS, it is more likely to find the optimal specialized DNN solution for varied tasks and different hardware back-ends. We evaluate CHaNAS on Imagenet on different hardware back-ends against the state-of-the-art (SOTA) hardware-aware search method MobileNet-v3 [24]. Experiments show that the co-design solutions obtained by ChaNAS show up to 1.6×, 1.9×, and 1.7× performance boost on NVIDIA P100 GPU, Intel Xeon 8163 CPU, and Samsung Note 10 Mobile, respectively, over the baselines of the same-level accuracy.

In summary, this article makes the following contributions:

  • We propose the first automatic joint search methodology for neural network architecture and scheduling policy on different hardware, CHaNAS. By using a block-based hierarchical design representation, CHaNAS constructs a large joint co-design space and is more able to locate the optimal solution for the target hardware with specified design constraints than existing scheduler-agnostic hardware-aware NAS methods.

  • We carefully design the blocks that are used to construct the target DNN model and consider the block size to tradeoff the architecture search cost and schedule optimization effect. We also propose an elastic supernet that includes all blocks and is employed by CHaNAS to generate child neural network models in the solution search process. The basic building blocks for this supernet can be automatically pre-scheduled and evaluated on the target hardware to expose the optimal performance to the CHaNAS during the search process with little cost.

  • Due to the huge co-design search space, we divide the co-design search space according to the constraints of the deployment scenario to reduce the search cost, and employ an evolutionary search strategy to explore the reduced co-design search space, which can gradually generate higher-quality co-design results for the target hardware platform.

  • To further improve the quality of the automatically generated co-design solution, We propose a new objective function to enhance the generalization capability of the target network, which has been empirically proved to help get networks that are less influenced by the weight sharing problem. In our experiment, the introduction of the new objective function improves the accuracy of the discovered DNN solutions by \( 0.5\% \).

Skip 2BACKGROUND AND MOTIVATION Section

2 BACKGROUND AND MOTIVATION

In this section, we first introduce two critical factors that impact the DNN solution on the target hardware: the efficient DNN architecture design and the schedule strategy that maps the DNN on different hardware. Then we introduce the heuristic approach that is not only typically used in the DNN architecture search but also used in our co-design process. Finally, we present the observation that motivated us to automatically co-design the DNN architecture and corresponding schedule strategy for the target hardware platforms.

Efficient Neural Network Design. Efficient DNN model design is crucial for the overall performance of the deep learning system. Many efficient neural network architectures are proposed to improve hardware efficiency, such as Squeezenet [26], MobileNets [25], ShuffleNets [54], and so on, which mainly focus on reducing the computation (e.g., adopting depthwise). To reduce the manual effort in the efficient DNN design, NAS tends to automate the architecture design process and begins to dominate the recent efficient network design, while the early NAS methods [33, 59] search for high accuracy architectures without considering hardware efficiency. With the development of NAS techniques [8], researchers begin to combine multi-objective optimization into the NAS process [23, 28, 37, 44, 48, 52, 58]. They typically focus on searching efficient NAS models with higher accuracy, lower parameters, Flops (number of multi-add operations), and lower latency (or energy). As a result, those methods incorporate hardware characteristics into the NAS loop to increase inference efficiency, but their performance highly depends on the quality of the search space and search strategy [34, 42]. To evaluate each architecture efficiently, One-shot NAS substantially reduces the computation cost by training only one super-network, a.k.a. supernet, to approximate the performance of every architecture in the search space via weight-sharing. Specifically, One-shot NAS uses the supernet to predict the performance of a specific architecture by deactivating the extra edges w.r.t. a target architecture on the supernet via masking, then performing evaluations using the masked supernet. In general, people follow the manual design heuristic for NAS search space design and use either heuristic or ML method for design search.

DNN Model Deployment on Hardware. How to map DNNs onto a particular hardware device is crucial to the system performance [7, 36]. Currently, contemporary DNN compilers [12, 18, 43] rely on an intermediate representation (IR) of task graph to abstract the network architecture. So they can either manipulate the network IR at the graph-level or redefine the implementation of each single operator that is a node in the IR graph. As shown in the Figure 2, optimizing these DNN models is generally performed in two steps: (i) high-level graph optimizations and (ii) low-level kernel optimizations such as those found in vendor libraries. Typically, at the beginning of the compile optimization, the compiler usually partitions the large computation graph of a DNN model into several subgraphs, which becomes the basic unit in the whole compilation process. This partition has a negligible effect on the performance due to the layer-by-layer construction nature of DNN [55]. Then some graph-level optimization techniques are applied to the subgraphs. Graph-level optimization techniques include layout optimizations, operator fusion, constant folding, auto-batching, and so on. In contrast, operator-level computation optimization is often hardware specific. For example, achieving latency hiding for operator computation requires different strategies on different hardware back-ends. On the CPU, memory latency hiding is achieved implicitly with simultaneous multi-threading or hardware pre-fetching [12]. While GPUs rely on rapid context switching of many warps of threads. Generally, in the deep learning compile process, to optimize an operator or a (sub-)graph of multiple operators, the compiler requires users to define the computation in a high-level declarative language, and the compiler then searches for programs tailored toward different hardware platforms based on some human-written schedule templates, including different schedule primitives (e.g., split, reorder, fuse, etc.) and proper parameters to achieve parallelism, vectorization, memory tilling, and memory access latency hiding for different hardware. For example, TVM [12] requires the user to write a template that defines the structure of the operator programs with some tunable parameters. Then the compiler searches for the best values of these parameters for a specific input shape configuration and a specific hardware target.

Fig. 2.

Fig. 2. Illustration of DNN compiles optimization.

Heuristic search method. Many algorithms have been applied in both the NAS process and schedule strategy search process, e.g., heuristic algorithm [12], Bayesian optimization [57], reinforcement learning [44, 59], and so on. In this article, we mainly consider the heuristic method in the co-design search process, so we make a brief introduction to the heuristic algorithm here. As illustrated in the Figure 3, the heuristic method mainly consists of five parts. At first, a group of candidates is chosen to serve as the initialized population. The algorithm will then select the best candidate after several iterations. In each iteration, according to the selected individuals (called parents), some new individuals are created (called children), depending on if mutation or crossover should be utilized. Then each child will be evaluated to get the fitness value that determines the quality of the new individual. The new individual is inserted into the population if its fitness value is better than the worst performing individual of the population; otherwise, the new individual is omitted. After that, the population is updated. Each iteration leads to a new generation, and the iteration terminates if the number of max generations is reached.

Fig. 3.

Fig. 3. Illustration of the heuristic search method.

Observation. Figure 4 reveals a phenomenon we discovered in the SOTA NAS frameworks that run realistic workloads. For different neural network task specifications, e.g., different dataset, optimization goals, or different performance constraints like the accuracy or latency requirements, an ideal NAS framework is supposed to generate the optimal neural network model for the target hardware through the automated search mechanism. However, as the user-specified constraint of accuracy changes, the Pareto frontiers of the NAS schemes that assume different scheduling solutions will intersect with others, which means none of the NAS baselines in Figure 4 can always generate a better solution than the others when the task specification or constraint changes. For example, in Figure 4, if the designer seeks a neural network that must run faster than 35 ms, then the NAS scheme assumes Schedule-B is better, while the one with Schedule-A is more accurate when the latency constraint changes to 20 ms. In other words, prior schedule-agnostic NAS technologies cannot guarantee the optimal network solution in the search process as they assume a fixed compiling strategy. Thus, enabling effective and automatic co-optimization of neural network architecture and the scheduling policy is necessary for an optimized neural network system. In this work, we are the first to investigate the scheduler and hardware-aware NAS that searches not only for the network architecture choices but also the corresponding network scheduling solution on the target hardware at the same time.

Fig. 4.

Fig. 4. Relative Pareto-frontier of three different scheduling strategies. We randomly extract 200 models from the OFA [9] and test them on the Note 10 mobile, Schedule-A presents the original schedule result of TF_Lite, while Schedule-B and Schedule-C present two schedule strategies that change the loop split knobs in TF_Lite.

Skip 3CHANAS FRAMEWORK Section

3 CHANAS FRAMEWORK

3.1 Problem Formalization

CHaNAS explores the massive joint design space, which is the confluence of the DNN architecture space and the network scheduling space for the target hardware. The DNN architecture space includes the model hyper-parameters such as the layers type, input size, channels, and so on. While the scheduling space contains many compiler-level optimization knobs with their parameters, such as loop tiling, operators fusion, reordering, and so on. Assuming the DNN architecture search space is \( \lbrace arch\rbrace =\lbrace arch_{1},\ldots ,arch_{n}\rbrace \), the scheduling strategy space is \( C=\lbrace c_{1}, \ldots ,c_{m}\rbrace \), and the target design performance constraint is \( P_{thres} \). The objective of CHaNAS is to search from the joint search space \( \lbrace \lbrace arch\rbrace , C\rbrace \) for the best DNN architecture \( arch^{*} \) with the associated scheduling strategy \( c^{*}_{arch^{*}} \), which together contribute to the maximum network accuracy and improved performance P that must at least satisfies constraint \( P_{thres} \). In all, we formalize this problem as (1) \( \begin{equation} \begin{aligned}\left(arch^{*}, c^{*}_{arch^{*}}\right) \in \underset{arch^{*} \in \lbrace arch\rbrace , c^{*} \in C}{\operatorname{argmin}} \mathcal {L}_{\text{val}}(arch,\omega , c) \\ s.t. \quad P(arch^{*}, c^{*}_{arch^{*}}) \lt P_{thres} \\ s.t. \quad \omega ^{*}=\operatorname{argmin} \mathcal {L}_{\text{train}}(\omega , arch), \end{aligned} \end{equation} \) where \( \mathcal {L}_{\text{train}} \) and \( \mathcal {L}_{\text{val}} \) are the task loss of training and validation, respectively; \( \omega \) presents the network weights.

3.2 CHaNAS Overview

As shown in Figure 1(c), in contrast to prior works that only search for solutions inside the network architecture space \( \lbrace arch\rbrace \), CHaNAS has a significantly larger design space to explore. Thus, we need to answer the following questions: (i) How should we automatically construct the co-design search space? It needs to cover both the DNN model and scheduling policy, which requires a compact design space representation to describe the potential co-design solutions with high efficiency. (ii) How should we evaluate each candidate co-design pair efficiently? We have to perform the costly DNN model training and sophisticated schedule optimation to get the real performance on the target hardware. (iii) How should we reduce the huge search space and design an effective search strategy that quickly converges into high-quality co-design solutions? Thus, for the first time, CHaNAS conducts the architecture/schedule-policy co-design by employing a block-based hierarchical design representation to construct the co-design search space efficiently and reduce the efforts taken to explore unnecessary options.

As a sub-graph of interconnected neural network operators (e.g., capsule conv2d, dilated conv2d), a CHaNAS block is the fundamental unit in both network architecture design and scheduling policy search. The design choice of block-level construction is based on two observations. First, the block is the basic unit in network architecture search, as the mainstream NAS approaches will pre-define the elementary blocks and generate the inter-block connection to form a candidate network architecture. Second, in the deep learning compile system, the compiler usually divides the full computation graph into sub-graphs, where the sub-graphs can be treated as blocks. Then the scheduler will attempt to reorder the operators in blocks and then map each block onto hardware by trying out back-end scheduling techniques such as loop-unrolling and blocking. Therefore, block-level construction is flexible enough to support the different design constraints and different hardware platforms. Furthermore, such a factorized hierarchical search space makes a good balance between the diversity of potential co-design options and the size of the entire co-design search space. We address the difficulties in the model-level schedule optimization. Suppose we partition the network into B blocks, and each block has a parameter search space size of R with an average of S scheduling policies per block. Due to our block-based design, our neural network architecture search and schedule policy search are separated, and their search space size is \( R^{B} \) and \( B*S \), respectively. So our total search space would be \( B*S+R^{B} \), versus the original single-level co-design search space at the size of \( R^{B}*S^{B} \).

In the whole design process, the hierarchical exploration flow of CHaNAS is hinged on neural blocks and involves two coordinated phases: the top-down network scheduling component will pre-generate the implementations for each possible neural block and evaluate them on the target hardware, while the bottom-up network architecture generation unit will search through the possible sequences of the neural blocks that have already been virtually mapped and optimized on hardware by the scheduling unit. With these two exploration paths, the optimal co-design solution can be found at a high probability. Figure 5 depicts the overview of CHaNAS, which has three major components: (1) a block-stacked super-net that captures the DNN architecture search space. (all candidate DNN architectures can be extracted from the super-net, which achieves fast accuracy evaluation as the candidate DNN directly inherits its parameters to bypass the expensive network training stage); (2) a block-based scheduling space explorer that transforms each block into a computational sub-graph and optimizes it on the target hardware; and (3) an search algorithm that divides the joint space to reduce the search cost and then search within the co-design sub-space. In CHaNAS, there are three major steps in the co-design flow.

Fig. 5.

Fig. 5. Overview of CHaNAS. Both the network architecture design and the network scheduling is conducted on the basis of neural network blocks.

Step 1: Construction of the Elastic Super-Net. We at first build the super-net from which many candidate DNN architectures can be derived. The super-net is stacked with blocks. Specifically, each block in super-net has many variable hyper-parameters, such as kernel size, channels, input shape, and so on. After the super-net training completes, CHaNAS extracts child networks for the target hardware from the super-net in the search process for co-design solutions. Specifically, in the super-network training process, we use a new generalization-based search objective function, which avoids the overfitting in previous one-shot NAS methods.

Step 2: Block-level Pre-Scheduling. For the scheduling-level optimization space, the basic DNN building blocks are virtually pre-scheduled and optimized on the target hardware, so that it generates many block-level co-designs that will be used in the evaluation stage of the global joint design space search. Given a parameterized block, the scheduling space, including the graph-level optimization for overall block topology and the operator-level optimizations that are explored via a heuristic method, and the performance of the scheduling policies will be fine-tuned under the direction of a learned cost model until the best scheduling point for that block is obtained.

Step 3: Co-design Exploration. Given the deployment constraints (latency in our test as an example), we first reduce the co-design search space according to the cumulative distribution function of network inference latency. Then we use an evolutionary-based searcher to explore in the reduced search space, for which we build a DNN accuracy predictor and a performance predictor based on the block performance Look Up Table (LUT) profiled at the block-level pre-scheduling stage. At last, the target DNN model with a dedicated scheduling strategy is returned.

3.3 Elastic Super-net Construction

Block is the basic unit in CHaNAS architecture search and scheduling search. However, we have a tradeoff in designing blocks size. In the CHaNAS design space, if the block structure is defined at an over-fine granularity, i.e., fewer layers in the block, then the graph-level scheduling search space in a block will also be too small to conduct thorough scheduling-level optimization. In contrast, if the blocks are too large and lead to a massive block design space, then it will be less likely for the search algorithm to converge to the optimal architecture in an oversized architecture search space. Therefore, we propose a medium-grained block design method to build an elastic super-net to balance the architecture search efficiency and scheduling search efficacy. The super-net will not induce too much search complexity and contains various large-enough blocks to explore the potential of scheduling-level search.

Figure 6 shows the architecture of super-net, which defines the architecture search space. First, the DNN is composed of N basic building blocks. In this work, to design hardware-friendly DNNs and to reduce search time, we adopt the single-path DNN structure without branches [10, 58]. Second, for each block \( block_{i}, (1 \le i \le N) \), there are M units included. Following the common practice of NAS approaches [9, 44], we adopt the elastic MBConv cell as the basic unit in a block. A elastic MBConv cell is composed of sequential layers of conv-1 × 1 (convolution with kernel size 1 × 1), dwconv-k × k (depthwise convolution with kernel size k), SE (Squeeze-and-Excitation), and conv-1 × 1. Between conv-1x1 and dwconv-kxk, the number of channels expands by a ratio of E. In the elastic MBConv, we can search the kernel size k from \( \lbrace 3,5,7\rbrace \), and we also search for the number of channels expansion ratio of E from \( \lbrace 3,6\rbrace \) for each block except for the first one, which has the default expansion ratio of 1. Similarly as in Reference [9], we allow a parameterizable kernel size through a kernel transformation matrix in each MBConv, which expedites the training process. Furthermore, the depth D of the block can also be variable, which means only the first D units are kept in the sampled block. In addition to these configurable parameters, we also allow the DNN model to take arbitrary input image sizes by assigning the model with a size ratio.

Fig. 6.

Fig. 6. Super-Net architecture, which is composed by a number of blocks.

The elastic design of blocks allows the super-net to offer the candidate sub-networks that are sufficiently flexible for different deployment constraints and makes the search cost affordable. To reduce the evaluation cost further, we train the super network that contains all the possible sub-networks through weight sharing and use it to estimate the accuracy of each sub-network. Specifically, in the super-net training process, we have designed an objective function to reduce overfitting. In the following section, we introduce the generation-based loss function design.

3.4 Training Objective of the Super-net

It has been demonstrated that by sharing the weight parameters among all the architecture instances, we will gain several orders of magnitude speedup in the search process [33, 41]. However, this comes with the cost that the performance of all the candidate networks cannot be fairly evaluated, since they inherit the sub-optimal shared parameters from the supernet. This is due to the super-net containing many sub-networks of different sizes and shapes, and they share the weights with each other, making the sub-networks inherit the common sub-optimal weights. Intuitively, this factor leads to low-quality decision-making of the search controller when selecting the candidate network without considering its genuine optimal performance. Therefore, we propose to use a new training objective function to reduce the phenomenon of model overfitting and prevent interference between the sub-networks.

Recall previous NAS methods [33, 44, 48, 59], they either use training loss or validation loss to update the network parameters, i.e., (2) \( \begin{equation} \begin{aligned}\mathbb {E}\left[\mathcal {L}_{\text{train}}(arch, \boldsymbol {w})\right], \ or \ \mathbb {E}\left[\mathcal {L}_{\text{val}}(arch, \boldsymbol {w})|\right]\!. \end{aligned} \end{equation} \)

Compared to the naive approach, our novel objective function is based on the generalization gap of the network architecture. The intuition behind this is that the selected sub-network should perform well on the data that it has not trained on, thus reducing the effect of the parameter sharing problems. Therefore, we address enhancing the generalization ability of these candidate networks and adding a generation loss into the super-net training objective function. This can improve the accuracy of the search controller by enforcing a fair evaluation result, and help it to select the network models that have higher generation performance. Formally, we define the objective function in the super-net training process as follows: (3) \( \begin{equation} \begin{aligned}\mathbb {E}\left[\mathcal {L}_{\text{train }}(arch, \boldsymbol {w})+\lambda \left|\mathcal {L}_{\mathrm{val}}(arch, \boldsymbol {w})-\mathcal {L}_{\text{train }}(arch, \boldsymbol {w})\right|\right], \end{aligned} \end{equation} \) where w represents the current network’s weights and \( \mathcal {L}_{\mathrm{val}}(arch, \boldsymbol {w})-\mathcal {L}_{\text{train }}(arch, \boldsymbol {w}) \) can be treated as the generation loss. The scalar variable \( \lambda \) balances the training loss and the generalization loss. We observe that \( \lambda =0.45 \) works well in our experiments. In our experiments, we find the generation loss helps CHaNAS gain a \( 0.5\% \) network accuracy improvement, and we present a detailed analysis on how the generation loss helps us improve the overall performance in the evaluation chapter.

3.5 Block-level Pre-scheduling

In the prior NAS process, when a network architecture candidate is selected, it is mapped to the underlying hardware or the hardware model for performance evaluation. However, in CHaNAS, the generated network must first try-out and choose the optimum scheduling policy for that specific hardware, before being tested for true-optimal performance. As a result, we must ensure its blocks are mapped to the hardware with the best scheduling policy for each network. To select the best scheduling point for a block on the target hardware, we build a block schedule optimizer \( \mathcal {F} \) that takes the parameterized block \( block_{i} = (A_{1}, A_{2}, \ldots , A_{\mathrm{b}}) \) from the super-net and the hardware descriptor \( D_{hard} \) as inputs, and the output of the schedule optimizer is the best scheduling policy \( c_{block}^{*} \) and the associated network performance \( P_{c_{block}^{*}} \) on the hardware: (4) \( \begin{equation} \begin{aligned}\mathrm{c_{block}^{*}}, P_{c_{block}^{*}} =\mathcal {F}(A_{1}, A_{2}, \ldots , A_{\mathrm{b}}, D_{hard}). \end{aligned} \end{equation} \)

To guarantee efficient execution of these blocks for the target hardware. The process of block scheduling is depicted in Figure 7, which mainly comprises two parts: the graph-level optimization and the lower-level operators scheduling for the hardware platform. For graph-level optimization, we mainly adopt two methods: layer fusion, including intrinsic fusion, pointwise fusion, and kernel fusion, which reduces the memory and communication overhead for the intermediate data, and data layout transformation, such as the flatten, concat, and reorganization operation, which transform the feature data layouts into back-end friendly patterns.

Fig. 7.

Fig. 7. Block schedule optimization, including graph-level optimization and lower-level operators schedule mapping.

For lower-level operator scheduling, as each block is composed of MBConv cells with similar structures but different in tensor shapes, we optimize the schedule of each operator (e.g., conv2d, depthwise conv2d) independently in the MBConv cell. The lower-level operator scheduling is a self-tuning process that aims to optimize low-level implementations for maximum hardware performance. Initially, we generate a scheduling space by enumerating different combinations of hardware-specific schedule primitives and the corresponding adjustable parameters (listed in Table 1), including the memory tiling, loop transformations, vectorization/tensorization, and parallelization strategies [13]. Then, without requiring human intervention, we apply a heuristic method to search the scheduling space and find schedules that maximize performance for a particular combination of operator, tensor shape, and hardware configuration.

Table 1.
NameDescription & Parameters
splitdivide a loop into several sub-loops –loop to split and split factors
fusionmerge several loops into a hyper-loop –adjacent loop to fuse
reorderchange execution orders of loops –loops to reorder and new order
unrollunroll a loop by given depth –which loop to unroll and unroll depth
vectorizeapply vector operation to a loop –which loop to vectorize
inlineinline a function –which node to inline
compute_atput producer in the body of consumer –which node and how deep to compute at
paralleluse multithreading–which loop to parallel
cacheuse shared memory to store inputs/results –how much data to cache
bindassign a loop to parallel blocks/threads –which loop to bind to block/thread

Table 1. Different Scheduling Primitives with Their Description

Encoding the operator schedules. The operator schedule is encoded using a vector, with each element representing a specific primitive or parameter choice. Figures 8(c) and (e) are two scheduling examples for one conv2d operator instance in Figure 8(b). Both Figures 8(c) and (e) split the original seven loops of conv2d into 13 sub-loops but with different split parameters, then reorder them, and generate a larger out-most loop by fusion. Figures 8(d) and (f) are the corresponding encoded representation of Figures 8(c) and (e), respectively. The encoding follows a fixed order to ensure that the configuration points with the same number of parameters are put together. For operator split, we record the split factors and create a Cartesian product of sub-spaces. Supposing that a loop in a two-dimensional (2D) convolution of size L is split into Z parts, we can generate different choices via Z-factorization of integer L, where the split of 2D convolution can be represented as \( [f_{1},f_{2},\ldots ,f_{Z}] \), where \( f_{1} \times f_{2}\ldots \times f_{Z}=L \). For reordering, we just record the new order of loops. For fusion, the loops that are not recorded are designed to be fused with their nearby outer loops during fusion. For unrolling, each expanded loop corresponds to a value of 1 when expanded and otherwise 0 when not expanded. Besides, we always parallelize the outer-most loop and vectorize the inner-most loop; therefore, parallel and vectorize are not encoded.

Fig. 8.

Fig. 8. The example of Operator scheduling example in CHaNAS. (a) Conv2d description. (b) Conv2d code example. (c) Schedule description. (d) Schedule encoding of (c). (e) Schedule description. (f) Schedule encoding of (e).

Scheduling-level Search. To explore this large design space efficiently, We employ a heuristic method to explore the scheduling space with a Gaussian-Process– (GP) based cost model, which is pre-trained using runtime measurement data collected from the target hardware. Different points in the search space are evaluated by querying through the GP-based cost model. Figure 9 has shown the process of schedule exploration. Before the scheduling strategy exploration, an initial set H is maintained, in which the scheduling points \( {p_{1}, p_{2} \dots p_{H}} \) and their performance result \( E_{p} \) are kept. In the scheduling-policy exploration process, we choose a starting point p in H with the probability \( \exp ^{-\gamma \frac{\left(E^{*}-E_{p}\right)}{E^{*}}} \), where \( \gamma \) is a hyperparameter and \( E^{*} \) is the performance of the best schedule point in H. As can be seen, the closer \( E_{p} \) is to \( E^{*} \), the more likely it is that p will be chosen. We then take this point to generate a set of new points through mutation and evaluate these schedule points from the GP-based cost model. We also define some rules as a mutation mechanism to avoid the generation of inefficient schedule points. For example, we discover that in the vast majority of cases, a divisible split factor is the most efficient, but other split factors produce inferior outcomes. As a result, for each loop in the mutation process, we limit the split factors to be divisible. Then the top-k newly generated points will be chosen and evaluated on the hardware and then added to H. All these performance measurement data are then used to update the GP-based cost model. Thanks to these blocks’ structural similarity, we can reuse the GP-based cost model to speed up the scheduling-policy exploration. Consequently, we gradually find the optimal scheduling policy of an operator on the hardware platform after a certain number of iterations, which will be kept in a LUT for CHaNAS to pool during the coordinated search process.

Fig. 9.

Fig. 9. Overview of the operator-level schedule exploration process, which breeds the next generation using mutation and crossover operators. The schedule explorer evaluates the performance of each schedule candidate using a Gaussian-Process-based cost model and then selects the schedule with the highest score to run on a distributed device cluster via RPC, and, finally, we get the real performance and collect the performance data into the database H. To improve the cost model’s predictive accuracy, we update the cost model periodically using collected data in the database H.

3.6 Co-design Space Exploration

The size of the factorized hierarchical search space has been reduced exponentially when compared to the original joint space, since the candidate DNN architectures in the design space can be directly compared by their best performance, given the blocks that comprise the DNNs have already been evaluated on the target hardware by the pre-scheduling stage. However, the sub-net search space is still enormous; evaluating each candidate DNN still needs additional effort even when we use the weight sharing to avoid the direct training. We first automatically divide the search space according to the performance requirements, which eliminates the search in the sub-spaces that do not satisfy the constraint. Then we use the evolutionary method to search for the target solution.

Automated NAS search space division. In our analysis, we find that input resolution and the model expansion ratio are the two most important factors affecting the network’s performance, so we divide the network search space according to these two factors. In this work, we can select from an input resolution factor \( In\_size = \lbrace 160,176,192,208,224\rbrace \) and a model expansion ratio \( Md\_E = \lbrace 1.0, 1.1, 1.2, 1.3\rbrace \) in our experiment. These parameters can be adjusted according to different situations. The space division leads to \( In\_size \times Md\_E \) (\( 5\times 4=20 \) in our experiment) possible sub-spaces. Our first-step goal is to find the best sub-space that contains the target model.

Locating the target sub-space is non-trivial. One way is to perform a network search on each of the sub-search space and compare the final results. However, the computation overhead will be astronomical. Instead, we evaluate the quality of the search space by randomly sampling \( \delta \) networks from the search space and comparing the distribution of the qualified networks. We collect the Cumulative Distribution Function (CDF) of each satisfying network’s latency and choose the sub-search space with the lowest average latency. The intuition is that, with the same model family, the lower CDF of latency is more likely to achieve higher performance. For computing the network’s latency, we construct a block performance LUT for the target device to enable fast performance estimation of DNN candidates. In the LUT, we track block-level latency metrics on real devices with different input dimensions, channel expansion ratio, and so on. Thus, the latency of the network model can be obtained by referring to the block-level performance \( Lat_{block_{i}} \) in LUT \( Lat_{net} \): (5) \( \begin{equation} \begin{aligned}Lat_{net} = \sum _{i=1}^{F}{Lat_{block_{i}}}, \end{aligned} \end{equation} \) where F denotes the total number of blocks, as the block-level method includes the optimal graphs-level and operator-level runtime optimizations. This can be more accurate than the previous operator-wise lookup table as the block-level method captures the runtime optimizations by adding the representation of model graphs and corresponding performance.

Evolutionary search. After finding the best sub-search space, we also adopt an evolutionary(heuristic) search algorithm to find the target model efficiently. In evolutionary search process, each DNN architecture is represented by as a genome vector, denoted as \( arch_{i} = [block_{1},\dots , block_{b}]\in R_{v} \), where v is the length of the vector. In evolution iterations, we randomly choose 15K sub-networks with different architecture and then measure their accuracy on 10K input samples from the validation set. To accelerate the evolution process, the \( [arch_{i}, accuracy_{i}] \) pairs are used to train a multilayer perception-based accuracy predictor, which can quickly predict the model accuracy based on its architecture parameters. Through iteratively mutating high-quality DNN architectures, we can generate new DNN architectures of potentially higher quality. In each iteration, the searcher evaluates the fitness of each DNN architecture candidate based on the outputs from the accuracy and latency predictive models, then selects architectures with the highest fitness to breed the next generation using mutation and crossover operators. When a certain number of iterations is reached, or the constraints are satisfied, the searcher returns the DNN architecture from the evaluated set and the dedicated optimal scheduling policy for the target hardware.

Skip 4EVALUATION Section

4 EVALUATION

4.1 Experiments Setup

To demonstrate the efficacy of our proposed method, we evaluate CHaNAS by comparing it with four previous hardware-aware NAS works (Mnasnet [44], Fbnet [48], ProxylessNas [10], and MobileNet-v3 [24]) on the ImageNet2012 classification dataset. For a fair comparison, we run their source code end-to-end similar to our experiment with different random initialization seeds using hyperparameters and commands released by the authors. As mentioned before, we first build the elastic super-net that can be fit into different deployment constraints, and then we optimize each possible block used in the super-net for the target hardware. At last, we reduce the co-design search space and use the evolutionary algorithm to perform the search. We use five blocks to form the super-net with each block having four MBConvs at most. Each MBConv has a kernel size within \( \lbrace 3,5,7\rbrace \) and a channel expand ratio E within \( \lbrace 3,4,6\rbrace \).The input image size is ranged within \( \lbrace 160,176,192,208,224\rbrace \), and the model expansion ratio ranges within \( \lbrace 1.0, 1.1, 1.2, 1.3\rbrace \). The standard SGD optimizer is used to train the supernet with Nesterov momentum 0.9 and weight decay \( 3e^{-5} \). The initial learning rate is 2.5, and the learning rate decays as the cosine function schedules. We also use the knowledge distillation technique to progressively fine-tune the sub-networks. The whole training process takes around 4.5 days on 16 NVIDIA P100 GPUs with 32 GB memory each. For block compile optimization, we implement the optimization strategy in Python and use TVM [12] tools (version 0.7.dev) for code generation targeting various hardware platforms. It is noteworthy that the block optimization process is independent of the super-net training process. Thus, the whole blocks optimization process can be executed with the training process in parallel. In general, the time overhead of the pre-scheduling process can be hidden by training, and the joint-space search can be achieved with reasonable overhead, comparable to prior NAS methods that are agnostic to the scheduling optimization space.

We apply CHaNAS to three different hardware platforms: NVIDIA P100 GPU, Intel Xeon 8163 [email protected] GHz, and Samsung Note 10 phone (Snapdragon [email protected] GHz). For comparison, the performance of previous works on GPUs is measured in Pytorch 1.3 + Cuda 10.1; the NAS solutions On CPU are evaluated in Pytorch 1.3; On the mobile phone, the networks are implemented in TF_Lite. The performance is the averaged results of 1,000 measurements with the workloads for more stable results.

4.2 Improvement from Scheduling-level Optimization

To show the benefit of our block-based pre-schedule optimization, we provide the baseline implementation CHaNAS-W/O, which indicates the solutions that are extracted from the CHaNAS super-net without scheduling-level optimization. We measure the latency of their derived solutions in Pytorch on both GPU and CPU and in TF_Lite on the mobile phone. CHaNAS-W presents the solutions that have gone through block-based scheduling-level optimization in the NAS process. We also use the Note10 mobile phone as a hardware platform, as shown in Table 2, which reports the comparison between CHaNAS-W, CHaNAS-W/O, and previous hardware-aware NAS methods on Samsung Note 10 phone. As we can see, though CHaNAS-W has higher MACs than CHaNAS-W/O, it has an obvious reduction of inference latency (16.4 ms vs. 27.5 ms), for the compiler-level co-design can dynamically adapt to different block structures and help search for the proper network schedules for the target hardware platform. In general, CHaNAS-W can achieve high performance as it strikes a balance between inter-thread and intra-thread workload decomposition on the ARM CPU by exploring numerous scheduling strategies. Besides, according to the DNN architectures obtained from CHaNAS-W and CHaNAS-W/O, the DNN architecture of CHaNAS is relatively more regular that more suitable for acceleration.

Table 2.
ModelImageNet Top-1(%)Mobile LatencyMACsSearch Cost (GPU hours)NN Training Cost (GPU hours)Total Cost (GPU hours)AWS Cost
Mnasnet74.034.4 ms317M4000N4000N$12250N
Fbnet-C74.933.6 ms375M216N360N576N$1764N
ProxylessNas-R74.635.7 ms320M200N300N500N$1530N
MobileNet-v3(large)75.228.3 ms219M180N180N$550N
CHaNAS-W/O76.527.5 ms224M40N13001300 +40N$<124N
CHaNAS-W76.616.4 ms240M50N13001300 + 50N$<150N
  • CHaNAS-W/O indicates the solutions that are extracted from the CHaNAS super-net without scheduling-level re-optimization. “AWS Cost” is calculated based on the price of on-demand P3.16xlarge instances on AWS Cloud. N is the number of deployment scenarios we experimented with in the evaluation.

Table 2. Performance Comparisons on Samsung Note10

  • CHaNAS-W/O indicates the solutions that are extracted from the CHaNAS super-net without scheduling-level re-optimization. “AWS Cost” is calculated based on the price of on-demand P3.16xlarge instances on AWS Cloud. N is the number of deployment scenarios we experimented with in the evaluation.

Without pre-schedule optimization, CHaNAS-W/O achieves higher ImageNet top1 accuracy than MobileNet-v3 [24] with similar MACs. Specifically, CHaNAS-W/O achieves 1.2×, 1.4×, and 1.5× speedup than MobileNet-v3 on GPU, CPU, and Mobile, respectively. This is attributed to the elastic super-net design; the sub-networks extracted from the CHaNAS super-net have not only high flexibility and medium-granularity suitable for hardware-oriented search.

4.3 Improvement from the Adoption of Generation Loss

Previous NAS methods use the validation loss or the training loss only to update the supernet weights. For example, in the previous super-network training process, the network search controller evaluates and selects the network architectures by using its sub-optimal parameters. However, ChaNAS uses the generation loss to make sure that the candidates in the search process are fairly evaluated and thus improve the final training results of the networks under evaluation. From the Figure 11, we observe that the network search solution with the generalization loss yields a better model in terms of average accuracy and performance. Hereby we give an insight on why the generalization loss helps improve the network accuracy in ChaNAS. The source of improvement obtained by the generalization loss is especially analyzed in Figure 11, where the result of three different loss objectives used in the super-network training stage, and the true validation loss results during the search process are visualized. As we can see, even when the validation loss is used for updating architecture parameters, these candidate networks still suffer from the overfitting problem caused by the weight sharing. This provides evidence that the neural network architecture search can potentially benefit from the adoption of generalization loss, which may point to a different optimization direction of hardware-aware NAS in the co-design cases.

4.4 Benefits Gained by Co-design

To prove the effectiveness of the coordinated search method, Figure 10 shows the Pareto-optimal points found by different hardware-aware NAS methods on three hardware platforms. Through a joint search of network architecture and scheduling policy, CHaNAS-W achieves the highest performance over all different cases of different performance requirements. CHaNAS-W achieves 78.4% ImageNet top1 accuracy on the P100 GPU, 2.3% more accurate than MobileNet-v3, the previous best result reported through the hardware-aware NAS approach. Most importantly, CHaNAS-W runs 1.6×, 2.1×, and 2.2× faster on the P100 GPU; 1.9×, 2.2×, and 2.4× faster on the Intel Xeon 8163 CPU; and 1.7×, 2.7×, and 2.8× faster on the Note 10 phone than MobileNet-v3, Fbnet, and Mnasnet, respectively, while delivering similar output accuracy. With the co-design approach, DNN solutions can be customized and optimized according to the target hardware to achieve better performance.

Fig. 10.

Fig. 10. Neural network implementations provided by CHaNAS; Coordinated search achieve similar accuracy but significant speedup over the neural network solutions with a fixed scheduling policy on GPU, CPU, and Mobile, respectively.

Fig. 11.

Fig. 11. The results for different objective functions are used as the metric to update super-network parameters. ChaNAS with \( \mathcal {L}_{\mathrm{gen}} \) significantly outperforms \( \mathcal {L}_{\mathrm{train}} \) -based approaches.

We also show the co-design cost of our CHaNAS compared with previous hardware-nas in Table 2 when developing the candidate network and schedules for N different application scenarios of different constraints; the search cost includes pre-schedule optimization cost and co-search overhead before network deployment. “AWS Cost” is calculated based on the AWS cloud-charging price of on-demand P3.16x large instances. Most previous hardware-aware NAS need to re-design or re-train the candidate DNN model for a new hardware platform or even the change to the design constraint. For example, Mnasnet [44] needs 4,000 GPU hours (near $12,250 AWS cost) for a new development scenario. While CHaNAS decouples the training and search process in NAS due to the block-based design, our super-net training can be performed only once and requires only a marginal search cost for fast deployment in a new application case. CHaNAS only need additional 50 GPU hours ($150 AWS cost) for redeployment.

4.5 Analysis of Design Points Found by CHaNAS

We visualize some of the discovered architectures for the three platforms in Figure 12 by using different colors and shapes to denote the kernel sizes and channel expansion ratios of MBConv. As shown in Figure 12, CHaNAS uses more 7 × 7 kernels in the neural network for GPU than other platforms, and we conjecture that CHaNAS can find more efficient GPU schedules for blocks with 7 × 7 kernels, due to the large parallelism of the computation array. CHaNAS can automatically find a solution that has higher arithmetic intensity to improve GPU utilization. On a Note 10 mobile phone, CHaNAS uses network models of smaller input size but with more layers than that on Intel CPU, which implies that it needs slimmer but deeper architectures to maintain model accuracy due to the insufficiency of on-chip memory space in mobile phones. This is reasonable, since deep-but-slim network architecture requires less on-chip memory space than the wider network architectures.

Fig. 12.

Fig. 12. Visualization of the architectures found by CHaNAS for NVIDIA P100 GPU, Intel Xeon CPU, and Note 10 mobile.

We also analyze the low-level operator schedule generation result found by the schedule search strategy. Taking Note 10 as an example, we select 18 different convolution layers from the search space and list their configurations in Table 3. Then we compare the latency results of these 18 distinctive convolution layers that were scheduled by CHaNAS-W and TF_Lite, respectively. The absolute performance is shown in Figure 13. CHaNAS-W can achieve an average latency of 0.496 ms for each layer, which is 40% lower than TFlite. To analyze why ChaNAS can achieve higher speed, we carefully study the scheduling behavior generated by CHaNAS-W. From the detailed schedule optimization behavior on Note 10 phone, CHaNAS-W can enable register blocking through multi-level tiling and vectorize the inner-most loop to achieve vectorization, which are two critical factors to improve the schedule results on a Note 10 phone. CHaNAS-W can split spatial loops and reduce loops according to the split factors from exploration results and uses split primitive and reorder primitive recursively to produce a series of small tiles to achieve the multi-level tiling. After splitting, the loops become smaller and can be potentially held by the cache to exploit data locality. To exploit parallelism, CHaNAS-W dynamically fuses several outer loops into one outer-most loop and parallelizes it. The performance improvement brought by CHaNAS-W shows our scheduling-level search process can dynamically optimize the code implementation in contrast to the previous fixed mapping strategy for various hardware.

Fig. 13.

Fig. 13. Latency results of CHaNAS-W compared to TF_lite on Note 10 mobile for different convolution layers.

Table 3.
NameL1L2L3L4L5L6L7L8L9L10L11L12L13L14L15L16L17L18
input_size224224112562814224224112562814224224112562814
in/out channel3/1616/2424/4040/8080/112160/1603/1616/2424/4040/8080/112160/1603/1616/2424/4040/8080/112160/160
kernel size333333555555777777
stride122211122211122211

Table 3. Configurations of 18 Distinctive Convolution Layers

Skip 5RELATED WORK Section

5 RELATED WORK

Hardware-aware NAS. Recent hardware-aware NAS methods [22, 28, 44] directly incorporate the hardware feedback into the architecture search loop to discover the neural networks that work best on specific hardware. For instance, Mnasnet [44] directly measures the inference latency by executing the model on mobile phones, allowing it to use performance measurement to guide the model search process. ProxylessNAS [10] directly learns the DNN architecture for the ImageNet dataset by proposing a gradient-based approach to train binarized parameters. FBNet [48] proposes a differentiable platform-aware NAS using Gumbel Softmax sampling. OFA [9] can quickly search within a pre-trained supernet for a sub-graph network that achieves the required accuracy and speed on hardware. By these means, efficient network architectures with improved inference efficiency on particular hardware can be obtained automatically. EDD [31] proposes a fully simultaneous, efficient differentiable DNN architecture and implementation co-search methodology. Targeting ASICs, Yang et al. [51] propose the NASAIC that can simultaneously identify multiple DNN architectures and the associated heterogeneous ASIC accelerators to meet both DNN model accuracy and hardware performance requirements. Accelerator-aware NAS [14, 58] explores the neural architecture and hardware accelerator co-design, by parameterizing neural architecture search and accelerator search in a unified joint search space. However, prior hardware-aware NAS methods are oblivious to the compiler-level design choices, which means they do not consider the optimization of various scheduling policies when evaluating network candidates in the search process. Consequently, they may fail to find the de-facto optimal network model that must be eventually scheduled and mapped onto the hardware via a compiler. Hence, we believe automated network-model/scheduling-policy co-search is an approach toward better co-design solutions.

High-performance DNN model scheduling. Scheduling DNN model on the target hardware is another critical factor to improve the efficiency of the DNN system and has attracted a lot of attention from both academia and industry [19, 20]. The best practice currently is still developing schedule libraries for different hardware and needs manually tuning for a new DNN model. Most existing deep learning frameworks [4, 11, 27, 40] rely on these libraries to achieve high performance. For CPUs, MKL-DNN [2] uses JIT techniques to optimize Convolutional Neural Network on Intel Xeon CPUs. SWIRL [46] can generate high-quality vectorized and parallelized code for DNNs to improve the efficiency of CPUs. For GPUs, cuBlas [3] can accelerate linear algebra kernels to extreme high performance, and cuDNN [17] accelerates deep learning applications by assembling a set of efficient algorithms on the GPUs. TF-Lite Micro [4] focuses on accelerating the DNN models on the embedded hardware. To adapt to more hardware back-ends, and improve the generality of different deep learning frameworks, Intel nGraph [18] and Google XLA [1] have the role of a bridge between deep learning frameworks and hardware back-ends. Intel nGraph utilizes MKL-DNN to produce highly optimized implementations on CPUs and the PTX-emitting back-end of the LLVM to generate assembly code for GPUs. The XLA compiler acts as a back-end for TensorFlow. TVM [12] proposes an ahead-of-time compiler that supports multiple front-ends and hardware platforms. These compilers adopt high-level computing graphs and leverage fusion across layers based on predetermined rules. Besides, Auto-scheduling algorithms have gradually attracted a substantial amount of attention and provide appealing productiveness. Tensor Comprehension [45] adopts polyhedral IRs and employs a genetic search of affine transformation options (e.g., tile sizes, loop fusion, and scheduling strategies). PolyMage [38] introduces fusion methods with loop nests and determines the rules of fusion and the range of tiling sizes to ensure a small auto-tuning space. AutoTVM [12] utilizes high-level abstractions to represent the computing graph and includes a template-guided search algorithm for low-level tensor code generation. As a framework for generating tensor programs, Flextensor [56] attempts to reduce human efforts in writing templates by using more general templates to cover more operators. To expand the optimization scope of tensor scheduling, Ansor [55] explores a larger search space to cover the useful tensor-level program optimization options. However, none of the prior compiling or scheduling framework considers the joint search of both scheduling policy and network architecture.

DNN/Compiler Co-design. While DNN/accelerator co-design has been attracting growing research interest, DNN/Compiler co-design remains largely underexplored. This may be partial, because designers are more inclined to treat compilers as well-developed tools that should be touched. Some recent works address this issue and successfully demonstrate the practicality of DNN/Compiler co-design. DNN/Compiler Co-design includes PCONV [35], PatDNN [39], and CoCoPIE [21], which tackle model compression and compilation simultaneously. During model compression, they focus on structured pruning, guided by pre-determined compiler-friendly patterns. During compilation, they propose efficient compiler code generation, enabling the compilers to maximize or maintain both instruction-level and thread-level parallelism. MCUNet [32] is another framework that integrates model design and compiler optimization. It is composed of two components, TinyNAS and TinyEngine; TinyNAS searches for specialized DNNS, while TinyEngine generates specialized code to eliminate instruction and memory redundancy.

Compared to traditional methods that either optimize the neural network using neural architecture search based on a fixed schedule strategy from a given deep learning library (e.g., Tensorflow [4] and Pytorch [40]) or tune the scheduling policy to maximize the inference speed for a given network [12, 13], CHaNAS can utilize the resources better by neural network architecture-schedule policy co-design.

Skip 6CONCLUSION Section

6 CONCLUSION

We proposed an automated NAS framework, CHaNAS, to co-optimize the DNN architecture and the dedicated network scheduling policy for a specific machine learning task on the hardware. It directly involves compiler optimization in the DNN NAS loop, aiming to offer a higher performance of DNN solutions for the target hardware platforms. With the proposed hierarchical co-design search space and the exploration method, both the network architecture design and the network scheduling can be effectively conducted on the basis of neural network blocks. Specifically, we also introduced a new objective function for super-network training based on the generalization gap, which helps improve the accuracy of the candidate DNN models, and we showed that it outperforms previously proposed training or validation loss functions. In the extensive experiments, when applied to the Imagenet2012 classification task on different hardware back-ends, CHaNAS can generate better co-design solutions over SOTA hardware-aware search methods MobileNet-v3. It is shown in the experiments that the co-design solutions generated by CHaNAS achieved 1.6×, 1.9×, and 1.7× performance boost, respectively, on NVIDIA P100 GPU, Intel Xeon CPU, and Samsung Note 10 Mobile, over the baselines of the same-level accuracy. This approach can transfer to other image classification tasks easily. We believe the larger neural network architecture and schedule co-design search space and effective exploration strategy could benefit the hardware-aware NAS methods further.

REFERENCES

  1. [1] 2020. Google, XLA. Retrieved from https://www.tensorflow.org/xla.Google ScholarGoogle Scholar
  2. [2] 2020. Inter MKL-DNN. Retrieved from https://github.com/intel/mkl-dnn.Google ScholarGoogle Scholar
  3. [3] 2020. NVIDIA, CUBLAS Library. Retrieved from https://www.nvidia.com/.Google ScholarGoogle Scholar
  4. [4] Abadi Martín, Barham Paul, Chen Jianmin, Chen Zhifeng, Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Irving Geoffrey, Isard Michael, et al. 2017. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’17). 265283.Google ScholarGoogle Scholar
  5. [5] Abdelfattah Mohamed S., Dudziak Łukasz, Chau Thomas, Lee Royson, Kim Hyeji, and Lane Nicholas D.. 2020. Best of both worlds: Automl co-design of a cnn and its hardware accelerator. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20), 16. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Ankit Aayush, Sengupta Abhronil, and Roy Kaushik. 2018. Neuromorphic computing across the stack: Devices, circuits and architectures. In Proceedings of the IEEE International Workshop on Signal Processing Systems (SiPS’18). IEEE, 16. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Bacis Marco, Natale Giuseppe, Sozzo Emanuele Del, and Santambrogio Marco Domenico. 2017. A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’17). IEEE, 9097. Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Bender Gabriel, Kindermans Pieter-Jan, Zoph Barret, Vasudevan Vijay, and Le Quoc. 2018. Understanding and simplifying one-shot architecture search. In Proceedings of the International Conference on Machine Learning (ICML’18), 550559.Google ScholarGoogle Scholar
  9. [9] Cai Han, Gan Chuang, Wang Tianzhe, Zhang Zhekai, and Han Song. 2020. Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the International Conference on Learning Representations (ICLR’20).Google ScholarGoogle Scholar
  10. [10] Cai Han, Zhu Ligeng, and Han Song. 2019. Proxylessnas: Direct neural architecture search on target task and hardware. In Proceedings of the International Conference on Learning Representations (ICLR’19).Google ScholarGoogle Scholar
  11. [11] Chen Tianqi, Li Mu, Li Yutian, Lin Min, Wang Naiyan, Wang Minjie, Xiao Tianjun, Xu Bing, Zhang Chiyuan, and Zhang Zheng. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274. Retrieved from http://arxiv.org/abs/1512.01274.Google ScholarGoogle Scholar
  12. [12] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Zheng Lianmin, Yan Eddie, Shen Haichen, Cowan Meghan, Wang Leyuan, Hu Yuwei, Ceze Luis, et al. 2018. \( \lbrace \)TVM\( \rbrace \): An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 578594.Google ScholarGoogle Scholar
  13. [13] Chen Tianqi, Zheng Lianmin, Yan Eddie, Jiang Ziheng, Moreau Thierry, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2018. Learning to Optimize Tensor Programs. Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’18). 33893400.Google ScholarGoogle Scholar
  14. [14] Chen Weiwei, Wang Ying, Yang Shuang, Liu Chen, and Zhang Lei. 2020. You only search once: A fast automation framework for single-stage DNN/Accelerator co-design. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’20). 12831286. Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Chen Yu-Hsin, Emer Joel, and Sze Vivienne. 2017. Using dataflow to optimize energy efficiency of deep neural network accelerators. In Proceedings of the 50the Annual IEEE/ACM International Symposium on Microarchitecture (Micro’17). 1221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Chen Yu-Hsin, Yang Tien-Ju, Emer Joel, and Sze Vivienne. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Select. Top. Circ. Syst. 9, 2 (2019), 292308. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Chetlur Sharan, Woolley Cliff, Vandermersch Philippe, Cohen Jonathan, Tran John, Catanzaro Bryan, and Shelhamer Evan. 2014. cuDNN: Efficient primitives for deep learning. volumearXiv:1410.0759. Retrieved from http://arxiv.org/abs/1410.0759.Google ScholarGoogle Scholar
  18. [18] Cyphers Scott, Bansal Arjun K., Bhiwandiwalla Anahita, Bobba Jayaram, Brookhart Matthew, Chakraborty Avijit, Constable Will, Convey Christian, Cook Leona, Kanawi Omar, et al. 2018. Intel ngraph: An intermediate representation, compiler, and executor for deep learning. arXiv:1801.08058. http://arxiv.org/abs/1801.08058.Google ScholarGoogle Scholar
  19. [19] Das Anup, Kumar Akash, and Veeravalli Bharadwaj. 2014. Energy-aware task mapping and scheduling for reliable embedded computing systems. ACM Trans. Embed. Comput. Syst. 13, 2s (2014), 127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Das Anup, Kumar Akash, and Veeravalli Bharadwaj. 2015. Reliability and energy-aware mapping and scheduling of multimedia applications on multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 27, 3 (2015), 869884. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Guan Hui, Liu Shaoshan, Ma Xiaolong, Niu Wei, Ren Bin, Shen Xipeng, Wang Yanzhi, and Zhao Pu. 2021. CoCoPIE: Enabling real-time AI on off-the-shelf mobile devices via compression-compilation co-design. Commun. ACM 64, 6 (2021), 6268.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Gupta Suyog and Akin Berkin. 2020. Accelerator-aware neural network design using AutoML. arXiv:2003.02838. Retrieved from https://arxiv.org/abs/2003.02838.Google ScholarGoogle Scholar
  23. [23] Hao Cong, Zhang Xiaofan, Li Yuhong, Huang Sitao, Xiong Jinjun, Rupnow Kyle, Hwu Wen-mei, and Chen Deming. 2019. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Howard Andrew, Sandler Mark, Chu Grace, Chen Liang-Chieh, Chen Bo, Tan Mingxing, Wang Weijun, Zhu Yukun, Pang Ruoming, Vasudevan Vijay, et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 13141324. Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from http://arxiv.org/abs/1704.04861.Google ScholarGoogle Scholar
  26. [26] Iandola Forrest N., Han Song, Moskewicz Matthew W., Ashraf Khalid, Dally William J., and Keutzer Kurt. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv:1602.07360. Retrieved from http://arxiv.org/abs/1602.07360.Google ScholarGoogle Scholar
  27. [27] Jia Yangqing, Shelhamer Evan, Donahue Jeff, Karayev Sergey, Long Jonathan, Girshick Ross, Guadarrama Sergio, and Darrell Trevor. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. 675678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Jiang Weiwen, Zhang Xinyi, Sha Edwin H.-M., Yang Lei, Zhuge Qingfeng, Shi Yiyu, and Hu Jingtong. 2019. Accuracy vs. efficiency: Achieving both through FPGA-implementation aware neural architecture search. Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19), 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Kwon Hyoukjun, Chatarasi Prasanth, Pellauer Michael, Parashar Angshuman, Sarkar Vivek, and Krishna Tushar. 2019. Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro’19). 754768. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Kwon Kiseok, Amid Alon, Gholami Amir, Wu Bichen, Asanovic Krste, and Keutzer Kurt. 2018. Co-design of deep neural nets and neural net accelerators for embedded vision applications. In Proceedings of the 55th ACM/IEEE Design Automation Conference (DAC’18), 16. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Li Yuhong, Hao Cong, Zhang Xiaofan, Liu Xinheng, Chen Yao, Xiong Jinjun, Hwu Wen-mei, and Chen Deming. 2020. Edd: Efficient differentiable dnn architecture and implementation co-search for embedded ai solutions. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Lin Ji, Chen Wei-Ming, Lin Yujun, Cohn John, Gan Chuang, and Han Song. 2020. Mcunet: Tiny deep learning on iot devices. arXiv:2007.10319. Retrieved from https://arxiv.org/abs/2007.10319.Google ScholarGoogle Scholar
  33. [33] Liu Hanxiao, Simonyan Karen, and Yang Yiming. 2019. Darts: Differentiable architecture search. In Proceedings of the International Conference on Learning Representations (ICLR’19).Google ScholarGoogle Scholar
  34. [34] Lu Qing, Jiang Weiwen, Xu Xiaowei, Shi Yiyu, and Hu Jingtong. 2019. On neural architecture search for resource-constrained hardware platforms. arXiv:1911.00105. Retrieved from http://arxiv.org/abs/1911.00105.Google ScholarGoogle Scholar
  35. [35] Ma Xiaolong, Guo Fu-Ming, Niu Wei, Lin Xue, Tang Jian, Ma Kaisheng, Ren Bin, and Wang Yanzhi. 2020. Pconv: The missing but desirable sparsity in dnn weight pruning for real-time execution on mobile devices. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 51175124.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Ma Yufei, Cao Yu, Vrudhula Sarma, and Seo Jae-sun. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 4554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Marculescu Diana, Stamoulis Dimitrios, and Cai Ermao. 2018. Hardware-aware machine learning: Modeling and optimization. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Mullapudi Ravi Teja, Vasista Vinay, and Bondhugula Uday. 2015. Polymage: Automatic optimization for image processing pipelines. ACM SIGARCH Comput. Arch. News 43, 1 (2015), 429443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Niu Wei, Ma Xiaolong, Lin Sheng, Wang Shihao, Qian Xuehai, Lin Xue, Wang Yanzhi, and Ren Bin. 2020. Patdnn: Achieving real-time dnn execution on mobile devices with pattern-based weight pruning. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 907922.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703. Retrieved from https://arxiv.org/abs/1912.01703.Google ScholarGoogle Scholar
  41. [41] Pham Hieu, Guan Melody, Zoph Barret, Le Quoc, and Dean Jeff. 2018. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning. PMLR, 40954104. https://arxiv.org/abs/1802.03268.Google ScholarGoogle Scholar
  42. [42] Radosavovic Ilija, Kosaraju Raj Prateek, Girshick Ross, He Kaiming, and Dollár Piotr. 2020. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 1042810436. Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Ragan-Kelley Jonathan, Barnes Connelly, Adams Andrew, Paris Sylvain, Durand Frédo, and Amarasinghe Saman. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not. 48, 6 (2013), 519530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Tan Mingxing, Chen Bo, Pang Ruoming, Vasudevan Vijay, Sandler Mark, Howard Andrew, and Le Quoc V.. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 28202828. Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Vasilache Nicolas, Zinenko Oleksandr, Theodoridis Theodoros, Goyal Priya, DeVito Zachary, Moses William S., Verdoolaege Sven, Adams Andrew, and Cohen Albert. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv:1802.04730. Retrieved from http://arxiv.org/abs/1802.04730.Google ScholarGoogle Scholar
  46. [46] Venkat Anand, Rusira Tharindu, Barik Raj, Hall Mary, and Truong Leonard. 2019. SWIRL: High-performance many-core CPU code generation for deep neural networks. Int. J. High Perf. Comput. Appl. 33, 6 (2019), 12751289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Wang Ying, Xu Jie, Han Yinhe, Li Huawei, and Li Xiaowei. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53th ACM/IEEE Design Automation Conference (DAC’16). 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Wu Bichen, Dai Xiaoliang, Zhang Peizhao, Wang Yanghan, Sun Fei, Wu Yiming, Tian Yuandong, Vajda Peter, Jia Yangqing, and Keutzer Kurt. 2019. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1073410742. Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Xiao Qingcheng, Liang Yun, Lu Liqiang, Yan Shengen, and Tai Yu-Wing. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference. 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Xu Yuhui, Xie Lingxi, Zhang Xiaopeng, Chen Xin, Shi Bowen, Tian Qi, and Xiong Hongkai. 2020. Latency-aware differentiable neural architecture search. arXiv: 2001.06392 (2020). Retrieved from https://arxiv.org/abs/2001.06392.Google ScholarGoogle Scholar
  51. [51] Yang Lei, Yan Zheyu, Li Meng, Kwon Hyoukjun, Lai Liangzhen, Krishna Tushar, Chandra Vikas, Jiang Weiwen, and Shi Yiyu. 2020. Co-exploration of neural architectures and heterogeneous asic accelerator designs targeting multiple tasks. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Yang Yifan, Huang Qijing, Wu Bichen, Zhang Tianjun, Ma Liang, Gambardella Giulio, Blott Michaela, Lavagno Luciano, Vissers Kees, Wawrzynek John, et al. 2019. Synetgy: Algorithm-hardware co-design for ConvNet accelerators on embedded FPGAs. InProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53th ACM/IEEE Design Automation Conference (DAC’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Zhang Xiangyu, Zhou Xinyu, Lin Mengxiao, and Sun Jian. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 68486856. Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Zheng Lianmin, Jia Chengfan, Sun Minmin, Wu Zhao, Yu Cody Hao, Haj-Ali Ameer, Wang Yida, Yang Jun, Zhuo Danyang, Sen Koushik, et al. 2020. Ansor: Generating high-performance tensor programs for deep learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’19). 863879. https://www.usenix.org/conference/osdi20/presentation/zheng.Google ScholarGoogle Scholar
  56. [56] Zheng Size, Liang Yun, Wang Shuo, Chen Renze, and Sheng Kaiwen. 2020. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). 859873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Zhou Hongpeng, Yang Minghao, Wang Jun, and Pan Wei. 2019. BayesNAS: A Bayesian approach for neural architecture search. In Proceedings of the 36th International Conference on Machine Learning (ICML’19), Vol. 97. 76037613. http://proceedings.mlr.press/v97/zhou19e.html.Google ScholarGoogle Scholar
  58. [58] Zhou Yanqi, Dong Xuanyi, Akin Berkin, Tan Mingxing, Peng Daiyi, Meng Tianjian, Yazdanbakhsh Amir, Huang Da, Narayanaswami Ravi, and Laudon James. 2021. Rethinking co-design of neural architectures and hardware accelerators. arXiv:2102.08619. Retrieved from https://arxiv.org/abs/2102.08619.Google ScholarGoogle Scholar
  59. [59] Zoph Barret and Le Quoc V.. 2017. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar

Index Terms

  1. A Framework for Neural Network Architecture and Compile Co-optimization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Embedded Computing Systems
        ACM Transactions on Embedded Computing Systems  Volume 22, Issue 1
        January 2023
        512 pages
        ISSN:1539-9087
        EISSN:1558-3465
        DOI:10.1145/3567467
        • Editor:
        • Tulika Mitra
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 October 2022
        • Online AM: 10 May 2022
        • Accepted: 21 April 2022
        • Revised: 20 February 2022
        • Received: 13 October 2021
        Published in tecs Volume 22, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)913
        • Downloads (Last 6 weeks)80

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!