Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization

Memory is a critical design consideration in current data-intensive DNN accelerators, as it profoundly determines energy consumption, bandwidth requirements, and area costs. As DNN structures become more complex, a larger on-chip memory capacity is required to reduce data movement overhead, but at the expense of silicon costs. Some previous works have proposed memory-oriented optimizations, such as different data reuse and layer fusion schemes. However, these methods are not general and potent enough to cope with various graph structures. In this paper, we explore the intrinsic connection between network structures and memory features to optimize both hardware and mapping. First, we introduce a graph-level execution scheme with a corresponding dataflow and memory management method. This scheme enables the execution of arbitrary graph patterns with high data reuse and low hardware overhead. Subsequently, we propose Cocco, a hardware-mapping co-exploration framework leveraging graph-level features of networks. It aims to minimize communication overhead, such as energy consumption and bandwidth requirements, with a smaller memory capacity. We formulate the graph-partition scheduling and memory configuration search as an optimization problem and employ a genetic-based method to achieve efficient co-exploration for large and irregular networks. Experiments demonstrate that Cocco obtains lower external memory access, lower bandwidth requirements, and more stable optimization for graph partition compared to the greedy algorithm and dynamic programming introduced in prior works. Cocco also reduces the costs by 1.89% to 50.33% using co-exploration compared to other typical methods.


Abstract
Memory is a critical design consideration in current dataintensive DNN accelerators, as it profoundly determines energy consumption, bandwidth requirements, and area costs.As DNN structures become more complex, a larger on-chip memory capacity is required to reduce data movement overhead, but at the expense of silicon costs.Some previous works have proposed memory-oriented optimizations, such as different data reuse and layer fusion schemes.However, these methods are not general and potent enough to cope with various graph structures.
In this paper, we explore the intrinsic connection between network structures and memory features to optimize both hardware and mapping.First, we introduce a graph-level execution scheme with a corresponding dataflow and memory management method.This scheme enables the execution of arbitrary graph patterns with high data reuse and low hardware overhead.Subsequently, we propose Cocco, a hardware-mapping co-exploration framework leveraging graph-level features of networks.It aims to minimize communication overhead, such as energy consumption and bandwidth requirements, with a smaller memory capacity.We formulate the graph-partition scheduling and memory configuration search as an optimization problem and employ a genetic-based method to achieve efficient co-exploration for large and irregular networks.Experiments demonstrate that Cocco obtains lower external memory access, lower bandwidth requirements, and more stable optimization for graph partition compared to the greedy algorithm and dynamic programming introduced in prior works.Cocco also reduces the costs by 1.89% to 50.33% using co-exploration compared to other typical methods.

Introduction
The evolution of neural network topology has driven the remarkable progress of artificial intelligence from the early single-layer perceptron (SLP) [45,54] and multi-layer perceptron (MLP) [17,22,39] to modern DNNs with plain [36,57]/inception [59]/residual [20,55] structures based on manual design, and even irregular structures using neural architecture search (NAS) [53,75] or random network generation [68].These technological innovations have resulted in increasingly complex computation graphs, which pose challenges for efficient memory design and deployment.
Memory design is crucial in the accelerator system, as it performs data preparation at the start of each processing stage according to the scheduling scheme, determining energy consumption, bandwidth requirements, and area costs.Figure 1 shows the trade-off between the on-chip memory size and the external memory access in DNN accelerators.A smaller on-chip buffer (left side) saves area but requires more data reloading.A larger buffer (right side) can reduce external memory access and save energy and bandwidth but at the cost of increasing the memory overhead.An excessively large SRAM may not be feasible due to the high silicon area cost, typically ranging from 1 to 2 mm 2 /MB in 12nm, and the high energy overhead, dozens of times that of a MAC operation for a large SRAM.
Therefore, the key problem is: between the two extremes in Figure 1, how to find an appropriate memory configuration with efficient workload mapping and data management, especially under the growing complexity of neural network architectures.node (1) node (10) ͙ ͙ node (2) Inter.: intermediate Figure 1.The effect of different memory capacities for a computation graph.Intermediate results can be buffered in the on-chip memory if it is large enough.The on-chip memory of small capacity can only buffer two nodes (marked in the red dotted box), and the larger memory can cover a larger subgraph (right side).
The critical status of memory design has attracted extensive research.Most previous studies focus on simple layer-level optimization (the left one of Figure 1) by applying loop transformation techniques such as tiling and reordering to fit the memory size and reuse the on-chip data [23,43,44,61,70].In addition, several works also guide the memory capacity and hierarchy design using designspace exploration [12,32,37,66,67].However, these layerlevel optimizations are confined to the limited intra-layer reuse, which is insufficient for memory-intensive networks.A subgraph-level scheme (e.g., the middle one and the right one of Figure 1) provides a larger optimization space via inter-layer reuse [3,4,38,73] to reduce the I/O overhead.Therefore, this paper aims to leverage the subgraph-level computing flow to optimize the memory capacity and external communication for networks with any topology.
However, there are three primary challenges to fully exploit the subgraph-level optimization.
First, we need a general execution flow for any sub-graph.Due to the various kernel sizes and strides, a parent node in a subgraph may have unbalanced data requirements from its consumers, which makes it difficult to determine the tensor tiling scheme and the memory allocation for each node (layer).In the traditional single-layer execution, we usually divide a large tensor into loop tiles, which are processed through a series of regular computing steps.Similarly, we want the sub-graph execution to be a series of elementary computing steps with a simple control flow.
Second, we require a suitable memory management method for the subgraph execution.Due to complicated dependency among nodes in a subgraph, careful management is needed to reuse overlapping and inter-layer intermediate data.
Solving these two challenges contributes to a basic hardware execution model compatible with subgraph-level optimization.However, we also encounter the third challenge: how to partition a model into subgraphs and how much memory to allocate.The optimization space is huge, so we need to devise a search method with high sampling efficiency to find a proper subgraph partition and memory configuration result.
In this paper, we first introduce a complete graph-level scheme for memory.In particular, it contains a consumptioncentric flow that enables the execution of arbitrary subgraphs with low memory footprints (for challenge 1).Accordingly, we provide an explicit memory dataflow and the corresponding memory management scheme for effective data reuse (for challenge 2).Building on the graph-level memory scheme, we propose Cocco, a hardware-mapping co-exploration framework, to establish a connection between model features and the memory configuration (for challenge 3).
Cocco aims to find a combination of on-chip buffers and the corresponding graph-level scheduling for lower memory and communication overhead.In particular, we develop a genetic-based algorithm to efficiently explore the search space of graph partitions and the associated memory configuration for a series of neural networks.
In summary, this work makes the following contributions: • Subgraph execution scheme.We first introduce a consumption-centric flow to determine a low-cost execution sequence by throttling and aligning the dataflow.• Efficient dataflow and memory management for subgraph data reuse.We propose a memory management scheme featuring multiple reconfigurable regions and the corresponding dataflow to support arbitrary subgraph execution with full data reuse.• Hardware-mapping co-exploration framework.
Based on the subgraph execution scheme and memory dataflow, we propose Cocco, a genetic-based framework combining the graph-level partition and memory design-space exploration together.Cocco achieves 1.89% to 50.33% lower costs (lower communication with a smaller size) using co-exploration in contrast to other methods.
2 Background and Motivation

Design of Neural Network Accelerators
The DNN accelerator unit is the most basic execution unit in a computing system, on top of which, we can scale it out to many-core, many-socket, and many-drawer systems [24,40,48,60].An accelerator unit usually employs a processing element (PE) array on a sophisticated interconnection network to enable efficient tensor-level computation.located next to the PE array to serve as the data interface and manage data between the PE array and the external memory (e.g., DRAM or other cores).Due to the limited capacity of the global buffer, the compiler has to partition the network execution into a series of elementary workloads that are scheduled along the parallel spatial resources and the temporal dimension [18,61,72].The capacity of the global buffer usually dominates the external memory access and bandwidth requirements, significantly impacting system performance.If the global memory is larger, it is more likely to buffer more intermediate data and avoid data being evicted to DRAM.As shown in Figure 1, a larger buffer expands the scope of elementary workloads from a single layer to a larger subgraph, reducing the communication overhead.However, choosing an appropriate memory specification is always a challenge.In Figure 2, we surveyed 16 popular industrial neural network processors with various memory/performance/area characteristics, where nine of them target the training domain [6,11,24,34,35,40,41,48,60,63,69] and seven target model inference [1, 7, 8, 26-28, 49, 65].According to the survey, we can observe several trends as follows: 1. Memory occupies a significant portion of the silicon footprint on an NPU chip, ranging from 4% to 79% of the area, with capacities from 2.5MB to 896MB.2. Figure 2 Left shows a trend of diminishing marginal benefit of memory capacity.This is because there is a critical capacity to meet the data reuse and bandwidth requirement at the beginning, and the increments become negligible with higher memory capacity. 3. We can infer that there is a saturated capacity equivalent to the ideal unlimited memory, especially for the inference design.For example, Hanguang [26] is a special SRAM-only inference system without DDR, and the 394MB buffers are large enough to hold the intermediate data in their scenarios.This survey implies a design trade-off between memory capacity and performance based on workloads and commercial considerations.Motivated by the observations above, this paper aims to provide several memory design considerations and study the connection between workload features and memory capacity in an NPU accelerator.

Workload Deployment
A neural network is usually executed in a DNN accelerator with layer or graph granularities based on the buffer capacity and dataflow.

2.2.1
Layer-level Assignment.This manner assigns tasks layer by layer.Most previous studies employ a tiling-based layer-wise execution manner [10,21,30,37,50,61], which elaborates the tiling sizes of tensors to fit in the accelerator buffers and maintain performance.A proper tiling scheme should overlap the data loading latency with the computing time of each tile and try to reduce the repeated access of local weight buffers.Tiles of data are transferred between the external memory and the global buffer, and PEs subsequently fetch data from the global to their local buffers.Given the larger bit-width of partial sums (e.g., 24bit partial sums v.s.8bit inputs in Simba), the output-centric tiling scheme is more commonly used to calculate the final results before writing back to the global buffer [61].

Graph-level Assignment.
Unlike the layer-level assignment that restrains from leveraging inter-layer reuse, a graph-level assignment processes several layers of a neural network as a whole.To demonstrate the effectiveness of the layer-level assignment, we evaluate four networks on a 2TOPS accelerator model, as shown in Figure 3.The results show that fusing layers into subgraphs significantly reduces external memory access by 42.3% ∼ 74.7% and average bandwidth requirements by 26.8% ∼ 67.8%.However, the improvements of larger subgraphs are marginal, indicating that there is an optimal trade-off between inter-layer reuse and subgraph size, which determines the memory requirement.For example, executing three-layer subgraphs reduces external memory access by 53.7% in ResNet50, while executing five-layer subgraphs only further reduces it by 13.6%.
Several works have studied inter-layer reuse and graph partition.However, they have several limitations in terms of performance and flexibility.LCP [42] groups similar layers into a cluster and executes them as a whole, which makes it challenging to generalize into an arbitrary graph.Fused-CNN [4] and SR-CNN [38] fuse large contiguous layers for plain networks using manually-designed strategies.Irregular-NN [73] attempts to execute a complex subgraph using a DP-based algorithm, but the constrained search space limits the exploration.
To overcome these challenges, we propose an end-to-end framework that automatically optimizes the graph partition and memory configuration for any neural network.Our framework consists of two main components: a graph-level dataflow and a hardware-mapping co-exploration algorithm.We first introduce the graph-level dataflow and its hardware implementation.Then, we present Cocco, an efficient algorithm that explores the trade-offs among memory configurations and graph partition schemes based on workload features.

The Proposed Graph-Level Scheme
To execute layers on an NPU core in a graph-level manner, we need an effective approach to reuse intermediate data and decide the memory allocation.This section presents our comprehensive scheme for subgraph execution, which addresses the first two challenges mentioned in Section 1. First, we describe a multi-layer execution flow that minimizes the memory footprint by a friendly tiling approach (for challenge 1).Second, we explain how to implement this flow on a real NPU using an efficient data reuse pattern (for challenge 2).The consistent target is to reduce the memory footprint and be friendly to implementation.

Subgraph execution scheme
It is common practice for the layer-level scheduling to partition the output tensor into several tiles as layer-level elementary operations [56,61,72,74], simplifying the scheduling and instruction generation.Likewise, our high-level idea is also to generate a series of explicit subgraph-level elementary operations.However, we need to address the challenges of various kernel sizes and strides in different paths to prevent unbalanced data production and unnecessary memory.
A model's subgraph consists of multiple layers (nodes) with dependencies.Section 4 provides detailed information on subgraph partition.In Figure 4(a), we present a straightforward production-centric scheme for executing a subgraph with different kernel sizes in two branches, deriving tile sizes of the subsequent layers based on the predetermined input tile sizes.For example, we can produce a 1 × 1 tile of Node(0) and a 2 × 2 tile of Node(2) with a given 5 × 5 feature map of input Node(-1).In this case, these intermediate results only reduce to 1 × 1 in Node(3), limited by the smallest input of Node(0), so the remaining results of Node(2) can not be consumed immediately.As shown in Figure 4, three extra data of Node( 2) along with sixteen extra source data of Node(1) take up extra memory space.There are more redundant cached data when the subgraph becomes larger and more complicated.Disadvantages of this manner are attributed to the production-centric idea that consumes all related activations from the producers at once.
To avoid the memory overhead of storing unused data, we propose a consumption-centric scheme in Figure 4(b), where results of each node are produced on demand based on consumer(s) (i.e., output node(s)).For example, given a 1 × 1 tile of Node(3), we derive the 1 × 1 tile size for Node (2), which subsequently decides a 3 × 3 tile for Node (1).
The backward-derivation for each producer node is nontrivial because of diverse kernel sizes and strides in different paths.Therefore, we propose a three-stage flow to determine the behavior of each node, as illustrated in Figure 5.The highlevel idea is to let output nodes drive the whole execution and match the data consumption and production in each subgraph-level elementary operation.
The stage-1 is similar to the traditional single-layer scheduling, where the tile size is optimized for higher computation utilization.In order to hold a larger subgraph, the tile size sĂƌŝĂďůĞƐ ŽĨ ŶŽĚĞ ;uͿ F (u) , s (u)  ŬĞƌŶĞů ƐŝǌĞ͕ ƐƚƌŝĚĞ x (u)  ƚŝůĞ ƐŝǌĞ Δ (u) ŽĨĨƐĞƚ ŽĨ ĂĚũĂĐĞŶƚ ƚŝůĞƐ Stage-3: determine upd_num and execution sequence Δ (1) =x (1) =2 upd_num (1) =2 Figure 5.The flow to determine the execution scheme of a subgraph (i.e., the computed tile size of each node, the tile offset, and the processing sequence of nodes).For simplicity, we discuss the 1D-CONV in this example and it is similar in the 2D-CONV case.
tends to be smaller.In the 1D-CONV example, we set the tile size to be 2 for output nodes.
The stage-2 aims to determine the data update offset Δ and the memory allocation size  for each node based on the consumer(s), processing in the reverse topological order.We use the least common multiply (LCM) operation to determine Δ ( ) of producers for aligning different input offset requirements (Δ ()  () ) from consumers.Hence, one producer update may correspond to multiple updates of a consumer.For example, Δ (−2) = lcm{Δ (0)  (0) , Δ (1)  (1) } = 4 = 2Δ (1)  (1) , one update of Node(-2) corresponds to two updates of Node (1).As for the tile size deduction,   (Δ ( ) / () ) is to derive the required input tile size  (,) for output node  1 , where Δ ( ) / () is the consumer offset (updated data) per producer  update.The maximum result  (,) of all outputs  is the tile size  ( ) of input node .In this example, As mentioned above, since we use LCM to align production and consumption, one producer update may correspond to multiple updates of a consumer.In the stage-3, we use _ to represent the number of memory update per subgraph elementary operation.The generated result of the example in Figure 5 is shown in Figure 6._ of Node(-1), Node(1), and Node(2) are two, where the second updates are highlighted in red boxes.Note that the {_ (−2) , . . ., _ (2) } solution is not unique, but the unique co-prime one {1, 2, 1, 2, 2} corresponds to the minimal elementary operation. 1For example, assume node  is a convolution layer with kernel size  ()  and stride  () , then   ( ) =  () + ( − 1) ×  () .The proposed flow is based on a general directed acyclic computation graph and is not limited to specific layer features.In this way, we can determine the execution scheme for any complex irregular network like NasNet [75] and RandWire [68].

Memory Management for the subgraph execution
Up to now, we have inferred the execution scheme for subgraphs, and the remaining challenge is how to implement it on hardware efficiently.Figure 7 shows the memory allocation and update scheme for the subgraph execution.Before computing a subgraph, the compiler determines logical blocks for input, intermediate, and output nodes, where the block sizes depend on the tile sizes derived from the execution flow.
For convenient management, we introduce two types of memory regions: MAIN and SIDE.The MAIN region stores the source data for PE (i.e., the tile of  0 ×  0 ×  in Figure 7).The SIDE region reserves the horizontally overlapping data 2 .Considering no reuse requirement for some output nodes, we only need a MAIN region to buffer the results of the current tile.Except for the input nodes (negative numbers) loading data from DRAM, the other nodes update data locally based on the computed results of the input node(s).
In detail, the update scheme leverages the collaboration between the MAIN region and the SIDE region to achieve full reuse across sliding tiles (we consider kernel size > stride).As shown in Figure 7, when the convolution windows slide across the feature maps, the vertical overlap data (e.g., column  = 5) are reused locally in the MAIN region.In contrast, the horizontally overlapping data (e.g., the first row of  = 6 ∼ 8) are loaded from the SIDE region (path ①).Only a subset of data is replaced by the newly calculated results (marked in green).Besides, the bottom horizontal slices write new data to the SIDE region for the next row loop (path ②).
The extra hardware overhead for the proposed memory scheme is slight.Figure 8 presents our 12nm NPU core for the subgraph processing, with a buffer region manager to logically partition the global buffer to support contiguous layer processing.The buffer region manager is a 2 -depth register file, where  determines the maximum subgraph size, and each entry pair indicates the start and the end address for each region.The area overhead is quite small, and in our test chip, the area ratio is only 0.18% with  = 64 and 272-byte size (17-bit address for the 1MB 64bit-width global buffer).
In summary, our high-level idea is to divide the buffer into logical blocks for different layers and try to reuse data for sliding convolution windows.The memory management approach can be compatible with an accelerator as long as it supports the data movement inside the on-chip memory and flexible data assignment for computing.Coupled with our subgraph execution scheme introduced before, intermediate outputs in the subgraph can avoid being recomputed.Only those layers required by other subgraphs are written back to DRAM for further reuse.

Memory Communication-Capacity Co-Exploration
The aforementioned hardware model enables arbitrary subgraph execution, but there is always limited buffer capacity  in hardware.Therefore, we need to partition the whole computation graph into a series of subgraphs that fit the memory.Below, we move up to the optimization for graph partition and memory design-space exploration for challenge 3. We aim to find a partition scheme  :  → N that assigns each layer to a subgraph, where layer  ∈  is computed in the  ()-th subgraph.A valid partition scheme should satisfy that any layer is computed before use.Therefore, for any (, ) ∈ , we have  () ≤  ().Moreover, any subgraph should be connected in , otherwise meaningless.
We cast the partition exploration as an optimization problem.The objective is to find a valid partition scheme  that minimizes the total cost: where   is a cost function of a given subgraph based on a target metric  (e.g., external memory access (EMA) and energy).For each subgraph, the EMA cost contains the loading of weights and input activations and the storage of output activations 3 .The energy cost includes the overhead of EMA, on-chip buffers, and computation units.

Design-Space Exploration (DSE).
Our work further extends the optimization to combine with the memory design-space exploration.In this paper, we focus on the global buffer and the weight buffer, given that they dominate the overhead of energy and area in an NPU core.As illustrated in Figure 1, a larger buffer capacity can take in more layers inside a subgraph, reducing communication costs but compromising the silicon area.To co-explore the hardware configuration and mapping, we construct an objective function by a linear combination of the hardware and mapping costs: where  is a preference hyper-parameter to adjust the proportion between two costs.

Baseline Methods
Several optimization methods that exist today can perform graph-level partition.However, most of them fail to directly co-explore hardware and partition.Below, we list four typical methods as our baselines and sketch their features.

Enumeration-based
Algorithm.Fused-CNN [4] applies a straightforward way to enumerate all possible partition schemes and return the best one.Jangda et al. [25] proposed state compression dynamic programming to speed up the enumeration-based algorithm.We migrate their methods as our baseline and further improve them by only recording one subgraph in the state to reduce the time complexity.Nonetheless, there are still exponential states in the improved implementation.Let  be the number of nodes in a graph, and the enumeration-based method may explore up to  (2 2 ) states for irregular networks.Consequently, the search is hard to complete within a reasonable search time for large-scale networks, not to mention the co-exploration with DSE.[47] employs a greedy algorithm to perform function grouping, which can be applied to the graph-level partition.Specifically, it first assigns each layer into a single-layer subgraph.Then it iteratively fuses a pair of subgraphs contributing the greatest benefit until all benefits are negative.

Greedy Algorithm. Halide
Therefore, this algorithm tends to be trapped at the local minimum.Moreover, since the fusion decision rules are based on a given hardware, the greedy method cannot co-explore with DSE.

Dynamic Programming (DP)-based Algorithm.
For the irregular network scheduling problem, Zheng et al. [73] proposed a DP-based algorithm.They arrange the layers based on their depth and perform DP in a sequential manner.
This method is restricted to assigning layers that are contiguous in the depth order into a subgraph, hence the exploration is confined to constrained search space.It is unlikely to find the global optimum, especially for non-plain network structures.In addition, since the state transition of DP depends on the predefined buffer size, it is also tough to carry out co-exploration.

Simulated Annealing (SA)
. SA [33] is a popular optimization algorithm that samples a point and updates it iteratively to improve.It adopts the new sample points with a probability affected by the performance difference and a hyper-parameter named temperature.We employ the customized mutation operations (described in Section 4.4.3) to update the sample points and implement an SA-based algorithm as a baseline.
SA is an alternative optimization method for our framework with compatible operators, but it is not stable as the genetic algorithm in a range of benchmarks, which will be shown in later experiments.

Genetic Algorithm
Previous research shows competitive performance of the Genetic Algorithm (GA) in several scheduling optimization problems [30,31].We summarize several benefits of GA for our hardware-mapping co-exploration problem: 1. White-box property: We can track and tune its optimization process conveniently.Therefore, it is easy and intuitive to understand. 2. Complete search space: It has the potential to explore the complete search space by customized mutation and crossover operations.3. Avoid local optima: In contrast to the greedy algorithm, GA can naturally jump out of the local minimum benefiting from the diversity of the population.4. Flexible initialization: We can use the results of other optimization algorithms to initialize GA and use GA to finetune the result.5. Co-exploration: Through the proposed GA operations and genome encoding, it can further support partition-DSE co-exploration.We encode each candidate solution (partition scheme and the corresponding memory configuration for our problem) as a genome, and the population contains a set of genomes.The GA goes through a series of generations to obtain a lower cost.It performs the crossover and mutation operations on the population in each generation.Specifically, a crossover operation blends two genomes selected from the population to generate one offspring while a mutation operation modifies a genome randomly.At the end of each generation, the evaluation environment evaluates the fitness of each genome, and the population in the new generation is selected based on the fitness results.

Cocco Optimization Framework
Cocco is a GA-based optimization framework that enables the co-exploration of memory configuration and graph-level partition, as shown in Figure 10.The core of Cocco is a series of operations that explore a complete search space.We build a genetic algorithm based on these customized operations.Fed with the neural network structure and DSE requirements, Cocco goes through several steps to get the optimization results.The execution model described in Section 3 is embedded in the evaluation environment.In the following, we introduce the five stages of Cocco.4.4.4Evaluation.Since GA tries to maximize the fitness of the genomes, we set fitness to be the opposite of the cost (e.g., Formula 1 and 2).To evaluate the fitness of each genome in the population, we use our simulator (introduced in the next section) to extract the execution costs of subgraphs (e.g., EMA and energy).
During the evaluation, the simulator decodes the subgraph and hardware configuration of each genome and calculates the fitness by aggregating the cost of each subgraph.Particularly, when a large subgraph exceeds the buffer capacity, we perform the split-subgraph operation to ensure genome validity.This kind of in-situ tuning can increase the number of valid samples during the optimization operations and thus, improve the sample efficiency.

Selection.
At the end of each generation, Cocco performs the tournament selection.Specifically, it holds multiple tournaments among a few randomly selected genomes, and the winners (the genome with the best fitness) of these tournaments form the population of a new generation.This operation facilitates superior fitness in the new generation.The number of genomes in each tournament is decided by a hyper-parameter.The new generation subsequently starts from the crossover step again.

Experiments
In the evaluations, we first present the superiority of Cocco for the graph partition; and then demonstrate its outstanding stability and sample efficiency of the co-exploration for the hardware optimization, followed by additional discussions about the results under different configurations.

Methodology
5.1.1Evaluated Models.In the following evaluations, we consider three types of model structures: plain (VGG16 [57]), multi-branch (ResNet50, ResNet152 [20], GoogleNet [59], Transformer [64], and GPT [52]), and irregular structure (RandWire-A/B [68] and NasNet [75]).RandWire-A/B are generated based on the small and regular regime configurations introduced in the paper [68].FC layers are transformed to 1×1 CONV while pooling and element-wise layers are analyzed as depth-wise CONV without weights.The scalar operations (e.g., activation function) are hidden in the pipeline (e.g., the post-process module following PE in Simba [56]) and their overhead can be ignored.10, we consider a SIMBA-like hierarchical accelerator with a global buffer, a weight buffer, and a 4×4 PE array in each core used in several previous works [56,61,71].Each PE contains an 8×8 MAC array to process a sub-tile from the global buffer.In particular, we model the execution flow based on the scheme described in Section 3. The parallelism of two dimensions of the PE array can be dynamically configured by the mapper results to ensure high utilization.We schedule subgraphs in topological order and prefetch weights of the next subgraph during the current computing.We also extend our platform to support fundamental multi-core studies by interconnecting cores with a crossbar.They share weights to release the burden of each core.
The arithmetic and memory overhead is extracted in a 12nm library based on the synthesized RTL implementations (SRAM based on the ARM memory compiler) with 1GHz.The DRAM energy is set as 12.5pJ/bit [70].The extra footprint of the plug-in design is mainly a 272-Byte register file to store the head and end logical region addresses of maximal 64 nodes, which is negligible.Based on off-the-shelf evaluators Timeloop [50] and MAESTRO [37] for spatial accelerators, we developed a modified simulator that supports the evaluation of latency and energy.It employs the consumption-centric scheme to determine the tile size of each layer, and the memory access in the model is free from padding data.The latency per subgraph depends on the maximum of the calculation and external communication cycles.We allocate 16GB/s DRAM bandwidth per accelerator core for loading weights and input activations and writing back data for subsequent subgraphs.The off-chip communication consists of weight loading of each layer and the inputs and outputs of each subgraph.As described in Section 3, our subgraph execution scheme avoids recomputing of intermediate outputs.

Baselines.
Three optimization baselines for graph partition are the greedy algorithm used in Halide [47], dynamic programming (DP) used in Irregular-NN [73] , and the enumeration-based method as a reference.
For the DSE studies, we compare Cocco with simulated annealing (SA) [33] to demonstrate the better stability of GA.These two methods are both the co-optimization scheme that optimizes partition and hardware settings at the same time.In contrast to co-optimization, the two-step scheme is another method for design-space exploration.Specifically, we Figure 11.The evaluation results for graph partition using the EMA-opt configuration (EMA as the optimization metric).The enumeration-based method is deterministic, which figures out the optimal solution as a reference in the first four models.It cannot complete for large-scale models (Transformer, GPT, RandWire-A, and RandWire-B) in a reasonable time because of the exponential search space.
use random search (RS) or grid search (GS) to sample memory capacity candidates and then explore the corresponding partition schemes.During the search, we evaluate 5,000 samples for each capacity candidate and keep the best candidate as the output.As for the sampling method, RS randomly samples memory capacity candidates while GS uses a coarser granularity to enumerate the candidates.

Graph Partition Evaluations
We start by presenting the partition performance on the single-core hardware with a 1MB global buffer and a 1.125MB weight buffer.The number of samples in Cocco is set to be 400,000.We evaluate the external memory access (EMA) and bandwidth requirements of eight models shown in Figure 11, where the results are normalized to the Halide baseline.This experiment aims to validate the effectiveness of our Cocco framework in graph partition.For networks with simpler structures, Cocco can find out the optimal solutions same as the enumeration-based results.For large-scale irregular networks (Transformer, GPT, RandWire-A, and RandWire-B), the enumeration-based method cannot complete in a reasonable time, while Cocco provides better solutions than Halide and DP.A better subgraph partition strategy helps to ease the communication burden, thus reducing the EMA cost and bandwidth requirements.

Hardware-Mapping Co-Exploration
After learning the superiority of Cocco for the graph partition, we further co-explore the memory configuration and graph partition mapping as the core study of this work.Three categories of exploration methods are used, including the fixed hardware scheme, the two-step scheme as baselines, and the proposed co-optimization scheme.We set three fixed memory configurations with Small capacity, Medium capacity, and Large capacity, followed by a partition-only procedure.The two-step scheme is implemented with decoupled steps for capacity search (RS or GS) and partition (GA).The cooptimization methods include the proposed Cocco and an SA-based one as the comparison.All methods sample up to 50,000 points.The energy-capacity co-optimization is used in the following evaluations.

DSE analysis using separate and shared buffer.
We first perform the hardware-mapping co-exploration to determine the suitable memory configuration (except for the fixed-HW scheme) with  = 0.002 4 and then solely execute the partition-only Cocco to obtain the final cost.In particular, we also compared the results using two memory designs: separate buffer and shared buffer.For the separate buffer design, activations and weights are stored in different buffers while they share the same space in the shared buffer design.The memory capacity candidates for the global buffer Table 2. Hardware-mapping co-exploration for shared buffer.We evaluate the cost using Formula 2 (the lower cost, the better), where the metric  is energy.(for activations) range from 128KB to 2048KB with a 64KB interval, while that for the weight buffer range from 144KB to 2304KB with a 72KB interval.The exploration range of the shared buffer is from 128KB to 3072KB with an interval of 64KB.The evaluation using separate buffers is shown in Table 1, where Cocco achieves better optimization with up to 1.89% (compared to SA in ResNet50) to 50.33% (compared to Fixed-HW(L) in RandWire) lower cost compared to various baselines across four models.The two-step scheme fails to combine the information between different sizes, so it is generally worse than the co-optimization method.
The capacity results also reflect the inherent capacity preference of different models.The data amount in GoogleNet and RandWire is relatively smaller, and thus their capacity requirements are lower.In contrast, the data amount in NasNet is larger, so a high capacity is preferred.
As shown in Table 2, the evaluation of the shared buffer setting shows a similar trend.Furthermore, we can observe that most of the cost results of the shared buffer are lower than those using the separate configuration.Although the shared buffer design requires additional control flows, it indeed improves efficiency than the separate buffer design.

Sample efficiency analysis.
We next study the sample efficiency of the two-step and the co-optimization scheme in Figure 12.We record the cost trends of the first 50,000 samples on ResNet50, GoogleNet, and RandWire during the exploration.Overall, Cocco shows a consistent convergence trend on these three networks.And it converges faster and  achieves lower costs compared to other baselines, exhibiting a higher sample efficiency.The two-step methods perform graph-partition separately under different capacities, so they fail to utilize the partition information between capacities.Particularly, the GS method uses a deterministic search direction (search from large to small capacity in this experiment), so the convergence time depends on the optimal capacity.Since GoogleNet and RandWire require relatively small buffers, GS takes a considerable number of samples to converge.

Optimization procedure analysis.
We next study how the distribution of sample points changes during the optimization procedure of Cocco.While searching for 20 generations with 500 genomes each, we divided them into ten groups with different colors in Figure 13.The results show that the distribution moves towards a lower intercept   14 demonstrate the effectiveness of  in adjusting the preference between the memory capacity and the given metric (energy is used here).The optimization trades the memory capacity for lower energy cost with the increase of .In addition, a larger memory capacity indeed contributes to lower energy, but the yields show differences because of their various model-inherent graph and layer patterns.For example, NasNet is more memory-intensive and more structure-complex than the other three models, so it requires a larger memory capacity for less energy consumption.2 shows that the increase of capacity is sub-linear with performance.To study this observation, we scale our model to the multi-core version and share weights of a subgraph across cores.Different cores only buffer a subset of weights and transfer the data between cores, similar to BSD in Tangram [18] or data-rotation in NN-Baton [61].The overhead of the interconnection crossbar is extracted from the implemented Arteries IP [5].
An accelerator with more cores can cover a larger subgraph but bring more core-to-core overhead.As shown in Table 3, in most cases, energy increases from the single-core to dual-core configuration because of the communication overhead.Moreover, profiting from the data-sharing mechanism, the required memory of each core drops with the increase of core number.5.4.3Batch size study.For the batch size evaluation shown in Table 3, the latency with a larger batch size principally presents a sub-linear increase, which benefits from the lower bandwidth requirement of weights via the inter-sample data reuse.In addition, such data reuse amortizes the energy burden per batch processing.And owing to the better weight reuse in multi-batch processing, a larger batch size does not require a proportional capacity.
6 Related Works 6.1 Intra-layer Optimization Prior works focus on the data reuse for intra-layer assignments, like output-stationary in ShiDianNao [14] and Envision [46], weight-stationary in NeuFlow [15] and Nvdla [49], input-stationary in SCNN [51], and row-stationary in Eyeriss [13].Based on these primitive dataflow patterns, extensive studies explored the optimal tiling and reordering schemes via brute-force, feedback-based, and constraint optimization approaches [23,30,50].These works focus on layer-level optimization, missing the graph information at a higher level.The efficiency of tile updates depends on the memory architecture.Simba [56,74] and NN-Baton [61] view each tile as an independent workload so that the tile size has a prominent impact on memory access due to halo regions.Motivated by traditional vision processors, Ascend [40] and DRQ [58] employ line buffers to achieve data reuse in the row direction, but the line buffer cannot well support the 2D-tiling reuse in both row and column directions.

Inter-layer Optimization
Intra-layer scheduling is sub-optimal, which is limited by the data reuse within a layer.Therefore, Fused-CNN [4], SR-CNN [38], and LCP [42] introduce layer fusion method that cache intermediate data on-chip to reduce data transfer overhead using handcrafted or heuristic methods for fusion partition.Although Irregular-NN [73] suggests a customized-DP algorithm, the exploration space is constrained because the layers in an assignment need to be successive in a specific order.A recent work named DNNFuser [29] employs an RLbased method, but their formulation towards 1D layer-fusion is hard to handle complex irregular networks.Tangram [18] and Atomic [72] schedule DNN workloads on a multi-core (scalable) accelerator, but they focus on executing a single layer on each core at a time rather than processing multiple layers with local data reuse.Also, some previous works [2,19,62] tackle the workload placement problem for multiple devides without discussing the downstream execution on each device.
Cocco proposes an automatic framework for inter-layer scheduling with a comprehensive memory scheme.It focuses on the fundamental core-level temporal execution that can be potentially scaled up to the multi-core or multi-device scenario with a spatial parallelism mechanism.

Design-Space Exploration for Memory
Memory design exploration methods lie primarily on two sides: analysis-driven and search-driven.For the analysisdriven method, Chen et al. [12] leverage red-blue pebble models to derive the proper memory capacity representations.Subsequently, Cai et al. [9] propose Olympus, which generalizes a framework to a batch of successive layers and also fills up with more scheduling and data reuse techniques.However, they are difficult to represent a subgraph with complex inter-layer connections.As for the search-driven method, Xiao et al. [67], Kwon et al. [37], and Feng et al. [16] explore the memory configuration for the layer-level assignment using the brute-force search, while Kao et al. [32] employ a genetic algorithm to improve the efficiency.These works principally focus on the layer-level information, while in comparison, Cocco exploits graph-level features for the better optimization.

Conclusion
While layer-level scheduling is widely studied to improve memory efficiency, graph-level optimization remains relatively unexplored.This paper proposed a graph-level dataflow with the corresponding memory management scheme that enables flexible graph partitions with high memory utilization.On top of it, we propose Cocco, a framework to provide a recommended memory configuration with graph-level scheduling strategies.Cocco shows outstanding graph partition ability compared to the greedy algorithm and DP employed in previous works and enables efficient graph-level hardware-mapping co-exploration.This paper helps to provide an implementation philosophy for the accelerator memory and better deployment for it.

Figure 4 .
Figure 4.A conceptual comparison between two manners to process a subgraph.The node marked with a negative number represents the input node.The corresponding subgraph is shown in the upper right, where  ×  / refers to the convolution kernel size ( ) and stride ().

Figure 6 .
Figure 6.The memory snapshot during two subgraph elementary operations based on the execution scheme of Figure 5 example.The allocated memory size and update offset correspond to  and Δ, respectively (the [:] notation denotes data ranging from index  to ).The arrows denote the data dependency according to the node relation in the subgraph.

Figure 7 .
Figure 7. Memory allocation and data update scheme in the global buffer for full data reuse.The data layout used in our implementation is NWHC8c (aligned to 8 channels), which can be changed in another design. 0 and  0 are the height and width of an input tile;  is the input channel size;  is the global width-dimension index of the input tensor; and  0 is the width-dimension index of an input tile.

Figure 8 .
Figure 8. Hardware implementation with the buffer region manager in our 12nm NPU as a demonstration.The layout is an NPU core extracted from part of our in-house chip.

4. 1
Problem Formulation 4.1.1Graph-Level Partition.Formally, a DNN model can be represented as a computation graph  = ( , ), where  is the vertex set consisting of all the layers in a DNN model, and  is the edge set that defines the structure of DNN.In particular, an edge (, ) ∈  represents that the output of layer  is an input of layer .

Figure 9 .
Figure 9. Illustration of crossover and mutation operations in Cocco.

5. 1 . 2
Accelerator Platform.As shown at the top of Figure

Figure 12 .
Figure 12.The convergence curve of Cocco compared with other baselines in the hardware-mapping co-explorations.The optimization method requiring fewer samples in (d) has higher sample efficiency.

Figure 13 .
Figure 13.The visualization of sample points distribution during optimization.The slope of the red dashed line denotes the preference between energy and capacity cost.The point on the line with a lower intercept has a smaller cost.

Figure 14 .
Figure 14.The trade-off between energy and memory capacity.The optimization target is to minimize the cost defined in Formula 2, where the metric  is energy.Energy results of each model are normalized to the first  (= 0.0005) results.

Table 1 .
Hardware-mapping co-exploration for separate buffer.In this table, A refers to the global buffer, and W refers to the weight buffer.We evaluate the cost using Formula 2 (the lower cost, the better), where the metric  is energy.We use RandWire-A as RandWire in the following experiments.

Table 3 .
Multi-core and batch evaluation using the energycapacity co-opt configuration.Size denotes the shared buffer size in each core.Study of  in the cost function.The results shown in Figure