A Spatial-Designed Computing-In-Memory Architecture Based on Monolithic 3D Integration for High-Performance Systems

The computing-in-memory (CIM) technology effectively addresses the bottleneck of data movement in traditional von-Neumann architecture, especially for deep neural network (DNN) acceleration. However, with the improving performance and parallelism of CIM processing elements (PEs), the substantial latency and power overhead caused by high-density intermediate results transmission has become a new bottleneck in CIM architectures. In this paper, we propose a spatial-designed CIM architecture based on the emerging Monolithic 3D (M3D) technology, and a spatiality-aware DNN mapping method for high-performance CIM systems. The proposed architecture introduces a novel hierarchy by implementing staggered tiers, enabling PEs to be shared by multiple tiles, and uses the ultra-dense and lower-power Inter-Layer Vias (ILVs) as shared buses, enabling CIM PEs to exploit the ultra-high bandwidth of M3D for inter-tile and intra-tile data transfer. Experiment result shows that the proposed M3D-enabled CIM architecture, combined with the proposed mapping method, achieves a 6.52× latency improvement, a 40.84× interconnection energy-delay product (EDP) improvement, and a 7.62× system-level EDP improvement compared to state-of-the-art CIM architecture.


INTRODUCTION
Deep Neural Network (DNN) acceleration tasks are facing significantly increasing amounts of data [1].In traditional von-Neumann architectures, hardware acceleration of data-intensive DNN tasks faces substantial latency and power overhead of data movement, which introduces a "memory wall" in hardware accelerators.Built with multiple processing elements (PEs), computing-in-memory (CIM) architectures address these memory-accessing overhead with high-density local memory, and the high-parallelism in-situ computation inside PEs with high energy efficiency [2].CIM architecture shows immense potential in facing the ever-increasing complexity and scale of AI inference applications.
However, the performance and parallelism of PEs are improving significantly [3], and CIM systems are facing more complex DNN structures, causing the data density of the intermediate results to become considerably higher [4].As a result, more contention occurs on the interconnect network during communication, as shown in Fig. 1(a).Limited network resources in traditional architecture could cause the overall latency of the system to become "interconnection-dominant" rather than "PE-dominant." Also, more frequent data transmission led to higher power consumption on the interconnection network.This latency and power overhead caused by intermediate results has become a new bottleneck (like a new "memory wall") in high-performance CIM architectures.
To address the aforementioned problems, we propose a novel spatial-designed CIM architecture based on emerging Monolithic 3D (M3D) technology for data-intensive DNN acceleration tasks.In M3D integration architectures, the memory tier and logic tier are connected through the high-density vertical connection of nanoscale Inter-Layer Vias (ILVs) [5], [6], significantly reducing the 2D die footprint and data transmission overhead.However, conventional M3D architectures typically stack PEs directly atop or below their designated memory block.This provides PEs with ultra-high point-to-point bandwidth to their local memory [7], [8], [9], but the acceleration of complex DNN structures, such as Residual blocks [1], FPNs [10], etc., where intermediate results that need to be written to non-local buffers are frequent and unavoidable.As a result, a significant amount of data transfer cannot leverage the high bandwidth of ILVs in conventional M3D architectures, and data movement bottlenecks remain unresolved.To address this problem, the proposed architecture implements a staggered tier placement, as shown in Fig. 1(b), enabling PEs to be shared across tiles, and achieving sharable vertical ultra-dense ILV buses that can provide higher connectivity for both inter-tile and intra-tile communication.The proposed architecture enables scalable high-speed and power-efficient intermediate results movement with minimal routing and logic overhead.We also introduce a spatiality-aware DNN mapping algorithm to fully exploit the high-density ILVs in the proposed architecture based on the position of PEs and ILV buses, through which most of the data transmission can be offloaded to ILV buses, providing minimal latency and power consumption.The major contributions of this paper are given as follows: • A novel M3D-based CIM architecture that enables the PEs to use ILVs as inter-tile and intra-tile buses, breaking through the connectivity limitations of traditional CIM architectures.• A detailed scheme to use ILVs as buses, and an on-chip data flow for the M3D architecture is proposed, achieving reusable and scalable interconnect network between tiers.• A spatiality-aware DNN mapping method is proposed, further reducing the overhead of potential data transfers.• We employ RRAM-based CIM to evaluate our architecture on various DNN tasks.Experiment shows that the proposed architecture outperforms state-of-the-art CIM architecture, achieving an improvement of 6.52× in latency, and 40.84× and 7.62× in interconnect and system EDP, respectively.

PRELIMINARIES 2.1 Conventional CIM Data Communication
CIM PEs are typically composed of digital-to-analog, analog-todigital converters (DACs/ADCs), and crossbars.Crossbars can be built based on emerging memory devices such as resistive randomaccess memory (RRAM) [11], or on conventional memory units  like SRAM [12].Specifically, with its highly parallel analog computing, non-volatile RRAM-based PE can achieve high-speed and low-power convolution operations.The highly paralleled computation generates intermediate results, and will be stored in a buffer and transmitted between PEs through an on-chip network.
Traditional multi-core CIM architectures typically follow a hierarchical design method.ISAAC-based architectures use a concentrated-mesh architecture [2], [11], where several PEs make a tile and share a localized memory, and several tiles are concentrated in a node of the Network on Chip (NoC) [2].Works in [13] also introduced an optimized three-level heterogeneous network, significantly improving the efficiency of data movement.
However, limited NoC channel width results in longer packet length to accommodate the output of high parallelism ADCs (e.g.128 ADCs or more).Additionally, as shown in Fig. 2(a) and (b), due to the constraints of array size, complex networks like ResNet [1] increase the complexity of data communication after mapping, leading to a higher flit injection rate for inter-tile communication.Also, broadcasting and convergence may occur between PEs far apart, which further introduces contention on the network, causing more latency and power consumption during data movement.We used BookSim 2.0 [14] to evaluate the impact of the length and density of intermediate results on network latency.As shown in Fig 2(c), for a typical 32-bit 4×4 Mesh NoC at 1 GHz [11], the network latency significantly exceeds the processing latency of 14.4 ns for high-performance CIM PEs [3].This gap between network latency and PE processing latency leads to PEs idling, creating bubbles in the pipeline and reducing system performance [13].

Conventional M3D-Enabled Architectures
As shown in Fig. 3(a), M3D-enabled architectures use nanoscale ILVs as inter-tier links between PE and the atop memory, transforming complex planar routing wires into vertical connections, and significantly reducing the wire length [5].As shown in Fig. 3(b) and (c), ILVs have a diameter of below 100 nm, providing over 1000 times higher connection density compared to alternative Through-Silicon Vias (TSVs).The work in [16] shows that the routing delay of M3D ILV outperforms TSV by 4.7× and traditional planar metal routing by 10.6×.M3D technology with high-speed ultra-dense ILVs provides an ultra-high inter-tier bandwidth, and has further enabled highly efficient fused system design with emerging device technologies like carbon-nanotube field-effect transistors (CNTFETs) and RRAM [5].M3D architectures to accelerate CIM with back-end-ofline (BEOL) compatible oxide semiconductor (e.g., IGZO [17] and IWO [18]) devices have also been widely studied.
Previous works have also explored the design of interconnection networks in M3D architectures.Ref. [17] uses ILVs for memory access, and top-tier IGZO/CNT-based routers for data transfer.Ref. [9] proposed a M3D-enabled energy-efficient SW-NoC (Switching-NoC) through the design of multi-tier routers.Ref. [8] introduced an architecture with inter-tier routers and bypass links to provide extra inter-tile channels, reducing the contention on NoC.
However, conventional M3D-enabled architectures still strictly follow a hierarchical structure, where high ILV bandwidth can only be leveraged as the connection between PEs and their designated local memory.As a result, previous M3D CIM architectures still depend on the planar global interconnect for a significant amount of data movement between tiles, and are unable to leverage the ultra-high ILV bandwidth fully.Also, adding new tiers to build multi-tier NoC [9] led to much higher costs.Nevertheless, transmission through ILVs can be done in a cycle, while CIM PEs typically need multiple cycles for calculation.As a result, ILV interconnects without sharing could only achieve a low bandwidth utilization.

ARCHITECTURE AND OPTIMIZATION 3.1 Architecture Overview
Form the observations above, we conclude that: 1) Higher bandwidth with low power and area overhead is needed in highperformance CIM architectures.2) New hierarchical partitioning and interconnection network design are needed for the M3Denabled CIM architecture to bypass the memory-PE correspondence limits, maximizing the utilization of ILV connection.
To achieve these goals, we propose a novel M3D-enabled CIM architecture with a new hierarchy and interconnection design.As shown in Fig. 4(a), the architecture includes an atop memory tier, and a CIM tier below.We specifically use RRAM-based CIM for demonstration.High-density ILVs are clustered as shared buses providing efficient vertical connections between the two tiers.Memory tiles are connected through a mesh NoC in the memory tier, and each tile consists of four distributed SRAM blocks, serving as buffers for intermediate results from its below child PEs, which are neatly placed in the CIM tier.Unlike traditional hierarchical architectures, we changed the correspondence between memories and PEs: The memory blocks and PE Macros are placed in a staggered way, and this positioning allows each memory block to correspond to its four PEs directly below its vertices, and each PE to correspond to four memory blocks atop, through shared vertical ILVs with a minimal detour.High-density shared ILV buses allow data movement between PEs and memory blocks to be completed within one cycle, for both intra-tile and inter-tile communication.
As shown in Fig. 4(a), the tightly arranged ILVs are staggered in the form of groups, with each group serving as a bus connecting the PEs and their corresponding memory block.We have Read ILV Buses (RIBs) for fetching kernel data from memory to PE, and Write ILV Buses (WIBs) to transmit calculation results from PEs to memories.The ILV bus disperses as a planar H-connection after reaching its target tier.RIBs connect the read ports of a memory As shown in Fig. 4(b), each memory tile contains a Scheduler that records the dataflow configuration and detailed DNN tasks after mapping, and it is responsible for PE calculation scheduling and managing all the control signals.A Special Function Unit (SFU) is responsible for non-matrix-vector-multiplication (non-MVM) functions, such as concatenation, pooling, etc.We preserve the planar NoC on the memory tier with routers and channels, connecting memory tiles in a scalable Mesh fashion and providing links for a small amount of data that cannot utilize ILV buses.
The structure PE in the CIM Tier is shown in Fig 4(c): The memory tier writes control data (e.g.kernel size and destination node) through RIBs into the State Buffer.The computational data is arranged in the Kernel Buffer through RIBs before being sent to the CIM Macro for processing.Additionally, a Bypass Link forwards RIB's data to the WIB for direct memory access without PE processing, which further exploits the ILV Buses.
As shown in Fig. 4(d), each WIB is shared by four SRAMs, and each RIB is shared by four PEs.We categorize PEs into three types: Tile-exclusive PE (E-PE), below the intersection of four memory blocks of a tile; PE shared by two tiles (2s-PE), below the center of tile edge; and PE shared by four tiles (4s-PE), below the intersection of four memory tiles.E-PEs can access any memory block in a tile through ILV buses, while 2s-PEs and 4s-PEs leverages the inter-tile ILV buses for inter-tile data transfer.

Supported Types of Dataflow
In this proposed architecture, the data can be categorized into Intermediate Results (IR) and Control Data (CD).The IR is the calculation result generated by PE.The structure of CD is shown in Fig. 5(a), which will be used for packet routing, and tracking of the calculation process by the State Buffer and the Scheduler.As shown in Fig. 5(b), the proposed architecture supports four different types of dataflow on the heterogeneous interconnect network: 1).If IR is written through intra-tile WIBs to the target SRAM, task information, e.g.remaining operation count, in the scheduler is updated locally, no packet will be injected into the planar NoC; 2).If IR is written to adjacent memory in another tile through the inter-tile WIBs, the source tile sends CD through the planar mesh to update the target scheduler.The target tile will synchronize CD and IR after it receives IR from inter-tile WIBs; 3).For long-distance transfer, CD goes through planar NoC following an X-Y routing algorithm.IR utilizes the ILV buses and the Bypass links, following the same X-Y routing.In each hop, the local schedular synchronizes the CDs and corresponding IRs; 4).If the WIBs are completely occupied, both CD and IR will travel through the planner NoC following an X-Y routing algorithm; This architectural design significantly reduces the average packet length injected to the planar network, alleviating the latency and power consumption bottlenecks caused by the data transfer.

Spatiality-Aware DNN Mapping
Unlike traditional architecture designs, the physical positioning of PE in the proposed architecture significantly benefits the utilization of low-power and high-bandwidth ILV buses, which, in turn, further benefits the system-level EDP.We propose a spatiality-aware DNN mapping algorithm as shown in Algorithm 1, and the goal of which is to exploit the ILV buses.The mapping procedure includes the allocation of input buffers and PEs for calculation.Firstly, the buffer will be allocated close to the PEs that generates results for the previous DNN layer to reduce the contention on NoC.We aim to place it in the same tile to leverage intra-tile WIB without the need for inter-tile CD transfer.If this is not feasible, the algorithm chooses an adjacent SRAM block in another tile, which can still exploit inter-tile WIB for IR.If neither of these options is achievable, the nearest available SRAM block is preferred for allocation.Next is the mapping of PEs.Due to the constraints of array size, a single-layer network may be mapped to multiple PEs and may use the same input data.For PEs using the same input data from the same SRAM block, the priority for PE allocation, from highest to lowest, is as follows: E-PE, 2s-PE, and 4s-PE.This approach ensures that PEs with superior inter-tile communication capabilities are reserved for subsequent DNN layers.At the end of mapping each layer of the network, the type of dataflow as mentioned in the previous section, is determined based on the utilization of ILV buses between memory blocks and PEs.Then the utilization of each ILV bus is updated for mapping of subsequent layers.
Taking the Res-Block [1] shown in Fig. 2(a) and (b) as an example, as shown in Fig. 6, with the proposed algorithm, PEs with broadcasting or convergence requirements are located close to the same memory block, and PEs with inter-tile data transmission requirements are located at the tile boundaries to exploit the inter-tile WIBs.As a result, all intermediate results can leverage the ILV buses for inter-tile and intra-tile communication, and only requires a few CD to be transferred through the planar NoC.

EXPERIMENTS 4.1 Experiment Setup
We evaluated the proposed architecture and mapping method using RRAM-based CIM specifically, and conducted experiments to analyze the latency and power performance for various DNN tasks.
We selected the CIFAR-10 dataset and evaluated the performance of the proposed architecture on large-scale CNN models, e.g., ResNet-18, ResNet-50 [1], VGG-11, and VGG-19 [19].For the simulators, we used modified BookSim 2.0 [14] for our proposed architecture, to evaluate the latency and power consumption of the interconnect network.We built an in-house cycle-accurate simulator based on SystemC to assess the end-to-end latency.
Specifically, to further analyze the performance of the proposed architecture in high-data-density scenarios, we add another control group employing a weight replication method.As shown in Fig. 7 and Fig. 8, we replicate the weight of the initial layers of DNN by four times (acceleration rate ×4), and the replicated PEs will be responsible for the parallel calculation of different kernels in the same feature map.In this way, the bubbles in the pipeline introduced by kernel striding can be eliminated, providing an endto-end speedup and a higher computation density.

Architecture Configurations
We compare our architecture with state-of-the-art CIM architecture [4], the heterogeneous interconnection design of which improved the inference latency by 5.4× compared to ISAAC [2].The parameters for the proposed architecture are listed in Table 1.We employed two different scales for architectures for different scales of the network: a 10×10 mesh for ResNet-50 and VGG-19, and an 8×8 mesh for ResNet-18 and VGG-11.We use 576×128 as the RRAM array size [20].We adopt the state-of-the-art PE performance of a 14.4 ns latency from [3].The data precision for intermediate results is 8-bit.Memory blocks and PEs are connected through 576-bit RIBs and 1024-bit WIBs.For other parameters of the NoC, we adopt typical values of 32-bit data width at 1 GHz during the experiment.

Results and Discussions
As shown in Fig. 7(a), for the control group without weight replication, the experiment results show that the proposed architecture achieves a latency improvement of [3.7× to 5.5×] for different networks.This improvement can be attributed to the high-bandwidth RIBs and WIBs, which significantly reduces the transmission overhead on the planar global routing.Also, the low-power ILV buses bring an improvement of [5.3× to 5.7×] of interconnection power consumption is achieved, owing to lower injection rates, and lower average packet lengths.As a result, interconnection EDP is improved by [20.4× to 30.6×], as shown in Fig. 8.
As shown in Fig. 7(b), for the control group with 4× weight replication, the network latency is improved by [4.6× to 6.5×], and the interconnection power consumption is improved by [5.6× to 6.5×].This is because the weight-replication introduces a larger number of PEs, which results in higher inter-tile data density.In this context, the ILV buses in the proposed architecture can achieve higher utilization, leading to more reduction in NoC contention.The proposed architecture is suitable for high-data-density scenarios and gives 3.8× speedup on average after applying 4× weigh replication on the initial layers, while traditional architecture only gives 3.2× on average since the replication-introduced contention is nonnegligible in traditional architectures.The proposed architecture improves the interconnection EDP by [26.4× to 40.8×] after weight replication, as shown in Fig. 8. Adopting the power breakdown of a more complex system from [11], the system EDP is improved by [4.37× to 7.62×] after adopting the proposed architecture.

CONCLUSIONS
This work presents a spatial-designed CIM architecture based on M3D integration technology, and a spatiality-aware DNN mapping method.The proposed architecture implements a staggered tier placement, and implements shared PEs and ILV buses for intertile and intra-tile data movement, exploiting the high-bandwidth, low-power ILV interconnection.Moreover, the proposed mapping algorithm further enabled critical paths to leverage the ILV buses, significantly reducing the contention and transmission overhead on the planar on-chip network.Experiment results show that the proposed design achieves a maximum 6.52× latency improvement, a 40.84× interconnection EDP improvement, and a 7.62× system-level EDP improvement compared to state-of-the-art CIM architecture.

Figure 1 :
Figure 1: (a) Contention introduced by complex dataflow in traditional CIM architecture.(b) Proposed M3D-based architecture with staggered tier placement.

Figure 2 :
Figure 2: (a) Structure of a Res-block [1] and (b) its dataflow after mapping to 576×128 RRAM CIM arrays.(c) Impact of injection rate and packet length on network latency of a 4×4 Mesh, compared with state-of-the-art PE [3].

Figure 4 :
Figure 4: (a) The proposed architecture.(b) Detailed structure of memory tile.(c) Detailed structure of PE.STATE ILV buses for control signals, and DATA ILV buses for intermediate results.(d) Relative locations of E-PE, 2s-PE, 4s-PE, and inter-tile ILV bus.

Figure 5 :
Figure 5: (a) Detailed segmentation of the Control Data (CD); (b) Supported dataflow in the proposed architecture.

Figure 6 :
Figure 6: Mapped Res-Block from Fig. 2(a).The sequence for allocation is marked in circle.Especially, the PEs for Conv3 and 4 uses inter-tile WIB to send IR to downstream memory blocks, which requires CD passed through planner NoC.

Figure 7 :
Figure 7: (a) Latency and (b) interconnection energy comparison with baseline architecture for different DNN tasks and different acceleration rates.×1 stands for the control group without weight replication, and ×4 stands for four times of weight replication.

Figure 8 :
Figure 8: Interconnection EDP improvement for different DNN tasks and different acceleration rates.
Algorithm 1 Spatiality-aware DNN Mapping Algorithm for all layers in target DNN, do //Allocation Input Memory   for   based on   −1 Connect to an SRAM block, priority from highest to lowest: 1.A same-tile SRAM block through intra-tile WIB; 2. An SRAM block adjacent to   −1 through inter-tile WIB; 3. A nearest SRAM block from   −1 through inter-tile WIB; //Allocation   for   for all child   of   , do Choose PE location priority: E-PE > 2s-PE > 4s-PE; Fix dataflow based on location of   ,   , and ILV utilization; Update ILV utilization based on the new dataflow;

Table 1 :
Summary of design parameters