Abstract
An emerging use case of machine learning (ML) is to train a model on a high-performance system and deploy the trained model on energy-constrained embedded systems. Neuromorphic hardware platforms, which operate on principles of the biological brain, can significantly lower the energy overhead of an ML inference task, making these platforms an attractive solution for embedded ML systems. We present a design-technology tradeoff analysis to implement such inference tasks on the processing elements (PEs) of a non-volatile memory (NVM)-based neuromorphic hardware. Through detailed circuit-level simulations at scaled process technology nodes, we show the negative impact of technology scaling on the information-processing latency, which impacts the quality of service of an embedded ML system. At a finer granularity, the latency inside a PE depends on (1) the delay introduced by parasitic components on its current paths, and (2) the varying delay to sense different resistance states of its NVM cells. Based on these two observations, we make the following three contributions. First, on the technology front, we propose an optimization scheme where the NVM resistance state that takes the longest time to sense is set on current paths having the least delay, and vice versa, reducing the average PE latency, which improves the quality of service. Second, on the architecture front, we introduce isolation transistors within each PE to partition it into regions that can be individually power-gated, reducing both latency and energy. Finally, on the system-software front, we propose a mechanism to leverage the proposed technological and architectural enhancements when implementing an ML inference task on neuromorphic PEs of the hardware. Evaluations with a recent neuromorphic hardware architecture show that our proposed design-technology co-optimization approach improves both performance and energy efficiency of ML inference tasks without incurring high cost-per-bit.
1 INTRODUCTION
Neuromorphic computing systems are integrated circuits that implement the architecture of the central nervous system in primates [14, 22, 65]. These systems facilitate energy-efficient computations using spiking neural networks (SNNs) [63] for power-constrained embedded devices. To this end, the design workflow is to train a machine learning (ML) model (commonly on a backend server) and subsequently convert the trained model to spike-based computations and deploy it on the neuromorphic hardware of an embedded system. The quality of inference (e.g., accuracy) is assessed in terms of the inter-spike interval (ISI) (see Appendix B). Therefore, any deviation from its expected value will lead to a degradation of the inference quality.
Typical neuromorphic systems such as Loihi [28], DYNAPs [66], and \(\mu\)Brain [93] consist of processing elements (PEs) that communicate spikes using a shared interconnect. Each PE implements neuron and synapse circuitries. A common technique to implement a neuromorphic PE is by using an analog crossbar where bitlines and wordlines are organized in a grid with memory cells connected at their crosspoints to store synaptic weights [2, 32, 40, 45, 46, 51, 61, 69, 100]. Neuron circuitries are implemented along bitlines and wordlines. Figure 1 (left) shows the architecture of an \(N \times N\) analog crossbar with N bitlines and N wordlines.
Fig. 1. An \(N \times N\) crossbar showing the parasitic components within.
We investigate the internal architecture of a crossbar and find that parasitic components introduce delay in propagating current from a pre-synaptic neuron to a post-synaptic neuron as illustrated in Figure 1 (right). This delay depends on the specific current path activated in a crossbar. The higher the number of parasitic components on a current path, the larger is its propagation delay [70, 86, 88, 89, 91]. Parasitic components on bitlines and wordlines are a major source of latency at scaled process technology nodes, and they create significant latency variation in a crossbar [33, 34, 49, 53, 56, 67, 79, 83, 87]. Such variation can introduce ISI distortion (see Appendix B), which may impact the quality of an inference task [7, 27].
To increase the energy efficiency of a neuromorphic system, non-volatile memory (NVM) such as oxide-based random access memory (OxRRAM), phase-change memory, ferroelectric RAM, and spin-based magnetic RAM is used to implement the memory cells in a crossbar [15, 64, 94, 95, 98]. An NVM cell can be programmed to a high-resistance state (HRS) or one of many low-resistance states (LRS), implementing multi-bit synaptic weights [59, 64, 97]. To implement a synaptic weight on a memory cell of a crossbar, the synaptic weight is programmed as the conductance of the cell.
A crossbar can accommodate only a fixed number of pre-synaptic connections per post-synaptic neuron. To give an an example, the crossbar in Figure 1 (left) has N pre-synaptic neurons, N post-synaptic neurons, and \(N^2\) memory cells. Each post-synaptic neuron can have a maximum of N pre-synaptic connections. To mitigate the negative impact of technology scaling (e.g., increase in the value of parasitic components on current paths and increase in the power density), N is constrained to a lower value, typically between 128 and 256 (see our tradeoff analysis in Section 2). To map a large SNN model on multi-PE hardware, system software frameworks such as NEUTRAMS [50], NeuroXplorer [11], SentryOS [93], LAVA [60], and DFSynthsizer [76] are commonly used. These frameworks first partition a model into clusters, where a cluster is a subset of neurons and synapses of the model that can be implemented on the architecture of a crossbar. Subsequently, the partitioned clusters are implemented on different crossbars of a neuromorphic hardware.
We make the following two key observations related to a neuromorphic PE:
The latency within a crossbar is a function of the length (i.e., the number of parasitic components) of current paths and the delay to sense the NVM cell activated on a current path.
Due to how memory cells are organized in a crossbar, a significant fraction of these memory cells remains unutilized when implementing ML inference tasks.
Based on these two observations (which we elaborate in Sections 2 through 4), we present a design-technology tradeoff analysis to implement ML inference tasks on different PEs of an NVM-based neuromorphic system. We make the following four key contributions:
Through detailed circuit-level simulations at scaled process technology nodes, we show that bitline and wordline parasitics are the primary sources of long latency in a crossbar and they create asymmetry in inference latency. With technology scaling, the absolute latency increases and the latency asymmetry becomes increasingly more significant. In addition, different resistance states of a multi-level NVM cell take varying latencies to sense during an inference operation (see Section 2).
We propose to optimize the implementation of synaptic weights on NVM cells such that the resistance state that takes the longest time to sense is programmed on the NVM cell that has the least parasitic delay in a crossbar. This lowers the latency of a crossbar (see Section 3).
We propose an architectural change of introducing isolation transistors in a crossbar to partition it into regions that can be individually power-gated based on their utilization. In this way, we improve energy efficiency. In addition, by isolating the unutilized region of a crossbar from the active region, parasitics of only the active region contribute to latency rather than both as in a baseline non-partitioned crossbar architecture. This reduces the latency of a crossbar (see Section 4).
We show that our technological and architectural optimizations can only deliver on their latency and energy improvement promises if they are exploited efficiently by the system software. Therefore, we propose a mechanism to expose our proposed design changes to the system level, allowing the system software to improve both latency and energy when implementing ML inference tasks on hardware (see Section 5).
We evaluate our design-technology co-optimization approach for a recent neuromorphic hardware using 10 ML inference tasks. Results show 12% reduction in average PE latency and 22% lower application energy compared to current state of the art.
To the best of our knowledge, this is the first work that demonstrates the energy and latency improvement of power gating crossbar-based neuromorphic hardware designs.
2 DESIGN-TECHNOLOGY TRADEOFF ANALYSIS
Without loss of generality, we demonstrate the design-technology tradeoff for an OxRRAM-based neuromorphic PE, where each NVM cell can be programmed to the following four resistance levels (i.e., 2-bit per synapse): 1.5 K\({\Omega }\), 5.78 K\({\Omega }\), 13.6 K\({\Omega }\), and 73 K\({\Omega }\) [19, 20, 31, 64, 74]. Furthermore, we show our analysis for four process technology nodes: 16 nm, 22 nm, 32 nm, and 45 nm, which are obtained from our technology provider. The analysis can be easily extended to other NVM types and also to other process technology nodes.
2.1 Cost-per-Bit Analysis for a Neuromorphic PE
The computer memory industry has thus far been primarily driven by the cost-per-bit metric, which provides the maximum capacity for a given manufacturing cost. As shown in recent works [56, 67, 68, 79, 81, 82, 83], manufacturing cost can be estimated from the area overhead. To estimate the cost-per-bit of a neuromorphic PE, we investigate the internal architecture of a crossbar and find that a neuron circuit can be designed using 20 transistors and a capacitor [48], whereas an NVM cell is a 1T-1R arrangement with a transistor used as an access device for the cell. Within an \(N \times N\) crossbar, there are N pre-synaptic neurons, N post-synaptic neurons, and \(N^2\) synaptic cells. The total area of all neurons and synapses of a crossbar is (1) \(\begin{align} \text{Neuron area} &= 2N(20T + 1C)\\ \nonumber \nonumber \text{Synapse area} &= N^2(1T + 1R), \end{align}\) where T stands for transistor, C for capacitor, and R for NVM cell. The total synaptic cell capacity is \(N^2\), with each NVM cell implementing 2-bit per synapse. The total number of bits (i.e., synaptic capacity) in the crossbar is (2) \(\begin{equation} \text{Total bits} = 2N^2. \end{equation}\)
Therefore, the cost-per-bit of an NxN crossbar is (3) \(\begin{equation} \text{Cost-per-bit} = \frac{2N(20T+1C)+N^2(1T+1R)}{2N^2} \approx \frac{F^2(27+2N)}{N}, \end{equation}\) where the cost-per-bit is represented in terms of the crossbar dimension N and the feature size F. Equation (3) provides a back-of-the-envelope calculation of cost-per-bit. Figure 2 plots the normalized cost-per-bit for four different process technology nodes, with the crossbar dimension ranging from 16 to 256. We make the following two observations.
Fig. 2. Cost-per-bit analysis of a crossbar.
First, the cost-per-bit reduces with increase in the dimension of a crossbar—that is, larger-sized crossbars can accommodate more bits for a given cost. However, both the absolute latency and latency variation increases significantly for larger-sized crossbars, which increases inference latency and reduces the quality of ML inference due to an increase in the ISI distortion (see our analysis in Section 2.2). Second, the cost-per-bit reduces considerably with technology scaling. This is due to higher integration density at smaller process technology nodes.
The formulation for the cost-per-bit (Equation (3)) depends on the specific neuron architecture of Indiveri [48] and the one transistor (1T)-based OxRRAM design of Mallik et al. [64]. This formulation can be easily extended to other neuron and synapse designs. Furthermore, system designers can use our formulation to configure their neuromorphic hardware, without having to access and plug in technology-related data.
2.2 Latency Variation in a Neuromorphic PE
Figure 3 shows the difference between the best-case and worst-case latency in a crossbar (expressed as a fraction of 1 – \(\mu\)s spike duration) for five different crossbar configurations at 45-nm, 32-nm, 22-nm, and 16-nm process technology nodes. See our experimental setup using NeuroXplorer [11] in Section 6.1, which incorporates software, architecture, circuit, and technology. All NVM cells are programmed to the HRS—that is, 73 K\(\Omega\) (see Section 2.3 for the dependency on resistance states).
Fig. 3. Variance in latency within a crossbar, expressed as a fraction of a single spike duration.
We make two key observations. First, the latency difference increases with crossbar size due to an increase in the number of parasitic components on current paths. The average latency difference for the \(256 \times 256\) crossbar is higher than \(16 \times 16\), \(64 \times 64\), and \(128 \times 128\) crossbars by 16.5×, 13.4×, and 4.5×, respectively. This average is computed across the four process technology nodes. Therefore, smaller-sized crossbars lead to a smaller variation in latency, which is good for performance. However, smaller-sized crossbars also lead to higher cost-per-bit, which we have analyzed in Section 2.1. For most neuromorphic PE designs, \(128 \times 128\) crossbars achieve the best tradeoff in terms of latency variation and cost-per-bit [42, 57, 64, 66, 70, 102].
Second, the latency difference increases significantly for scaled process technology nodes due to an increase in the value of the parasitic component. The average latency difference for 32-nm, 22-nm, and 16-nm process technology nodes is higher than 45 nm by 1.3×, 3×, and 6.6×, respectively. The unit wordline (bitline) parasitic resistance ranges from approximately 2.5\(\Omega\) (1\(\Omega\)) at the 45-nm node to 10\(\Omega\) (3.8\(\Omega\)) at the 16-nm node. The value of these unit parasitic resistances is expected to scale further reaching \(\approx 25\Omega\) at the 5-nm node [33, 35, 36, 73, 92]. The unit wordline and bitline capacitance values also scale proportionately with technology. Latency variation increases ISI distortion, which degrades the quality of ML inference.
2.3 Varying Latency to Sense NVM Resistance States
The latency (i.e., the delay) on a current path from a pre-synaptic neuron to a post-synaptic neuron within a crossbar is proportionate to \(R_{eff}\cdot C_{eff}\), where \(R_{eff}\) (\(C_{eff}\)) is the effective resistance (capacitance) on the path. This delay increases the time it takes for the membrane potential of a post-synaptic neuron to rise above the threshold voltage causing a delay in spike generation.
The effective resistance on a current path depends on the value of parasitic resistances and the resistance of the NVM cell. We analyze the latency impact due to different resistance states. Figure 4 plots the increase in latency (expressed as a fraction of 1–\(\mu\)s spike duration) to sense three NVM resistance states (LRS2, LRS3, and HRS) with respect to LRS1 at 45-nm, 32-nm, 22-nm, and 16-nm process technology nodes. These results are for a neuromorphic PE with a \(128 \times 128\) crossbar.
Fig. 4. Latency to sense various NVM resistance states, expressed as a fraction of a single spike duration.
We observe that the latency to sense the HRS is considerably higher than all three LRS at all process technology nodes (consistent with other works [18, 64, 99]). The latency difference increases with technology scaling due to an increase in the size of parasitic components on bitlines and wordlines of a crossbar, which we analyzed in Section 2.2.
3 PROPOSED TECHNOLOGICAL IMPROVEMENTS
Based on the design-technology tradeoff analysis of Section 2, we now present our technology-related optimization. Without loss of generality, we present our optimization for a 128 × 128 crossbar-based neuromorphic hardware designed at the 16-nm node. We exploit the following two observations from Section 2: (1) HRS in an NVM cell takes higher latency to sense than LRS, and (2) spike propagation latency in a crossbar depends on the number of parasitic components on its current path. The left side of Figure 5 shows the proposed technological changes. A crossbar is partitioned into three regions. The number of parasitic components on current paths in region A is considerably lower than in the rest of the crossbar. Therefore, all NVM cells in this region (four in this example) implement only the HRS, which takes the longest time to sense. Conversely, NVM cells in region B have longer propagation delay due to the higher number of parasitic components. Therefore, all NVM cells in this region (nine in this example) implement only the LRS, which takes the shortest time to sense. Finally, all other NVM cells (i.e., those in region C) are programmable (i.e., these cells can implement all four resistance states). The overall objective is to balance the latency on different current paths within a crossbar. This minimizes the latency variation in a crossbar, which reduces ISI distortion and improves the quality of ML inference tasks.
Fig. 5. Our proposed technological change.
The right side of Figure 5 shows a pre- and a post-synaptic neuron connected via a synapse that is programmed to the LRS. The synaptic connection can be implemented on NVM cells in region B (with only LRS1) and region C (with programmable states). The figure illustrates two alternative implementations of these neurons. If the pre-synaptic neuron is implemented on wordline 0, then the post-synaptic neuron cannot be implemented on bitlines 0 and 1. This is because NVM cells in region A are all in HRS. In this example, we show the implementation on bitline 2 (see the blue implementation). Conversely, if the post-synaptic neuron is implemented on bitline 0, then the pre-synaptic neuron cannot be implemented on wordline 0 and 1 (to avoid using region A). We show the implementation on wordline 2 (see the red implementation).
Formally, the proposed neuromorphic PE is represented by a tuple
Figure 6 plots the variation of latency in the proposed 128 × 128 crossbar, normalized to a baseline architecture [7], where any NVM cell can be programmed to any of the four resistance states. See Section 6.4 for a description of this baseline architecture and Section 6.1 for the simulation setup. The variation in latency is measured as ratio of the best-case and worst-case latency in the crossbar. The figure reports latency variation for \(N_h\) ranging from 2 to 64 with \(N_l\) set to 16, 32, and 64.
Fig. 6. Latency variation in the proposed crossbar architecture for different settings of \(N_h\) and \(N_l\) .
We observe that latency variation decreases with an increase in \(N_h\). This is due to an increase in the size of region A, which increases the (worst-case) latency due to an increase in the number of parasitic components on current paths via the HRS. However, the (best-case) latency of current paths via the LRS remains the same. Therefore, the latency variation reduces, which improves inference quality by lowering the ISI distortion. To illustrate this concept, Figure 7 provides an example where two synapses are mapped to a 4 × 4 crossbar. In Figure 7(a), the red synapse (in HRS) is mapped to the bottom left corner of the crossbar, whereas the blue synapse (in LRS) is mapped to the top right corner. The figure shows the timing of two spikes. The input spike on the red and blue synapses are \({t_1}\) and \({t_2}\), respectively. Without loss of generality, let \(t_2 \gt t_1\). The ISI of these two spikes is \(t_2 - t_1\). Due to the delay in current propagation through bitlines and wordlines, these two spikes arrive at the output terminal at different times: red synapse with a delay of x and blue synapse with a delay of y. Here, \(y \gt x\). Therefore, ISI of the output spikes is \((t_2 + y) - (t_1 + x)\). The ISI distortion (difference of ISI between input and output) is (4) \(\begin{equation} \text{ISI distortion} = \Big ((t_2 + y) - (t_1 + x)\Big) - (t_2 - t_1) = y - x. \end{equation}\)
Fig. 7. ISI improvement due to increase in the size of region A.
Figure 7(b) illustrates a scenario where region A is increased to include more cells that are programmed to the HRS. The mapping process will map the red synapse using the farthest cell of region A. The delay on this synapse is \(x+ \Delta\), where \(\Delta\) is the additional delay due to routing spikes on the red synapse via a longer route compared to that in Figure 7(a). Therefore, ISI of the output spikes is (\(t_2 + y\)) – (\(t_1 + x + \Delta\)). The ISI distortion is (5) \(\begin{equation} \text{ISI distortion} = \Big ((t_2 + y) - (t_1 + x + \Delta)\Big) - (t_2 - t_1) = y - x - \Delta . \end{equation}\)
Comparing Equations (4) and (5), we observe that the ISI distortion reduces due to an increase in the size of region A. ISI distortion also reduces with an increase in \({N_l}\) due to a reduction in the worst-case latency. We also note that large \({N_h}\) may lead to higher average crossbar latency, which impacts real-time performance. Finally, we see that going from \({N_l}\) = 16 to 32, there is no significant reduction in the latency variation. Although the size of region B increases with an increase in \({N_l}\), we observe only marginal reduction of the best-case latency. Overall, with \({N_h}\) = \({N_l}\) = 64, the latency variation is 74% lower than baseline. This is chosen based on the tradeoff between latency variation and average latency for a 128 × 128 crossbar at 16 nm. The tradeoff point can change for other technology nodes and for other crossbar configurations.
3.1 Reduction in Latency Variation
To understand the reduction of latency variation within a crossbar as a result of our technological changes, we provide a simple example. Consider that there are only two current paths in a crossbar. The parasitic delay on the shortest and longest current paths are D and \((D+\Delta)\), respectively. The time to sense LRS and HRS NVM states are S and \((S+\delta)\), respectively. Without any optimization, the worst-case condition is triggered when the HRS is programmed on the longest path and the LRS on the shortest path. The minimum and maximum latencies are \((D+S)\) and \((D+S+\Delta +\delta)\), respectively. The latency variation is \((\Delta +\delta)\). Using our technology optimization, the HRS is programmed on the shortest path and the LRS on the longest path. The two latencies are \((D+S+\Delta)\) and \((D+S+\delta)\). The latency variation reduces to \((|\Delta -\delta |)\).
Within a crossbar, there are many current paths (\(N^2\) current paths in an \(N \times N\) crossbar). The precise reduction in latency variation depends on the specific current paths activated for a synaptic connection, which is controlled during the mapping of an ML application to the crossbars of the hardware. In Figure 6, we show a 74% reduction comparing only the shortest and the longest paths in a \(128\, \times \, 128\) crossbar. In Section 7.2, we evaluate the general case considering the mapping process. We report an average 22% reduction of latency variation.
Reducing the latency variation helps reduce the ISI distortion, which improves the inference quality. In Section 7.4, we report an average 4% increase of inference quality.
3.2 Impact on Latency
Although latency variation impacts inference quality, the average crossbar latency impacts the real-time performance. To understand the impact of our technological optimization on the average crossbar latency, we consider the same example of two current paths. Consider that there are m synapses with the LRS and n synapses with the HRS. The average latency in the worst-case condition is \(\frac{m\cdot (D+S) + n\cdot (D+S+\Delta +\delta)}{m+n}\). Using the technological improvement, the average latency is \(\frac{m\cdot (D+S+\Delta) + n\cdot (D+S+\delta)}{m+n}\). Therefore, the change in latency is \((\frac{n-m}{n+m})\Delta\). This change in latency depends on (1) current paths activated in a crossbar and (2) the value of n and m—that is, the number of synaptic connections with the HRS and LRS, respectively. In Section 7.3, we show an average 3% reduction of the average crossbar latency for the evaluated applications.
4 ARCHITECTURAL ENHANCEMENTS TO NEUROMORPHIC PE
To understand the motivation of the proposed architectural changes, Figure 8 reports the average synapse utilization of \(128 \times 128\) crossbars in neuromorphic PEs for 10 ML models implemented using the spatial decomposition technique of Balaji et al. [10], which is a best-effort approach to improve the utilization of crossbars in neuromorphic hardware.
Fig. 8. Average synapse utilization of neuromorphic PEs.
We observe that the average synapse utilization is only 0.9%. This is because a crossbar can accommodate only a limited number of pre-synaptic connections per post-synaptic neuron. To illustrate this, Figure 9 shows three examples of implementing neurons on a \(4 \times 4\) crossbar. The synapse utilization of the three example scenarios are (a) 25% (4 out of 16), (b) 18.75% (3 out of 16), and (c) 25% (4 out of 16). As the crossbar dimension increases, the utilization drops significantly. For instance, if a 128 × 128 crossbar is used to implement a single 128-input neuron (i.e., generalization of Figure 9(a)), the utilization is only 0.78% (128 utilized synapses out of a total of \(128^2 = \hbox{16,384}\) synapses). Lower synapse utilization leads to lower energy efficiency.
Fig. 9. Implementation of one 4-input (a), one 3-input (b), and two 2-input (c) neurons to a \(4 \times 4\) crossbar.
To improve energy efficiency, we propose to partition a neuromorphic PE into regions that can be dynamically power-gated based on its utilization for a given ML inference task. Figure 10 shows the use of isolation transistors in a neuromorphic PE to partition a \(4 \times 4\) crossbar into active and unutilized regions. Figure 10(a) illustrates the implementation of only a single neuron function \(y_1\) in the crossbar. To improve energy efficiency, isolation transistors are needed on every bitline (between wordlines 3 and 4) and on every wordline (between bitlines 1 and 2). Figure 10(b) illustrates the implementation of two neuron functions \(y_1\) and \(y_2\) in the crossbar. In this scenario, isolation transistors are only needed on every wordline (between bitlines 2 and 3). To implement inference on a neuromorphic system, each crossbar may have different utilization of its memory cells. Therefore, to improve energy efficiency in every crossbar, isolation transistors are needed on every bitline (and between every pair of wordlines) and on every wordline (and between every pair of bitlines)—a total of 24 isolation transistors for this example \(4 \times 4\) crossbar (in general, \({2N(N-1)}\) for an \(N \times N\) crossbar). This fine-grained partitioned PE architecture offers flexibility in energy management incorporating crossbar utilization but leads to a significant increase in the area, latency, and system overhead to control isolation transistors.
Fig. 10. Proposed neuromorphic PE architecture partitioned using isolation transistors.
To overcome these limitations while improving energy efficiency, we enable a coarse-grained partitioning in a crossbar as illustrated in Figure 10(c). In this example, isolation transistors are inserted selectively on every bitline (between wordlines 3 and 4) and on every wordline (between bitlines 2 and 3). This coarse-grained partitioned PE architecture requires a total of eight isolation transistors (in general, 2N for an \(N \times N\) crossbar). To reduce the control overhead, isolation transistors on wordlines of a crossbar are controlled using a single control signal
Table 1. Different PE Configurations Enabled Using the Two New Crossbar Control Signals
In a baseline PE architecture, a crossbar dimension is fixed to 4 × 4. Its static energy is proportional to the number of memory cells, which is 4*4 = 16 in this example. Latency in the crossbar varies from \(t_{1,1}\) (nearest cell or best case) to \(t_{4,4}\) (farthest cell or worst case).
In the proposed partitioned PE architecture, there are four configurations.
In
In
In
In
Our proposed system software (which we discuss in Section 5) minimizes the use of configuration ‘11,’ improving both performance and energy efficiency.
Single control. The proposed partitioned PE architecture also supports using a single control signal for all isolation transistors in a crossbar. When using a single control, only the configurations ‘00’ and ‘11’ are used, implementing a \(3 \times 2\) and a \(4 \times 4\) array, respectively.
To generalize the discussion for an \(N \times N\) crossbar, assume that isolation transistors are inserted on every bitline (between wordlines P and \(P + 1\)) and on every wordline (between bitlines Q and \(Q + 1\)). Then, the four configurations are ‘00,’ a P × Q array; ‘01,’ an N × Q array; ‘10,’ a P × N array; and ‘11,’ an N × N array. Formally, \(\langle N, N_h, N_l, P, Q\rangle\) represents the proposed partitioned PE architecture. Equation (9) summarizes the notations. (6) \(\begin{align} \langle N\rangle &= \text{a baseline } N\times N \text{ crossbar} \end{align}\) (7) \(\begin{align} \langle N,N_h,N_l\rangle &= N\times N \text{ crossbar with tech. enhancement (Section~3}) \end{align}\) (8) \(\begin{align} \langle N,N_h,N_l,P,Q\rangle &= N\times N \text{crossbar with tech. and arch. enhancements} \end{align}\) (9) \(\begin{align} & \quad \text{(see Sections~3 and 4)} \end{align}\)
We introduce the following four terminologies: (1) expanded mode: in this mode, a crossbar is operated in configuration ‘11’; (2) collapsed mode: in this mode, a crossbar is operated in configurations ‘00,’ ‘01,’ and ‘10’; (3) collapsed region: this is the reduced dimension of the crossbar when operating in configurations ‘00,’ ‘01,’ and ‘10’; and (4) far region: this is the region of the crossbar excluding the collapsed region.
In our design methodology, the far region of a crossbar is power-gated using the two control signals at design-time considering the crossbar’s utilization. This is achieved during mapping of neurons and synapses to the hardware. Since neuron and synapse mapping does not change during inference, there is no dynamic power management needed. Consequently, there is also no latency and energy overhead involved in switching the far region on/off at runtime.
4.1 Placing Isolation Transistors in a Crossbar
To illustrate the design space exploration involved in placing isolation transistors in a crossbar, Figure 11(a) illustrates a baseline crossbar with four current paths that are activated during mapping of neurons and synapses. Figure 11(b) through (d) show three alternative placements of isolation transistors in the crossbar. In Figure 11(b), P and Q values are kept small. The size of the far region is large. In this figure, only two of the current paths (1 and 2) stay within the collapsed region of the crossbar, whereas the other two current paths (3 and 4) traverse via the far region. This means that the latency of paths 3 and 4 increases due to the delay of the isolation transistors on current paths. Additionally, the far region cannot be power-gated, so there is a limited scope for energy reduction using power gating. Increasing P and Q values further (Figure 11(c)), the far region reduces in size as illustrated in the figure. Although three of the four current paths stay in the collapsed region, the far region still cannot be power-gated due to the presence of path 4 in this region. Finally, Figure 11(d) illustrates a possibility where all current paths stay in the collapsed region. The far region can therefore be power-gated. However, because of the small size of the far region, the energy benefits may not be significant. We explore this latency and energy tradeoffs.
Fig. 11. Placing isolation transistors in a crossbar.
Figure 12 shows the latency and energy tradeoffs in selecting the values of P and Q for the ResNet inference workload implemented on \(128 \times 128\) crossbars in a neuromorphic hardware. Latency and energy numbers are normalized to baseline. We make the following two key observations.
Fig. 12. Selecting P and Q values for the ResNet application.
First, energy is lower for smaller P and Q values. This is because by reducing P and Q, the size of the collapsed region of a crossbar reduces. Therefore, there are more memory cells in the far region that can be power-gated to lower energy.
Second, latency also reduces with a reduction in P and Q values (until P = Q = 80). This is due to shorter bitlines and wordlines of the collapsed region. However, with P = Q = 64 or 72, more clusters of ResNet need crossbars in the expanded mode of operation. This is because synapses in these clusters can no longer fit onto the reduced dimension of a collapsed crossbar. This increases latency due to isolation transistors on current paths. For ResNet, P = Q = 80 is the tradeoff point. The tradeoff point is different for different applications. To select a single crossbar configuration that gives good results for all applications, we perform similar analysis for all evaluated applications (see Section 6.3). Based on such analysis, P = Q = 96 is the selected configuration for the \(128 \times 128\) crossbar at the 16-nm technology node.
5 EXPLOITING TECHNOLOGICAL AND ARCHITECTURAL IMPROVEMENTS VIA THE SYSTEM SOFTWARE
To describe the system software, the left side of Figure 13 shows the final crossbar design with isolation transistors that allow each neuromorphic PE to operate in a collapsed or expanded mode. The right side shows control signals for these transistors generated from a centralized controller implemented inside the system software.
Fig. 13. Final crossbar design using the isolation transistors. The right side shows the control signals generated from the controller when using the proposed partitioned PE architecture in a neuromorphic system.
Without loss of generality, Figure 14 shows modifications to the baseline system software [60] to exploit the proposed design changes. A trained ML model is first partitioned to generate clusters, where each cluster can fit onto a crossbar. These clusters are stored in a cluster queue (
Fig. 14. Proposed system software. All changes are indicated in red.
In selecting the final mapping, the configuration selector first checks to see if a cluster can be mapped to a \(P \times Q\) array. If this is possible, then the mapping to the \(P \times Q\) array is selected as the final mapping for the cluster, and the corresponding PE is set to operate in configuration ‘00’ (collapsed mode). Otherwise, the configuration selector checks to see if the cluster can be mapped to the \(N \times Q\) or \(P \times N\) array. If so, the corresponding mapping is selected, and the PE is set to operate in configurations ‘01’ or ‘10,’ respectively. If the cluster cannot be mapped to either \(N\times Q\) or \(P\times N\) arrays, the mapping to the \(N \times N\) array is selected as the final mapping of the cluster with the PE set to operate in configuration ‘11’ (expanded mode). In this way, the proposed system software uses expanded mode only when it is absolutely necessary to do so. Otherwise, it selects the collapsed region to map synapses, improving both latency and energy.
6 EVALUATION METHODOLOGY
6.1 Simulation Framework
We evaluate the proposed design-technology co-optimization approach for OxRRAM-based neuromorphic PEs. Our simulation framework includes NeuroXplorer [11], a cycle-level in-house neuromorphic simulator [11] with programmable crossbar parameters. We configure this framework to simulate crossbars with parameters listed in Table 2.
| Neuron Technology | 16-nm CMOS (original design is at 14-nm FinFET) |
| Synapse Technology | HfO\({}_2\)-based OxRRAM [64] |
| Supply Voltage | 1.0 V |
| Energy per Spike | 23.6 pJ at 30-Hz spike frequency |
| Energy per Routing | 3 pJ |
| Switch Bandwidth | 3.44 G. Events/s |
Table 2. Major Simulation Parameters Extracted from the Work of Davies et al. [28]
Circuit-level simulations are performed with technology parameters from the predictive technology model (PTM) [101] and OxRRAM-specific parameters from Chen and Yu [18]. We note that comparing different chip technologies or recommending one technology node over another is not the focus of this work. Instead, we show that for a given process technology node, design optimizations can reduce energy and latency variations. Furthermore, the proposed design-technology co-optimization methodology can be used by system designers to choose the best technology node for their neuromorphic designs by exploring the energy-performance tradeoffs.
Neuromorphic simulations are performed on a Lambda workstation, which has AMD Threadripper 3960X with 24 cores, 128-MB cache, 128 GB of RAM, and 2 RTX3090 GPUs. Figure 15(a) shows the design pipeline implemented using NeuroXplorer. An ML model is first trained using frameworks such as Keras and PyTorch. Subsequently, the trained model is converted into the SNN using [4, 76]. The trained model is also simulated using an SNN simulator such as CARLsim [21]. NeuroXplorer integrates PyCARL [3], which allows the SNN model to be simulated using other SNN simulators such as Nengo [13], Neuron [43], and Brian [39]. Keras [41] and CARLsim [21] both use the two GPUs to accelerate model training and SNN functional simulation, respectively.
Fig. 15. Design pipeline using NeuroXplorer.
The SNN simulated model is clustered using the best-effort technique of Balaji et al. [10], which maximizes cluster utilization. Clusters of the SNN are mapped to the hardware using the SpineMap technique [7]. Finally, we perform cycle-accurate simulation of the clusters using NeuroXplorer [11].
Figure 15(b) shows the modeling hierarchy of the simulator. At the highest level is the many-core design, which is a tile-based architecture, similar to Loihi [28]. Each PE consists of a crossbar, which is an organization of neurons and synapses. A neurons is modeled using the work of Indiveri [48] and a synaptic circuit using the work of Mallik et al. [64]. At the lowest level are the technology models (see Table 2).
Finally, Figure 15(c) shows the statistics collection framework in NeuroXplorer. It facilitates global statistics collection, where spike arrival times are recorded for each PE (shown as C in the figure). These spike times are then used to compute the ISI distortion (see Appendix B).
6.2 Power Consideration for Isolation Transistors
The additional power required to control the isolation transistors when accessing the RRAM cells in the far region is approximately 3× that of raising a wordline, since raising a wordline requires driving one access transistor per bitline, whereas accessing the RRAM cells in the far region requires driving two isolation and one access transistor per bitline. The power overhead for accessing RRAM cells in the collapsed modes ‘01’ and ‘10’ is approximately 2× (one isolation and one access transistor) [56, 79, 83]. The energy numbers reported in Section 7.1 incorporates these overheads.
6.3 Evaluated Workloads
We select 10 ML inference programs that are representative of three most commonly used neural network classes: convolutional neural network (CNN), multi-layer perceptron (MLP), and recurrent neural network (RNN). Table 3 summarizes the topology, number of neurons and synapses, number of spikes per image, and baseline quality of these applications on hardware.
| Baseline | Obtained | ||||||
|---|---|---|---|---|---|---|---|
| Class | Applications | Dataset | Neurons | Synapses | Avg. Spikes/Frame | Quality | Quality |
| CNN | LeNet | CIFAR-10 | 80,271 | 275,110 | 724,565 | 86.3% | 87.1% |
| AlexNet | CIFAR-10 | 127,894 | 3,873,222 | 7,055,109 | 66.4% | 66.9% | |
| ResNet | CIFAR-10 | 266,799 | 5,391,616 | 7,339,322 | 57.4% | 58.0% | |
| DenseNet | CIFAR-10 | 365,200 | 11,198,470 | 1,250,976 | 46.3% | 46.5% | |
| VGG | CIFAR-10 | 448,484 | 22,215,209 | 12,826,673 | 81.4 % | 81.6% | |
| HeartClass [24] | PhysioNet | 170,292 | 1,049,249 | 2,771,634 | 63.7% | 63.9% | |
| MLP | MLPDigit | MNIST | 894 | 79,400 | 26,563 | 91.6% | 96.4% |
| EdgeDet [21] | CARLsim | 7,268 | 114,057 | 248,603 | SSIM = 0.89 | 0.99 | |
| ImgSmooth [21] | CARLsim | 5,120 | 9,025 | 174,872 | PSNR = 19 | 22.2 | |
| RNN | RNNDigit [30] | MNIST | 1,191 | 11,442 | 30,508 | 83.6% | 83.7% |
Table 3. Applications Used to Evaluate the Proposed Approach
6.4 Evaluated Approaches
We evaluate the following techniques:
Baseline [7] : The Baseline approach first clusters an ML inference model to minimize the inter-cluster spike communication. Clusters are then mapped to neuromorphic PEs of the hardware with synapses of each cluster implemented on memory cells of a crossbar without incorporating latency variation. Neuromorphic PEs are not optimized to reduce latency variation—that is, any resistance states (LRS or HRS) can be programmed on any current path (long or short). Unused crossbars are power-gated to reduce energy consumption. This is the coarse-grained power management technique implemented in many state-of-the-art many-core neuromorphic designs such as Loihi [28], DYNAPs [66], and \(\mu\)Brain [93].Baseline + Design Changes : This is the Baseline mapping approach implemented on the proposed latency-optimized partitioned neuromorphic PE design. In the proposed design, the HRS, which takes a long time to sense, is used only on shorter current paths, ones that have lower parasitic delays. Similarly, the LRS is used only on longer current paths. In addition to coarse-grained power management, we facilitate power gating at a finer granularity in the proposed design. Specifically, by controlling the isolation transistors, we power-gate unused resources within each crossbar.Proposed: This is the proposed solution where the system software is optimized to exploit the design changes.
7 RESULTS AND DISCUSSIONS
7.1 Energy Efficiency
Figure 16 plots the energy efficiency of the evaluated techniques normalized to Baseline. We make the following two key observations.
Fig. 16. Energy consumption normalized to Baseline.
First, with the proposed design changes, energy reduces by only 7% compared to Baseline. This is because both in Baseline and Baseline with the proposed design changes, synapses of a cluster are implemented randomly on NVM cells of a crossbar, causing them to be distributed across the crossbar dimension. Therefore, there remains a limited scope to collapse the crossbar and use power gating to save energy. Second, the proposed design-technology co-optimization approach has the lowest energy (22% lower than Baseline and 16% lower than Baseline with the proposed design changes). This improvement is due to the proposed system software, which exploits the design changes in implementing ML inference on neuromorphic PEs. In particular, synapses are implemented to maximize the utilization of the collapsed region in each crossbar of the hardware. If all of a cluster’s synapses fit into the collapsed region, then the far region can be isolated from the collapsed region using isolation transistors and power-gated to save energy.
7.2 Latency Variation
Figure 17 plots the latency variation normalized to Baseline. We make the following three key observations.
Fig. 17. Latency variation normalized to Baseline.
First, with the proposed design changes, latency variation increases compared to Baseline by an average of 1%. This is because of the increase in latency associated with the delay of isolation transistors on current paths. Second, the latency variation using the proposed approach is 30% lower than Baseline and 32% lower than Baseline with the proposed design changes. The reason for these improvements is threefold: (1) optimizing NVM resistance states in a crossbar such that the state that takes the longest time to sense is programmed on current paths that have the least propagation delay; (2) isolating the collapsed region of a crossbar from the far region to reduce current propagation delay; and (3) exploiting these changes during the implementation of an ML inference using the proposed system software, which uses the far region of a crossbar only when it is absolutely necessary to do so. Otherwise, it improves both latency and energy by operating the crossbar in the collapsed mode.
Finally, the latency variation using the proposed approach varies across different applications. This is because the proposed approach exploits the latency and energy tradeoffs differently for different applications. The latency variation is similar to the Baseline for ResNet, whereas it is significantly lower than the Baseline for HeartClass.
Using the results from Sections 7.1 and 7.2, we conclude that the proposed approach introduces maximum gain for applications where the latency and energy tradeoffs can be better exploited. For all other applications, it either minimizes energy or minimizes latency variation.
7.3 Real-Time Performance
One of the key hardware performance metrics for neuromorphic computing is real-time performance, which is a function of the crossbar latency. To evaluate real-time performance, Figure 18 plots the crossbar latency of the proposed approach and the Baseline for the evaluated applications. Results are normalized to the Baseline.
Fig. 18. Crossbar latency normalized to Baseline.
We observe that the crossbar latency using the proposed approach is on average 4.5% lower than the Baseline. This reduction is because the proposed approach places synapses with the HRS on shorter current paths, which lowers the overall spike latency on those synapses. We have elaborated this in Section 3.2.
7.4 Inference Quality
Figure 19 shows the improvement in inference quality using the proposed approach, normalized to Baseline. We observe that the image quality improves by an average of 4%. This is due to the reduction in ISI distortion caused by a reduction of the latency variation in neuromorphic PEs using the proposed changes, which we analyzed in Section 7.2. In addition, the improvement of inference quality with peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) metrics for EdgeDet and ImgSmooth is higher than other inference tasks with accuracy metrics. This is because PSNR and SSIM metrics are computed on individual images where we see a large improvement in quality. For accuracy-based tasks, we observe that feature representation in hidden layers of these models changes due to ISI distortion, but not all such changes lead to misclassification. Thus, the accuracy of these inference tasks is comparable to Baseline.
Fig. 19. Inference quality normalized to Baseline.
7.5 Single vs. Double Control Design
Figure 20 plots the energy efficiency of the proposed design with single control signal and the default, which uses two control signals for each PE. We observe that using single control, energy reduces by only 2% compared to Baseline. This is because most crossbars are operated in the expanded mode due to limited scope to collapse the crossbar. Our default design leads to 14.4% lower energy than with single control. This is because in the default design, a crossbar can be collapsed along X- and Y-dimensions independently, leading to three collapsed array configurations. Therefore, the system software has a higher probability to use the collapsed mode, leading to a reduction in energy.
Fig. 20. Partitioned PE architecture with single and double control.
7.6 Die Area Analysis
Adding an isolation transistor to the bitline increases the height of the crossbar, whereas that on the wordline increases the width. Without the isolation transistors, the height of a baseline crossbar is equal to the sum of height of the memory cells and the sense-amplifier, whereas the width is equal to the sum of the width of the memory cells. For RRAM-based neuromorphic PEs, a sense amplifier in the peripheral circuit and an isolation transistor is approximately 384× and 9.6× taller than an individual RRAM cell, respectively [17, 64, 96]. In terms of width, an isolation transistor is only 1.3× wider than an RRAM cell. Therefore, for a crossbar with 128 RRAM cells per bitline and wordline (i.e., \(128 \times 128\) array), the overhead along the height of the crossbar is \(\frac{9.6}{384 + 128} = 1.83\%\), and the overhead along the width of the crossbar is \(\frac{1.3}{128} = 1.01\%\).
8 CONCLUSION
We present a design-technology co-optimization approach to implement energy-efficient ML inference on NVM-based neuromorphic PEs. First, we optimize the NVM resistance state such that the state that takes the longest time to sense is placed on current paths with fewer parasitics and hence incurs lower propagation delay, and vice versa. Second, we use isolation transistors to partition a PE into collapsed and far regions such that the NVM cells of the far region can be opportunistically power-gated to save both energy and latency. Finally, we use the system software to exploit the design changes, maximizing the utilization of the collapsed region of each PE in the hardware. Our system software uses the far region only when it is absolutely necessary to do so; otherwise, it improves both latency and energy by operating the PE in the collapsed mode. We evaluate our design-technology co-optimization approach for a state-of-the-art neuromorphic architecture. Evaluations with different ML inference tasks show that the proposed approach improves both latency and energy without incurring significant cost-per-bit.
APPENDICES
A SPIKING NEURAL NETWORKS
SNNs enable powerful computations due to their spatio-temporal information encoding capabilities [63]. An SNN consists of neurons, which are connected via synapses. A neuron can be implemented as an integrate-and-fire (IF) logic, which is illustrated in Figure 21 (left). Here, an input current \(U(t)\) (i.e., spike from a pre-synaptic neuron) raises the membrane voltage of the neuron. When this voltage crosses a threshold \(V_{th}\), the IF logic emits an output spike, which propagates to is post-synaptic neuron. Figure 21 (middle) illustrates the membrane voltage of the IF neuron due to an input spike train. The moment of threshold crossing is illustrated in Figure 21 (right). These are the firing times of the output spike train of the neuron.
Fig. 21. A leaky IF neuron with current input \(U(t)\) (left). The membrane potential over time of the neuron (middle). The spike output of the neuron representing its firing time (right).
SNNs can implement many ML approaches such as supervised learning, unsupervised learning, reinforcement learning, and lifelong learning. We focus on supervised ML, where an SNN is pre-trained with representative data. ML inference refers to feeding live data points to this trained SNN to generate the corresponding output.
B QUALITY OF INFERENCE
The quality of ML inference can be expressed in terms of accuracy [4], mean square error [26], PSNR [21], and SSIM [44]. Although accuracy is commonly used for assessing the quality of supervised learning (e.g., using CNNs), there are also applications such as edge detection, where the quality is assessed using other metrics such as SSIM. In our prior work [7], we showed that these quality metrics are a function of the ISI between neurons. Therefore, any deviation of ISI (called ISI distortion) from its trained value may lead to quality loss. To describe ISI, let \(\lbrace t_1, t_2, \ldots , t_{K}\rbrace\) denote a neuron’s firing times in the time interval \([0,T]\), thus the average ISI of this spike train is (10) \(\begin{equation} \mathcal {I} = \sum _{i=2}^K (t_i - t_{i-1})/(K-1). \end{equation}\)
To illustrate how a change in ISI, called ISI distortion, impacts inference quality, we use a small SNN in which three input neurons are connected to an output neuron. Figure 22 illustrates the impact of ISI distortion on the output spike. In the top part of the figure, a spike is generated at the output neuron at 22\(\mu\)s due to spikes from the input neurons. In the bottom part of the figure, the second spike from input 3 is delayed (i.e., it has an ISI distortion). Due to this distortion, there is no output spike generated. Missing spikes can impact inference quality, as spikes encode information in SNNs.
Fig. 22. Impact of ISI distortion on accuracy [3]. Top: A scenario where an output spike is generated based on the spikes received from the three input neurons. Bottom: A scenario where the second spike from neuron 3 is delayed. There are no output spikes generated.
Figure 23 shows the impact of ISI distortion on the quality of image smoothing implemented using an SNN [21]. Figure 23(a) shows the input image, which is fed to the SNN. Figure 23(b) shows the output of the image smoothing application with no ISI distortion. PSNR of the output with reference to the input is 20. Figure 23(c) shows the output with ISI distortion due to variation in latency within neuromorphic PEs of the hardware. PSNR of this output with respect to the input is 19. A reduction in PSNR indicates that the output image quality with ISI distortion is lower than the one without distortion. In fact, image quality deteriorates with an increase in ISI distortion. We use ISI distortion as a measure of the quality of ML inference [7]. Our aim is to improve this inference quality via technological and architectural enhancements that reduce ISI distortion when the inference task is implemented on neuromorphic PEs of hardware.
Fig. 23. Impact of ISI distortion on image smoothing.
C HARDWARE IMPLEMENTATION OF ML INFERENCE
Most neuromorphic hardware platforms are implemented as tiled-based architectures [16, 28, 29, 37, 72, 93], where the tiles are interconnected via a shared interconnect such as network-on-chip [62] and segmented bus [12]. Figure 24 illustrates a tile-based neuromorphic hardware platform, where the tiles can communicate concurrently. Each tile includes (1) a neuromorphic PE, which consists of neuron and synapse circuitries, and (2) a network interface, which encodes spikes into AER (address event representation) and communicates these AER packets to the switch for routing to their destination tiles. A common design practice is to use analog crossbars to implement a neuromorphic PE [2, 7, 45, 52, 55, 58, 61, 100]. Within a crossbar, a pre-synaptic neuron circuit acts as a current driver and is placed on a wordline, whereas a post-synaptic neuron circuit acts as a current sink and is placed on a bitline as illustrated in Figure 1 (left).
Since a crossbar can accommodate only a limited number of neurons and synapses, an ML model is first partitioned into clusters, where each cluster can be implemented on a crossbar of the hardware. Partitioned clusters are then mapped to different crossbars when admitting the model to the hardware platform. To this end, several heuristic approaches are proposed in the literature. PSOPART [27] minimizes spike latency on the shared interconnect, SpiNeMap [7] minimizes interconnect energy, DFSynthesizer [76] maximizes throughput, DecomposedSNN [10] maximizes crossbar utilization, EaNC [90] minimizes overall energy of an ML task by targeting both computation and communication energy, TaNC [89] minimizes the average temperature of each crossbar, eSpine [91] maximizes NVM endurance in a crossbar, RENEU [80] minimizes the circuit aging in a crossbar’s peripheral circuits, and NCil [86] reduces read disturb issues in a crossbar, improving the inference lifetime. Besides these techniques, there are other software frameworks [1, 5, 6, 9, 11, 23, 25, 38, 47, 50, 54, 60, 71, 75, 77, 78, 85, 88] and runtime approaches [8, 84] addressing one or more of these optimization objectives.
We investigate the internal architecture of a crossbar and find that the parasitic components introduce delay in propagating current from a pre-synaptic neuron to a post-synaptic neuron as illustrated in Figure 1 (right). This delay depends on the specific current path used in the mapping. The higher the number of parasitic components on a current path, the larger is its propagation delay. Parasitic components on bitlines and wordlines are a major source of latency at scaled process technology nodes, and they create significant latency variation in a crossbar. Specifically, the latency of a synaptic connection in an SNN depends precisely on the memory cell in the crossbar that is used to implement it. Such latency variation can introduce ISI distortion (see Appendix B), which may impact the quality of an inference task.
D NVM TECHNOLOGY
RRAM technology presents an attractive option for implementing memory cells of a crossbar due to its demonstrated potential for low-power multi-level operation and high integration density [64]. An RRAM cell is composed of an insulating film sandwiched between conducting electrodes forming a metal-insulator-metal (MIM) structure (Figure 25). Recently, conducting filament-based metal-oxide RRAM implemented with transition-metal-oxides such as HfO\({}_2\), ZrO\({}_2\), and TiO\({}_2\) has received considerable attention due to their low-power and CMOS-compatible scaling.
Fig. 25. Operation of an RRAM cell with the \(\text{HfO}_2\) layer sandwiched between the metals Ti (top electrode) and TiN (bottom electrode). The right side shows the formation of LRS/SET. The left side shows HRS/RESET.
Synaptic weights are represented as conductance of the insulating layer within each RRAM cell. To program an RRAM cell, elevated voltages are applied at the top and bottom electrodes, which rearranges the atomic structure of the insulating layer. Figure 25 shows the HRS and LRS of an RRAM cell. An RRAM cell can also be programmed into intermediate LRS, allowing its multi-level operations [18].
- [1] . 2013. Cognitive computing programming paradigm: A corelet language for composing networks of neurosynaptic cores. In Proceedings of IJCNN.Google Scholar
Cross Ref
- [2] . 2017. TraNNsformer: Neural network transformation for memristive crossbar based neuromorphic system design. In Proceedings of ICCAD.Google Scholar
Digital Library
- [3] . 2020. PyCARL: A PyNN interface for hardware-software co-simulation of spiking neural network. In Proceedings of IJCNN.Google Scholar
Cross Ref
- [4] . 2018. Power-accuracy trade-offs for heartbeat classification on neural networks hardware. Journal of Low Power Electronics 14, 4 (2018), 508–519.Google Scholar
Cross Ref
- [5] . 2019. A framework for the analysis of throughput-constraints of SNNs on neuromorphic hardware. In Proceedings of ISVLSI.Google Scholar
Cross Ref
- [6] . 2020. Compiling spiking neural networks to mitigate neuromorphic hardware constraints. In Proceedings of IGSC Workshops.Google Scholar
Cross Ref
- [7] . 2020. Mapping spiking neural networks to neuromorphic hardware. IEEE Transactions on Very Large Scale (VLSI) Systems 28, 1 (2020), 76–86.Google Scholar
- [8] . 2020. Run-time mapping of spiking neural networks to neuromorphic hardware. Journal of Signal Processing Systems 92, 11 (2020), 1293–1302.Google Scholar
Digital Library
- [9] . 2019. A framework to explore workload-specific performance and lifetime trade-offs in neuromorphic computing. IEEE Computer Architecture Letters 18, 2 (2019), 149–152.Google Scholar
Digital Library
- [10] . 2020. Enabling resource-aware mapping of spiking neural networks via spatial decomposition. IEEE Embedded Systems Letters 13, 3 (2020), 142–145.Google Scholar
Digital Library
- [11] . 2021. NeuroXplorer 1.0: An extensible framework for architectural exploration with spiking neural networks. In Proceedings of ICONS.Google Scholar
Digital Library
- [12] . 2019. Exploration of segmented bus as scalable global interconnect for neuromorphic computing. In Proceedings of GLSVLSI.Google Scholar
Digital Library
- [13] . 2014. Nengo: A Python tool for building large-scale functional brain models. Frontiers in Neuroinformatics 7 (2014), 48.Google Scholar
Cross Ref
- [14] . 2019. Is my neural network neuromorphic? Taxonomy, recent trends and future directions in neuromorphic engineering. In Proceedings of ACSSC.Google Scholar
- [15] . 2017. Neuromorphic computing using non-volatile memory. Advances in Physics: X 2, 1 (2017), 89–124.Google Scholar
Cross Ref
- [16] . 2018. Very large-scale neuromorphic systems for biological signal processing. In CMOS Circuits for Biological Sensing and Processing, Srinjoy Mitra and David R. S. Cumming (Eds.). Springer, 315–340.Google Scholar
Cross Ref
- [17] . 2016. Design tradeoffs of vertical RRAM-based 3-D cross-point array. IEEE Transactions on Very Large Scale (VLSI) Systems 24, 12 (2016), 3460–3467.Google Scholar
- [18] . 2015. Compact modeling of RRAM devices and its applications in 1T1R and 1S1R array design. IEEE Transactions on Electron Devices 62, 12 (2015), 4022–4028.Google Scholar
Cross Ref
- [19] . 2020. ReRAM: History, status, and future. IEEE Transactions on Electron Devices 67, 4 (2020), 1420–1433.Google Scholar
Cross Ref
- [20] . 2010. Impact of resistance drift on multilevel PCM design. In Proceedings of ICDT.Google Scholar
Cross Ref
- [21] . 2018. CARLsim 4: An open source library for large scale, biologically detailed spiking neural network simulation using heterogeneous clusters. In Proceedings of IJCNN.Google Scholar
Cross Ref
- [22] . 2021. 2021 roadmap on neuromorphic computing and engineering. arXiv preprint arXiv:2105.05956 (2021).Google Scholar
- [23] . 2021. Automated generation of integrated digital and spiking neuromorphic machine learning accelerators. In Proceedings of ICCAD.Google Scholar
Digital Library
- [24] . 2018. Heartbeat classification in wearables using multi-layer perceptron and time-frequency joint distribution of ECG. In Proceedings of CHASE.Google Scholar
Digital Library
- [25] . 2018. Dataflow-based mapping of spiking neural networks on neuromorphic hardware. In Proceedings of GLSVLSI.Google Scholar
Digital Library
- [26] . 2018. Unsupervised heart-rate estimation in wearables with Liquid states and a probabilistic readout. Neural Networks 99 (2018), 134–147.Google Scholar
Digital Library
- [27] . 2018. Mapping of local and global synapses on spiking neuromorphic hardware. In Proceedings of DATE.Google Scholar
Cross Ref
- [28] . 2018. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38, 1 (2018), 82–99.Google Scholar
Cross Ref
- [29] . 2019. TrueNorth: Accelerating from zero to 64 million neurons in 10 years. Computer 52, 5 (2019), 20–29.Google Scholar
Cross Ref
- [30] . 2015. Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in Computational Neuroscience 9 (2015), 99.Google Scholar
Cross Ref
- [31] . 2021. OxRRAM-based analog in-memory computing for deep neural network inference: A conductance variability study. IEEE Transactions on Electron Devices 68, 5 (2021), 2301–2305.Google Scholar
Cross Ref
- [32] . 2020. 3D memristor crossbar architecture for a multicore neuromorphic system. In Proceedings of IJCNN.Google Scholar
Cross Ref
- [33] . 2017. Modeling and analysis of passive switching crossbar arrays. IEEE Transactions on Circuits and Systems I: Regular Papers 65, 1 (2017), 270–282.Google Scholar
Cross Ref
- [34] . 2018. Overcoming crossbar nonidealities in binary neural networks through learning. In Proceedings of NANOARCH.Google Scholar
Digital Library
- [35] . 2020. IR-QNN framework: An IR drop-aware offline training of quantized crossbar arrays. IEEE Access 8 (2020), 228392–228408.Google Scholar
Cross Ref
- [36] . 2019. Effect of asymmetric nonlinearity dynamics in RRAMs on spiking neural network performance. In Proceedings of ACSSC.Google Scholar
Cross Ref
- [37] . 2020. Bottom-Up and Top-Down Neuromorphic Processor Design: Unveiling Roads to Embedded Cognition. Ph.D. Dissertation. UCL-Université Catholique de Louvain.Google Scholar
- [38] . 2012. A hierachical configuration system for a massively parallel neural hardware platform. In Proceedings of CF.Google Scholar
- [39] . 2009. The brian simulator. Frontiers in Neuroscience 3, 2 (2009), 192–197.Google Scholar
Cross Ref
- [40] . 2020. HFNet: A CNN architecture co-designed for neuromorphic hardware with a crossbar array of synapses. Frontiers in Neuroscience 14 (2020), 907.Google Scholar
Cross Ref
- [41] . 2017. Deep Learning with Keras. Packt Publishing.Google Scholar
Digital Library
- [42] . 2020. Towards state-aware computation in ReRAM neural networks. In Proceedings of DAC.Google Scholar
Cross Ref
- [43] . 1997. The NEURON simulation environment. Neural Computation 9, 6 (1997), 1179–1209.Google Scholar
Digital Library
- [44] . 2010. Image quality metrics: PSNR vs. SSIM. In Proceedings of ICPR.Google Scholar
Digital Library
- [45] . 2014. Memristor crossbar-based neuromorphic computing system: A case study. IEEE Transactions on Neural Networks and Learning Systems 25, 10 (2014), 1864–1878.Google Scholar
Cross Ref
- [46] . 2016. Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication. In Proceedings of DAC.Google Scholar
Digital Library
- [47] . 2022. Implementing spiking neural networks on neuromorphic architectures: A review. arXiv:2202.08897 (2022).Google Scholar
- [48] . 2003. A low-power adaptive integrate-and-fire neuron circuit. In Proceedings of ISCAS.Google Scholar
Cross Ref
- [49] . 2017. Parasitic effect analysis in memristor-array-based neuromorphic systems. IEEE Transactions on Nanotechnology 17, 1 (2017), 184–193.Google Scholar
Cross Ref
- [50] . 2016. NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. In Proceedings of MICRO.Google Scholar
Cross Ref
- [51] . 2012. A digital neuromorphic VLSI architecture with memristor crossbar synaptic array for machine learning. In Proceedings of SOCC.Google Scholar
Cross Ref
- [52] . 2015. A reconfigurable digital neuromorphic processor with memristive synaptic crossbar for cognitive computing. ACM Journal on Emerging Technologies 11, 4 (2015), Article 38, 25 pages.Google Scholar
- [53] . 2019. Memristive non-idealities: Is there any practical implications for designing neural network chips? In Proceedings of ISCAS.Google Scholar
Cross Ref
- [54] . 2021. Special session: Reliability analysis for ML/AI hardware. In Proceedings of VTS.Google Scholar
Cross Ref
- [55] . 2021. Memristor crossbar circuits for neuromorphic pattern recognition. In Proceedings of ISOCC.Google Scholar
Cross Ref
- [56] . 2013. Tiered-latency DRAM: A low latency and low cost DRAM architecture. In Proceedings of HPCA.Google Scholar
- [57] . 2017. Sneak-path based test and diagnosis for 1R RRAM crossbar using voltage bias technique. In Proceedings of DAC.Google Scholar
Digital Library
- [58] . 2021. Hardware implementation of neuromorphic computing using large-scale memristor crossbar arrays. Advanced Intelligent Systems 3, 1 (2021), Article 2000137, 26 pages.Google Scholar
Cross Ref
- [59] . 2021. Multibit ferroelectric FET based on nonidentical double \(HfZrO_2\) for high-density nonvolatile memory. IEEE Electron Device Letters 42, 4 (2021), 617–620.Google Scholar
Cross Ref
- [60] . 2018. Mapping spiking neural networks onto a manycore neuromorphic architecture. In Proceedings of PLDI.Google Scholar
Digital Library
- [61] . 2015. A spiking neuromorphic design with resistive crossbar. In Proceedings of DAC.Google Scholar
Digital Library
- [62] . 2018. Neu-NoC: A high-efficient interconnection network for accelerated neuromorphic systems. In Proceedings of ASP-DAC.Google Scholar
Digital Library
- [63] . 1997. Networks of spiking neurons: The third generation of neural network models. Neural Networks 10, 9 (1997), 1659–1671.Google Scholar
Cross Ref
- [64] . 2017. Design-technology co-optimization for OxRRAM-based synaptic processing unit. In Proceedings of VLSIT.Google Scholar
Cross Ref
- [65] . 1990. Neuromorphic electronic systems. Proceedings of the IEEE 78, 10 (1990), 1629–1636.Google Scholar
Cross Ref
- [66] . 2018. A scalable multicore architecture with heterogeneous memory structures for dynamic neuromorphic asynchronous processors (DYNAPs). IEEE Transactions on Biomedical Circuits and Systems 12, 1 (2018), 106–122.Google Scholar
Cross Ref
- [67] . 2013. Memory scaling: A systems architecture perspective. In Proceedings of IMW.Google Scholar
Cross Ref
- [68] . 2015. Research problems and opportunities in memory systems. Supercomputing Frontiers and Innovations 1, 3 (2015), 19–55.Google Scholar
- [69] . 2014. Spintronic threshold logic array (STLA)–A compact, low leakage, non-volatile gate array architecture. Journal of Parallel and Distributed Computing 74, 6 (2014), 2452–2460.Google Scholar
Cross Ref
- [70] . 2021. Design technology co-optimization for neuromorphic computing. In Proceedings of IGSC Workshops.Google Scholar
Cross Ref
- [71] . 2022. On the mitigation of read disturbances in neuromorphic inference hardware. arXiv:2201.11527 (2022).Google Scholar
- [72] . 2019. Low-power neuromorphic hardware for signal processing applications: A review of architectural and system-level design approaches. IEEE Signal Processing Magazine 36, 6 (2019), 97–110.Google Scholar
Cross Ref
- [73] . 2020. Design exploration of sensing techniques in 2T-2R resistive ternary CAMs. IEEE Transactions on Circuits and Systems II: Express Briefs 68, 2 (2020), 762–766.Google Scholar
Cross Ref
- [74] . 2020. Impact of read disturb on multilevel RRAM based inference engine: Experiments and model prediction. In Proceedings of IRPS.Google Scholar
Digital Library
- [75] . 2020. Compiling spiking neural networks to neuromorphic hardware. In Proceedings of LCTES.Google Scholar
Digital Library
- [76] . 2021. DFSynthesizer: Dataflow-based synthesis of spiking neural networks to neuromorphic hardware. arXiv:2108.02023 (2021).Google Scholar
- [77] . 2020. A case for lifetime reliability-aware neuromorphic computing. In Proceedings of MWSCAS.Google Scholar
Cross Ref
- [78] . 2020. Design methodologies for reliable and energy-efficient PCM systems. In Proceedings of IGSC Workshops.Google Scholar
Cross Ref
- [79] . 2020. Exploiting inter- and intra-memory asymmetries for data mapping in hybrid tiered-memories. In Proceedings of ISMM.Google Scholar
Digital Library
- [80] . 2020. Improving dependability of neuromorphic computing with non-volatile memory. In Proceedings of EDCC.Google Scholar
Cross Ref
- [81] . 2019. Enabling and exploiting partition-level parallelism (PALP) in phase change memories. ACM Transactions on Embedded Computing Systems 18, 5s (2019), 1–25.Google Scholar
Digital Library
- [82] . 2020. Improving phase change memory performance with data content aware access. In Proceedings of ISMM.Google Scholar
Digital Library
- [83] . 2021. Aging-aware request scheduling for non-volatile main memory. In Proceedings of ASP-DAC.Google Scholar
Digital Library
- [84] . 2021. Dynamic reliability management in neuromorphic computing. ACM Journal on Emerging Technologies in Computing Systems 17, 4 (2021), Article 63, 27 pages.Google Scholar
Digital Library
- [85] . 2021. A design flow for mapping spiking neural networks to many-core neuromorphic hardware. In Proceedings of ICCAD.Google Scholar
Digital Library
- [86] . 2021. Improving inference lifetime of neuromorphic systems via intelligent synapse mapping. In Proceedings of ASAP.Google Scholar
Cross Ref
- [87] . 2021. Analysis of parasitics on CMOS based memristor crossbar array for neuromorphic systems. In Proceedings of MWSCAS.Google Scholar
Cross Ref
- [88] . 2020. Reliability-performance trade-offs in neuromorphic computing. In Proceedings of IGSC Workshops.Google Scholar
Cross Ref
- [89] . 2020. Thermal-aware compilation of spiking neural networks to neuromorphic hardware. In Proceedings of LCPC.Google Scholar
- [90] . 2021. On the role of system software in energy management of neuromorphic computing. In Proceedings of CF.Google Scholar
Digital Library
- [91] . 2021. Endurance-aware mapping of spiking neural networks to neuromorphic hardware. IEEE Transactions on Parallel and Distributed Systems 33, 2 (2021), 288–301.Google Scholar
Digital Library
- [92] . 2020. RRAM-VAC: A variability-aware controller for RRAM-based memory architectures. In Proceedings of ASP-DAC.Google Scholar
Digital Library
- [93] . 2022. Design of many-core big little \(\mu\)Brains for energy-efficient embedded neuromorphic computing. In Proceedings of DATE.Google Scholar
- [94] . 2020. NCPower: Power modelling for NVM-based neuromorphic chip. In Proceedings of ICONS.Google Scholar
Digital Library
- [95] . 2018. An all-memristor deep spiking neural computing system: A step toward realizing the low-power stochastic brain. IEEE Transactions on Emerging Topics in Computational Intelligence 2, 5 (2018), 345–358.Google Scholar
Cross Ref
- [96] . 2011. Design implications of memristor-based RRAM cross-point structures. In Proceedings of DATE.Google Scholar
- [97] . 2019. 24.1 a 1Mb multibit ReRAM computing-in-memory macro with 14.6 ns parallel MAC computing time for CNN based AI edge processors. In Proceedings of ISSCC.Google Scholar
- [98] . 2019. Evolving energy efficient convolutional neural networks. In Proceedings of Big Data.Google Scholar
Cross Ref
- [99] . 2014. Design guidelines for 3D RRAM cross-point architecture. In Proceedings of ISCAS.Google Scholar
Cross Ref
- [100] . 2018. Neuromorphic computing with memristor crossbar. Physica Status Solidi (a) 215, 13 (2018), Article 1700875.Google Scholar
Cross Ref
- [101] . 2007. Predictive technology model for nano-CMOS design exploration. ACM Journal on Emerging Technologies in Computing Systems3, 1 (2007), 1–es.Google Scholar
- [102] . 2018. Mixed size crossbar based RRAM CNN accelerator with overlapped mapping method. In Proceedings of ICCAD.Google Scholar
Digital Library
Index Terms
Design-Technology Co-Optimization for NVM-Based Neuromorphic Processing Elements
Recommendations
Dynamic Reliability Management in Neuromorphic Computing
Neuromorphic computing systems execute machine learning tasks designed with spiking neural networks. These systems are embracing non-volatile memory to implement high-density and low-energy synaptic storage. Elevated voltages and currents needed to ...
Hardware/software cooperative caching for hybrid DRAM/NVM memory architectures
ICS '17: Proceedings of the International Conference on SupercomputingNon-Volatile Memory (NVM) has recently emerged for its nonvolatility, high density and energy efficiency. Hybrid memory systems composed of DRAM and NVM have the best of both worlds, because NVM can offer larger capacity and have near-zero standby power ...
Statistical Weight Refresh System for CTT-Based Synaptic Arrays
GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023Charge-trap transistors (CTTs) are compute-in-memory devices that are used to model synaptic arrays in neuromorphic systems. CTTs enable non von Neumann architectures, thus, eliminating the energy spent on compute-memory communication. Synaptic weights ...































Comments