skip to main content
research-article
Public Access

Design-Technology Co-Optimization for NVM-Based Neuromorphic Processing Elements

Published:12 December 2022Publication History

Skip Abstract Section

Abstract

An emerging use case of machine learning (ML) is to train a model on a high-performance system and deploy the trained model on energy-constrained embedded systems. Neuromorphic hardware platforms, which operate on principles of the biological brain, can significantly lower the energy overhead of an ML inference task, making these platforms an attractive solution for embedded ML systems. We present a design-technology tradeoff analysis to implement such inference tasks on the processing elements (PEs) of a non-volatile memory (NVM)-based neuromorphic hardware. Through detailed circuit-level simulations at scaled process technology nodes, we show the negative impact of technology scaling on the information-processing latency, which impacts the quality of service of an embedded ML system. At a finer granularity, the latency inside a PE depends on (1) the delay introduced by parasitic components on its current paths, and (2) the varying delay to sense different resistance states of its NVM cells. Based on these two observations, we make the following three contributions. First, on the technology front, we propose an optimization scheme where the NVM resistance state that takes the longest time to sense is set on current paths having the least delay, and vice versa, reducing the average PE latency, which improves the quality of service. Second, on the architecture front, we introduce isolation transistors within each PE to partition it into regions that can be individually power-gated, reducing both latency and energy. Finally, on the system-software front, we propose a mechanism to leverage the proposed technological and architectural enhancements when implementing an ML inference task on neuromorphic PEs of the hardware. Evaluations with a recent neuromorphic hardware architecture show that our proposed design-technology co-optimization approach improves both performance and energy efficiency of ML inference tasks without incurring high cost-per-bit.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Neuromorphic computing systems are integrated circuits that implement the architecture of the central nervous system in primates [14, 22, 65]. These systems facilitate energy-efficient computations using spiking neural networks (SNNs) [63] for power-constrained embedded devices. To this end, the design workflow is to train a machine learning (ML) model (commonly on a backend server) and subsequently convert the trained model to spike-based computations and deploy it on the neuromorphic hardware of an embedded system. The quality of inference (e.g., accuracy) is assessed in terms of the inter-spike interval (ISI) (see Appendix B). Therefore, any deviation from its expected value will lead to a degradation of the inference quality.

Typical neuromorphic systems such as Loihi [28], DYNAPs [66], and \(\mu\)Brain [93] consist of processing elements (PEs) that communicate spikes using a shared interconnect. Each PE implements neuron and synapse circuitries. A common technique to implement a neuromorphic PE is by using an analog crossbar where bitlines and wordlines are organized in a grid with memory cells connected at their crosspoints to store synaptic weights [2, 32, 40, 45, 46, 51, 61, 69, 100]. Neuron circuitries are implemented along bitlines and wordlines. Figure 1 (left) shows the architecture of an \(N \times N\) analog crossbar with N bitlines and N wordlines.

Fig. 1.

Fig. 1. An \(N \times N\) crossbar showing the parasitic components within.

We investigate the internal architecture of a crossbar and find that parasitic components introduce delay in propagating current from a pre-synaptic neuron to a post-synaptic neuron as illustrated in Figure 1 (right). This delay depends on the specific current path activated in a crossbar. The higher the number of parasitic components on a current path, the larger is its propagation delay [70, 86, 88, 89, 91]. Parasitic components on bitlines and wordlines are a major source of latency at scaled process technology nodes, and they create significant latency variation in a crossbar [33, 34, 49, 53, 56, 67, 79, 83, 87]. Such variation can introduce ISI distortion (see Appendix B), which may impact the quality of an inference task [7, 27].

To increase the energy efficiency of a neuromorphic system, non-volatile memory (NVM) such as oxide-based random access memory (OxRRAM), phase-change memory, ferroelectric RAM, and spin-based magnetic RAM is used to implement the memory cells in a crossbar [15, 64, 94, 95, 98]. An NVM cell can be programmed to a high-resistance state (HRS) or one of many low-resistance states (LRS), implementing multi-bit synaptic weights [59, 64, 97]. To implement a synaptic weight on a memory cell of a crossbar, the synaptic weight is programmed as the conductance of the cell.

A crossbar can accommodate only a fixed number of pre-synaptic connections per post-synaptic neuron. To give an an example, the crossbar in Figure 1 (left) has N pre-synaptic neurons, N post-synaptic neurons, and \(N^2\) memory cells. Each post-synaptic neuron can have a maximum of N pre-synaptic connections. To mitigate the negative impact of technology scaling (e.g., increase in the value of parasitic components on current paths and increase in the power density), N is constrained to a lower value, typically between 128 and 256 (see our tradeoff analysis in Section 2). To map a large SNN model on multi-PE hardware, system software frameworks such as NEUTRAMS [50], NeuroXplorer [11], SentryOS [93], LAVA [60], and DFSynthsizer [76] are commonly used. These frameworks first partition a model into clusters, where a cluster is a subset of neurons and synapses of the model that can be implemented on the architecture of a crossbar. Subsequently, the partitioned clusters are implemented on different crossbars of a neuromorphic hardware.

We make the following two key observations related to a neuromorphic PE:

Observation 1.

The latency within a crossbar is a function of the length (i.e., the number of parasitic components) of current paths and the delay to sense the NVM cell activated on a current path.

Observation 2.

Due to how memory cells are organized in a crossbar, a significant fraction of these memory cells remains unutilized when implementing ML inference tasks.

Based on these two observations (which we elaborate in Sections 2 through 4), we present a design-technology tradeoff analysis to implement ML inference tasks on different PEs of an NVM-based neuromorphic system. We make the following four key contributions:

  • Through detailed circuit-level simulations at scaled process technology nodes, we show that bitline and wordline parasitics are the primary sources of long latency in a crossbar and they create asymmetry in inference latency. With technology scaling, the absolute latency increases and the latency asymmetry becomes increasingly more significant. In addition, different resistance states of a multi-level NVM cell take varying latencies to sense during an inference operation (see Section 2).

  • We propose to optimize the implementation of synaptic weights on NVM cells such that the resistance state that takes the longest time to sense is programmed on the NVM cell that has the least parasitic delay in a crossbar. This lowers the latency of a crossbar (see Section 3).

  • We propose an architectural change of introducing isolation transistors in a crossbar to partition it into regions that can be individually power-gated based on their utilization. In this way, we improve energy efficiency. In addition, by isolating the unutilized region of a crossbar from the active region, parasitics of only the active region contribute to latency rather than both as in a baseline non-partitioned crossbar architecture. This reduces the latency of a crossbar (see Section 4).

  • We show that our technological and architectural optimizations can only deliver on their latency and energy improvement promises if they are exploited efficiently by the system software. Therefore, we propose a mechanism to expose our proposed design changes to the system level, allowing the system software to improve both latency and energy when implementing ML inference tasks on hardware (see Section 5).

We evaluate our design-technology co-optimization approach for a recent neuromorphic hardware using 10 ML inference tasks. Results show 12% reduction in average PE latency and 22% lower application energy compared to current state of the art.

To the best of our knowledge, this is the first work that demonstrates the energy and latency improvement of power gating crossbar-based neuromorphic hardware designs.

Skip 2DESIGN-TECHNOLOGY TRADEOFF ANALYSIS Section

2 DESIGN-TECHNOLOGY TRADEOFF ANALYSIS

Without loss of generality, we demonstrate the design-technology tradeoff for an OxRRAM-based neuromorphic PE, where each NVM cell can be programmed to the following four resistance levels (i.e., 2-bit per synapse): 1.5 K\({\Omega }\), 5.78 K\({\Omega }\), 13.6 K\({\Omega }\), and 73 K\({\Omega }\) [19, 20, 31, 64, 74]. Furthermore, we show our analysis for four process technology nodes: 16 nm, 22 nm, 32 nm, and 45 nm, which are obtained from our technology provider. The analysis can be easily extended to other NVM types and also to other process technology nodes.

2.1 Cost-per-Bit Analysis for a Neuromorphic PE

The computer memory industry has thus far been primarily driven by the cost-per-bit metric, which provides the maximum capacity for a given manufacturing cost. As shown in recent works [56, 67, 68, 79, 81, 82, 83], manufacturing cost can be estimated from the area overhead. To estimate the cost-per-bit of a neuromorphic PE, we investigate the internal architecture of a crossbar and find that a neuron circuit can be designed using 20 transistors and a capacitor [48], whereas an NVM cell is a 1T-1R arrangement with a transistor used as an access device for the cell. Within an \(N \times N\) crossbar, there are N pre-synaptic neurons, N post-synaptic neurons, and \(N^2\) synaptic cells. The total area of all neurons and synapses of a crossbar is (1) \(\begin{align} \text{Neuron area} &= 2N(20T + 1C)\\ \nonumber \nonumber \text{Synapse area} &= N^2(1T + 1R), \end{align}\) where T stands for transistor, C for capacitor, and R for NVM cell. The total synaptic cell capacity is \(N^2\), with each NVM cell implementing 2-bit per synapse. The total number of bits (i.e., synaptic capacity) in the crossbar is (2) \(\begin{equation} \text{Total bits} = 2N^2. \end{equation}\)

Therefore, the cost-per-bit of an NxN crossbar is (3) \(\begin{equation} \text{Cost-per-bit} = \frac{2N(20T+1C)+N^2(1T+1R)}{2N^2} \approx \frac{F^2(27+2N)}{N}, \end{equation}\) where the cost-per-bit is represented in terms of the crossbar dimension N and the feature size F. Equation (3) provides a back-of-the-envelope calculation of cost-per-bit. Figure 2 plots the normalized cost-per-bit for four different process technology nodes, with the crossbar dimension ranging from 16 to 256. We make the following two observations.

Fig. 2.

Fig. 2. Cost-per-bit analysis of a crossbar.

First, the cost-per-bit reduces with increase in the dimension of a crossbar—that is, larger-sized crossbars can accommodate more bits for a given cost. However, both the absolute latency and latency variation increases significantly for larger-sized crossbars, which increases inference latency and reduces the quality of ML inference due to an increase in the ISI distortion (see our analysis in Section 2.2). Second, the cost-per-bit reduces considerably with technology scaling. This is due to higher integration density at smaller process technology nodes.

The formulation for the cost-per-bit (Equation (3)) depends on the specific neuron architecture of Indiveri [48] and the one transistor (1T)-based OxRRAM design of Mallik et al. [64]. This formulation can be easily extended to other neuron and synapse designs. Furthermore, system designers can use our formulation to configure their neuromorphic hardware, without having to access and plug in technology-related data.

2.2 Latency Variation in a Neuromorphic PE

Figure 3 shows the difference between the best-case and worst-case latency in a crossbar (expressed as a fraction of 1 – \(\mu\)s spike duration) for five different crossbar configurations at 45-nm, 32-nm, 22-nm, and 16-nm process technology nodes. See our experimental setup using NeuroXplorer [11] in Section 6.1, which incorporates software, architecture, circuit, and technology. All NVM cells are programmed to the HRS—that is, 73 K\(\Omega\) (see Section 2.3 for the dependency on resistance states).

Fig. 3.

Fig. 3. Variance in latency within a crossbar, expressed as a fraction of a single spike duration.

We make two key observations. First, the latency difference increases with crossbar size due to an increase in the number of parasitic components on current paths. The average latency difference for the \(256 \times 256\) crossbar is higher than \(16 \times 16\), \(64 \times 64\), and \(128 \times 128\) crossbars by 16.5×, 13.4×, and 4.5×, respectively. This average is computed across the four process technology nodes. Therefore, smaller-sized crossbars lead to a smaller variation in latency, which is good for performance. However, smaller-sized crossbars also lead to higher cost-per-bit, which we have analyzed in Section 2.1. For most neuromorphic PE designs, \(128 \times 128\) crossbars achieve the best tradeoff in terms of latency variation and cost-per-bit [42, 57, 64, 66, 70, 102].

Second, the latency difference increases significantly for scaled process technology nodes due to an increase in the value of the parasitic component. The average latency difference for 32-nm, 22-nm, and 16-nm process technology nodes is higher than 45 nm by 1.3×, 3×, and 6.6×, respectively. The unit wordline (bitline) parasitic resistance ranges from approximately 2.5\(\Omega\) (1\(\Omega\)) at the 45-nm node to 10\(\Omega\) (3.8\(\Omega\)) at the 16-nm node. The value of these unit parasitic resistances is expected to scale further reaching \(\approx 25\Omega\) at the 5-nm node [33, 35, 36, 73, 92]. The unit wordline and bitline capacitance values also scale proportionately with technology. Latency variation increases ISI distortion, which degrades the quality of ML inference.

2.3 Varying Latency to Sense NVM Resistance States

The latency (i.e., the delay) on a current path from a pre-synaptic neuron to a post-synaptic neuron within a crossbar is proportionate to \(R_{eff}\cdot C_{eff}\), where \(R_{eff}\) (\(C_{eff}\)) is the effective resistance (capacitance) on the path. This delay increases the time it takes for the membrane potential of a post-synaptic neuron to rise above the threshold voltage causing a delay in spike generation.

The effective resistance on a current path depends on the value of parasitic resistances and the resistance of the NVM cell. We analyze the latency impact due to different resistance states. Figure 4 plots the increase in latency (expressed as a fraction of 1–\(\mu\)s spike duration) to sense three NVM resistance states (LRS2, LRS3, and HRS) with respect to LRS1 at 45-nm, 32-nm, 22-nm, and 16-nm process technology nodes. These results are for a neuromorphic PE with a \(128 \times 128\) crossbar.

Fig. 4.

Fig. 4. Latency to sense various NVM resistance states, expressed as a fraction of a single spike duration.

We observe that the latency to sense the HRS is considerably higher than all three LRS at all process technology nodes (consistent with other works [18, 64, 99]). The latency difference increases with technology scaling due to an increase in the size of parasitic components on bitlines and wordlines of a crossbar, which we analyzed in Section 2.2.

Skip 3PROPOSED TECHNOLOGICAL IMPROVEMENTS Section

3 PROPOSED TECHNOLOGICAL IMPROVEMENTS

Based on the design-technology tradeoff analysis of Section 2, we now present our technology-related optimization. Without loss of generality, we present our optimization for a 128 × 128 crossbar-based neuromorphic hardware designed at the 16-nm node. We exploit the following two observations from Section 2: (1) HRS in an NVM cell takes higher latency to sense than LRS, and (2) spike propagation latency in a crossbar depends on the number of parasitic components on its current path. The left side of Figure 5 shows the proposed technological changes. A crossbar is partitioned into three regions. The number of parasitic components on current paths in region A is considerably lower than in the rest of the crossbar. Therefore, all NVM cells in this region (four in this example) implement only the HRS, which takes the longest time to sense. Conversely, NVM cells in region B have longer propagation delay due to the higher number of parasitic components. Therefore, all NVM cells in this region (nine in this example) implement only the LRS, which takes the shortest time to sense. Finally, all other NVM cells (i.e., those in region C) are programmable (i.e., these cells can implement all four resistance states). The overall objective is to balance the latency on different current paths within a crossbar. This minimizes the latency variation in a crossbar, which reduces ISI distortion and improves the quality of ML inference tasks.

Fig. 5.

Fig. 5. Our proposed technological change.

The right side of Figure 5 shows a pre- and a post-synaptic neuron connected via a synapse that is programmed to the LRS. The synaptic connection can be implemented on NVM cells in region B (with only LRS1) and region C (with programmable states). The figure illustrates two alternative implementations of these neurons. If the pre-synaptic neuron is implemented on wordline 0, then the post-synaptic neuron cannot be implemented on bitlines 0 and 1. This is because NVM cells in region A are all in HRS. In this example, we show the implementation on bitline 2 (see the blue implementation). Conversely, if the post-synaptic neuron is implemented on bitline 0, then the pre-synaptic neuron cannot be implemented on wordline 0 and 1 (to avoid using region A). We show the implementation on wordline 2 (see the red implementation).

Formally, the proposed neuromorphic PE is represented by a tuple \({\langle N, N_h, N_l\rangle }\), where N is the dimension of its crossbar. All NVM cells at crosspoints of wordlines \({0, 1, \ldots , N_h-1}\) and bitlines \(0, 1, \ldots , N_h-1\) (i.e., region A) can implement only HRS. All NVM cells at crosspoints of wordlines \({N-N_l,N-N_l+1,\ldots ,N-1}\) and bitlines \(N-N_l ,N-N_l + 1, \ldots , N-1\) (i.e., region B) can implement only LRS. All other NVM cells in the PE’s crossbar can implement all four resistance states.

Figure 6 plots the variation of latency in the proposed 128 × 128 crossbar, normalized to a baseline architecture [7], where any NVM cell can be programmed to any of the four resistance states. See Section 6.4 for a description of this baseline architecture and Section 6.1 for the simulation setup. The variation in latency is measured as ratio of the best-case and worst-case latency in the crossbar. The figure reports latency variation for \(N_h\) ranging from 2 to 64 with \(N_l\) set to 16, 32, and 64.

Fig. 6.

Fig. 6. Latency variation in the proposed crossbar architecture for different settings of \(N_h\) and \(N_l\) .

We observe that latency variation decreases with an increase in \(N_h\). This is due to an increase in the size of region A, which increases the (worst-case) latency due to an increase in the number of parasitic components on current paths via the HRS. However, the (best-case) latency of current paths via the LRS remains the same. Therefore, the latency variation reduces, which improves inference quality by lowering the ISI distortion. To illustrate this concept, Figure 7 provides an example where two synapses are mapped to a 4 × 4 crossbar. In Figure 7(a), the red synapse (in HRS) is mapped to the bottom left corner of the crossbar, whereas the blue synapse (in LRS) is mapped to the top right corner. The figure shows the timing of two spikes. The input spike on the red and blue synapses are \({t_1}\) and \({t_2}\), respectively. Without loss of generality, let \(t_2 \gt t_1\). The ISI of these two spikes is \(t_2 - t_1\). Due to the delay in current propagation through bitlines and wordlines, these two spikes arrive at the output terminal at different times: red synapse with a delay of x and blue synapse with a delay of y. Here, \(y \gt x\). Therefore, ISI of the output spikes is \((t_2 + y) - (t_1 + x)\). The ISI distortion (difference of ISI between input and output) is (4) \(\begin{equation} \text{ISI distortion} = \Big ((t_2 + y) - (t_1 + x)\Big) - (t_2 - t_1) = y - x. \end{equation}\)

Fig. 7.

Fig. 7. ISI improvement due to increase in the size of region A.

Figure 7(b) illustrates a scenario where region A is increased to include more cells that are programmed to the HRS. The mapping process will map the red synapse using the farthest cell of region A. The delay on this synapse is \(x+ \Delta\), where \(\Delta\) is the additional delay due to routing spikes on the red synapse via a longer route compared to that in Figure 7(a). Therefore, ISI of the output spikes is (\(t_2 + y\)) – (\(t_1 + x + \Delta\)). The ISI distortion is (5) \(\begin{equation} \text{ISI distortion} = \Big ((t_2 + y) - (t_1 + x + \Delta)\Big) - (t_2 - t_1) = y - x - \Delta . \end{equation}\)

Comparing Equations (4) and (5), we observe that the ISI distortion reduces due to an increase in the size of region A. ISI distortion also reduces with an increase in \({N_l}\) due to a reduction in the worst-case latency. We also note that large \({N_h}\) may lead to higher average crossbar latency, which impacts real-time performance. Finally, we see that going from \({N_l}\) = 16 to 32, there is no significant reduction in the latency variation. Although the size of region B increases with an increase in \({N_l}\), we observe only marginal reduction of the best-case latency. Overall, with \({N_h}\) = \({N_l}\) = 64, the latency variation is 74% lower than baseline. This is chosen based on the tradeoff between latency variation and average latency for a 128 × 128 crossbar at 16 nm. The tradeoff point can change for other technology nodes and for other crossbar configurations.

3.1 Reduction in Latency Variation

To understand the reduction of latency variation within a crossbar as a result of our technological changes, we provide a simple example. Consider that there are only two current paths in a crossbar. The parasitic delay on the shortest and longest current paths are D and \((D+\Delta)\), respectively. The time to sense LRS and HRS NVM states are S and \((S+\delta)\), respectively. Without any optimization, the worst-case condition is triggered when the HRS is programmed on the longest path and the LRS on the shortest path. The minimum and maximum latencies are \((D+S)\) and \((D+S+\Delta +\delta)\), respectively. The latency variation is \((\Delta +\delta)\). Using our technology optimization, the HRS is programmed on the shortest path and the LRS on the longest path. The two latencies are \((D+S+\Delta)\) and \((D+S+\delta)\). The latency variation reduces to \((|\Delta -\delta |)\).

Within a crossbar, there are many current paths (\(N^2\) current paths in an \(N \times N\) crossbar). The precise reduction in latency variation depends on the specific current paths activated for a synaptic connection, which is controlled during the mapping of an ML application to the crossbars of the hardware. In Figure 6, we show a 74% reduction comparing only the shortest and the longest paths in a \(128\, \times \, 128\) crossbar. In Section 7.2, we evaluate the general case considering the mapping process. We report an average 22% reduction of latency variation.

Reducing the latency variation helps reduce the ISI distortion, which improves the inference quality. In Section 7.4, we report an average 4% increase of inference quality.

3.2 Impact on Latency

Although latency variation impacts inference quality, the average crossbar latency impacts the real-time performance. To understand the impact of our technological optimization on the average crossbar latency, we consider the same example of two current paths. Consider that there are m synapses with the LRS and n synapses with the HRS. The average latency in the worst-case condition is \(\frac{m\cdot (D+S) + n\cdot (D+S+\Delta +\delta)}{m+n}\). Using the technological improvement, the average latency is \(\frac{m\cdot (D+S+\Delta) + n\cdot (D+S+\delta)}{m+n}\). Therefore, the change in latency is \((\frac{n-m}{n+m})\Delta\). This change in latency depends on (1) current paths activated in a crossbar and (2) the value of n and m—that is, the number of synaptic connections with the HRS and LRS, respectively. In Section 7.3, we show an average 3% reduction of the average crossbar latency for the evaluated applications.

Skip 4ARCHITECTURAL ENHANCEMENTS TO NEUROMORPHIC PE Section

4 ARCHITECTURAL ENHANCEMENTS TO NEUROMORPHIC PE

To understand the motivation of the proposed architectural changes, Figure 8 reports the average synapse utilization of \(128 \times 128\) crossbars in neuromorphic PEs for 10 ML models implemented using the spatial decomposition technique of Balaji et al. [10], which is a best-effort approach to improve the utilization of crossbars in neuromorphic hardware.

Fig. 8.

Fig. 8. Average synapse utilization of neuromorphic PEs.

We observe that the average synapse utilization is only 0.9%. This is because a crossbar can accommodate only a limited number of pre-synaptic connections per post-synaptic neuron. To illustrate this, Figure 9 shows three examples of implementing neurons on a \(4 \times 4\) crossbar. The synapse utilization of the three example scenarios are (a) 25% (4 out of 16), (b) 18.75% (3 out of 16), and (c) 25% (4 out of 16). As the crossbar dimension increases, the utilization drops significantly. For instance, if a 128 × 128 crossbar is used to implement a single 128-input neuron (i.e., generalization of Figure 9(a)), the utilization is only 0.78% (128 utilized synapses out of a total of \(128^2 = \hbox{16,384}\) synapses). Lower synapse utilization leads to lower energy efficiency.

Fig. 9.

Fig. 9. Implementation of one 4-input (a), one 3-input (b), and two 2-input (c) neurons to a \(4 \times 4\) crossbar.

To improve energy efficiency, we propose to partition a neuromorphic PE into regions that can be dynamically power-gated based on its utilization for a given ML inference task. Figure 10 shows the use of isolation transistors in a neuromorphic PE to partition a \(4 \times 4\) crossbar into active and unutilized regions. Figure 10(a) illustrates the implementation of only a single neuron function \(y_1\) in the crossbar. To improve energy efficiency, isolation transistors are needed on every bitline (between wordlines 3 and 4) and on every wordline (between bitlines 1 and 2). Figure 10(b) illustrates the implementation of two neuron functions \(y_1\) and \(y_2\) in the crossbar. In this scenario, isolation transistors are only needed on every wordline (between bitlines 2 and 3). To implement inference on a neuromorphic system, each crossbar may have different utilization of its memory cells. Therefore, to improve energy efficiency in every crossbar, isolation transistors are needed on every bitline (and between every pair of wordlines) and on every wordline (and between every pair of bitlines)—a total of 24 isolation transistors for this example \(4 \times 4\) crossbar (in general, \({2N(N-1)}\) for an \(N \times N\) crossbar). This fine-grained partitioned PE architecture offers flexibility in energy management incorporating crossbar utilization but leads to a significant increase in the area, latency, and system overhead to control isolation transistors.

Fig. 10.

Fig. 10. Proposed neuromorphic PE architecture partitioned using isolation transistors.

To overcome these limitations while improving energy efficiency, we enable a coarse-grained partitioning in a crossbar as illustrated in Figure 10(c). In this example, isolation transistors are inserted selectively on every bitline (between wordlines 3 and 4) and on every wordline (between bitlines 2 and 3). This coarse-grained partitioned PE architecture requires a total of eight isolation transistors (in general, 2N for an \(N \times N\) crossbar). To reduce the control overhead, isolation transistors on wordlines of a crossbar are controlled using a single control signal wl_iso_ctrl and those on bitlines using the signal bl_iso_ctrl. Through these two control signals, we enable four distinct configurations of the crossbar, which are summarized in Table 1.

Table 1.
Crossbar ControlKey Parameters
wl_iso_ctrlbl_iso_ctrlDimensionEnergyLatency
Baseline PE Architecture
\(4\times 4\)\(\propto\) 4*4Best case: \(t_{1,1}\)
Worst case: \(t_{4,4}\)
Proposed Partitioned PE Architecture
00\(3\times 2\)\(\propto\) 3*2Best case: \(t_{1,1}\)
Worst case: \(t_{3,2}-\Delta\)
01\(4\times 2\)\(\propto\) 4*2Best case: \(t_{1,1}\)
Worst case: \(t_{4,2}-\Delta + t_{ON}\)
10\(3\times 4\)\(\propto\) 3*4Best case: \(t_{1,1}\)
Worst case: \(t_{3,4}-\Delta + t_{ON}\)
11\(4\times 4\)\(\propto\) 4*4Best case: \(t_{1,1}\)
Worst case: \(t_{4,4} + 2\cdot t_{ON}\)

Table 1. Different PE Configurations Enabled Using the Two New Crossbar Control Signals

In a baseline PE architecture, a crossbar dimension is fixed to 4 × 4. Its static energy is proportional to the number of memory cells, which is 4*4 = 16 in this example. Latency in the crossbar varies from \(t_{1,1}\) (nearest cell or best case) to \(t_{4,4}\) (farthest cell or worst case).

In the proposed partitioned PE architecture, there are four configurations.

In configuration ‘00,’ the crossbar is configured as a 3 × 2 array with its static energy proportional to 3 × 2 = 6 memory cells. This is when the unutilized region is power-gated. The best-case latency is \(t_{1,1}\), and the worst-case latency is \(t_{3,2}-\Delta\), where \(\Delta\) is the reduction in parasitic delay due to shorter bitlines and wordlines.

In configuration ‘01,’ the crossbar is configured as a 4 × 2 array with its static energy proportional to 4 × 2 = 8 memory cells. The best-case latency is \(t_{1,1}\), and worst-case latency is \(t_{4,2}\ -\ \Delta \ +\ t_{ON}\), where \(t_{ON}\) is the delay of the isolation transistor on current paths.

In configuration ‘10,’ the crossbar is configured as a 3 × 4 array with its static energy proportional to 3 × 4 = 12 memory cells. The best-case latency is \(t_{1,1}\), and the worst-case latency is \(t_{3,4}-\Delta + t_{ON}\).

In configuration ‘11,’ the crossbar is configured as the baseline 4 × 4 array with its static energy proportional to 4 × 4 = 16 memory cells. The best-case latency is \(t_{1,1}\), and the worst-case latency is \(t_{4,4} + 2\cdot t_{ON}\). Observe that on the longest current path, there are now two isolation transistors, resulting in higher worst-case latency than in the baseline design.

Our proposed system software (which we discuss in Section 5) minimizes the use of configuration ‘11,’ improving both performance and energy efficiency.

Single control. The proposed partitioned PE architecture also supports using a single control signal for all isolation transistors in a crossbar. When using a single control, only the configurations ‘00’ and ‘11’ are used, implementing a \(3 \times 2\) and a \(4 \times 4\) array, respectively.

To generalize the discussion for an \(N \times N\) crossbar, assume that isolation transistors are inserted on every bitline (between wordlines P and \(P + 1\)) and on every wordline (between bitlines Q and \(Q + 1\)). Then, the four configurations are ‘00,’ a P × Q array; ‘01,’ an N × Q array; ‘10,’ a P × N array; and ‘11,’ an N × N array. Formally, \(\langle N, N_h, N_l, P, Q\rangle\) represents the proposed partitioned PE architecture. Equation (9) summarizes the notations. (6) \(\begin{align} \langle N\rangle &= \text{a baseline } N\times N \text{ crossbar} \end{align}\) (7) \(\begin{align} \langle N,N_h,N_l\rangle &= N\times N \text{ crossbar with tech. enhancement (Section~3}) \end{align}\) (8) \(\begin{align} \langle N,N_h,N_l,P,Q\rangle &= N\times N \text{crossbar with tech. and arch. enhancements} \end{align}\) (9) \(\begin{align} & \quad \text{(see Sections~3 and 4)} \end{align}\)

We introduce the following four terminologies: (1) expanded mode: in this mode, a crossbar is operated in configuration ‘11’; (2) collapsed mode: in this mode, a crossbar is operated in configurations ‘00,’ ‘01,’ and ‘10’; (3) collapsed region: this is the reduced dimension of the crossbar when operating in configurations ‘00,’ ‘01,’ and ‘10’; and (4) far region: this is the region of the crossbar excluding the collapsed region.

In our design methodology, the far region of a crossbar is power-gated using the two control signals at design-time considering the crossbar’s utilization. This is achieved during mapping of neurons and synapses to the hardware. Since neuron and synapse mapping does not change during inference, there is no dynamic power management needed. Consequently, there is also no latency and energy overhead involved in switching the far region on/off at runtime.

4.1 Placing Isolation Transistors in a Crossbar

To illustrate the design space exploration involved in placing isolation transistors in a crossbar, Figure 11(a) illustrates a baseline crossbar with four current paths that are activated during mapping of neurons and synapses. Figure 11(b) through (d) show three alternative placements of isolation transistors in the crossbar. In Figure 11(b), P and Q values are kept small. The size of the far region is large. In this figure, only two of the current paths (1 and 2) stay within the collapsed region of the crossbar, whereas the other two current paths (3 and 4) traverse via the far region. This means that the latency of paths 3 and 4 increases due to the delay of the isolation transistors on current paths. Additionally, the far region cannot be power-gated, so there is a limited scope for energy reduction using power gating. Increasing P and Q values further (Figure 11(c)), the far region reduces in size as illustrated in the figure. Although three of the four current paths stay in the collapsed region, the far region still cannot be power-gated due to the presence of path 4 in this region. Finally, Figure 11(d) illustrates a possibility where all current paths stay in the collapsed region. The far region can therefore be power-gated. However, because of the small size of the far region, the energy benefits may not be significant. We explore this latency and energy tradeoffs.

Fig. 11.

Fig. 11. Placing isolation transistors in a crossbar.

Figure 12 shows the latency and energy tradeoffs in selecting the values of P and Q for the ResNet inference workload implemented on \(128 \times 128\) crossbars in a neuromorphic hardware. Latency and energy numbers are normalized to baseline. We make the following two key observations.

Fig. 12.

Fig. 12. Selecting P and Q values for the ResNet application.

First, energy is lower for smaller P and Q values. This is because by reducing P and Q, the size of the collapsed region of a crossbar reduces. Therefore, there are more memory cells in the far region that can be power-gated to lower energy.

Second, latency also reduces with a reduction in P and Q values (until P = Q = 80). This is due to shorter bitlines and wordlines of the collapsed region. However, with P = Q = 64 or 72, more clusters of ResNet need crossbars in the expanded mode of operation. This is because synapses in these clusters can no longer fit onto the reduced dimension of a collapsed crossbar. This increases latency due to isolation transistors on current paths. For ResNet, P = Q = 80 is the tradeoff point. The tradeoff point is different for different applications. To select a single crossbar configuration that gives good results for all applications, we perform similar analysis for all evaluated applications (see Section 6.3). Based on such analysis, P = Q = 96 is the selected configuration for the \(128 \times 128\) crossbar at the 16-nm technology node.

Skip 5EXPLOITING TECHNOLOGICAL AND ARCHITECTURAL IMPROVEMENTS VIA THE SYSTEM SOFTWARE Section

5 EXPLOITING TECHNOLOGICAL AND ARCHITECTURAL IMPROVEMENTS VIA THE SYSTEM SOFTWARE

To describe the system software, the left side of Figure 13 shows the final crossbar design with isolation transistors that allow each neuromorphic PE to operate in a collapsed or expanded mode. The right side shows control signals for these transistors generated from a centralized controller implemented inside the system software.

Fig. 13.

Fig. 13. Final crossbar design using the isolation transistors. The right side shows the control signals generated from the controller when using the proposed partitioned PE architecture in a neuromorphic system.

Without loss of generality, Figure 14 shows modifications to the baseline system software [60] to exploit the proposed design changes. A trained ML model is first partitioned to generate clusters, where each cluster can fit onto a crossbar. These clusters are stored in a cluster queue (clQ). In the baseline design, each cluster from the clQ is mapped to an \(N \times N\) array (exactly replicating the crossbar dimension of the hardware). The mapping is programmed to the hardware using the cluster placement block. In the proposed design, each cluster of clQ is mapped on four separate arrays: a \(P \times Q\) array, an \(N \times Q\) array, a \(P \times N\) array, and an \(N \times N\) array. These mappings go to a configuration selection block, which selects the final mapping for the cluster and the configuration of the corresponding PE based on energy-latency tradeoffs. The configuration is programmed to the hardware by configuring the two control signals wl_iso_ctrl and bl_iso_ctrl. This allows to power-gate the far region of the crossbar. It is important to note that since we power-gate unused resources of a crossbar only at design-time when admitting an application, we minimize the switching overhead. In the future, we will extend this work to also consider dynamic power management by dynamically controlling the isolation transistors.

Fig. 14.

Fig. 14. Proposed system software. All changes are indicated in red.

In selecting the final mapping, the configuration selector first checks to see if a cluster can be mapped to a \(P \times Q\) array. If this is possible, then the mapping to the \(P \times Q\) array is selected as the final mapping for the cluster, and the corresponding PE is set to operate in configuration ‘00’ (collapsed mode). Otherwise, the configuration selector checks to see if the cluster can be mapped to the \(N \times Q\) or \(P \times N\) array. If so, the corresponding mapping is selected, and the PE is set to operate in configurations ‘01’ or ‘10,’ respectively. If the cluster cannot be mapped to either \(N\times Q\) or \(P\times N\) arrays, the mapping to the \(N \times N\) array is selected as the final mapping of the cluster with the PE set to operate in configuration ‘11’ (expanded mode). In this way, the proposed system software uses expanded mode only when it is absolutely necessary to do so. Otherwise, it selects the collapsed region to map synapses, improving both latency and energy.

Skip 6EVALUATION METHODOLOGY Section

6 EVALUATION METHODOLOGY

6.1 Simulation Framework

We evaluate the proposed design-technology co-optimization approach for OxRRAM-based neuromorphic PEs. Our simulation framework includes NeuroXplorer [11], a cycle-level in-house neuromorphic simulator [11] with programmable crossbar parameters. We configure this framework to simulate crossbars with parameters listed in Table 2.

Table 2.
Neuron Technology16-nm CMOS (original design is at 14-nm FinFET)
Synapse TechnologyHfO\({}_2\)-based OxRRAM [64]
Supply Voltage1.0 V
Energy per Spike23.6 pJ at 30-Hz spike frequency
Energy per Routing3 pJ
Switch Bandwidth3.44 G. Events/s

Table 2. Major Simulation Parameters Extracted from the Work of Davies et al. [28]

Circuit-level simulations are performed with technology parameters from the predictive technology model (PTM) [101] and OxRRAM-specific parameters from Chen and Yu [18]. We note that comparing different chip technologies or recommending one technology node over another is not the focus of this work. Instead, we show that for a given process technology node, design optimizations can reduce energy and latency variations. Furthermore, the proposed design-technology co-optimization methodology can be used by system designers to choose the best technology node for their neuromorphic designs by exploring the energy-performance tradeoffs.

Neuromorphic simulations are performed on a Lambda workstation, which has AMD Threadripper 3960X with 24 cores, 128-MB cache, 128 GB of RAM, and 2 RTX3090 GPUs. Figure 15(a) shows the design pipeline implemented using NeuroXplorer. An ML model is first trained using frameworks such as Keras and PyTorch. Subsequently, the trained model is converted into the SNN using [4, 76]. The trained model is also simulated using an SNN simulator such as CARLsim [21]. NeuroXplorer integrates PyCARL [3], which allows the SNN model to be simulated using other SNN simulators such as Nengo [13], Neuron [43], and Brian [39]. Keras [41] and CARLsim [21] both use the two GPUs to accelerate model training and SNN functional simulation, respectively.

Fig. 15.

Fig. 15. Design pipeline using NeuroXplorer.

The SNN simulated model is clustered using the best-effort technique of Balaji et al. [10], which maximizes cluster utilization. Clusters of the SNN are mapped to the hardware using the SpineMap technique [7]. Finally, we perform cycle-accurate simulation of the clusters using NeuroXplorer [11].

Figure 15(b) shows the modeling hierarchy of the simulator. At the highest level is the many-core design, which is a tile-based architecture, similar to Loihi [28]. Each PE consists of a crossbar, which is an organization of neurons and synapses. A neurons is modeled using the work of Indiveri [48] and a synaptic circuit using the work of Mallik et al. [64]. At the lowest level are the technology models (see Table 2).

Finally, Figure 15(c) shows the statistics collection framework in NeuroXplorer. It facilitates global statistics collection, where spike arrival times are recorded for each PE (shown as C in the figure). These spike times are then used to compute the ISI distortion (see Appendix B).

6.2 Power Consideration for Isolation Transistors

The additional power required to control the isolation transistors when accessing the RRAM cells in the far region is approximately 3× that of raising a wordline, since raising a wordline requires driving one access transistor per bitline, whereas accessing the RRAM cells in the far region requires driving two isolation and one access transistor per bitline. The power overhead for accessing RRAM cells in the collapsed modes ‘01’ and ‘10’ is approximately 2× (one isolation and one access transistor) [56, 79, 83]. The energy numbers reported in Section 7.1 incorporates these overheads.

6.3 Evaluated Workloads

We select 10 ML inference programs that are representative of three most commonly used neural network classes: convolutional neural network (CNN), multi-layer perceptron (MLP), and recurrent neural network (RNN). Table 3 summarizes the topology, number of neurons and synapses, number of spikes per image, and baseline quality of these applications on hardware.

Table 3.
BaselineObtained
ClassApplicationsDatasetNeuronsSynapsesAvg. Spikes/FrameQualityQuality
CNNLeNetCIFAR-1080,271275,110724,56586.3%87.1%
AlexNetCIFAR-10127,8943,873,2227,055,10966.4%66.9%
ResNetCIFAR-10266,7995,391,6167,339,32257.4%58.0%
DenseNetCIFAR-10365,20011,198,4701,250,97646.3%46.5%
VGGCIFAR-10448,48422,215,20912,826,67381.4 %81.6%
HeartClass [24]PhysioNet170,2921,049,2492,771,63463.7%63.9%
MLPMLPDigitMNIST89479,40026,56391.6%96.4%
EdgeDet [21]CARLsim7,268114,057248,603SSIM = 0.890.99
ImgSmooth [21]CARLsim5,1209,025174,872PSNR = 1922.2
RNNRNNDigit [30]MNIST1,19111,44230,50883.6%83.7%

Table 3. Applications Used to Evaluate the Proposed Approach

6.4 Evaluated Approaches

We evaluate the following techniques:

  • Baseline [7]: The Baseline approach first clusters an ML inference model to minimize the inter-cluster spike communication. Clusters are then mapped to neuromorphic PEs of the hardware with synapses of each cluster implemented on memory cells of a crossbar without incorporating latency variation. Neuromorphic PEs are not optimized to reduce latency variation—that is, any resistance states (LRS or HRS) can be programmed on any current path (long or short). Unused crossbars are power-gated to reduce energy consumption. This is the coarse-grained power management technique implemented in many state-of-the-art many-core neuromorphic designs such as Loihi [28], DYNAPs [66], and \(\mu\)Brain [93].

  • Baseline + Design Changes: This is the Baseline mapping approach implemented on the proposed latency-optimized partitioned neuromorphic PE design. In the proposed design, the HRS, which takes a long time to sense, is used only on shorter current paths, ones that have lower parasitic delays. Similarly, the LRS is used only on longer current paths. In addition to coarse-grained power management, we facilitate power gating at a finer granularity in the proposed design. Specifically, by controlling the isolation transistors, we power-gate unused resources within each crossbar.

  • Proposed: This is the proposed solution where the system software is optimized to exploit the design changes.

Skip 7RESULTS AND DISCUSSIONS Section

7 RESULTS AND DISCUSSIONS

7.1 Energy Efficiency

Figure 16 plots the energy efficiency of the evaluated techniques normalized to Baseline. We make the following two key observations.

Fig. 16.

Fig. 16. Energy consumption normalized to Baseline.

First, with the proposed design changes, energy reduces by only 7% compared to Baseline. This is because both in Baseline and Baseline with the proposed design changes, synapses of a cluster are implemented randomly on NVM cells of a crossbar, causing them to be distributed across the crossbar dimension. Therefore, there remains a limited scope to collapse the crossbar and use power gating to save energy. Second, the proposed design-technology co-optimization approach has the lowest energy (22% lower than Baseline and 16% lower than Baseline with the proposed design changes). This improvement is due to the proposed system software, which exploits the design changes in implementing ML inference on neuromorphic PEs. In particular, synapses are implemented to maximize the utilization of the collapsed region in each crossbar of the hardware. If all of a cluster’s synapses fit into the collapsed region, then the far region can be isolated from the collapsed region using isolation transistors and power-gated to save energy.

7.2 Latency Variation

Figure 17 plots the latency variation normalized to Baseline. We make the following three key observations.

Fig. 17.

Fig. 17. Latency variation normalized to Baseline.

First, with the proposed design changes, latency variation increases compared to Baseline by an average of 1%. This is because of the increase in latency associated with the delay of isolation transistors on current paths. Second, the latency variation using the proposed approach is 30% lower than Baseline and 32% lower than Baseline with the proposed design changes. The reason for these improvements is threefold: (1) optimizing NVM resistance states in a crossbar such that the state that takes the longest time to sense is programmed on current paths that have the least propagation delay; (2) isolating the collapsed region of a crossbar from the far region to reduce current propagation delay; and (3) exploiting these changes during the implementation of an ML inference using the proposed system software, which uses the far region of a crossbar only when it is absolutely necessary to do so. Otherwise, it improves both latency and energy by operating the crossbar in the collapsed mode.

Finally, the latency variation using the proposed approach varies across different applications. This is because the proposed approach exploits the latency and energy tradeoffs differently for different applications. The latency variation is similar to the Baseline for ResNet, whereas it is significantly lower than the Baseline for HeartClass.

Using the results from Sections 7.1 and 7.2, we conclude that the proposed approach introduces maximum gain for applications where the latency and energy tradeoffs can be better exploited. For all other applications, it either minimizes energy or minimizes latency variation.

7.3 Real-Time Performance

One of the key hardware performance metrics for neuromorphic computing is real-time performance, which is a function of the crossbar latency. To evaluate real-time performance, Figure 18 plots the crossbar latency of the proposed approach and the Baseline for the evaluated applications. Results are normalized to the Baseline.

Fig. 18.

Fig. 18. Crossbar latency normalized to Baseline.

We observe that the crossbar latency using the proposed approach is on average 4.5% lower than the Baseline. This reduction is because the proposed approach places synapses with the HRS on shorter current paths, which lowers the overall spike latency on those synapses. We have elaborated this in Section 3.2.

7.4 Inference Quality

Figure 19 shows the improvement in inference quality using the proposed approach, normalized to Baseline. We observe that the image quality improves by an average of 4%. This is due to the reduction in ISI distortion caused by a reduction of the latency variation in neuromorphic PEs using the proposed changes, which we analyzed in Section 7.2. In addition, the improvement of inference quality with peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) metrics for EdgeDet and ImgSmooth is higher than other inference tasks with accuracy metrics. This is because PSNR and SSIM metrics are computed on individual images where we see a large improvement in quality. For accuracy-based tasks, we observe that feature representation in hidden layers of these models changes due to ISI distortion, but not all such changes lead to misclassification. Thus, the accuracy of these inference tasks is comparable to Baseline.

Fig. 19.

Fig. 19. Inference quality normalized to Baseline.

7.5 Single vs. Double Control Design

Figure 20 plots the energy efficiency of the proposed design with single control signal and the default, which uses two control signals for each PE. We observe that using single control, energy reduces by only 2% compared to Baseline. This is because most crossbars are operated in the expanded mode due to limited scope to collapse the crossbar. Our default design leads to 14.4% lower energy than with single control. This is because in the default design, a crossbar can be collapsed along X- and Y-dimensions independently, leading to three collapsed array configurations. Therefore, the system software has a higher probability to use the collapsed mode, leading to a reduction in energy.

Fig. 20.

Fig. 20. Partitioned PE architecture with single and double control.

7.6 Die Area Analysis

Adding an isolation transistor to the bitline increases the height of the crossbar, whereas that on the wordline increases the width. Without the isolation transistors, the height of a baseline crossbar is equal to the sum of height of the memory cells and the sense-amplifier, whereas the width is equal to the sum of the width of the memory cells. For RRAM-based neuromorphic PEs, a sense amplifier in the peripheral circuit and an isolation transistor is approximately 384× and 9.6× taller than an individual RRAM cell, respectively [17, 64, 96]. In terms of width, an isolation transistor is only 1.3× wider than an RRAM cell. Therefore, for a crossbar with 128 RRAM cells per bitline and wordline (i.e., \(128 \times 128\) array), the overhead along the height of the crossbar is \(\frac{9.6}{384 + 128} = 1.83\%\), and the overhead along the width of the crossbar is \(\frac{1.3}{128} = 1.01\%\).

Skip 8CONCLUSION Section

8 CONCLUSION

We present a design-technology co-optimization approach to implement energy-efficient ML inference on NVM-based neuromorphic PEs. First, we optimize the NVM resistance state such that the state that takes the longest time to sense is placed on current paths with fewer parasitics and hence incurs lower propagation delay, and vice versa. Second, we use isolation transistors to partition a PE into collapsed and far regions such that the NVM cells of the far region can be opportunistically power-gated to save both energy and latency. Finally, we use the system software to exploit the design changes, maximizing the utilization of the collapsed region of each PE in the hardware. Our system software uses the far region only when it is absolutely necessary to do so; otherwise, it improves both latency and energy by operating the PE in the collapsed mode. We evaluate our design-technology co-optimization approach for a state-of-the-art neuromorphic architecture. Evaluations with different ML inference tasks show that the proposed approach improves both latency and energy without incurring significant cost-per-bit.

APPENDICES

A SPIKING NEURAL NETWORKS

SNNs enable powerful computations due to their spatio-temporal information encoding capabilities [63]. An SNN consists of neurons, which are connected via synapses. A neuron can be implemented as an integrate-and-fire (IF) logic, which is illustrated in Figure 21 (left). Here, an input current \(U(t)\) (i.e., spike from a pre-synaptic neuron) raises the membrane voltage of the neuron. When this voltage crosses a threshold \(V_{th}\), the IF logic emits an output spike, which propagates to is post-synaptic neuron. Figure 21 (middle) illustrates the membrane voltage of the IF neuron due to an input spike train. The moment of threshold crossing is illustrated in Figure 21 (right). These are the firing times of the output spike train of the neuron.

Fig. 21.

Fig. 21. A leaky IF neuron with current input \(U(t)\) (left). The membrane potential over time of the neuron (middle). The spike output of the neuron representing its firing time (right).

SNNs can implement many ML approaches such as supervised learning, unsupervised learning, reinforcement learning, and lifelong learning. We focus on supervised ML, where an SNN is pre-trained with representative data. ML inference refers to feeding live data points to this trained SNN to generate the corresponding output.

B QUALITY OF INFERENCE

The quality of ML inference can be expressed in terms of accuracy [4], mean square error [26], PSNR [21], and SSIM [44]. Although accuracy is commonly used for assessing the quality of supervised learning (e.g., using CNNs), there are also applications such as edge detection, where the quality is assessed using other metrics such as SSIM. In our prior work [7], we showed that these quality metrics are a function of the ISI between neurons. Therefore, any deviation of ISI (called ISI distortion) from its trained value may lead to quality loss. To describe ISI, let \(\lbrace t_1, t_2, \ldots , t_{K}\rbrace\) denote a neuron’s firing times in the time interval \([0,T]\), thus the average ISI of this spike train is (10) \(\begin{equation} \mathcal {I} = \sum _{i=2}^K (t_i - t_{i-1})/(K-1). \end{equation}\)

To illustrate how a change in ISI, called ISI distortion, impacts inference quality, we use a small SNN in which three input neurons are connected to an output neuron. Figure 22 illustrates the impact of ISI distortion on the output spike. In the top part of the figure, a spike is generated at the output neuron at 22\(\mu\)s due to spikes from the input neurons. In the bottom part of the figure, the second spike from input 3 is delayed (i.e., it has an ISI distortion). Due to this distortion, there is no output spike generated. Missing spikes can impact inference quality, as spikes encode information in SNNs.

Fig. 22.

Fig. 22. Impact of ISI distortion on accuracy [3]. Top: A scenario where an output spike is generated based on the spikes received from the three input neurons. Bottom: A scenario where the second spike from neuron 3 is delayed. There are no output spikes generated.

Figure 23 shows the impact of ISI distortion on the quality of image smoothing implemented using an SNN [21]. Figure 23(a) shows the input image, which is fed to the SNN. Figure 23(b) shows the output of the image smoothing application with no ISI distortion. PSNR of the output with reference to the input is 20. Figure 23(c) shows the output with ISI distortion due to variation in latency within neuromorphic PEs of the hardware. PSNR of this output with respect to the input is 19. A reduction in PSNR indicates that the output image quality with ISI distortion is lower than the one without distortion. In fact, image quality deteriorates with an increase in ISI distortion. We use ISI distortion as a measure of the quality of ML inference [7]. Our aim is to improve this inference quality via technological and architectural enhancements that reduce ISI distortion when the inference task is implemented on neuromorphic PEs of hardware.

Fig. 23.

Fig. 23. Impact of ISI distortion on image smoothing.

C HARDWARE IMPLEMENTATION OF ML INFERENCE

Most neuromorphic hardware platforms are implemented as tiled-based architectures [16, 28, 29, 37, 72, 93], where the tiles are interconnected via a shared interconnect such as network-on-chip [62] and segmented bus [12]. Figure 24 illustrates a tile-based neuromorphic hardware platform, where the tiles can communicate concurrently. Each tile includes (1) a neuromorphic PE, which consists of neuron and synapse circuitries, and (2) a network interface, which encodes spikes into AER (address event representation) and communicates these AER packets to the switch for routing to their destination tiles. A common design practice is to use analog crossbars to implement a neuromorphic PE [2, 7, 45, 52, 55, 58, 61, 100]. Within a crossbar, a pre-synaptic neuron circuit acts as a current driver and is placed on a wordline, whereas a post-synaptic neuron circuit acts as a current sink and is placed on a bitline as illustrated in Figure 1 (left).

Fig. 24.

Fig. 24. Tile-based neuromorphic hardware, representative of hardware platforms such as TrueNorth [29], Loihi [28], DYNAPs [66], and \(\mu\) Brain [93].

Since a crossbar can accommodate only a limited number of neurons and synapses, an ML model is first partitioned into clusters, where each cluster can be implemented on a crossbar of the hardware. Partitioned clusters are then mapped to different crossbars when admitting the model to the hardware platform. To this end, several heuristic approaches are proposed in the literature. PSOPART [27] minimizes spike latency on the shared interconnect, SpiNeMap [7] minimizes interconnect energy, DFSynthesizer [76] maximizes throughput, DecomposedSNN [10] maximizes crossbar utilization, EaNC [90] minimizes overall energy of an ML task by targeting both computation and communication energy, TaNC [89] minimizes the average temperature of each crossbar, eSpine [91] maximizes NVM endurance in a crossbar, RENEU [80] minimizes the circuit aging in a crossbar’s peripheral circuits, and NCil [86] reduces read disturb issues in a crossbar, improving the inference lifetime. Besides these techniques, there are other software frameworks [1, 5, 6, 9, 11, 23, 25, 38, 47, 50, 54, 60, 71, 75, 77, 78, 85, 88] and runtime approaches [8, 84] addressing one or more of these optimization objectives.

We investigate the internal architecture of a crossbar and find that the parasitic components introduce delay in propagating current from a pre-synaptic neuron to a post-synaptic neuron as illustrated in Figure 1 (right). This delay depends on the specific current path used in the mapping. The higher the number of parasitic components on a current path, the larger is its propagation delay. Parasitic components on bitlines and wordlines are a major source of latency at scaled process technology nodes, and they create significant latency variation in a crossbar. Specifically, the latency of a synaptic connection in an SNN depends precisely on the memory cell in the crossbar that is used to implement it. Such latency variation can introduce ISI distortion (see Appendix B), which may impact the quality of an inference task.

D NVM TECHNOLOGY

RRAM technology presents an attractive option for implementing memory cells of a crossbar due to its demonstrated potential for low-power multi-level operation and high integration density [64]. An RRAM cell is composed of an insulating film sandwiched between conducting electrodes forming a metal-insulator-metal (MIM) structure (Figure 25). Recently, conducting filament-based metal-oxide RRAM implemented with transition-metal-oxides such as HfO\({}_2\), ZrO\({}_2\), and TiO\({}_2\) has received considerable attention due to their low-power and CMOS-compatible scaling.

Fig. 25.

Fig. 25. Operation of an RRAM cell with the \(\text{HfO}_2\) layer sandwiched between the metals Ti (top electrode) and TiN (bottom electrode). The right side shows the formation of LRS/SET. The left side shows HRS/RESET.

Synaptic weights are represented as conductance of the insulating layer within each RRAM cell. To program an RRAM cell, elevated voltages are applied at the top and bottom electrodes, which rearranges the atomic structure of the insulating layer. Figure 25 shows the HRS and LRS of an RRAM cell. An RRAM cell can also be programmed into intermediate LRS, allowing its multi-level operations [18].

REFERENCES

  1. [1] Amir Arnon, Datta Pallab, Risk William P., Cassidy Andrew S., Kusnitz Jeffrey A., Esser Steve K., Andreopoulos Alexander, et al. 2013. Cognitive computing programming paradigm: A corelet language for composing networks of neurosynaptic cores. In Proceedings of IJCNN.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Ankit Aayush, Sengupta Abhronil, and Roy Kaushik. 2017. TraNNsformer: Neural network transformation for memristive crossbar based neuromorphic system design. In Proceedings of ICCAD.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Balaji Adarsha, Adiraju Prathyusha, Kashyap Hirak J., Das Anup, Krichmar Jeffrey L., Dutt Nikil D., and Catthoor Francky. 2020. PyCARL: A PyNN interface for hardware-software co-simulation of spiking neural network. In Proceedings of IJCNN.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Balaji Adarsha, Corradi Federico, Das Anup, Pande Sandeep, Schaafsma Siebren, and Catthoor Francky. 2018. Power-accuracy trade-offs for heartbeat classification on neural networks hardware. Journal of Low Power Electronics 14, 4 (2018), 508–519.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Balaji Adarsha and Das Anup. 2019. A framework for the analysis of throughput-constraints of SNNs on neuromorphic hardware. In Proceedings of ISVLSI.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Balaji Adarsha and Das Anup. 2020. Compiling spiking neural networks to mitigate neuromorphic hardware constraints. In Proceedings of IGSC Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Balaji Adarsha, Das Anup, Wu Yuefeng, Huynh Khanh, Dell’anna Francesco G., Indiveri Giacomo, Krichmar Jeffrey L., Dutt Nikil D., Schaafsma Siebren, and Catthoor Francky. 2020. Mapping spiking neural networks to neuromorphic hardware. IEEE Transactions on Very Large Scale (VLSI) Systems 28, 1 (2020), 76–86.Google ScholarGoogle Scholar
  8. [8] Balaji Adarsha, Marty Thibaut, Das Anup, and Catthoor Francky. 2020. Run-time mapping of spiking neural networks to neuromorphic hardware. Journal of Signal Processing Systems 92, 11 (2020), 1293–1302.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Balaji Adarsha, Song Shihao, Das Anup, Dutt Nikil, Krichmar Jeff, Kandasamy Nagarajan, and Catthoor Francky. 2019. A framework to explore workload-specific performance and lifetime trade-offs in neuromorphic computing. IEEE Computer Architecture Letters 18, 2 (2019), 149–152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Balaji Adarsha, Song Shihao, Das Anup, Krichmar Jeffrey, Dutt Nikil, Shackleford James, Kandasamy Nagarajan, and Catthoor Francky. 2020. Enabling resource-aware mapping of spiking neural networks via spatial decomposition. IEEE Embedded Systems Letters 13, 3 (2020), 142–145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Balaji Adarsha, Song Shihao, Titirsha Twisha, Das Anup, Krichmar Jeffrey, Dutt Nikil, Shackleford James, Kandasamy Nagarajan, and Catthoor Francky. 2021. NeuroXplorer 1.0: An extensible framework for architectural exploration with spiking neural networks. In Proceedings of ICONS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Balaji Adarsha, Wu Yuefeng, Das Anup, Catthoor Francky, and Schaafsma Siebren. 2019. Exploration of segmented bus as scalable global interconnect for neuromorphic computing. In Proceedings of GLSVLSI.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Bekolay Trevor, Bergstra James, Hunsberger Eric, DeWolf Travis, Stewart Terrence C., Rasmussen Daniel, Choo Xuan, Voelker Aaron, and Eliasmith Chris. 2014. Nengo: A Python tool for building large-scale functional brain models. Frontiers in Neuroinformatics 7 (2014), 48.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Bose Sumon, Acharya Jyotibdha, and Basu Arindam. 2019. Is my neural network neuromorphic? Taxonomy, recent trends and future directions in neuromorphic engineering. In Proceedings of ACSSC.Google ScholarGoogle Scholar
  15. [15] Burr Geoffrey W., Shelby Robert M., Sebastian Abu, Kim Sangbum, Kim Seyoung, Sidler Severin, Virwani Kumar, et al. 2017. Neuromorphic computing using non-volatile memory. Advances in Physics: X 2, 1 (2017), 89–124.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Catthoor Francky, Mitra Srinjoy, Das Anup, and Schaafsma Siebren. 2018. Very large-scale neuromorphic systems for biological signal processing. In CMOS Circuits for Biological Sensing and Processing, Srinjoy Mitra and David R. S. Cumming (Eds.). Springer, 315–340.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Chen Pai-Yu, Li Zhiwei, and Yu Shimeng. 2016. Design tradeoffs of vertical RRAM-based 3-D cross-point array. IEEE Transactions on Very Large Scale (VLSI) Systems 24, 12 (2016), 3460–3467.Google ScholarGoogle Scholar
  18. [18] Chen Pai-Yu and Yu Shimeng. 2015. Compact modeling of RRAM devices and its applications in 1T1R and 1S1R array design. IEEE Transactions on Electron Devices 62, 12 (2015), 4022–4028.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Chen Yangyin. 2020. ReRAM: History, status, and future. IEEE Transactions on Electron Devices 67, 4 (2020), 1420–1433.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Chiu Yi-Hsuan, Liao Yi-Bo, Chiang Meng-Hsueh, Lin Chia-Long, Hsu Wei-Chou, Chiang Pei-Chia, Hsu Yen-Ya, et al. 2010. Impact of resistance drift on multilevel PCM design. In Proceedings of ICDT.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Chou Ting, Kashyap Hirak, Xing Jinwei, Listopad Stanislav, Rounds Emily, Beyeler Michael, Dutt Nikil, and Krichmar Jeffrey. 2018. CARLsim 4: An open source library for large scale, biologically detailed spiking neural network simulation using heterogeneous clusters. In Proceedings of IJCNN.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Christensen Dennis V., Dittmann Regina, Linares-Barranco Bernabé, Sebastian Abu, Gallo Manuel Le, Redaelli Andrea, Slesazeck Stefan, et al. 2021. 2021 roadmap on neuromorphic computing and engineering. arXiv preprint arXiv:2105.05956 (2021).Google ScholarGoogle Scholar
  23. [23] Curzel Serena, Agostini Nicolas Bohm, Song Shihao, Dagli Ismet, Limaye Ankur, Tan Cheng, et al. 2021. Automated generation of integrated digital and spiking neuromorphic machine learning accelerators. In Proceedings of ICCAD.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Das Anup, Catthoor Francky, and Schaafsma Siebren. 2018. Heartbeat classification in wearables using multi-layer perceptron and time-frequency joint distribution of ECG. In Proceedings of CHASE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Das Anup and Kumar Akash. 2018. Dataflow-based mapping of spiking neural networks on neuromorphic hardware. In Proceedings of GLSVLSI.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Das A., Pradhapan P., Groenendaal W., Adiraju P., Rajan R. T., Catthoor F., Schaafsma S., Krichmar J. L., Dutt N., and Hoof C. Van. 2018. Unsupervised heart-rate estimation in wearables with Liquid states and a probabilistic readout. Neural Networks 99 (2018), 134–147.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Das Anup, Wu Yuefeng, Huynh Khanh, Dell’Anna Francesco, Catthoor Francky, and Schaafsma Siebren. 2018. Mapping of local and global synapses on spiking neuromorphic hardware. In Proceedings of DATE.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Davies Mike, Srinivasa Narayan, Lin Tsung Han, Chinya Gautham, Cao Yongqiang, Choday Sri Harsha, Dimou Georgios, et al. 2018. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38, 1 (2018), 82–99.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Debole Michael V., Taba Brian, Amir Arnon, Akopyan Filipp, Andreopoulos Alexander, Risk William P., Kusnitz Jeff, et al. 2019. TrueNorth: Accelerating from zero to 64 million neurons in 10 years. Computer 52, 5 (2019), 20–29.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Diehl Peter and Cook Matthew. 2015. Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in Computational Neuroscience 9 (2015), 99.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Doevenspeck J., Degraeve R., Fantini A., Cosemans S., Mallik A., Debacker P., Verkest D., Lauwereins R., and Dehaene W.. 2021. OxRRAM-based analog in-memory computing for deep neural network inference: A conductance variability study. IEEE Transactions on Electron Devices 68, 5 (2021), 2301–2305.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Fernando B. Rasitha, Qi Yangjie, Yakopcic Chris, and Taha Tarek M.. 2020. 3D memristor crossbar architecture for a multicore neuromorphic system. In Proceedings of IJCNN.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Fouda Mohammed E., Eltawil Ahmed M., and Kurdahi Fadi. 2017. Modeling and analysis of passive switching crossbar arrays. IEEE Transactions on Circuits and Systems I: Regular Papers 65, 1 (2017), 270–282.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Fouda Mohammed E., Lee Jongeun, Eltawil Ahmed M., and Kurdahi Fadi. 2018. Overcoming crossbar nonidealities in binary neural networks through learning. In Proceedings of NANOARCH.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Fouda Mohammed E., Lee Sugil, Lee Jongeun, Kim Gun Hwan, Kurdahi Fadi, and Eltawil Ahmed. 2020. IR-QNN framework: An IR drop-aware offline training of quantized crossbar arrays. IEEE Access 8 (2020), 228392–228408.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Fouda Mohammed E., Neftci E., Eltawil Ahmed, and Kurdahi F.. 2019. Effect of asymmetric nonlinearity dynamics in RRAMs on spiking neural network performance. In Proceedings of ACSSC.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Frenkel Charlotte. 2020. Bottom-Up and Top-Down Neuromorphic Processor Design: Unveiling Roads to Embedded Cognition. Ph.D. Dissertation. UCL-Université Catholique de Louvain.Google ScholarGoogle Scholar
  38. [38] Galluppi Francesco, Davies Sergio, Rast Alexander, Sharp Thomas, Plana Luis A., and Furber Steve. 2012. A hierachical configuration system for a massively parallel neural hardware platform. In Proceedings of CF.Google ScholarGoogle Scholar
  39. [39] Goodman Dan F. M. and Brette Romain. 2009. The brian simulator. Frontiers in Neuroscience 3, 2 (2009), 192–197.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Gopalakrishnan Roshan, Chua Yansong, Sun Pengfei, Kumar Ashish Jith Sreejith, and Basu Arindam. 2020. HFNet: A CNN architecture co-designed for neuromorphic hardware with a crossbar array of synapses. Frontiers in Neuroscience 14 (2020), 907.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Gulli Antonio and Pal Sujit. 2017. Deep Learning with Keras. Packt Publishing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] He Yintao, Wang Ying, Zhao Xiandong, Li Huawei, and Li Xiaowei. 2020. Towards state-aware computation in ReRAM neural networks. In Proceedings of DAC.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Hines Michael L. and Carnevale Nicholas T.. 1997. The NEURON simulation environment. Neural Computation 9, 6 (1997), 1179–1209.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Hore Alain and Ziou Djemel. 2010. Image quality metrics: PSNR vs. SSIM. In Proceedings of ICPR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Hu Miao, Li Hai, Chen Yiran, Wu Qing, Rose Garrett S., and Linderman Richard W.. 2014. Memristor crossbar-based neuromorphic computing system: A case study. IEEE Transactions on Neural Networks and Learning Systems 25, 10 (2014), 1864–1878.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Hu Miao, Strachan John Paul, Li Zhiyong, Grafals Emmanuelle M., Davila Noraica, Graves Catherine, Lam Sity, Ge Ning, Yang Jianhua Joshua, and Williams R. Stanley. 2016. Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication. In Proceedings of DAC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Huynh Phu Khanh, Varshika M. Lakshmi, Paul Ankita, Isik Murat, Balaji Adarsha, and Das Anup. 2022. Implementing spiking neural networks on neuromorphic architectures: A review. arXiv:2202.08897 (2022).Google ScholarGoogle Scholar
  48. [48] Indiveri Giacomo. 2003. A low-power adaptive integrate-and-fire neuron circuit. In Proceedings of ISCAS.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Jeong YeonJoo, Zidan Mohammed A., and Lu Wei D.. 2017. Parasitic effect analysis in memristor-array-based neuromorphic systems. IEEE Transactions on Nanotechnology 17, 1 (2017), 184–193.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Ji Yu, Zhang Youhui, Li Shuangchen, Chi Ping, Jiang Cihang, Qu Peng, Xie Yuan, and Chen Wenguang. 2016. NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. In Proceedings of MICRO.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Kim Yongtae, Zhang Yong, and Li Peng. 2012. A digital neuromorphic VLSI architecture with memristor crossbar synaptic array for machine learning. In Proceedings of SOCC.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Kim Yongtae, Zhang Yong, and Li Peng. 2015. A reconfigurable digital neuromorphic processor with memristive synaptic crossbar for cognitive computing. ACM Journal on Emerging Technologies 11, 4 (2015), Article 38, 25 pages.Google ScholarGoogle Scholar
  53. [53] Krestinskaya Olga, Irmanova Aidana, and James Alex Pappachen. 2019. Memristive non-idealities: Is there any practical implications for designing neural network chips? In Proceedings of ISCAS.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Kundu Shamik, Basu Kanad, Sadi Mehdi, Titirsha Twisha, Song Shihao, Das Anup, and Guin Ujjwal. 2021. Special session: Reliability analysis for ML/AI hardware. In Proceedings of VTS.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Le Minh and Truong Son Ngoc. 2021. Memristor crossbar circuits for neuromorphic pattern recognition. In Proceedings of ISOCC.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Lee Donghyuk, Kim Yoongu, Seshadri Vivek, Liu Jamie, Subramanian Lavanya, and Mutlu Onur. 2013. Tiered-latency DRAM: A low latency and low cost DRAM architecture. In Proceedings of HPCA.Google ScholarGoogle Scholar
  57. [57] Li Tianjian, Bi Xiangyu, Jing Naifeng, Liang Xiaoyao, and Jiang Li. 2017. Sneak-path based test and diagnosis for 1R RRAM crossbar using voltage bias technique. In Proceedings of DAC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Li Yesheng and Ang Kah-Wee. 2021. Hardware implementation of neuromorphic computing using large-scale memristor crossbar arrays. Advanced Intelligent Systems 3, 1 (2021), Article 2000137, 26 pages.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Liao C.-Y., Hsiang K.-Y., Hsieh F.-C., Chiang S.-H., Chang S.-H., Liu J.-H., Lou C.-F., et al. 2021. Multibit ferroelectric FET based on nonidentical double \(HfZrO_2\) for high-density nonvolatile memory. IEEE Electron Device Letters 42, 4 (2021), 617–620.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Lin Chit-Kwan, Wild Andreas, Chinya Gautham N., Lin Tsung-Han, Davies Mike, and Wang Hong. 2018. Mapping spiking neural networks onto a manycore neuromorphic architecture. In Proceedings of PLDI.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Liu Chenchen, Yan Bonan, Yang Chaofei, Song Linghao, Li Zheng, Liu Beiye, Chen Yiran, Li Hai, Wu Qing, and Jiang Hao. 2015. A spiking neuromorphic design with resistive crossbar. In Proceedings of DAC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Liu Xiaoxiao, Wen Wei, Qian Xuehai, Li Hai, and Chen Yiran. 2018. Neu-NoC: A high-efficient interconnection network for accelerated neuromorphic systems. In Proceedings of ASP-DAC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Maass Wolfgang. 1997. Networks of spiking neurons: The third generation of neural network models. Neural Networks 10, 9 (1997), 1659–1671.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Mallik A., Garbin D., Fantini A., Rodopoulos D., Degraeve R., Stuijt J., Das A. K., et al. 2017. Design-technology co-optimization for OxRRAM-based synaptic processing unit. In Proceedings of VLSIT.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Mead Carver. 1990. Neuromorphic electronic systems. Proceedings of the IEEE 78, 10 (1990), 1629–1636.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Moradi Saber, Qiao Ning, Stefanini Fabio, and Indiveri Giacomo. 2018. A scalable multicore architecture with heterogeneous memory structures for dynamic neuromorphic asynchronous processors (DYNAPs). IEEE Transactions on Biomedical Circuits and Systems 12, 1 (2018), 106–122.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Mutlu Onur. 2013. Memory scaling: A systems architecture perspective. In Proceedings of IMW.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Mutlu Onur and Subramanian Lavanya. 2015. Research problems and opportunities in memory systems. Supercomputing Frontiers and Innovations 1, 3 (2015), 19–55.Google ScholarGoogle Scholar
  69. [69] Nukala Nishant S., Kulkarni Niranjan, and Vrudhula Sarma. 2014. Spintronic threshold logic array (STLA)–A compact, low leakage, non-volatile gate array architecture. Journal of Parallel and Distributed Computing 74, 6 (2014), 2452–2460.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Paul Ankita, Song Shihao, and Das Anup. 2021. Design technology co-optimization for neuromorphic computing. In Proceedings of IGSC Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Paul Ankita, Song Shihao, Titirsha Twisha, and Das Anup. 2022. On the mitigation of read disturbances in neuromorphic inference hardware. arXiv:2201.11527 (2022).Google ScholarGoogle Scholar
  72. [72] Rajendran Bipin, Sebastian Abu, Schmuker Michael, Srinivasa Narayan, and Eleftheriou Evangelos. 2019. Low-power neuromorphic hardware for signal processing applications: A review of architectural and system-level design approaches. IEEE Signal Processing Magazine 36, 6 (2019), 97–110.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Rakka M., Fouda M. E., Kanj R., Eltawil Ahmed, and Kurdahi F. J.. 2020. Design exploration of sensing techniques in 2T-2R resistive ternary CAMs. IEEE Transactions on Circuits and Systems II: Express Briefs 68, 2 (2020), 762–766.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Shim Wonbo, Luo Yandong, Seo Jae-Sun, and Yu Shimeng. 2020. Impact of read disturb on multilevel RRAM based inference engine: Experiments and model prediction. In Proceedings of IRPS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Song Shihao, Balaji Adarsha, Das Anup, Kandasamy Nagarajan, and Shackleford James. 2020. Compiling spiking neural networks to neuromorphic hardware. In Proceedings of LCTES.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Song Shihao, Chong Harry, Balaji Adarsha, Das Anup, Shackleford James, and Kandasamy Nagarajan. 2021. DFSynthesizer: Dataflow-based synthesis of spiking neural networks to neuromorphic hardware. arXiv:2108.02023 (2021).Google ScholarGoogle Scholar
  77. [77] Song Shihao and Das Anup. 2020. A case for lifetime reliability-aware neuromorphic computing. In Proceedings of MWSCAS.Google ScholarGoogle ScholarCross RefCross Ref
  78. [78] Song Shihao and Das Anup. 2020. Design methodologies for reliable and energy-efficient PCM systems. In Proceedings of IGSC Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Song Shihao, Das Anup, and Kandasamy Nagarajan. 2020. Exploiting inter- and intra-memory asymmetries for data mapping in hybrid tiered-memories. In Proceedings of ISMM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. [80] Song Shihao, Das Anup, and Kandasamy Nagarajan. 2020. Improving dependability of neuromorphic computing with non-volatile memory. In Proceedings of EDCC.Google ScholarGoogle ScholarCross RefCross Ref
  81. [81] Song Shihao, Das Anup, Mutlu Onur, and Kandasamy Nagarajan. 2019. Enabling and exploiting partition-level parallelism (PALP) in phase change memories. ACM Transactions on Embedded Computing Systems 18, 5s (2019), 1–25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Song Shihao, Das Anup, Mutlu Onur, and Kandasamy Nagarajan. 2020. Improving phase change memory performance with data content aware access. In Proceedings of ISMM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. [83] Song Shihao, Das Anup, Mutlu Onur, and Kandasamy Nagarajan. 2021. Aging-aware request scheduling for non-volatile main memory. In Proceedings of ASP-DAC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. [84] Song Shihao, Hanamshet Jui, Balaji Adarsha, Das Anup, Krichmar Jeff, Dutt Nikil, Kandasamy Nagarajan, and Catthoor Francky. 2021. Dynamic reliability management in neuromorphic computing. ACM Journal on Emerging Technologies in Computing Systems 17, 4 (2021), Article 63, 27 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. [85] Song Shihao, Mirtinti Lakshmi Varshika, Das Anup, and Kandasamy Nagarajan. 2021. A design flow for mapping spiking neural networks to many-core neuromorphic hardware. In Proceedings of ICCAD.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. [86] Song Shihao, Titirsha Twisha, and Das Anup. 2021. Improving inference lifetime of neuromorphic systems via intelligent synapse mapping. In Proceedings of ASAP.Google ScholarGoogle ScholarCross RefCross Ref
  87. [87] Thomas Sherin A., Vohra Sahibia Kaur, Kumar Rahul, Sharma Rohit, and Das Devarshi Mrinal. 2021. Analysis of parasitics on CMOS based memristor crossbar array for neuromorphic systems. In Proceedings of MWSCAS.Google ScholarGoogle ScholarCross RefCross Ref
  88. [88] Titirsha Twisha and Das Anup. 2020. Reliability-performance trade-offs in neuromorphic computing. In Proceedings of IGSC Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  89. [89] Titirsha Twisha and Das Anup. 2020. Thermal-aware compilation of spiking neural networks to neuromorphic hardware. In Proceedings of LCPC.Google ScholarGoogle Scholar
  90. [90] Titirsha Twisha, Song Shihao, Balaji Adarsha, and Das Anup. 2021. On the role of system software in energy management of neuromorphic computing. In Proceedings of CF.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. [91] Titirsha Twisha, Song Shihao, Das Anup, Krichmar Jeffrey, Dutt Nikil, Kandasamy Nagarajan, and Catthoor Francky. 2021. Endurance-aware mapping of spiking neural networks to neuromorphic hardware. IEEE Transactions on Parallel and Distributed Systems 33, 2 (2021), 288–301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. [92] Tuli Shikhar, Rios Marco, Levisse Alexandre, and Atienza David. 2020. RRAM-VAC: A variability-aware controller for RRAM-based memory architectures. In Proceedings of ASP-DAC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. [93] Varshika M. Lakshmi, Balaji Adarsha, Corradi Federico, Das Anup, Stuijt Jan, and Catthoor Francky. 2022. Design of many-core big little \(\mu\)Brains for energy-efficient embedded neuromorphic computing. In Proceedings of DATE.Google ScholarGoogle Scholar
  94. [94] Wang Zhehui, Zhang Huaipeng, Luo Tao, Wong Weng-Fai, Do Anh Tuan, Vishnu Paramasivam, Zhang Wei, and Goh Rick Siow Mong. 2020. NCPower: Power modelling for NVM-based neuromorphic chip. In Proceedings of ICONS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. [95] Wijesinghe Parami, Ankit Aayush, Sengupta Abhronil, and Roy Kaushik. 2018. An all-memristor deep spiking neural computing system: A step toward realizing the low-power stochastic brain. IEEE Transactions on Emerging Topics in Computational Intelligence 2, 5 (2018), 345–358.Google ScholarGoogle ScholarCross RefCross Ref
  96. [96] Xu Cong, Dong Xiangyu, Jouppi Norman P., and Xie Yuan. 2011. Design implications of memristor-based RRAM cross-point structures. In Proceedings of DATE.Google ScholarGoogle Scholar
  97. [97] Xue Cheng-Xin, Chen Wei-Hao, Liu Je-Syu, Li Jia-Fang, Lin Wei-Yu, Lin Wei-En, Wang Jing-Hong, et al. 2019. 24.1 a 1Mb multibit ReRAM computing-in-memory macro with 14.6 ns parallel MAC computing time for CNN based AI edge processors. In Proceedings of ISSCC.Google ScholarGoogle Scholar
  98. [98] Young Steven R., Devineni Pravallika, Parsa Maryam, Johnston J. Travis, Kay Bill, Patton Robert M., Schuman Catherine D., Rose Derek C., and Potok Thomas E.. 2019. Evolving energy efficient convolutional neural networks. In Proceedings of Big Data.Google ScholarGoogle ScholarCross RefCross Ref
  99. [99] Yu Shimeng, Deng Yexin, Gao Bin, Huang Peng, Chen Bing, Liu Xiaoyan, Kang Jinfeng, Chen Hong-Yu, Jiang Zizhen, and Wong H.-S. Philip. 2014. Design guidelines for 3D RRAM cross-point architecture. In Proceedings of ISCAS.Google ScholarGoogle ScholarCross RefCross Ref
  100. [100] Zhang Xinjiang, Huang Anping, Hu Qi, Xiao Zhisong, and Chu Paul K.. 2018. Neuromorphic computing with memristor crossbar. Physica Status Solidi (a) 215, 13 (2018), Article 1700875.Google ScholarGoogle ScholarCross RefCross Ref
  101. [101] Zhao Wei and Cao Yu. 2007. Predictive technology model for nano-CMOS design exploration. ACM Journal on Emerging Technologies in Computing Systems3, 1 (2007), 1–es.Google ScholarGoogle Scholar
  102. [102] Zhu Zhenhua, Lin Jilan, Cheng Ming, Xia Lixue, Sun Hanbo, Chen Xiaoming, Wang Yu, and Yang Huazhong. 2018. Mixed size crossbar based RRAM CNN accelerator with overlapped mapping method. In Proceedings of ICCAD.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Design-Technology Co-Optimization for NVM-Based Neuromorphic Processing Elements

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Embedded Computing Systems
              ACM Transactions on Embedded Computing Systems  Volume 21, Issue 6
              November 2022
              498 pages
              ISSN:1539-9087
              EISSN:1558-3465
              DOI:10.1145/3561948
              • Editor:
              • Tulika Mitra
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 12 December 2022
              • Online AM: 21 March 2022
              • Accepted: 3 March 2022
              • Revised: 21 February 2022
              • Received: 15 July 2021
              Published in tecs Volume 21, Issue 6

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!