EPHA: An Energy-efficient Parallel Hybrid Architecture for ANNs and SNNs

Artificial neural networks (ANNs) and spiking neural networks (SNNs) are two general approaches to achieve artificial intelligence (AI). The former have been widely used in academia and industry fields; the latter, SNNs, are more similar to biological neural networks and can realize ultra-low power consumption, thus have received widespread research attention. However, due to their fundamental differences in computation formula and information coding, the two methods often require different and incompatible platforms. Alongside the development of AI, a general platform that can support both ANNs and SNNs is necessary. Moreover, there are some similarities between ANNs and SNNs, which leaves room to deploy different networks on the same architecture. However, there is little related research on this topic. Accordingly, this article presents an energy-efficient, scalable, and non-Von Neumann architecture (EPHA) for ANNs and SNNs. Our study combines device-, circuit-, architecture-, and algorithm-level innovations to achieve a parallel architecture with ultra-low power consumption. We use the compensated ferrimagnet to act as both synapses and neurons to store weights and perform dot-product operations, respectively. Moreover, we propose a novel computing flow to reduce the operations across multiple crossbar arrays, which enables our design to conduct large and complex tasks. On a suite of ANN and SNN workloads, the EPHA is 1.6× more power-efficient than a state-of-the-art design, NEBULA, in the ANN mode. In the SNN mode, our design is 4 orders of magnitude more than the Loihi in power efficiency.


43:3
Moreover, this is not an ideal unified architecture that meets AI requirements.All of these factors significantly limit NEBULA's application and popularization.Therefore, it is necessary to design a new emerging devices-based architecture that can deploy both ANNs and SNNs.
Inspired by this observation, we propose a novel non-Von Neumann architecture for both ANNs and SNNs, the Energy-efficient Parallel Hybrid Architecture (EPHA).The main contributions of our study are as follows: -We use the compensated ferrimagnetic device to design a special crossbar-based computing circuit.Based on the verification of Beyond-CMOS benchmarking, we prove that the ferrimagnetic-based neurons and synapses are more excellent in terms of energy consumption and computational delay.-We design the neuro-synaptic array to be shared by the ANN and SNN modes, while peripheral circuits can be reconfigured independently to support different modes.To the best of our knowledge, it is the first emerging device-based architecture that can support both ANNs and SNNs in the same neuro-synaptic array.-We propose a layer-wise mapping scheme to flexibly assign computations of a vast network into different crossbar arrays, reducing cross-NCs operations and realizing higher computation parallelism.-We deploy several ANNs and SNNs to the EPHA and perform an in-depth study of power, area, and computational efficiency.Our evaluation results show that our design in the ANN mode is over 1.6× better than the NEBULA in power efficiency.In the SNN mode, the EPHA is 4 orders of magnitude more than the Loihi in power efficiency.

BACKGROUND
This section provides an overview of the primitives of the ANN and SNN, the architecture of the NEBULA, and the functional characteristics of the compensated ferrimagnet device.

The Similarities between ANN and SNN Primitives
The core primitives in a neural network are neurons and the associated synapses.We first analyze the computing framework of ANNs.In a conventional ANN with multiple layers, the weight matrix W l constructs the connection between two adjacent layers, with biases b l .Synapses in layer l receive activations from the previous layer and perform the weighted sum of activations with corresponding synaptic weights.Neurons in layer l receive the result of the weighted sum and apply the nonlinear transfer function to obtain the activation value.The rectified linear unit (ReLU) activation of neuron i in layer l is computed as in Equation ( 1): Here, a l-1 i is the activation output from the previous layer.Different from ANNs, the formula of SNNs draws inspiration from biological neurons.The SNN processes information by using sparse spike trains.Synapses of the SNN perform the weighted sum of synaptic weight W j with the input current i j (t) at the current timestep t.Neurons of the SNN receive the output of the associated synapses and memorize the historical membrane potential.We use the simple Linear-Integrate-Fire (LIF) framework as the spiking neuron model, which is represented by Equation (2): Here, the u(t) is the neuron membrane potential at timestep t.Neurons create spikes as the communicate information to subsequent layers if the membrane potential exceeds a given threshold potential, then reset to the resting potential until receiving new incentives.The inference using LIF neurons is based on a rate-encoding framework to store and process the spike information in a modern computing system.The SNN can realize a higher degree of bio-fidelity due to rich spatiotemporal dynamics, the event-based mode of operation, and various learning rules [17].Moreover, the SNN can potentially offer an efficient way of doing inference due to sparsely spike train and event-driven computations.
Even though ANNs and SNNs have fundamental differences in computation formulas, there are still some similarities between the two approaches.Specifically, as shown in Figure 1(a), the ANN and SNN synapses perform the weighted calculation.Even though they are different in the digital circuit, they are the same in the analog domain according to Kirchhoff's law [8].Moreover, neurons in both ANNs and SNNs receive output from the associated synapses and integrate them.These similarities leave room for deploying different networks in the same architecture.To deploy ANNs and SNNs on the same architecture, we need to design the emerging device-based synaptic array to perform the dot product operation and design the neuron to perform the nonlinear activation in the ANN mode or the membrane-potential integration in the SNN mode.

The NEBULA Architecture
The NEBULA [67] is the first one of the most influential holistic architecture designs for both ANNs and SNNs with spintronic device-based neurons and synapses.As shown in Figure 1(c), the NEBULA is mainly composed of ANN/SNN neural cores (A_NC/S_NC) and accumulator units (AUs).
The MTJ, as the basic building block, is composed of two nanomagnets with a sandwiched insulator layer.The domain wall-based MTJ adds an elongated domain wall layer between two magnetic domains.The programming current flows through the heavy metal layer (HML), resulting in the continuous movement of the domain wall.The movement of the domain wall results in continuous conductance change of the device, realizing multiple resistive states to be encoded.Since the domain wall (DW) MTJ only shows linear device behavior, the NEBULA has to modify the device structure to process different neuron functions (shown in Figures 1(b) and (d)).The SNN NC uses the spiking neuron to integrate-fire a binary spike train, while the ANN NC uses a nonspiking neuron to output continuous values.As shown in Figure 1(b), the integrate-fire spiking neuron comprises the reference MTJ and neuron MTJ.The input spike flows through the HML and results in the continuous movement of the domain wall of MTJ.The displacement of DW is proportional to the incoming current pulses through the heavy metal, which provides an intrinsic correspondence to the integrate-fire of a neuron function.The change in conductance of MTJ corresponds to the membrane potential accumulation in the SNN.As a result, the conductance, as the membrane potential, is stored in the domain wall layer of the MTJ.The neuron MTJ is situated at one edge of the ferromagnetic.The output of the inverter is driven HIGH to generate a spike when the DW reaches the right edge of the magnet [61].Similarly, the non-spiking neuron, as shown in Figure 1(d), comprises the reference MTJ and a transistor to act as a Saturating Rectified Linear in the A_NC.The non-spiking neuron receives the input current and changes the domain wall position.The higher current leads to higher displacement of the domain wall.The reference MTJ serves to produce a resistive divider network such that the gate voltage of the PMOS transistor decreases with an increase in the magnitude of the input current.The output current is proportional to the input programming current.
In this way, the A_NC and S_NC are distinct and incompatible platforms corresponding to different networks.The MTJ-based synapse arrays [2,29] in different types of NCs cannot be shared even if they are the same in structure, which limits the flexibility of a NEBULA chip.Since the crossbar array takes the largest proportion of the power and area in an NC (the corresponding proportions are 73% and 86% in the ANN NC and 87% and 98% in the SNN NC), it leads to a huge waste of hardware resources and reduces the energy efficiency per unit area.In addition, the NEB-ULA shows the great potential of the SNN-ANN hybrid structure to trade off between performance and energy efficiency.Specifically, the NEBULA uses the ANN NC and SNN NC to deploy ANN layers and SNN layers of a hybrid network, respectively.We can control the delay and energy consumption by controlling the number of ANN layers in the hybrid network.This shows the unique advantage of the SNN-ANN hybrid structure.However, since the NEBULA uses different types of NCs to deploy ANN/SNN layers, the proportion of the ANN NC and SNN NC needs to be weighed according to specific requirements.Moreover, when one type of NC is working, the other type of NC is in a waiting state, leading to a low hardware utilization.These promote the study of an architecture that can deploy both ANNs and SNNs.Furthermore, the structure of specialized neurons based on MTJs is intricate due to their nanoscale dimensions and sensitivity to fabrication [62].Additionally, the movement of domain walls in conventional DW-MTJs typically occurs at the millisecond level [31,78,86], which is 10 4 × slower than the CMOS.Finally, the computing flow in the NEBULA leads to a large number of computations across multiple NCs, which increase the data movement and control logic (detailed in Section 4.4).These problems inspire us to design a novel spin-based hybrid architecture for SNNs and ANNs.

The Ferrimagnetic Device
Recently, spintronic devices based on the spin-orbit torque (SOT) have emerged as a candidate for neuromorphic computing due to their high energy efficiency and non-volatility.Magnetic devices are important representatives of spintronic devices, including ferromagnets (FM), ferrimagnets (FIM), and antiferromagnetic (AFM).However, some devices have intrinsic limitations for neuromorphic computing due to their inherent attributes.For example, because the frequency of magnetization dynamics is in the range of gigahertz, the ferromagnet needs a high current density to realize a fast speed system.This means that we need to trade off between the operation speed and energy efficiency of the FM.Different from the FM, the antiferromagnet can offer faster operation with a lower current due to the fast spin dynamics.However, it is difficult to detect the magnetization, as the AFM is not magnetic to the outside.
Several previous studies [67,85] have reported that the ferrimagnetic CoGd device, as a representative of ferrimagnetic devices, exhibits superior performance and higher energy efficiency.Moreover, the CoGd film shows great potential to address the limitations of the DW-MTJ that is used in the NEBULA due to the following reasons: First, the compensated film based on SOT has different device behaviors, which brings an opportunity for using the compensated FIM to simulate synaptic and neuronal functions in different modes.Second, as shown in Figure 2(a), the compensated FIM CoGd is a rare-earth-transition-metal alloy with a cubic structure.Compared with MTJ, the FIM structure is simpler.Third, the switching time of FIM films can be reduced to a subnanosecond [3,41] as the antiferromagnetically coupled Co-Gd links accelerate the spin momentum transfer.Finally, the FIM film can realize 10 6 × higher writing efficiency than the FMbased counterpart [6,46] and is one to two orders of magnitude more energy-efficient than that in AFM [42,61,70].Therefore, the compensated FIM has great potential to achieve a highly efficient neuromorphic computing system.

The Compensated FIM-based Synapse and Neuron
As shown in Figure 2(b), the Anomalous Hall Effect (AHE) resistance curve of compensated FIM can be divided into the linear region and the SIGMOID region (similar to the SIGMOID activation function in the ANN).Our verification process indicates that FIM devices exhibit distinct characteristics when subjected to varying pulse intensities (as discussed in Section 6.1).This property of FIM highlights its enormous potential for neural network applications, as it can simulate multiple functions to cater to their diverse needs.As we discussed in Section 2.1, there are similarities between ANNs and SNNs, which allow us to utilize the resistance characteristic of the FIM for common computations between ANNs and SNNs.
Specifically, as shown in Figure 2(c), the AHE conductance (G) of the SOT-based switching, as the synaptic weight, is adjusted by the programming voltage.After programming conductances of FIMbased cross-points, the input voltage, as the activation data, is applied to the corresponding synapse through the word line (WL).The output current of each cell in the same column, as the result of the dot product operation, transmits to the compensated FIM neuron through the bit line (BL) and changes the AHE resistance.In this way, we can reduce the computational complexity of the dot product operation from O(n 2 ) to O(1) [64] according to Kirchhoff's law, i.e., I = V ×G, by using the linear region of the FIM resistance.The difference between the ANN and SNN modes is the AHE resistance region used by neurons.In ANN mode, we can effortlessly achieve the nonlinear activation function of ANNs by utilizing the SIGMOID region of the compensated FIM.In SNN mode, the neurons can additionally carry out the membrane-potential integration by utilizing the linear region of the FIM resistance in the SNN mode.
Even though the compensated FIM is proven to be energy-efficient for neuromorphic computing, previous studies [3,41] only focus on the compensated FIM properties or use simple applications to verify the device function without delving deep into the application in achieving 43:7 AI.There is a lack of a holistic compensated FIMs-based neuromorphic system including device-, architecture-, and algorithm-level exploration.At the same time, several factors also need to be considered in the design of the neuromorphic system.First, the neuromorphic system needs to minimize the conversions between digital and analog signals to decrease power consumption; second, the neuromorphic system needs a novel computation flow to decrease the computation across multiple arrays with great flexibility and scalability to support vast and complex neural networks; third, since fast and efficient real-time sorting is essential for neural networks, the neuromorphic needs to realize high computation parallelism to reduce the latency.

RELATED WORK
In recent years, many researchers [27,58,68,74,76] have designed ANN hardware architectures for higher performance and better energy efficiency.Some earlier research [11,20,75,84] aimed at optimizing the algorithms and dataflow modes in the ANN accelerator to reduce power consumption.To avoid the memory wall, the ISAAC [64] and Zhang et al. [83] use memristor-based crossbar arrays [34,47,77] to improve computation throughput and energy efficiency.However, these studies are limited to deploying ANNs and do not support SNNs.To reduce power consumption, some studies have focused on emulating the multi-layer bio-inspired SNNs on software or hardware platforms.Some neuron morphological platforms, including the TrueNorth [12,45], Neurogrid [1], and Loihi [14], can efficiently simulate a wider variety of large-scale SNNs.In summary, these designs can only support SNNs or ANNs, which cannot meet the requirements of flexible deployment of AI.In this context, some general platforms have been proposed to support different modes and neural networks.Loihi 2 [13] and Spinnaker 2 [44] provide support for developing hybrid models.However, these designs use different functional units to perform different operations.In other words, they do not process both ANNs and SNNs in the same neurosynaptic array.The TianJic [55] proposes a muti-core architecture to support both ANNs and SNNs by integrating reconfigurable building blocks and a streamlined dataflow with hybrid coding schemes.Even so, it cannot overcome the energy consumption bottleneck of computing and data transmission.The NEBULA [67] is a novel design that combines ANN-SNN with MTJ-based neurons and synapses, resulting in superior energy efficiency compared to previous designs.However, due to the distinct and incompatible nature of different NCs in the NEBULA, it cannot fulfill the requirements for AI development.Additionally, Section 2.2 highlights several limitations in terms of delay and energy efficiency of the NEBULA.Our study maintains low-power benefits while reducing computation across multiple arrays.Moreover, our design enables the deployment of both ANNs and SNNs in the same neuro-synaptic array and reduces computations across NCs.

OVERVIEW OF THE EPHA DESIGN 4.1 Overall Design
The numerous differences between ANNs and SNNs (such as data representation and storage and computing concepts) bring great challenges to deploy different types of networks in the same architecture.As complex problems require the collaboration of different networks [30], an architecture capable of accommodating different types of networks is imperative and essential.To this end, we propose the Energy-efficient Parallel Hybrid Architecture (EPHA) for ANNs and SNNs.Our approach leverages the commonalities between different networks and multiple device behaviors of the compensated FIM.
We first introduce the overview of the EPHA architecture, followed by discussions of different components.Figure 3(a) illustrates the overall structure of the EPHA.The EPHA consists of four building blocks, including the axon, ANN/SNN NCs (A/S), soma, and router.To deploy different networks in the same neuro-synaptic array, the ANN/SNN NC is shared to perform the calculation for both ANNs and SNNs, while the soma can be reconfigured independently to perform the remaining operations.
The ANN/SNN NC, as the basic building block in the EPHA, is composed of 4 crossbar-based tiles, an ADC, a number of DACs/SNN drivers, and an Input/Output buffer.To amortize the energy consumption of peripheral circuits, each EPHA chip has 6 ANN/SNN NCs, and each NC has 16 crossbar arrays (Figure 3(b)).We set the crossbar array size as 128 × 128.The crossbar also can be scaled to a larger size.A larger crossbar array size can potentially increase the density of synaptic arrays and allow for a larger receptive field size.However, it also means that more rows in the array need to be activated in parallel, thereby generating a larger output current in SLs, which requires a higher driving voltage and consumes more energy [59,67].Conversely, a smaller array size can reduce energy consumption while achieving higher hardware utilization.Nonetheless, reducing the size of the synaptic array will result in more calculations across multiple arrays, leading to additional data movement and calculation delay due to the large amount of output/input channels.Previous studies [64,67] illustrate that the ADC/DAC consumes a significant portion of power.As shown in Figure 3(d), the EPHA uses hierarchical neuron units to amortize the high power consumption of ADCs.Each ANN/SNN NC has 23 neuron arrays corresponding to four hierarchy levels (H0-H3), which is similar to the hierarchical neuron design in the NEBULA [67].The ANN/SNN NC can conduct the computation of ANNs and SNNs with the help of crossbar peripherals.The ANN/SNN NC is capable of performing computations for both ANNs and SNNs with the assistance of crossbar peripherals.This is largely attributed to the FIM-based neuro-synaptic array, which can execute the dot product operation of continuous values in the ANN mode (as described in Section 2.4).Additionally, it can carry out the weighted calculation and membrane potential integration of the digital spike in the SNN mode.
More specifically, in the ANN mode of an ANN/SNN NC, the weight is converted into the corresponding analog programming voltage to adjust the conductance of the synapse with the help of digital-to-analog converters (DACs).Then, the input feature map is converted into the corresponding analog voltage and weighted by the conductance programmed at each synapse.The output current of the synapses accumulates in the neuron array, modulates the neuron output, and completes the dot-product operation.In the SNN mode of an ANN/SNN NC, the weight is also programmed as the conductance of the synapse by the programming input voltage with the help of the SNN driver.Then, the binary spike sequence is converted into the corresponding analog voltage and weighted in the corresponding synaptic device.The output current of each synapse in the same column transmits to the corresponding neuron and changes the AHE resistance.Due to the non-volatile nature of the FIM device, the FIM neuron can realize the integration of the memorization of membrane potential.
The biological axon [82], as the output channel of the neuron, is used to transmit nerve impulses.Similar to the biological axon, the axon block in the EPHA is primarily responsible for buffering the input/output data and encoding/decoding the data.The biological soma [82], as the information processing center, responds to received nerve impulses and conducts nerve impulses.Similarly to it, the soma in the EPHA is designed to receive and process data from the ANN/SNN NC.The soma is composed of six accumulation units, a spike generator with threshold comparison, and a max pooling function block.So, the soma can be reconfigured either as a threshold comparator, spike generator in SNN mode; or a pooling function block in ANN mode.The router can transmit the data between multiple EPHAs, which enables us to build a highly parallel architecture to deploy large or composite neural networks.In the following sections, we explain the details of each component in the EPHA.

The CoGd-FIM-based Synaptic Crossbar Array and Neuron Array for ANNs and
SNNs As shown in Figure 3(c), a neuron and the associated synapses constitute the basic functional unit of a neural network.Each input is weighted with a synapse, then the result is sent to the neuron for accumulation.A set of such layers forms a highly parallel structure.The most compact and simplest structure to form this structure is the crossbar array, as shown in Figure 3(d).Based on the crossbar array structure, each weight is encoded by the synaptic AHE resistance at a crosspoint, while the input vector is encoded by the analog current/voltage signal.Unlike conventional memristor devices, the CoGd-FIM is a four-terminal device and has the sneak path problem [36] and increases energy consumption in weight updating [8].As shown in Figure 3(e), we design a special CoGd-FIM-based crossbar to solve this problem.
First, we add two selection transistors in series with the FIM-based device, forming the twotransistor one-resistor (2T1R) array architecture.The transistors can be viewed as switches to isolate different cells and avoid the sneak path problem of the unselected cells.Meanwhile, in this way, we do not apply some voltage at the unselected rows to avoid the write disturbance on unselected cells, which can also reduce the energy consumption during weight update.
Second, we rotate the bit line (BL) by 90 • to supply the activation in the form of analog currents along BLs in parallel, which is similar to the structure of a pseudo crossbar array.When used in combination with the special mapping method (detailed in Section 4.4), this design helps us to maximize data reuse of weights and activations.
Third, we divide the word line into the write word line (WWL) and the read word line (RWL) to control weight update and dot-product computation, respectively.This does not increase the control difficulty but is more convenient to perform the weight update operation and dotproduct computation.In the weight update operation, all RWLs are turned off, and only the WWL corresponding to the selected row is turned on.Then, the cells on the selected row are applied with the programming pulses from source lines (SL) in parallel.In the dot-product computation, all FIM-based cross-points are transparent when all WWLs are turned off and RWLs are turned on.The input vector currents are weighted by conductance values programmed at each crosspoint along BLs, and the output current as the result of the multiplication is summed along SLs.The total current along SLs flows to the corresponding neurons and changes the AHE resistance of neurons to modulate outputs.Thereby, we effectively implement the parallel dot-product and partial sum accumulation operations by using simple Kirchoff's law.We can also use the AHE resistance characteristics of CoGd-based neurons to complete the SIGMOID activation function.Our verification result indicates that FIM devices exhibit distinct characteristics when subjected to varying pulse intensities (as discussed in Section 6.1).This property aligns with the functional requirements of different activation functions in neural networks, consequently enhancing the flexibility of EPHA.
Compared to CMOS-based neuromorphic structures, CoGd-FIM-based devices can simulate the functions of synapses and neurons without the requirement of fetching synaptic weights from an SRAM bank to a computing unit.Additionally, this approach provides an efficient means of achieving compact and area-efficient neuro-inspired hardware through dot-product computing.Nevertheless, due to the non-volatile nature of the AHE resistance, it is necessary to reset the AHE resistance at the end of the timestep set in the controller, which has minimal impact on performance.Compared with a memristor crossbar-based architecture, the FIM-based crossbar has the following advantages.First, the FIM has rich properties and can be engineered to emulate both synaptic and neuronal functions of neural networks.Second, the FIM uses the electron spin instead of charge to carry information, which allows lower terminal voltage and energy consumption.Third, the FIM has more intermediate resistive states to realize a more compact and energy-efficient neuromorphic computing engine.

Crossbar Peripherals for ANNs and SNNs
A neural network consists of multiple layers, such as the convolutional layer, the rectified linear unit, the pooling layer, and the fully connected layer.Even FIM-based crossbar arrays are effective at performing multiple parallel dot-product operations.However, a full-fledged crossbar-based CNN accelerator must integrate several digital components to complete the remaining functions.Figure 4 shows the implementation diagrams for an ANN or SNN with the cooperation of different parts of the EPHA.
As shown in Figure 4(a), after the axon buffers the input/output data from the eDRAM, the decoder converts the activation to the ANN or SNN format, depending on the configurations, and transforms the weight into an analog voltage while programming the AHE resistance of the CoGd-FIM-based synapses via WWLs and SLs.Similarly, the DACs convert the activation into an analog voltage and supply it to FIM-based synapses via RWLs and BLs (see Figures 4(b) and (c)).As shown in Figure 4(d), after the final result is output from neurons, the soma is reconfigured for the threshold comparison (SNN mode) or a pooling operation (ANN mode).In this way, the crossbar is only used to complete the accumulation of the dot-product and partial sum.Therefore, the crossbar arrays can be shared by different modes.
In addition, the EPHA can easily deploy a combination of ANN-SNN hybrid networks, such as ANN-input-SNN-output or SNN-input-ANN-output by configuring the crossbar peripherals.This is similar to the cross-paradigm scheme mentioned in the TianJic [55].The router contains a reconfigurable routing table to support an arbitrary topology to connect NCs inside or outside the EPHA; this allows the EPHA to support complex neural networks efficiently.Previous studies [64,67] illustrate that the ADC/DAC consumes a significant portion of power.Therefore, to minimize the overall power consumption, we use the SNN driver and spike generator instead of DACs/ADCs to realize the conversion between digital and analog signals in the SNN mode, which helps the EPHA achieve lower power consumption.The spike generator is composed of a comparator that quantizes analog signals into two levels of digital signals.
In the EPHA, the area and power consumption of peripheral circuits account for 19% and 9.7%, respectively.Compared with the NEBULA, the EPHA has a larger peripheral circuit size.This is mainly because the EPHA adds supporting digital units (such as the MaxPool and Spike Generator) and improves the resolution of ADCs.The area and power consumption of peripheral circuits are acceptable for deploying both ANNs and SNNs on the same crossbar.We conduct a detailed analysis of power/area consumption in Section 5.2.

Computing Flow for Convolutions with Different Structures
Different applications require different CNNs with varying structures.Further, the structure of convolutions in one CNN also varies layer-by-layer.Thus, it is crucial for designs to support varyingstructure CNNs.
The NEBULA employs a unique computation flow that utilizes a hierarchical neuron design, similar to the structure of addition trees, to support different convolutions.Due to the non-volatility of new devices, weight data is fixed on cross-points.Consequently, the primary energy consumption for data movement is on activation data.However, the NEBULA does not currently analyze activation reuse or parallel computation.Moreover, as shown in the dotted box of Figure 5, the NEBULA uses the kernel size (R f , R f = k × k × C 1 ) as the mapping condition, which limits the flexibility of the hierarchical neuron design (correspond to H0-H3).Since the number of channels in the kernel is close to or even larger than the array size, the NEBULA requires a high hierarchical level of the neuron unit to participate in the calculation.It may even require computations across multiple NCs.Even the hierarchical neuron architecture reduces the energy consumption of the partial sums in the current domain; this increases the energy consumption and the complexity of the controller.As shown in Table 1, the NEBULA needs a high hierarchical level of the neuron unit to participate in the calculation when deploying several popular neural networks (H4 stands for computation across multiple NCs).
Inspired by these observations, we propose a novel mapping method for varying convolution structures in the EPHA.In the weight update operation, as shown in 1 of Figure 5, we split C 2 Then, we group the sub-filters according to their spatial location in each filter (as shown in 2 of Figure 5), which can be achieved by operations similar to the matrix transposition.Since the channel of kernels can be huge, sub-filters are further divided and mapped into different sub-matrix according to the size of the crossbar array (step 3 of Figure 5).The convolution is decomposed with different dimension sizes into sub-matrices with the same size, which is friendly to the hardware design.As shown in 4 of Figure 5, after splitting and grouping, sub-matrices are allocated into different crossbars depending on the size of the channel (C 1 ) and filter (k × k).The weights in each row of a crossbar array come from the same group and conduct the dot-product operations with the same activations along BLs.This is the reason why we rotate the BLs by 90 • .In this way, the activation is shared with synaptic weights in the same row.This means that we change the mapping condition from the kernel size (R f , R f = k × k × C 1 ) to the channel size (C 1 ).After the dot-product operations, we activate the neuron level (H0-H3) according to 3 in Figure 5 to accumulate the partial sums of the sub-matrix from different crossbar arrays and then get the final output feature map (OFM).It should be noted that the resulting voltage of sliced convolutions is accumulated by the neuron array according to the Kirchhoff's voltage law to reduce the energy consumption of ADC.
We can reduce the energy consumption of the partial sums in the analog domain based on the neuron hierarchy-level design.However, since the mapping method is different from the NEBULA, we also need to adjust the mapping conditions due to the novel mapping method.For C 1 ≤ M, the sub-filter size (1 × 1 × C 1 ) is smaller than the size of the crossbar array, which means we do not have to generate the sub-matrix.Neurons at hierarchy level 0 are activated to complete the computation, while other neurons at other levels are turned off.For M ≤ C 1 ≤ 4M, we need two crossbar arrays to participate in the calculation, as shown in the green box of Figure 5, which means neurons at level H1 are activated to sum currents that come from neurons at level H0.Similarly, if 4M ≤ C 1 ≤ 8M, then we activate neurons at level H2 to sum the currents flowing from neurons at level H1.In this way, we can schedule the crossbar array with a size of 4M × 8M (shown in the yellow box of Figure 5).If 8M ≤ C 1 ≤ 16M, then we need to activate neurons at level H3 to schedule a tile.Notice that if M × 16M ≤ C 1 , then computations are conducted across multiple ANN/SNN NCs.In this case, the accumulation of the final OFM is augmented with the help of the router and AU.
Compared with the NEBULA, the EPHA achieves higher parallelism and flexibility to enable the deployment of larger networks by using the new data mapping method.As shown in Table 1, we can effectively reduce the hierarchical level to reduce the complexity and the energy consumption of control logic.

The Precision of EPHA
Reducing the data precision (normally called quantization in neural networks) is a general method to further reduce power consumption, which is also popular in resource-constrained analog devices.The data precision of EPHA is set as 4 bits.This is due to two factors.First, previous studies [67,88] show that lower resolutions can achieve similar accuracy to the floating-point counterparts of the evaluated models.Second, the crossbar circuit is limited by the resistance range of CoGd-FIM-based devices, which also limits the data precision.The 4-bit precision may decrease the accuracy.We can use the post-training quantization [17] to reduce data resolution and retain performance.Specifically, we train models with floating point precision.Then, we determine the maximum activation value of each layer by using a subset of training data and then clip all the activations according to the maximum activation.The maximum activation is empirically decided for each layer to minimize the loss of accuracy.After clipping the activation value, we quantize the values to 16 levels according to the maximum conductance range by using the neural network distiller [88].It should be noted that although the quantization level remains constant when the activation and weight of each layer are quantized by the peripheral circuit, the scaling factor of each layer varies from layer-to-layer.This method is similar to the neural network distiller [88].

EVALUATION METHODOLOGY
This section briefly introduces a method for converting ANNs to SNNs, adapted from Reference [4].We then present some details related to the modeling of the EPHA.

ANN to SNN Conversion
Due to the non-differentiable nature of spikes, there is a lack of direct and effective methods for training SNNs.In the EPHA, we train the original ANN and transform it into an SNN version with the same network structure [5,17].We train the VGG-16 [66] and the Inception-V3 [69] using back-propagation on image recognition datasets including the MNIST [16], CIFAR-10 [35], and ImageNet [15].We then use the following modifications to converse ANNs to their corresponding SNNs: (1) Conversion of Inputs.The benchmark datasets to evaluate SNNs are rare, currently.So, it is important to convert frame-based image databases into event-based datasets to evaluate the recognition accuracy of SNNs.According to the equation Z l i := V thr ( M 0 j=1 W l ij x j +b l i ), we convert the input feature map into rate-encoded spike trains and obtain the regular charge value z l i of neuron i by weighting spike trains with the corresponding weights as the conductance W l ij .
(2) Conversion of Weights and Biases.The linear rescaling of all weights and biases is based on the linear activation function.To preserve the information encoded within a layer, it is necessary to jointly scale the parameters of that layer.Denoting the activation in layer l as λ l = p[a l ], the weights W l and biases b l are converted to W l λ l−1 /λ l and b l /λ l .The p is the percentage of the converting scale.It should be noted that the bias will be applied to synapses in the form of a bias analog current.Besides, we set and analyze the thresholds and firing rates by using a data-based normalization conversion method that is suggested in Reference [4].
(3) Conversion of Max-pooling Layers.In conventional methods of converting ANN to SNN, it is typical to use the spike of the maximally firing neuron as the max-pooling outcome.However, this practice results in a loss of accuracy and poses challenges for hardware implementation.Therefore, we have adopted an alternative approach that employs a pooling module in peripheral digital circuits to achieve max-pooling without sacrificing accuracy.Specifically, in the SNN mode, the spike train outputted by the spike generator is converted to an integer by using the adder.Then, the pooling operation can be completed by the max pooling unit as in the ANN mode.
Our transformation method results in a timestep that is proportional to the depth of the network.In our experimental setup, we set the timestep of the three-layer MLP to 30, the converted VGG-16 to 400, the eight-layer binary network to 80, and Inception-V3 to 550. Figure 6 shows the accuracy of the conversion algorithm without taking non-ideal physical characteristics (including variation, noise, or non-ideal circuit linearity).This is mainly because the non-ideal physical characteristics are out of scope in our work and can be addressed by previous studies [33,37].We will further study this issue in subsequent work.These evaluations show that the algorithm for ANN-to-SNN conversion is nearly lossless.Generally, a lot of overhead is associated with converting an ANN to an SNN.Other surrogate gradient approaches, such as e-prop and eRBP (event-driven random backprop), can be used to get backprop-like performance on SNNs.In addition, other studies, such as SSTDP [39], DCT-SNN [19], TA-SNN [80], NeuSpike-Net [87], and SpikeConverter [38], can realize shorter timesteps even on large datasets.We consider that our conversion is used as a proof of concept to show the performance of the architecture with the SNNs.

Power, Area, and Performance Modeling
To facilitate architecture-level analysis, we use a multi-level co-simulation framework to evaluate the performance of the EPHA.We capture the underlying physics of CoGd-FIM by using mumax3 [72].The spin dynamics of each sublattice are carried out by the atomistic Landau-Lifshitz-Gilbert equation.Based on the analysis of the spintronic device [53,67], we opt to use the dimensions 320nm × 20nm × 7.5nm as the minimum device size.This cannot only meet the resistance range, but also take advantage of the low power consumption of spintronic devices.The energy and area of the max-pooling circuit, eDRAM, and router are adapted from the analysis in the ISAAC [64].The data of ADC comes from the ADC performance survey 1997-2015 [48].We use an 8-bit ADC instead of a 4-bit ADC, since the partial sums of the crossbar array may exceed 4 bits.To ensure computation accuracy, we have to sacrifice power and area.The power consumption and area of the DAC, SNN driver, and AU are adapted from the NEBULA [67].The power consumption and area data of the spike generator are sourced from Reference [73], and we have adjusted it from 45 nm to the 32 nm process node.Specifically, the energy consumption in each part is computed by multiplying the energy consumed per bit/operation per unit with the total number of data bits/operations [79].The energy consumption of writing, reading, and resetting per bit of an FIM device are 0.17f J, 0.15f J, and 0.007f J, respectively [63].The energy consumption per conversion step of DAC, ADC, SNN driver, and spike generator are 1.79f J [60], 59.4f J [9], 0.0044f J [48], 0.259f J [73], respectively.The energy consumption of the input/output buffer and eDRAM are 0.064pJ /bit [56] and 0.075pJ /bit [10], respectively.The energy consumption of the WL switch matrix per operation is 0.27pJ [56].
We evaluate the EPHA in the ANN and SNN mode by using the parameterized analytical model of the Bitlet [59].Specifically, the system-level energy consumption is obtained from the sum of energy consumption in all parts.We map different workloads to the different nodes of the EPHA according to the computing flow in Section 4. This provides a deterministic execution model to help us analyze the energy consumption and the latency/throughput of the EPHA for a given neural network.We assume there are no structural hazards between layers and within a layer.We also assume that there are no data conflicts between different components.Note that the cycle-accurate simulations do not capture any phenomena we have not already captured.

EXPERIMENTAL RESULTS
We train neural networks on common datasets, then convert ANNs to the corresponding multilayered SNN benchmarks.The CMOS transistors, which are used to construct the crossbar peripherals, are evaluated in a 32 nm process node.We evaluate the ANN mode and SNN mode of the EPHA by deploying different benchmark networks (such as the VGG 16 [66], the Inception V3 [69]).We also compare our design with state-of-the-art accelerators.We deploy the VGG-16  on various designs to ensure fair comparisons.Table 2 shows the layer-wise parameters in the ANN [21] mode and the SNN mode of deploying the VGG-16.

The Performance of FIM-based Devices
In the architecture of neuromorphic computing, the number of artificial neural synapses is vast, and the dot-product operation is the primary power-consuming bottleneck of the entire system.Thus, the selection of artificial neural synaptic devices is critical for the design.We evaluate the switching process of the FIM-based device by using mumax3 [72].Figure 7(a) intuitively reflects the flipping of the upward magnetization to downward as the result of the current-included SOT.And Figure 7(b) shows the AHE resistance curve of the compensated FIM with the increase of current pulses, which verifies the multiple device behavior of the FIM device.The main parameters used in our simulation are summarized as follows [3]: The exchange constant is 240 × 10 −12 pJm −1 , saturation magnetization 7.0 × 10 5 Am −1 , damping coefficient 0.03, and vertical anisotropy coefficient 3 × 10 5 kJm −3 .Due to the antiferromagnetic exchange coupling of the Co(Gd) sublattice, the FIM-based switching has significant energy efficiency and a small delay.Specifically, based on our Fig. 8. Energy vs. delay for a synapse in ANN (magenta), cellular neural network (green), SNN (gold), and oscillator neural network (blue).Labels for architectures can be found in Reference [53].
simulation and previous studies [3,41], the FIM device is one or two orders of magnitude faster than that of a FM-based device.
To further verify the advantages of the FIM device in neuromorphic computing, we compare the FIM with other emerging devices in the delay and energy consumption.Figure 8 shows the energy consumption and delay results of a CoGd-FIM-based synapse and other beyond-CMOS devices (the FET corresponding to ferroelectrics, the SOT/STT corresponding to magnetoelectrics based on the SOT/STT) by using the Beyond-CMOS Benchmarking [49,52].Here, we compare the device parameters of the CoGd-FIM in the SNN mode.As shown in Figure 8, the CoGd-FIM has lower energy consumption within a shorter delay.Moreover, we can draw the conclusion that, compared with other neural networks, the emerging device can show better performance as the synapse for SNNs.Moreover, due to the non-volatile of the FIM switching, the FIM device can also retain its state while there is no current flow through HML.Therefore, the FIM switching does not have standby power and is also suitable for asynchronous even-based computation.These reveal the great potential of the CoGd-FIM as an artificial neural synapse.However, the device variations, including device-to-device variation and cycle-to-cycle variation, may deviate the device resistance from the target state and degrade the inference accuracy.Specifically, the random and independent formation of conduction filaments may lead to different final resistances by using the same programming flow.The fluctuation during fabrication also may lead to a similar phenomenon of devices.A large number of previous studies [33,37] have demonstrated that the on-chip tuning and offchip pre-processing can be used to tackle device variation.For example, the constant signal pulse programming [33] applies constant pulses until the resistance state reaches the target state.The Vortex [37] proposes the variation-aware off-device training method to recover the accuracy drop.

The Power and Area of EPHA
Table 3 presents the parameters of the EPHA in both ANN and SNN modes.The peak power consumption of an EPHA chip in the ANN mode is approximately 794.156 mW, with an area of 3.804 mm 2 .Although the EPHA can deploy ANNs and SNNs on the same crossbar array, the performance of the ANN/SNN NC differs between the ANN and SNN modes.In the ANN mode, the power consumption of an ANN/SNN NC is 119.776mW, which is approximately 9 times higher than that in the SNN mode.The reason for the difference is that the ANN mode necessitates a longer duration to attain a broader range of AHE resistance under the same pulse current density, leading to increased energy consumption during weight updating to surmount the demagnetization energy barrier.Furthermore, in the dot-product operation, higher AHE resistance generates more heat owing to the joule heating effect of current pulses.Figure 9 depicts the power and area consumption of various components in the EPHA.The FIMbased crossbar array occupies the largest area of the entire chip, accounting for 59.3%.Since the array can be shared by different modes, it enhances the architecture's flexibility and hardware utilization rate.The power consumption ratio of different parts of the EPHA differs in the ANN and SNN modes.In the ANN mode, the power consumption of the crossbar array is the highest, exceeding the overall consumption of the peripheral circuits.Additionally, the ADCs/DACs contribute to 32.2% of the total power.This is primarily due to the fact that the ADC integrates multiple complex modules, such as the sample-and-hold circuit, the capacitive DAC, the comparator, and the digital logic.Additionally, our design employs a higher-precision ADC to ensure the accuracy of the accumulation calculation, which increases power consumption.Reducing the power of ADCs/DACs is crucial to further reducing power dissipation.In contrast, as shown in Figure 9, the most significant power consumption in the SNN mode is used for the conversion between digital and analog signals.Furthermore, traditional CMOS-based circuits exhibit advantages in specific applications.For example, the power of the max pooling only consumes 0.4 mW in an area of 0.00024 mm 2 , which is very small for the whole chip.This demonstrates that the hybrid circuit based on CMOS and spin devices can maximize both benefits.Additionally, this approach can prevent the conversion of the max pooling layer in the ANN, reducing the accuracy loss of the conversion between the ANN and the SNN.Compared with other architectures, the EPHA performs various functions other than the synaptic and neuronal functions of ANNS and SNNs by using peripheral circuits, realizing higher flexibility.Compared with the NEBULA, the peripheral power consumption of the EPHA is a bit higher, mainly due to the addition of digital functional units (0.4 mW of the MaxPool and 1.2 mW of the Spike Generator) and the higher resolution ADC (0.43 mW in the NEBULA and 16 mW in the EPHA).

The Performance and Energy Consumption of EPHA
The performance and energy consumption of the EPHA depends on workloads.To estimate the performance and energy consumption of EPHA, we use the evaluation method in Section 5.2.When deploying the VGG-16, the EPHA in the ANN mode can classify one image within 15ms, while the energy consumption is 15mJ .When it turns to the SNN mode, the latency and energy consumption are 55ms and 10mJ , respectively.The performance of EPHA in the ANN mode and SNN mode reaches 2.03T FLOPS and 557GFLOPS, respectively.The energy efficiency in the ANN and SNN modes reaches 623Gops/J and 6.19Tops/J , respectively.Because the EPHA can deploy both ANNs and SNNs, we can achieve the tradeoff between energy efficiency and performance by using the SNN-ANN hybrid mode.Figure 10 shows the layer-wise energy consumption and latency of deploying VGG-16 in both ANN mode and SNN mode.The energy consumption of the 1st layer to the 5th layer in the ANN mode is significantly higher than that in the SNN mode.We divide the VGG-16 into two parts.The first part (the 1st layer to the 5th layer ) is in the SNN mode and is converted using the method mentioned in Section 5.1.The second part (other layers) is in the ANN mode.The input of the 6th layer is aggregated from the output spike of the 5th layer by the AU.Compared with the ANN mode, the SNN-ANN hybrid mode reduces the energy consumption by 1.6× and increases the energy efficiency from 623Gops/J to 999Gops/J .
Figure 11 presents the layer-wise energy breakdown of the significant components of the EPHA in different modes and neural networks.It should be noted that our comparison assumes no sparsity in the activation and weight, which represents a limiting case.The real energy consumption is expected to be lower than this.The energy consumption of the EPHA varies from different networks or in different modes.For example, the maximum energy consumption of one layer by deploying the VGG-16 [66] in the ANN mode is over 6× that of the Inception V3.This is mainly because of the huge differences in the convolution size between different networks or different layers.In particular, the largest convolution size in Inception V3 is 3 × 3 × 192 × 384, while the smallest convolution size is 1 × 1 × 24 × 64.The largest convolution size is 432× larger than the smallest convolution size in the Inception V3.The largest convolution size in the VGG-16 is 3 × 3 × 512 × 512, which is 1,563× larger than the smallest size in the Inception V3.Therefore, it is important for architecture to support convolutions with different structures.Similarly, since the energy required by the FIM-based crossbar array to overcome the demagnetization energy barrier is different in different modes, the energy consumption of one layer varies greatly under different modes with the same network.Taking VGG-16 as an example, the energy consumption in the ANN is nearly 40× than that in the SNN.This is due to the higher sparsity of the binary data and the event-driven computing method used in the SNN.

Comparison with Other Designs
Table 4 compares several state-of-the-art hardware platforms in terms of their computational and power efficiency.The computational efficiency (CE) metric refers to the number of 16-bit  operations performed per second per mm 2 .The power efficiency (PE) is measured by the number of 16-bit operations performed per watt.To standardize the energy consumption of 16-bit computations across different designs, we made adjustments to the other platforms.Specifically, we increased the number of cycles required for computations by four times in the NEBULA and EPHA and reduced the throughput of the TrueNorth, Loihi, SATO, SpinalFlow, and Peng by onehalf.Furthermore, we estimated the PE of Eyeriss [11], TrueNorth [45], and Loihi [14] based on their throughput using the conversion method described in Reference [43].It is important to note that TrueNorth and Loihi in Table 4 are SNN accelerators, and the indicated weight accuracy represents the accuracy of the original network.During actual deployment, input data undergoes conversion into a sequence of spikes using rate coding.Our focus is solely on the performance of the reasoning process, and we do not consider the performance of training.The data of the Loihi and TrueNorth presented were obtained from Reference [20].The NEBULA, a typical spin-based design for ANNs and SNNs, demonstrates competitive performance and energy efficiency.To provide a direct comparison in architecture, we assume that the energy consumption of the CoGd-based device is identical to the MTJ used in the NEBULA.Note that the CoGd-based device actually exhibits superior performance with smaller energy consumption [3].As shown in Table 4, the power efficiency of EPHA is 1.6× higher than the NEBULA in the ANN mode and achieves 16× improvement in the SNN mode.The CE of the EPHA is 27× higher than the NEBULA in the ANN mode and achieves 68× improvement in the SNN mode.This is mainly due to three reasons.First, the EPHA boasts a unique computation dataflow to avoid computations across multiple NCs.Specifically, the EPHA effectively reduces the hierarchical neuron level to avoid computations across multiple NCs, as explained in Section 4.2.Differently, the NEBULA needs across-NC computations in the 6th layer to the 13th layer when deploying the VGG-16.In this case, the intermediate data needs to be converted into digital signals and then moved to other NCs through various memories/storages, which leads to additional energy consumption.Moreover, this is usually the most costly in the energy consumption and delay of spin-based architectures.Second, weights in the same crossbar array perform the dot-product operation with the same activation data, thanks to the unique crossbar array structure and computation flow.This leads to better parallel computing and higher activation reuse.Third, the EPHA enables the deployment of ANNs and SNNs in the same NC, which achieves higher utilization of hardware resources.In summary, the EPHA's device-level, circuit-level, and computation flow innovations enable it to outperform the NEBULA.The NEBULA and EPHA can deploy both ANNs or SNNs, even the SNN-ANN hybrid mode.The NEBULA shows that the ANN-SNN hybrid can reduce the energy consumption of ANNs and reduce the latency of SNNs.In Section 6.3, we also prove this point.As shown in Figure 12(a), the energy consumption increases as more ANN layers are added to the network.The CE also shows the same trend (shown in Figure 12(b)).This is mainly because the latency increases with more SNN layers, reducing performance.Figure 12(c) shows that the PE decreases with more ANN layers.This is mainly because the energy consumption increases with more ANN layers, resulting in lower energy efficiency.We can realize the tradeoff between energy consumption and latency by using the SNN-ANN hybrid mode.

Comparison with the ISAAC.
The ISAAC is an advanced ANN accelerator design that utilizes crossbar memory to enhance throughput, energy efficiency, and computational density.We have increased the array size of the EPHA to match that of the ISAAC, but the EPHA still exhibits lower power consumption and higher computational density.This can be attributed to our use of the multi-bit FIM device as the synapse and neuron, which achieves higher integration and lower power consumption.Additionally, the FIM device in the EPHA can perform dot-product in a single cycle, eliminating pipeline stages and enabling higher parallelism.As shown in Table 4, the EPHA outperforms the ISAAC in terms of area and power and has great potential as an ultra-low-power design.The EPHA improves by 108× and 1.8× in PE and CE, respectively.This is mainly because the novel dataflow based on the hierarchical neurons of EPHA reduces the energy consumption of ADC and avoids cross-array calculations.Finally, our design uses the same crossbar-based array to deploy both ANNs and SNNs, making it more suitable for AI development.

Comparison with Other
Emerging Device-based Designs.Peng et al. [56] optimize the weight mapping and dataflow for neural networks based on the ISAAC.Compared with it, the result shows that the EPHA can improve the PE by 8×, reaching 41.1Tops/W .Yao et al. [81], as one of the state-of-the-art memristor-based hardware systems, shows great potential in realizing energy-efficient CNN neuromorphic system.Compared with it, the EPHA improves the PE and CE by 60× and 7.3×, respectively.This is mainly because the spin-based device of EPHA can realize higher energy efficiency computing.Specifically, the energy consumption for 1-bit computing in Reference [81] is 371.89pJ , which is 4 orders of magnitude higher than that in the EPHA (0.8pJ /bit).

Comparison with Traditional Device-based Architectures
The advantages of the EPHA become apparent compared with other traditional computing platforms and demonstrate the great potential of CoGd-based devices in energy-efficient neuromorphic computing.The Eyeriss [11], as a representative CNN inference engine, delivers very high hardware performance.Compared with the Eyeriss, the EPHA improves the PE and CE by 118× and 24×, respectively.The energy efficiency of EPHA is also 2.9× higher than that of Eyeriss.The TrueNorth and the Loihi are state-of-the-art non-Von Neumann architectures for SNNs.As shown in Table 4, compared with the TrueNorth, the EPHA improves more than 5 orders of magnitude both in the PE and the CE.Compared with the Loihi, the EPHA improves the PE by over 4 orders of magnitude and increases the CE by 11 times.This is mainly due to the following reasons.First, the spin-based neuromorphic device in the EPHA can reduce the energy consumption of synapses (0.15f J) by 5 orders of magnitude than a corresponding digital/analog CMOS synapse or neuron implementation in the TrueNorth and Loihi (23.6pJ ).Second, the multi-bit device can avoid the traditional Boolean computing and reduce computational complexity (reduce to O(1)) to realize higher performance.Third, the non-volatility of the FIM device has zero standby power, which is suitable for even-driven computing.Differently, the inactive PEs in Loihi and TrueNorth consume additional energy to maintain data.Fourth, the Loihi needs to transport a large amount of communication between cores in the form of packetized messages to realize higher scalability and flexibility.The communication increases the latency and energy consumption of the whole system.In contrast, the hierarchical neuron and novel dataflow in the EPHA reduce the energy consumption and latency due to the communication between different cores.Fifth, the variable synaptic formats in the Loihi decrease the area and energy efficiency.Moreover, the EPHA also realizes better scalability and flexibility, which avoids specially designing and training the network model to be compatible with their architectures.As shown in Table 4, compared with the latest SNN-based accelerator works, such as the SATO [40] and SpinalFlow [50], the EPHA also shows an advantage in power consumption.The ultra power efficiency of EPHA demonstrates the great potential to reduce the cost of packaging and realize low-cost edge engines.However, the CE of EPHA is lower than that of both the SATO and SpinalFlow.This is mainly due to the following reasons.First, the SATO [40], SpinalFlow [50], and Parallel Time Batching [32] are temporal-parallel SNN accelerators that can accumulate the timestep in parallel.Second, these designs decouple the chronological dependence in the spiking operations and maximize data locality to increase parallelism and decrease data movements.It should be noted that the EPHA can also further im-prove performance and energy efficiency by using these optimizations.However, these optimizations increase the peripheral circuits and may reduce the performance of EPHA in the ANN mode.The optimization of EPHA is a tradeoff in performance between the ANN mode and SNN mode.Third, these designs use more advanced CMOS technology.In addition, we are conservative in selecting the area of one FIM device (the device size is 320nm × 20nm).Over the next decade, the number of intermediate resistive states of spin-based devices could increase by a factor of 10, achieving higher area efficiency [26].

CONCLUSION
This article proposes the EPHA, a non-Von Neumann, modular, parallel, scalable, multi-model architecture designed to achieve artificial intelligence that supports ANNs and SNNs.We use an ultra-low-power, ultra-fast-response CoGd-FIM-based device to design a morphable neuron core circuit that forms the building block of the EPHA.The EPHA exhibits competitive performance and energy efficiency of the ANN and SNN modes on a suite of workloads.The EPHA allows for the deployment of ANN and SNN on the same array, which is made possible by the similarity in analog calculation between the two models.Our spin-based design supports spike, continuous value, and mixed models in the same synapse array, marking a significant breakthrough.Furthermore, the design concept presented in this article can be extended to other emerging equipment architectures, thereby increasing the flexibility of the architecture.Additionally, the high writing efficiency of FIM illustrates its potential in training chip applications.

Fig. 1 .
Fig. 1.(a) Primitives of the ANN and SNN, (b) structure of the SNN NC, (c) the architecture of the NEBULA, (d) structure of the ANN NC.

Fig. 2 .
Fig. 2. (a) Schematic illustration of the compensated FIM, (b) the resistance curve of compensated FIM, (c) an FIM crossbar array acts as a neuromorphic computing array for ANNs.

Fig. 3 .
Fig. 3. (a) The EPHA architecture, (b) an ANN/SNN Neural Core (A/S), (c) a neuron receives the weighted synaptic summation of inputs, (d) structure diagram of the crossbar array, (e) structural diagram of the CoGd-FIM-based neuromorphic crossbar array.

Fig. 7 .
Fig. 7. (a) The simulation diagram of FIM-based device, (b) the AHE resistance curve of the compensated FIM with the increase of different current pulses.

Fig. 9 .
Fig. 9.The power consumption and area of the EPHA in different modes.

Fig. 11 .
Fig. 11.The layer-wise energy consumption of the ANN mode compared to the SNN mode on the EPHA.The line chart represents loading VGG-16, corresponding to the left ordinate axis.The bar chart represents loading inception V3, corresponding to the right ordinate axis.

Table 1 .
The Hierarchy Level of Neuron Units by Mapping Different Networks

Table 2 .
The Layer-wise Parameter in the ANN Mode and the SNN Mode of the VGG-16