Supporting Energy-Based Learning With An Ising Machine Substrate: A Case Study on RBM

Nature apparently does a lot of computation constantly. If we can harness some of that computation at an appropriate level, we can potentially perform certain type of computation (much) faster and more efficiently than we can do with a von Neumann computer. Indeed, many powerful algorithms are inspired by nature and are thus prime candidates for nature-based computation. One particular branch of this effort that has seen some recent rapid advances is Ising machines. Some Ising machines are already showing better performance and energy efficiency for optimization problems. Through design iterations and co-evolution between hardware and algorithm, we expect more benefits from nature-based computing systems. In this paper, we make a case for an augmented Ising machine suitable for both training and inference using an energy-based machine learning algorithm. We show that with a small change, the Ising substrate accelerate key parts of the algorithm and achieve non-trivial speedup and efficiency gain. With a more substantial change, we can turn the machine into a self-sufficient gradient follower to virtually complete training entirely in hardware. This can bring about 29x speedup and about 1000x reduction in energy compared to a Tensor Processing Unit (TPU) host.


Introduction
Nature apparently does a lot of computation all the time, solving differential equations, performing random sampling, and so on.We have harnessed some of it of course.The transistors, for example, can be turned on and off based on the law of nature.They are the foundation for most of our computers today.But this is different from harnessing nature's computational capability at some higher level, for example, to solve an entire problem.Indeed, some very powerful algorithms are inspired by nature [2,41,72].It is increasingly evident that certain nature-based systems with fast dynamics can solve a certain set of problems much more quickly and efficiently than through mapping it to the von Neumann interface [3].
Among various critical real-world problems, Machine Learning (ML) training is known to be extremely computationally expensive 1 , which calls for a breakthrough solution to reduce the computational time and energy cost.We believe that a nature-based computing substrate could provide a potential fundamental solution to this issue.This idea is actually not entirely new -recently, Hinton has discussed a similar perspective in [29].Specifically, he emphasized the significance of utilizing the nature of electronic components (such as capacitors, resistors, and conductors) in analog hardware to enable efficient ML with "mortal computation".However, he also acknowledged that implementing backpropagation using naturebased computing substrates, as opposed to forward propagation, is a significant challenge.
This motivates us to investigate whether the recent emerging high-performance nature-based computing substrates have the potential for efficient ML.One particular branch of such computing substrates that has seen some recent rapid advances is Ising machines.In a nutshell, Ising machines leverage nature to seek low energy states for a system of coupled spins.A number of problems (in fact, all NP-complete problems) can be expressed as an equivalent optimization problem of the Ising formula (more on that in Sec. 2).Though existing Ising machines are largely in the form of prototypes and concepts, they are already showing promise of (much) better performance and energy efficiency for optimization problems.But the real appeal is in their future opportunities.First, through design iterations, their computational capacity and efficiency will continue to improve, hopefully quickly and significantly as they have in the recent past.Second, with novel hardware, the design of algorithms (especially those inspired by nature) will co-evolve with the hardware and lead to a richer combination of problem-solving modalities.
In this paper, we make a concrete case for this second point by demonstrating that an augmented Ising machine design can bring 29× speedup with 1000× energy cost reduction over TPUs to the training of Restricted Boltzmann Machine (RBM) and variants, a machine learning modality that has drawn considerable attention recently.It is worth noting that Boltzmann-Machine-based ML has emerged as a promising approach for solving various real-world networking problems including recommendation systems [49] and traffic prediction [53], outperforming traditional neural network solutions in terms of both inference speed and accuracy.However, the training of Boltzmann-Machine-based models remains prohibitively expensive, thus hindering the realization of their full potential in the future.The proposed training approach based on (augmented) Ising machines will fundamentally address this problem.
It is also important to note that we believe the true appeal lies in what we can do when we co-design the hardware and the algorithm 2 .In this paper, we mostly follow the existing practice in terms of the training algorithm as this is merely the first step of our exploration.In the future, we will investigate how to extend the proposed solution to general ML.
2 Background and Related work

Principles of Ising machines
The Ising model is used to describe the Hamiltonian of a system of coupled spins.The spins have one degree of freedom and take one of two values (+1, −1).The energy of the system is a function of pair-wise coupling of the spins (   ) and the interaction (ℎ  ) of some external field () with each spin.The resulting Hamiltonian is as follows: A physical system with such a Hamiltonian naturally tends towards low-energy states.It is as if nature tries to solve an optimization problem with Eq. 1 as the objective function, which is not a trivial task.Indeed, the cardinality of the state space grows exponentially with the number of spins, and the optimization problem is NP-complete: it is easily convertible to and from a generalized max-cut problem, which is part of the original list of NP-complete problem [38].
Thus if a physical system of spins somehow offers programmable coupling parameters (   and ℎ  in Eq. 1), they can be used as a special purpose computer to solve3 optimization problems that can be expressed in Ising formula (Eq.1).In fact, all problems in the Karp NP-complete set have their Ising formula derived [50].Also, if a problem already has a QUBO (quadratic unconstrained binary optimization) formulation, mapping to Ising formula is as easy as substituting bits for spins:   = 2  − 1.
Because of the broad class of problems that can map to the Ising formula, building nature-based computing systems that solve these problems has attracted significant attention [6,7,9,10,26,39,40,54,71].
It is important to note that different approaches may offer different fundamental tradeoffs and goes through varying gestation speeds.Thus it could be premature to evaluate a general approach based on observed instances of prototypes.Nevertheless, we provide a broad-brushed characterization, which can help researchers get a basic sense of the landscape -as long as the caveats are properly understood.

The three generations of Ising machines
Some of the earliest and perhaps the best known Ising machines are the quantum annealers marketed by D-Wave.At the moment, quantum annealers are extremely sensitive to noise, necessitating cryogenic operating condition that consumes much power (25KW for D-Wave 2000q).Also, they use a local coupling network, which means the the nominal 2000 nodes on the D-Wave 2000q is equivalent to only about 64 effective nodes [24,25].
Coherent Ising Machines (CIM) can be thought of as a secondgeneration design where some of the issues are addressed [8,35,51,61,70].However, besides their own technical challenges, the current CIMs all use computed rather than physical coupling.Such a design will unlikely to be energy-efficient due to fundamental reasons.
Since the operating principle of CIM can be viewed with a model that describes coupled oscillators [62], using other oscillators can in theory achieve a similar goal.This led to a number of electronic Oscillator-based Ising Machines (OIM) which can be considered as a third-generation.These systems use electronic oscillators (LC tanks) [66] for spins and (programmable) resistors as coupling units.However, for on-chip integration, inductors are often a source of practical challenges.They are area intensive, have undesirable parasitics with reduced quality factor and increased phase noise all of which pose practical challenges in maintaining frequency uniformity and phase synchronicity between thousands of on-chip oscillators.
Another electronic design with a different architecture is the Bistable Resistively-coupled Ising Machine (BRIM) [3].In BRIM, the spin is implemented as capacitor voltage controlled by a feedback circuit, making it bistable.The design is CMOS-compatible and since it uses voltage (as opposed to phase) to represent spin, it enables a straightforward interface to additional architectural support for computational tasks.We therefore use a baseline substrate similar to BRIM.Note that the same principle discussed in the paper could directly apply to all Ising machines with different amount of glue logic.

Energy-based models
The concept of energy is used not only in traditional optimization algorithms [41] but also in a number of machine learning algorithms collectively referred to as Energy-Based Models (EBM) [45].The system usually consists of two sets of variables  and  (as a concrete example, let us imagine for now  represents pixels of an image, and  Boolean variables classifying the image).If the energy of the state,  (,  ), is low, then the classification is good.In many EBMs, the energy is similar to the Ising formula.In the well-known model of Boltzmann machine [32] for example, if we ignore the distinction between the two set of variables and just refer to each variable as   , the energy is equivalent to the Ising model: When using Boltzmann machines for inference, the system is also similar to using an Ising machine, but with a difference in the the meaning of the variables/spins.But by and large, Ising machines can accelerate inference of Boltzmann machines in a straightforward manner.
Training a Boltzmann machine is an entirely different matter.Unlike in an optimization problem where the weights are inputs, in training the weights are the output.In other words, training is the inverse of an combinatoric optimization problem.Like in many machine learning algorithms, training is done by using a gradient descent approach to lower the loss function while iterating over a set of training samples.While we will get into the details later, a key point to emphasize here is that the key challenge in such a gradient descent algorithm often involves terms that are computationally intractable, leading to the necessity of approximation algorithms.Here, a nature-based computing substrate allows us to adopt approaches convenient or efficient for that substrate without the need to follow exactly the prevailing von Neumann algorithms.
In this paper, we will show how a physical Ising machine can help accelerate an EBM both in training and in inference in a number of different ways.For this purpose, we choose a special case of Boltzmann machines called the Restricted Boltzmann Machine (RBM) as it is a widely-used algorithm that is heavily optimized for von Neumann architectures.RBMs (and its multi-layer variants) have found applications in specialized learning [1,4,15,18,33,56,65,73] and unsupervised learning [19-21, 23, 28, 31, 44, 57, 60, 67, 69].An RBM has only connections between a visible node and a hidden node and no connections between two visible nodes or between two hidden nodes as shown in Figure 1.As a result, the energy function is as follows: where , is the coupling weight between visible unit   and hidden unit ℎ  ; and    and  ℎ  are the biases for the corresponding visible and hidden units.Similar to other neural networks, RBMs can be stacked into a multi-layer configuration to form a deep network.Specifically, two common variants are Deep Belief Networks (DBN) and Deep Boltzmann Machines (DBM) [34,56].There are subtle differences between these variants and the simpler RBM.For the sake of clarity, we will focus on discussing RBM and follow conventional approaches when stacking multiple layers together.In other words, DBN/DBM-specific optimizations are outside the scope of this paper.

Architecture
In this section, we discuss the architecture design issues.We start with our Ising machine substrate that will be the foundation for additional architecture support (Sec.3.1).Next, we discuss a design where our modified Ising machine serves to provide acceleration for key computation in an otherwise conventional approach to the training of RBMs (Sec.3.2).Finally we show that if we start from first principles, we do not need to follow exactly the conventional training algorithm and can better leverage the strength of our Ising substrate as a practical Boltzmann sampler (Sec 3.3).The result is more flexibility in system design that ultimately leads to a more efficient design.

Ising machine substrate
As we discussed earlier, a number of physical substrates can leverage nature to perform optimization of an Ising formula.In principle, any such substrate can be used for our purpose of accelerating a machine learning algorithm.In practice, we aim for an integrated electronic design.This makes the BRIM design [3] a much more convenient baseline to use. Figure 2 shows a high-level diagram of its architecture.), resulting in an upper triangular coupling network.In an equivalent implementation, the coupling unit may consist of two uni-directional parts, forming a symmetric layout.
In this machine, each node consists of a capacitor and a feedback unit to make the capacitor bistable (1 or −1).A mesh of all-toall programmable resistors serve to express the Ising formula the system is trying to optimize.When treated as a dynamical system, a Lyapunov analysis can be applied to the differential equations governing the nodal voltages.It can be shown that local minima of the Ising energy landscape are all stable states of the system [3].Put more simply, starting from a given initial condition, this system of coupled nodes will seek a local minimum without explicit guidance from a controller.Extra annealing control is needed to inject random "spin flips" to escape a local minimum.This is analogous to accepting a state of worse energy with non-zero probability in simulated annealing [41].
In an RBM, the nodes are separated into a bipartite graph.For such a special graph, the architecture for our Ising machine substrate can be slightly modified to have nodes on two edges of the coupling network.In this structure, a visible node (  ) can only be coupled to a hidden node (ℎ  ) as shown in Figure 3.This layout significantly improves space efficiency compared to one that allows connection between all nodes.As a concrete example, one of our benchmarks uses 784 visible nodes and 200 hidden nodes.Mapping them on a generic all-to-all Ising substrate would need about 6 times more coupling units ((784 + 200) 2 vs 784 × 200).
Finally, the nodes of the Ising machine can be augmented in the following ways to support operation of the RBM algorithm.In both training and inference, it is common to clamp the visible nodes or the hidden nodes to certain value.This can be easily achieved with a buffer connected to the nodal capacitor.One important detail is that the inputs to the visible nodes are usually multi-bit values and thus requires digital-to-analog conversion.Note that in the baseline Ising machine, the vast majority of the area is devoted to the coupling units as it scales with  2 ( being the number nodes).Thus most addition to the node structure has a small impact to the system's complexity and chip area.

Programmable Resistive Coupling
Programming Logic

MUX
Figure 3: High-level RBM showing visible and hidden nodes, with clamping units to drive node biases, coupling mesh, and programming logic.

Gibbs sampler architecture
We first discuss a variant of RBM accelerator that is more traditional: we are simply leveraging the Ising substrate to accelerate a portion of the software algorithm that naturally suits the hardware.For convenience, we name the accelerator design Gibbs sampler (GS) as it follows (a variant of) the traditional Gibbs sampling-based algorithm [22].The overall algorithm is shown as Algorithm 1.
The algorithm: Here we see that in the training loop (lines 8 to 16), there are repeated calculations of ⟨ ⊤  ℎ  ⟩ and ⟨ ⊤  ℎ  ⟩.In a nutshell, the algorithm is a stochastic gradient descent and the training loop is calculating the stochastic gradient for every weight    .In the so-called positive phase (lines 9, 10), a training sample   is clamped to the visible nodes; and a corresponding sample for the hidden nodes ℎ  is generated based on the conditional probability formula: where  () is the logistic function 1/(1 +  − ).
In the negative phase (lines 13, 14), a -step Gibbs sampling is performed.Keeping the current hidden node values, we probabilistically generate a set of new visible nodes, with probabilities as follows.
From there, the updated visible nodes project back to generate updated hidden nodes, forming one complete step of the Markov Chain Monte Carlo (MCMC) algorithm.In principle, one such step in the MCMC algorithm would make a rather poor sampling.In practice, a small number of  steps (e.g., 5) are chosen to balance the cost and the quality of the sampling.Such a -step contrastive divergence algorithm is often referred to as CD-.

Algorithm 1 Contrastive divergence algorithm for training
Input: training_data, learning_rate (),   , batch_size () Output: weights (W), visible bias (  ), hidden bias ( ℎ ) 6: for  = 1, 2, . . .,  do 7: for  = 1, 2, . . .,    ℎ do 8: for  = 1, 2, . . .,  do 9: 12: for  = 1, 2, . . .,   do 13: end for 16: end for 17: end for 21: end for Architectural support: At every learning step, the current weight matrix [   ]  × will be programmed to the coupling array such that the resistance at each unit    is proportional to 1     .This step is just like programming the optimization formula in a standalone Ising machine.If we further clamp one set of nodes (say, visible) to fixed values, each coupling unit will produce a current equal to the voltage of the visible node divided by the resistance of the programmable coupling unit, which is equivalent to multiplying the corresponding weight in the matrix.Each hidden node, therefore, sees the sum of the current in the entire column.Used this way, the coupling array is effectively producing a vector-matrix multiplication operation (in the analog domain).
At this stage, rather than reading out the resulting currents, we opt to feed the current through a non-linear circuit that produces the effect of the logistic function.In fact, a modified inverter can approximate the function admirably.Finally, the output of the logistic function is the probability of node being 1.This can also be supported with a relatively straightforward circuit: a comparator with the other input being fed with pseudo random voltage level.Details of circuit design and issues can be found in Appendix B.
Operation: With the architectural support described, much of the training loop in Algorithm 1 will be offloaded to the hardware.The remaining operations will be carried out on digital functional units such as a TPU.The overall operation sequence is as follows: (1) Initialization on TPU; (2) Programming the (initially random) coupling matrix and biases to the Ising substrate.(3) Clamping the visible units to a training sample (  ); (4) Read out the hidden units (ℎ  ) after the Ising substrate finishes operation; (5) Perform the equivalent of -step Gibbs sampling by initializing the hidden units and allowing the Ising substrate to evolve and produce state samples.(6) Read out the final value from the visible (  ) and hidden units (ℎ  ).( 7) Repeat steps 3 to 6 to collect the samples for calculating gradient descent.( 8) Compute (on TPU) the new coupling matrix and biases by calculating gradient using the samples collected according to the following formula (⟨ ⊤  ℎ  ⟩ − ⟨ ⊤  ℎ  ⟩).(9) Repeat from step 2 for subsequent minibatches.

Boltzmann gradient follower architecture
The system described in Sec.3.2 represent an improvement over digital units as we shall see later.But this is largely due to the efficiency gain from approximate analog implementations.The benefit of nature-based computing is often much greater when we avoid frequent interactions between specialized hardware and a more general-purpose computing substrate.For this to happen, we need to have a deeper understanding of the intention of the algorithm.
The goal of the MCMC algorithm: With RBM, the goal is to capture the training data with a probability distribution model i.e., of all possible inputs, those in the training set should have high probabilities.The probability of an input is exponentially related to the energy of that input as in Boltzmann distribution (hence the name): Clearly the probabilities need to sum up to one, thus the equation for the probability of a particular state (, ℎ) is: When we say the machine captures the training data, we mean the machine's model (weights and biases) maximizes the probability of all  training samples.In other words, it maximizes   =1  ( ( ) ) where  ( ) is the  ℎ training sample, or equivalently the sum of the log probability:   =1 ( ( ( ) )).Note that  () = ℎ  (, ℎ).Since the probability is a function of the parameters (coupling weights and biases), we follow the gradient of each parameter.For the coupling parameter, the gradient is as follows: For notional clarity, we now look only at the contribution of one training sample ( =  ( ) ) to the gradient and focus on the first part in the numerator of RHS of Eq. 8: Here the notation ⟨•⟩  means the expectation with respect to the data, i.e., keeping the data constant () and averaging over all possible ℎ.
If we follow the same steps, the second part of the gradient (  ( )    ) gives us: Here the notation ⟨•⟩  means the expectation with respect to the entire state space given by the current model (coupling parameters and biases).
As we can see, to calculate any parameter's gradient, we need to calculate the expectation of a large number of states, which is impractical.The common solution is an MCMC algorithm (e.g., CD-), just like simulated annealing in solving an Ising formulation problem.Both algorithms are inspired by Boltzmann statistics.The Ising machine substrate that we use directly embodies such statistics and can thus perform sampling of the state space with extraordinary efficiency.To see the efficiency, let us look at a concrete example.For a 784×200 RBM, the -iteration software Gibbs sampling (lines 12 to 15 in Algorithm 1) is achieved by taking  =  * (784 + 200) steps on the Markov chain and takes roughly  × 10 6 instructions.This is equivalent to going through a trajectory on our dynamical system with roughly  phase points, each taking roughly a dozen picoseconds on average.
Architectural support: When we initialize the Ising substrate to some initial condition, it will proceed to traverse through the energy landscape directed by both the system's governing differential equations and the annealing control.This has the effect of "sampling" the state space and arguably produces samples better than the algorithmic random walk in CD-.However, there is a problem: the production of the samples is much faster than the host computer can access and postprocess them to obtain the expectations.We propose, therefore, a more direct approach, where the sampled expectations (⟨  ℎ  ⟩  or ⟨  ℎ  ⟩  ) are directly added to or subtracted from the model parameter (e.g.,    ) inside the Ising substrate, without involving the host.
In the CD- algorithm, learning is as follows, where ⟨•⟩  ± indicates the expectation over a set of positive ( + ) or negative ( − ) phase samples: Here the expectations are accumulated over a minibatch of (say,  = 500) samples before being used to update the parameters to the next value.The choice of  is usually a matter of convenience for the software.For us, the minibatch size is effectively one ( = 1), which requires a correspondingly smaller  (roughly 500× less than that needed for  = 500) for a similar effective learning rate.In this setup, the hardware will then take one positive phase sample and increment    based on this sample, then take a negative phase sample and decrement    based on that sample.This is ideal for hardware implementation as no expensive storage is needed to accumulate statistics for calculating expectations.In our baseline Ising machine, each    is physically expressed by the conductance of a configurable resistor, which is implemented by a transistor with variable gate-source voltages.Increasing and decreasing    can thus be achieved by raising or lowering the gate voltage using a charge-redistribution circuit shown in Fig. 4 with more details in Appendix B.
A fine point can be made here as to whether our learning algorithm produces a biased estimator or not.Our empirical analysis shows estimation bias appears to have no effect on the ultimate accuracy measures.If anything, our modifications seem to reduce bias from the commonly used algorithms as detailed in Appendix A.
For negative phase samples, we use  different initial conditions for the hidden units (ℎ ( ) ,  = 1..) to allow  independent random walks.These are often referred to as  particles.After each positive phase, we load one of the particles (say, ℎ (3) ); perform annealing of the Ising machine (equivalent to the random walk of the von Neumann algorithm); take a sample of   ℎ  (negative phase); and store the resulting hidden unit values back to ℎ (3) for persistence [63].
The parameters are physically expressed by the conductance of configurable resistors.The resistors are implemented by transistors with variable gate-source voltages.Increasing and decreasing the parameters can be achieved by raising or lowering the gate voltage.This in turn is done by a charge-redistribution circuit detailed in Appendix B.
The resulting operation can be thus characterized by the following equations: Constrasting Eq. 12 with Eq. 11, we can summarize the effective changes to the original algorithm as follows: (1) our parameters are updated mid-step: positive phase samples ( + ) are taken under (  ), resulting in updated parameters   + 1 2 , under which negative phase samples ( − ) are taken.In the original algorithm, both samples are taken under the same parameter (  ).
(2) the increments are achieved by hardware adjustment, which has both non-linearity and variations.This effect is captured by the function    (•).(3) for convenience of the hardware, the effective mini-batch size is 1.Finally, since the entire learning is now conducted inside the (augmented) Ising substrate, the coupling unit is larger (Fig. 4).Furthermore, the trained results need to be read out at the end of the learning process, this requires extra analog-to-digital converters (ADC) which are expensive converters.Nevertheless, keep in mind that they are only used once at the end of the entire algorithm.
In short, though the architecture needs some non-trivial new circuits and small modifications to the operating algorithm, it carries out the intention of a traditional software implementation.As we will see, the resulting quality is no different from a software implementation even under non-trivial noise considerations.Operation: Figure 4 shows the proposed Boltzmann gradient follower architecture.The key addition is to the coupling unit where the programmable resistor serving as the weight can be adjusted in place.With this architecture, the digital (host) computer takes a much more peripheral role, mostly setting the system up, feeding data at a fixed frequency, and finally reading the results.The operation can be described as follows: (1) Initialize the weights and biases. 42) The host would send training samples to latches at the visible units.
(3) The machine will clamp the data, wait for a predetermined time for the hidden units to settle.The resulting sample ⟨  ℎ  ⟩  + will increment    .(4) The machine will then load one of  particles and start annealing process.(5) After annealing, the resulting sample ⟨  ℎ  ⟩  − will decrement    .(6) The process (2-5) is repeated for a programmable number of learning steps.Then the ADCs will read out the coupler control voltages one column at a time.

Experimental Analysis
To better understand the characteristics of our augmented Ising substrate for EBMs, we compare various metrics between our proposed design and TPU [37].It is worth noting up front that an evolutionary perspective is needed when viewing the results.Both digital numerical architectures and nature-based computing systems will evolve.Design refinement can bring significant changes to these metrics.More importantly, the line between the two groups will continue to blur and cross pollination is very much an intended consequence.
We next describe the experimental setup including benchmark RBMs and DBN models; then show quantitative comparisons between a TPU and a noiseless model of our analog design in terms of energy efficiency and throughput; and finally present a few in-depth analyses to help understand the how the system behaves.
To train RBM as a recommendation system [64] and anomaly detector [55] we use 100k MovieLens dataset [27] and "European Credit Card Fraud Detection" [16] dataset respectively.The learning rate used to train these models is 0.1 and size of RBM and DBN configurations are shown in Table 1.Since RBMs are unsupervised models, one way to quantify the quality of training is the average log probability of the training samples which can be measured using annealed importance sampling [58].We also report some common metrics like classification accuracy using logistic regression layer at the end for image classification, mean absolute error (MAE) of test data and projected data for recommendation systems and area under Receiver operating characteristic (ROC) curve for Anomaly detection.
Modeling: Behavioral models of Gibbs sampler (GS) and Boltzmann gradient follower (BGF) were developed using Matlab.All circuits are modeled in Cadence 45 using Generic Process Design Kit (GPDK045) to obtain power and area.We designed a 32x32 We used these behavioral models to obtain performance and noise analysis.We assume the system has enough nodes to fit the largest problems in the set.Thus execution time is just the product of the number of iterations and the cycle count per iteration.Anything not carried out on our hardware is done on the host machine, which we take to be the same TPU as the baseline.
The area and power consumed by the TPU unit were obtained from [37], which uses a 28 technology.We used 8-bit precision ADCs to read out the final trained weights.To input training data, we used 8-bit precision DTCs.We adopt DTCs, and ADCs based on the measurement-validated designs in the following papers [47], [43], respectively.

Execution Speed
We first consider the execution speed of the two architectures discussed: the Gibbs sampler (GS) and the Boltzmann gradient follower (BGF).The operating frequency is 1GHz for digital portions for both.For a comparison, we include TPU (v1) and a GPU.Fig. 5 shows the execution time of different benchmarks all normalized to that of BGF.As we can see, BGF has 29x geometric mean speedup over TPU, whereas GS has 2x.However, this is not meant as an apple-to-apple comparison and we need to contextualize the results: First, TPU/GPU are meant to be more general-purpose and can handle arbitrary sized workloads.In their current incarnation, BGF and GS have a fixed capacity and are limited in problem size they can process.That said, most datasets we used comfortably fit inside a small die.Other than increasing die size for larger capacity, we note that scaling beyond a single chip's capacity is feasible and part of the community's on-going research [59].
Second, communication between the Ising substrate and host is fully accounted for and amounts to about a quarter of time GS spends waiting for host.Indeed, the advantage of BGF over GS is precisely the removal of the Amdahl's bottleneck of computing and communication of the host: BGF now automatically follows the gradient and only communicates end results to the host.Nevertheless, the fundamental efficiency of the Ising substrate is still an important factor as we show in Table 3.

Energy Consumption
A primary source of fundamental efficiency of an Ising substrate comes from the fact that many algorithms are mimicking nature, whereas our hardware directly embodies that nature.For instance, in a typical step of MCMC, flipping one node requires roughly  ( ) multiply-accumulate (MAC) operations followed by some probability sampling.Ignoring the probability sampling, which may be much more costly, just a MAC operation costs on the order of a pJ.For the problems discussed,  ≈ 1000.So one such flip requires the order of nJ using conventional digital computation.In contrast, in BRIM, flipping a node involves a distributed network of currents charging/discharging a nodal capacitor.With nodal capacitors on the order of 50 fF and a voltage of roughly 1V, flipping of a node takes on the order of 100 fJ.Thus, an Ising substrate has the potential to be about 4 orders of magnitude more efficient compared to a conventional computational substrate.Regarding the entire system, the energy savings will depend on many factors not included in the simplified analysis above.With the technology difference in mind, we can compare the energy consumption of different benchmarks shown in Figure 6.All energy results are normalized to that of the Boltzmann gradient follower.Our accelerators are more energy efficient in the effective operations they carry out than digital TPU operations prescribed by the algorithm.Overall, we see improvements of around 1000x for BGF.Two qualitative points are worth mentioning.On the one hand, digital circuit can still improve and more efficient designs can further reduce per-operation cost.On the other hand, the type of short random walks performed on our Boltzmann gradient follower architecture are not the most efficient use of the Ising substrate.Further algorithm innovation may well find applications that fully utilize the ability of ours and similar nature-based systems.Finally, we look at the chip area estimates.Again, with the technology difference this is also not a direct comparison.Nevertheless, our baseline 28nm TPU takes about 330 2 with 24% of it being the MAC array.In comparison, the area and power of BGF and its building blocks are shown in Table 2.Because the coupling unit is by far the most numerous elements (roughly  ( 2 ) vs  ( ) for other units), the area is largely determined by the coupling units.Assuming a 1600 × 1600 array, a Boltzmann gradient follower costs about 21 2 which is small when compared to TPU area.As shown in Table 3, BGF are much more efficient -in this specialized algorithm -than TPU or the state-of-the-art computational accelerators.

Impact of algorithmic change
From a first-principle analysis, our Boltzmann gradient follower architecture simply implements a different style of stochastic gradient descent.It will not give us the exact same trained weights, but should provide similar solution quality to the end user.We now analyze numerically the change resulted from the algorithmic change.
We use two metrics as discussed before: the average log probability of the training samples and classification accuracy.Fig. 7 shows the average log probability of models obtained using different methods (CD-1, CD-10 and our modified versions to Boltzmann gradient follower) over a period of training with different data sets.Note

BGF cd1 cd10
KMNIST FMNIST EMNIST Figure 7: Average log probability of different models (higher is better) of conventional algorithms (CD-1 and CD-10) and our modified algorithm used for Boltzmann gradient follower (BGF).The thumbnails show the general trend of other workloads.
that the log probability is computationally intractable and thus approximated with AIS [58] as already mentioned in Sec.3.3.Indeed, in some data sets (e.g., CIFAR10), the same AIS mechanism fails altogether to produce a finite estimate on the partition function.
The result therefore should not with much precision outside the general We show a few example data sets.
As we can see, in general, trajectories of log probability increase over time, and often quite substantially.This means the trained models approximate the probability distribution of training data better over time.The exact trajectory, however, is highly variable even under common practices of CD- with different  values.Our modified algorithm is understandably producing its own trajectory.Compared to CD-10, the difference in trajectory is often less pronounced than that resulting from choosing a small  for expediency.In addition to the uncertainty in the log probability estimate itself, keep in mind that beyond a certain point, one could argue that better log probability is simply a result of overfitting.Thus we also show classification errors in Table 4.Note that for the first 6 tests, the tool by default reports only two digits of significance, which masks any difference between using cd- and BGF.We artificially increased the reported precision.We can see that although the results are clearly different, both methods provide essentially the same accuracy.
Overall, the main takeaway point is this: We made small modifications to the standard algorithm for the convenience of implementation.From a first principle perspective, these changes are akin to choosing choosing CD- over the impractical maximum likelihood learning, or simply selecting a random  value in CD-

Impact of noise on training
Finally, we look at the impact of noise and variation on solution quality.Until this point, the results are assuming the analog hardware suffers from no noise or variation.To simulate process variations and circuit noise, we inject both static variation on the resistance of the coupling units and dynamic noises at both nodes and coupling units.The noise and variation are generated by Gaussian distribution with root mean square (RMS) values between 3% and 30% for a total of 25 combinations.Different results are thus characterized by a pair (  ,   ).For clarity, in Figure 8  In general, the message is perhaps somewhat predictable: When the combination of noise and variation is not too extreme (e.g., < 10% each), the impact on log probability is negligible.In many cases, the loss is smaller than that gained by modifying the algorithm to suit our Boltzmann gradient follower.But even for the more extreme variation and noise configurations the impact on log probability does not appear significant.And the final inference accuracy is essentially unchanged for all image-based benchmarks.For Recommendation system and anomaly detection benchmarks, the final results do show a little variation as shown in Fig. 9 and  10.To sum, while the analog circuits are subject to noise, they can competently perform the gradient descent function under even significant amount of noise without affecting the quality of the overall training process.

Discussions
Overall, experimental analysis of the two architectures provides some evidence that even without new algorithms specifically exploiting the capability of the hardware, we can obtain substantial performance and energy benefits with a very small additional chip footprint.On the other hand, computational substrates such as TPU are clearly much more general-purpose.At this stage, our analysis does not yet suggest that they are compelling designs per se.But we believe it shows that there is potential in the general direction for a number of reasons: ( run of the algorithm produces one KL divergence measure.We repeat 400 runs for each algorithm from different random initial conditions.The resulting 400 different measures are plotted as cumulative probability distribution shown in Fig. 11.
We can see that all the algorithms have a fairly similar bias characteristic.This is not not surprising since BGF is really a modified CD algorithm.However, because of the inherent speed in exploring the phase space, BGF can be thought of as CD- with a very large .As  → ∞, CD- effectively becomes ML.As a result, we see BGF indeed offers less chance of a larger KL divergence compared to conventional CD-.Overall, the takeaway point is clear.Our BGF does not create a problem of biased estimation.If anything, it improves bias characteristic of the commonly used von Neumann algorithm.

Appendix B Circuit
In addition to the circuits needed to implement the baseline Ising substrate, we need extra circuits in the nodes to make them probabilistic according to RBM algorithms, which are summarized in Fig. 12. Specifically, we need a current summing circuit (Sec.B.1), a sigmoid function (Sec.B.2), and a random number generator (Sec.B.3).Finally, for the Boltzmann gradient follower architecture, the coupling unit is discussed in (Sec.B.4).

B.1 Current summation
The summation process illustrated in Fig. 12

B.2 Sigmoid unit
A Sigmoid function, which is formulated as , commonly serves as an activation function that introduces nonlinearity in neural networks or converts the output of a linear model to probability for certain applications.The two hyper-parameters,  1 and  2 , allow fine-tuning of the Sigmoid curve, adapting it to diverse situations.This same functionality can be achieved using a differential to single-ended amplifier, as illustrated in Fig. 13(a).In this configuration, the amplifier's gain is intentionally set to a lower level, making the transfer function closely resemble a Sigmoid shape.An advantage of this approach lies in its flexibility, as the shape of the transfer function can be adjusted through a current bias control.By choosing different values for  ℎ , the gain of the amplifier can be modified accordingly, which is similar to tuning the hyper-parameters in a Sigmoid function.

B.3 Random number generator
Regularly, noise generators and chaotic-signal generators are two approaches to producing random numbers in the most of literature.In this study, we leverage thermal noise in the diode to create a random number generator, as shown in Fig. 13(b).Two diodes  1 and  2 perform as noise source generators as the input of variablegain noise amplifiers.The amplifier is biased at   =   /2 so finally the noise is amplified to a random number distributed from (  −  •   ) to (  +  •   ).The generated noise signal is subsequently applied to the input of a standard dynamic comparator [5], compared to the output of the Sigmoid unit, thus enabling probabilistic sampling.The output buffer(typically a latch according to [5]) servers to digitize the output voltage and ensure sufficient drive capability.Multiple devices can be used as a diode such as gate-drain shorted MOSFET or base-collector shorted BJT to reduce the chip area.

B.4 Coupling unit for Boltzmann gradient follower
For each coupling unit representing a coupling parameter ( , ), a training circuit(/) is introduced to allow the Boltzmann gradient follower algorithm to adjust  , as shown in Fig. 14(a).This adjustment is achieved through a charge redistribution-based CMOS charge pump circuit depicted in Fig. 14  different phases (positive/negative).   and   are parasitic capacitance at those internal nodes.Note that we use a complementary phase control for charge transfer transistors to tune  and , enabling the update of  +   and  −   in opposite directions.The operation of the coupler, including training circuit is as follows.To be explicit, we only discuss the increment/decrement of  + , here.During the pre-charge phase (  is high),   will be precharged to   , resulting in a charge storage of     on   .Concurrently,   is discharged to ground.Subsequently, during the charge transfer phase (when   is low), Positive sample phase(ℎ = 1): If   *   = 1,  2 will be activated, leading to charges redistributed from   to   , and hence increase   ( + , ).If   *   = 0,  2 remains off, thereby preserving the value of  , .The discharging path  2 is always turned off during the whole positive sample phase to ensure that  + , will not be decremented.
Negative sample phase(ℎ = 0): If   *   = 1,  2 will be activated to redistribute charges on   with   , resulting in a decrease in   ( + , ).If   *   = 0,  2 is kept off, thus maintaining the value of  + , . + , will not be incremented because the charging path  2 is always turned off during the whole negative sample phase.
The timing diagram illustrates one of the possible system states for the coupler, as depicted in Fig. 14(c).Due to the non-overlapping control signals, there will be no current path directly connecting   to   or ground.So the charges transferred onto   per cycle are determined by the ratio between by   ,   and   .By circuit design, this charge transfer can be accurately controlled to achieve a step size of only a small number of electrons.

Figure 5 :
Figure 5: Execution time normalized to that of BGF for different RBMs and image batch size of 500.

Figure 6 :
Figure 6: Energy consumption of TPU and GS over various benchmarks normalized over BGF for image batch size of 500.

Figure 8 :
Figure 8: Moving average of mean log probability of different models under varying amount of injected noise and variations.The data are smoothed using a moving average of 10 points.

3 Figure 9 :
Figure 9: Mean absolute error (MAE) of different models under varying amount of injected noise and variations.Final MAE ranges between 0.709 and 0.7258.

Figure 10 :
Figure 10: Roc curves of different models under varying amount of injected noise and variations.Final AUC ranges between 0.957 and 0.963.
(a) consists of distinct operating phases referred to as   .During the clamp phase (  is closed), the visible nodes(  ) are clamped to training samples, while the hidden nodes(ℎ  ) are clamped to   for initiation and reset purposes.During the free-running phase (when   is open), the current flowing from    begins to integrate on the capacitor, resulting in a gradual voltage shift.The  values of nodes are set by bias one terminal of the capacitor to   =   /2.

Figure 12 :
Figure 12: Component-level diagram for a node.Detailed implementations of (b), (c) and (d) are depicted in Fig. 13.

Table 1 :
Dataset parameters of different types of Neural Networks used in evaluation.
node Boltzmann gradient follower in cadence to test the behavioral models.

Table 2 :
Area and power of sub-units of Gibbs sampler and Boltzmann gradient follower at different number of nodes(N).CU = Coupling unit; SU = Sigmoid unit; RNG = Random number Genrator; DTC= Digital to Time Converter

Table 3 :
Comparison between different accelerators.

Table 4 :
Test accuracy obtained by different types of Neural Network Models for each data set USING different algorithm.The accuracy is rounded by "sklearn" (python library) function.
Statistical physics has inspired a lot of effective algorithms.Physical Ising machines are dynamical systems that embody the physics underlying popular algorithms such as simulated annealing, RBM, and other derivative algorithms.As a result, an Ising machine substrate (when done right) can be used as extreme accelerators for certain part of these physics-inspired algorithms.In this paper, we showcased two different designs that augment an Ising substrate with extra circuit to support RBM training.We show that with some small changes, an Ising machine can easily serve as a Gibbs sampler to accelerate the RBM algorithm.With more substantial changes, the substrate can serve as a Boltzmann sampler while following the gradient to train RBM without any additional host computation.Compared to an idealized TPU host significantly larger in chip area, such a Boltzmann gradient follower can achieve some 29x speedup and 1000x energy savings -though these systems are far less general-purpose than TPUs.With further research, we believe that hardware software co-designed nature-based computing systems can be an important new computation modality.Cumulative probability distribution of KL divergence of CD and BGF training result against ground truth.In other words, every point (, ) on the curve shows % of training distribution have a final KL divergence of  or less from the ground truth.
(b).The charge pump consists of two pre-charging transistors ( 1 and  1 ) and two charge transfer transistors ( 2 and  2 ), driven by two non-overlapping control signals (  and   ).  is regulated by phase control to differentiate between the increment and decrement of  , during