SuperBP: Design Space Exploration of Perceptron-Based Branch Predictors for Superconducting CPUs

Single Flux Quantum (SFQ) superconducting technology has a considerable advantage over CMOS in power and performance. SFQ CPUs can also help scale quantum computing technologies, as SFQ circuits can be integrated with qubits due to their amenability to a cryogenic environment. Recently, there have been significant developments in VLSI design automation tools, making it feasible to design pipelined SFQ CPUs. SFQ technology, however, is constrained by the number of Josephson Junctions (JJs) integrated into a single chip. Prior works focused on JJ-efficient SFQ datapath designs. Pipelined SFQ CPUs also require branch predictors that provide the best prediction accuracy for a given JJ budget. In this paper, we design and evaluate the original Perceptron branch predictor and a later variant named the Hashed Perceptron predictor in terms of their accuracy and JJ usage.Since branch predictors, to date, have not been designed for SFQ CPUs, we first design a baseline predictor built using nondestructive readout (NDRO) cells for storing the perceptron weights. Given that NDRO cells are JJ intensive, we propose an enhanced JJ-efficient design, called SuperBP, that uses high-capacity destructive readout (HC-DRO) cells to store perceptron weights. HC-DRO is a recently introduced multi-bit fluxon storage cell that stores 2 bits per cell. HC-DRO cells double the weight storage density over basic DRO cells to improve prediction accuracy for a given JJ count. However, naive integration of HC-DRO with SFQ logic is inefficient as HC-DRO cells store multiple fluxons in a single cell, which needs a decoding step on a read and an encoding step on a write. SuperBP presents novel inference and prediction update circuits for the Perceptron predictor that can directly operate on the native 2-bit HC-DRO weights without decoding and encoding, thereby reducing the JJ use. SuperBP reduces the JJ count by 39% compared to the NDRO-based design.We evaluate the performance of Perceptron and its hashed variants with the HC-DRO cell design using a range of benchmarks, including several SPEC CPU 2017, mobile, and server traces from the 5th Championship Branch Predictor competition. Our evaluation shows that for a given JJ count, the basic Perceptron variant of SuperBP provides better accuracy than the hashed variant. The hashed variant uses multiple weight tables, each of which needs its own access decoder, and decoder designs in SFQ consume a significant number of JJs. Thus, the hashed variant of SuperBP wastes the JJ budget for accessing multiple tables, leaving a smaller weight storage capacity, which compromises prediction accuracy. The basic Perceptron variant of SuperBP improves prediction accuracy by 13.6% over the hashed perceptron variant for an exemplar 30K JJ budget.CCS CONCEPTS• Hardware → Superconducting circuits ; • Computer systems organization → Architectures.


ABSTRACT
Single Flux Quantum (SFQ) superconducting technology has a considerable advantage over CMOS in power and performance.SFQ CPUs can also help scale quantum computing technologies, as SFQ circuits can be integrated with qubits due to their amenability to a cryogenic environment.Recently, there have been significant developments in VLSI design automation tools, making it feasible to design pipelined SFQ CPUs.SFQ technology, however, is constrained by the number of Josephson Junctions (JJs) integrated into a single chip.Prior works focused on JJ-efficient SFQ datapath designs.Pipelined SFQ CPUs also require branch predictors that provide the best prediction accuracy for a given JJ budget.In this paper, we design and evaluate the original Perceptron branch predictor and a later variant named the Hashed Perceptron predictor in terms of their accuracy and JJ usage.
Since branch predictors, to date, have not been designed for SFQ CPUs, we first design a baseline predictor built using nondestructive readout (NDRO) cells for storing the perceptron weights.Given that NDRO cells are JJ intensive, we propose an enhanced JJefficient design, called SuperBP, that uses high-capacity destructive readout (HC-DRO) cells to store perceptron weights.HC-DRO is a recently introduced multi-bit fluxon storage cell that stores 2 bits per cell.HC-DRO cells double the weight storage density over basic DRO cells to improve prediction accuracy for a given JJ count.However, naive integration of HC-DRO with SFQ logic is inefficient as HC-DRO cells store multiple fluxons in a single cell, which needs a decoding step on a read and an encoding step on a write.SuperBP presents novel inference and prediction update circuits for the Perceptron predictor that can directly operate on the native 2-bit HC-DRO weights without decoding and encoding, thereby reducing the JJ use.SuperBP reduces the JJ count by 39% compared to the NDRO-based design.
We evaluate the performance of Perceptron and its hashed variants with the HC-DRO cell design using a range of benchmarks, including several SPEC CPU 2017, mobile, and server traces from the 5th Championship Branch Predictor competition.Our evaluation shows that for a given JJ count, the basic Perceptron variant of SuperBP provides better accuracy than the hashed variant.The hashed variant uses multiple weight tables, each of which needs its

INTRODUCTION
Single Flux Quantum (SFQ) based CPU designs have recently been introduced in literature [5,6,8,21,24,46].SFQ CPUs are gatepipelined designs due to the inherent nature of the magnetic pulse storage and movement (more details in the background).Deeply pipelined CPUs suffer from two constraints: long delay for Read-After-Write(RAW) hazards and branch misprediction penalty.Prior works have focused on building the critical microarchitecture blocks for fast register files [5,8,46], and execution units [11,21,39].These design innovations help reduce the impact of RAW hazards.
As we look into future SFQ CPU designs, it is important for our research community to explore efficient branch predictors for SFQ CPUs, which will in turn enable speculative execution microarchitectures.One consequential limitation when exploring branch predictor designs for the SFQ technology is the Josephson Junction (JJ) count that can be integrated into a single chip in the current fabrication process [40].Hence, predictor designs must consider the total JJ counts accounting for both the predictor storage and the logic used to access and update the predictors.
In this work, we explore two variants of the well-known perceptronbased branch predictor [14] for SFQ CPUs.We design the original perceptron [14] and the hashed perceptron variant [42].While these predictor implementations in CMOS designs are well known, there are a plethora of challenges in implementing the predictors in SFQ designs.For instance, implementing demultiplexer (DEMUX) designs in SFQ technology is JJ-intensive.Hence, designs that need a large number of DEMUX circuits will compromise the size of the perceptron weight storage.Thus, a careful JJ-neutral evaluation of branch predictors is critical for properly evaluating the branch predictor performance.
To this end, this work aims to design, implement, and refine a perceptron-based predictor using SFQ-based logic cells.We first design a baseline predictor that uses non-destructive readout (NDRO) cells for weight storage.NDRO cells enable the weights to be preserved even after reading the cell but come at the expense of nearly 7x JJs compared to destructive readout (DRO) cells.We then present SuperBP (superconducting branch predictor) that uses high capacity-DRO (HC-DRO) cells to increase weight table storage capacity without increasing JJ counts.Each HC-DRO cell is designed to store two bits [19,46] using the same number of JJs as DRO cells.Thus, HC-DRO cells provide a unique opportunity to double the size of the branch predictor.The 2-bit values are encoded as up to 3 pulses into a single cell.Prior work that used HC-DRO cells to design register files [19,46] required the use of decoder and encoder circuits to transform pulse counts into 2-bits (MSB and LSB) of data, which is then processed by execution units.Adding decoder and encoder circuits reduces the JJ efficiency of the HC-DRO-based branch predictor design.SuperBP presents a series of novel SFQ circuit-level design adoptions that directly operate on the multiple pulses from the HC-DRO cells and store them back into the perceptron weight tables without any decoding and encoding overheads.Additionally, HC-DRO cells suffer from the destructive readout property.Thus, it is critical to restore the values of a predictor entry after each read.We present a simple loopback buffer design to retain the values even after an entry is read from the history tables.
We implemented the original perceptron and the hashed perceptron variant designs [14,42], using all the innovations mentioned above.The hashed perceptron predictor uses multiple weight tables, and each table needs a decoder to access the weight entry.Unfortunately, the decoder design in SFQ technology requires a substantial number of JJs.Hence, the need for multiple decoders leads to JJ overheads, reducing the JJ availability for weight storage.Thus, for a given JJ budget, our results show that the original perceptron predictor outperforms the hashed perceptron.
The primary contributions of this paper are as follows: • We present an optimized NDRO-based perceptron branch predictor design as a baseline.As SFQ-based branch predictors have not been proposed in the literature, our goal here is to highlight some of the unique challenges in building this baseline, such as path balancing and reset port requirements.This design uses NDRO memory cells and SFQ logic gates to achieve perceptron functionality.• We then present the design of SuperBP, which is the original perceptron branch predictor built with HC-DRO cells.We first describe the design of the inference and training units of the predictor that can directly compute on multi-bit cells without extra encoding and decoding circuits.These circuits are then integrated to create a fully operational perceptron predictor that can be included in an SFQ CPU.• We then present the design of the hashed perceptron version of SuperBP.
• We evaluate the MPKI (branch predictor misses per thousand instructions) of NDRO-based perceptron, SuperBP, and hashed perceptron SuperBP for a given JJ budget using a range of benchmarks -SPEC CPU 2017, mobile and server application traces from the 5th Championship Branch Prediction [12].Compared to the NDRO baseline, SuperBP can reduce the MPKI by 13.6% for the same JJ count.• We evaluate the IPC improvements of SuperBP and show that for the same JJ budget, SuperBP shows up to 10% higher performance than the hashed version.

BACKGROUND 2.1 Why build SFQ CPUs?
Single Flux Quantum (SFQ) circuits use Josephson Junction (JJ) devices, which can be clocked at 10-50 GHz clock frequency to enable high performance and energy-efficient computations.Advances in SFQ-based VLSI design tools provide well-characterized cell libraries [7], place and route algorithms [17], and more -paving the way for CPU designs.More recently, large investments [26] in SFQ-based computing bet on building SFQ-based CPUs that can help scale quantum technologies.Their ability to perform highly power-efficient computations at low temperatures can be leveraged to control qubits and quantum sensors.So far, most SFQ-based qubit control architectures focus on building specialized units [10,15,41].
Unfortunately, specialized accelerators are not enough.We must design general-purpose SFQ CPUs to enable programmability and broad functionality.For example, calibrating a qubit device requires a series of steps that involves traversing a tree of functions iteratively [20].This is often done in software to enable flexibility and modularity.Unfortunately, calibration sequences often result in a complex control flow.Supporting this operation with a fixedfunction qubit control hardware alone is not practical.We expect to embed our branch predictor design into in-order CPUs initially.Current road map predictions show that approximately 300K SFQ gates (approximately a million JJs) can be fabricated in the near term [1,2,32].Given these predictions, we believe that most SFQ CPUs will start with in-order CPU designs.In-order CPUs in CMOS are generally shallow pipelines, so they do not need complex branch predictors.However, in SFQ technology, even inorder CPUs will have very deep pipelines due to gate-level clocking needs.Hence, this work focuses on building a branch predictor for SFQ CPUs.

SFQ Logic and Branch Prediction
In CMOS technology, "1" is represented as a high voltage level, and "0" is represented as a low voltage level.However, SFQ logic uses magnetic pulses to represent "1" and "0".The magnetic pulse is stored in the form of a single quantum flux or fluxon.If a memory cell stores a fluxon, it means it stores a "1".If a memory cell stores nothing, it means it stores a "0".Once the fluxons are read from the memory cells, they are transmitted between logic gates in the form of SFQ pulses.The existence of an SFQ pulse represents a "1", and the absence of a pulse represents a "0".However, most SFQ logic gates need a clock input to perform the logic function once the input pulse arrives.Hence, SFQ logic uses gate-level clocking to solve this issue.Since each logic gate has a clock and works like

Clock Distribution
Each gate in the SFQ logic requires a clock with significant overhead.Clock distribution is an active research area, and several works aim to reduce clock overheads [16,36,38].In particular, using the dynamic SFQ (DSFQ) technology [30], researchers successfully designed gates that do not need a clock.Instead, these gates are self-timed along with their self-resetting property.This paper uses this technology to reduce the clock distribution demands, but a complete clocking analysis is outside the scope of this paper.

DRO Memory Cell
In SFQ technology, the Destructive ReadOut (DRO) cell [24] is one of the most important cells.It is a basic memory cell to store the SFQ fluxon.It can also be used as a buffer cell for path balancing [18] in a circuit (more details on path balancing later).DRO cells are also known sometimes as D-Flipflops.Figure 1(a) shows the schematic of a DRO cell.It receives an SFQ pulse at input D. If it does not already have a fluxon (SFQ pulse) stored in it, it will store the fluxon in the superconducting loop J 1 -L 2 -J 2 .Otherwise, the incoming pulse is dissipated through the buffer junction J 0 .Once we read from the DRO cell by sending a pulse to input CLK, the superconducting loop will be reset and release an SFQ pulse at the output Q.After each read operation, the loop is reset.Thus, the read is destructive for a one-bit DRO cell.

HC-DRO Memory Cell
It is possible to store more than one fluxon in a memory cell.This cell is called the High Capacity Destructive ReadOut (HC-DRO) cell.By removing the J 0 shown in Figure 1(a) and increasing the L 2 inductance, one can design a HC-DRO cell shown in Figure 1(b).Now the J 1 -L 2 -J 2 loop can hold multiple pulses.The design also needs to increase the critical currents of J 1 and J 2 to stabilize the reading of the multiple pulses.
Our own prior work [19] showed that a 2-bit HC-DRO cell can be built robustly with careful inductor sizing and adjusting the JJs' critical current.We have designed and verified the operation of a robust 2-bit HC-DRO cell using JoSim, a simulation program with an integrated circuit emphasis on superconducting designs [4].In our design, the parameters are L1∼ 6 pH, L2∼ 20 pH, L3∼ 4 pH, J1∼ 115 A, J2∼ 111 A, J3∼ 80 A.
Since the HC-DRO cell can hold 0 to 3 pulses, it can represent 2 bits of information (00 to 11).To read all pulses stored in the HC-DRO cell, we need to send three consecutive pulses to input CLK.The four sets of clock signals correspondingly read 00, 01, 10, and 11.After each read, the fluxons get released.Hence, the read operation in HC-DRO is also destructive, the same as the DRO cell.

NDRO Memory Cell
In SFQ technology, another important memory cell is the Non-Destructive ReadOut (NDRO) cell [31].After the read operations, the fluxon is still stored in the NDRO cell.Hence, NDRO cell can be read as many times as needed.The schematic of an NDRO cell is shown in Figure 1(c).The read operation is non-destructive.However, NDRO costs 11 JJs.To store 2 bits, NDRO needs a total of 22 JJs, while HC-DRO only needs 3 JJs.Hence, HC-DRO cells have a 7.3× density advantage.The total number of JJs is a critical limitation of SFQ fabrication; hence, reducing JJ counts is the key metric for improvement in SFQ CPUs.

Splitters and Mergers
Fan-out is expensive in SFQ logic, as a single pulse can only drive one gate.An SFQ pulse needs to be duplicated to drive two SFQ gates.A splitter [24] is used for this purpose.The splitter cell is shown in Figure 1(d).When an SFQ pulse arrives at input A, a pulse is generated at outputs B and C.
Driving the same pin with two SFQ pulses needs a merger cell [24] (see Figure 1(e)).If the merger cell receives a pulse on any of its inputs A or B, it will pass the pulse to its output C. If these two pulses are too close to each other, only one output pulse will be generated.

Path Balancing
As described before, each SFQ logic is controlled by a clock.To ensure a correct result, all signals at the input of a gate must be synchronized.The pipeline depth should be the same from any primary input to any primary output.We call this synchronizing procedure Path Balancing [23].The most common way for path balancing is to add DFF (DRO cells) to the paths with shorter pipeline depths.Figure 2 shows an example to compute Y=A•B•C.While the CMOS design in Figure 2(a) faces no timing challenges, adapting the same design in SFQ will cause failures.Assume A, B, and C all have a logic "1" pulse in the same cycle.In the first cycle, AND1 However, AND2 has one input as "1" (C) and one input as "0" (A•B), and it will generate nothing as a "0" output.In the second cycle, AND2 still has one input as "1" (A•B) and one input as "0" (C).As a result, AND2 cannot generate a correct output "1".To solve this issue, a DFF is added to the shorter path to balance the path, as shown in Figure 2(b).The input pulse C is delayed by one cycle and arrives at AND2 at the same cycle as the output pulse of AND1.As a result, AND2 will generate a correct output of "1" in the third cycle.

Recap of Perceptron Branch Predictor
Perceptron branch predictor uses a single perceptron weight table to predict branch outcomes [14], as shown in Figure 3.
In the perceptron weight table, there are N rows.In each row, there are ( + 1) signed integer weights ( 0 ,  1 , ...,   ), where  is the length of the branch history.To predict a branch, we choose a row of weights based on the branch PC address and perform an element-wise weighted summation with the branch history, as shown in the equation below.The branch history is represented as ( 1 ,  2 , ...,   ), where   is -1 when the branch associated with that history bit is not taken and is +1 when it is taken.The prediction result  is computed as When  ≥ 0, branch is predicted as taken, and when  < 0, the branch is predicted as not taken.
Later, when the branch is resolved, and the direction  is known,  is set as -1 for a not taken branch and +1 for a taken branch.The branch outcome  is used to train the perceptron and update the

Recap of Hashed Perceptron Branch Predictor
Perceptron branch predictor only uses one weight table, so its performance may suffer from aliasing.To further improve the performance, the hashed perceptron branch predictor was proposed [42].
Instead of using a single weight table, the hashed perceptron predictor has multiple weight tables, and each table is indexed with different hash functions, as shown in Figure 4. Since it utilizes varied path information through multiple weight tables, hashed perceptron predictor (and other variants) show performance improvements [13,33].In this work, we design both a perceptron and a hashed perceptron BP to evaluate the performance under a limited JJ budget.

RELATED WORK
There is a plethora of branch predictor designs for CMOS CPUs [22,25,45].In this work, we focused on perceptron and hashed perceptron branch predictors [14,42].The primary goal of this work is to design predictors using HC-DRO cells without the need for encoding and decoding circuits.Thus, our predictor design operates natively on the HC-DRO cell storage, significantly reducing JJ counts.Prior work [46] has demonstrated using HC-DRO cells in the SFQ register file design context.Their work purely used HC-DRO cells to store data in a denser format in an SFQ register file.In their design, a decoder circuit must follow each register read to translate the multi-fluxon pulses into 2-bit values.Similarly, each register write is preceded by an encoder circuit that translates a 2-bit value into an equivalent number of pulses to store.In this paper, we treat the HC-DRO data as a native data type where all the pulses stored in an HC-DRO cell are processed simultaneously.We thus eliminate the need for encoding and decoding the data, making the branch predictor more latency and JJ-efficient.
In [19], the authors describe a high-level overview of a 2-bit branch predictor [34] built with 2-bit fluxon storage.Their work describes the operation of the 2-bit branch predictor.Their design uses two special ports called TAKEN and NOT_TAKEN ports.They describe the 2-bit update mechanism as sending a pulse to either of these ports depending on the branch outcome and then incrementing/decrementing the 2-bit value.The description provided focuses primarily on CMOS-style functionality while ignoring the critical limitations of SFQ designs.They do not address key challenges in operating the saturating 2-bit counter.For instance, during the training phase, the 2-bit prediction counter has to be decremented when the branch is not taken.However, decrementing an HC-DRO cell releases one fluxon stored in the cell.Unlike in CMOS, a stray pulse must be properly dissipated in SFQ.Otherwise, these pulses may move through the circuit, causing havoc to the operation.In particular, the stray pulse may appear as a "taken branch" prediction for the fetch controller.This is just one specific example of how a functional description of a predictor is inadequate for making the predictor work in the SFQ regime.
Our work targets perceptron and hashed perceptron predictors instead of the 2-bit branch predictor.We aim to design a fully operational perceptron predictor, including accurately handling multiple pulses in an HC-DRO cell without encoding/decoding and designing efficient predictor update circuits that operate directly on the HC-DRO pulses.As such, this paper presents a detailed circuit-level design and implementation of the predictor using HC-DRO cells.It proposes to use HC-DRO cells not just for storage but to treat the 2-bit value as a native data representation within the microarchitecture.This work also proposes non-intuitive data representations and circuit implementations that work on sign+magnitude representation of predictor storage values.We also provide detailed circuit+architecture level evaluations to help guide the microarchitecture progression of SFQ branch predictor designs.

NDRO BASELINE BRANCH PREDICTOR
While designing a perceptron predictor in CMOS circuits is well studied, its design with the SFQ logic family presents unique challenges.This section describes our baseline NDRO-based perceptron branch predictor design.Our goal here is to design the inference and training units considering the constraints of SFQ logic.Hence, the focus is on highlighting how an NDRO branch predictor can be designed: specifically, the need for a reset port, the need for path balancing, and the need for concurrent detection of overflow, which is uniquely needed for SFQ-based perceptron.We will also describe the hashed perceptron design at the end of this section.Figure 5 shows the NDRO perceptron design.This design includes three main components: perceptron weight storage, training, and inference units.

Perceptron Weight Storage Design
Each rectangular box in the perceptron weight storage consists of a row of NDRO cells storing one perceptron weight entry.Each row has 3( + 1) bits, where  is the length of the global branch history.We use 3 bits (one sign bit and two weight value bits) to represent each weight.Weights that are multiple of 2 bits will allow us to easily compare against the HC-DRO design since each HC-DRO cell stores 2 bits.
There are three ports in the weight storage: read, reset, and write port.Each port takes an enable signal and hashed branch address as input.In order to read, reset, and write into the target weight entry, a demultiplexer (DEMUX) is necessary to decode the address into the one-hot representation.In SFQ technology, the most common way to design a DMEUX is using Non-Destructive ReadOut cells with Complementary output (NDROC), which was proposed in prior work [3,37].DEMUX designed with NDROC requires 60% of JJs compared to CMOS design with NOT and AND gates [44].
Figure 6 shows the NDROC DEMUX tree design.Each box represents an NDROC cell.After setting the NDROCs with hashed branch address S, the enable signal IN will travel through the clock pins and arrive at one of the outputs of the bottom-level NDROC cells as a one-hot representation.Notice that the clock pins are re-purposed in this design to pass the enable signal for generating the one-hot address.Hence, clock distribution is eliminated in the NDROC-based DEMUX compared to the AND-based DEMUX.Even with these optimizations, it is important to note that a read/write port is JJ intensive.To access a predictor table of N entries, one needs 1-to-N DEMUX trees.Such a tree uses (N-1) NDROC cells and up to 2N splitters, consuming a significant amount of JJs.Hence, designs that require fewer access ports may be beneficial under stringent JJ limits.
The write port is designed using Dynamic AND(DAND) gates [30] to avoid the clock distribution, which is proposed in [19].The Unlike the CMOS storage design, a reset port is necessary for this SFQ weight storage to work properly.A CMOS flip-flop storing a "1" can be overwritten with a "0", but an NDRO cell cannot be overwritten.To overwrite a "1" with a "0", we need to reset the NDRO cell using the reset pin.Hence, a reset port is necessary to perform a correct write operation.Before overwriting the weight entry with updated weights, we need to reset the target entry first.After reading out from the weight storage, the weights will be sent to either the inference unit or training unit through a row of NDROC 1-to-2 DEMUX.

Training Unit Design
As we described in Algorithm 1, the training unit is responsible for updating the weights.The weights are updated by using the equation   =   +   .Since  and   are both -1 or +1, the weight   only needs to be incremented or decremented by one by using a saturating counter design.Based on this saturation requirement, we designed an NDRO-based SFQ circuit as shown in Figure 8.The circuit consists of an increment/decrement circuit and an independent overflow detection circuit.The increment/decrement circuit and the overflow detection circuit are designed in behavioral Verilog and synthesized with the qPalace tool [7], which is an open-source SFQ synthesis tool.
The increment/decrement circuit has two inputs: the weight read from the weight storage,   , and the control signal INC.When , the SEL will be 0 and SEL will be 1; otherwise SEL will be 1 and SEL will be 0. We can use SEL and SEL with two AND gates to select the correct new weight.Since SEL and SEL cannot be 1 simultaneously, we can safely combine the output of the two AND gates with an SFQ merger circuit.
Notice that we detect the overflow simultaneously with computation rather than after the computation because we want to make the whole training unit have the shortest delay.Also, pathbalancing DRO cells for   are added to match the delay of the increment/decrement circuit.After the updated weight is computed,  _ , the value is written back to the weight storage through the write port, but with a reset operation prior to the write.In total, this design needs ( + 1) training units, where  is the length of the global branch history.

Inference Unit Design
According to Equation 1, the interference units need to multiply   and   , then accumulate.Figure 9(a) shows an NDRO-based inference unit design when  = 7, where  is the length of the global branch history.There is one sign extension circuit and seven multiplication circuits for inference.When   = −1, the multiplication circuit will generate −  using the 2's complement formula.Otherwise, the circuit will only do a sign extension.Although   only has three bits, when   = −4 and   = −1,     = 4 needs 4 bits to represent.That is why the proposed design needs to do a sign extension for  0 , and the output of the multiply circuits is 4 bits.After computing     , the accumulation operation is performed.The proposed design uses the adder-tree design using the Kogge-Stone adders.The adder tree will have  2 ( + 1) levels, which is 3 here.L1 to L3 in Figure 9(a) is the adder tree.Each level's adder will have one more bit than the previous level to avoid overflow.Hence, the   here has 7 bits.The sign bits will be used as the branch prediction result, 0 for taken and 1 for not taken.However, we still need the rest of the part of   .After the branch result comes out, even if the branch prediction is correct, we still need to compare   with  (=4) to determine whether we need to update the value or not according to Algorithm 1.

Optimization: Using a 3-bit Adder for Efficient Inference
In the inference unit, sign extension is performed when   = −4 and   = −1;     needs 4 bits to represent.Under all other situations, the computation only needs three bits to represent     .The extra one bit will lead to a much larger add-tree design.If we limit     to 3 bits, the total JJ costs of the Kogge-Stone adder tree will be reduced by around 30%.To enable this limit, the proposed design sets the range of   as -3 to +3 to limit the range of     .With the new   range,     can only be -3 to +3, which only needs 3 bits to represent.This modification requires an updated training unit design, which modifies the overflow circuit.When   = +3 and INC=1 or   = −3 and INC=0, SEL will be 0 and SEL will be 1 as shown in Figure 8.In the inference unit, there is no more sign extension function at level 1.Instead, we need to add path balancing DRO for  0 .Each adder in the adder tree will have one less bit, as shown in Figure 9(b).

Hashed Perceptron Design
The weight table structure is one main difference between perceptron and hashed perceptron BP design.In hashed perceptron BP design, each row of the weight tables only has 3 bits.The typical design will have 4 or 8 weight tables.Each table will have its own training unit.The inference unit needs to be connected to all weight tables.For some hashed perceptron algorithms, the branch history   may not be used during the training and inference, but it will not fundamentally affect the training and inference unit design.For example, in Figure 9(b), we can remove the top-level circuits of the optimized inference unit and directly connect   to L1 adders.Thus, the hashed perceptron design requires multiple decoder circuits to access each weight table.Each decoder will need its own NDROC-based DEMUX.While these independent DEMUX circuits can be easily scaled, they each consume a significant number of JJs.

SUPERBP: HC-DRO PERCEPTRON BRANCH PREDICTOR
While NDRO cells have the desirable property of keeping the weights after each access, NDRO designs consume a significant number of JJs compared to DRO cells.This section presents the design details of SuperBP, a predictor design that uses HC-DRO cells to reduce JJ count.
Prior work used HC-DRO cells [46] to increase the register file density.In their design, they only consider the HC-DRO cells as normal storage cells and used a decoder circuit that expands the 2bit storage information into up to 3 consecutive SFQ pulses, which are then fed to the execution units for normal operation.Similarly, while writing data into HC-DRO cells, they used an encoder to encode up to 3 pulses into a 2-bit value.In their design, the decoder and encoder circuits were placed on the critical path of read and write operations, respectively.SuperBP takes an integrated storage and computation approach to branch predictor design.It proposes a novel approach to perform the perceptron predictor computations directly on the HC-DRO storage cells without needing to decode the data.For simplicity of presentation, we first discuss the design in the context of the original perceptron predictor.At the end of this section, we describe the changes needed for the hashed perceptron variant of SuperBP.
The SuperBP design is shown in Figure 10.Like the NDRO design, SuperBP consists of three main parts: perceptron weight storage, training unit, and inference unit.Our training and inference unit designs directly operate on the HC-DRO cells, eliminating the need for extra encode and decode circuits.

Perceptron Weight Storage Design
The HC-DRO perceptron weight storage design replaces the NDRO cells with HC-DRO and DRO cells to improve JJ usage.Each row has 3(n + 1) bits, where  is the length of the global branch history.Since an HC-DRO can store 2 bits of information, we use one DRO cell to store the sign bit and one HC-DRO cell to store the 2-bit weight values.Each entry will have  + 1 DRO cells and  + 1 HC-DRO cells.However, using DRO and HC-DRO cells as storage leads to several issues.
The first issue is that HC-DRO cells need three consecutive clock pulses to be fully read out.Since there are at most 3 magnetic pulses in a single HC-DRO cell, we may need 3 consecutive clock pulses to push the pulses out.The generation of 3 consecutive clock pulses is done by the HC-CLK design shown in Figure 11, which duplicates the read enable (REN) pulse into three copies.This HC-CLK design takes a single input clock pulse and moves it through 3 different paths to generate 3 clock pulses that are equally spaced.HC-CLK design uses splitters (labeled S in the figure), mergers (M), and JTLs (J), which do not require any clock distribution.JTL stands for The second issue is that DRO and HC-DRO cells cannot keep the data after being read.Hence, after each inference request, a row of weights that are read will be lost, which is a substantial hurdle to retraining the weights.To counter this challenge, we adapted the LoopBuffer design from prior work [46].A LoopBuffer was attached to an HC-DRO register file in the prior work.In that design, the LoopBuffer was a single-entry NDRO register storage that read a register and then recycled the content after the data was sent to the execution units.Thus, by backing up a large HC-DRO register file with a single NDRO register, the overall design enabled the register file to be read multiple times while preserving the HC-DRO cells' density advantage.

Inference LoopBuffer
In SuperBP, we modified the basic LoopBuffer design to match the branch predictor functionality.In particular, we have an inference and training Loopbuffer, as shown in Figure 10, and each has a different purpose.When a weight is read for inference, it needs to be preserved to be written back to the weight storage.Hence, the Inference LoopBuffer is set to "1", and the Training LoopBuffer is reset to "0" first.Then, the weight storage is read through the read port.Once the pulses are read from the HC-DRO cells, they will arrive at the clock pins of both LoopBuffers.Since the Training LoopBuffer is set to "0", it will not generate any output.The readout pulses will pass through the inference LoopBuffer since it is set to "1".These pulses are then duplicated into two copies with splitters.One copy arrives at the input of the inference unit and then produces the prediction result.
As discussed in the next section, our inference unit is designed to operate directly on multiple pulses without decoding HC-DRO data into the most and least significant bits.This optimization reduces the need for an intervening decoding circuit on the inference critical path.Another copy of the weight goes through the write-back path and then is written into the original weight entry.This is how we keep the data after reading from the DRO and HC-DRO cells for inference.

Training LoopBuffer
The second LoopBuffer, the training LoopBuffer, serves a very different purpose.The training LoopBuffer acts as a custom reset port for the branch predictor.Recall that in SFQ technology, a write operation cannot erase an existing magnetic flux.Hence, writing a "0" on an existing "1" is impossible unless the existing value is reset first.The training LoopBuffer provides a very efficient reset port functionality without needing an expensive port design.In a branch predictor, the weight is updated only during training.When updating the weight during the training cycle, the inference Loop-Buffer is reset to "0", and the training LoopBuffer is set to "1".Then, the weight storage is read.Since only the training LoopBuffer is "1", the pulses will arrive at the training unit.As discussed in the next section, our training unit design operates on multiple pulse inputs rather than decoding the pulses first.This optimization lets us directly feed multiple pulses from the weight storage to the training unit.The training unit will produce the new weight.Notice that this training LoopBuffer does not have the "loopback" path.Instead, we only send the updated weight back to the weight storage directly from the training circuit.Thus, the training LoopBuffer utilizes the destructive readout property to avoid resetting the weight storage entry before writing into it.

Eliminating the Decoder and Encoder Circuits
Next, we tackle the design of the training and inference unit.Recall that the data stored in the HC-DRO cells store 2 bits of information as multiple pulses.Hence, to enable traditional logic operations, prior works used an extra encode and decode circuit [19] to transform 0-3 pulses into 2-bit values and vice-versa.However, SuperBP uses a novel design that eliminates the cost of data decoding and encoding.As we describe next, our training and inference unit can operate on the pulses read from the HC-DRO cells without encoding and decoding.

Training Unit Design
HC-DRO cells store at most three fluxon pulses, which can be treated as two bits of information.However, when it comes to a branch predictor design, each weight can either be incremented by one or decremented by one.In other words, the weight update process does not write arbitrarily different values; instead, they either increment or decrement one pulse.
To take advantage of this special property, we use a signedmagnitude representation of the weight (as opposed to a 2's complement representation).When the sign bit is "0", if the HC-DRO cell stores zero, one, two, or three pulses, the weight stored here is +0, +1, +2, or +3.When the sign bit is "1", if the HC-DRO cell stores zero, one, two, or three pulses, the weight stored here is -0, -1, -2, or -3.The range is -3 to +3, the same as our optimized NDRO design's weight range.Notice that both +0 and -0 mean the weight is 0. We can see that the HC-DRO cells here store the absolute value of   .Even though signed-magnitude representation is less efficient in its representation, it allows for an efficient training process.Increment the weight's absolute value: When SIGN=0 and INC=1 (add one to a positive number) or SIGN=1 and INC=0 (subtract one from a negative number), we need to add one to the weight's absolute value.The XOR gate XOR1 will generate an extra pulse and send it to the weight absolute value path.We call this pulse "G-pulse" for short.G-pulse will become part of the weight's absolute value, and we successfully add one to the weight's absolute value.When the weight's absolute value is 3, H1 will release three pulses in three consecutive cycles.G-pulse will arrive at the weight's absolute value path simultaneously with the third pulse released from H1.The OR gate on this path can merge these two pulses to prevent overflow.
Decrement the weight's absolute value: When SIGN=0 and DEC=1 (subtract one from a positive number), or SIGN=1 and DEC=0 (add one to a negative number), we need to subtract one from the weight's absolute value.The XOR gate XOR2 will generate a pulse to destroy one pulse from the weight absolute value path.We call this pulse "D-pulse" for short.Whether the weight's absolute value is 1, 2, or 3, H1 will always release one pulse at the very first cycle.This first pulse will arrive at the XOR gate XOR3 simultaneously with the destroy pulse.Then XOR3 will destroy the first pulse, and we subtract one from the weight's absolute value successfully.
Encoding +0 and -0: When decrementing the weight's absolute value, if the value is 0, we are computing +0 − 1 = −1 or −0 + 1 = +1, so we need to change the absolute value to 1 and flip the sign bit.Since the weight's absolute value is 0, H1 will not release anything.D-pulse will generate a pulse at the output of XOR3.It means the weight's absolute value becomes 1.Meanwhile, H1 did not release anything, so the NOT gate will generate a pulse.This pulse will arrive at the AND gate with D-pulse.We can use the AND gate output to flip the sign bit at XOR gate XOR4.As a result, +0 − 1 becomes -1, and −0 + 1 becomes +1.

Inference Unit Design
Similar to the NDRO inference unit design, we need to compute     first to add the products.Instead of decoding the data read from the HC-DRO cells, we directly operate on them.Figure 13 shows the inference unit design when  = 7, where  is the length of the global branch history.
Multiplication and 2's complement: We need to represent the number as 2's complement to add positive and negative numbers.However, we stored the sign and absolute value of   in the weight storage, so we need to translate them first.With careful design, we successfully merge the multiplication,    , and 2's complement translation into the same circuit.We use the label M&T to represent this circuit in Figure 13. Figure 14 shows this design in detail.VALUE is the absolute value of the readout weight.SIGN is the sign bit of the readout weight.SIGN_X is the sign of   ("0" is + and "1" is -).
This circuit is divided into two parts.The upper part is used to compute the results, SIGN_OUT will be the sign bit of     , and the VALUE_OUT will be the corresponding 2's complement lower 2 bits of     in serial pulses form.For example, if     = −3, the 2's complement of -3 is 101.So SIGN_OUT will be 1, and VALUE_OUT will have one pulse in three cycles.If     = +2, the 2's complement of +2 is 010.So SIGN_OUT will be 0, and VALUE_OUT will have two pulses in three cycles.The VALUE_OUT pulses may not be consecutive, but this can be tolerated by our serial adder design described in the next part.The lower part is used to generate a correct form of 0. When the input is -0 or +0, the upper circuit cannot generate a correct 0 in 2's complement.So, if the NDRO in the lower part does not detect any pulses from the absolute value input, it will not generate any pulse.Hence, the AND gate at the output can generate a correct 0, which is no pulses.Two NDRO cells in this design need to be reset after each computation.
HC-DRO serial adder design: We designed a new adder, which can directly add two numbers read from the HC-DRO cells.The design is shown in Figure 15(b).This design is based on a one-bit counter [27], which is represented as COUNT in the figure.The state machine of the counter is shown in Figure 15(a).Since the number of the pulses read from one HC-DRO cell represents the number stored in this HC-DRO cell, if we count the total number of the pulses read from two HC-DRO cells, we will get the sum of the two numbers stored in the HC-DRO cells.
Assume HC-DRO A has two fluxons and HC-DRO B has three fluxons.We only read one fluxon each cycle.In the first cycle, both A and B release a pulse.We connect an AND gate directly to the higher bit counter H to add two to the counter.After the first cycle, the counter result is 010.In the second cycle, both A and B release a pulse.H got another pulse to be flipped to 0 and generate a carry pulse.We store this carry pulse in the DRO cell temporarily.After the second cycle, the counter result is 100.In the third cycle, only B releases a pulse.We connect an XOR gate directly to the lower bit counter L to add two to the counter.After the third cycle, the counter result is 101.After releasing and counting all pulses, we can read from DRO and two counters.The result will be parallel pulses, 101.Using this counter design, we complete the decoding and adding simultaneously.As we count the total number of pulses, the adder operates correctly, even for the non-consecutive input pulses.
Adder tree design: Figure 13 shows the adder tree for the summation.The first level of the adder tree is the proposed serial adder.Since the input is 3 bits and the output will be 4 bits, we need four counters here.The design is shown in Figure 16.Notice that the fourth counter is connected to an OR gate instead of an AND gate.We use this OR gate to do sign extension so that the 4-bit output will have a correct sign bit.After the first-level computation, all the numbers are in the form of parallel pulses.We can use the same Kogge-Stone adder tree as the NDRO design for the rest of the summation.

Hashed Perceptron Design
Similar to the NDRO baseline, the weight table is the main difference between perceptron and hashed perceptron BP design.In the hashed perceptron BP design, each row of the weight tables only has 3 bits(one DRO cell and one HC-DRO cell).The typical design will have 4 or 8 weight tables.Each table will have its own training unit.The inference unit needs to be connected to all weight tables.For some hashed perceptron algorithms, the branch history   may not be used during training and inference, but it will not fundamentally affect the training and inference unit design.For example, in Figure 13, we can remove the top level of multiplication and 2's complement translation circuits in the optimized inference unit and directly connect   to L1 adders.However, our 3-bit HC-DRO serial adder design will remain intact.The main challenge is building separate indexing circuits to access a given weight table entry.Each indexing circuit needs a DEMUX design, which consumes additional JJs.

EVALUATION METHODOLOGY
JJ Count: Some of the largest demonstrated SFQ chips currently have about 72K JJs [9].In the near future, integrating a million JJs on a chip is expected to be feasible [32].In our evaluations, we decided to vary the JJ budget allocated to the branch predictor to be under 10% of the total chip budget.Thus, we evaluated both perceptron and hashed perceptron SuperBP under the same JJ budget, ranging from 30k JJs to 90k JJs.Hashed Perceptron Choice: The hashed indexing functions of the hashed perceptron branch predictor have been shown to affect the overall performance.So we chose five different indexing functions from [42].We compared the performance of hashed perceptron with these varying hash functions against the traditional perceptron branch predictor.The hashed indexing functions are shown below.Notice that,  represents the th weight table, and _ℎ_ [0] = _ℎ_.(ℎ_ [] ⊕ _ℎ_) mod #ℎ Performance Simulator: We built an SFQ-based gate-level pipelined CPU simulator to analyze the end-to-end application time of different SuperBP designs.The ISA we chose is RISC-V 32I.The simulator is based on the RISC-V ISA Simulator Spike [28] and written in C++.Our simulator takes the operating trace as an input to simulate the overall performance of a gate-level pipeline in-order core.To get the depth of each gate-level pipelined stage, we synthesized an open-source 32-bit in-order CPU [43] with the qPalace tool [7].The qPalace tool uses the SFQ cell library and supports path-balancing, which can generate the correct gate depth.Benchmarks: We selected a wide range of applications from different benchmark suites.We evaluated mcf, leela, xz, deepsjeng (sjeng), nab, lbm, parest, and namd from the SPEC CPU 2017 benchmark [35].All the benchmarks are from the SPECrate suite and tested with the ref dataset.In addition to SPEC, we use a representative microbenchmark from the RISC-V repository [29] to evaluate the performance of superBP.We use Vector Addition (vvadd), Median Filter (median), Multiply (intmul), Sparse Matrix-Vector Multiplication (spmv), and dhrystone (dhstone).We also used all the 440 mobile and server traces from the 5th Championship Branch Prediction competition [12].For SPEC and RISC-V benchmarks, due to slow simulation speed, we functionally skipped the first 10 billion instructions and gathered the branch information for the next 1 billion instructions for each benchmark.These are then used to evaluate the predictor performance.

Perceptron and Hashed Perceptron
We first compare the performance in terms of MPKI (misses per 1000 instructions) for perceptron SuperBP and hashed perceptron SuperBP for varying JJ budgets.We ran simulations with different design parameters (number of weights, number of tables in hashed perceptron, and global history length) for a given JJ budget and picked the design with the best performance for both predictors.The average MPKI of the perceptron and hashed perceptron designs are shown in Figure 17.These are the averaged MPKI results across all the benchmarks for the best-performing predictor design.The MPKI results show that the perceptron SuperBP shows better accuracy than any of the hashed perceptron SuperBP across all JJ budgets.This outcome is because the access port DEMUX trees consume most of the JJs in the hashed perceptron SuperBP.In hashed perceptron SuperBP design, DEMUX costs around 70∼85% of JJs, and the actual weight storage only costs 3∼4% of JJs.However, in perceptron SuperBP design, the weights use up to 12% of the JJ budget, while only about 50% of the budget is used to index a given entry of the weight table.Hence, the perceptron SuperBP has around 3 times the weight storage capacity of the hashed perceptron SuperBP.This additional weight storage leads to tracking more of the history, improving the prediction accuracy.Thus, in a JJ-constrained environment, which most SFQ designs currently are, accounting for the table access overheads eats into the performance advantages of the hashed perceptron predictor.In the rest of this section, we only evaluate SuperBP perceptron across different designs.

Hardware Performance
We evaluate the hardware performance in JJ count and static power.We built Verilog netlists using the publicly available cell libraries [31] for both the NDRO and HC-DRO-based perceptron branch predictor.Furthermore, we have successfully verified the functional correctness and timing with different inputs.We calculated the total JJ count and static power using the SFQ cell library provided by [31].As for the dynamic AND gate, which is not provided in the library, we derived the data from [30] and [44].JJ Count: Table 1 shows the JJ count of NDRO and SuperBP perceptron and hashed perceptron predictors for different sizes.The size of the branch predictor is represented using the notation  × ( +1), where  is the number of entries in the perceptron weight storage, and  is the length of the global history.For hashed perceptron,  +1 is the number of weight tables.The data includes the JJ counts for splitters, mergers, and any necessary splitters for the clock distribution.The first row shows the JJ count of the NDRO-based design.The second row shows the JJ count of SuperBP design and the saving of JJ over the NDRO-based design in percentage.The third row shows the JJ count for the hashed perceptron predictor and the saving of JJ over the NDRO-based design in percentage.A negative saving here means using more JJs than the NDRO-based design.Due to the overheads associated with addressing multiple tables in the hashed perceptron predictor, the JJ count is consistently higher than the NDRO baseline and significantly more than the perceptron predictor JJ counts for a given predictor configuration.When the size is 16×8, the hashed SuperBP has double the amount of JJs of the perceptron SuperBP while having a similar accuracy.The breakdown of JJ counts across different components of the branch predictor is shown in Table 2 for the two endpoints of our design space explorations.The 16×8 SuperBP costs 13,079 JJs, as shown in Table 1, while the NDRO design for the same branch predictor dimensions consumes 20,516 JJs.Interestingly, the division of JJs across the various parts of the branch predictor design remains roughly the same between the NDRO and SuperBP.However, Su-perBP uses nearly 36% fewer JJs to achieve identical accuracy to the NDRO design.
In Table 2(b), we show the same breakdown statistics for the 128×32 branch predictor configuration.Again, the division of JJ counts across different predictor circuit components remains roughly the same, but SuperBP uses 39% fewer JJs to achieve the same accuracy.Power Consumption: Table 3 shows the static power for each predictor design.Since static power is a function of the number of JJs used in a design, a 64×32 SuperBP consumes around 35.41% less static power than the NDRO-based design.Note that the higher static power consumption of NDRO design also leads to higher cooling costs since the additional power draw leads to high heat extraction costs.In the results shown here, we did not count the additional cooling power needed for NDRO-based designs.3: Total static power consumed by BP only (W) and the saving % over the baseline Impact on Latency: As for the latency, although it takes some time for the LoopBack write to update the weight after each inference, the weight storage is only accessed when there is a branch instruction.The weight LoopBack update will finish before the next branch instruction for such designs.For the inference unit, the differences are the     multiplication unit and first-level 3-bit adders between NDRO-based design and SuperBP.In the NDRObased design, the gate depth of the multiplication unit is 3, and the gate depth of a 3-bit adder is 5.In SuperBP design, the gate depth of the multiplication and 2's complement translation is 4, and the gate depth of a 3-bit adder is 2.However, it takes SuperBP design two more cycles to process all three serial signals.As a result, the NDRO-based design and SuperBP have the same inference gate latency (3 + 5 + adder tree and 4 + 2 + 2+ adder tree).
Sensitivity to JJ Budget: We simulated both NDRO-based design and SuperBP.We compared the MPKI of both designs with the same JJ budget.We evaluate all benchmarks with the JJ budget of 30k to 90k.For a fair comparison, we ran the simulation multiple times for each hardware budget with different  (number of entries in the weight storage) and  (global history length).Figure 18 shows the MPKI for both NDRO-based design and SuperBP with the optimal size.From this figure, we can see that SuperBP outperforms the NDRO-based design consistently.

Detailed Performance Evaluations
We performed performance simulations to measure end-to-end application performance.We used 30k JJ budget designs for this simulation.Figure 19 shows each benchmark's absolute MPKI and MPKI reduction %.For the CBP traces, we show the average for mobile and server traces.lbm shows a 41% of MPKI reduction.The average MPKI reduction is 13.6%.Figure 20 shows the IPC improvement for each benchmark.Note that CBP traces have branch instruction information and not all the instructions necessary to simulate performance.Hence, we did not use CBP traces for IPC measurements.As expected, the benchmark's overall execution time improvements are roughly proportional to the MPKI reduction experienced by that benchmark.
For instance, nab shows the most IPC reduction(10.6%).This is because nab has a relatively high MPKI reduction as well as a relatively high MPKI, both of which contribute to a considerable reduction in branch penalty.In contrast, lbm has a low base MPKI; hence, branch misprediction is not a significant bottleneck for this benchmark.Thus, the IPC improvement is minimal, as expected.

CONCLUSION
Efficient CPUs built using SFQ can enable broad adoption across many domains, from energy-efficient data centers to qubit control.This work highlights challenges in building branch predictors using SFQ technology.We proposed SuperBP, a novel design built with High Capacity Destructive ReadOut (HC-DRO) cells that can work with HC-DRO cells without extra decoding and encoding.We evaluate the performance of perceptron SuperBP and hashed perceptron SuperBP using representative benchmarks.Using gatelevel synthesis tools and gate-level simulations, we showed that SuperBP saves 39% of the JJ counts and shows a 34% reduction in power.We evaluate the performance of SuperBP over the NDRO baseline using representative benchmarks under the same JJ budget.For example, when the JJ budget is 30k, our SuperBP design can reduce the MPKI by on average.

A.7 Experiment customization
We also provided the hashed perceptron version of SuperBP.You can replace the "predictor.h"with "predictor_hashed.h", which locates in SuperBP-CBP-Simulation/cbp16sim /src/simnlog.To revert the predictor back to the perceptron version of SuperBP, replace the "predictor.h"with "predictor_perceptron.h".You can also change the TABLE_SIZE, TABLE_WIDTH, or TABLE_COUNT in the "predictor.h" to get the results for different sizes of SuperBP and hashed perception predictor.

Figure 7 :
Figure 7: Write port and DAND gate

Figure 15 :
Figure 15: (a) State machine of the counter (b) HC-DRO serial adder

Table 1 :
Total JJ count and the saving % over the baseline