Hardcaml MSM: A High-Performance Split CPU-FPGA Multi-Scalar Multiplication Engine

This paper presents a split CPU-FPGA Multi-Scalar Multiplication (MSM) engine written in Hardcaml. Hardcaml MSM was submitted to the 2022 ZPrize cryptography competition and won 1st place in the FPGA track. Hardcaml MSM targets the BLS12-377 elliptic curve and is currently the lowest-latency implementation utilizing FPGAs published. For a MSM of order 2^26 we achieve a single-round MSM latency of 5.518s and average power of 52W, with our design running at 278MHz. When performing multiple rounds of MSM with the same base points but random scalars, we are able to further mask host I/O and memory latency and reduce latency to 5.083s. This is a latency improvement of 13% over the previously fastest reported FPGA solution, and an improvement of 472% when compared to state of the art open-source CPU library gnark-crypto.


INTRODUCTION
Zero-knowledge proofs [7] (ZKPs) are powerful cryptography tools that allow a prover to prove that a certain statement is true without revealing any other information to the verifier.ZKPs are very attractive for applications where online privacy is paramount, for example digital signatures [10], online voting [19], blockchain [2], and distributed systems [23].
One class of ZKP getting a lot of attention recently is the Zeroknowledge Succinct Non-interactive Arguments of Knowledge (zk-SNARK) [6].This type of ZKP requires no interaction between the prover and verifier, and is compact and quick to verify.
One of the most popular zk-SNARK implementations, Groth16 [12], requires a huge number of elliptic curve (EC) operations known as Multi-Scalar Multiplications (MSM) and Number Theoretic Transforms (NTT).Current systems that use zk-SNARKs tend to require MSMs with millions of inputs.In this paper we focus on accelerating the MSM problem on these large-scale MSMs.
A crucial element of our success was making use of Hardcaml [21].Hardcaml is an OCaml library that can be used to design and test hardware.Hardcaml leverages both the strong type system of OCaml, along with a verbose built-in circuit linter, to increase hardware design productivity, reliability, and efficiency.A built-in cycle-accurate simulator allows for unit level tests alongside the Hardcaml source code, which can optionally print digital ASCII waveforms.These tests provide fast feedback on designs and help catch future bugs.
To compute large MSMs efficiently, we need to optimize a number of aspects of our system, including elliptic curve primitives, higher-level MSM algorithms, and embedded system architecture.While prior work has presented both MSM and ZKP engines implemented in isolation on CPUs [8,16], ASICs [25], GPUs [14,17], FPGAs [1,11,24], etc, Hardcaml's robust design and testing capabilities allowed us to integrate these optimizations into a single design.As a result, Hardcaml MSM won first place in the ZPrize FPGA track [9].Hardcaml MSM implements the BLS12-377 elliptic curve, and is available on GitHub [22].We have tested Hardcaml MSM on a VU9P FPGA running on a split-CPU architecture offloading some tasks to the host CPU.
In summary, our contributions in this paper include: • A fully pipelined, strongly unified mixed point adder that also supports subtraction and only requires 7 multiplication and 6 addition operations.
• System architecture and techniques to mask PCIe latency to increase performance.• A stall controller that impacts performance by only 0.543% with simple heuristics that only require 4-deep FIFOs.• A split CPU-FPGA architecture that allows for a more streamlined FPGA implementation acheiving 278MHz.

PRIOR WORK
Performance of published ZKP accelerators has been steadily improving.One of the first FPGA solutions available on Amazon's AWS FPGA cloud [11] is designed for the ZCash blockchain.This performs better than a CPU implementation, but implements point multiplication naively and doesn't support large scale MSMs.PipeZK [25] presents an ASIC ZKP accelerator, which targets both the MSM and NTT problems, but only shows results for MSMs up to order 2 20 .They also do not attempt to map their accelerator to widely available FPGAs, which we believe poses a different set of challenges.
PipeMSM [24] presents a FPGA-based MSM accelerator, but only shows results for MSMs up to order 2 20 .In addition, the point adder is implemented using projective coordinates and requires a larger number of multipliers and adders.
CycloneMSM [1] presents a FPGA-based MSM accelerator for MSMs up to order 2 26 .While this design includes several novel optimizations, including a multi-cycle pipelined point adder also using Twisted Edwards coordinates, we present precomputations and new architecture they do not take advantage of.
A next-gen GPU implementation [14] shows higher performance than Hardcaml MSM, but our architecture introduces several new optimizations that are novel and beneficial.

THE MSM PROBLEM
In general, the Multi-Scalar Multiplication (MSM) problem is to take a list of scalars and points and compute the sum of each of the points scaled by its corresponding scalar, modulo a large -bit prime also known as its security-value, as shown in (1).Here  is the scale of the MSM,   is a -bit scalar, and   is an -bit EC point.
Elliptic curve cryptography (ECC) allows for smaller keys than non-EC cryptography such as RSA, while providing the same level of security.For example, a small 228-bit ECC key requires as much time to crack as a much larger 2,380-bit RSA key.
Implementing the MSM requires two elliptic curve operations: point addition and point doubling.These operations can be optimized with efficient modulo reduction algorithms such as Barrett [3] or Montgomery, and better-than- ( 2 ) multiplication techniques such as the Karatsuba [13] algorithm.
A naive algorithm using repeated point addition of scalar-point sums can compute MSMs with a few thousand inputs.When the scale of a MSM increases into the range of more than a million points, other algorithms such as Pippengers [20] provide much better performance.Pippengers is discussed in more detail in the architecture section.Figure 1 shows how the MSM problem relies on optimized EC primitives, and feeds into the overall ZKP algorithm.

HARDCAML MSM
Hardcaml MSM is a split CPU-FPGA accelerator written in Hardcaml and integrated into an Amazon AWS FPGA cloud F1 instance using Vitis.Hardcaml MSM won first place in the recent ZPrize  competition, and at this time is the fastest reported FPGA solution to the large-scale MSM problem in accelerating ZKPs.We target the BLS12-377 curve, which is popular in many ZKP systems due to its high security level, and which can be transformed into other coordinate systems that allow for faster hardware implementations.In BLS12-377  is 377 bits and  is 253 bits.We implement a version of Pippenger's algorithm to solve the MSM problem.Pippenger's algorithm reformulates the dot products from (1) into smaller dot products over buckets, where each bucket represents a small contiguous slice of the scalar:

Top-Level Architecture
Here  is the bucket size in bits and  is the number of slices.  and   are elements of the prime and scalar fields respectively, and the product   must be greater than or equal to , the number of bits in the scalar field.
The computational cost is broken down in (3) below, with the bucket sums, triangle sums, and final accumulation shown.Here A and D represent the cost of point additions and doubles respectively.To calculate the bucket sums for a given slice, we create 2  buckets, one for each possible value of   [  ].Then, in each   , we sum all the   s such that   [  ] = .One the host CPU, we calculate the triangle sums by multiplying each   by  and adding them all together to get a sum for the slice.Finally, we combine the slice sums into the result of the MSM.
For large-scale MSMs ( = 2 20 and greater), the bulk of the computation lies in the bucket sums.For example, on the BLS12-377 curve with  = 2 26 and  = 16, the bucket sums require over 1 billion point additions, while the rest of the computation requires only about 8,000 operations.Because of this, we compute bucket sums on the FPGA, while in parallel performing triangle sums and final accumulations on the CPU.
Because the base points  do not change over multiple rounds of a a given ZKP, we can precompute curve and coordinate transformations on  and store them in DDR memory on the FPGA board.We then need to send only the scalars  to the FPGA to compute the MSM.
In order to pick values of bucket size  and number of slices  , we take into account the amount of memory used on the FPGA and the amount of time required to compute the triangle sums and final accumulations on the CPU.Previous work [1] reduced the number of bucket sums required by using a small number of large slices ( = 16).However, all the buckets for a slice of this size cannot fit in the FPGA's URAM.Values of     [  ] are not pre-sorted, as the scalars contain random values picked at runtime.Instead, Hardcaml MSM uses a larger number of small slices ( = 13).This maps very well to Xilinx FPGA URAM memory primitives, such that each bucket is only 2 URAMs deep, can be floorplanned in such a way to allow for a high Fmax, and all bucket URAMs can fit onto the FPGA to be operated in parallel.On a VU9P this utilizes roughly 60% of the URAM available.Note that using  = 16 would mean that each bucket needs to be 16 URAMs deep, making it impossible to fit all bucket URAM on the FPGA, and limiting Fmax.Having all bucket URAM fit onto the FPGA allows us to completely parallelize the inner point additions, such that every     we stream into the FPGA can be processed immediately, without requiring sorting.
Depending on the depth of our mixed-point adder, we may submit an operation to a bucket that already has a pending operation in the adder pipeline.To solve this problem we implement a stall controller with FIFOs that can hold per-bucket     [  ] until that bucket is ready.We have provided modeling tools and found the throughput impact of this is only on average 0.543%, leading to very shallow FIFOs 4-deep that can be implemented logically in BRAM memory primitives.
The board we targeted has 64GiB of DRAM, split over 4 x DDR4 interfaces.Our point transformations only require 8.8GiB which allow us to localize access to a single interface.We implement FPGA-CPU communication using open-CL libraries in Vivado Vitis.For the interface between our Hardcaml MSM core and AWS shell on the FPGA, we make use of HLS kernels that allow frequency scaling and merging of data streams as they arrives from DRAM and PCIe.Rather than using a fixed clock, Vitis will attempt to find the maximum frequency possible, and then scale the provided clock post-implementation to allow for the highest Fmax.

Twisted Edwards Coordinates and Precomputation
The BLS12-377 curve has fixed parameters  and .Affine points (  ,   ) in Weierstrass form on the curve follow the form  2  =  3  +   +  Point addition in this form requires expensive inversions that we want to avoid.This section shows transformations onto a Scaled Twisted Edwards curve that will lead to very efficient hardware point addition operations.
First, we convert points in Weierstrass form to points on a Montgomery [18] curve, with the following formula ( and  are curve parameters): A Twisted Edwards curve has the following formula, where  and  are parameters of the curve.
Once we have this curve transformation we can convert points onto the Twisted Edwards curve [4] by The catch here is that not all points can be mapped-points with   = −1 or   = 0 are not valid on a Twisted Edwards curve, and there are 5 such adversarial points on the BLS12-377 curve.Because of our split CPU-FPGA architecture, we are able to filter these out on the host and if we encounter them, the portion of the MSM result they contribute to is calculated by a CPU side-band process without transformation.
While prior art [1] takes advantage of this transform, we take it further by transforming the curve into a Scaled Twisted Edwards Curve [5], given by Another transformation unique to Hardcaml MSM is as follows.We take our points on the Scaled Twisted Edwards Curve in affine coordinates and transform them into Extended coordinates (8) which contain redundant values  and  but allow for a shorter point addition formula.
Table 1 compares results of the number of field operations required after applying our transformations.Here M, S, D, and A represent the number of field multiplies, squares, multiplication by a constant, and additions or subtractions respectively.
When compared to prior work, our implementation achieves the smallest amount of field operations with only 7M and 6A.The previous best Scaled Twisted Edwards implementation, from the Explicit-Formulas Database [5] (EFD), achieves 7M + 8A + 2D, but it is not strongly unified, meaning doubling is not supported and special care would need to be taken to avoid the case when two points are identical.

Triangle Sum Offloading
After our controller has detected that a given bucket has finished processing all     [  ], it starts streaming the bucket values back to the host CPU.As the host is receiving these, it starts both the triangle sums and final accumulations for each bucket, and starts sending new  to the FPGA for further MSM rounds.Transferring data while both the CPU and FPGA are doing meaningful work masks the PCIe I/O penalty.Because we use many small buckets rather than a few large buckets, we can simultaneously calculate a new MSM once one bucket has been sent to the host, before all buckets have finished.We also do not rely on a fast CPU, as the triangle sums are much easier to calculate than the bucket sums.After the triangle sums are finished, the host CPU reverses the Scaled Twisted Edward transforms and sums in any adversarial     that it has stored.

Streaming Scalar Transformation
Kernels written in C++ and transformed via Vivado HLS combine the s in the FPGA's DDR memory and the s streaming from the CPU into     , allowing the FPGA to start work as soon as it receives the first input.
In order to reduce the number of buckets required we perform a transformation on  on the FPGA in realtime.Each   [  ] represents an unsigned integer in the range [0, 2  and propagating the carry for each   [  ] as it streams into the FPGA.We can exploit this because our mixed-point adder implements subtraction cheaply and all   [  ] < 0 can be added into positive bucket URAM slots by instead subtracting its value, meaning a 2 13 bucket can be implemented by a 2 12 -deep URAM.

Fully Pipelined Unified Point Adder
Our 4-stage unified mixed-point adder takes advantage of the previous optimizations to allow for a high Fmax of 278MHz, shown in Figure 3.Our fully-pipelined adder can accept new input every clock cycle and has a latency of 238 clock cycles.Both input points A and B are in extended coordinates; point B inputs originate from DRAM, converted from affine coordinates, and have a fixed initial   coordinate of zero.Pipe blocks pipeline data that is not being modified in that stage.We provide an input subtract or add that when set will perform a   −   operation instead of addition, required due to our signed scalar transformation.When performing subtraction, the first-stage multiplication and secondstage subtractions have their operands switched.Instead of doing a full reduction to [ − 1, 0] for each multiplication, we selectively perform a cheaper coarse modular reduction using Barrett reduction after the first and third stages.

Stall Controller
Figure 4 shows our stall controller and hazard detection dataflow.The hazard we must avoid is while a pair of points for a target bucket are being added together, any other points for the same bucket would corrupt its data.We implemented a stall controller and scalar shift register for the purpose of tracking which buckets could be processed, along with stall FIFOs to temporarily hold points in the case a bucket was busy.Our experiments showed that a 4-stage pipeline maximized our Fmax without a significant worsening of latency or stall rate.
Our stall controller tries to keep the point adder as busy as possible with a few simple heuristics.We need to track if data is present in any of the 238 pipelines and check if a new coefficient would cause a hazard.Naively done, this would require 238 comparators in parallel and then a wide OR reduction for each pipeline in the adder that can potentially contain valid data, which would impact Fmax.

Scalar Tracking.
Our stall controller processes multiple slices on successive clock cycles rather than all in parallel.This means that for   , at clock cycle , we only check hazards for its buckets.The locations in the point adder with data from the same-slice are deterministic.This architecture means we only need to compare and logically OR reduce at most 12 of the 253-bit scalars in the pipeline rather than all 238.

Stall Point FIFO.
When we detect a hazard the   and W-bit scalar slice   [  ] are placed in a stalled point FIFO and we insert a bubble into the pipeline for that cycle.There are separate FIFOs for each bucket being processed, and because we only need to process one stall FIFO per clock cycle, all stall FIFOs can actually be logically mapped to a single wide BRAM.4.6.3Heuristics.We modeled the stall controller and adder so that we could experiment with several algorithms to find the most efficient one.The algorithm we found that was the best was: ( By processing full FIFOs first, we avoid overflow.When full, however, they must be flushed.We found that with only a 4-deep FIFO this was extremely rare.Modeling with this algorithm showed the performance drop due to stalls was only on average 0.543%.

Optimized Field Operations
The most costly operations inside our point adder are the field multiplications.In order to optimize these we use the Karatsuba algorithm, which requires  (  2 3 ) ≈  ( 1.585 ) single-digit multiplications to multiply an  digit number, rather than the naive long-multiplication algorithm, which requires  ( 2 ) single digit multiplications.
All multiplications are using modular arithmetic, so we chose Barret's reduction algorithm with slight modifications to improve the FPGA resource usage and allow a higher Fmax.We split Barrett reduction into the coarse reduction which is used after multiplication stages, and then fine-grained reduction after addition or subtraction stages.We are able to avoid costly multipliers in the finegrained reduction step by storing reduction values in BRAMs [15].
We also replace any multiply-by-constant with one represented in non-adjacent form (NAF).If the constant has a Hamming weight in NAF larger than a certain threshold, we use DSP slices.Otherwise, we use long multiplication with LUTs.We also found that congestion was often the cause for lower Fmax, and by selecting Vivado place-and-route strategies that avoided congestion, and strategically applying location constraints to the point adder and bucket URAM, we increased post-route Fmax by nearly 10%.

MEASUREMENT RESULTS
We measured Hardcaml MSM on an AWS f1.2xlarge instance, targeting the BLS12-377 curve with the G1 subgroup-generator.This AWS instance contains a Intel Xeon E5-2686 v4 Processor (2.3 GHz (base) and 2.7 GHz (turbo)) and a UltraScale+ VU9P FPGA.Table 2 shows resource usage compared to Cyclone MSM as it was the only other FPGA MSM accelerator implemented on AWS.Resource counts include the overhead from the AWS shell, roughly 20%.Our implementation favored using DSPs and FFs for point addition while Cyclone MSM uses more LUTs.The increased URAM is due to Hardcaml MSM mapping all buckets to the FPGA, while Cyclone MSM chose to have a larger B=16 but then only a single bucket mapped physically to the FPGA.
MSM performance was measured and shown in Figure 5 for both single-round and four-round power of 2 scale MSMs.We present Mop/s as a normalized performance number for comparison, although this does not take into account technology node.Here Cyclone MSM is implemented on the same VU9P 16nm FPGA, a 2 26 When we measure the time taken over multiple MSM rounds, we see the benefits of our split architecture, and memory and PCIe I/O optimizations.A four-round 2 26 MSM takes 20.331s, which perround gives a latency of 5.083s, and performance of 13.2Mop/s, which is a 13% improvement over previous state-of-the-art FPGA accelerators.As far as we can tell, prior work does not mask memory and PCIe I/O and does not gain any benefit when multiple rounds are performed.
It is not easy to do a apples to apples comparison to GPU implementations as they are implemented on different technology nodes, which was also highlighted by the ZPrize running separate FPGA and GPU tracks.Compared to the start-of-the-art GPU MSM accelerator by Matter Labs & Yrrid [14] running on an NVIDIA A40 we are 9x slower, although use 2.8x less power.For comparison the A40 was built on a process node 4 generations ahead of the FPGA we were given to implement against.
Although we implemented a split CPU-FPGA architecture, the bulk of the time is still consumed by the FPGA.Table 3 shows the breakdown of the time taken in each step.Note that the total time is not the sum of these as stages of each are happening in parallel.

CONCLUSIONS
This paper presents Hardcaml MSM, a split CPU-FPGA MSM engine written in Hardcaml, with the highest performance currently published for an FPGA implementation targeting the BLS12-377 curve.Hardcaml won first place in the FPGA track of the 2022 ZPrize cryptography competition.For a MSM of order 2 26 we achieve a single-round MSM latency of 5.518s and average power of 52W, with our design running at 278MHz.When performing multiple rounds of MSM with the same base points but random scalars, we are able to further mask host I/O and memory latency and reduce latency to 5.083s.This is a latency improvement of 13% over the previously fastest reported FPGA solution [1], and an improvement of 472% when compared to state-of-the-art open-source CPU library gnark-crypto [8].

Figure 2
Figure2shows the top level architecture of Hardcaml MSM.
) An elliptic curve in Weierstrass form is equivalent to a Montgomery curve with  = 3 and  = , where  = √ 3 2 +  −1 and  is one of the roots of  3 +  +  = 0.

Figure 5 :
Figure 5: Performance to compute a MSM on BLS12-377 for different implementations.

Table 1 : Number of field operations for a point addition.
1) If all the stall FIFOs have a least one     [  ], process them.Else, (2) If any of the stalled point FIFOs are full, process them.Else, (3) Process incoming     [  ]

Table 3 : Breakdown of time take in each step in Hardcaml MSM for 𝑁
= 2 26 .MSM is reported as 11.7 Mop/s, compared to 12.1 Mop/s for Hardcaml MSM when computing a single round.PipeZK is an 28nm ASIC and acheives 5.7 Mop/s for a 2 20 MSM.Pipe MSM is a U55C 16nm FPGA, and acheives 3.8 Mop/s for a 2 20 MSM.