A low-cost configurable hash computing circuit for PQC algorithm

The development of quantum computers has introduced a significant threat to the security of traditional cryptographic algorithms. To address this challenge, post-quantum cryptographic algorithms (PQC) have been developed, offering robust resistance against attacks from quantum computers. The hash operation plays a critical role in many PQC algorithms based on lattice ciphers and represents a substantial resource-consuming component in algorithm implementations. In this paper, we propose a novel circuit structure for hash implementation in FPGA platform. Our design integrates the hash operations of Kyber and Dilithium by reusing a shared Keccak unit and achieves different hash operation modes through parameter configuration in the control unit. Furthermore, we introduce a novel pipeline structure that multiplexes two sets of pipeline registers with an unfolding factor of 1. This innovative approach significantly reduces hardware resource consumption while satisfying the performance requirements of the algorithm. The proposed architecture, implemented on the Kintex-7 FPGA, utilizes 7376 LUTs, 3059 FFs, and 4 DSPs. Compared to the existing state-of-the-art designs, our design reduces about 40.2% of LUT resources, and 14.1% of Flip Flops resources. Additionally, it achieves 391MHZ clock frequency and finishes Keccak operations in 0.123μs. As a result, our design offers a low-cost, configurable hash computing circuit architecture with relatively excellent performance.


INTRODUCTION
Modern cryptosystems are mainly composed of symmetric cryptography and public-key cryptography.Traditional public-key cryptographic algorithms, such as RSA [1] and ECC [2], rely on problems like integer prime factorization and elliptic curve discrete logarithm computation, which cannot be efficiently solved by conventional computers in a short time.However, with the rapid development of quantum computing science in the 21st century, the powerful computing ability of quantum computers has posed a great potential threat to the traditional public key cryptosystem.Post-Quantum Cryptography (PQC) is a new generation of cryptography standards developed by the National Institute of Standards and Technology (NIST), which is capable of resisting algorithmic attacks on existing cryptography by quantum computers.Currently, several post-quantum cryptographic algorithms such as CRYSTAL-Kyber [3], and CRYSTAL-Dilithium [4] have been selected as finalists in the third round of NIST selection.Among these lattice-based cryptographic algorithms, the hash series operation stands out as one of the most resource-intensive modules.Consequently, the design of the hash series circuit significantly impacts the performance and resource consumption of the overall algorithm.
In Dilithium and Kyber, a wide range of hash series algorithms, such as shake256, shake128, and sha3-512, are utilized throughout the key generation, ciphertext encryption and decryption, signature, and signature verification processes.These algorithms all share a common Keccak algorithm [5].In [7], the authors proposed a Dilithium hash computing circuit featuring three Keccak kernels, each with a 64-bit data path.Two of these kernels are dedicated to matrix sampling, while the third is responsible for other hashing operations.Similarly, in [8], the authors designed separate structures for each Dilithium hash operation, with a standard AXI interface to facilitate tuning and optimization.In [9], the authors presented a permutation kernel utilizing two rounds of Keccak for SHAKE-128 and SHAKE-256, employing three large registers to update input, output, and internal state data.The hardware implementation of post-quantum cryptographic (PQC) algorithms often allocates more than 30% of the overall hardware resources to the hash series operations, resulting in increased costs for the entire algorithm.Therefore, it is imperative to design a hash operation circuit that effectively reduces hardware resource consumption without affecting the performance of other critical operations.
The main contribution of this paper can be summarized as follows: we propose a novel pipeline Keccak unit and integrate all the hash operations in Kyber and Dilithium into a unified hash module.This innovative approach effectively eliminates redundancy and maximizes the utilization of hardware resources.The hash module supports different parameters, enabling the implementation of various hash algorithms.Our design substantially reduces hardware resource consumption and supports hardware integration testing of at least two PQC algorithms while satisfying the performance requirements of the algorithm.

MATHEMATICAL ALGORITHM
Kyber and Dilithium, as NIST third-round finalist algorithms, have strong security and excellent performance, Which are based on lattice cryptography and rely on the MLWE problem [12], and have good application prospects.Both Kyber and Dilithium's algorithmic processes need to generate a large amount of pseudo-random data, such as the matrix vectors and polynomial vectors, which necessitates the use of hash series operations.Hash operation is essentially the hashing of an input message of arbitrary length onto a number field of different forms to obtain a sequence that is difficult to reverse to the original value, and its core is the Keccak algorithm with a sponge structure, as shown in Figure 1.
Where b is the bit length of the sponge structure storage state S, in this algorithm is 1600, r and c are the bit length of bit rate and capacity respectively, different hashing algorithms have their independent r values, {M 0 , ……, M n-1 } are the groups of messages after padding, {S 0 , ……, S n-1 } are the hash values of the distribution output, all of which are of length r.The sponge structure is processed in two parts, absorbing and squeezing, in the absorbing phase, the input message of arbitrary length is firstly padded and a minimum number of 0 and 1 are added after the message to make it an integer multiple of r, and the message M is divided into n segments, each of length r, i.e. = ( )/ .Then M i does the xor operation with the first r bits of the internal state S in turn until all message segments are used up, where i=0,1…, n-1.In the squeeze phase, the compression function Keccakf1600 processes the internal state S in steps, repeating the above operations according to the required output length of different algorithm types, each operation intercepts the first r bits of the 1600-bit output of the Keccakf1600 function and concatenates these bits until the required output length is obtained.
The Keccakf1600 function is the core of the Kyber and Dilithium hash series algorithm, which is mainly divided into five mapping functions.Keccakf1600 function converts 1600bit bits into a 5*5*64 three-dimensional structure, and the specific steps of each mapping function are shown below, for each step, there are 0≤x<5 and 0≤y<5. : (3) : Algorithm 1 Adjusted Keccakf1600 algorithm 9. end for 10. for (i = 0; i < 5; i = i + 1) do 11.
Where RC[k] is the round constant of 24 rounds, m, n are the shift coordinates of 25 internal state blocks, sof is the offset of 25 state blocks, and the offset and shift coordinates are derived from precomputation.The adjusted Keccak algorithm precomputes and merges the original algorithm, which reduces the operational complexity of the Keccakf1600 function to some extent.

HARDWARE IMPLEMENTATION
In this section, we propose a scheme for the implementation of Kyber and Dilithium's hash operation on FPGA, we divide the hardware structure into four parts, control unit, Keccak operation unit, absorb unit, and storage unit, as shown in Figure 2. The control unit is mainly used to configure the starting and parameters of the absorb unit and Keccak unit, as well as the read and write enabler and address of the storage unit.The absorb unit is mainly used to fill and absorb the input data.The Keccak unit is mainly used to

Keccak Unit
The Keccak unit is the most complex and resource-consuming part of the whole module.Previous work on Keccakf1600 generally uses an implementation with an expansion factor of 2. However, for the needs that can be directly applied to post-quantum cryptographic algorithms such as Kyber and Dilithium, the actual working frequency and hardware overhead of the hash series module often requires comprehensive consideration of the impact of other modules of the overall algorithm, to meet the performance requirement while reducing hardware resource consumption as much as possible, We compare the experimental analysis by adjusting the parameters and finally choose to implement the Keccak unit by inserting two sets of pipeline registers with an expansion factor of 1.The structure is shown in Figure 3.
To improve the working frequency of the module, we insert one set of pipeline registers to store the intermediate data between steps and of the Keccak unit and at the output side respectively, and we can see that the logic gates on the critical path of the Keccak unit are all 6 or less, which effectively shortens the critical path of the Keccak unit and improves the system frequency.However, multiple registers would bring huge additional hardware overhead.Therefore, we adopt the method of reusing the same set of pipeline registers, repeatedly using a set of 25*64bit registers to achieve the storage requirements of two sets of pipeline registers for 24 rounds of operations.The pipeline structure is shown in Figure 4.
It can be seen that 25 sets of registers run in parallel to get the final output of 25*64bit at the same time, which is processed after 48 clock cycles.When there are n sets of data to perform Keccakf1600 operations, it only takes n+47 cycles to complete, with a time complexity of O(n), thus reducing the computation time.In addition, we adopt a precomputation approach to calculate the offsets and rotations in steps and in advance, and simplify the 24-round constants by a similar method proposed in [6], which further reduces the consumption of hardware resources.

Control Unit And Other Units
To further reduce the impact of hardware resource overhead caused by multiple sets of pipeline registers in the Keccak unit, we use the same set of 25*64bit registers as the Keccak unit in the absorb unit to store the input messages, and the 25 sets of registers store the input data to achieve message padding, and the control unit uses a multiplexer to decide the additional 0 and 1 status bits of different lengths and where they should be filled according to different hash patterns.All the hash operations in Kyber and Dilithium algorithms share the same Keccakf1600 function, previous work usually separates the two algorithms or even different hash patterns in the same algorithm design, which often brings unbearable consumption of hardware resources.Therefore, we adopt the method of reusing the Keccak unit to implement all the hash operations in key generation, encryption and decryption, and signature verification in both algorithms.The control unit configures the final value of the counter according to the input parameters and outputs the valid_i signal and a_start signal to start the Keccak unit and the absorb unit respectively.At the same time, the read-write logic is designed in the control unit to interact with the external storage unit, and each round of Keccak operation is running in parallel with the previous round of Keccak outputting data, which reduces the number of clocks required for the input-run-out process of the entire hash operation part.We use a single-port RAM as the memory unit and use simple shift circuits and counters in the control unit to implement RAM read and write address changes and to implement inputs and outputs with arbitrary lengths.
After FPGA testing, our module can implement all the hash operations in the Dilithium and Kyber algorithms.

EXPERIMENTAL RESULTS
Our scheme is implemented on Kintex-7 FPGA, the experimental results are obtained by VIVADO_2019.1 tool after place and route analysis, we enumerate a variety of Dilithium and Kyber algorithm hash operation implementation work and compare the resource consumption and performance with security level 2, the final results are shown in Table 1.
In general, Dilithium's hashing operation requires more hardware resources compared to Kyber.Therefore, we primarily use the area and performance metrics of Dilithium's hashing part as our reference standard.Additionally, since the number of cycles for different hashing operation modes varies, we select the number of cycles for 24 rounds of the Keccak core and the number of cycles for hashing operation to generate a single matrix-vector element as our reference standard.Compared with the implementation of individual algorithms, our design not only combines the hashing operations of Kyber and Dilithium but also substantially reduces the consumption of hardware resources.
Compared to [10], our scheme reduces about 40.2% of LUT resources, and 14.1% of Flip Flops resources, and the maximum clock frequency is 391MHZ.From the equation of time = cycles/frequency, our design increases the execution time by about 17% and loses about 14.5% performance in the process of generating a single matrix element by hashing operation, but improves the efficiency in area × time by about 23.2%, achieving an effective balance of area and performance.Also, the high frequency of the hash operation part reduces the performance impact on other critical circuit units of the algorithm such as NTT.

CONCLUSION
This work presents the implementation of a low-cost, highperformance hash computing circuit structure on the Kintex-7 FPGA.The design utilizes 7376 LUTs, 3059 FFs, and 4 DSPs, achieving a maximum clock frequency of 384MHz.Notably, our design consumes the fewest hardware resources among existing circuits that implement the Dilithium hash operation.This attribute is particularly advantageous in scenarios where hardware resources are limited or when prioritizing area requirements over performance.Furthermore, the configurable structure of this design satisfies the hash function requirements of at least two PQC algorithms, improves the reusability of hardware circuits, and brings prospects for the integration and testing of multiple post-quantum cryptographic algorithms.

Figure 1 :
Figure 1: Sponge structure ) Where rot is the circular shift, RC is the round constant of 24 rounds, and r[x][y] is the shift constant.According to the characteristics of the hardware design, to reduce the consumption of hardware resources, we have adjusted the implementation of the Keccakf1600 function to store the internal state only in a set of 25*64bit registers S[0]∼S[24], the principle of which is shown in Algorithm 1.

Figure 2 :
Figure 2: Hardware structure of our scheme

Table 1 :
Comparison of our implementation with other designs in resources and max-frequency