LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics

This work presents a novel reconfigurable architecture for Low Latency Graph Neural Network (LL-GNN) designs for particle detectors, delivering unprecedented low latency performance. Incorporating FPGA-based GNNs into particle detectors presents a unique challenge since it requires sub-microsecond latency to deploy the networks for online event selection with a data rate of hundreds of terabytes per second in the Level-1 triggers at the CERN Large Hadron Collider experiments. This article proposes a novel outer-product based matrix multiplication approach, which is enhanced by exploiting the structured adjacency matrix and a column-major data layout. In addition, we propose a custom code transformation for the matrix multiplication operations, which leverages the structured sparsity patterns and binary features of adjacency matrices to reduce latency and improve hardware efficiency. Moreover, a fusion step is introduced to further reduce the end-to-end design latency by eliminating unnecessary boundaries. Furthermore, a GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under given latency constraints. To facilitate this, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 9.0 times faster and achieves up to 13.1 times higher power efficiency than a GPU implementation. Compared to the previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy. The proposed LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.

which avoids irregular memory access for intermediate results in GNNs.Moreover, we present a custom code transformation for MMMs of GNN feature transformation based on the character of interaction-network based GNNs that utilize a fully connected graph as input.This approach avoids the costly matrix multiplications between the adjacency matrix and the input feature matrix, thereby enhancing computation efficiency.
Furthermore, instead of running GNNs in a coarse-grained pipeline as shown in our prior work [34,35], we propose fusion of sub-layers as much as possible, transforming multiple coarsegrained pipeline stages into a single stage with a fine-grained pipeline inside.The fusion eliminates unnecessary handshake acknowledgments and dual buffers between coarse-grained pipeline stages, effectively reducing end-to-end latency.To further optimize latency, a Finite State Machine (FSM)based code structure is proposed to handle imperfect loops that may emerge after fusion.Finally, we present a GNN-specific algorithm-hardware co-design approach to optimize the algorithm and hardware simultaneously, exploring the optimal configuration to enhance overall performance.Our LL-GNN design successfully achieves sub-microsecond latency, which makes the algorithm compatible with the HL-LHC conditions where there is a hard ultra-low latency constraint, while also enabling improved accuracy.
To the best of our knowledge, this is the first FPGA-based design of GNNs with sub-microsecond latency for particle identification for the detectors at the CERN HL-LHC experiments.This work would help to improve next-generation trigger systems, enabling powerful algorithms to process massive LHC data effectively.
We make the following contributions in this paper: • A low latency layer-wise hardware architecture for GNNs with many novel design optimizations, including outer-product based matrix multiplication, column-major order, custom code transformation and the fusion of sub-layers as well as a GNN-specific algorithm and hardware co-optimization approach, resulting in sub-microsecond latency and high hardware efficiency.• A scalable, efficient open-source template 1 for GNN-based JEDI-net, which enables the generation of low-latency FPGA designs with efficient resource utilization leveraging highlevel synthesis (HLS) tools.• A comprehensive evaluation of the proposed method and hardware architecture.
Although the GNN coefficients are HEP specific, the proposed optimizations and techniques have broader implications for other GNN applications beyond particle identification.Most of the techniques and optimizations introduced in Section 3 and Section 4.1∼4.4 are general and can be beneficial to a range of GNN applications, not just limited to the JEDI-net.These optimizations include strategies for streamlining matrix operations, hardware pipelining, and ways to optimize FPGA resources for maximum performance, minimizing design latency and maximizing acceleration efficiency.Moreover, the techniques devised for low latency optimization can benefit numerous other applications on FPGAs, especially those that require real-time processing.
Relationship to Prior Publications.This paper expands on our previous work [34,35] which primarily focuses on hardware optimizations and targets low initiation interval, but faces challenges with high latency.This work addresses two limitations of the prior work.First, we focus on optimizing end-to-end latency while prior work concerns with initiation interval.Our new approach fuses together as many coarse-grained pipeline stages as possible into a single fine-grained pipeline stage, resulting in low latency designs.Second, we cover co-optimization of both algorithm and hardware, in contrast to the prior work that only focuses on hardware optimizations.This work explores the design performance trade-off under both user-defined algorithm and hardware constraints, such as achieving the highest accuracy with a latency requirement of less than 1s on a given FPGA.The optimized design provides a significant reduction in end-to-end latency, achieving speedups of 6.51 to 16.7 times along with enhanced model accuracy when compared to our previous studies [34,35].For the first time, our novel architecture supports GNNs of JEDI-net to operate in less than 1s on FPGAs.To facilitate the deployment of these optimizations, this paper provides an open-source template for generating low-latency FPGA implementations of GNN-based JEDI-net with efficient resource utilizations, which has not been covered by previous work.

Graph Neural Network and Interaction Network
GNNs can adapt their graph-based structure to an input graph with an iterative process of information aggregation across nodes to learn the complex dependencies of a system.In essence, GNNs create new graph embedding and leverage these for making predictions at edge-level, node-level, or graph-level.The interaction network [3] is a powerful graph based framework for reasoning about objects and relations in complex and dynamic systems.The network takes a graph as input, where nodes represent objects and edges represent interactions between them.Each node and edge has an associated set of features that describe the state of the object or interaction.It learns to capture complex interactions that can be used to predict future states and abstract physical properties.
An interaction-network based GNN processes the input graph in three main steps in each iteration.First, it applies an edge function to each edge in the graph, taking as input the features of the edge and the features of the nodes it connects.This edge function models the interaction between the objects, producing an updated edge feature (often called a "message").Second, these messages are consolidated by an aggregation function over the nodes interconnected to each edge, providing a summary of the incoming messages.The process of producing and consolidating messages is known as a message-passing operation.The aggregation function, here assumed to be a summation, transposes the edge-specific information to node-specific outputs by gathering information based on the connected node indices.Third, a node function is then applied to each node, taking as input the node's original features and the aggregated messages.This node function models how the object's state is affected by its interactions, producing an updated node feature.An additional GNN-head function gives the final graph-level prediction.An example full forward pass, comprised of edge and node blocks used to conduct a graph-level prediction, is shown in Fig. 1(a).

JEDI-net for Particle Identification
Early trigger methodologies [6] relied solely on jet energy to make a decision.Recently, machine learning based strategies are being used for their superior accuracy.However, due to the low latency requirement, only simple Multi-Layer Perceptron (MLP) networks [8,11] on FPGAs are introduced.High accuracy in the trigger is crucial to keep only the most interesting events while keeping the output bandwidth low [8].As a solution, the study in [29] presents JEDI-net, a Graph Neural Network (GNN) based algorithm, that delivers state-of-the-art accuracy for particle identification.This development has led to a surge in the demand for GNNs in trigger systems.JEDI-net [29] is a fully connected GNN based on the interaction network architecture, which identifies the particles produced at the LHC by analyzing jets.A jet is a narrow cone of hadrons and other particles produced by the hadronization of a particle.The input of JEDI-net can be represented as a graph,  = ⟨, ⟩ with the nodes,  , corresponding to physics particles, and the edges, , to the relations.It is a fully connected GNN.The input of nodes ( ) is defined as a  ×   matrix, whose columns represent the node's -length feature vectors, and   is the number of particles in a jet.
The relations are represented by a triplet,  = ⟨  ,   ,    ⟩, where   and   are   ×   binary matrices which index the receiver and sender nodes, respectively.Each column of   and   is a one-hot vector and it indicates the receiver node's index;   indicates the sender similarly.The number of edges,   , is   × (  − 1) since the input graph is fully connected with directional edges.
Fig. 1(b) shows an overview of JEDI-net.In addition, Fig. 2 shows the   and   of this fully connected GNN for an example.To illustrate the idea, the number of particles (nodes) is 4 in this example.But please note a real case will have more particles.The input  matrix is multiplied by the   and   matrices and the results are then concatenated to form a  matrix, having dimension 2 ×   .Each column of the  matrix represents an edge, i.e. a particle-to-particle interaction.The 2 elements of each column are the features of the sending and receiving nodes for that edge.A trainable deep neural network (DNN) function   : R 2 → R   is then applied to each column of  and produces a matrix .Then Ē =    is computed in MMM3 (see Fig. 1(b)) to gather the cumulative effects of interactions received by a given node.Thus, the cumulative effects of the interactions at a given node are obtained by summing the   hidden features over the incoming edges, which is implemented by computing Ē =    in the MMM3 unit.The Ē and  are then concatenated to form the  matrix, working as a shortcut connection.Each column of the  matrix represents a constituent in the jet, which is expressed as a ( +   )-dimensional feature vector, containing  input features and   hidden features.It represents the combined effect of all the interactions between particles.Another trainable function   is presented to build a post-interaction representation of each jet constituent.It is performed on each column of  to produce the  matrix,

DESIGN AND OPTIMIZATION
This section introduces the hardware architecture and multiple optimizations to accelerate the interaction network based GNNs.We adopt a 3-step optimization strategy, named "divide, conquer and fusion", to achieve low latency for GNN inferences.During the "divide" and "conquer" steps, we split GNNs into multiple sub-layers and perform dedicated optimizations, such as outer-product based MMM (Section 3.1) approach and leveraging the structured sparsity of adjacency matrices (Section 3.3) for GNNs.Then, in the "fusion" step (Section 3.5), multiple sub-layers are fused together to eliminate boundaries and buffers between different pipeline stages, resulting in low latency and high hardware efficiency.

Outer-product Based Matrix Multiplication
Matrix multiplication is a fundamental operation in the calculation of GNNs.It typically involves the inner-product of rows from the first matrix and columns from the second matrix.For instance, to compute the GNN aggregate function Ē =    (MMM3), as shown in Fig. 1(a) and (b), one needs an entire row of the  matrix and a full column from the    matrix to perform the inner-product for each entry of Ē.However, as the  matrix is generated column by column, this introduces a long latency when using the inner-product based MMM since it must wait until an entire row of the  matrix is ready, leading to a considerable delay.To address this issue, this work proposes an outer-product based matrix multiplication to process the GNN aggregate function (MMM3).Instead of using a full row from  matrix, now a full column from the  matrix is multiplied by one element from    matrix to generate the partial result of the first column of the result matrix Ē as shown in Fig. 3(a).This partial result is then accumulated to form a complete column of the result matrix.Given that the  matrix is produced column by column, MMM3 can start as soon as the first column of the  is ready.To efficiently support the outer-product based MMM, the column-major order data layout (discussed in Section 3.2) is used for representing the intermediate results (i.e., 2D matrix arrays).Thus, input elements can be grouped as a vector (i.e., an entire column as a vector) and can be processed efficiently with high parallelism because the data can be fetched sequentially.It substantially minimizes the waiting time for the aggregate function in GNNs, thus reducing the overall design latency.

Results Generation
The calculation of the first vector of the resultant matrix.

X
Broadcasting The calculation of the second vector.Fig. 3. Outer-product based matrix multiplication with column-major order and structured sparsity.
In addition, the custom code transformation (discussed in Section 3.3), can be adapted to enhance the proposed outer-product based MMM, which exploits the sparsity patterns inherent in the    adjacency matrix and the binary features.It can avoid costly multiplications and involves only load and store operations with a small number of additions for this MMM unit.Detailed pseudocode of the enhanced outer-product based MMM is presented in Algorithm 1. Generally, in GNNs, the aggregation has a large amount of data access and relatively small amount of computation, resulting in low hardware efficiency [57].However, our approach takes advantage of the sparse structures and binary features in the adjacency matrix.The input   matrix is binary and each column is one-hot as explained in Section 2.2.Therefore, the multiplication operations are unnecessary since    is binary.Besides, only 1   of the total number of additions are required.Moreover, due to the structured pattern of the adjacency matrix, the value and pattern of adjacency matrices can be incorporated into the loop index to avoid irregular memory access.
One limitation of the outer-product based MMM is that it produces a full size resultant matrix with only partial results until the end [36].This process requires substantial memory write bandwidth.The AWB-GCN [17] presents a column-wise product architecture, a variant of outer-product MMM, which involves one column from the first matrix and only one element from the second matrix, such that the output only affects one resultant vector.In our design, each row is one-hot, which means that only one vector in the resultant matrix is valid at a time.Moreover, we further exploit the structured pattern of   so that the partial results only update the corresponding vectors in the resultant matrix sequentially, preventing irregular memory writes and thereby reducing latency.Furthermore, due to the structured pattern in   , the proposed approach only requires the  matrix to be read once, which also reduces the bandwidth of memory access.Fig. 3(b) shows how the design reads the vectors from column (  − 1) to 2 * (  − 1) in the  matrix and generates the second vector in the resultant matrix.The latency could be further reduced by involving multiple columns of input matrix.However, this requires multiple preceding hardware units to generate the equivalent number of columns during each cycle, thereby increasing hardware resource usage.Our proposed methods not only eliminate the costly matrix multiplication operations and reduce iterations, but also improve memory access efficiency by removing the necessity for adjacency matrix inputs.These techniques thus decrease design latency, and boost throughput as well as hardware efficiency.

Column-major Order
The intermediate results in the layer-wise GNN hardware architecture are captured using twodimensional (2D) arrays representing a matrix as shown in Fig. 1(b).Row-major and column-major orders (Fig. 4) are two data layout methods, which are critical for correctly passing arrays between hardware units.More importantly, choosing the right data layout method is also critical for hardware performance, which is the focus of this work.The difference between the two orders lies in which elements of an array are contiguous in memory.An appropriate data layout will have a significant impact on hardware performance.When mapping 2D data onto a one-dimensional (1D) structure (i.e.memory) using a high level synthesis tool (e.g., Xilinx Vivado / Vitis HLS), often the default data layout is row-major order, which follows a C language legacy.However, row-major order for GNN-based JEDI-net will lead to poor spatial locality and hinder parallelism, since the functions   and   are applied to each column of the input matrix, as shown in Fig. 1(b).The concatenation of two input matrices into one output matrix also works on columns, such as concatenating 1 and 2 to the  matrix.With a row-major order data layout, the input data of these functions do not sit in memory contiguously so it is very time-consuming to fetch all the elements in a column.However, if the data are represented using a column-major order, in which the consecutive elements of a column reside next to each other, iterating over columns becomes easy because the data are accessed sequentially.Thus, this work proposes the column-major order to increase the data spatial locality for accelerating layer-wise GNNs efficiently, leading to good hardware performance.

Custom MMMs for GNN Feature Transformation
The GNN design detailed in [29] relies on dense MMM operations to calculate the MMMs between the input feature matrix and adjacency matrices, specifically, 1 =   and 2 =   .However, this approach proves to be resource-intensive and time-consuming.Interestingly, we observe that a significant proportion of the computational operations in these MMMs, particularly in fullyconnected GNNs based on interaction networks, are redundant by leveraging the sparse structures and binary features within the adjacency matrices.In the case of interaction network-based GNNs with fully-connected graphs, every column within the receiving and sending matrices is one-hot.In addition, these matrices not only feature a binary structure but also have static patterns, as shown in Fig. 2. The element (  )   is set to 1 when the  ℎ node receives the  ℎ edge, and is 0 otherwise.Similarly, the element (  )   is set to 1 when the  ℎ node sends the  ℎ edge and is 0 otherwise.

Algorithm 2:
The pseudocode of the custom MMMs.
1 Function MMM_(  , 1, 2): 2 // The layout of matrices is column-wise Due to these unique characteristics, many computational operations are unnecessary.First, the multiplication operations are unnecessary because the   and   matrices only have binary values.Second, accumulation (addition) operations can be eliminated when doing the inner product of row vectors from the input feature  matrix and column vectors from   (or   ).This is because each column of   and   is one-hot, resulting in only one product being produced.As a result, the iteration factor is reduced from   × (  − 1) to   − 1.This simplifies the calculation of 1 =   and 2 =   to merely load and store operations.A detailed pseudocode of the custom code transformation for MMM1/2 is provided in Algorithm 2.
One challenges when dealing with GNNs is handling the irregular memory access of sparse adjacency matrices.Our approach can avoid such memory access by building the known patterns of sparsity directly into the loop index.In addition, the access of the input feature matrix for 1 and 2 is sequential in our designs since we adopt a column-major order data format as discussed in Section 3.2.The proposed approach not only eliminates the costly MMM operations to increase the computational efficiency, but also avoids the irregular access of the adjacency matrices, which greatly reduces the design latency.
Although the patterns presented here are specific to fully-connected GNNs based on interaction networks, the idea of exploiting the adjacency matrix pattern is highly adaptable.We believe that the method of code transformation we've employed could be adjusted to improve other GNN networks that have a useful pattern of sparsity, leading to low latency designs.

A Dataflow Architecture with Task-level Parallelism
To improve performance and reduce latency, our previous designs [34,35] partition the entire network into multiple sub-layers and utilize a dataflow architecture with task-level parallelism [5,55].All the sub-layers are mapped on-chip, with each one as a task within a coarse-grained pipeline, allowing simultaneous execution.These tasks run as soon as their inputs become valid, thereby enabling efficient data flow between them.Ping-pong buffers are used as the channels between various tasks.Signals indicating when the buffers are full or empty are used for handshaking.Fig. 5 shows the sub-layers (tasks) within in the GNN network, with blocks representing task operations and arrows representing data flow between tasks.The partitioning rule is based on the primary operations within the GNN network, as presented in Fig. 1.In addition, we implement dedicated optimizations, categorizing them into two classes: intra-task and inter-task optimizations.Given that each task operates independently, dedicated intra-task optimizations can be applied to each task, such as exploiting outer-product based MMM (discussed in Section 3.1) and structured sparsity of adjacency matrices (discussed in Section 3.3).Moreover, inter-task optimizations are also introduced to refine multiple tasks and their interactions, such as column-major order representation (discussed in Section 3.2) to increase the spatial locality of intermediate results transferred between tasks, and initiation interval balancing [37] for optimizing the overall initiation interval across multiple tasks, and a fusion technique (discussed in Section 3.5) which fuses multiple tasks to remove extra boundaries, reducing end to end latency.The proposed GNN layered structure and coefficients are CERN LHC specific, while the dataflow architecture with task-level parallelism and the optimizations, especially the inter-task ones, can be adapted to other GNN networks or even to other types of neural networks.
The initiation interval is decided by the largest initiation interval among all the units on the datapath.It is usually not necessary to unroll every unit in order to achieve the smallest initiation interval.Some hardware resources can be saved from the units that do not require full unrolling, and then these hardware resources can be reassigned to the bottleneck unit that dominates the whole design to reduce the initiation interval and latency [37] by an appropriate partitioning of hardware resources.While partitioning FPGA resources to enhance latency and initiation interval in a layer-wise architecture has been studied for CNNs [19,39,54,55] and RNNs [37], there is little work focusing on GNNs.This work addresses this gap by balancing the initiation intervals of the sub-layer units in GNNs by appropriate FPGA resources partitioning.

Divide, Conquer and Fuse
In the previous subsections, we "divide" the GNN model into multiple sub-layers and "conquer" them with tailed-made optimizations.Our previous designs [34,35] deploy these sub-layers in a coarse-grained pipeline, achieving a low initiation interval.Nevertheless, the end-to-end latency remains higher than the 1s which is required for the CERN HL-LHC.To address this, we introduce a succeeding step called "fuse" which combines multiple optimized sub-layers into a single unit and runs it in a fine-grained pipeline to reduce overall design latency.
Within GNNs, operations are typically performed repeatedly on graph nodes or edges.This implies that many processing units (or loops) share a common loop bound, that is, the number of nodes (or edges) in a graph, which makes many sequential units (loops) mergeable.Particularly, in our GNN designs, the matrix multiplication units are simplified after exploiting structure sparsity.Running them as individual stages within a coarse-grained pipeline not only leads to lots of wasted hardware resources but also increased latency due to the extra acknowledgments, including pingpong buffers that exist between stages in the pipeline.As multiple sequential loops share the same loop bound, these loops can be merged into one to reduce the latency.
For example, the MMM1/2, Concat1 and DNN1 units all have a loop bound of   (number of edges), as they perform calculations on the edges of the graph.They can be consolidated into one large loop, which can then run in a fine-grained pipeline to reduce overall latency, as merging loops allows the logic within the loops to be optimized together.The Concat2 and DNN2 units, on the other hand, have a loop bound of   (number of nodes), as they perform calculations on the nodes of the graph.An interesting aspect to consider is the aggregate unit (MMM3) which accumulates the effects of interactions from edges to nodes.This work proposes to divide this unit into two parts.The first part, covering lines 3 to 10 in Algorithm 1, has a loop bound of   and can be fused into a new unit called Edge unit, while the second part (lines 11 to 13 in Algorithm 1) can be fused into a Node unit stage.The entire dataflow is converted into Fig.6(a) from its previous design (Fig. 5), applying unique parallelism parameters for each task.Note that the fusion of compute units into a single loop is also data independent.
One drawback is that the initiation interval might slightly increase, due to the merging of multiple stages to form a large stage.Nevertheless, this is worthwhile as fewer stages and boundaries after fusion result in reduced latency and high hardware efficiency.

Handling Imperfect Loops
Fusion can potentially result in an imperfect loop, especially in the case of GNNs in which some units operate on edges using an edge-based loop bound while others operate on nodes with a node-based bound.Fig. 6(b) shows the final dataflow once the Edge unit and Node unit are fused into an Edge-Node unit.Algorithm 3 shows a typical example loop from lines 1 to 8 within an Edge-Node unit for GNNs of JEDI-net.The code_body_A (e.g., the Concat1 and DNN1 units) iterates with a bound of   , which is equal to   × (  − 1), while the code_body_B (e.g., the part of MMM3 from lines 11 to 13 in Algorithm 1 plus DNN2) only iterates   times.After the fusion, it becomes an imperfect loop where the code_body_B exists outside the inner loop.To pipeline code_body_B, one could set the #pragma of "PIPELINE" at line 2 in the outer loop, but it leads to automatically unrolling all loops in the hierarchy below.As a result, this would completely unroll the inner loop at line 3, resulting in (  − 1) hardware copies of code_body_A, significantly consuming hardware resources.If the required hardware resources exceed the given budget, one must limit the number of instances of the code_body_A.For example, if the total budget can support only   2 instances of code_body_A, this loop can only be unrolled by a factor of   2 .However, using a #pragma of unrolling with a factor for the inner loop will not manage to reduce the number of copies since the "PIPELINE" at line 2 has a priority and will force the full unrolling of the inner loop.Thus, one has to move the #pragma of "PIPELINE" to line 4 to only pipeline the inner loop, excluding pipelining the code_body_B, which leads to a poor design and large latency.
To solve this issue, this work proposes another code transformation which transforms an imperfect loop into a perfect one using a finite-state machine (FSM) based structure with a target initiation interval (  ), as shown from line 9 to 34 in Algorithm 3. If the number of instances of the code_body_A that can be deployed under the given resource budget is   , then   equals (   −1   ).Please note that the increase in the initiation interval is not due to the FSM itself, but is a consequence of limited hardware resources.The loop can now run with an initiation interval Algorithm 3: Code transformation with an FSM-based structure to transfer an imperfect loop into a perfect loop. of one, but the equivalent initiation interval is the target initiation interval.With this code transformation, we can deploy as many instances as permitted by the hardware budget, thereby improving the design performance after fusion and reducing the overall design latency.An automatic process can be developed to perform the fusion alongside code transformation, effectively handling the imperfect loop and enhancing performance.

IMPLEMENTATION AND CO-DESIGN FRAMEWORK
Based on the new low latency hardware architecture incorporating fusion, this section presents a two-level parallelism scheme (Section 4.1) to improve the parallelism and reduce latency.Furthermore, a GNN-specific co-design approach is introduced (Section 4.4) to optimize the algorithm and hardware simultaneously.By combining these different optimizations, our designs achieve good performance and efficiency.

Two-level Parallelism
The trade-off between latency, throughput and FPGA resource usage is determined by the parallelization of the design.This work exploits a two-level parallelism scheme.First, we utilize the reuse factor [11] to fine tune the parallelism of computational units within GNNs.The reuse factor controls the amount of parallelism: from sequentially re-using a single multiplier multiple times to using the maximum number of multipliers concurrently.With a reuse factor of one, the computation becomes entirely parallel, and all multiplications are performed simultaneously using a maximal number of multipliers.When the reuse factor is , only 1   of the computation is executed at a time, using 1   fewer multipliers.Fig. 7 shows different reuse factors for a fully-connected layer involving four multiplications.The custom code transformation (discussed in Section 3.3) is performed to optimize the matrix multiplications to avoid multiplications.Hence, only the three MLPs (  ,   ,   ) require multipliers in the JEDI-net design.We apply the reuse factors    ,    and   to these three MLPs.This work always tries to achieve extremely low latency by using as many hardware resources as feasible, such as unrolling all the layers in the MLPs by adopting a reuse factor value of 1.
In addition, we deploy multiple instances of the   unit to further increase the design parallelism.The   is applied to each column of the  matrix, as described in Section 2.2, resulting in a significant number of iterations since there are   × (  − 1) columns in the  matrix.Completely unrolling all the iterations requires thousands of hardware copies of   , leading to considerable hardware resource consumption that could easily exceed a given FPGA.Hence, this work partially unrolls it with a factor,    , producing    hardware instances of the   unit, each processing a vector of the  matrix.The preceding hardware units of   should be also updated so that they can produce    vectors in each cycle.Moreover, the succeeding hardware unit, the aggregate unit (MMM3), is also designed to be able to consume multiple result vectors as discussed in Section 3.1.

Resource Model
To explore design space and identify the best trade-off between hardware resources and design performance, a resource model for GNN-based JEDI-net has been developed.Since the DSPs are the most critical resources for FPGA-based neural networks, our resource model primarily focuses on DSPs.The DSP resource model is detailed in Equation (1): Here,   denotes the number of DSPs utilized for a single fully-connected (FC) layer, with   and   representing input and output sizes, respectively.The JEDI-net has three MLPs, each consisting of multiple FC layers.  represents the labels of these three MLPs, namely   ,   and   , as shown in Fig. 1(b).For simplicity, this work utilizes a uniform bitwidth (Q12.12,24 total bits with 12 fractional bits) for most of the design datapath.However, as the weight values of the MLPs fall in the range [0,1), approximately 13 effective bits are used.Hence, the Vivado HLS tool trims the remaining bits from one of the inputs for the multipliers, allowing a single Xilinx DSP to fully implement one multiplier.Furthermore, we use Q16.16 for the accumulators to maintain accuracy.With our proposed approach, there are no multipliers in MMM1/2/3 units.As a result, only the MLP units need multipliers which are implemented using DSPs.The total number of DSPs used in JEDI-net, as shown in Equation ( 1), should be less than the total number of DSPs available on the targeted FPGA.

Latency Model
We have managed to fuse most of the sub-layers, which leads to a lower latency architecture compared to the one proposed in our previous work [34].The latency model of JEDI-net based on this latency optimized hardware architecture is detailed in Equation (2).
=    ×  ( ( ),    ,   ) Here,   represents the initiation interval of the multiplier, which is one cycle in this work.  denotes the initiation interval of the fused loop, which depends on the maximum number of   units that can be deployed on a given FPGA with limited hardware resources, and reuse factors for    and   .Please note that    is always set to 1, as the execution of   is the bottleneck in the design.Meanwhile,   refers to the pipeline depth of the model while   is the depth of the logic outside the primary fused loop.For simplicity, both of them are constants based on the design architecture.
Equation ( 2) for the proposed hardware architecture is based on Fig. 6(b), incorporating all optimizations introduced in this work.Due to limited hardware resources, the top loop is not unrolled as shown in line 10 of Algorithm 3.Only the #pragma of "PIPELINE" is used.As a result, the Edge-Node unit contains a single node-processing engine, capable of handling one node at a time.The engine, corresponding to the loop body, is designed to be pipelined, allowing for the processing of a subsequent node every   (initiation interval) cycles.The proposed LL-GNN architecture aims to minimize the loop initiation interval,   , ideally bringing it down to 1 within the hardware resource budget using the proposed techniques.With increased hardware resources, e.g. a larger FPGA, our architecture can accommodate more engines by fully or partially unrolling the top loop in the Edge-Node unit, increasing the parallelism and further reducing the latency.Furthermore, With a fully unrolled top loop, the latency becomes independent of the number of nodes, given sufficient hardware resources.

Co-design of Algorithm and Hardware of GNNs for Low Latency
The study in [29] focuses on optimizing algorithm for JEDI-net GNNs.However, such strategies often encounter deployment issues where neural network models, optimized for high accuracy, cannot be deployed on a given FPGA due to the large model size or the limited hardware resources.As another line of research, our previous studies [34,35] focus on hardware optimizations.However, without algorithm optimization, it is difficult for these designs to achieve an end-to-end latency of less than 1s.Thus, existing GNN designs for particle identification either focus on the algorithm or the hardware, leading to sub-optimal designs when deploying GNNs on FPGAs.
To address this issue, our work proposes a co-design approach tailored for GNNs, allowing simultaneous optimization of both the algorithm and hardware.This strategy enables us to explore trade-offs within user-defined constraints, such as maximizing accuracy with a latency requirement of less than 1s on a given FPGA.An overview of our proposed co-design approach is shown in Fig. 8.The co-design approach consists of the following key steps: • Define Acceptable Latency Threshold: Given that this work focuses on achieving low latency on a resource-constrained hardware device with a strict latency requirement (  ), we further refine our design search space and exploration flow.We introduce a unique parameter, , to fine-tune an acceptable latency threshold.Models whose latencies () exceed this threshold (defined as  ×   ) are excluded from the training process.This early termination strategy avoids the unnecessary training of models that would be unfit for deployment due to latency constraints.By carefully choosing , we provide a balance between training time and exploration of the solution space.• Rebalance MLP Sizes: Recognizing that edge-related operations usually require more iterations than those involving nodes due to the larger number of edges in a graph, we rebalance the sizes of different MLPs when exploring network design space.For example, we assign a smaller value to the hidden layer size (   ) in the edge-related function   , while maintaining or increasing the size (   ) of the other two MLPs.It is because the   is required to run   × (  − 1) times, which is significantly larger than the   that runs only   times, and the   that runs only once per inference.This re-balancing reduces latency while maintaining model accuracy.• Estimate Latency of Design Candidates: Using equations ( 1) and ( 2), we estimate the latency of design candidates with various GNN configurations.The  parameter could be set larger than 1 to loosen the primary constraint and avoid missing potential solutions. could also be set to a very large value to conduct a full coverage.• Generate Optimal Design: Once we find the optimal design based on the user-defined metrics, the model hyperparameters, as well as weights and bias, are generated.A lowlatency FPGA design is then produced using our open-source HLS-based templates, thereby enhancing design productivity.
Although [29] has conducted a wide algorithmic search considering various model hyperparameters, it assigns uniform size to all three MLPs, which is not latency friendly for GNNs since edge-related operations usually require more iterations than node-related ones as discussed above.We mitigate this by rebalancing the sizes of different MLPs, thereby striking a balance between algorithm and hardware performance.In design space exploration (DSE), we also use grid search, like [29], to navigate both algorithmic and hardware design parameters, constrained by latency considerations.We recognize that more advanced DSE techniques, such as those leveraging reinforcement learning, might offer better efficiency over grid search.We leave this for our future work since it has limited impact on the conclusions in this paper.
While our primary objective in this work is to minimize latency on a resource-constrained device, the optimization mode of our co-design approach can easily be adapted to serve other user-defined metrics and constraints.Furthermore, we are investigating how this co-design approach can be automated by techniques such as meta-programming [33,45].

System Overview
The Compact Muon Solenoid (CMS) serves as one of the particle detectors at CERN LHC.The CMS Experiment, which is related to the triggering and collisions at this particle detector, utilizes two levels of real-time triggering.The data acquisition system, including the L1T, is shown in Fig. 9 (a).The L1T is responsible for making fast, real-time decisions about which events from the LHC should be kept for further examination.It works at an input rate of 40 MHz, matching the rate of proton-proton collisions at the LHC, resulting in a data rate of hundreds of terabytes per second.Given the limitations in readout bandwidth and storage capacity, it is not feasible to store all the data produced.
The aim of the L1T, therefore, is to decrease the data rate by an average of 400 times, from 40 MHz to approximately 100 KHz, a rate that can be efficiently transferred to the data storage for later analysis in detail.In order to provide the required scientific performance, the HL-LHC L1T system relies on the delivery of highest precision inputs representing the largest bandwidth ever handled at Level-1 (63 Terabits per second of information) [9].The significant enhancement of the L1T in the upgraded HL-LHC compared to the previous LHC is the implementation of the Particle Flow unit that aggregates the regionally processed information from various sub-detectors, including Calorimetry, Muon and Tracking, and facilitates sophisticated algorithms, such as GNN-based JEDI-net, to make a more accurate decision for L1T.If an accept decision is made, the data are then read out for further processing; otherwise, the data will be dropped.
The L1T has a total latency budget of 12.5s in the HL-LHC [9].The 12.5s is a fixed latency (there is no batching), with a hard limit (we lose the data if exceeded).It consists of numerous FPGA-based subsystems with around 730 FPGAs.They are designed without a CPU or PCIe in the data path to meet ultra-low latency requirements.Furthermore, no external memory is included in the data path.One of the candidate algorithms for the HL-LHC is the GNN-based particle identification algorithm.For a single algorithm the constraint of latency is 1s.The GNN designs accept AXI bus based streamed data which arrive via multiple parallel optical fibres, each running at 25 Gbps, as shown in Fig. 9   packets to and from the FPGA, ensuring that data flow is maintained at the speeds required for real-time processing in HL-LHC.Besides, the data links use a custom protocol with no forward error checking, hence they are able to run with a latency of ∼100 ns.Since the data links run with such low latency, there are no concerns about the latency of incoming data.Moreover, this link latency can often be effectively overlapped with the data processing, for example, in this work the initiation intervals of the proposed designs are between 150ns to 750ns.Therefore, the link latency is excluded in the evaluations of algorithms.This is compatible with the latency calculation for algorithms and applications in CERN HL-LHC experiments and aligning with other designs [8,11] of particle identification for the L1 trigger system.

EVALUATION AND ANALYSIS
This section presents the evaluation results of the GNN-based JEDI-net on FPGAs demonstrating the scalability of the proposed optimization for GNNs.

Experimental Setup
This study covers JEDI-net-30p [29] and JEDI-net-50p GNN models, targeting datasets of 30 particles [10] and 50 particles [12], respectively.To study the performance and limitations of the proposed optimizations and hardware architecture, the designs are implemented using Xilinx Vivado HLS 2019.2 on a Xilinx Alveo U250 board with an UltraScale+ XCU250 FPGA for the evaluation and comparison with other implementations.It offers 1.7M LUTs, 12.3k DSP slices and two 100Gbps network interfaces.It runs at 200MHz so each cycle is 5ns.The hardware datapath employs a Q12.12 fixed-point representation with 1 sign bit, 11 integer bits and 12 fractional bits.In contrast, the accumulator uses Q16.16 to keep accuracy.It achieves the same accuracy as the floating-point model.

Balancing Quantization and Accuracy
To find an optimal fixed-point representation that can achieve no reduction in the physics performance of the algorithm compared to the conventional floating-point representation, we analyze the fixed-point precision across a range from 16 to 26 bits in total bit widths and 6 to 13 integer bits (including the sign bit), as shown in Fig. 10(a).With 24 bits in total and 12 integer bits, the fixed-point model effectively achieves the same accuracy as the FP32 floating-point counterpart.In addition, JEDI-net achieves much higher accuracy than the previous work based on MLPs [8,11] which only has an accuracy of around 75%.
To further understand the performance of our optimized model, we also evaluate the Receiver Operating Characteristic (ROC) curves with the area under the curve (AUC) for the 5 jet classifiers, including gluon, light quarks, W boson, Z boson, and top quark.These results are shown in Fig. 10(b), with higher AUC value corresponding to superior classification performance.Although the AUC of the light quarks tagger (blue lines) using 24-bit fixed-point data representation seems different from the floating-point one, it's important to note the logarithmic scale on the x-axis of Fig. 10(b) and the AUC loss of the q tagger is less than 1%, thus demonstrating that the fixed-point representation has a negligible impact on the overall performance.In summary, we successfully balance the trade-off between quantization and accuracy, thereby maximizing the efficiency and performance of our optimized model.

Resource Utilization
Table 1 shows the resource utilization of our designs on the U250 FPGA with different configurations.Arrows in the table symbolize desired trends.In JEDI-net-30p models, the number of input particle   is set at 30.Each particle has a feature size  of 16 as defined by the dataset.For the JEDI-net-50p model,   is increased to 50 with the same feature size of .As   increases, the number of edges   also increases significantly, which is equal to   * (  − 1).This results in 870 edges for   = 30 and dramatically increases to 2450 for   = 50.Both models have various sizes of the hidden layers within   ,   and   MLPs.Table 1 shows that our newly introduced design with fusion is more resource-efficient than the previous design for the same JEDI-net 30p model.This can be seen clearly when comparing the designs labeled as J2 and J3 in Table 1, both of which target the same JEDI-net model.Further details about J2 and J3 are provided in Table 2. Our new design, based on the proposed low latency hardware architecture with fusion, has reduced the usage of DSPs, BRAM and LUTs.This not only results in lower resource consumption but also paves the way for a higher performance design.

Performance Analysis and Discussion
To show the advantages of our proposed optimizations, Table 2 presents a comparative analysis of various intermediate and final designs derived after applying the proposed optimizations one by one.The designs J1∼J5 and U1∼U5 correspond to the JEDI-net-30p and JEDI-net-50p networks, respectively, with J1 and U1 serving as the baselines.

5.4.1
Creating Baseline Designs.Designs J1 and U1, targeting the JEDI-net-30p and JEDInet-50p respectively, have been developed leveraging our proposed outer-product based MMM, custom code transformation of MMMs, and column-major order data representation.These designs are illustrated in Table 2 and serve as our baseline designs.To minimize latency and improve throughput, we further fully unroll the loops within each network layer in the   ,   , and   units of designs J1 and U1.However, the design latency still ranges between 12.56s and 32.60s, which significantly exceeds the latency constraint of less than 1s.Therefore, further optimizations must be explored to meet this hard constraint.

Parallelization and Resource Balancing.
To further reduce latency and initiation interval, we increase the design parallelism by deploying multiple DNN1 (  ) hardware units.This is particularly necessary since the   , which serves as the GNN edge function that is applied to each edge in the graph, needs to iterate   * (  − 1) times.As a result, as the number of deployed DNN1 units,    , increases from 1 to 13, the latency of design J2 drops from 12.56s (J2) to 1.91s, resulting in a 6.5 times reduction in latency.The latency reduction is achieved by allocating more hardware resources to improve performance (low latency).However naively increasing the number of DNN1 units does not guarantee an optimal design.It is illustrated in designs U1 and U2 for the large JEDI-net-50p model.Here, latency decreases from 32.60s to 12.47s as    increases from 1 to 3, shown in Table 2.However, the required number of DSPs has exceeded the total DSPs on this FPGA.To solve this issue, we strategically re-allocate some DSP blocks from DNN2 (  ) and DNN3 (  ) to DNN1 (  ) by increasing the reuse factors of DNN2 and DNN3, as well as deploying multiple DNN1 units to balance the overall design initiation interval as described in Section 3.4.This strategy leads to a reduced initiation interval and latency.A subsequent design space exploration  (4,4).As a result, we arrive at design U3 which has both better initiation interval and latency compared to the design U2.

5.4.3
Fusion.The latency can be further reduced by the newly proposed sub-layer fusion.The fusion approach eliminates the boundaries between the hardware units, thus enabling a finegrained pipeline instead of a coarse grained pipeline.A code transformation with an FSM structure is introduced to mitigate the potential pipeline issue arising from imperfect loops.For instance, design J3 targets the same neural network model as J2, but with the fusion, the latency is reduced from 1.91s to 0.62s.This is the first time that the JEDI-net-30p model operates within 1s, showing the advantages of the fusion technique in achieving low latency designs.Setting    to a number between 10 to 14 will lead to the same initiation interval and latency using the new low latency hardware architecture because the partition is performed on the inner loop which has a bound of 29 for JEDI-net-30p.Therefore, we select 10, which demands the least hardware resources.As compared to our previous work J2 [34], the new design J3 is 3.1 times faster as shown in Fig. 11 while consuming 22% less DSP blocks.These improvements are accompanied by a minor drawback; there is a small increase in the initiation interval from 0.40s to 0.45s since the pipeline depth of the fused stage increases after fusion.However, the small increase in the initiation interval is considered tolerable, especially when considering the significant enhancements in latency and resource efficiency.

5.4.4
Algorithm and Hardware Co-optimization.We introduce a co-design approach to optimize the GNNs both on model accuracy and hardware performance.It is worth noting that this is the first instance where the larger JEDI-net-50p models can run in 1s without sacrificing model accuracy.In [29] and our previous work [34,35], all the MLPs share the same size.However, our analysis of JEDI-net reveals that the   is iterated   times more than   and   * (  − 1) times more than   .As a result, if   is smaller, more copies of   units can be deployed on a given FPGA, leading to a faster design.We consider JEDI-net-30p models with the number of layer (NL) of   searched in (1, 2, 3, 4) and layer size in (8,16,24,32).The latency parameter  is set to 2 for JEDI-net-30p.As JEDI-net-50p is much larger, the size of   is searched in (8,16,32,48) with NL in (1, 2, 3, 4).The  is set to 4 for JEDI-net-50p.For simplicity, we keep the layer number and other configurations of   and   the same as in [29] but adjust only the size of their first layer to one of (16,32,48,64,96) for both JEDI-net-30p and 50p.
The results are shown in Fig. 13(a) and (b) using a U250 FPGA platform.Each blue dot in these figures represents an explored design.Fig. 13(a) shows the designs with a latency less than 2s.Since JEDI-net-50p is much larger than JEDI-net-30p, Fig. 13(b) shows the designs with a latency less than 4s.Designs J4 and U4 are selected because they have the lowest latency (Opt-Latn) among all the candidate designs and they have the highest model accuracy among the designs with the same latency.Despite a minor accuracy loss of 0.33%, the latency of JEDI-net-30p can be further reduced to 0.29s, which is 2.1 times faster than J3.The initiation interval can also be reduced to 0.15s.As for JEDI-net-50p, the latency of U4 is reduced from 10.66s to 0.65s, which is 16.7 times faster than U3 [34], as shown in Fig. 12.In addition, U4 maintains higher accuracy than U3. on the features of the task, the strengths and limitations of the devices as well as the connection structure of the devices.

Comparison with GPUs and CPUs
To compare the performance of the proposed design on FPGA with other platforms, we run the JEDI-net models implemented in [29] on Intel Xeon Gold 6154 CPU and NVIDIA GeForce RTX 3090 (CUDA 11.8) based on PyTorch (1.12.1) framework.The CuDNN libraries are used for optimizing the hardware performance on GPUs.Each batch has 1000 graph events (samples) according to [29], so we set the same batch size across all the hardware platforms for a fair comparison.CPU power consumption is measured by the pcm-power utility [20], excluding the DRAM power consumption.GPU power consumption is measured using nvidia-smi utility.The maximum total power consumption of the U250 board is used in comparison, which is 225W.We adopt KGPS (Kilo Graphs Per Second), which denotes the number of graph events inferences that run per second, as an indicator of throughput.Compared with the JEDI-net implementation on GPU, our FPGA design is 8.7∼9.0 times faster.In terms of the power efficiency given by KGPS per Watt, our design is 12.6∼13.1 times higher than the GPU implementation.When compared to the CPU implementation, our FPGA implementation is 282∼545 times faster.In addition, our design achieves 141∼296 times higher power efficiency than the CPU implementation.As for future improvements, we believe the proposed custom code transformation can also be applied to CPU and GPU implementations, but the latency profiling shows that the three MMMs cost less than 15% of the total latency.We leave that for future work since it has a limited impact on the conclusions in this paper.The FPGA implementation is faster and more efficient due to tailor-made optimizations and a co-design approach.

RELATED WORK
Several studies have explored the use of GNNs for particle physics applications, such as jet tagging (identification) [29], charged particle tracking [23], and calorimeter energy measurements [31].More can be found in a survey [43].To achieve low latency, FPGAs are utilized.The work in [13] extends the hls4ml [11] tool to translate GNNs into FPGA firmware automatically for charged particle tracking.GarNet [22], a GNN-based algorithm, is proposed for calorimeter energy regression. of input graphs.We propose multiple novel optimizations to achieve high throughput and submicrosecond level latency.These previous studies are orthogonal to our proposed approach and hardware architecture.Their techniques could be complementary to our approach, which could be extended in the future to achieve even lower latency.

CONCLUSIONS AND FUTURE WORK
This paper presents a novel approach for minimizing the latency for processing GNNs on FPGAs, using JEDI-net algorithm as an end-to-end application.It involves optimizing the matrix operations and hardware pipeline to support next-generation low-latency collider trigger systems, the key to many fundamental physics experiments including particle identification.Results show up to 9.0 times reduction in latency over the existing GPU-based JEDI-net implementation and up to 16.7 times reduction over state-of-the-art work.Future work includes exploring the use of new FPGA resources such as the AI Engines [1] and the AI Tensor Blocks [25], automating the proposed co-design approach using techniques such as meta-programming [33,45], and incorporating the proposed techniques into the design and implementation of the data processing architecture for next-generation collider trigger systems.
b) Fig. 1.(a) An example of interaction network-based GNN with edge block, aggregation, node block and MLP-head.(b) Overview of the JEDI-net architecture.)

Fig. 7 .
Fig. 7. Designs with different reuse factors for a example fully-connected layer which has  multiplications.In this example, the  equals 4. (a) With a reuse factor of 1, all the  multiplications are performed simultaneously using  multipliers, which is fully parallel.(b) With a reuse factor of 2, the 4 multiplications are performed using 2 multipliers.(c) With a reuse factor of 4, the 4 multiplications are performed sequentially using only 1 multiplier, which is fully serial.
(b).Dedicated transceiver units facilitate the reception and transmission of data

Fig. 9 .
Fig. 9. System Overview.(a) The CMS L1T as well as data acquisition system; (b) The FPGA system of JEDI-net for particle detection.

Fig. 10 .
Fig. 10.(a) Model accuracy of JEDI-net 50p network under different datapath bitwidth configurations.The x-axis represents the total number of bits, while the different colored lines with dots represent varying numbers of integer bits.The line without dots shows the model accuracy using 32-bit floating-point (FP32) representation.(b) The ROC curves with the AUC for the 5 jets, including gluon, light quarks, W boson, Z boson and top quark.

Table 2 .
Performance comparison of FPGA-based implementations of JEDI-net with various parameters.
helps determine the appropriate values of the parallelism parameters (   ,    ) to be

Table 3 .
Comparison of the FPGA, CPU and GPU designs.