skip to main content
research-article
Open Access

LW-GCN: A Lightweight FPGA-based Graph Convolutional Network Accelerator

Authors Info & Claims
Published:22 December 2022Publication History

Skip Abstract Section

Abstract

Graph convolutional networks (GCNs) have been introduced to effectively process non-Euclidean graph data. However, GCNs incur large amounts of irregularity in computation and memory access, which prevents efficient use of traditional neural network accelerators. Moreover, existing dedicated GCN accelerators demand high memory volumes and are difficult to implement onto resource limited edge devices.

In this work, we propose LW-GCN, a lightweight FPGA-based accelerator with a software-hardware co-designed process to tackle irregularity in computation and memory access in GCN inference. LW-GCN decomposes the main GCN operations into Sparse Matrix-Matrix Multiplication (SpMM) and Matrix-Matrix Multiplication (MM). We propose a novel compression format to balance workload across PEs and prevent data hazards. Moreover, we apply data quantization and workload tiling, and map both SpMM and MM of GCN inference onto a uniform architecture on resource limited hardware. Evaluation on GCN and GraphSAGE are performed on Xilinx Kintex-7 FPGA with three popular datasets. Compared to existing CPU, GPU, and state-of-the-art FPGA-based accelerator, LW-GCN reduces latency by up to 60×, 12×, and 1.7× and increases power efficiency by up to 912×, 511×, and 3.87×, respectively. Furthermore, compared with NVIDIA’s latest edge GPU Jetson Xavier NX, LW-GCN achieves speedup and energy savings of 32× and 84×, respectively.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Over recent years, deep learning paradigms such as convolutional neural networks (CNNs) and recurrent neural networks have shown great success in various families of tasks such as image and text processing [21, 26]. However, these paradigms rely heavily on structural properties of Euclidean data such as dense tensors and have trouble processing non-Euclidean data such as graphs. To tackle this problem, graph neural networks (GNNs) have been introduced and have demonstrated the ability to accurately process complex graph data [22]. Among numerous GNNs, graph convolutional networks (GCNs) [16], which borrows ideas from Convolutional Neural Networks (CNNs) to aggregate neighbor data, have quickly attracted industrial attention as a popular solution to real-world problems [2, 8, 25, 34]. Since then, there have been many other graph processing algorithms (i.e., GIN, GraphSAGE, GAT, etc.) introduced to optimize performance on existing problems and extend to new challenges [11, 14, 23, 28, 29, 32].

Similarly to CNN, GCNs also contain multiple layers, where the main operations of each layer are combination and aggregation. Combination is similar to a dense layer of a multi-layer perceptron, where a feature matrix is multiplied by a weight matrix. Aggregation is similar to a convolution operation of a standard CNN, where the feature vector of each vertex is computed through a weighted aggregation of all feature vectors of neighboring vertices, which can be represented as a matrix multiplication between the graph adjacency matrix and the feature matrix.

Despite the fact that the majority of GCN operations can be represented as matrix multiplication, it is unlikely for existing matrix multiplication–oriented accelerators [12, 20, 35, 36] to yield high throughput on GCN. These accelerators typically exploit the structured nature of dense tensors and apply data reuse techniques to achieve performance boosts. However, such techniques are ineffective in GCNs, because adjacency matrices in GCNs are often sparse, random, and irregular due to the fact that node degree distribution of random graphs follow the power-law distribution. Although existing works such as EIE and Cambricon-X [15, 38] tackles irregularity in computation and memory access in deep compressed CNNs, the sparsity of deep compressed CNNs is much lower (around 90%) than that of GCNs (over 99.9%). Due to the extreme sparseness of graph data, sparse CNN accelerators also fail to maximize computational efficiency, and thus a more effective approach is required.

There are existing GCN accelerators to overcome the sparseness challenges [13, 18, 19, 33]. The work EnGN [19] proposes a unified architecture for feature extraction, aggregation, and operation update. The GCNAX also proposes a unified architecture while focusing specifically on loop rearrangement to improve the efficiency of loading data from off-chip. The Cambricon-G [24] proposes the cuboid engine with multiple vertex processing units and hybrid on-chip memory to process the sparse data and dynamically update the graph topology. Meanwhile, the Rubik [7] develops a unified architecture to cooperate with graph reordering to support both node-level and graph-level computing. On the contrast, the work in References [13, 33] assumes combination and aggregation are structurally different. In HyGCN, they design an aggregation engine for irregular accesses and computations and an combination engine for regular accesses and computations. The AWB-GCN uses TDQ-1 and TDQ-2 to perform general sparse (sparsity \(\lt\) 75%) matrix multiplication and ultra-sparse matrix multiplication, respectively. Although the above work achieves performance boosts, they either cache large amounts of data on-chip or rapidly load data from off-chip memory. This requires either large amounts of on-chip memory or huge off-chip memory bandwidth (i.e., high-bandwidth memory). Moreover, HyGCN and AWB-GCN deploys independent hardware modules for different operations, such as combination and aggregation. Despite the efforts to balance computation in each module, the inherent workload differences across datasets make it difficult to keep both modules fully utilized. Moreover, as GCN grows in popularity and supports numerous real-world applications, it is natural for its inference workload to see heavy demand on edge devices in the near future. For example, GCN is used for autonomous exploration under uncertainty in the robotic domain [6]. The authors in Reference [1] propose a GNN-based algorithm to optimize pose prediction in two-dimensional SLAM, which can be widely used in autonomous driving. It is unlikely for these resource limited devices to provide powerful hardware resources, therefore a more lightweight approach is required.

To this end, we propose a lightweight software-hardware co-optimized accelerator, named LW-GCN, to efficiently perform GCN inference. We first introduce the “packet” conception in compressing the sparse matrix into a packet-level column-only coordinate-list (PCOO) format in software. The PCOO format is also easy to decompress in the hardware. We then propose a unified micro-architecture to efficiently execute both combination and aggregation, where the main operations are Matrix-Matrix Multiplication (MM) and Sparse Matrix-Matrix Multiplication (SpMM). An optimized computation pipeline is utilized in each processing element (PE) to cope with the irregularity in computation and memory access caused by SpMM. Due to the limited hardware resources, we apply tiling to process a portion of MM/SpMM at a time, which enables us to only keep a fraction of the matrices on-chip. Finally, our preprocess procedure injects “empty elements” in PCOO to indicate idle cycles and prevent data collisions caused by irregularity of the sparse matrix in software side. The preprocess algorithm has linear time and space complexity with respect to the number of elements in the sparse matrix.

We implement LW-GCN onto the Xilinx Kintex-7 K325T FPGA, which simulates the limited resource availability of edge devices. We evaluate LW-GCN for GCN and GraphSAGE on three popular datasets Cora [3], CiteSeer [4], and PubMed [9]. Compared to state-of-the-art software framework Pytorch Geometric (PyG) running on Intel Xeon Gold 5218 CPU, NVIDIA Jetson Xavier NX edge GPU, NVIDIA RTX3090 GPU, and a prior FPGA-based GCN accelerator [13], LW-GCN achieves up to 60\(\times\), 32\(\times\), 12\(\times\), and 1.7\(\times\) smaller latency, as well as 912\(\times\), 84\(\times\), 511\(\times\), and 3.87\(\times\) higher energy efficiency, respectively. To summarize, the main contributions of this work as listed as follows:

  • Software-Hardware Co-optimization. We propose a linear time and space preprocess algorithm to compress the sparse matrix into PCOO format and optimize the GCN workload. In addition, the micro-architecture is designed to efficiently process the PCOO format, so that the GCN workload is also optimized in hardware side.

  • High Computation Efficiency. We design unified micro-architecture for MM and SpMM, which efficiently performs both combination and aggregation operations in GCN. Moreover, the PCOO format skips computation and storage of zeros in the sparse matrix, and the optimized architecture in each PE addresses the irregularity issue caused by sparse matrix, which further increase the computation efficiency of LW-GCN.

  • Low Resource Requirement. The compression method in the preprocess algorithm reduces both the storage and bandwidth requirement. Moreover, LW-GCN utilizes tiling techniques to process a portion of MM/SpMM at a time, thus further alleviating on-chip memory burdens. Different from prior works that rely heavily on large on-chip memory availability, LW-GCN works effectively on resource limited edge devices.

  • High Performance. We evaluate LW-GCN on a Kintex-7 FPGA on three popular datasets. Our work reduces latency by up to 60\(\times\), 32\(\times\), 12\(\times\), and 1.7\(\times\) and increases energy efficiency by up to 912\(\times\), 84\(\times\), 511\(\times\), and 3.87\(\times\), compared to Intel CPU, NVIDIA edge GPU, NVIDIA server GPU, and prior FPGA-based GCN accelerator.

Skip 2CHALLENGES AND MOTIVATIONS Section

2 CHALLENGES AND MOTIVATIONS

In this section, we will briefly introduce the GCN algorithm, the challenges to map it on hardware, and the motivation of our accelerator design.

2.1 GCN Background

The forward propagation of the \(l\)th layer of a multi-layer GCN [16] is illustrated in Equation (1), (1) \(\begin{align} X_l = Relu(AX_{l-1}W_l), \end{align}\) where \(A, X_{l}\), and \(W_l\) indicate the adjacency matrix of the input graph, the feature matrix of the \(l\)th layer, and the weight matrix of the \(l\)th layer, respectively. \(Relu\) is the activation function and the input feature matrix of the graph is represented as \(X_0\).

Based on our analysis on widely used datasets, the adjacency matrices are often sparse, the input feature matrix is often sparse, while the weight matrices are dense, as shown in Table 1. Therefore, the computation order influences dramatically on computation complexity when skipping the zeros. Following the analysis in Reference [13], we profile the required number of scalar operations and intermediate storage under different computation orders, as shown in Table 2. This way, we perform \(A \times (X_{l-1} \times W_l)\), as it is much more efficient. This is also true for GraphSAGE. The computation of a layer in GraphSAGE can be expressed as follows: (2) \(\begin{equation} X_l = ReLU(X_{l-1}W_{l,1} + \hat{A}X_{l-1}W_{l,2}), \end{equation}\) where \(\hat{A}\) is preprocessed from adjacency matrix \(A\) by dividing each element by the number of non-zero elements in the row. In this way, we can perform the same optimized compute order of \(\hat{A}(X_{l-1}W_{l,2})\) as that in GCN. Similarly, the first layer of GIN contains similar sparse-dense-dense matrix multiplication workloads and can potentially utilize this optimization. For simplicity, we refer to step \(X_{l-1} \times W_l\) as combination and \(A \times (...)\) as aggregation of each GCN layer, following the conventions of Reference [33]. Moreover, for the combination of the first layer and aggregation, we perform SpMM and for the combination of other layers, we perform MM, this is because \(X_l\) is produced by the previous layer and it is always dense except the first layer. From here onward, for SpMM we will refer to the left sparse input matrix as \(X\), the right dense input matrix as \(W\), and output matrix as \(Y\) for simplicity.

Table 1.
DatasetsNodesEdgesInput FeaturesClassesFeature DensityEdge DensityWeight Density
Cora2,70810,5561,43371.27%0.144%100%
CiteSeer3,3279,1043,70360.85%0.0822%100%
PubMed19,71788,648500310.0%0.0228%100%

Table 1. Dimensions and Densities of Widely Used Datasets

Table 2.
Datasets\((A \times X_{l-1}) \times W_l\)\(A \times (X_{l-1} \times W_l)\)
Cora18.7 M/56.2 Mb1.33 M/0.661 Mb
CiteSeer38.9 M/188 Mb2.23 M/0.812 Mb
PubMed118 M/150 Mb18.6 M/4.81 Mb

Table 2. Required Computation and Storage under Different Computation Orders

2.2 Challenges

As illustrated in previous sections, the main operations in GCN can be extracted as SpMM and MM. Therefore, the challenge is to accelerate SpMM and MM on resource limited devices.

2.2.1 Challenges on SpMM.

The computation of SpMM on one PE can be effective to skip all zero elements of the sparse input \(X\), as shown in Algorithm 1. However, parallel computing with multiple PEs introduce new problems in Computation Imbalance and Memory Irregularity.

Computation Imbalance: To accelerate the computation of SpMM on multiple PEs, we will first divide the workload and distribute portions to multiple PEs. In each PE, we only process the non-zero elements from \(X\). Due to irregularity in \(X\), it is difficult to allocate identical workloads to every PE, which leads to computation imbalance. This is challenging for the SpMM in GCN, as the matrices in combination and aggregation are extremely sparse (\(\gt 99\%\)). Moreover, real-world graphs follows the power-law distribution [30], which implies that the minority of rows (columns) in the adjacent matrix have the majority of non-zeros while the majority of rows have only a few (not empty) non-zeros. Such irregularity further increases the difficulty to balance workload.

Memory Irregularity: Since the optimization of SpMM only stores the non-zero elements to save memory requirement, the data irregularity incurs several issues during computation. First, it is difficult to predict the position of the next non-zero element \(X_{i, j}\) to be processed in the left matrix. Since matrix multiplication matches \(X_{i,j}\) against \(W_j\) and \(j\) is unknown, the next non-zero \(X_{i,j}\) could require any row of \(W\). This uncertainty requires us to cache the entire \(W\) matrix on-chip, which leads to very expensive caching. Second, parallel computing of SpMM will process multiple non-zero elements of \(X\) simultaneously, thus requiring all corresponding data in \(W\) to be readily available, this introduces the problem of bank conflict. For example, to process non-zero elements \(X_{i_a,j_a}\) and \(X_{i_b,j_b}\) simultaneously, the PEs must be provided with \(W_{j_a}\) and \(W_{j_b}\). However, memory resources on FPGA usually come with high depth and very limited (1 or 2) ports, where each port can only access a single depth of the memory bank at a time. In the scenario where \(W_{j_a}\) and \(W_{j_b}\) are stored on the same bank, which can only supply one of them at a time, we face a data conflict. Third, since the SpMM algorithm computes each row of the SpMM result as a sum of many scalar-vector multiplications, it introduces a read-after-write (RAW) conflict. This is due to the fact that arithmetic operations tend to take multiple cycles on hardware. If we process non-zero elements \(X_{i, j_a}\) followed by \(X_{i, j_b}\) in the immediate next cycle, then the multiplication and addition would not have finished in the first cycle. When the PE reads in \(Y_i\) in the next cycle to process addition for \(X_{i, j_b}\) it would inevitably read in an incorrect result. Finally, although the RAW conflict can be effectively resolved by utilizing multiply-accumulators (MACs) instead of individual multipliers and adders, doing so restraints the design to use the same PE to process each row \(X_i\), which leads back to the issue of Computation Imbalance. As the individual node degree in a random graph follows the power-law distribution, it is common for there to be a large difference (over 100\(\times\)) between densities of individual rows of an adjacency matrix. Naively partitioning the sparse input \(X\) into row-blocks and assigning row-blocks to a PE group would result in a difference between non-zero workload assigned to each PE within the group. The latency of the group would be controlled solely by the input row with the highest density, vastly reducing efficiency.

2.2.2 Challenges on Resource Limited Devices.

Accelerating the inference of GCN should include the acceleration of both MM and SpMM. Although MM does not have the issues of Computation Imbalance and Memory Irregularity as SpMM, MM requires storage of all the numbers in the matrices, which incurs the issue of Bandwidth Constraints. Therefore, designing a module with both MM and SpMM in consideration is challenging. Existing solutions such as References [13, 33] view MM and SpMM as inherently different workloads, therefore introduced dedicated modules to perform each independently. Although this allows each module to efficiently tailor toward its workload, the resource allocation for each module raises a non-negligible concern. Since different problem settings come with different data dimensions and densities (examples shown in Table 1), the ratio between arithmetic operations required in combination and aggregation varies significantly across datasets. Moreover, the data dependency between combination and aggregation leaves one of the MM and SpMM modules idle, which leads to a waste of resources. Such problem makes the accelerating GCN on resource limited devices more challenging.

2.3 Motivation

Motivated by the above challenges, we propose a software-hardware co-optimization process to address each of them, while keeping an available resource budget of an edge device. We first define a PCOO format to compress the input sparse matrix, effectively eliminating zero elements to preserve both storage space and computation time. We then design a dedicated computation engine processing multiple non-zero elements in parallel efficiently. Some key highlights of our design include the following:

  • Software Preprocessing: We first compress the sparse data into PCOO format and leverage the binary “edge-or-no-edge” feature of graph adjacency matrices to remove value data. Then, we search the space of the sparse matrix to balance the workload on different PEs to resolve the issue of computation imbalance. Finally, idle data insertion is applied to solve the problem of bank conflict with a small burden.

  • Dedicated Architecture Design: We design a dedicated architecture to decompress the PCOO format to further increase the computation efficiency. Moreover, a multi-port memory is applied in our design to resolve the issue of data conflict from the hardware side.

  • Unified Micro-architecture: We observe that MM is essentially a special case of SpMM where density is 1. Therefore, we design the algorithm to process SpMM by individual non-zero elements on the sparse matrix and apply that algorithm on MM as well. Moreover, we design a unified architecture of PE to process both MM and SpMM efficiently, which allows all computation resource to be fully utilized. This allows the full GCN workload to be deployed onto a unified module, resolving the resource allocation problem.

  • Flexible Design: Our design is not dedicated toward any specific GCN configurations, instead it is able to support any number of layers with any size of GCN layers. Additionally, since MM and SpMM are widely used across GNNs, our design supports most operations needed for many other networks. In Section 5, we also evaluate our design on GraphSAGE in addition to GCN as we support it out of the box.

Skip 3SOFTWARE PREPROCESSING Section

3 SOFTWARE PREPROCESSING

The software preprocessing algorithm will first compress the input data and then allocate and schedule GCN workloads onto different PEs. We will explain these algorithms in detail in this section.

3.1 Data Compression

3.1.1 PCOO Format.

As shown in Table 1, the adjacency matrix and the input matrix of the first layer in GCNs are often extremely sparse. Therefore, we compress these matrices to process only valuable information (non-zero elements) to save storage and reduce computation complexity. We introduce the “packet” concept to propose a PCOO format to compress the sparse matrix (Figure 1). In detail, we treat all the elements in one row as one packet, and each non-zero element \(X_{i,j}\) in one row is formatted into a bitwise format. First, the leading two bits conclude the row information of each non-zero element, which indicate the start-of-row (SOR) for first non-zero element and end-of-row (EOR) for last non-zero element. Second, the following one bit indicates valid (VLD) to differentiate from injected empty elements (the injected empty elements are explained in detail in Section 3.2). These three bits act as the header of a packet, and the rest bits play a role of payload, which has the column information and the value of each non-zero element. We use \(log_2(T)\) bits, where \(T\) is the tile size, to represent the column position within tile (\(j\) mod \(T\)) of \(X_{i, j}\). Finally, we use the remaining \(H\) bits to represent the value of the non-zero element. In the corner case where there are no non-zero elements in a given row, we set the header SOR = EOR = 1 and VLD = 0 with empty payload to instruct the hardware to increment the row number without performing calculation. In this way, we totally need \(3+log_2(T)+H\) bits to represent each non-zero element in the sparse matrix. The algorithm of compressing sparse matrix with PCOO is concluded in Algorithm 2.

Fig. 1.

Fig. 1. Packet-level column-only coordinate list format.

We evaluate the storage consumption of the commonly used compression formats (CSR/CSC/ COO) vs. PCOO, and the results are shown in Table 3. For all the datasets, PCOO format is comparable in terms of storage efficiency compared with CSR and CSC and is more efficient than COO.

Table 3.
Datasetrowcolnon-zerosCSRCSCCOOPCOO
Cora Features2,7081,43349,216781 Kb810 Kb1.33 Mb886 Kb
Cora Edges2,7082,70810,556207 Kb207 Kb296 Kb201 Kb
CiteSeer Features3,3273,703105,1651.74 Mb1.74 Mb2.94 Mb2.00 Mb
CiteSeer Edges3,3273,3279,104192 Kb192 Kb255 Kb173 Kb
PubMed Features19,717500105,16513.2 Mb18.8 Mb27.7 Mb15.8 Mb
PubMed Edges19,71719,7179,104202 Kb202 Kb301 Kb195 Kb

Table 3. Comparison of Storage Requirement among CSR, CSC, COO, and PCOO

Since we treat MM the same operation as SpMM, we format the left dense matrix in MM to fit the unified PE (as expressed in Section 4). The dense matrix is first stored as normal, and then we design all rows to share the same column information in PCOO format. In this way, we only need an extra of \((3+log_2(T)) \times Column\_Size\) bits to store the dense matrix in intermediate steps.

3.1.2 Quantization.

To further reduce the memory consumption, we apply quantization onto the values of all the matrices in GCNs. There are existing quantization strategies for GNNs. Degree-Quant [27] can quantize to 8-bit signed fixed point with negligible accuracy loss. However, their quantization strategy is applied during the GCN training process. SGQuant [10] proposes a GNN-tailored quantization algorithm to reduce GNN memory consumption. However, they require a fine-tuning scheme to compensate for the accuracy loss caused by precision reduction. Our work only targets the inference phase, and we target on reducing the running time for preprocess. In this way, we take post-training quantization strategy to save time for preprocessing. To maintain accuracy, we select 16-bit signed fixed point (SINT16) to quantize the features and weights. Moreover, we explore the data properties of the sparse matrices and use 4-bit signed fixed point (SINT4) to quantize the non-zero elements. In fact, for all matrices as well as two of three feature matrices on the three popular used datasets, the value of each \(X_{i,j}\) would be binary between 0 and 1, and there would be no accuracy loss at all. During the computation, we store all the intermediate results as 32-bit signed fixed point (SINT32) to maintain accuracy. The evaluation of our quantization strategy on both GCN and GraphSAGE on all three datasets shows that our proposed approach incurs negligible accuracy loss (within 0.2%).

3.2 Assignment and Scheduling

To reduce memory consumption, we employ an outer-product tiling approach, as shown in Figure 2. We partition the inputs into \(T\)-column tiles for \(X\) and \(T\)-row tiles for \(W\). The hardware processes a pair of tiles at a time and produces the final result by accumulating all tile results. For each pair of tiles, we perform the following preprocessing steps to balance workload and reduce data volume.

Fig. 2.

Fig. 2. Outer product matrix multiplication.

3.2.1 Workload Assignment and Scheduling.

Multiplication of non-zero elements in one row of the sparse matrix \(X\) is assigned to the same PE, while multiplication of different rows are assigned to different PEs in a round-robin fashion. This way, non-zero elements from each row are processed sequentially on the same PE and do not require the same accumulator simultaneously. However, different rows of a graph adjacent matrix could have extremely different densities (with relative difference \(\gt\) 100\(\times\)). If we naively tile the workload further into row blocks, then it would be inefficient for the majority of PEs to finish execution and remain idle to wait for a single PE to finish processing a particularly dense row, shown as the assignment step in Figure 3. To increase PE efficiency, we design the PEs to work independently, each PE starts to compute a new row immediately when it finishes the previous one. This way, multiple rows are effectively concatenated before assigned to one PE; this way we eliminate idle time (shown as concatenation step in Figure 3). Since it is unlikely for the density of a row to correlate with its row number, by the law of large numbers we expect the sum of densities of rows assigned to each PE to be similar. In Section 5, we will analyze examples in details and compare the computation cost and idle time before and after the concatenation step. Finally, to ensure all PEs balance to process the same amount of elements, including zeros and non-zeros, we inject empty elements at the end of each concatenated row when necessary.

Fig. 3.

Fig. 3. Round-robin assignment of non-zero elements to four PEs.

3.2.2 Data Collision Resolution.

Due to constraints of on-chip memory, multiple rows of \(W\) are stored in the same memory slice, of which only a single row may be accessed at any time. However, the sparsity of \(X\) may cause two PEs to simultaneously access different depths on the same memory slice, which incurs data collision. To resolve this problem, we first develop a multi-bank memory system with data replication to reduce the occurrence of such data collisions (see details in Section 4.2). We then inject an empty element with VLD = 0 to prevent any data collision not resolved by the multi-port memory system. For example, there will be a bank conflict if \(N+1\) elements are requiring to access the same \(N-\)port memory. For this case, we will insert an empty element in the place of the \(N+1\)-th element and make the empty element to access at one of the other \(N\) addresses so that the bank conflict can be avoided. The inserted element will incur extra inference latency while more ports on a single memory will incur large usage of on-chip memories. The tradeoff is then made between the usage of on-chip memory and extra latency incurred by empty elements, detailed analysis will be discussed in Section 5.3.

Preprocessing is summarized in two steps in Algorithms 2 and 3. Overall, this preprocessing algorithm is bounded by linear time and space complexity to the total number of non-zero elements in every unique sparse tile. However, the dense tile is quantized to SINT4 and passed to hardware without structural change. Finally, the preprocessor generates instructions to serialize the execution across layers and steps.

Skip 4MICRO-ARCHITECTURE OF LW-GCN Section

4 MICRO-ARCHITECTURE OF LW-GCN

As shown in Figure 4, the micro-architecture of LW-GCN is composed of Peripheral Interface, External Memory Interface, Top Control, PE Array for Sparse-Dense Matrix Multiplication, and on-chip buffers. The Top Control module fetches and decodes instructions, before passing them to individual modules. As mentioned above in Section 3, the micro-architecture processes a single tile at a time.

Fig. 4.

Fig. 4. The overall micro-architecture and workflow of LW-GCN.

4.1 Overall Workflow

The overall workflow of computing one tile of LW-GCN is shown in Figure 4. The dense input data is transferred from external memory to dense data memory (DDM) during the initial load step. During the Compute step, sparse input data are streamed onto the edge weights memory (EWM), and the PE array fetches \(X\) data from EWM and \(W\) data from DDM and performs MAC operations in parallel. The intermediate results will be stored in the output matrix memory backup (OMMB) and can be treated as the left matrix in the later computation. Upon finishing all pairs of tiles from each aggregation or combination step, we move output to the OMMB, and move a copy to DDM after combination and EWM after aggregation during the data move step.

4.2 Multi-Bank Dense Data Memory

As mentioned in Section 3, to compute different rows on different PEs in parallel, multiple non-zero elements from the sparse input are streamed on-chip during SpMM. Due to the sparseness and irregularity of \(X\), it is difficult to predict the column positions of the non-zero elements ahead of time. Particularly, it is possible that several PEs require different addresses from the same DDM. Limited by read capability of on-chip memory (dual-port RAM only supports reading from two ports at most but the PE number is likely larger than two), such access restriction leads to data collision. In the micro-architecture of LW-GCN, we build a multi-port memory to store weights of one tile through data replication and row grouping to reduce such data collision. In addition, we further reduce the occurrence of such data collision during preprocessing, as mentioned in Section 3.

4.2.1 Data Replication.

We replicate the dense data into \(r\) replicas for different memory slices. Ideally, when setting \(r\) equals to the number of PEs, the aforementioned data collision can be avoided, because we would have a dedicated replica of dense data for each PE. However, this incurs a large on-chip memory requirement and is unfeasible in reality. Therefore, we set a relative small \(r\) to solve part of the data collision with acceptable resource utilization (the chosen of \(r\) is explained in detail in Section 5.2), and we introduce row grouping to further reduce the occurrence of data collision.

4.2.2 Row Grouping.

We partition each dense data replica into \(g\) row groups, each of which is stored independently. Specifically, we store row \(W_j\) on group (\(j\) mod \(g\)), so that data collision can only occur between elements \(X_{i_a, j_a}\) and \(X_{i_b, j_b}\) if \((j_a\) mod \(g)\) = \((j_b\) mod \(g)\) and \(j_a \ne j_b\), which is significantly less likely compared with the undivided memory. Despite the fact that on-chip memory requires a minimum depth to be fully utilized, we are able to use high numbers of row groups to statistically reduce the probability of data collision. However, row grouping with large \(g\) leads to high complexity for data distribution to PEs, which results in complex placement and routing and increases resource consumption.

Both data replication and row grouping can efficiently reduce data collision. The remaining collision is avoided by injecting empty elements and processing them as idle cycles, as mentioned in Section 3. We experiment with different \(r\) and \(g\) in Section 5.3, where we analyze the number of inserted idle cycles versus hardware resource consumption to determine the optimal number of memory replicas and row groups.

4.3 Unified PE Architecture for MM and SpMM

As shown in Figure 5(a), the number of PE groups and memory banks are kept the same, so that each PE group can access the corresponding memory bank for dense data to avoid data collision. Based on addresses generated by an individual PE, Memory Selector and Data Distributor dispatch appropriate dense data. We use priority decoder when distributing addresses to memory banks, which allows different PEs to fetch from the same address of the same memory bank.

Fig. 5.

Fig. 5. (a) The architecture of PE Array; (b) detailed architecture of a PE.

Note that data replication only applies to dense input but not sparse input. The compressed sparse data is streamed directly to each PE. As shown in Figure 5(b), data first passes through PCOO Decoder, where the \(log_2(T)\)-bit column index is interpreted as memory address to fetch dense data. If a valid bit is observed (VLD = 1), then the PE routes the corresponding value to its multiplier; otherwise, it assumes the current value is an injected empty element (i.e., data collision, waiting for other PEs to finish, etc.), and routes 0 to the multiplier instead. Since multiple rows are concatenated to feed into each PE, we use SOR and EOR to indicate the start and end of a row, respectively. For each computation step, SOR controls the input of the accumulator to be either its previous result (SOR = 0) or the intermediate result of the previous tile saved in OMMB (SOR = 1). Meanwhile, EOR controls the address generation for storing current results into the output buffer, and also increments the internally tracked row number (EOR = 1).

The MM is also performed in the PE with the same working flow. Since the left matrix is dense, all the rows share the same row and column information, which also goes through the PCOO decoder. The sparse_flag signal then indicate which data to select. When we are processing MM (sparse_flag is 0), we will select the edge weights stored in EWM; otherwise, we will select the value decoded from PCOO Decoder. In this way, we can perform both MM and SpMM in the unified PE, which increases the working efficiency of PE for computing combination and aggregation of GCNs.

Skip 5EVALUATION Section

5 EVALUATION

In this section, we evaluate LW-GCN on different configurations to identify the impact of each hardware resource. We then compare a final implementation against existing computing platforms on three popular datasets: Cora, CiteSeer, and PubMed. The dimensions and densities of each dataset are shown in Table 1.

5.1 Experiment Design

We evaluate LW-GCN on a two-layer GCN that use a hidden size of 16 and trained dense weights and bias via the state-of-art framework PyG. Note that this setup is identical to the GCN used in Reference [13] that we will be evaluating against. In addition, to demonstrate the flexibility of our approach, we extend our evaluation to GraphSAGE under the same datasets.

LW-GCN is implemented in Verilog HDL and deployed onto a Xilinx Kintex-7 K325T FPGA, where we measure the execution time and energy consumption. The DDM is implemented with LUT RAM while other on-chip memories are implemented with Block RAM (BRAM). This is because memory banks in DDM require small depth and high bandwidth, and LUT RAM is more suitable than BRAM. In this section, we first explore the impact of tile size and dense input replication on execution latency. Then, we present a breakdown of latency in individual step of loading, computation, and data movement. Finally, we present an overall performance comparison against existing platforms in terms of latency and energy efficiency.

The preprocessing time for all the datasets evaluated are shown in Table 4. For reference, we also provide the time it takes to read corresponding data from csv files. We can see that the preprocessing time is comparable with the data loading time. Moreover, we only run preprocess once for each dataset, and the preprocessing time is acceptable.

Table 4.
DatasetCSV Load TimePreprocess Time
Cora Features95.9 ms19.8 ms
Cora Edges20.3 ms8.67 ms
CiteSeer Features177 ms46.6 ms
CiteSeer Edges25.4 ms11.0 ms
PubMed Features1.53 s279 ms
PubMed Edges167 ms228 ms

Table 4. Preprocess Time

5.2 Hyper Parameter Impact

During each SpMM step, the dense input of one tile is stored in on-chip LUT RAM, where multiple rows would be stored on the same slice of memory to fully utilize it. The limitation where only a single row can be read from each LUT RAM slice at a time induces data collision when multiple reads are needed for a same RAM slice and at a same time. As explained in Section 4.2, both data replication and row grouping can effectively reduce data collision. Less data collision will in return results in smaller latency of computing SpMM. However, due to the irregular nature of graph adjacency matrices, individual rows have very different sparsity that results in PE imbalance, we statistically minimize this effect by utilizing larger tiles. As GCN has the hidden size of 16, we set each PE to have 16 multiply-accumulators and have the fixed relationship between tile size \(T\) and row grouping \(g\) that \(T=16g\). Therefore, we evaluate the impact of latency from dense data replication \(r\) and tile size \(T\), as shown in Figure 6. We can see that the latency of computing is decreased by more dense data replications as well as larger tile sizes. At eight replicas, LW-GCN’s SpMM latency is reduced by up to 44.23% (on PubMed) compared to 1 replica under the same 512-row tile setup. At 4,096-row tiles, SpMM latency is reduced by up to 61.83% (on PubMed) with the same replication setup. The ideal cases in Figure 6 is estimated by summing up the total amount of workload and assuming every PE is fully utilized.

Fig. 6.

Fig. 6. Impact of (a) dense data replication with 512-row tiles and (b) tile size with one replica.

Due to resource limitations, it is unfeasible to continuously expand tile sizes and replication numbers. Considering the tile size \(T\) and data replication \(r\), the number of LUT RAM needed can be expressed in Equation (3), (3) \(\begin{equation} \# LUTRAM = \frac{T \times 16 \times r}{16} = T \times r, \end{equation}\) where the 16 in numerator indicates the data width. Equation (3) is divided by 16, because each LUT RAM can store 16-bit of data [31]. Since we insert two registers in each LUT RAM to achieve higher working frequency, the number of flip-flop (FF) can be expressed in Equation (4). (4) \(\begin{equation} \#FF = 2\times \#LUTRAM. \end{equation}\)

Since the dense data are distributed to each PE in one PE group, we need multi-bit multiplexer to select the data of the appropriate memory. Moreover, in Xilinx FPGA, each 8-bit multiplexer is implemented with 1 F7 MUX and 2 LUTs [5]. In this way, the number of F7 MUX and LUT required is listed as follows, respectively, (5) \(\begin{equation} \#F7\ MUX = \frac{T}{16} \times g \times \frac{256}{8} = 2\times T \times g, \end{equation}\) (6) \(\begin{equation} \#LUT = 2\times \#F7\ MUX. \end{equation}\)

In Equation (5), \(\frac{T}{16} \times g\) indicates the number of multi-bit multiplexers. It is divided by 16, because 16 multiply-accumulators in each PE can share a same multi-bit multiplexer. Since we using 256-bit multiplexer (each PE has 16 multiply-accumulator and each data is in SINT16 data format), the number of multi-bit multiplexers is multiplied by \(256/8\) to get the number of F7 MUX.

We also evaluate the resource utilization under the resource limited device with respect to dense data replication \(r\) and tile size \(T\), as shown in Figure 7. According to the results in Figure 7, the resource utilization for storing and fetching dense input data follows our analysis in Equations (3)–(6). Moreover, \(r = 4\) and \(T = 512\, (g = 32)\) achieve the best balance between resource and performance under the specific FPGA, and will be used for the remaining of experiments. When the FPGA platform varies, it is also easy to change the hyperparameters following Equations (3)–(6) with the resource constraints. Given these hyper parameters, the overall resource utilization on Kintex-7 K325T FPGA is shown in Table 5.

Fig. 7.

Fig. 7. Resource consumption of (a) replication and (b) tile size.

Table 5.
ResourceLUTLUT RAMFFF7 MUXBRAMDSP
Used161,52933,80494,36932,768291.5512
Available203,80064,000407,600101,900445840
Utilization (%)79.2652.8223.1532.1665.5160.95

Table 5. Resource Utilization on Kintex-7 325T FPGA

5.3 Latency Breakdown

During preprocessing, we inject empty elements (see Section 3.2) to handle the corner case where a row \(X_i\) contains no non-zero elements or when a PE completes its execution. This enables each PE to internally track current row \(i\), which allows us to remove row number \(i\) from off-chip memory and reduce memory bandwidth consumption. We also inject empty element symbols when two elements are to read from different depths of the same memory to prevent data collision. Figure 8 shows the latency breakdown for overall runtime (including MM/SpMM, memory load and on-chip data movement) as well as for SpMM (including computation, PE imbalance, and data collision). The latency of MM is dominant by computation as MM does not have the issues of PE imbalance and data collision. Therefore, we do not list the latency breakdown for MM. In both cases, the time spending on computation (i.e., MM/SpMM for overall and computation for SpMM) is dominant. The dataset PubMed has a relatively larger PE imbalance. This is because the higher sparsity and irregularity of PubMed (the edge density in PubMed is about \(1/5\) of that in Cora and CiteSeer.) When using the round-robin workload assignment and scheduling scheme, there exists the number of non-zeros elements in one row is higher than the sum of non-zero elements in other rows, thus causing an imbalance. The higher PE imbalance in PubMed also indicates potential room for improvement by workload assignment and scheduling, which will be explored in the future. For example, we can assign non-zero elements of the same row to multiple PEs to make PE balance. At the same time, we need extra buffers and adders on hardware to make correct computation.

Fig. 8.

Fig. 8. Latency breakdown for (a) full execution and (b) SpMM.

We further evaluate the specific utilization rates per PE with respect to combination and aggregation operations. For simplicity, we only show the first layer on Cora dataset in Figure 9. It shows that the idle time of each PE varies from 6% to 12% for combination and from 1% to 20% for aggregation, respectively. In overall, the lowest utilized PE is idle for less than 20% of the SpMM time.

Fig. 9.

Fig. 9. PE utilization during SpMM for Cora: (a) first combination tile and (b) first aggregation tile.

5.4 Overall Comparison

We evaluate the overall latency and energy efficiency of LW-GCN against the Intel Xeon Gold 5218 CPU, NVIDIA Xavier NX edge GPU, NVIDIA RTX3090 GPU, and state-of-the-art FPGA-based GCN accelerator AWB-GCN [13] and report results in the top half (GCN) of Table 6. Note that AWB-GCN is implemented on Intel Stratix 10 D5005 with frequency of 330 MHz and uses 8,192 DSP slices. We normalize their reported latency and energy efficiency with this resource utilization to our FPGA (200 MHz and 512 DSP slices) for a fair comparison. For energy efficiency, Intel Stratix 10 D5005 uses 14-nm transistors while Xilinx Kintex-7 325T uses 28-nm transistors, following the analysis in Reference [17], we normalize their power consumption by \((\frac{28}{14})^2=4\times\). For GCN as illustrated in Table 6, LW-GCN outperforms all the other platforms in terms of latency and energy efficiency. Specifically, LW-GCN achieves up to 60\(\times\), 32\(\times\), 12\(\times\), and 1.7\(\times\) speedup, as well as 2478\(\times\), 84\(\times\), 511\(\times\), and 3.88\(\times\) energy efficiency, compared with CPU, edge GPU, GPU, and AWB-GCN, respectively. LW-GCN is able to achieve such performance benchmarks while keeping a small resource budget, due to the techniques used in software preprocessing and micro-architecture to reduce data collision and PE imbalance for SpMM, as well as performing MM and SpMM with unified architecture.

Table 6.
Latency (ms) [speedup]Energy efficiency (graph/kJ)
Platform (Clock rate: GHz)CoraCiteSeerPubMedCoraCiteSeerPubMed
GCN
Intel Xeon Gold 5218 (2.1)1.89 [1\(\times\)]3.88 [1\(\times\)]12.5 [1\(\times\)]4.23E32.06E3640
NVIDIA Xavier NX (1.1)1.87 [1\(\times\)]1.88 [2.1\(\times\)]2.01 [6.2\(\times\)]3.57E43.55E43.32E4
NVIDIA RTX3090 (1.7)0.492 [3.9\(\times\)]0.481 [8.1\(\times\)]0.491 [26\(\times\)]5.83E35.95E35.83E3
AWB-GCN (0.2)0.0613 [31\(\times\)]0.115 [35\(\times\)]0.791 [16\(\times\)]7.70E54.82E56.21E5
LW-GCN (0.2)0.0412 [46\(\times\)]0.0652 [60\(\times\)]0.571 [22\(\times\)]2.98E61.88E62.14E5
GraphSAGE
Intel Xeon Gold 5218 (2.1)172 [1\(\times\)]385 [1\(\times\)]340 [1\(\times\)]46.520.823.5
NVIDIA Xavier NX (1.1)10.6 [16.3\(\times\)]9.63 [40.0\(\times\)]10.8 [31.5\(\times\)]6.28E36.92E36.17E3
NVIDIA RTX3090 (1.7)1.94 [89.0\(\times\)]1.88 [204.8\(\times\)]1.96 [173.6\(\times\)]1.47E31.52E31.46E3
AWB-GCN (0.2)NANANANANANA
LW-GCN (0.2)0.086 [2.01E3\(\times\)]0.14 [2.75E3\(\times\)]1.07 [318\(\times\)]1.42E68.77E51.72E4

Table 6. Comparison with CPU, Edge GPU, General GPU, and Existing FPGA Accelerator on GCN and GraphSAGE

5.5 Extending LW-GCN to Other Algorithms

Although LW-GCN is designed as a GCN accelerator, the underlying MM/SpMM acceleration is not limited to GCN and can be applied to any MM/SpMM related GNN workloads. In fact, due to the sparse nature of graph adjacent matrices and dense nature of weight matrices, most GNNs workload involves MM/SpMM. As a proof of concept we directly applied LW-GCN to GraphSAGE [14] on the same datasets and achieved an acceleration of up to 2750\(\times\), 123\(\times\), and 22.6\(\times\) and energy savings of up to 42200\(\times\), 226\(\times\), and 966\(\times\) over CPU, edge GPU, and GPU respectively, as shown in the bottom half of Table 6. Note that AWB-GCN results for GraphSAGE is not available in the literature. Additionally, note that the PyG implementation for GraphSAGE involves a sparse-sparse matrix multiplication due to computing aggregation before concatenation, when applied on the three datasets we used, therefore the latency is much higher than it could be.

We also evaluate the inference latency of larger datasets (i.e., Reddit [14]) and deeper GCN architectures (i.e., GraphSAINT [37]) running on LW-GCN to validate the flexibility. First, we run GCN with the larger dataset Reddit, which has 232,965 nodes and 602 features. We partition Reddit into small tiles that fits for LW-GCN and run all the tiles iteratively to get the inference results. The total inference time for running Reddit on LW-GCN is 1249.6 ms. For reference, the inference of Reddit dataset is also evaluated on the CPU and GPUs. The inference time on CPU is 29,400 ms, which is 23.5\(\times\) slower than LW-GCN. As for the GPUs, they all run out of memory and cannot finish the inference. To this end, our work can still outperform CPU/GPU when running large datasets. Moreover, the original reported inference time on AWB-GCN [13] is 31.81 ms, which would scale to 839 ms given our frequency and dsp usage. The performance drop is because LW-GCN is targeting on a edge device, whose bandwidth is constraint compared with the FPGA board used in AWB-GCN. In fact, 54.5% of inference time is spent on data loads and stores, where a typical value is 6–23%, as shown in Figure 8. Second, we run Cora dataset with GraphSAINT architecture, which has six graph convolutional layers. The inference latency is 0.225 ms, of which 21.5% is used for data communication while others for computation. This is quite similar to that of running GCN, because all the layers in GraphSAINT share the similar computation operations as GCN. The above two examples show that the proposed LW-GCN can also work on larger datasets and deeper GNNs. For larger datasets, LW-GCN does not get promising results because of the resource limitation and off-chip memory bandwidth.

Skip 6CONCLUSIONS AND FUTURE WORK Section

6 CONCLUSIONS AND FUTURE WORK

GCN involves heavy computation of multiplications of sparse and dense matrices, but most neural network accelerators are targeted at CNN with dense matrix multiplication and therefore are not efficient for GCN. Recently, FPGA-based AWB-GCN improves performance but still requires a large amount of on-chip memory. Therefore, it is inapplicable to resource limited hardware platforms such as edge devices.

In this article, we have proposed LW-GCN, a software-hardware co-designed accelerator for GCN inference. LW-GCN consists of a software preprocessing algorithm and an FPGA-based hardware accelerator. The core to LW-GCN is our SpMM design, which reduces memory needs through tiling, data quantization, sparse matrix compression, and workload assignment with data collision resolution. Experiments show that for GCN, LW-GCN reduces latency by up to 60\(\times\), 12\(\times\), and 1.7\(\times\) compared to CPU, GPU, and AWB-GCN and increases power efficiency by up to 912\(\times\), 511\(\times\), and 3.87\(\times\). Additionally, the underlying SpMM design used by LW-GCN is applicable to other graph neural network algorithms such as GraphSAGE, not limited to GCN.

REFERENCES

  1. [1] Azzam Rana, Kong Felix H., Taha Tarek, and Zweiri Yahya. 2021. Pose-graph neural network classifier for global optimality prediction in 2D SLAM. IEEE Access 9 (2021), 8046680477. Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Baldassarre Federico and Azizpour Hossein. 2019. Explainability techniques for graph convolutional networks. arXiv:1905.13686. Retrieved from https://arxiv.org/abs/1905.13686.Google ScholarGoogle Scholar
  3. [3] Cabanes Cecile, Grouazel Antoine, Schuckmann Karina von, Hamon Michel, Turpin Victor, Coatanoan Christine, Paris François, Guinehut Stephanie, Boone Cathy, Ferry Nicolas, and C. de Boyer Montégut. 2013. The CORA dataset: Validation and diagnostics of in-situ ocean temperature and salinity measurements. Ocean Sci. 9, 1 (2013), 118.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Caragea Cornelia, Wu Jian, Ciobanu Alina, Williams Kyle, Fernández-Ramírez Juan, Chen Hung-Hsuan, Wu Zhaohui, and Giles Lee. 2014. Citeseer x: A scholarly big dataset. In Proceedings of the European Conference on Information Retrieval. Springer, Cham, 311322.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chapman Ken. 2014. Multiplexer design techniques for datapath performance with minimized routing resources. Application Note. http://www.xilinx.com.Google ScholarGoogle Scholar
  6. [6] Chen Fanfei, Wang Jinkun, Shan Tixiao, and Englot Brendan. 2019. Autonomous exploration under uncertainty via graph convolutional networks. In Proceedings of the International Symposium on Robotics Research. Springer, Cham, 676–691.Google ScholarGoogle Scholar
  7. [7] Chen Xiaobing, Wang Yuke, Xie Xinfeng, Hu Xing, Basak Abanti, Liang Ling, Yan Mingyu, Deng Lei, Ding Yufei, Du Zidong, and Y. Chen. 2020. Rubik: A hierarchical architecture for efficient graph learning. arXiv:2009.12495.Google ScholarGoogle Scholar
  8. [8] Chiang Wei-Lin, Liu Xuanqing, Si Si, Li Yang, Bengio Samy, and Hsieh Cho-Jui. 2019. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 257266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Dernoncourt Franck and Lee Ji Young. 2017. Pubmed 200k rct: A dataset for sequential sentence classification in medical abstracts. arXiv:1710.06071.Google ScholarGoogle Scholar
  10. [10] Feng Boyuan, Wang Yuke, Li Xu, Yang Shu, Peng Xueqiao, and Ding Yufei. 2020. Sgquant: Squeezing the last bit on graph neural networks with specialized quantization. In Proceedings of the IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI’20). IEEE, 10441052.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Fey Matthias, Lenssen Jan Eric, Weichert Frank, and Müller Heinrich. 2018. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 869877.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Frick Achim and Rochman Arif. 2004. Characterization of TPU-elastomers by thermal analysis (DSC). Polymer Test. 23, 4 (2004), 413417.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Geng Tong, Li Ang, Shi Runbin, Wu Chunshu, Wang Tianqi, Li Yanfei, Haghi Pouya, Tumeo Antonino, Che Shuai, Reinhardt Steve, and Herbordt Martin. 2019. AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 922-936.Google ScholarGoogle Scholar
  14. [14] Hamilton Will, Ying Zhitao, and Leskovec Jure. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, Vol. 30.Google ScholarGoogle Scholar
  15. [15] Han Song, Liu Xingyu, Mao Huizi, Pu Jing, Pedram Ardavan, Horowitz Mark A., and Dally William J.. 2016. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Arch. News 44, 3 (2016), 243254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Kipf Thomas N. and Welling Max. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907.Google ScholarGoogle Scholar
  17. [17] Li Fei, Lin Yizhou, He Lei, Chen Deming, and Cong Jason. 2005. Power modeling and characteristics of field programmable gate arrays. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 24, 11 (2005), 17121724.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Li Jiajun, Louri Ahmed, Karanth Avinash, and Bunescu Razvan. 2021. GCNAX: A flexible and energy-efficient accelerator for graph convolutional neural networks. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 775788. Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Liang Shengwen, Wang Ying, Liu Cheng, He Lei, Huawei LI, Xu Dawen, and Li Xiaowei. 2020. EnGN: A high-throughput and energy-efficient accelerator for large graph neural networks. IEEE Trans. Comput. 70, 9 (2020), 1511–1525.Google ScholarGoogle Scholar
  20. [20] Liu Shaoli, Du Zidong, Tao Jinhua, Han Dong, Luo Tao, Xie Yuan, Chen Yunji, and Chen Tianshi. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 393405.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] O’Shea Keiron and Nash Ryan. 2015. An introduction to convolutional neural networks. arXiv:1511.08458. Retrieved from https://arxiv.org/abs/1511.08458.Google ScholarGoogle Scholar
  22. [22] Scarselli Franco, Gori Marco, Tsoi Ah Chung, Hagenbuchner Markus, and Monfardini Gabriele. 2008. The graph neural network model. IEEE Trans. Neural Netw. 20, 1 (2008), 6180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Schlichtkrull Michael, Kipf Thomas N., Bloem Peter, Berg Rianne Van Den, Titov Ivan, and Welling Max. 2018. Modeling relational data with graph convolutional networks. In Proceedings of the European Semantic Web Conference. Springer, Cham, 593607.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Song Xinkai, Zhi Tian, Fan Zhe, Zhang Zhenxing, Zeng Xi, Li Wei, Hu Xing, Du Zidong, Guo Qi, and Chen Yunji. 2022. Cambricon-G: A polyvalent energy-efficient accelerator for dynamic graph neural networks. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 41, 1 (2022), 116128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Sun Mengying, Zhao Sendong, Gilvary Coryandar, Elemento Olivier, Zhou Jiayu, and Wang Fei. 2020. Graph convolutional networks for computational drug development and discovery. Brief. Bioinform. 21, 3 (2020), 919935.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Sutskever Ilya, Martens James, and Hinton Geoffrey E.. 2011. Generating text with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML’11).Google ScholarGoogle Scholar
  27. [27] Tailor Shyam A., Fernandez-Marques Javier, and Lane Nicholas D.. 2020. Degree-quant: Quantization-aware training for graph neural networks. arXiv preprint arXiv:2008.05000.Google ScholarGoogle Scholar
  28. [28] Thekumparampil Kiran K., Wang Chong, Oh Sewoong, and Li Li-Jia. 2018. Attention-based graph neural network for semi-supervised learning. arXiv:1803.03735. Retrieved from https://arxiv.org/abs/1803.03735.Google ScholarGoogle Scholar
  29. [29] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.Google ScholarGoogle Scholar
  30. [30] Xie Cong, Yan Ling, Li Wu-Jun, and Zhang Zhihua. 2014. Distributed power-law graph computing: Theoretical and empirical analysis. In Advances in Neural Information Processing Systems, Ghahramani Z., Welling M., Cortes C., Lawrence N., and Weinberger K. Q. (Eds.), Vol. 27.Google ScholarGoogle Scholar
  31. [31] Xilinx. 2015. LogiCORE IP distributed memory generator v8.0 product guide. Xilinx Product Guide.Google ScholarGoogle Scholar
  32. [32] Xu Keyulu, Hu Weihua, Leskovec Jure, and Jegelka Stefanie. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826.Google ScholarGoogle Scholar
  33. [33] Yan Mingyu, Deng Lei, Hu Xing, Liang Ling, Feng Yujing, Ye Xiaochun, Zhang Zhimin, Fan Dongrui, and Xie Yuan. 2020. Hygcn: A gcn accelerator with hybrid architecture. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, 1529.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Ying Rex, He Ruining, Chen Kaifeng, Eksombatchai Pong, Hamilton William L., and Leskovec Jure. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 974983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Yu Yunxuan, Wu Chen, Zhao Tiandong, Wang Kun, and He Lei. 2019. OPU: An FPGA-based overlay processor for convolutional neural networks. IEEE Trans. VLSI Syst. 28, 1 (2019), 3547.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Yu Yunxuan, Zhao Tiandong, Wang Kun, and He Lei. 2020. Light-opu: An fpga-based overlay processor for lightweight convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 122132.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Zeng Hanqing, Zhou Hongkuan, Srivastava Ajitesh, Kannan Rajgopal, and Prasanna Viktor. 2019. Graphsaint: Graph sampling based inductive learning method. arXiv:1907.04931.Google ScholarGoogle Scholar
  38. [38] Zhang Shijin, Du Zidong, Zhang Lei, Lan Huiying, Liu Shaoli, Li Ling, Guo Qi, Chen Tianshi, and Chen Yunji. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 112.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. LW-GCN: A Lightweight FPGA-based Graph Convolutional Network Accelerator

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 16, Issue 1
      March 2023
      403 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/35733111
      • Editor:
      • Deming Chen
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 December 2022
      • Online AM: 4 August 2022
      • Accepted: 1 July 2022
      • Revised: 28 June 2022
      • Received: 2 November 2021
      Published in trets Volume 16, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!