Large-Scale Simulation of Structural Dynamics Computing on GPU Clusters

Structural dynamics simulation plays an important role in research on reactor design and complex engineering. The Hybrid Total Finite Element Tearing and Interconnecting (HTFETI) method combined with Newmark method is an efficient way to solve large-scale structural dynamics problems. However, the sparse direct solver and the load imbalance caused by inconsistent density models are two critical issues limiting the performance and the scalability of structural dynamics computing. For the former, we propose an efficient variable-size batched method to accelerate SpMV on GPUs. For the latter, we establish an online performance prediction model, based on which we then design a novel inter-cluster subdomain fine-tuning algorithm to balance the workload of HTFETI parallel computing. We are the first to achieve the high-fidelity structural dynamics simulation of China Experimental Fast Reactor core assembly with up to 53.4 billion grids. The weak and strong scalability efficiencies reach 91.77% and 86.13% on 12,800 GPUs, respectively.


INTRODUCTION
Structural dynamics simulation plays an important role in reactor design, transportation, aerospace, energy exploration and development, and major engineering systems.With the increasing complexity of large equipment and the continuous improvement of reliability requirements, there is an urgent need for large-scale and fine-structure dynamics calculation.However, high-precision simulation of this problem often involves large-scale and complex finite element calculation, which makes it difficult to achieve high scalability due to a large scale of meshes up to more than 10 billion elements.For example, in the fluid-induced vibration (FIV) simulation on a single wire-wound fuel rod with one pitch, it takes several hours to simulate the structural dynamics process on one hundred thousand grids using open source software code_Aster [17].Considering simulation on reactor whole core assembly with millions or even billions of grids, conventional Finite Element Method (FEM) computing would become unacceptable in terms of memory space and time.To overcome these challenges, it puts forward high requirements for algorithms and computing power.
The Hybrid Total Finite Element Tearing and Interconnecting (HTFETI) method [41] combined with Newmark method of time integration is an important approach to solving large-scale structural dynamics computing problems.The HTFETI method is used to realize the efficient parallel computation of large-scale finite element simulation, while the Newmark method [30] is applied to update time stepping during structural dynamics computing.The HTFETI method is a kind of domain decomposition method based on Krylov space iterative method for parallel computing.It divides the problem domain into non-overlapping subdomains, and then a number of subdomains are combined into clusters, which can be seen as a two-level domain decomposition approach.The continuity between the subdomains on the boundary is implemented by Lagrange Multipliers (LM).With this kind of domain-cluster decomposition strategy, HTFETI method has the potential to solve the problems decomposed into a large number of subdomains due to the reduction of coarse problems.Lower memory consumption makes it a scalable method enabling parallelization of the original problem to scale to thousands of cores.However, this kind of decomposition strategy will also bring new challenges in the load imbalance problem during large-scale calculation.The computing cost of HTFETI method is sensitive to the number of grid boundary elements rather than grid interior elements.It will bring additional computing overhead due to load imbalance when dealing with inconsistent density meshes that are ubiquitous in complex mechanical systems computation.Meanwhile, current Top500 supercomputers [37] are usually equipped with many-core accelerators, such as GPU.It is crucial for HTFETI optimization to fully utilize the computing power of supercomputers.The computing overhead and load imbalance [8] make optimization a high-reward task.
In HTFETI, the most time-consuming operation is the solve routine of the sparse direct solver, which is called for all subdomain matrices repeatedly in every iteration of the Preconditioned Conjugate Projected Gradient (PCPG) solver [25].All SpMV operations would take more than 95% of the total execution time [39].For the multiplication of a large set of sparse matrices with their corresponding vectors, handling the distinct problems in sequence leverages only a fraction of hardware and incurs a large kernel launch overhead [6].Efficient implementation of batched SpMV can address these issues.Since there is currently no batched SpMV routine in the rocSPARSE library [5], it is necessary to propose a batch processing scheme.Existing SpMV performance prediction is used to select a better format, kernel, or parameter for a distinct sparse matrix (e.g., [14,27,40]).The output of the prediction model is a SpMV method [14] or the potential speedup of a given method [40].WISE [40] selects a variety of sparse matrix features but lacks correlation with specific algorithms.These models are usually built offline and used as a preprocessing procedure for matrix acceleration.Hence, it is still a challenge to quickly establish an online SpMV performance prediction model with high accuracy.
The load balancing of the FEM parallel decomposition is often depending on graph partitioning software like METIS [23], SCOTCH [33] and KaHIP [35], etc, while partition of the HTFETI assemble matrix is based on mesh boundary elements.Although these graph partitioning software can evenly divide the mesh internal elements, they cannot guarantee the balance of mesh boundary elements in some complex models in HTFETI.For example, the maximum running time is up to 5 times the minimum running time and 2 times the average running time in the simulation of structural dynamics for China Experimental Fast Reactor (CEFR) whole core assembly on thousands of compute nodes.It is necessary to develop an efficient and accurate load balancing algorithm for HTFETI before the CG iterations.
In this paper, we develop a high-fidelity structural dynamics simulation software based on performance model guided load balancing framework to overcome these challenges and achieve high efficiency.Our main contributions are summarized as follows: • We propose a load balancing framework guided by the performance prediction model.This framework is easily embedded in large numerical computing processes using SpMV kernel on GPUs, which is flexibly applied in many application scenarios.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• We propose a variable-size batched processing scheme and use the multi-stream pipeline to accelerate SpMV for a large number of sparse matrices on GPUs.We establish an online performance model based on Ridge Regression model to predict the SpMV performance by using several features which are closely related to the SpMV algorithm and GPU architecture.• We propose an Inter-Cluster Subdomain Fine-Tuning (ICSFT) algorithm to adjust the computing load of parallel HTFETI accurately by the online performance prediction of SpMV.Both computing task and memory usage are further optimized for simulation by fully utilizing our proposed performance model.• We realize these optimizations on AMD Radeon Instinct MI60 GPU cluster and achieve improvement of performance and load balancing.We complete the structural dynamic simulation of 53.4 billion mesh elements CEFR whole core fuel assembly for the first time and get good scalability on 12,800 GPUs.All parameters are obtained from the real environment of the CEFR whole core assembly.The FIV behavior of whole core fuel assemblies under coolant flushing is finely simulated, which provides an important reference for reactor design.

BACKGROUND
In this section, we briefly introduce the structural dynamics simulation problem solved by HTFETI method and Newmark method, and then explain the load balancing optimization motivation.

HTFETI for Structural Dynamics Simulation
The structural dynamics model in this paper comprises the global dynamics equation, and this equation is advanced in time.The governing equations derivated by the principle of virtual work are modeled from the principles of conservation of mass, momentum balance, and energy conservation.They are formulated in the following mathematical formulation.
In this equation, the unknown  represents displacement,  is the mass matrix,  is the damping matrix, and  is the stiffness matrix. () is the external forces acting on the dynamics system.The Newmark method [30] is used to discretize the dynamic Equation 1 in time direction, and then obtain: Rearranging Equation 2for  +1 , we obtain the Equation 3.
Where  is a matrix related to matrix , ,  and parameter  and  for Newmark method.  is a vector related to right-hand-side (RHS)   +1 , matrix , , and displacement at -th time step   .
The FETI method [16] is a kind of non-overlapping domain decomposition method for large-scale FEM problems.It decomposes the linear systems into several independent non-overlapping subdomains and introduces Lagrange multipliers (LM) to keep the mutual continuity of primal variables between neighboring subdomains.For large-scale problems, as the number of subdomains increases, the computational scale of the problem increases proportionally.In this paper, we use HTFETI method [41] to solve the linear system (Equation 3).As a two-level domain decomposition method, the HTFETI method decomposes the problem into subdomains, and subdomains then are aggregated into the clusters [24,39,42], which naturally results in the smaller coarse problem compared to the FETI method.In the HTFETI method, the so-called corner nodes are selected to increment the continuity of the solutions in the cluster, and LM   are used to represent the connection relationship between subdomains in the cluster.The permutation and splitting of the matrix according to the connectivity of the subdomains in cluster reads Equation 4: With  = diag ( 1 ,  Thanks to the block-structure of   , Equation 4 can be reordered into Equation 5 according to clusters from 1 to  .Equation 5 has a similar structure of TFETI algorithm, and then we can solve each system in cluster as TFETI algorithm does in each subdomain.
Equation 5 is simplified into: With the Moore-Penrose Inverse  + of the singular stiffness matrices, we get: The vector  contains its amplitudes, and matrix  is the known basis of null space of .Using new notation, we obtain the new Schur complement system.
Then the derived system of linear equations (Equation 8) can be favorably solved by PCPG method [25], and then displacement  +1 at current time step are obtained from Equation 7. Next, we can obtain updated  +1 according to displacement  +1 at time  +1 , and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.then go to the next time step solving loop, until the required time is reached.The computational accuracy of the Newmark method depends on the time step size Δ  .The whole solving process can be implemented in parallel, and iterations are completed by multiple MPI processes.Each MPI process corresponds to one cluster.
In the solver run time stage, the PCPG method is executed and its execution time is therefore given by the number of iterations and the iteration time.The most time-consuming part of a single iteration is the execution of the solve routine of the sparse direct solver, which includes the most expensive operation of the operator  (matrix vector product): :  means the set of subdomain matrices belonging to the  -th cluster.This operation can take up to 95% of the iteration time [39].The right-hand side of the Equation 9are composed of square matrices  +  ⊤ of each subdomain.The number of matrix rows is equal to the number of dual unknowns , distributes from 800 to 8,000.For simulation on fuel rods, these matrices are similar in skew and locality traits due to adjacent mesh elements.There are a large number of sparse matrices operations, which needs to call SpMV subroutine from sparse BLAS.To reach the full potential of accelerators, it is crucial to speed up variable-size batched SpMV.

Motivation
In general problem, we apply graph partitioning methods of element connectivity graph to make the number of mesh elements of each partition balanced, in order to make the original unknowns evenly distributed.However, the original graph partitioning method does not guarantee the balance of partition boundary elements size for HTFETI methods.It is difficult to balance the unknowns per cluster due to dual unknowns  defined on the boundaries.The size of the subdomain matrix  +  ⊤ is related to , which makes the workload of matrix-vector multiplications in each cluster different.
In complex mechanics systems, mesh partition is ubiquitously inconsistent in irregular geometry models.Such as the model of fuel rod [28], it composed of a hollow part and a solid part.Mesh elements connect closely in the solid part, and distribute sparsely in the hollow part.After two-level division of HTFETI method, the model mesh elements are partitioned evenly into clusters, and then mesh elements in each cluster are divided into multiple subdomains with the same number of elements.The mesh elements in the solid part have more neighbors, so the subdomain or cluster has more contact areas with its neighbors.However, the subdomain or cluster in the hollow part has fewer contact areas with its neighbors.The connection between subdomains and the connection between subdomains and clusters are related to the workload of HTFETI.Therefore, the inconsistent distribution of model meshes will lead to the load imbalance problem.

OVERVIEW
In this section, we introduce the overall framework of structural dynamics computing and discuss innovations and optimizations.
In structural dynamics simulation, two-level graph partitioning and domain stiffness matrix () and domain gluing matrix () parallel assembling are done in the HTFETI preprocessing to ensure the load balancing of the whole computing process.Then the timestepping approach is used to keep time updating during dynamics computing.Newmark integral method is used here to ensure time updating accurately and stably.After that, PCPG solver is applied to realize iterative solution in HTFETI method efficiently.The basic computing process is shown in the left flow chart of Figure 1.
In massive computing, domain decomposition is a crucial factor in keeping load balance in parallel computing.The original HT-FETI [41] only roughly divides the static graph by METIS according to the number of domains before matrix assembling, without considering of accurate computing load.We propose a model-guided load balancing framework to fine-tune the computing cost, shown in the right part of Figure 1.
The load balancing framework consists of predictor and balancer.Figure 1 shows the operations of the load balancing framework.First, several sparse matrices are sampled from  +  ⊤ of each cluster after primary graph partitioning and matrices assembling (❷❸).Only 2-4 matrices are sampled randomly from each cluster due to the similar feature of  +  ⊤ matrices.Then these sampled sparse matrices are applied to online modeling of performance prediction model through the feature extraction (❹), and the batched SpMV library is called to ensure efficient sparse matrix-vector multiplication on GPUs during modeling.The performance prediction model predicts the computing cost of each subdomain accurately for the primary subdomain graph after two-level graph-partitioning, and then guides subdomain weights assigning due to the prediction result (❶❺❻).The balancer adopts ICSFT method to tune subdomains between clusters according to predicted subdomain weight by the predictor (❼❽❾), and obtains a balanced subdomain graph for the HTFETI preprocessing.In the solving phase of PCPG, batched SpMV with variable-size batched method and multi-stream pipeline routine is called to accelerate the sparse direct solver on GPUs.
The load balancing framework is easily embedded in other largescale computing problems such as FEM simulation and CFD etc.Only the initial partition graph and sparse matrices set in matrixvector multiplication are needed to input, the framework will output a balanced domain decomposition graph for the next solving process.Load balancing has a severe impact on many large-scale computing problems.The given framework will provide an efficient and exact solution for this kind of problem.

VARIABLE-SIZE BATCHED SPMV
We focus on the acceleration of the SpMV for a large set of sparse matrices and the performance prediction model.

Acceleration
We use the Compressed Sparse Row (CSR) format to store the matrix, which is the most frequently used format.CSR uses three arrays to store a sparse matrix: values, column indices, and row pointers [5].Greathouse and Daga [20] developed CSR-Stream algorithm to achieve fine-grained load balancing and regular, coalesced memory accesses for sparse matrices with few non-zeros per row.The algorithm is implemented to the rocSPARSE [5], and we use Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.it to accomplish SpMV for the matrices.The algorithm consists of two steps.First, it streams values from the memory to local data share (LDS).The number of rows fits into the LDS dynamically according to the non-zeros.The second step is the per-row reduction.rocSPARSE optimized the reduction process [5].When the number of rows is more than the threads in the work-group, each thread performs the reduction for a single row.Otherwise, the work-group performs a tree-style reduction of each row.
Most batched operations assign a certain number of work-groups to each matrix of the batch.There are still several idle threads in work-groups on the operation of the last few rows for each matrix.Obtaining the number of rows, columns, and non-zeros per matrix is also necessary for the variable-size batched kernel function.Since the SpMV kernel is memory-bound, increasing memory access can exacerbate performance bottlenecks.We package a set of matrices and place them within a kernel.Concretely, we modify each matrix's row pointer array and column indices array, making their values incremented from the previous ones.The entire batch of matrices conforms to the CSR storage format and can be put into a kernel function to complete the calculation.The features of each matrix are still preserved.Our method assigns the calculation work to every thread proportionately without extra memory access.It maximizes the use of hardware resources without numerous idle threads and avoids the kernel launch overhead at every iteration.
Batching many small matrices into a larger one eliminates most of the per-transfer overhead, which also performs better on hostdevice data transfers.In addition to the batched method, we also use pinned memory and the multi-stream pipeline schedule to optimize data transfers.The data in pinned host memory can be transferred to device memory without requiring staging buffers or having to perform pinning on each transfer [34].By allocating pinned memory, data transfer between host and device can use DMA technology.The GPU module we used has two copy engines responsible for sending and receiving data between CPU and GPU, respectively, and one kernel engine executes kernels.Putting multiplications into multiple streams can overlap Host to Device (H2D) copy, Device to Host (D2H) copy, and kernel execution on the device, as shown in Figure 2. Multi-stream pipeline scheduling improves at the instruction level of parallelism [28].These optimizations can improve total memory throughput.

Performance Prediction Model
In order to accurately predict the workload of the SpMV, the prediction model considers kernel execution time and data transfer time.
As shown in Figure 2, three components of H2D, D2H, and kernel Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the average number of non-zeros per row Equation 13  average distance between each pair of continuous non-zeros Equation 14 memory access locality behavior of the input vector   Gini coefficient of the non-zeros number between column blocks Equation 15 2

𝑟
variance of the number of non-zeros per row Equation 16 load imbalance between threads during the reduction process execution can be overlapped with each other for two copy engines.Equation 10 can estimate both the dominant transfers and dominant kernel scenarios.The resulting execution time can be modeled as the largest component plus the per stream time spent on the other two components [38].We use the PCIe transfer model [38] to predict the performance of data transfers and mainly focus on modeling kernel execution performance.
PCIe transfer model [38] is established based on the LogGP model [1].LogGP model is an extension of LogP [13] model, which is suitable for predicting both long messages and short messages.LogGP [1] abstracts the communication through five parameters: the communication latency (), overhead (), bandwidth (), the number of processors (), and the bandwidth obtained for long massages ().PCIe transfer model [38] can be expressed as Equation 11and Equation 12, which show the time for transferring  bytes over DMA by using  streams.For iterative solvers, the sparse matrices are pre-fixed on the device, and the value of  depends on the size of the input and output vectors.
As for the SpMV execution kernel, a sparse matrix's size, nonzero skew, and nonzero locality characteristics substantially impact the performance [40].Based on these insights and CSR-Stream algorithm [20], we choose significantly influenced features to build the model.
Due to the irregular sparsity of the sparse matrix, the CSR-Stream algorithm still has performance bottlenecks.In the first step, the algorithm coalesced loads the column indices and values of the nonzeros which are stored consecutively.However, the input vector entries are accessed irregularly, which depends on the column indices.The memory access locality of the input vector is the main bottleneck in the first stage.In the second step, regardless of using scalar reduction or tree-style reduction, there is a load imbalance problem between threads in a work-group.
Table 1 shows the features extracted from the sparse matrix, which mainly come from three aspects: workloads, memory access locality behavior, and load imbalance between threads.
(1) Workloads: We use the number of rows   and non-zeros   , the average number of non-zeros per row   in Equation 13to measure workloads.The matrices we use are all square with the same number of rows   and columns   , so we only use one of them.  (  ) represents the size of the input (output) vector.  and   indicate the total and average amount of the calculation.
(2) Memory access locality behavior: We use   in Equation 14and   in Equation 15to capture the memory access locality behavior of the input vector.  calculates the average columnto-column distance between each pair of continuous non-zeros in the row.In Equation 14,  _ means the number of continuous non-zero pairs in row  .The variable  , means the column-tocolumn distance between non-zero pair  in row  .  indicates the interval in memory between vector data fetched by two consecutive threads.  means the Gini coefficient [26] of the non-zeros between the column blocks.We divide every 8 columns into column blocks from the first column, and the reason is as follows.The L2 cache is 16 way associative with 64B cache lines and LRU replacement [2].For loads and stores, all accesses to any given aligned 64-byte cacheline get coalesced into a single memory request [3].8 data with double precision can be fetched in one fully coalesced request.An extension of the Gini coefficient [26]   is yielded by Equation 15.   is the number of column blocks and  _ is the number of non-zeros in column block .The Gini coefficient [26] represents the distribution imbalance of non-zeros between the column blocks.It ranges between 0 and 1.When it has a larger value, the distribution is more unbalanced, that is, memory accesses are more likely to coalesce.
(3) Load imbalance between threads: The variance of the number of non-zeros per row  2  captures the spread degree of non-zeros between rows.In Equation 16, the variable  _ is the number of non-zeros in row  .A balanced row distribution of non-zeros will make the reduction tasks evenly distribute among threads.
We use Ridge Regression (RR) to build the prediction model.RR introduces a regularization parameter  for combating the problems Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
of multicollinearity and overfitting [10].It can be trained quickly without compromising accuracy.The training set consists of the matrix features shown in Table 1 and the floating-point performance under specific batched SpMV library parameters (the size of the batch and the number of streams).The SpMV floating-point performance  is shown in Equation 17.   is the number of matrices,   is the SpMV kernel execution time,  _ is the number of non-zeros in matrix .For SpMV, multiply-add operand  is 2. The prediction of floating-point performance P obtained by training the RR model is regarded as the approximation of  in Equation 17.Then the variable t in Equation 10 can be yielded by Equation 17.

LOAD BALANCING
After the HTFETI two-level domain decomposition, each subdomain has been determined.A new graph of subdomains is also formed, where graph vertices represent subdomains and graph edges represent their adjacency relationship.Each cluster has similar number of subdomains.However, the matrix-vector multiplication execution time of each subdomain is different, which results in a load imbalance problem between clusters.In this section, we propose ICSFT algorithm to adjust the subdomains between clusters, according to the weights assigned to subdomains in the graph.The details of ICSFT algorithm for load balance are described in Algorithm 1.In line 5, subdomain  obtains domain weight dw() according to predicted running time obtained from the performance prediction function domainSpmvPredict given in Equation 10.In line 6, the cluster P() where the subdomain  located is updated by adding corresponding subdomain weight dw() to get the sum of cluster weights cw.Variable bw is the threshold for cluster balancing weights, which is adjusted by balance parameter .When the cluster maximum weight cw_max is not greater than bw, it is considered to achieve a good load balancing.We apply a marking method to adjust subdomains between clusters based on predicted weight.If a cluster has been adjusted, it will be marked and assigned to vis.In line 13, the function selectMaxUnvisited is used to select a cluster c that has not been adjusted and has a weight greater than bw for adjustment.In line 17, a neighboring cluster cnei of cluster c is selected by the function selectMinNeighbor, which has not been adjusted and has the minimum weight.In line 18, the function selectFromBoundary is applied to select a random subdomain  within cluster c on the boundary between cluster c and cluster cnei.After that, the partition P() of subdomain , the weight of cluster c and cluster cnei are updated in line 19 -21.Until the weight of cluster c decreases to bw, cluster c no longer needs to be adjusted.The largest weight cw_max is recorded for all marked clusters in line 25 -27.When cw_max in all clusters decreases to bw, the clusters are considered to be balanced.If the load balancing state is not reached after one round of adjustment for all clusters, all of them are set to unmarked and another round of adjustments will go on.
Figure 3 shows the process of Algorithm 1. Firstly, the weights of cluster 1 -4 are greater than the threshold for cluster balancing weights in step (a).Cluster 1 with the largest weight is selected to adjust its subdomains to the neighboring cluster 2 -4 in step (b)  until the weight of cluster 1 decreases to the threshold for cluster balancing weights.Cluster 1 is marked as visited.In step (c), Cluster 3 is selected to be the largest weight cluster, and subdomains on the boundary of cluster 3 are adjusted to the neighboring cluster 2 and cluster 4 since cluster 1 is marked as visited.Cluster 4 does the same, its subdomains are moved to unmarked cluster 5 and cluster 6 in step (d) until it is balanced.In step (e), one round of adjustment is ended due to the neighboring clusters of cluster 2 are all marked, although the weight of cluster 2 is still greater than the threshold for cluster balancing weights.Then we go to the next round to continue adjustment until all clusters are balanced in step (f).Because the degree of mesh nodes is close to each other in numerical calculations, there is little increase in communication traffic caused by the ICSFT algorithm.Compared to calculations, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
communication still accounts for a relatively small proportion for HTFETI.

EVALUATION
In this section, we evaluate the acceleration and performance prediction model.Both evaluation and application experiments are conducted on an AMD GPU cluster, which consists of more than 4,000 compute notes.The Infiniband HDR network interconnection between nodes adopts Fat-tree topology, and the transmission rate can reach bidirectional 200 Gbps.Each compute node has 128GB DDR4 memory and is equipped with four AMD Instinct MI60 GPUs and one 32 cores 2.50GHz x86 CPU interconnected by the PCIe.Each GPU has 16GB HBM2 memory and uses the AMD Graphics Core Next (GCN) architecture [4], which is composed of 64 Compute Units (CUs).One CU has four SIMD units, each having 16 Vector Arithmetic Logic Units (VALUs).rocSPARSE version 2.2.0, included in the ROCm version 5.2.0 is used for implementation.

Acceleration Evaluation
This section presents the performance and throughput of the batched SpMV library.Experiments compare the performance and throughput of SpMV with different batch sizes and stream counts.Matrices used in the experiments are selected from the real-world simulation of structural dynamics.The experiment is built on one GPU.The kernel execution performance and throughput are yielded by Equation 17 and Equation 18, respectively.The matrix  +  ⊤ is pre-fixed on the GPU, therefore the amount of data transfer is the input vector and the output vector for each iteration in the throughput evaluation.  _ and  _ are the size of the output vector and the input vector for matrix , respectively.The variable  is the average time of 50 times SpMV executions including data transfers.The byte size  for double is 8, and the unit of throughput is GB/s.
Figure 4 shows the kernel execution performance achieved by the batched SpMV library.In order to process multiple batches and streams, we use 4096 matrices in the test.The results reveal that the performance is less than 10 GFLOPs for these matrices, although the SpMV kernels based on the CSR format are optimized at the thread level.The variable-size batched scheme can bring acceleration up to 19.32 times, far exceeding the acceleration brought by only the multi-stream method where batch size equals 1.When the batch size exceeds 256, increasing the number of streams or batches has little effect on performance improvement.In this case, hardware resources are efficiently utilized, and the performance attains around 90 GFLOPs.Figure 5 shows the throughput achieved by the batched SpMV library.We fix the number of matrices at 4096 to test the throughput with different streams and batches.For most cases, 4 streams can bring approximately full acceleration.The amount of data processed (data granularity) in each iteration affects the throughput [29].The reduction of data granularity can reduce non-dominant component overhead.But it increases the latency of data transmission and kernel invoking, and degrades the performance of the batched SpMV kernel.The way to select the optimal granularity for pipelining schedules is worth investigating, but not covered in this paper.

Prediction Model Evaluation
In this section, we performed performance model tests on PCIe transfer, kernel execution, and an iterative process of SpMV.
To build the PCIe transfer model, we obtained the parameters in the following ways. +  is approximated by measuring transfer time of one byte [38]. and  are estimated by measuring the transfer time of different bytes using one and multiple streams, respectively.Table 2 shows the parameters for the test platform.We test the transfer time for 16MB to 1GB data with the streams ranging from 1 to 16 to measure the accuracy of the model.For H2D  data transfers, our model can accurately predict performance within a maximum error of 4.31%.For D2H data transfers, the maximum error is 4.44%.It is accurate enough to predict the workload.Figure 6 and Figure 7 show the observed and estimated transfer times of 16MB data, in which the largest error is produced.We randomly select 2 sparse matrices from each cluster as the training set to build the RR kernel performance prediction model.We perform multiple vector multiplication calculations on the same sparse matrix to achieve per-matrix performance at large batch sizes.In the training process, we use cross-validation to test multiple values of regularization parameter  to find the ideal one.The Equation 19shows the kernel performance prediction model when the batch size is 1024 and the number of streams is 4. The model uses the average performance of 50 times SpMV for each matrix.The model training time is within 2s, under the condition that the SpMV of each cluster is executed in parallel.It takes very little time in numerical simulation applications that require multiple executions of the PCPG.This ensures that online preprocessing is fast and acceptable.P =51.68462 + 0.00779 ×   − 0.00051 We test the model accuracy on 1600 matrix sets each with 4096 matrices.A key factor affecting performance is the amount of computation in the batch.Therefore we can use the estimated time t as an approximation of the observed time   regardless of whether the intra-batch matrices are the same.Figure 8 shows the observed and estimated kernel execution time.The observed time is the average time of 50 times SpMV kernel execution for the matrix set.The estimated time of the matrix set is the sum of each matrix's estimated time, which can be yielded by Equation 17  Based on the PCIe transfer model and kernel execution performance model, the estimated SpMV execution time t can be yielded by Equation 10.We use the same test set as the kernel performance evaluation.Figure 9 shows the observed and estimated SpMV execution time.The mean error of the prediction is 4.33%.The MSE and MAE are 2.40 × 10 −7 and 4.10 × 10 −4 , respectively.

APPLICATION
Under the operating environment of high temperature, high pressure, and intense radiation in nuclear reactors, fuel assemblies are subjected to long-term coolant erosion and generate FIV.The FIV will cause fretting wear and high cycle fatigue [36], affecting the structural integrity of fuel assemblies under normal and accident Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.conditions.In order to ensure the safe operation of the core in the reactor, it is necessary to clarify the structural dynamics process of the vibration of the fuel rod in the actual environment.For the whole core high-precision simulation analysis of this process, the required grid size is up to hundreds of millions or even tens of billions [22].We apply the HTFETI combined with the Newmark method to simulate and track the FIV of the whole core assembly.
We perform multi-step structural dynamics simulation on the large-scale mesh elements of CEFR whole core assembly.The CEFR whole core model consists of 712 core assemblies, including 81 fuel assemblies concentrated in the middle of the reactor core.According to the isomorphism of reactor assembly, we model CEFR whole core assembly as shown in the bottom left side of Figure 10.One end of the whole core assembly is fixed.The coolant flows in the fuel assembly and interacts with the wire-wrapped fuel rods, causing FIV problem.The flow action of the fluid acts on the outer surfaces of the wire-wrapped fuel rod in the form of fluid pressure  (), which acts as an input function on the considering system.The experiment performed 400-step numerical simulations, each time step is 0.01s.The simulation results are shown in the top left and the right side of Figure 10.The maximum displacement is 9 microns occurred at the top of the fuel rod.It conforms to the accuracy of the same magnitude as the test results of the Westinghouse Test Center in the United States [21].The wire-wrapped fuel rods keep reciprocating, and the simulation results are in line with the actual state [32].We perform the scalability test to measure the parallel efficiency and load imbalance ratio.The number of compute nodes ranges from 400 to 3200, and the number of GPUs ranges from 1600 to 12,800.In the weak scalability test, the subdomains in each cluster remain the same, while the problem size grows proportionally to the number of GPUs.The total amount of subdomains grows from 823,296 to 6,561,792.In the strong scalability test, the problem size is stable, while the number of subdomains in each cluster decreases proportionally to the number of GPUs.The number of subdomains in each cluster decreases from 4096 to 512, and the total amount keeps 6,561,792.Figure 11 and 12 show the execution time under no optimization strategy (base), variable-size batched method and pipeline schedule (acceleration), and ICSFT load balancing method and acceleration strategy (acceleration + ICSFT).The experiments use 4 streams and 4 batches as the parameters of the batched SpMV library.The matrices are evenly distributed to the batch.Compared with the base, the acceleration strategy brings a maximum improvement of 60.56 times and an average of 38.60 times.After ICSFT, the improvement reaches up to 99.60 times for the maximum and 62.44 times on average.The weak and strong scalability efficiencies reach 91.77% and 86.13% on 12,800 GPUs, respectively.Good scalability indicates that the system can solve larger engineering problems by using more processors without losing machine efficiency.We use the load imbalance ratio (LIR) to evaluate the workload across GPUs.LIR is the ratio of the maximum running time to the average running time.In Figure 13, the LIR of ICSFT algorithm is 107.98% to 117.38% over an 8-fold increase in problem size.In Figure 14

RELATED WORK
For SpMV optimization, previous works propose numerous sparse matrix formats [11,12,18] and leverage various acceleration methods [14,20].Anzt et al. [6] propose a flexible batched SpMV kernel for concurrently processing a large collection of small-size sparse matrices with less than 1024 rows.We propose a variablesize batched scheme to accelerate SpMV without restriction and extra memory access.
For the kernel performance prediction model, SMAT [27] uses a black-box machine learning model to train representative performance parameters extracted from the sparse matrix pattern and architectural configuration.WISE [40] uses a novel sparse matrix feature set that summarizes a matrix's size, skew, and locality traits as the input to decision trees to estimate the speedup of different SpMV methods over the baseline.Nisa et al. [31] provide numerical SpMV performance by training ML models (Decision Tree, MLP, SVM, XGBoost).These methods have many training parameters and long training time, which is unsuitable for online training scenarios.We construct an online performance prediction model that can be quickly generated with high accuracy for sparse matrices with similar characteristics from numerical simulation applications.
For FETI-type load balancing, some previous works focus on the mesh generation of the subdomain [15,19].We mainly focus on the LM matrix generated by graph partitioning.Casadei et al. [9] propose greedy graph growing approaches to balance halo and interior node sizes.Lin et al. [28] propose a multi-grained load balancing scheme based on graph repartitioning and work-stealing for general matrix-vector multiplication.However, this work still cannot achieve load balancing by the workload of the subdomain before PCPG execution.We propose the ICSFT algorithm to balance the parallel HTFETI load accurately by SpMV performance model.We combine the predictor and balancer into a load balancing framework.It is flexibly applied in many application scenarios.
For FETI-type optimization and large-scale simulation, Říha et al. [43] implement the sparse solver on Intel Xeon Phi (KNC) using MKL PARDISO and NVIDIA GPU using cuSolver.Barka et al. [7] propose the application of FETI-2LM and DDM for the full-wave simulation of a large antenna array with 1600 radiating elements on 1764 Intel Xeon computing cores.We use HTFETI method combined with Newmark method to achieve the high-fidelity structural dynamics simulation of CEFR core assembly with up to 53.4 billion grids.

CONCLUSIONS
In this paper, we develop a variable-size batched processing scheme and use the multi-stream pipeline to accelerate SpMV solver for mass batched small-size sparse matrices on GPUs.Furthermore, we design a model guided load balancing framework for massively parallel structural dynamics simulation.In this framework, the workload of subdomains can be predicted accurately by the performance prediction model, and subdomains are then tuned between clusters by ICFSI method to balance parallel workload.Finally, we demonstrate the practical value of our high-fidelity structural dynamics simulation software in a non-trivial real-world application, i.e., FIV analysis for the large-scale and high-fidelity simulation of CEFR whole core fuel assembly using 12,800 GPUs.Considering generality of the model-guided load balancing framework, we can provide a scalable and efficient solution for CFD simulation and other complex engineering problems with massive grid elements.

ACKNOWLEDGMENTS
This research was supported by the Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDB0500000) and the Fundamental Reasearch Funds for the Central Universities.The numerical calculations in this study were carried out on the ORISE Supercomputer.We gratefully acknowledge the support of the China Institute of Atomic Energy, provided models for this work.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Figure 1 :
Figure 1: Overview of load balancing framework.

Figure 2 :
Figure 2: Possible overlap of devices for 2 copy engines under multi-stream pipeline schedule when using 1 or 4 streams.

Figure 3 :
Figure 3: An example of ICSFT algorithm

Figure 4 :
Figure 4: Performance of SpMV schemes comparing batches and streams.

Figure 10 :
Figure 10: Displacement response of the fuel rod.

Figure 14 :
Figure 14: Load imbalance ratio of strong scaling.

Table 1 :
Matrix features used in the performance model Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Table 3 :
Parameters of simulation