ANT-MOC: Scalable Neutral Particle Transport Using 3D Method of Characteristics on Multi-GPU Systems

The Method Of Characteristic (MOC) to solve the Neutron Transport Equation (NTE) is the core of full-core simulation for reactors. High resolution is enabled by discretizing the NTE through massive tracks to traverse the 3D reactor geometry. However, the 3D full-core simulation is prohibitively expensive because of the high memory consumption and the severe load imbalance. To deal with these challenges, we develop ANT-MOC11The name “ANT-MOC” is inspired by the cooperative transport behavior of ants, which allows them to efficiently exploit resources from their environment.. Specifically, we build a performance model for memory footprint, computation and communication, based on which a track management strategy is proposed to overcome the resolution bottlenecks caused by limited GPU memory. Furthermore, we implement a novel multi-level load mapping strategy to ensure load balancing among nodes, GPUs, and CUs. ANT-MOC enables a 3D full-core reactor simulation with 100 billion tracks on 16,000 GPUs, with 70.69% and 89.38% parallel efficiency for strong scalability and weak scalability, respectively.


INTRODUCTION
The virtual reactor offers a safer and more cost-effective means for designing and maintaining nuclear reactors.Various research institutions and government agencies have initiated several projects to promote the development of virtual reactors, including VERA [31], NEAMS [4], NURE [5], and CVR [36].The Neutron Transport Equations (NTE) serves as an important method to calculate various aspects of the reactor core, such as criticality, power distribution and fission rate.Within virtual reactors, the intractable problem is obtaining an exact NTE solution.The Method Of Characteristic (MOC) [1] is an efficient method to solve the NTE because of its geometric solid adaptability and high fidelity.
The MOC solver for the NTE is based on massive characteristic tracks that traverse the geometry of the reactor.Along with tracks, the angular of NTE is discretized into liner tracks [32].All of the tracks are discretized into segments by reactor structural details such as fuel assemblies, control rods, and other structures.As the reactor core structure becomes more complex, the simulation scale in geometry has evolved from single-assembly to full-core [10].For more detail and higher accuracy in the radial direction provided by 3D tracks, the simulation scale in tracks has also evolved from 2D to 3D [16] [9].Recently, 3D full-core simulation has become a long-standing goal of the reactor physics community.
Extending the simulation for 3D full-core is challenging due to the large-scale computation and memory required by tracks and segments.Gunow [12] propose a spatial decomposition method to map large amounts of computation and memory to CPU nodes.The computing capabilities of CPUs remain inadequate for conducting 3D full-core simulations.With the development of heterogeneous computing, more supercomputers adopt accelerators such as GPU for large-scale computing [30].However, the memory size of GPU often becomes a bottleneck that limits program expansion [37].In our test case consisting of 100 billion tracks, the memory required for 3D segments is 511.5 times greater than 2D segments case, reaching as high as 132.6 TB.Sciannandrone et al. [26] suggest an alternative approach for CPU solver using chord classification method for axially extruded geometries that only saves segments in 2D.This method allows us to implement the on-the-fly generation of 3D segments in GPU.In addition, the reactor's multi-assembly structure provides significant geometric heterogeneity.State-ofthe-art solution methods [33] [3] use fine meshes in the reflective regions and coarse meshes in the central core region to improve accuracy and reduce computing costs.Nonetheless, these approaches lead to an uneven distribution of computational burden among tasks segregated by spatial partitioning.Therefore, two important challenges of 3D full reactor simulation need to be addressed: On the one hand, limited computing power and memory cannot meet the computation requirements of high-resolution simulation.On the other hand, the load imbalance introduced by the decomposition and mapping method severely limits the full performance potential of computers.
To address the above challenges, we develop ANT-MOC, a software with 40,000 lines of code, which solves the 3D full-core NTE using the MOC on GPU clusters.The following contributions are provided: • We port the track generation, ray tracing, and source computation to the GPU, which enables ANT-MOC to perform end-to-end neutron transport simulations on GPU clusters.A performance model is built to predict the memory usage, computation, and communication traffic of ANT-MOC.• We propose a track management strategy based on the performance model to alleviate the capacity bottleneck of GPU memory.The regenerated track is minimized by the ratio of temporary tracks and resident tracks.The rest of this paper is organized as follows: Section 2 introduces MOC solver and compares this work with the current stateof-the-art work.Section 3 presents the whole execution process of ANT-MOC and introduces the GPU kernel and performance model implementation.Section 4 introduces a track management strategy and a three-level load mapping strategy.Section 5 verifies the correctness, performance improvement, and scalability of ANT-MOC.

BACKGROUND 2.1 MOC Solver and Spatial Decomposition
MOC is widely used to solve the Boltzmann form of NTE in various neutron transport solution software [2][13][6] [14][18] [38].In ANT-MOC, the   method [22] is used to enforce the NTE to be solved for a set of selected directions to which a set of quadrature weights are associated [29].Then, the spatial discretization and step approximation are applied to geometry to obtain Flat Source Regions (FSR) where the material is homogeneous.Eventually, we obtain the angular flux corresponding to a single track in Equation (1) where  , (0) and  , () are the flux at the entering position and the leaving position of the track, is the cross section for this region,   is the flat source, and   is the escape probability for the source.Equation ( 1) is used to iteratively calculate flux along a track from the starting segment to the ending one.As shown in Figure 1, the entire flowchart of neutron transport simulation can be divided into four primary procedures: geometry construction, track generation, ray tracing, and transport solving.In the geometry construction phase, the geometry Constructive Solid Geometry (CSG) method [2] [24] is used to construct the 3D geometry.The CSG method is an intuitive and convenient modeling method widely used in mainstream reactor simulation programs.Then the geometry is divided into FSR by the SPH homogenization method [35].The geometry construction phase requires less computation and memory footprint.
In the track generation and ray tracing phase, tracks and segments are generated based on the geometric structure [27].ANT-MOC can make use of the On-The-Fly (OTF) method [11] or the Chord Classification Method (CCM) [26] for axial track generation.After laying tracks on the axial surface, a group of 3D tracks distributed in the Z-axis space is expanded.3D Tracks are divided into segments by FSR.The amount of computation is related to the number of track segments due to the iteration scheme of MOC.All 3D tracks are stored along with additional parameters on radial sections and could be restored during transport solving, which enables us to adopt a track management strategy on 2D tracks with the OTF method.
In the transport solving phase, the FSR's flux energy distribution is first calculated by transmission scanning and source updating.Each transmission scanning integrates the flux of each energy group along each orbit from the previous iteration.At the same time, each FSR's fission source and absorption are updated by counting the new flux contribution of tracks.Fission source and absorption are used to calculate     .In reactor physics,     refers to the effective neutron multiplication factor of the core.It is one of the key physical parameters for evaluating nuclear reactors.These steps are repeated iteratively until the source converges in each region.
There are many techniques for decomposing the workload to multi-domain [19] [41].Spatial decomposition [21] is the most straightforward and efficient decomposition technique, which has been extensively used in various MOC programs [28] [13] [20] [25].It requires only near-neighbor communications for angular flux along truncated tracks.Accordingly, dividing geometry into spatial subdomains keeps a large granularity and low overhead.The implementation of spatial decomposition is strongly related to the laydown of tracks since the tracks are supposed to be connected at spatial subdomain boundaries.The solution is affected by the number and size of spatial subdomains.A subdomain only updates its incoming angular flux at the end of a source computation, which is similar to a Point Jacobi iteration [9].There might be differences between the fission rates produced with and without the spatial decomposition, while the normalized fission rates are usually the same.

Current State of the Art and Challenges
Over the years, several neutron transport-solving software have emerged after undergoing numerous developments and improvements.They all employ MOC to solve neutron transport problems.However, most programs do not directly employ 3D MOC because of the memory and computational capabilities limitations.These programs more commonly use 2D/1D coupled characteristic line calculations or solely use MOC to calculate the homogenized components of the subassembly.
We give an overview of some typical MOC software.These software involve of DeCART [13], NECP-X [6], nTRACEE [14], MPACT [18], ThorMOC [38], OpenMOC [2], and ANT-MOC.Cho [7] originally proposes a 2D/1D method [40] for solving the 3D NTE.The method couples radial 2D equations [39] with axial 1D equations.Due to its significant reduction in computational complexity compared to direct 3D solving, the method has been widely adopted.However, in the 2D/1D method, transverse leakage may result in a negative total source and computational instability.In contrast, the 3D method can effectively handle axial leakage problems.The 3D MOC/DD method [38] is based on axial linear approximations of angular flux and source.It decomposes the 3D equation into a series of axial 2D equations coupled by surface angular flux.In Table 1, OpenMOC stands out as the most functional software.It is equipped with a 2D GPUsolver and can perform 3D computation on CPU multi-node.The goal of many neutron transportsolving software developments is to achieve an end-to-end, highresolution, three-dimensional neutron transport solution for direct full-core simulation using GPUs.There are many practical obstacles to realizing this goal: (1) The amount of computation required for a direct 3D neutron transport solution is approximately 1000 times greater than  that of 2D solution [19].Furthermore, the increase in the scale of tracks requires a faster-than-linear increase in the computational count.(2) High-resolution neutron transport simulation places higher demands on memory footprint, and the limited memory space exacerbates this issue.It is a significant problem to balance efficient heterogeneous computing with memory footprint.
(3) The division of a simulation task may result in load imbalance.This is often caused by the heterogeneous nature of reactor structures [34] and the parallel spatial decomposition method [8] of the software.As a result, such load imbalance can extend across the different levels of supercomputer hardware, including nodes, GPUs, and CUs, thereby creating significant obstacles for efficient and scalable programs.To the best of our knowledge, ANT-MOC is the first direct 3D neutron transport software on GPU cluster.ANT-MOC can significantly accelerate neutron transport simulations compared to CPU-based codes.A strategy for managing tracks is proposed to alleviate the bottleneck of memory footprint.Furthermore, a threelevel load mapping strategy partitions geometry, angles, and tracks into nodes, GPUs, and CUs, respectively.These methodologies effectively boost hardware utilization and improve parallel scalability.

IMPLEMENTATION
This section gives the overview of ANT-MOC and its parallel implementation in detail.Furthermore, a performance model has been established to predict memory, computation, and communication efficiency use of ANT-MOC.

Overview of ANT-MOC
ANT-MOC is equipped with a GPU solver and supports cross-node calculation via MPI.The execution flow of parallel version of ANT-MOC is shown in Figure 2.For the convenience of description, we divide the execution flow into five stages corresponding to Figure 1.tracks for its assigned sub-geometry and maps the tracks to the GPU based on the policy in section 4.2.2.Then, the connection information between geometries is synchronized through MPI.Materials, FSR vector, and 2D track data are then copied to GPU memory.The GPU first carries out 3D track generation, after which the track manager covered in Section 4.1 divides them into two categories: resident tracks and temporary tracks.Then, 3D ray tracing is conducted on the resident tracks.The ray tracing of temporary tracks is performed within the Source Computation process.(4) Transport solving: The Track Mapping is carried out using the policy outlined in section 4.2.3.Once the source computation is completed, the tail fluxes of tracks are transmitted through the adjacent domains of MPI between different nodes and receive flux as new head flux and update source and     .Iteration continues until the flux residuals value is below a certain threshold.(5) Output Generation: If the Iteration converges and the transport is solved, ANT-MOC will output the FSR fission rate data to a file.

Parallel Implementation
ANT-MOC divides tasks into different granularity.Spatial decomposition is carried out based on geometric structure which is evenly divided into multiple cuboid sub-geometries arranged in 3D space.
The decomposed sub-geometry only exchanges flux with neighbors.In order to achieve the load balance mapping covered in 4.2.2, it is necessary to map multiple sub-geometries to a single GPU.We implement a geometry fusion method that merges multiple geometries into a fusion-geometry.To be specific, the modular ray tracing method [19] ensures that each sub-geometry has identical dimensions and distribution of tracks.Nevertheless, distinct sub-geometries contain discrete information, such as neighboring geometries, distribution of FSRs and distribution of materials.To address this issue, an additional dimension is added to store information on subdomains that are fused into one fusion-geometry.Upon completion of the kernel of the transport solution, track fluxes are transferred between GPUs via DMA within the same node.Subsequently, the track flux is transferred to adjacent fusion-geometry in other nodes.
We port the transport solver to the GPU.Relevant codes are implemented through CUDA and integrated with the ROCM platformsuitable code snippets.We can use hipify-perl automatic code conversion scripts for easy ROCM platform code conversion.Therefore, the GPU solver can support both NVIDIA and AMD hardware devices.On the right side of Figure 2, three algorithms of 3D track generation, 3D ray tracing, and source computation are the most time-consuming and account for 70% of the computational workload.Considering the number of tracks is large enough to allow large-scale parallelism, we map 3D tracks to GPU threads to perform these algorithms.Algorithm 1 presents a high-level summary of how the 3D tracks are operated on a GPU.The loop hierarchies of the three algorithms are identical to one another, as they all require a sweep of 3D tracks.The algorithm takes two inputs: d_track_3D and num_3D_tracks.The algorithm outputs consist of any 3D tracks generated by track generation, segments calculated by ray tracing, or flux computed by source computation.In line 1, each thread calculates its unique thread ID and processes a specific 3D track.In line 2, the while loop ensures that the available threads process all 3D tracks.The track processing workflow is executed for each track.The thread ID is then updated.The three algorithms differ in function: 3D track generation aims to get 3D tracks through 2D tracks; 3D ray tracing aims to generate 3D track segments; source computation aims to calculate flux along each track.each track is illustrated in Figure 3(a).In a given stack of 3D tracks, all tracks share the same position, including the azimuthal angle and x-y plane of the corresponding 2D track.A 3D track index is proposed by leveraging both 2D track chain and 2D track stack indexes, which allows us direct access to linking tracks.Finally, the linking information and indexing information of the 3D tracks is stored in GPU memory.

3D Ray
Tracing.3D ray tracing is executed to trace along 2D base tracks and get 3D segments.The processing flow of each 3D track is shown in Figure 3(b).The generation process of 3D Segments involves two levels of loops.The outer loop iterates through all the required 2D segments, while the inner loop traces each 2D segment.Within the inner loop, the FSR-related information for the starting coordinate is obtained first, and then the length of the 3D Segment is calculated and the coordinate is updated.At the same time, it is checked whether the coordinate crosses the Z-axis boundary.If so, the loop is exited; otherwise, the above steps are repeated based on the new coordinate.

Source Computation.
The transport solver traverses all energy groups.We first calculate the flux for each energy group by traversing a specific track in the forward direction.The scalar flux is calculated and accumulated in each segment into the corresponding FSR.The remaining flux is saved for transmission to adjacent regions.We then transfer the boundary angle fluxes to the outgoing track.Following the above procedures, the segments in the reverse direction are performed by the same process.The oneto-many relationship between FSR and tracks results in multiple linear read/write scalar fluxes in parallel transport solving.We put the FSR-associated fluxes into shared memory and use atomic operations for accessing.By appropriately organizing the order of variables to align data, we can save memory access time and memory consumption.During the simulation of reactors by using ANT-MOC, the performance of memory footprint, computation, and communication is usually affected by the initial input parameters.Therefore, it is necessary to identify the relationships between performance and input parameters.The input parameters and their derivations are listed in Table 2.The first four parameters are the initial inputs, and others are derived from inputs.We present the relationships for these parameters in Equation ( 2), (3), and (4), and we further explore how these parameters impose their influences on memory, computation, and communication performance in Equation ( 5), (6), and (7).

Performance Model
The number of 2d tracks  2 is calculated by Equation ( 2) where   is the angle number of vector tracks, and the angle number of scalar tracks is   /2.The function  adopts the track-laying algorithm, which returns the number of tracks passing through the x-y plane at specific angles.
Each 2D track expands into a set of 3D tracks.Each 3D track must be divided into   distinct polar angles.The number of 3D tracks is calculated by Equation ( 3), where  is a function to calculate the number of 3D tracks with the same polar angle and 2D base track.If the FSRs are determined, a linear relationship between the number of segments and tracks is presented in Equation (4).
Equation ( 4) is used to estimate the number of segments.The coefficients are derived from a small test case.Where  2 and  2 respectively denote the number of segments and tracks in a small sample.When the scale of rays is relatively dense, the number of segments increases proportionally with the number of tracks.Therefore, we can use a small sample to predict the segment count for a larger sample.The same principle is applied in the 3D case as well.When the geometry is constructed, the number of FSRs is determined by the subdivision of the components, and it remains fixed throughout the simulation.We have examined the memory footprint of the main vectors during the ANT-MOC calculation, and the results are displayed in Table 3. 3D and 2D segments occupy is up to 97% of the total memory, and this proportion increases with an increase in the number of tracks.The maximum memory footprint during the simulation can be calculated by Equation (5).All variables in Table 3 are vectors, and each item of them is constructed by a structure type. is the mapping function from the size of a vector to its memory occupation. is the memory occupation of the constants or fixed-size vectors we considered in our code.
According to the 3D ray tracing and source computation algorithm, we can infer that the number of segments determines the computation workload of the solver.The computation workload is corrected linearly with the number of segments in the source iterative.The computation workload can be evaluated by Equation (6).
The processes use Buffered Synchronous algorithm [20] to synchronize the 3D track flux with their neighbors.Each 3D track maintains the flux of two directions of tracks and multiple energy groups.Single precision is used for flux memory.The communication traffic can be calculated by Equation (7).
where the constant 2 represents the two directions of flux, _ represents the number of energy groups, and the constant 4 represents the memory width of a float in bytes.The unit of communication is bytes.The performance model can make ANT-MOC more efficient.For example, the communication and computation model can be used to design a load-mapping strategy and a track management strategy.The memory model can give reasonable memory estimation and avoid memory overflow.In Section 4.1 and 4.2, we apply the performance model to develop the strategy of track management and load mapping of ANT-MOC.

OPTIMIZATION
We have proposed two optimization strategies to achieve scalable 3D neutron transport solving.One is the track management strategy to trade off memory and computation, and the other is a three-level load mapping strategy suitable for supercomputer to reduce idle time.

Track management strategy
As the memory requirement of 3D segments is tens to hundreds of times that of 2D segments, the vast memory requirement is prohibitive [19].It poses a more significant challenge to the limited memory of the GPU.The OTF method [11] saves memory by extending 3D tracks in real-time with 2D tracks.ANT-MOC implements 3D track generation and 3D ray tracing based on the OTF method on GPUs.This ensures that 3D segments are generated and immediately used in real-time on GPUs.The OTF track generation method significantly saves memory footprint but introduces a significant amount of computation.In order to maximize GPU memory utilization and minimize redundant track generation, we have developed a track management strategy as illustrated in Figure 4.This approach categorizes tracks into resident and temporary tracks.Resident tracks, which are stored in the GPU memory, contain segments that do not need to be expanded during transmission resolution.Temporary track segments are discarded once the transmission solution has been executed.When selecting resident tracks, we rank them based on the number of segments calculated by Equation ( 4), with preference given to those with more segments in order to reduce the number of load operations during ray tracing.Furthermore, we have implemented a fused ray tracing and source computation kernel to minimize both kernel switching overhead and memory copy.

Three-level load mapping strategy
As demonstrated by Equation ( 4), the number of segments reflects the computational cost when solving transport problems.However, due to non-uniform grid partitioning in Geometry Construction, sub-geometries resulting from spatial decomposition may have different execution times during Transport Solving.This can cause delays in communication synchronization, leading to faster processes becoming idle and wasting computing resources.Moreover, variations in 3D track lengths lead to an imbalance in thread workload inside GPUs.To address these issues, we propose a three-level load-mapping strategy that matches the hardware architecture of supercomputers.Figure 5 illustrates how different partitioning granularities are mapped onto the hardware.The load-mapping strategies for each level are described as follow.

L1
. The first-level load-mapping strategy is designed to decompose and evenly map the geometry to multiple compute nodes.As shown in Figure 5(1), a 3D spatial geometry representing the actual reactor structure is first decomposed into 8 sub-geometry with a grid size of 2x2x2.Next, we estimate the computational load of each sub-geometry using the performance model in Equation ( 4) and construct a graph data structure with the computational load as the nodes and the connections between sub-geometries as the edges.After processing by the graph partitioning software such as ParMETIS [15], these 8 sub-geometries are organized into 4 groups and assigned to 4 distinct nodes.We chose a grid size of 2x2x2 for ease of illustration in demonstrating the spatial decomposition.In practice, the number of sub-geometry resulting from spatial decomposition is usually about tenfold the number of nodes.It is worth noting that the value of 10 is determined based on empirical observations [23].On the one hand, selecting a value that is too low might hamper the potential load-mapping gains.On the other hand, opting for a value that is excessively large would result in convoluted graph structures and failure to achieve the ideal partitioning effectiveness.The determination of the specific value is worthy of detailed investigation and analysis.

L2.
The second-level load-mapping strategy is designed to decompose and evenly map the sub-geometry group to multiple GPUs.As shown in Figure 5(2), a sub-geometry group is merged into a fusion-geometry, which is then decomposed and assigned to different GPUs based on the azimuthal angle of 2D tracks.The tracks are then copied to the respective GPUs for parallel processing.Since the number of azimuthal angles is defined as a multiple of 4, the number of GPUs per node is usually an even number, This partitioning approach helps to achieve reasonable parallelism.Furthermore, this method also enables efficient transfer of track flux between adjacent sub-geometries within each GPU.

L3.
The third-level load-mapping strategy is designed to sort and evenly map the 3D tracks in a single GPU to CUs.As shown in Figure 5(3), the tracks are first arranged in descending order according to the number of segments sorted by segment_num.After the sorting of tracks, each CU is mapped with tracks in a sequential manner.Then, the remaining tracks are mapped to each CU using the same sequence until all tracks have been mapped to a CU.Once the tracks are sorted and mapped onto different CUs, each CU will perform unique computational branches based on the resident and temporary tracks defined by the track management algorithm in section 4.1.Finally, the source will be updated.

EVALUATION
The OECD/NEA has published the C5G7 3D extension benchmark for simulation performance tests of various 3D neutron transport codes.In this section, we use ANT-MOC code to solve the C5G7 3D extension benchmark to evaluate its correctness and efficiency.Figure 6 shows the radial arrangement of the C5G7, which is a real-world case with strong inhomogeneity used to validate ANT-MOC.The real-world case consists of two UO 2 assemblies, two MOX assemblies, and five reflector assemblies.The UO 2 and MOX assemblies have 17x17 pin cells, while the reflector assemblies are homogeneous.Each pin cell is square with a side length of 1.26 cm and a pin radius of 0.54 cm.The UO 2 assemblies are composed of a fission chamber, UO 2 fuel cells, and guide tubes.Besides a fission chamber and guide tubes, MOX assemblies are composed of MOX fuel cells with three different enrichment of U235(4.3%, 7.0%, and 8.7%).
The experiments are carried out on an AMD GPU cluster which has more than 4,000 computing nodes.Each node includes one 32-core AMD Zen-based processor, four AMD Instinct MI60 GPUs, and 128 GB host memory.The processor has four Core Complex Dies, each configured as a NUMA node.Each GPU has 64 CUs and 16 GB global memory.The high-speed computing network utilizes the HDR InfiniBand network, with a network speed of 200Gbps between computing nodes.

Correctness Validation
To verify the correctness of ANT-MOC, a comparison between ANT-MOC and OpenMOC is conducted.The same experimental parameters are shown in Table 4.The reasonable axial and radial

Performance Model Validation
As outlined in Equation ( 5)-( 7), communication traffic is a quantifiable metric that is directly related to the number of 3D tracks.Conversely, memory and computation requirements are determined by the number of segments, which is an estimated value.Therefore, the accuracy of the performance model can be evaluated by comparing the difference between the estimated segment number and that of the actual value.A set of experiments with different numbers of tracks are used to test the performance, and the results are as shown in Figure 8.The x-axis represents the number of tracks, while the primary axis displays the segment number and the secondary axis presents the relative error value (eff).Figure 8 shows that the estimated and measured values of the segments are closely matched, and the relative error between the two fluctuates within 1.1%.Therefore, we conclude that the estimation of segments is accurate and indirectly proves the accuracy of the estimation of memory and computation.

Computation and Memory Evaluation
There are three methods for storing tracks and segments in memory.The EXP method stores all tracks explicitly, resulting in faster computation without the need for recomputation.The OTF method overcomes the memory bottleneck of EXP, while the Manager method strikes a balance between OTF and EXP, achieving faster calculations while efficiently utilizing memory.To evaluate the performance and suitability of the three methods, we conducted experiments using five different number of tracks.Specifically, we measured the average solver runtime over 10 transport iterations and recorded memory utilization prior to the start of transport execution.We set the memory threshold for resident tracks at 6.144GB using a greedy approach, which is subject to variation depending on the hardware used.The time and memory requirements of all three methods are displayed in figure 9, with time represented by a line graph and memory by a bar chart.The explicit method (EXP) serves as the benchmark.The horizontal axis represents the track scale of the test case, the primary axis represents the solver runtime, and the secondary axis shows memory utilization.The M_OTF represents the peak memory usage of the example using OTF, and T_OTF indicates its execution time.Based on the performance comparison presented in Figure 9, we can observe that the EXP provides the fastest solution speed, but its effectiveness is limited by GPU memory as the scale of the problem increases.On the other hand, the OTF method significantly reduces memory consumption through its real-time track generation technology, but it introduces a track generation kernel that is five times larger than the source computation kernel.Finally, the track management strategy (Manager) balances the benefits of both methods to enable high-resolution simulations and minimize the computational overhead introduced by OTF.As a result, it delivers a 30% performance improvement for OTF simulations.

Load Balance Evaluation
In order to evaluate the load balancing approach for parallel tasks, we define the load uniformity index for MAX load/Avg load, which is a value greater than 1.The closer the value is to 1, the more balanced the load mapping will be.Figure 10 shows the comparison between the load imbalance degree of the baseline and the threelevel load mapping strategy.The baseline is the partitioning method adopted by OpenMOC and is also mentioned as No balance.The x-axis is the amount of GPU, and the y-axis is the load uniformity index.According to the experimental results, the L1 load mapping method reduces the load by 5%.The L2 load mapping method reduces the load by 53%.The L3 load mapping method reduces the load by 8%.The limited gains observed from the implementation of L3 can be attributed to the fact that L2 partitioning had already facilitated a relatively high level of load balancing across multiple GPUs in a single node.Nevertheless, L3 still resulted in a significant increase in performance for each individual GPU.Furthermore, load mapping at the sub-geometry granularity of L1 determines the maximum potential improvement in load balancing, and enhancing the mapping strategy of L1 is crucial for achieving further performance gains in this area.

Scalability
In this test, we took a high-resolution simulation with a scale up to 100 billion tracks and trillion segments, which has never been achieved before in neutron transport solution software.Scalability tests were also conducted for the three-stage load mapping strategy to evaluate the impact of the optimizations on scalability.
We evaluated the scalability of ANT-MOC on the AMD GPU cluster.We utilized 1000 GPUs (250 nodes) as the baseline throughout our testing process.On average, each GPU was respectively allocated 54,581,544 and 5,124,596 tracks for strong and weak scalability tests.We employed the C5G7 3D extension benchmark during our evaluation, and the number of GPUs used in the test was from 1000 to 16000.During the experiments conducted on the weak scalability design, we varied the number of segments, which fluctuated within 1% error.Figure 11 shows the strong scalability test results of ANT-MOC.for 8000 GPUs, the parallel efficiency is observed to increase.In this case, all tracks are resident because GPU memory is enough.This reduces the necessary computations for ray tracing, especially when the single GPU task size decreases.In addition, the figure portrays the scalability achieved by implementing load-balancing strategies, which significantly reduce computation time while preserving high levels of parallel efficiency.Even in the largest test case, load balancing still provides a performance improvement of up to 12%.However, The addition of load mapping strategies also resulted in a reduction of parallel efficiency.This decrease in parallel efficiency occurred due to the reduced degree of load imbalance resulting from the decrease in single-processor loads during strong scalability testing.This reduction limited the potential for optimizing load mapping strategies, leading to a decrease in the effectiveness of these strategies.After incorporating all optimization capabilities into ANT-MOC, the parallel efficiency of strong scalability testing reached 71% on 16,000 GPUs, indicating that ANT-MOC possesses performance potential on even larger-scale hardware.
Figure 12 shows the weak scalability test results.As observed in weak scalability testing, The increase in computation time with scale results from implementing the spatial decomposition method for task partitioning in ANT-MOC, which generates additional grids and thereby contributes to an increase in computational complexity.Despite our attempts to adjust the number of segments proportionally, the increase in grids still exacerbated the degree of load imbalance.However, In tests where a load mapping strategy was configured, the reduction in parallel efficiency caused by spatial decomposition was alleviated.After implementing all optimization capabilities of ANT-MOC, the weak scalability parallel efficiency reached 89% on 16,000 GPUs with 174.66 billion tracks.This finding suggests that ANT-MOC is capable of handling large-scale problems.

CONCLUSION
In this work, we develop a software named ANT-MOC to solve the 3D full-core NTE using the MOC on GPU clusters.We present a three-level load mapping strategy which adapted to the hardware structure to alleviate the severe load imbalance in ANT-MOC.The tasks are divided into different granularities and mapped to different hardware levels.To reduce the bottleneck caused by the GPU memory for large-scale simulations, we port the main kernel for transport calculation and implement a real-time track generation method on the GPU.Furthermore, we minimize the "time for space" cost through a track management strategy.For the first time, we reach the simulation scale of billions of characteristic lines and achieve good scalability even on up to 16000 GPUs.Moreover, The wide occurrence of load imbalance in parallel software often leads to a waste of energy and money.It is worth investigating and analyzing this phenomenon further.The multi-level load mapping strategy proposed in this paper for neutron transport software has the potential to be further discussed and promoted for its universality, which could help alleviate the waste caused by load imbalance.

Figure 1 :
Figure 1: The execution flow of typical MOC programs.Transport solve performs a transport sweep to iterate over all tracks, which is the most time-consuming part of the workflow.

Figure 2 :
Figure 2: The execution flow of parallel version of ANT-MOC

( 1 )
Read Configuration: ANT-MOC reads a configuration file containing spatial decomposition parameters and track-generationrelated parameters.(2) Geometry Construction: The geometry is divided into several small sub-geometries.After division, the performance model covered in Section 3.3 can estimate memory, computation, and communication.Several sub-geometries are mapped to different nodes using the strategy in Section 4.2.1.(3) Track Generation and ray tracing: Each node generates 2D

Table 4 :
Experimental parameters of ANT-MOC and Open-MOC distribution is conducted by ParaView[17], and its results are shown in Figure 7.The figure illustrates the distribution of the fission rate calculated for the C5G7 benchmark.It demonstrates that the components at the central position exhibit a high fission rate, gradually decreasing towards the periphery.The relative error of the assembly pin-wise fission rate between ANT-MOC and OpenMOC are all zero, it can be concluded that ANT-MOC has the same simulation accuracy with OpenMOC.We also have done a comparison of the execution time of ANT-MOC (1 GPU) compared with OpenMOC-3D(8 CPU cores) and ANT-MOC achieves up to 428 times performance improvement.

Figure 7 :
Figure 7: Visualization of the fission rate distribution.

Figure 8 :
Figure 8: Comparison between predicted and measured values of the number of segments.

Figure 9 :
Figure 9: Memory and time comparison of three methods.

Figure 10 :
Figure 10: Comparison of multi-level load mapping strategy.

•
Based on the performance model, we propose a three-level load mapping strategy on GPU clusters.Computing resources are fully utilized by mapping tasks of different granularities among nodes, GPUs, and Compute Units (CUs), which refers to SMs.
• We are the first to achieve 100 billion tracks in a full-core 3D pressurized water reactor simulation.The efficiencies of strong scalability and weak scalability reach 70.69% and 89.38% on 16,000 GPUs, respectively.

Table 2 :
Performance related parameter

Table 3 :
Percentage of memory footprint for main variables