5 ExaFlop/s HPL-MxP Benchmark with Linear Scalability on the 40-Million-Core Sunway Supercomputer

HPL-MxP is an emerging high performance benchmark used to measure the mixed-precision computing capability of leading super-computers. In this work, we present our efforts on the new Sunway that linearly scales the benchmark to over 40 million cores, sustains an overall mixed-precision performance exceeding 5 ExaFlop/s, and achieves over 85% of peak performance, which is the highest efficiency reached among all heterogeneous systems on the HPL- MxP list. The optimizations of our HPL-MxP implementation include the following: (1) a Two-Direction Look-Ahead and Overlap algorithm that enables overlaps of all communications with computation; (2) a multilevel process-mapping and communication scheduling method that uses the entire network as best as possible while maintaining conflict-free algorithm-flow; and (3) a CG-Fusion computing framework that eliminates up to 60% of inter-chip communications and removes the memory access bottleneck while serving both computation and communication simultaneously. This work could also provide useful insights for tuning cutting-edge applications on Sunway supercomputers as well as other heterogeneous supercom-puters.


INTRODUCTION
Recently, artificial intelligence (AI applications), especially deep learning (DL), have been booming.The emerging need for pretraining large-scale models such as GPT-4 [5] has motivated the development of low-precision computing engines.An increasing number of leading supercomputers now begin to provide appreciable low-precision computing resources; such supercomputers include Frontier[10], Fugaku [31], Summit [33], and the new-generation Sunway supercomputer.
In the traditional area of high performance computing (HPC), the desire for low-precision hardware is rising.The hope is to harness its superior performance over single-or double-precision arithmetic while consuming less energy in scientific computing applications that typically necessitate high precision computing.To measure supercomputers' upper limit of mixed-precision utilization, HPL-MxP was created.
The HPL-MxP benchmark [15] was proposed in 2019 to measure the mixed-precision computing capability of supercomputers by solving dense linear equations with an iterative refinement (IR) method [20].HPL-MxP uses L and U factors obtained by lowprecision LU factorization as preconditioners and guarantees FP64precision through IR methods.Moreover it inherits and expands the HPL, which is the performance benchmark of TOP500 [16].Currently, a total of 26 supercomputers have participated in the HPL-MxP list [15], and seven of them are ranked among the top 10 of the latest TOP500.
In the domain of dense linear algebra, algorithms are usually expected to enable the peak performance of a machine.However, it is far from straightforward to achieve this. Figure 1 shows the efficiency status quo of both HPL-MxP and blas3 function hgemm in HPL-MxP from 1 CPU to 64 CPUs on the new-generation Sunway supercomputer (abbreviated as the new Sunway).Here, HPL-MxP is based on HPL 2.0 [23], which lays a good foundation for development with its 2D block data layout and "Look-Ahead" techniques for L-broadcast (denoted as L-LA) of the LU factorization.We change the matrix generator, add the IR part, simplify panel U's swapping to broadcasting, and accelerate all kernel functions with a SW26010-Pro processor.While all kernels are sufficiently accelerated, the gap is still more than 27% on one computing node.This gap mainly comes from the communication overhead.The algorithm of HPL-MxP urgently needs a smart redesign to narrow this gap.Even when only scaling from 1 to 64 CPUs, the gap already increases by 5%.This trend is unbearable when scaling to the scale of leading supercomputers.To climb the peak, the scalability issue of HPL-MxP must be addressed.Recently, Shuhei Kudo [19] implemented HPL-MxP on Fugaku [31].Lu [22] exploited HPL-MxP on two leading supercomputers: Frontier[10] and Summit [33].Their contributions show the potential of the mixed-precision capabilities on large-scale applications on platforms with homogeneous processors and heterogeneous accelerators.
Our work explores optimization techniques on HPL-MxP and measures its performance on the new Sunway with heterogeneous manycore processors, which is largely different from homogeneous processors or a heterogeneous device such as a GPU.Hence, we cannot directly apply the ideas from previous studies to heterogeneous device [2,4,9,18,21,34].The new Sunway was built upon the heterogeneous manycore processor named SW26010-Pro.While the computing performance of the SW26010-pro processor is among the top of existing heterogeneous devices, the on-chip memory and network bandwidth are relatively limited, which brings more challenges to designing new techniques for HPL-MxP.In addition, as a new member of the Sunway family, the SW26010-Pro processor has many architecture-specific features, which requires tailoring HPL-MxP to the architecture and fully leveraging all hardware components.Moreover, the new Sunway owns over 40 million cores, connects over 100,000 processors, and has a pruned network.Therefore, communication must be carefully designed and optimized for high scalability.
From a performance point of view, optimization of low-precision LU factorization is critical.Additionally, due to the test matrix property, HPL-MxP is pivot-free during LU factorization.This opens an opportunity to further exploit the "Look-Ahead" technique of HPL.
To this end, the present work proposes several optimization algorithms for HPL-MxP tailored to the target hardware that solves over 30 million equations with more than 100,000 processors.It achieves 5.048 Eflop/s mixed-precision performance and 85.21% floating-point computing efficiency.
The main contributions of our work are summarized below: • The proposed Two-Direction Look-Ahead and Overlap algorithm(2D-LAO), which extends "Look-Ahead" to two directions, overlaps communication with computation completely in both directions while maintaining high floating-point computing efficiency.• We proposed a novel contention-avoiding optimization for inter-node communication.It includes (1) a multi-level processmapping (MPM) method that best fits the communication topology of HPL-MxP to a hierarchical network topology to simplify the communication contention avoidance strategy by transforming the nodes' communication topology (two actionable rules defined to guide communication in HPL-MxP to prevent communication contention at runtime); and (2) a Ring-Ring broadcast algorithm, that follows the contention-free communication rules and further shortens the communication path of HPL-MxP while maintaining algorithm-flow.• The employed CG-Fusion computing framework, which exploits the features of the SW26010-Pro processor, removes the memory bottleneck while serving both computation and communication simultaneously; it also eliminates all intrachip communication and up to 60% inter-chip communication and scales down the number of MPI processes to 1/6.
The proposed optimization methods are in line with common design practices employed in leading supercomputing facilities, such as a focus on network topology and bandwidth pruning.Thus the present work can shed light on algorithm design for scientific and engineering applications deployed on Sunway supercomputers as well as other heterogeneous supercomputers.

NEW-GENERATION SUNWAY SUPERCOMPUTER
As the successor of the Sunway TaihuLight supercomputer system [11], the new Sunway shares many similarities in architecture but improved in various aspects.The new Sunway is equipped with more than 100,000 SW26010-Pro chips, which are a kind of new highperformance heterogeneous manycore processor.These chips are connected with highly-scalable layered and interconnect fabrics based on a fat-tree topology.

SW26010-pro
Figure 2: Hardware architecture of the SW26010-pro processor

SW26010-Pro Heterogeneous Processor
The general architecture of the SW26010-Pro processor is shown in Figure 2. Similar to SW26010 [11], the computing resources are organized through core-groups (CGs).The SW26010-Pro processor contains six CGs.Each CG owns one management processing element (MPE), 64 computing processing elements (CPEs), and 1 memory controller (MC).These six CGs are connected via a network on a chip (NoC).Each CG has 16 GB DDR4 memory connected to the corresponding MPE and CPEs via an MC.Six CGs share the on-chip network interface controller (NIC) when exchanging data to/from other computing nodes.
The MPE is a full-featured 64-bit RISC core that handles management, I/O, communication, and spawning threads for CPEs, and it can handle computations.In contrast, CPEs are simplified 64-bit RISC cores designed to provide highly aggregated computing capability.Each CPE has a user-controlled 256 KB local data memory (LDM) with low access latency.A direct memory access (DMA) mechanism is used to transfer a large amount of data between the LDM and global memory with a high bandwidth.Moreover, 64 CPEs are organized into a two-layer architecture, and each 2 × 2 CPE connects to a 4 × 4 mesh, thereby supporting both peer-to-peer mode and broadcast mode data exchange through a remote memory access mechanism within the CPE cluster.
The SW26010-Pro processor can provide a theoretical peak performance of 55.296 Tflop/s in half precision.Futhermore it has a bi-direction bandwidth of 56.25 GB/s shared by six CGs.
The SW26010-Pro processor provides a unified address space for the MPE and CPEs.A total of 96 (6 × 16) GB memory storage space can be configured as (1) share-segment mechanism, of which six independent memory spaces are provided to the corresponding CGs and the theoretical memory access bandwidth of each CG is 51.2 GB/s; or (2) cross-segment mechanism, where the entire 96 GB memory space is organized as NUMA and shared by all six CGs for a total theoretical bandwidth of 307.2 GB/s; or (3) a mix of (1) and (2).
Relevant configurations of the SW26010-Pro processor are listed in Table 1.

Interconnection Network
The new Sunway has more than 107,520 SW26010-Pro CPUs, with a parallel scale of 41,932,800 cores.Each SW26010-Pro processor has a dedicated network connection to a leaf switch with 304 ports.Of these, 256 ports are connected to the nodes, and 48 ports are connected to the secondary switches.Every 256 CPUs connecting to the same leaf switch form a supernode, thus achieving unblocked communication bandwidth across CPUs.All supernodes are connected via a 16:3 (256:48) oversubscribed multi-layer fat-tree network.

BASIC ALGORITHM OF HPL-MXP
The HPL-MxP benchmark [13] solves the dense linear equations  =  with IR and uses mixed-precision LU factorization as the preconditioner.The same convergence conditions as the doubleprecision HPL benchmark need to be satisfied.The outline of HPL-MxP is summarized in Algorithm 1. HPL-MxP requires 2 3  3 + ( 2 ) floating-point operations, backward-error should be consistent with HPL, and the number of iterations is limited to 50.Similar to in the HPL benchmark, the LU factorization is the most computationally expensive part and determines the performance of HPL-MxP.The main difference lies in the weakly diagonally dominant property of the test matrix, which eliminates the need for pivoting while keeping numerical stability [6,16].Based on this, the behavior of HPL-MxP on rows and columns is symmetrical, and the algorithmflow is unified into diagonal block factorization with block-LU, panel L and panel U's trsm, the corresponding L and U's broadcasts and trailing matrix 's gemm (see Figure 3(a)).

OPTIMIZATION METHODOLOGY
In this section, we present the proposed optimization algorithms.We introduce the 2D-LAO algorithm first.It provides an algorithm foundation for hiding communications.Then we propose our MPM method and the Ring-Ring broadcast algorithm, which focuses on communication overhead reduction.Our inter-node optimization provides a new way to avoid communication contention, which is important for scalability.To solve the memory bottleneck issue of the SW26010-pro processor, we propose a new computing framework called CG-Fusion, which benefits the HPL-MxP on the new Sunway.

2D-LAO Algorithm
In LU factorization, which is the most computationally expensive part of HPL-MxP, there are two main types of tasks: computation of A-gemm, L-trsm and U-trsm, and L-broadcast and U-broadcast communications.Following the base LU factorization algorithm shown in Figure 3(a), less communication and computation overlap is exploited, which leads to the efficiency gap depicted in Figure 1 and further hinders the scalability of HPL-MxP.To close this gap, as shown in Figure 3(b), it "Looks-Ahead" the panel L's L-trsm and the corresponding L-broadcast of the next update step, to parallelize this L-broadcast with the current update step's A-gemm.After panel L is received, panel U can begin its U-trsm section and U-broadcast.In L-LA, panel U's communication U-broadcast and trailing 's A-gemm are called sequentially.Hence several pipeline algorithms were previously proposed to split panel U and trailing  piece by piece in the U-direction to overlap communication and computation in a fine-grained way.However, this introduces scheduling and synchronization overhead as well as additional implementation efforts.
Due to the pivoting-free property of LU factorization in HPL-MxP, panel U is no longer dependent on panel L. Based on this observation, we extend the "Look-Ahead"-technique to U-broadcast and decouple data dependencies in both L and U directions.Then, we can freely reschedule the communication and computation, and overlap them wherever possible (see Figure 3(c)).We named this algorithm 2D-LAO.In HPL, communication is offloaded to a mechanism for nonblocking MPI.In the new Sunway, we assign the communication and computation to MPE and CPEs, respectively.Implementation details are provided below, • The senders broadcast panel L and panel U before entering trsm, and then both the senders and receivers can be involved in executing trsm in parallel.Therefore, L and U's broadcasts can start earlier and are overlapped with both trsm and gemm.• The senders execute one whole gemm for the entire trailing matrix , including blocks related and unrelated to the next trsm.Just a lightweight memory-write from CPEs is needed to notify the MPE to start broadcasting, which prompts CPEs to computing continuously without introducing interruptionoverhead. • The CPEs first write the gemm result that needs to be exchanged to the communication buffer instead of the trailing matrix  to start broadcasting earlier.The MPE then completes the copy from the buffer to the trailing matrix  without urgency.In addition, the MPE can probe messages at any time, promoting timely message delivery.• The order of L and U's broadcasts in different processes is determined according to the logical distance from the message senders which helps the processes to receive and deal with messages received earlier first.

Optimization for Inter-node Communication
2D-LAO provides a good foundation for communication and computation overlap.To achieve better scalability, communication overhead reduction becomes the next key while maintaining the algorithm-flow of LU factorization.Through this, we can achieve higher computing performance.We start by first identifying the sources of potential communication issues.
The algorithm-flow of LU factorization in HPL-MxP proceeds as follows: the MPI process owning the first diagonal block (called the root process) completes block-LU before other MPI processes owning panel L or panel U can start trsm on relevant blocks; after that, all MPI processes update their trailing matrix  with gemm.The propagation of the computation flow resembles throwing a stone into water, creating ripple after ripple.The built-in Ring broadcast of HPL [23] reflects the aforementioned algorithm-flow exactly and broadcasts sequentially from the root process until the last process.We name this the Basic-Ring (see Figure 6).However, as the system scales up, the communication time becomes intolerable due to the large number of propagation steps introduced by the linear growth of Basic Ring.
In addition, HPL-MxP determines that each process is involved in simultaneous broadcasts of its horizontal and vertical neighbors, thus forming a two-dimensional communication mesh grid.Once communication conflicts occur, they will spread to the entire process grid.The introduction of network pruning, i.e. a widelyadopted strategy for building today's supercomputing facilities, may prevent the application from reaching full-scale bandwidth.
To address these issues, the communication method needs to be carefully designed to achieve two goals: (1) shortening the broadcast path while maintaining the algorithm-flow, and (2) avoiding contention due to routing conflicts and network pruning.
To reduce these communication issues, we propose two things.First, we propose an MPM method that best fits the communication topology of HPL-MxP to the hierarchical network topology to simplify the communication contention avoidance strategy by transforming the nodes' communication topology; two actionable rules are defined to guide communication in HPL-MxP to prevent communication contention at runtime.Second, we construct a Ring-Ring broadcast algorithm, which follows the conflict-free communication rules and further shortens the communication path of HPL-MxP while maintaining the original algorithm-flow.

4.2.1
Multi-level process-mapping method.The multi-level fat-tree network topology is a common choices among leading supercomputers like Summit, Sierra, Tianhe-2A and the Sunway series supercomputers.This topology consists of multi level fat-trees, and the computing nodes in each bottom tree are fully connected in most cases.We call a set of these fully connected nodes a supernode.The network topology includes connections within and between supernodes.Due to budget constraints, a network connecting different supernodes is often deliberately pruned.Source mod K (SMK) and destination mod K (DMK) constitute a broad family of deterministic routing algorithms commonly adopted in fat-tree topologies.To exploit full network bandwidth and avoid contention on the pruned multi-level fat-tree network topology employing SMK and DMK routing, we propose a static process-node mapping method named multi-level process-mapping (MPM).In this section, we present details of this method by taking as an example the routing conflicts of SMK and assuming that one process is assigned to one node.A similar method can be extended to scenarios with DMK and multiple processes on one node.
According to the routing principles of the SMK algorithm, routing conflicts can occur when certain subsets of the following conditions hold: where _ and _ represent the source and destination nodes of the message respectively; and _ and _ represent the source and destination supernode of the messages respectively. and  indicate the index of the message, and , , ,   and   are the configuration settings determined by the network topology and routing algorithm.Specifically,  is related to the network pruning ratio;  is the number of nodes within a bottom tree;  reflects the connection topology within a supernode; and   and   describe the connection topology of the top tree and the routing rules.
For intra-supernode communication, if condition (1a) holds, messages  and  would select the same upward route, which may lead to conflict.If both the conditions (1b) and (1c) hold, messages  and  would select the same downward route, which may also result in conflict.Similarly, conditions (1d) and (1e) result in conflict between supernodes.
On this basis, we propose an MPM method to reschedule process  on the communication topology from node  to node MPM().Node MPM() would then take the data blocks originally processed by node  and become engaged in the communication dependencies of process  in the original communication topology.By following two additional rules, conflict-free communication can be achieved both within and between each level of the mapping.The construction of MPM involves reorganizing the nodes and supernodes into multi-level structures.After following the MPM the procedure below, we can obtain the one-to-one correspondence between each node  and the transformed index MPM().The process grid that represents the logical node layout is shown in Figure 4(a).By the definition of terms Block and Group, conditions (1b) and (1d) do not hold anymore; thus there are no conflicts within Blocks and Groups.Under the MPM, conflicts between messages sent to different Blocks are now resolved because by definition of the Blocks in Stage 1, condition (1c) will not hold anymore.Similarly, condition (1e) will not hold for any message sent to different Groups.Hence, there are only three routing conflict situations that may still exist: cross-supernode, cross-Blocks, and cross-Groups conflicts.These are shown in Figure 4(b), (c), and (d), which correspond to condition (1a), condition (1b) and (1c), and condition (1d) and (1e) holding, respectively.
To avoid these remaining conflict situations, we futher propose two rules.
• Rule 1: Ensure that communication flows in one direction, i.e., all nodes either send messages to the precedent nodes or subsequent nodes at the same time.Communication with the same stride is contained in the same level (nodes, Blocks, supernodes and Groups, respectively).• Rule 2: Ensure that only one row and column of supernodes send messages out to other supernodes at the same time.It can be shown that if the broadcast follows Rule 1, then cross-Block conflicts (Figure 4(c)) can be avoided.If it follows Rule 2, cross-Group conflicts (Figure 4(d)) can be avoided.It is also worthwhile to note that it would be difficult to find a perfect mapping, if any, that achieves conflict-free cross-supernode communication with any node within a supernode.However, if restricted to Rule 2, it suffices to select one row and column to avoid the conflict situation shown in Figure 4(b).We present the process grid diagram of 64 supernodes constructed by MPM on the new Sunway in Figure 5.The purple nodes in the process grid at the Block level can be designated as the only row and column to broadcast messages that need to comply with Rule 2, and the construction of those purple nodes ensures that the conflict condition in Figure 4(b) does not hold.
DMK has similar routing conflict-inducing conditions as SMK.Therefore, it is easy to apply MPM similarly.Regarding extending MPM to multiple processes on a node, we can make copies of the process grid with the number of on-node processes and order the copies one by one to consist of the final process grid.Then, the sub-grids by each on-node process form top-level "Blocks" and maintain the conflict-free characteristic.
It is worth noting that since MPM is based on the multi-layer fat-tree topology, it can provide valuable insights for large-scale applications running on a similar network topology to reduce communication overhead and contention and thus achieve optimal bandwidth utilization.

4.2.2
Ring-Ring broadcast algorithm.The basic Ring broadcast [23] conforms to the two rules above and fits the algorithm-flow of LU factorization.However, as mentioned above, its path at scale is excessively long.Note that the computation time of each step is relatively consistent and determined by the node's memory space.As HPL-MxP scales, it becomes difficult for nodes far from the root process to overlap Ring broadcast with computation.To shorten the broadcast path, we propose a new broadcast algorithm called Ring-Ring.It efficiently reduces the broadcast path length, while maintaining algorithm-flow and conforming to the aforementioned two rules.
The new Ring-Ring broadcast algorithm is illustrated in Figure 6, taking the row direction as an example.The leading nodes of a Block/supernode/Group are responsible for broadcasting messages to their counterparts in the next Block/supernode/Group, after sending messages to their immediate neighbors within the supernode as in Basic-Ring.

Basic-Ring node leading-node
Node-Ring S-Node-Ring Group-Ring Once the leading nodes receive messages from other Groups or supernodes, they start their ringlets within the Group or supernode, respectively.Ring-Ring maintains the correct topological order of the algorithm-flow over multiple hierarchies,i.e., locally and globally.Moreover, it complies with the two rules specified for MPM.Conflict-free characteristics are also guaranteed if the leading-nodes are located in the purple row of our node layout (Figure 5) as they will then dissatisfy conflict condition (1a).In addition, only leading-nodes have inter-supernode communications and apparently will get full bandwidth instead of being limited by the pruned network.
On the new Sunway, we further improve the broadcast algorithm by adding an additional layer of grouping, such as aggregating Blocks within a supernode, to further shorten the broadcast path without breaking algorithm-flow or contention-free characteristics.Furthermore, We perform a special treatment for algorithm-flow that first sends the message to the root processor in the next update step, and then forwards that message to the main part of Ring-Ring.In a test on an almost entire-machine scale of the new Sunway, the Ring-Ring algorithm reduces the length of the broadcast path from 431 to 27.

CG-Fusion Framework
On the new Sunway, memory access tends to be a bottleneck, especially when assigning a single process to a single CG.The computation and communication may severely compete for memory resources in the CG.As shown in Table 2, we measured the memory access bandwidth in HPL-MxP when the MPE, CPEs and NIC all accessed one CG memory storage at the same time.Although the three components can collectively reach 94% of the peak memory bandwidth, the NIC only gets less than 10 GB/s memory bandwidth, which is far below its peak network bandwidth of 56.25 GB/s.This results in theoretical network starvation.
In addition to the memory access bottleneck, load imbalance could still be an issue in the one-process-per-CG framework.Although the data blocks are evenly distributed to all processes, as LU factorization moves along, the distribution of active data blocks alternates from one step to another, leading to an imbalanced task load.Note that in Figure 7(a), 12 blocks are handled on CG2 while only 6 blocks are handled on CG3.The CG-Fusion framework employs a two-pronged strategy: memory fusion (MF) and process fusion (PF).Figure 7(b) shows the MF and PF mechanisms of the CG-Fusion framework, taking 9 × 9 blocks as an example.For comparison, Figure 7(a) shows the original one-process-per-CG computing framework.
Through MF, each block is distributed by interleaving across six on-chip memories at 512 B granularity, which can lead to better memory access balance.Furthermore, memory bandwidth is no longer a bottleneck for the NIC, since each matrix block is scattered to all six memories on a chip.Fetching a matrix block would then imply accessing all six memories in stead of just one.Then, the data traffic required by HPL-MxP enables the MPEs, CPEs, and NICs to access six memories simultaneously, which provides a combined bandwidth of 51.2 × 6 = 307.2GB/s.Notice that this is much higher than the peak network bandwidth of 56.25 GB/s.Even when looking at the measured NIC bandwidth of 9.87 GB/s on a single CG, a combined bandwidth of 9.87 × 6 = 59.22 GB/s suffices to saturate its peak network bandwidth.
Through PF, all CGs work together on trsm and gemm, which can lead to better task balance.For the same total memory size, CG-Fusion can afford √ 6x size of trsm but consumes 6x resources; this results in shortened execution of trsm, which can promote earlier broadcasts and hence shorten the critical path.
In addition, PF brings the following benefits.First, the number of MPI processes in the whole system is reduced to 1/6, making our implementation more scalable.Second, the number of messages per CPU decreases from 12 to 2 per update step, which alleviates potential communication contention on the NIC.Third, it can eliminate up to 60% of inter-chip communication volume, thereby saving memory and reducing its memory footprint.
The CG-Fusion computing framework can be implemented with the cross-segment mechanism provided by NUMA memory architecture and a fusion-CPE-spawning mechanism that allows an MPE to manipulate all CPEs on a SW26010-Pro processor.

Discussion
4.4.1 Generality of optimizations.Although the optimizations presented in this paper are focused on optimizing HPL-MxP on the new Sunway, several techniques may be generalized to other systems and applications.For example, 2D-LAO is an algorithm-level optimization that can be applied to tune HPL-MxP on other platforms, including those with accelerators.2D-LAO can also be used in applications with Cholesky decomposition.MPM may play a key role in resolving the on-scale communication issue of popular network topologies (pruned and multi-level fat-tree topologies).Additionally, the idea of reconstructing communication dependency tailored to network topology and optimizing communication paths can serve as a general guideline for communication optimization of large-scale applications.4.4.2Stable run over the entire machine.For large scale supercomputers, long-term stable operation is a considerable HPL.Many factors contribute to stable operation, including hardware reliability, cooling system capacity, and the fluctuation of power consumption due to application behaviors.The power consumption fluctuation is always an important issue to be considered.To address the fluctuating power consumption issues for HPL-MxP while keeping high performance, we design a block-column-major data layout to reuse the data in registers.This reduces the power consumption of gemm by 20%.For memory-intensive functions, we smooth out power jitters by adding redundant invalid floatingpoint operations.These measures effectively help the system to run HPL-MxP continuously and stably for extended hours.

PERFORMANCE RESULTS
We demonstrate HPL-MxP performance in this section.We choose BF16 for LU factorization, which has a larger range of representable values albeit lower accuracy than FP16; we also choose a BF16-FP64 mixed-precision configuration for IR, using the row-major process order and the above mentioned L-LA as the baseline.We set the block size to 1024, balancing by GEMM's computational efficiency, memory bandwidth occupancy, and parallel overhead.All kernel functions are calibrated by CPEs for optimal baseline performance, in which we arrange one process per CG.all the optimizations proposed in this work can work well with different precisions and block sizes.We implement HPL-MxP with SWGCC compiler.The MPI implementation is based on MPICH 3.2.
We summarize the HPL-MxP performance for seven leading supercomputers in Table 3, with data drawn from the latest official HPL-MxP list [15].The theoretical peak FP16 performance of these systems could be found in their reports or official websites [3,25,26].As shown in Table 3, the top-ranked HPL-MxP achieved a score of 9.95 Eflop/s and a mixed-precision floating-point computing efficiency of 74% on Frontier.Our HPL-MxP performance of 5.048 Eflop/s was achieved on the new Sunway.Moreover, we successfully scaled HPL-MxP up to 41,140,224 cores while at the same time achieving a mixed-precision floating-point computing efficiency of over 85.21%.Hence, in the HPL-MxP list, it ranks the highest among all heterogeneous systems and second among all systems.
Moreover, we evaluate critical kernels and conduct extensive ablation tests to demonstrate the performance improvement of the proposed optimization methods, including the 2D-LAO algorithm, CG-Fusion framework, MPM method and Ring-Ring broadcast algorithm.An evaluation of communication and weak scalability is also presented here.Due to computing budget constraints, we tested the MPM method and the Ring-Ring broadcast algorithm on large scales already optimized by the CG-Fusion framework.Thus, we present the performance of MPM and Ring-Ring after that of the CG-Fusion framework.

Performance Evaluation of Critical Kernels
We first present the performance of critical kernels used in halfprecision LU factorization in a single node, including hgemm and htrsm.The experimental result is shown in Figure 8, where the abscissa represents the matrix scale, and the ordinate represents the performance measure in TFLOPS.We can obtain 53.49Tflop/s for hgemm and 33.04 Tflop/s for htrsm, reaching 96.73% and 59.75% of peak half-precision computing capability, respectively.The high performance of critical kernels lays the foundation for the performance of our HPL-MxP.

Performance of the 2D-LAO Algorithm
We conducted LU factorization experiments on one CPU (2 × 3 process grid) to 64 CPUs (16×24 process grid), using L-broadcast "Look-(baseline framework L-LA), L-broadcast "Look-Ahead" and Overlap (denoted as L-LAO) and 2D-LAO.All tests employed HPL's built-in Ring broadcast.As shown in Figure 9, 2D-LAO achieves highest performance compared with L-LA and L-LAO at all scales.Adding the communication overlap in the L direction contributes to a performance improvement of over 20%.Additionally, the introduction of "Look-Ahead" and overlap in the U direction provides an additional 10% performance improvement.Compare with L-LA, on a single CPU, L-LAO achieves 21.59% greater performance while 2D-LAO obtains greater 30.95% performance.On 64 CPUs, L-LAO and 2D-LAO obtain 25.47% and 36.20%performance gains, respectively.From 1 CPU to 64 CPUs, the gains of 2D-LAO increase from 30.95% to 36.20%.As the number of processes grows, so do the benefits of communication overlap.It can be expected that 2D-LAO will bring more performance gains for larger scales.

Performance of the CG-Fusion Framework
The CG-Fusion framework benefits from MF and PF.To demonstrate the effects of them separately, we compare share-segmentbased 2D-LAO (denoted as SS), cross-segment-based 2D-LAO (denoted as CS), and CG-Fusion-based 2D-LAO (denoted as CG-Fusion) for LU factorization. Figure 10 shows the performance results when using 1 to 256 CPUs.Comparing CS and SS demonstrates the improvement brought about by MF.Because MF better utilizes memory and the NIC, it can bring benefits at all scales.For the 256-CPU case, we obtain over 5% improvement.Both CG-Fusion and CS distribute data over the cross-segment mechanism, and therefore the gap between them shows the effect of PF.CG-Fusion achieves a 4.55% performance improvement over CS in the 256-CPU test.This benefit mainly comes from centralizing the entire chip resources to accelerate Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.critical paths, thus mitigating conflicts on the NIC, eliminating intrachip communication, and reducing inter-chip communication.CG-Fusion is inferior to CS at small scales.The reason for this is that the newly added level of parallelism for merging CGs introduces scheduling overhead, which offsets the aforementioned benefit.
As the scale increases from 1 CPU to 256 CPUs, the performance gain of CG-Fusion over SS increases from 5.69% to 9.79%.Additionally, CG-Fusion brings more performance improvements for larger scales.

Performance of multi-level process-mapping and Ring-Ring Broadcast Algorithm
We  However, the distribution of gains is slightly different.For CS, MPM provides an approximately 1.5% performance improvement, while RRB provides another 1.5%.In contrast, for CG-Fusion, the boost is 0.4% from MPM and 2.4% from RRB.The effect of MPM diminishes in CG-Fusion because both CG-Fusion and MPM are designed for resolving communication contention, and part of the gain is shifted from MPM to CG-Fusion.Meanwhile, the effect of RRB becomes enhanced in CG-Fusion because there is a larger average volume per message to be transferred while the computing time is almost unchanged (compared with CS).As a result, performance is more sensitive to the broadcast path length.

Communication Evaluation
Figure 12 shows the time breakdown for the 64-CPU HPL-MxP.In the L-LA baseline algorithm, the broadcast overhead accounts for 32.48% of total time, which seriously hinders scalability.After applying 2D-LAO and CG-Fusion, the share of communication overhead drops to only 3.49%.Almost all communication can be overlapped with computation.Therefore, the proportion of the main computing task (i.e., updating the trailing matrix, increased from 62.77% to 91.04%, thus squeezing out more computing performance.

91.04%
CG-Fusion+2D-LAO We also compare the efficiency of LU factorization (integrating all optimization methods mentioned in this paper) and hgemm from 240 CPUs to 15,872 CPUs.There is only a fairly small gap left with our optimizations (Figure 13), which means that communication is almost entirely hidden in computation.With our efforts, the performance of LU factorization is almost close to that of hgemm, as one would expect.

Parallel Scalability
We finally demonstrate the parallel scalability of the optimized HPL-MxP implementation on the new Sunway in a weak scaling test.Results are listed in Figure 14, with both mixed-precision performance and floating-point computing efficiency reported.We achieve linear weak scalability with the computing efficiency remaining almost constant with respect to increasing scale up to  = 32, 292, 864.We end up at a score of 5.048 Eflop/s and sustaining a computing efficiency of 85.21% as the number of CPUs reaches over 100,000.With these results, the new Sunway can rank among the top heterogeneous supercomputers in the latest HPL-MxP list.Benchmark was proposed to incorporate mixed-precision operations and IR into HPL, thereby uniting these two realms delivering a blend of modern algorithms and contemporary hardware to evaluate mixed-precision computing capability of supercom-In recent years, there have been an increasing number of studies on implementations of HPL-MxP with leading supercomputers.ORNL [24] and RIKEN [29] implemented their optimized HPL-MxP based on AMD MI250 GPUs, Volta/Ampere GPUs and ARM-based A64FX respectively, putting Frontier[10], Fugaku [31], and Summit [33] to the top position in different releases of the HPL-MxP list.

85.21%
State-of-the-art works was conducted on Fugaku by Shuhei Kudo [19].They proposed three numerical techniques to reduce IR iterations while maintaining mixed-precision numerical stability.They also focused on reducing memory footprint to enlarge the problem matrix and reusing buffers to reduce overhead.Moreover, they exploited HPL's built-in Ring algorithm to implement broadcast in LU factorization.Lu [22] tuned HPL-MxP on Frontier and Summit.Their work focused on intra-node optimization and provided valuable information about tuning parameters.In contrast, our work mainly focuses on improving the scalability of HPL-MxP.We enhance computing and communication overlap, avoid communication contentions, reduce broadcast overhead, and take advantage of on-chip shared memory mechanisms to eliminate memory access bottleneck for heterogeneous architecture.
LU factorization is the most critical part of HPL-MxP, and there have already been quite a few studies on LU factorization optimization.Most studies about LU factorization have concentrated on optimizations within computing nodes [1,8,14,17,27,35,36].As hardware evolved, workloads shifted from optimizing BLAS kernels to distributing tasks between CPUs and accelerators [2,4,9,12,28,30].According to intra-node communication optimization, several pipeline algorithms have been developed to overlap communication with computation [7,21,34] using different scheduling strategies and task allocation methods.Tan [34] made a great summary of these pipelining algorithms.Resently, Kim [18] developed an optimized HPL for modern heterogeneous GPU clusters.They introduced to HPL a performance model and a simulator for better parameter searching and demonstrated its effectiveness on 1760 NVIDIA A100 GPUs.For inter-node optimization, Shi [32] proposed a method that reordered processes to reduce inter-node communication overheads, and this method improved load balancing on a heterogeneous cluster with 32 GPUs and 128 CPUs.
In present work, we pay more attention to scalability problems with both intra-node and inter-node optimizations.For intra-node optimization, we extend the "Look-Ahead" technique to two directions and hide each step's all communications as a whole with an entire-scale trailing matrix's computation.It reduces the overhead of fine-grained panel scheduling and computing startup.For internode optimization, we propose a process-mapping method that best fits the communication topology of LU factorization to the hierarchical network topology, thereby minimizing possible communication contention.Then, We reify abstractions of avoiding network conflicts into two actionable rules to guide communication and resolve the runtime communication contention issue.We also propose a Ring-Ring broadcast algorithm that follows the contention-free communication rules and further shortens the communication path of HPL-MxP while maintaining algorithm-flow.

CONCLUSIONS
This paper presents an optimized implementation of the HPL-MxP benchmark on one heterogeneous supercomputer with limited memory access and communication bandwidth, i.e., the new Sunway.Several optimization methods are proposed to scale HPL-MxP to over 40 million cores with linear scalability.The new Sunway mixed-precision performance was 5.048 Eflop/s on the HPL-MxP Benchmark, and it achieved the highest floating-point computing efficiency of 85.21% among all heterogeneous manycore systems in the HPL-MxP list.Our work can serve as a sort of test drive and provide insights for deploying large-scale scientific and engineering applications on leading heterogeneous supercomputers.

Figure 3 :
Figure 3: Different factoring strategies and algorithm-flows of LU factorization for the base, L-LA (built-in HPL 2.0), and 2D-LAO algorithms.Here, block-LU refers to the diagonal block's LU factorization; L-trsm and U-trsm denotes panel L and U's trsm, respectively; and A-gemm refers to trailing 's gemm:=  −  ×  .

Figure 5 :
Figure 5: The node layout obtained through MPM on the new Sunway ( = 48,  =  = 8,   =   = 16).It can be observed that, intra-Block nodes have different IDs (ID mod 8), which maintains the conflict-avoiding feature of Blocks.We relabel all Blocks in a supernode by block-ID mod (/ = 4) and swap Block 0 with Block 3. Next we reorder the nodes along the vertical direction and rotate the  ℎ column cyclically by ⌊  2 ⌋ steps ( from 0 to 15).Then, the originally conflicting messages (as shown in Figure 4(b)) can now be resolved since the first row of the resulting nodes' layout (i.e., the purple nodes) all have different values of the node-ID mod ( = 48).

Figure 6 :
Figure 6: Ring-Ring broadcast algorithm in the row direction.Ring-Ring consists of Basic Rings within supernodes (denoted as Node-Ring), S-Node-Ring linking intra-Group supernodes and Group-Ring connecting all Groups.
Figure 7(a) shows a snapshot of the data layout during LU factorization.
In one-process-per-CG-framework, six CGs on chip as six MPI processes to complete the task-blocks owning by themself individally.Six memories serve their own CPEs and MPE separately.In the CG-Fusion computing framework, six MPI processes originally assigned to six CGs on chip are now combined into only one MPI process and collaborate to carry out the computation in a fine-grained way.Six memories serve all CPEs and MPEs together.

Figure 7 :
Figure 7: Descriptions of the one-process-per-CG computing framework and the CG-Fusion computing framework.

Figure 8 :
Figure 8: Performance of hgemm and htrsm with different matrix sizes

Figure 9 :
Figure 9: Performance of L-LA, L-LAO and 2D-LAO on 1 CPU to 64 CPUs with built-in Ring broadcast of HPL

Figure 10 :
Figure 10: Performance of SS, CS, and CG-Fusion when using 1 CPU to 256 CPUs with built-in Ring broadcast of HPL.

Figure 11 :
Figure 11: Performance of multi-level process-mapping and broadcast methods wth CS and CG-Fusion.1RB denotes the built-in Ring broadcast of HPL, and RRB denotes the Ring-Ring broadcast.

Figure 14 :
Figure 14: Weak scalability of HPL-MxP, from 240 to 107,136 CPUs, with the matrix reaching up to 32,292,864 in size

Table 1 :
Hardware configurations of the SW26010-Pro processor
• Stage 3: Group all Blocks in each supernode into rectangles of shape  × , where  and  are determined by the scale of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• Stage 4: Combine each consecutive min(  ,   ) supernode into a Group.• Stage 5: Sort each Group into a rectangle of size   ×   , where   and   are determined by the scale of 2D communication topology used in HPL-MxP and satisfying   ×   ≤ min(  ,   ).• Stage 6: Arrange Groups in a row order.
Figure 4: Process grid diagram obtained through MPM and the conflict situations. nodes form a Block, / Blocks form a supernode, and min(  ,   ) Supernodes form a Group.In (b), if the source nodes of the messages have the same ID (mode  ), conflict would occur because condition (1a) holds.In (c), the messages sent from different Blocks with the same node ID (mod ) and to a same Block/Group will conflict because conditions (1b) and (1c) hold.Similar to (c), (d) shows the situation where (1d) and (1e) hold.2D communication topology used in HPL-MxP and satisfying  ×  ≤

Table 2 :
Bandwidth in simultaneous memory access

Table 3 :
HPL-MxP performance of leading supercomputers perform parallel experiments with the MPM and Ring-Ring broadcast algorithm (RRB) on 240 CPUs to 15,872 CPUs and using CS and CG-Fusion.Performance results are reported in Figure11.It can be found that CG-Fusion adding all optimizations achieves the highest efficiency.Moreover, the overall impact on performance brought by MPM and RRB is similar across both CS and CG-Fusion, with an increase of nearly 3%.