Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2

In this paper, we investigate data movement-reducing and communication-avoiding optimizations and their practicable implementation for large-scale unstructured-mesh applications. Utilizing the high-level abstraction of the OP2 DSL for the unstructured-mesh class of codes, we reason about techniques for reduced communications across a consecutive sequence of loops – a loop-chain. The careful trade-off with increased redundant computation in place of data movement is analyzed for distributed-memory parallelization. A new communication-avoiding (CA) back-end for OP2 is designed, codifying these techniques such that they can be applied automatically to any OP2 application. The back-end is extended to operate on a cluster of GPUs, integrating GPU-to-GPU communication with CUDA, in combination with MPI. The new CA back-end is applied automatically to two non-trivial applications, including the OP2 version of Rolls-Royce’s production CFD application, Hydra. Performance is investigated on both CPU and GPU clusters on representative problems of 8M and 24M node mesh sizes. Results demonstrate how for select configurations the new CA back-end provides between 30 – 65% runtime reductions for the loop-chains in these applications for the mesh sizes on both an HPE Cray EX system and an NVIDIA V100 GPU cluster. We model and examine the determinants and characteristics of a given unstructured-mesh loop-chain that can lead to performance benefits with CA techniques, providing insights into the general feasibility and profitability of using the optimizations for this class of applications.


INTRODUCTION
The end of frequency scaling in the middle of the last decade has led processor architectures to move towards increasingly massively parallel designs.As a result, modern processors have featured a proliferation of arithmetic capability in the form of increasing discrete processor cores, both on traditional CPUs as well as accelerator devices such as GPUs.For example, the number of processor cores on a high-end CPU currently has close to 100 cores, while GPUs are designed with even more parallelism; over 2000 "cores", albeit cores that are comparatively much simpler.Large-scale clusters of these devices have reached hundred-thousands to millions of cores as can be seen from the recently unveiled exascale supercomputers.However, the speed of memory and network channels interconnecting the processors and system/device memories have largely lagged behind, leading to significant memory bandwidth bottlenecks.As a result, the performance of many conventional algorithms, optimized for floating-point operations, has stalled.Developing algorithms with reduced data movement, or communication-avoiding (CA) algorithms have therefore become an intense area of research (e.g.[14,33]).The underlying motivation of these works is to exploit the significantly high computing capability of these processors in place of communications, aiming to obtain higher performance.
The main challenge in adopting communication-avoiding optimizations or techniques in real-world applications is the significant effort and difficulty in implementing them and the subsequent maintenance of the code.The specific optimizations are highly-complex and involved, usually obfuscating the source code with platformspecific low-level features.A large body of work has developed the underlying theory for CA techniques including tiling [18,27,29] and reduced distributed-memory systems communications [13,18].Compile time application of these optimizations, such as based on the polyhedral model [16,31] have been developed within compiler frameworks such as LLVM's Polly [12] and Pochoir [30].Another strand of works such as by Demmel et al. [9,10] have successfully developed libraries that applications can use to utilize CA-optimized numerical methods.More recently, domain-specific languages (DSLs) and similar high-level frameworks have demonstrated a pathway in applying these exotic optimizations to larger non-trivial applications.Key DSLs with CA capabilities include Firedrake [15] for unstructured-mesh-based FE applications and OPS [27] and Devito [19] for structured-mesh-based applications.
The underlying goal of this paper is to examine and apply key data movement-reducing techniques to large-scale, real-world applications.To this end, we build upon previous work by Luporini et al [17] bringing together techniques for reducing data movement for the unstructured-mesh applications class, codifying them through the OP2 DSL [20], an embedded domain-specific language for developing unstructured-mesh applications.We begin with a focus on distributed-memory parallel execution and then its combination with on-node parallelization on many-core (GPU) processors.A new CA back-end for OP2 is designed such that any application using OP2 can utilize the optimizations.The new back-end is then applied to our main application of interest, Hydra, a production computational fluid dynamics (CFD) application used for aero-engine design at Rolls-Royce plc.More specifically, we make the following contributions: • We design a new CA back-end for the OP2 DSL, focusing on its distributed-memory parallel operation and reason about techniques for reducing the number of MPI messages exchanged during the execution of a sequence of consecutive loops, a loop-chain.A key new feature of the back-end is its ability to run standard loops over unstructured-mesh sets interspersed with selected loop-chains with CA to obtain the best overall performance of the application (Section 3).• The performance of loop-chains with CA is analytically modeled.We investigate the careful trade-off of increasing computations at shared MPI halos to satisfy loop dependencies in place of data movement via message passing.
The analytic model provides insights to characterize the loop-chains and reason about whether a given loop-chain will benefit from the CA back-end of OP2 (Section 3.2).• The distributed-memory back-end is extended to work on GPU clusters leveraging reduced MPI message passing when executing on a cluster of GPUs, enabling lower overheads in GPU-to-GPU communication with CUDA (Section 3.3).• Finally, the CA back-end is applied to Hydra, using its recently re-engineered OP2 version [21], OP2-Hydra for use cases, with mesh sizes of 8M and 24M.Benchmarking is carried out on the ARCHER2 supercomputer, a HPE Cray EX system with AMD EPYC 7742 cores and the Cirrus GPU cluster with NVIDIA V100 GPUs at EPCC (Section 4).
Results indicate significant performance gains: up to 65% on select node counts on ARCHER2 (CPU) and Cirrus (GPU) clusters.However, they also point to the need of carefully selecting which loop-chains have CA optimizations turned on, as in some chains they lead to performance degradation over the non-CA version.
To our knowledge the practical implementation of CA techniques for large-scale applications, particularly for large production codes such as Hydra from the unstructured-mesh domain, are limited in literature, not to mention benchmarking and performance analysis on both multi-core (CPU) and many-core (GPU) cluster systems.
Reasoning about the viability and profitability of these optimizations for real-world codes is also novel, particularly through the development of an analytic model.Our work provides insights into the key determinants and characteristics of a given general unstructured-mesh loop-chain that can lead to performance benefits with CA optimizations.These insights, we believe are more broadly applicable, even to applications developed without OP2.On the other hand, the work also demonstrates how a high-level abstraction framework such as OP2 can seamlessly deliver these complex and exotic code transformations to real-world applications without affecting the science source, maintaining performance portability.

Communication-avoiding Algorithms
Communication-avoiding algorithms have a rich literature with reducing data movement identified as a fundamental optimization.Out of these techniques, improving data locality by restructuring loops or rescheduling loop iterations, generally known as tiling or loop-blocking, have long been well understood [32].The theoretical underpinnings of these loop transformations have been comprehensively described through the polyhedral model [7].
Many frameworks and libraries have been developed based on the polyhedral model.Key works include PLuTo [8], a fully automatic polyhedral program optimization system, Polly [12], an LLVM framework for high-level loop and data locality optimizations, Pochoir [30], a compiler and runtime system for implementing stencil computations on multicore processors and PolyMage [22] and Halide [25] which specifically targets image processing pipelines.All these works create polyhedra -multi-dimensional sub-iteration spaces, within loops with static, regular access patterns/dependencies.Such loops can be readily viewed as iterations over a structured mesh with regular stencils defining the dependency neighborhood.The extension of these algorithms to distributed-memory systems requires them to account for the dependencies in the loop nests when the iteration space is spread over a number of processes, or disparate memory areas.This leads to the need for those dependencies to be satisfied through communications of extra halo layers.These extensions have been developed for several of the aforementioned frameworks, for PLuTo in [6] and Distributed Halide in [11].However, applying these optimizations, particularly the ideas of the polyhedral model for unstructured-mesh applications are limited in literature as a consequence of the added complexity of managing the irregular dependencies of the unstructured-mesh.The dependence structure arises due to their characteristic indirect memory accesses, specifically indirect increments (e.g.D[map [i]] += f(...)) and indirect reads, through explicit connectivity mappings [17].The dependence analysis needs to be done via mappings, as opposed to the static dependence neighborhoods (e.g.specified by a stencil) commonly exploited in general static loop optimizations.As such, the analysis is significantly more involved and needs to be carried out dynamically at runtime, as demonstrated by Luporini et al. [17] based on the loop-chain abstraction introduced in [14].

Loop-chain Abstraction and Sparse Tiling
A loop-chain is a sequence of consecutive loops without any global synchronisation points (such as global reductions) in between loops, specified or annotated with information to facilitate runtime dependency analysis.The information should be provided by the programmer or automatically derived from the code either through a code parse or a DSL's API.This information then enables us to reason about the dependencies of the sequence of the loops collectively.For loop-chains over unstructured-meshes, as shown in [17], a sparse tiling schedule can be created from the analysis, providing an execution ordering of the iterations over the mesh.This ordering which is semantically equivalent to the original, can then be used to carry out the loop iterations.For example, consider the two sequential loops detailed in Figure 2, over the unstructured-mesh shown in Figure 1.The mesh consists of nodes, edges and cells, where a loop over edges, updating the nodes, at each end of the edge would require explicit connectivity information specified by a mapping of edges-to-nodes, en.Two such loops, occurring within a larger time-marching iterative loop, are illustrated in lines 4-11 (update) and 14-29 (edge_flux) in Figure 2.Both the loops increment data held on the nodes, res and flux respectively, indirectly via the mapping array en.The edge_flux loop indirectly reads data, cell wights (cw), held on the two cells next to an edge, via the mapping   edges-to-cells, ec.The update and edge_flux loops taken consecutively can be viewed as a loop-chain with two parallel loops and specified using the OP2 DSL's API [20] as in Figure 3.
• Access descriptors, one or more 2-tuples of the form of <, mode> associated with a loop   where  is a map indicating indirect access or ID (identity mapping) indicating direct access on data (specified by op_dats in OP2) defined on the iteration space of the loop.mode is the mode of data access, read (OP_RW), write (OP_WRITE) or increment (OP_INC).In OP2, the access descriptors are defined using op_arg_dats API.22 op_dat dres = op_decl_dat(nodes, 2, "double", res, "res" ); 23 op_dat dpres = op_decl_dat(nodes, 2, "double", pres, "pres"); 24 op_dat dcw = op_decl_dat(cells, 4, "double", cw, "cw" ); 25 op_dat dflux = op_decl_dat(nodes, 2, "double", flux, "flux"); The above definition provides information to carry out an inspection or analysis phase to create a set of tiles, i.e. the aforementioned sparse tiling schedule.The schedule will guarantee those data dependencies are not violated such that each tile can be executed in its entirety without any data accesses to/from outside the tile.Executing a tile   entails executing all the iterations from  0 belonging to that tile, followed by all the iterations in that tile for  1 and so on up to  −1 .Then the next tile  +1 is executed in a similar manner, continuing this pattern of execution until all the tiles have been completed.It is important to note that sparse tiling assumes that the order of execution of loop iterations does not affect the final result, at least within machine precision.This encompasses a large number of explicit numerical schemes, particularly when solving PDEs in numerical simulation applications.For the full description of the loop-chaining abstraction and the development of sparse tiling inspection/execute phases, we refer the reader to Luporini et al. [17].
The dependency analysis allows to exploit communicationavoidance at two levels of parallelism.On a distributed-memory parallel level, as we will demonstrate, the idea would be to move all communications to the beginning of the loop-chain, eliminating per-loop halo exchanges between neighboring processes, in place of a larger aggregated message.In this case, each mesh partition held by an MPI process can be thought of as a single "tile", which will be executed without halo exchanges within the loopchain.On a shared-memory multi-threaded parallelization level, the idea would be to select sufficiently sized tiles allowing to keep a working set of data in the fast cache memory of processors reducing the number of accesses from/to slower main memory per loop.The implementation of these ideas within OP2, creating a new communication-avoiding back-end and applying it to largescale unstructured-mesh applications, investigating performance from the main contribution of this paper.We will (1) initially codify sparse tiling for distributed-memory execution and then (2) extend it for GPU clusters integrating the back-end to work with CUDA code generated by OP2.

Related Work
Sparse tiling from [17] has been previously used in Firedrake [15], a high-level DSL framework for the automated solution of finite element computations.It requires problems to be specified in the Unified Form Language (UFL) [5].The specification is then used to generate parallel executables on multi-core CPU nodes and clusters.
In [17] both shared-memory and distributed-memory sparse tiling are implemented in Firedrake and the performance of a seismic exploration benchmark application, Seigen is evaluated on a CPU cluster.In contrast, in this paper, we demonstrate the performance of sparse tiling in a significantly larger (≈100K LoC F90, ≈500 loops), production application, Hydra on both CPU and GPU cluster systems.A new analytic performance model is also developed to gain insights into the performance behaviour with CA.

A COMMUNICATION-AVOIDING BACK-END FOR OP2
OP2 uses an owner compute model for parallelizing computations on a distributed memory parallel system [20].In this model, the unstructured-mesh, defined by mappings and data is partitioned among several processes so that each process owns some of the mesh elements.A process will only perform computations to update elements in their own partitions but will require data from elements in other partitions held in separate processes, specifically at the boundaries of the partitions.Thus, copies of data held in foreign partitions need to be communicated following the standard "halo" exchange mechanisms when using a message passing parallel implementation.Figure 4 illustrates this halo setup and configuration in OP2, where the back-end separates the iteration space such that it's segmented into: (1) core set of iterations that do not need to access   2) an export halo consisting of mesh data to be sent from the local process to some foreign process and (3) an import halo consisting of mesh data received from some foreign process.The elements in the import and export halos are further separated into two groups depending on whether redundant computations will be performed on them.For example, in the mesh in Figure 4, a loop over edges updating cells will require edges no: 5,8 and 9 executed on rank X to update cells no: 4 and 5 (i.e. an import execute halo, ieh).However, a loop over edges reading data on cells will require cells 0,4 and 5 imported onto rank Y (i.e. an import non-execute halo, inh).The inh is essentially a read-only halo.These then have a corresponding export execute (eeh) and export non-execute (enh) halos on each of the local processes.

Multi-layered Halo Data Structure
In OP2, the mesh data in op_maps and op_dats are held in 1D arrays with the core, export and import halos structured as illustrated in Figure 6(a).This ordering then enables latency hiding where the loop iterations in an op_par_loop, corresponding to the core can be carried out while halo exchanges are in-flight as these iterations do not access halo data.This latency-hiding algorithm is detailed in Alg 1. OP2 exchanges the ieh and inh in separate messages.After all the messages have been sent/received, execution over execute halos can be performed.In an op_par_loop, halos for an op_dat are exchanged only if (1) it is indirectly accessed as a read (OP_READ) or a read/write (OP_RW) and ( 2) it has been modified by a preceding loop, i.e. the halos need updating.A dirty-bit is used to keep track of when an op_dat is updated (by a OP_RW, OP_WRITE or an increment, OP_INC) in a loop.
The OP2 halos described above essentially maintain the data dependencies required to execute each processes' partition independently in parallel.Explicit messages are exchanged to update/sync the halos when carrying out computations in each op_par_loop.The dependency neighborhood for a single loop, therefore, can be viewed as a halo layer with a depth of 1 (see Figure 5).As shown   in [17], in a loop chain with  loops, syncs per loop can be eliminated if a large dependency neighborhood, maximally a layer with depth of  can be communicated at the start of the loop-chain and computed over, redundantly to update the mesh elements to satisfy the dependencies, that would have otherwise been updated as a halo exchange.Thus for the loop-chain with 2 loops detailed in Figure 3 a halo depth of 2 needs to be maintained as shown in Figure 7.
The maximum depth of  is required only when in a loopchain,  0 , . . .,  −1 , each loop   updates an op_dat  and the next loop  +1 read or read/writes to .This leads to iteration spaces that decrease in size for each loop in the loop-chain.Specifically, to compute  iterations of the last loop in the chain,  −1 , the loops  −1 ,  −2 , . . .,  0 should be iterating over  plus halo depths of 1, 2, . . .,  respectively.This algorithm is detailed in Alg 2. In the main inspection/setup phase, halo_exch_dats identifies the op_dats for halo exchange based on their access mode and dirty-bit value as discussed above.Then, calc_halo_layers compute the number of halo layers required for each loop in the loop-chain L. The analysis is detailed in Alg 3. calc_iters (in Alg 2) computes the iteration counts within the core and halo layers, while restructure_elements separates each halo layer into core and eeh as shown in Figure 6(b) and renumbers mappings of core, eeh and enh accordingly.OP2's halo data structure was extended to support this multi-layered halo setup.The loop chain executes core followed by ℎ 0 to ℎ −1 , then all the import halos ℎ 0 to ℎ −1 (see lines 8-18 in Alg 2).Given a loop chain, Alg 3 calculates, the minimum halo extension required for each op_dat in each loop according to its individual data access patterns.Then, the maximum halo extension required for a loop is obtained based on the halo extensions calculated for each individual op_dat in the loop, finally making the loop's halo extension effective for all the op_dats in the loop.
To further reduce the number of messages exchanged for the whole loop-chain, the halos from multiple loops and op_dats are organized into a structure where a single message to be sent/received per MPI rank can be constructed (line 5 in Alg 2) as illustrated in Figure 8. Alg 2 also does latency hiding where the core iterations of each loop in the chain are first executed while halos are in flight.After completing send/receives, iterations in the multiple halo layers are executed.

Analytic Model for Loop-chain Performance
An analytic performance model for reasoning about the runtimes of OP2 loops and an equivalent loop-chain executed with the communication-avoiding setup can be developed based on a parameterization of loop characteristics, communication patterns and machine properties.Considering the execution of a single OP2 loop,  as detailed in Alg 1, its runtime can be modeled by (i) the time taken for computing    (core) and    +    (execute halo) number of iterations, (ii) time taken to synch/communicate halos, minus any overlap of computation and communication.If we note    +    =  ℎ  =  1  to indicate that this is execution over a single halo layer, then the time taken by an OP2 loop is given by: Here,   is the compute time for one iteration of the loop body,   is the number of op_dats with halos to be synced,   is the maximum number of neighboring processes to communicate halos with per MPI process. 1  is the maximum message size (in bytes) sent to a neighbor for either ℎ or ℎ halos.The superscript 1 indicates the number of halo layers exchanged (1 is the default for OP2 loops).The multiplier 2 accounts for the time for sending a separate message for eeh and enh.The maximum message size and number of neighbors per MPI process are only known at runtime after the mesh partitioning [20].The maximum is used for each of the components above to model the critical path of the runtime, where for example we assume that the halo exchange cost between processes only completes as the slowest exchange between a pair of processes. and  are the latency and bandwidth of the network respectively.Then, the time taken for  number of OP2 loops in a loop-chain L is simply the sum of the time taken by the individual loops: When the loop-chain is executed with the communicationavoidance setup using the multi-layered halo data structures, a single grouped halo message is exchanged per neighbor process.This and the execution steps in Alg 2 lead to a total runtime of the full loop-chain: As detailed in Alg 2,  ℎ  includes iterations from the execute halos of multiple levels for each loop.The message size,   is the maximum grouped message size sent to each neighbor, which combines both the eeh and enh into a single message, as discussed before. (where  ≤ ) indicates the maximum number of halo layers required to carry out the CA algorithm for the loop-chain.Thus,   is given by : Here,   is the number of halo synching dats in loop , ℎ  is the halo extension for loop ,  ℎ,ℎ

𝑑
eeh size up to level ℎ  of the set on which the data set  is defined,  ℎ,ℎ

𝑑
enh size of level ℎ  .Not all enh levels up to ℎ  are packed into the message, only the levels updated are included.Given that a larger number of messages are grouped for communication an additional compute cost (a packing and unpacking cost)  is added to communication per neighbor.Finally,  is the maximum number of neighbors communicated by a processor, when exchanging the grouped message and  is the size of a data element of the op_dat  in bytes.
From equations ( 2) and ( 3), we can see that the CA version saves time to communicate with the single message sent per neighbor, but this gain only becomes applicable when the time to compute the core iterations,      is smaller than the communication time due to the latency hiding setup of the execution.Thus, we can hypothesize that any performance gains would more likely appear at high processor counts (i.e. at large machine size) in a strong scaling scenario, when the core iterations per partition (one partition is assigned per process in OP2) is smaller.The communication time further depends on the maximum number of halo layers  determining the message size   .If  is significantly smaller than , then the gains are likely to be large.However, any performance gains from faster communications can be diminished if the sum of the times to execute over the multiple halos given by −1 =0    ℎ  in ( 3) is significantly larger than the sum of times to execute over a single halo region given by    1  in (2).Hence, again the maximum number of halo layers involved in the CA execution of the loop-chain makes a significant impact.

Cluster of GPUs
Extending the CA distributed-memory execution to a cluster of GPUs can also be carried out, given that OP2 code-generates CUDA with MPI.In the GPU CA version, the halos are transferred via MPI, by first copying it to the host over the PCIe bus.This implementation does not utilize NVIDIA's GPUDirect [23] technology for transferring data between the GPUs.Instead, a communication pipeline is setup, allowing for maximum overlap of kernels, memcopy, communication and core computations.We found that this performs better than GPUDirect, which in many cases did not run simultaneously with the computing kernels.Again the multilayered halo setup will remain the same, but an extra data copy from host to device and vice versa will occur during halo exchange as detailed before.This cost can be approximated by a larger communication latency Λ replacing  in equation ( 3).Additionally,   will need to be estimated for a GPU.

Automatic Code-generation
The integration of CA optimizations to OP2 is done such that any application developed with the OP2 API can immediately utilize the new capabilities without modifications to the high-level source.The only addition to OP2's existing automatic code-generation process [20] (as detailed in Figure 9) is the use of a configuration file specifying the list of loops to be chained in the application.The file details loop names, loop count and maximum halo extension of loops.The OP2 code-generator was extended so that it can automatically generate the loop-chain execution template in Alg. 2 for these selected loops.The generated code is human-readable, similar to all code generated by OP2.Finally, the code can be compiled using a conventional compiler linking the new CA back-end library to generate the executable to run on the selected hardware.

PERFORMANCE
We now investigate the performance of the communicationavoiding back-end, applying it to two existing applications developed with OP2 : (1) a representative CFD mini-app, MGCFD [24] used for benchmarking and co-design and ( 2) the recently developed OP2 version of Hydra, OP2-Hydra [21], a large-scale production CFD application used at Rolls-Royce plc.
Performance is benchmarked on two systems, namely the HPE Cray EX system, ARCHER2 and the SGI/HPE 8600 GPU cluster, Cirrus both located at EPCC UK.Table 1 briefly details the key hardware and system setup of these two systems.Each ARCHER2 node consists of two AMD EPYC 7742 processors each with 64 cores (128 total cores) arranged in an 8×NUMA regions per node (16 cores per NUMA region) configuration [4].Each node is equipped with 256 GB of memory.The nodes are interconnected by an HPE Cray Slingshot, 2×100 Gb/s bi-directional per node network.The GNU compiler collection version 10.2.0 was used on ARCHER2 with compiler flags noted in the table.The Cirrus GPU cluster consists of 4×V100 GPUs per node configuration, each node also consisting of 2×Intel Xeon Gold 6248 (Cascade Lake) processors, each with 20 cores (40 total cores).A node has 384GB main memory and a single V100 GPU has 16GB global memory.

MG-CFD
MG-CFD [24] is a 3D unstructured multi-grid, finite-volume computational fluid dynamics (CFD) mini-app for inviscid-flow, developed by extending the CFD solver in the Rodinia benchmark suite.The mini-app implements a 3D finite volume solution of the Euler equations for inviscid, compressible flow over an unstructured grid.The application uses a multi-grid for accelerating the convergence of the solution.MG-CFD has been converted to use the OP2 API and its performance has been previously benchmarked in [28].For our experiments, we use the NASA Rotor 37 meshes, which represent the geometry of a transonic axial compressor rotor, widely used for validation in CFD.4.1.1Synthetic Loop-chains: As discussed in Section 3, an OP2 loop will exchange MPI halos for an op_dat if it is to be indirectly read (OP_READ) in a loop, but has been modified (OP_WRITE, OP_INC or OP_RW) in a preceding loop.In this case, we note the op_dat's halos as dirty at the start of the loop, triggering an MPI halo exchange before accessing halo values for computation.As such, two consecutive loops, the first modifying an op_dat followed by a second reading the same data indirectly will be our target access pattern (i.e.access descriptor in the loop-chain abstraction), for applying sparse tiling.Unfortunately, consecutive loops with the above access descriptor do not exist in MG-CFD.However, the relatively small size and simplicity of MG-CFD meant that we could add a synthetic sequence of loops to create the required setup.With such a configuration, we are then able to create arbitrarily extendable loop-chains with the above target access pattern allowing to examine the limits of a single loop-chain and observe its consequent performance, reasoning with the performance model.The new loops introduced to MG-CFD consists of two loops [1].The first, ca_write_kernel modifies an op_dat named, var through an indirect increment, OP_INC, operation while iterating over the edges.Thus, var becomes dirty at the end of this loop.The second loop, ca_read_kernel, indirectly reads var.ca_read_kernel is in fact a copy of the most time-consuming loop in MG-CFD, the compute_flux_edge loop that has the same access modes.This enables us to make an effective comparison between reducing communications, i.e. no halo exchanges in the second loop and the consequent increase in the (redundant) computations over the larger depth halos in the first loop.This 2-loop chain is enclosed within an outer loop whereby setting its iteration count nchains we can create longer loop-chains to explore performance.With nchains = 1, the execution of the loop chain, according to Alg 2 will not result in a reduction in the number of MPI messages exchanged.However, for nchains > 1, taking the resulting sequence of loops as a single loop-chain and applying Alg 2, we can see how multiple halo exchanges can be combined into a single larger message.We use this configuration to explore the performance on ARCHER2 and Cirrus clusters.4.1.2ARCHER2 Results: Figure 10 presents execution times (min. of at least 5 runs each, CoV < 0.01) of MG-CFD on ARCHER2 for a mesh size of 8M and 24M.The reported runtime is the time taken Table 2: MGCFD on ARCHER2 (optimistic) -Model Components: OP2 comms ( (2 1 )) -CA comms (  ) in bytes, OP2 core iterations ( (  )) -CA core iterations ( (  )), OP2 halo iterations ( ( 1 )) -CA halo iterations ( ( ℎ )) and performance gain% of CA over OP2. by the main iteration loop.We have not included the constant cost for the inspection phase, which gets amortized (and negligible) for a larger number of main iterations as is typical for real-world applications.Note the log scale of the y-axis.For both cases, we compare the original OP2, and the CA runtimes for loop-chains with loop counts  = 2, 4, .., 32 For each run, we utilized the full 128 cores/node (128 MPI procs/node).Additionally, to obtain the best partitions per process, i.e. smallest MPI halos and least number of neighbors per process, we used the k-way partitioner routine from ParMETIS.Increasing  from 2 to 32 will result in the original OP2 loops exchanging 16× more messages per neighbor.Only a single message is exchanged in the CA version.However, the grouped halo message size,   for CA can potentially contain a maximum of  halo layers.However, according to the data access patterns of the synthetic loop-chain,  is set to 2, for our benchmarking, making the message size  2 .We investigate the performance for the case where the number of op_dats exchanged remains constant at 2 per loop-chain.While this scenario is synthetic, as shown in [27] for structured-mesh codes, it is a prevalent case we see in real-world applications.The same was observed for Hydra, an unstructuredmesh code, but within much smaller loop-chains (see Section 4.2).For both the 8M and 24M meshes, better runtimes can be seen at higher node counts with CA.The performance gains are also larger for higher loop counts.This aligns with the insights from the model, where the CA version is saving on the number of messages sent without an increase in the message size.Up to 35% faster runtimes can be seen with CA, compared to the original OP2.Empirical measurements for the above runs provide further evidence for the performance trends.Table 2 details the computation and communication model component factors for the 8M and 24M mesh execution with MGCFD on ARCHER2.These were obtained by recording the message sizes and number of MPI neighbours for each tested configuration and substituting these values into the analytic model.Per-node, the amount of data communicated among the neighbours increases in the OP2 version ( (2 1 )) when increasing the loop count but remains constant in the CA version (  ).The amount of core computations ( (  )) which are performed while the halo exchange is in progress, is always smaller for the CA execution, compared to OP2.However, the number of

OP2-Hydra
Our second application is the OP2 version of Rolls-Royce's Hydra CFD application [21,26].Hydra is a full-scale production application developed for modeling various aspects of turbomachinery design.It is an unstructured-mesh finite-volume solver for the compressible Reynolds-Averaged Navier-Stokes (RANS) equations in their steady and unsteady formulation (RANS/URANS).It uses a 5step Runge-Kutta method for time-marching, with multi-grid and block-Jacobi preconditioning.Here, we again use the same 8M and 24M node NASA Rotor 37 meshes.OP2-Hydra consists of around 500 parallel loops with significantly more complex computations performed on the mesh than the loops in MG-CFD.A number of loop-chains were identified to target CA optimizations.Table 3 and Table 4 detail six loop-chains selected for our benchmarking.The first three chains (Table 3), weight, period and gradl are loop-chains which require multiple layers of halos for the execution.The constituent loops, their iteration set, access modes of op_dats that require halo exchanges and required max halo layers for each loop are detailed in the tables.The last three loop-chains (Table 4) require only a single layer of halos.However, these loop-chains consist of the most time-consuming loops in Hydra [26].The relative costs or the proportional contributions of the loop-chains to the total runtime of Hydra are vflux 18%, iflux 5%, gradl 8% and jacob 2%.The loop-chains, weight and period are inside the setup phase and outside the main time-marching loop.4.2.1 ARCHER2 Results: Performance of each loop-chain on up to 128 nodes (16k cores) is detailed in Figure 12 (CoV < 0.08).This is the cumulative time taken by each loop-chain for 20 iterations of the main time-marching loop.Hydra's default partitioner based on the recursive inertial bisection of the mesh is used in all these experiments.Results indicate that the loop-chains with the highest communication reduction, period and jacob showed performance improvements with CA -42% and 40% on 64 nodes for the 8M problem, respectively.weight loop-chain showed performance gains only with 8M mesh with a maximum of 14% on 128 nodes due to its communication reduction not being adequate enough to outperform the computation increase with 24M mesh.Other loop-chains, executing them as individual OP2 loops gave the best performance.Again the insights from the model as in Table 5, align with these results where loop-chains with higher communication reduction preferably with large loop counts tend to break-even the balance of computation vs communications performance.4.2.2Cirrus Results: On Cirrus (Figure 13) (CoV < 0.07), the gains from CA are larger, with loop-chains vflux, iflux and jacob performing up to 42%, 31% and 68% faster on the node counts benchmark.We see a majority of loop-chains getting a speedup with CA on the GPU cluster compared to the CPU cluster ARCHER2.
Loop-chains with a higher communication reduction compared to the increased redundant computation due to added halo extensions show performance gains in Hydra.The communication reduction and the total core size for latency hiding of period is   significantly higher than that of weight, giving it a higher performance gain as indicated through the model in Table 5. Loop-chains such as iflux and vflux which reduce the number of messages with a grouped halo, perform latency hiding with core execution, but with no reduction of communication are unlikely to give performance gains on CPU clusters.However, they show improvements on GPU clusters due to cutting down on host device communications.Loop-chains such as gradl which cause an increase in both communication and computation tend to degrade the performance even with a sufficiently large number of latency hiding core computations and grouped halos.On the other hand, loop-chains such as jacob which reduce communication with latency hiding and no computation increase always tend to give performance gains.
In general, there is a message size increase and message count decrease in the CA version.However, it does not change the load balance of the tasks divided among the processes compared to the OP2 version.Message exchange between processes is always associated with message packing and unpacking costs.CA/OP2 message packing cost ratio is equivalent to the CA/OP2 total halo message sizes ratio.However, the OP2 version does not suffer from a message unpacking cost since the messages related to a particular dataset are directly copied to the relevant dataset array when receiving the message.But there is an additional message unpacking cost for the CA version when copying the data elements of multiple datasets received in the same message to relevant dataset arrays.However, this unpacking cost becomes negligible due to the chunk memcpy operations performed on the received messages in the CA version compared to the multiple message exchange cost in the OP2 version.

CONCLUSION
In this paper, we have presented the implementation of data movement-reducing and communication-avoiding optimizations for large-scale unstructured-mesh applications.The runtime analysis of a loop-chain for creating sparse-tiling schedules and their extension to distributed-memory parallelization as detailed by Luporini et al. [17] is implemented through the OP2 DSL.The new back-end in OP2 for communication-avoidance (CA), codifies these techniques such that they can be applied automatically to any OP2 application.The careful trade-off with increased redundant computation in place of data movement was analyzed for distributedmemory parallelization using a representative CFD mini-app and Hydra, a production CFD code from Rolls-Royce.The performance trade-offs were modeled analytically, examining the determinants and characteristics of a given unstructured-mesh loop-chain that can lead to performance benefits with CA techniques.Performance benchmarks were carried out on a large HPE Cray EX system and an NVIDIA V100 GPU cluster.
Results show that loop-chains with higher communication reduction compared to its computation increase, are generally the ones with higher loop counts, provide increasingly bigger performance gains.This is true for both CPU and GPU clusters.For such loop-chains, the balance of computations to communications also begins to favour the CA-optimized version at larger node counts.We see 30 -65% runtime reductions.However, identifying such profitable loop-chains would be the challenge in real-world applications.As we see from Hydra, several such chains could be targeted.Nevertheless, the use of OP2's new CA back-end allowed us to elicit the best performance for Hydra without changing the large science source, maintaining performance portability.Future work will explore the performance of further large-scale production applications on a wider range of hardware with code generated through OP2.We will also move to further automate the code-gen process with lazy-evaluation [27].The new CA back-end for OP2 is available as open-source software at [2] and the loop-chained MG-CFD is available at [1].

Figure 4 :
Figure 4: OP2 partitioning over two MPI ranks and resulting halos on each rank[20]

Figure 6 :
Figure 6: op_dat and op_map data structures with (a) single and (b) multiple halo levels halo data, (2) an export halo consisting of mesh data to be sent from the local process to some foreign process and (3) an import halo consisting of mesh data received from some foreign process.The elements in the import and export halos are further separated into two groups depending on whether redundant computations will be performed on them.For example, in the mesh in Figure4, a loop over edges updating cells will require edges no: 5,8 and 9 executed on rank X to update cells no: 4 and 5 (i.e. an import execute halo, ieh).However, a loop over edges reading data on cells will require cells 0,4 and 5 imported onto rank Y (i.e. an import non-execute halo, inh).The inh is essentially a read-only halo.These then have a corresponding export execute (eeh) and export non-execute (enh) halos on each of the local processes.

Figure 9 :
Figure 9: OP2 Code Generation with CA