Parallelized Remapping Algorithms for km-scale Global Weather and Climate Simulations with Icosahedral Grid System

In weather and climate research, latitude–longitude grid data are typically used for analysis and visualization, and remapping from model native grids to latitude–longitude grids typically requires a significant amount of time. Here, we developed a series of parallelized remapping algorithms for NICAM, a global weather and climate model with an icosahedral grid system, and demonstrated their performance with global 14–0.87-km mesh model data on the supercomputer Fugaku. The original remapping tool in NICAM supports parallelization only in reading and interpolating data. In our proposed algorithms, the process of data writing is parallelized by separating output files or using the MPI-IO library, both of which enable us to remap 0.87-km mesh data with 670 million horizontal grid points and 94 vertical levels. The benchmark with 14-km mesh data shows that the developed algorithms significantly outperform the original algorithm in terms of elapsed time (by 7.4–8.7 times) and memory usage (by 2.8–5.0 times). Among the proposed algorithms, the separation of output files, along with reduced MPI communication size, leads to a better performance in the elapsed time and its scalability, and the use of the MPI-IO library leads to a better performance in memory usage. The remapping year per wall-clock day, assuming a six-hourly output interval, is up to 0.56 with 3.5-km mesh data, demonstrating the feasibility of handling global cloud-resolving climate simulation data in a practical time. This study demonstrates the importance of IO performance, including MPI-IO, in accelerating weather and climate research on future supercomputers.


INTRODUCTION
Approximately 20 years ago, the first-ever global cloud-resolving simulation was performed with the 3.5-km mesh NICAM, a nonhydrostatic global atmospheric model with an icosahedral grid system [12,13,20], on the Earth Simulator [9,19].More than a decade after this pioneering study, recent advancements in supercomputers have encouraged several modeling centers to develop global atmospheric models with a km-scale mesh size [16].Such a km-scale global atmospheric model enables us to explicitly simulate cloud and precipitation processes associated with deep convection and scale interactions in a physically consistent manner and represent extreme phenomena such as tropical cyclones.Recently, km-scale models are regarded as a core engine of the digital twin of the Earth [4,15,21], as their results are often indistinguishable from satellite images.
The use of a quasi-uniform grid system is a promising approach for km-scale global atmospheric models because of its locality of data access with massive parallelisms during simulations [22].Among the nine models participating in the project DYAMOND [16], the first global storm-resolving model intercomparison project, five models adopted a quasi-uniform grid, two models adopted a latitude-longitude grid, and only the other two models adopted a conventional spectral method as a dynamical core.For meteorological analysis and visualization, data users typically need latitudelongitude grid data, and remapping from the quasi-uniform grid system to the latitude-longitude grid system is a necessary step during or after simulations.However, remapping is generally an input/output (IO)-intensive process; its relative cost to the workflow of pre-processes, simulations, and post-processes seems to be increasing as the resolution and integration time increase because of the slower trend of IO performance compared with computational and memory throughput performance.Specifically, a global 3.5-km mesh simulation with NICAM is now being extended to the climate scale [10,17], which motivates us to develop efficient parallelized remapping algorithms that are sufficiently fast to handle km-scale global data in a practical time.
In this study, we developed a series of remapping algorithms that convert icosahedral grid data to latitude-longitude grid data and demonstrated their performance using 14-0.87-kmmesh NICAM simulation data on the supercomputer Fugaku.Some related works are reviewed in Section 2, and the data structure of the icosahedral grid system in NICAM is introduced in Section 3. The remapping algorithms developed in this study are described in Section 4, and the specifications of the data and experimental designs are presented in Section 5.The benchmark results are presented in Section 6, and the summary and discussion are presented in Section 7.

RELATED WORKS
The development of remapping algorithms (schemes) in the weather and climate field is typically motivated by (1) multimodel coupled simulations with different grid systems (e.g., atmosphere and ocean) and ( 2) analysis and visualization of the simulation output.For (1), remapping schemes that consider accuracy, conservation, and monotonicity have already been proposed [5,8].In NICAM, [1,2] implemented a remapping scheme in the coupler "Jcup" and demonstrated its computational performance with 220-14-km mesh models.Here, we focus on (2), where accuracy, conservation, and monotonicity are not necessarily a high priority.In NICAM, [2] introduced "NICAMIO" as a part of Jcup to write latitude-longitude grid data with the MPI-IO library remapped from the icosahedral grid data during the simulation.This kind of online output is userfriendly; however, high IO and communication burdens on the shared supercomputer system may hinder the actual elapsed time only for the remapping.Therefore, the remapping task is treated as a post-process of the simulation to separate concerns in this study.
To the best of our knowledge, this is the first study to develop a remapping algorithm that supports global km-scale meteorological data with MPI parallelization.[1] did not use MPI parallelization in their benchmark measurement to remap 14-220-km mesh icosahedral grid data to latitude-longitude grid data.In the project DYA-MOND [16], a tool called "CDO" (Climate Data Operators) was used to remap model native grid data to 0.1-degree latitude-longitude grid data.According to the CDO user guide [14], some operators are parallelized with OpenMP; however, reading and writing processes are not at all parallelized.
In general, IO-intensive task with massive parallelisms, more specifically, output task of large file(s) whose data is distributed across many processes, could be a significant bottleneck for weather and climate simulations as well as many applications.Several weather and climate modeling groups investigated this issue using MPI-IO and so on [6,24,25].An implementation of MPI-IO itself significantly affects the IO performance [3], and selection and tuning of MPI-IO library could be a next step of this study.

DATA STRUCTURE OF THE NICAM ICOSAHEDRAL GRID SYSTEM
The data structure of the icosahedral grid system in NICAM [7,13,18] is briefly introduced.Figure 1 shows icosahedral grids.The horizontal resolution of NICAM is measured as the grid division level, denoted as   (≥ 0) in this study.  = 0 corresponds to an icosahedron (Figure 1a), which has 12 vertices (horizontal grid points) and 10 diamonds, each consisting of 2 triangles.A finer mesh can be generated by recursively dividing each diamond into four sub-diamonds (from Figure 1a to 1f), and   is incremented by one for recursion.Therefore, the number of horizontal grid points globally is 10 × 4   + 2. In this study, the simulation data with 9 ≤   ≤ 13 that corresponds to mesh sizes of 14-0.87 km were used for the remapping benchmark.
For MPI parallelization, NICAM introduces the concept of "region, " a collection of horizontal grid points, as a unit of calculation per MPI process.The number of regions is measured by the region division level, denoted as   (≥ 0) in this study.  = 0 corresponds to ten diamonds of the icosahedron, and the maximum possible number of MPI processes for the simulation is 10 for   = 0. Larger   is achieved by recursively dividing each diamond into four subdiamonds, similar to the grid division level, and   is incremented by one for recursion.The number of regions (i.e., the maximum possible number of MPI processes for the simulation) is 10 × 4   .
The simulation results are output as a single icosahedral grid data file per MPI process.Each icosahedral grid data file consists of metadata and actual data parts, and the actual data for each variable are four-dimensional arrays (two horizontal, vertical, and region dimensions).They include halo, the surrounding grid points of the region, to interpolate data without MPI communication.

REMAPPING ALGORITHMS 4.1 Algorithm-1: Reference
Figure 2 shows Algorithm-1, a reference algorithm, used in the remapping tool implemented in NICAM.Let the number of MPI processes for remapping be denoted as   , so that each MPI process handles 10 × 4   /  regions.Also, let the number of regional icosahedral grid data files per MPI process of remapping be denoted as    .Pseudocode is shown in ALGORITHM 1.In Algorithm-1, the outermost loops in outer-to-inner order are variable and time step.Inside the outermost loops (i.e., for each variable and time step), the icosahedral grid data containing both the horizontal and vertical dimensions are first read in parallel.Then, the latitudinal-longitudinal grid data are diagnosed through linear interpolation from the surrounding three-point icosahedral grid data [2].A mapping table that contains the indices of icosahedral grid points and their weights for the averaging is prepared in advance.This table is named "llmap."Each MPI process has a full global latitude-longitude data array with zero initialization, and only the data handled by the MPI process are filled with the interpolated values.After interpolation, the latitude-longitude grid data are sent to the process with MPI rank 0 using the MPI_reduce function and written to a file storage as a single NetCDF file for each variable.In summary, only the data reading and interpolation processes are parallelized, and the communication size with the Algorithm-1 could be improved by parallelizing the file output process.Here, we considered two types of algorithms: one involves file separation by vertical levels (Algorithms-2 and -3), and the other uses the MPI-IO library (Algorithm-4).

Algorithm-2: Parallelization of output by file separation
In Algorithm-2, all MPI processes receive latitude-longitude grid data in sequence and write the data at the same time, as depicted in Figure 3 and ALGORITHM 2. The outermost loops are time step, variable, and vertical level in outer-to-inner order.Inside the outermost loops (i.e., for each time step, variable, and vertical level), icosahedral grid data are first read in parallel and latitude-longitude grid data are then diagnosed through linear interpolation from the icosahedral grid data in the same way as Algorithm-1.After interpolation, the latitude-longitude grid data are sent to the process with MPI rank  using the MPI_reduce function, where  starts from 0 and is incremented as the outermost loops are cycled.When  reaches   − 1 or the outermost loops reach their end, the global latitude-longitude grid data stored in the MPI processes are written simultaneously to their file storages.The output NetCDF file is separated by vertical levels as well as variables to homogenize the data size and reduce node imbalance in the output process. is reset to 0 after the output process is completed.This algorithm has advantages over Algorithm-1 in terms of not only parallelizing file output but also reducing memory usage through separation by vertical levels.Possible caveats include node imbalance among MPI processes that occurs if the number of time steps differs among the variables.

Algorithm-3: Reduction in MPI communication size
In Algorithm-1 and Algorithm-2, the use of the MPI_reduce function results in unnecessary MPI communication, especially when   is large.To transfer only the necessary data among MPI processes while avoiding complex data handling, the order of interpolation and MPI communication in Algorithm-2 is exchanged in Algorithm-3, as depicted in Figure 4 and ALGORITHM 3. Here, the icosahedral data for each time step, variable, and vertical level are sent to the process with MPI rank  using the MPI_gather function just after the icosahedral grid data are read.Similarly to Algorithm-2,  starts from 0 and is incremented as the outermost loops are cycled.When  reaches   − 1 or the outermost loops reach  their end, the global icosahedral grid data are linearly interpolated to diagnose the global latitude-longitude grid data.Then, they are written to the file storage as a single NetCDF file for each variable and vertical level. is reset to 0 after remapping and writing.A possible caveat of Algorithm-3 may be a larger memory usage per MPI process to store global llmap data in each MPI process.

Algorithm-4: Use of MPI-IO library for file output
In Algorithm-4, the MPI-IO library is used to write a NetCDF file for each variable, including the vertical dimension, as shown in Figure 5 and ALGORITHM 4. The outermost loops are variable and time After linear interpolation, the latitudelongitude grid data are merged and written to a file storage using the MPI-IO library to create a single NetCDF file for each variable.This algorithm does not involve many output files and is expected not to cause severe node imbalance even when the number of time steps differs among the variables.A possible caveat may be the heavy load on the storage system related to writing a single file from many MPI processes.

DATA AND EXPERIMENTAL DESIGN
A series of icosahedral grid data simulated by NICAM [11,23] were used to evaluate the remapping performance of each algorithm.Table 1 shows the specifications of the icosahedral grid data and Table 2 shows the experimental designs of the benchmark.The horizontal resolutions of the icosahedral grid data are between 14 km (  = 9) and 0.87 km (  = 13), and the number of time steps is 48.Each icosahedral grid data file contains 3 two-dimensional variables, 12 three-dimensional variables, and 1 two-dimensional cloud fraction with 49 cloud categories.The number of vertical levels for the three-dimensional variables is 94.Therefore, there are 1,180 global maps per time step on a two-dimensional data basis.Each icosahedral grid data file covers one region except for   = 13, where it covers two regions.The number of horizontal grids per icosahedral grid file, including halo, is the same for   = 9 and 10 The remapping tool was performed on the supercomputer Fugaku, and the number of threads per MPI process was set to 8. The total number of MPI processes for remapping was determined such that the number of icosahedral grid files per MPI process for the remapping task was 16 or 4 for 9 ≤   ≤ 12. Therefore, the weak scaling performance can be evaluated by comparing, for example, results with   = 9 and 40 MPI processes and those with   = 10 and 160 MPI processes and strong scaling performance, by evaluating, for example, results with   = 9 and 40 MPI processes and 160 MPI processes.Note that the benchmark results for   = 13 are not comparable with those for   < 13 because of the different experimental designs used to handle extremely heavy workloads.Specifically, the number of icosahedral grid files per MPI process for the remapping task was 32, and the number of MPI processes per node was set to 2, not 4, because of the limited available memory per node (32 GB).Furthermore, almost the entire workflow in Algorithm-2 was repeated four times to split the latitude-longitude grid data into 2 × 2 chunks to reduce memory usage.Furthermore, the remapping task for   = 13 with Algorithm-3 failed because of memory consumption.For Algorithm-4, the number of slices for the latitude and longitude was (4, 10), (8,20), and (16, 40) for executions with 40, 160, and 640 MPI processes, respectively.In this study, only the first four time-step data were remapped to latitude-longitude grid data to measure the elapsed time and memory usage per node.Specifically, the elapsed real time was obtained by applying time command to the execution of each remapping tool.Note that the time for explicit data transfer between 1st and 2nd storage layers of the supercomputer Fugaku (something like a staging) was negligible.The maximum memory usage was obtained from job statistical information.The measurements were repeated five times for each experimental setting (except for   = 13 with Algorithm-2, which was repeated only four times because of insufficient computational resources), and only the measurement with the shortest elapsed time was used for analysis.For intuitive understanding for climate modeling researchers, the elapsed time was converted to "remapping year per wall-clock day" (hereafter RYPD), assuming that the time interval of the data is 6 h, a standard output time interval of typical climate simulations.
Figures 6 and 7 show the elapsed time and memory usage per node for the remapping of the 14-0.87-kmmesh icosahedral grid data using different algorithms (colors) and experimental designs (symbols).In Algorithm-1, the elapsed time and the corresponding RYPD for the 14-km mesh icosahedral grid data is 3.2 × 10 3 s and 0.075, respectively (gray rectangle in Figure 6).Detailed profiling (Figure 8a) shows significant node imbalance in the output process of the latitude-longitude grid file and significant mean time of MPI communication.As shown in Figure 6, the output process is not parallelized in Algorithm-1, and all MPI processes except for that with MPI rank 0 must wait for the completion of the output process to proceed MPI communication.Therefore, about 95 % of the total elapsed time is attributable to the output process in Algorithm-1.The memory usage for remapping using Algorithm-1 reaches 12 GB per node (gray rectangle in Figure 7).This is mainly due to the memory allocation of two global three-dimensional arrays that reach 2.5 GB per MPI rank (10 GB per node).
Compared with Algorithm-1, Algorithm-2 (green rectangle in Figures 6 and 7) achieves a reduction in the elapsed time of remapping with the 14-km mesh icosahedral grid data.The elapsed time and the corresponding RYPD are 4.3 × 10 2 s and 0.56, respectively; this means that Algorithm-2 achieves 7.4 times speedup compared with Algorithm-1.As expected from Figure 3, the percentage of the elapsed time of the file output is now 14% of the total (Figures 8b), which is significantly smaller than that in Algorithm-1 (Figures 8a).
The memory usage of remapping using Algorithm-2 (3.9 GB) is also smaller by a factor of 3.1 than that using Algorithm-1.
Algorithm-3 (blue rectangle in Figures 6 and 7) achieves a further reduction in the elapsed time of remapping with the 14-km mesh icosahedral grid data compared with Algorithm-2.The elapsed time and the corresponding RYPD with Algorithm-3 are 3.6 × 10 2 s and 0.65, indicating an achievement of speedup by 8.7 and 1.2 times in Algorithm-3 compared with Algorithm-1 and -2, respectively.This speedup is, as expected, a result of the reduced MPI communication (Figure 8c).The remapping tool with Algorithm-3 consumes 4.4 GB of memory per node, which is 2.8 times less (1.1 times more) than that with Algorithm-1 (Algorithm-2).
Algorithm-4 (red rectangle in Figures 6 and 7) also achieves a reduction in the elapsed time of remapping with significantly smaller memory usage than Algorithm-1.The elapsed time and the corresponding RYPD for remapping with Algorithm-4 is 3.8 × 10 2 s and 0.63, respectively.The memory usage of the remapping tool with Algorithm-4 is 2.4 GB per node, which is significantly smaller than that of the other algorithms by a factor of 1.6-5.0.
In summary, the three improved algorithms significantly outperform the original algorithm, Algorithm-1, in terms of elapsed time and memory usage.Hereafter, we compare Algorithms-2, -3, and -4 for practical purposes.6: Elapsed time of remapping using Algorithm-1 (gray), -2 (green), -3 (blue), and -4 (red).The horizontal axis shows the total number of horizontal grid points throughout the globe and their mesh size in brackets.The same symbol indicates the same load per MPI process (Tables 1 and 2), and the same symbol with the same color is comparable in the sense of weak scaling, as connected by lines.Only the shortest elapsed time among five measurements for each experimental setting (except for the green circle that is the result of four measurements) is shown.6.2 Remapping with 14-7.0-kmmesh data As described in the previous subsection, Algorithm-3 is the best in terms of elapsed time, with 3.6 × 10 2 s (0.65 RYPD), followed by Algorithm-4 with 3.8×10 2 s (0.63 RYPD) and Algorithm-2 with 4.3× 10 2 s (0.56 RYPD).In terms of memory usage, Algorithm-4 exhibits The measurements with the shortest elapsed time for each experimental setting are broken down into the processes of opening, reading, and closing of icosahedral grid data ("open-ico", "read-ico", and "close-ico", respectively), opening, reading, and closing of latitude-longitude grid data ("open-ll", "write-ll", and "close-ll", respectively), opening, reading, and closing of llmap in total ("llmap"), interpolation, ("interp."),MPI-communication of icosahedral or latitude-longitude grid data ("comm."),and small residuals (not shown).Blue and light blue bars show average and maximum elapsed time among MPI processes.The upper limit of each graph corresponds to the total elapsed time.
the best performance with 2.4 GB per node, followed by Algorithm-2 with 3.9 GB and Algorithm-3 with 4.4 GB.These benchmark results were measured using 14-km mesh icosahedral grid data with 40 MPI processes, each of which handles 16 icosahedral grid data files (Tables 1 and 2).
The advantages of each algorithm in terms of elapsed time and memory usage are qualitatively the same when the 14-km mesh icosahedral grid data are remapped with 160 MPI processes, each of which handles 4 icosahedral grid data files, as shown in the diamond in Figures 6 and 7.The dotted lines in Figures 9 and 10 indicate the strong scaling performance of remapping with the 14-km mesh icosahedral grid data.Among the three algorithms, Algorithm-3 exhibits the best strong scaling performance in terms of elapsed time even though its absolute memory usage is higher than that of the other two algorithms.
The above performance with the 14-km mesh data is qualitatively the same as that with the 7.0-km mesh data.Comparisons between remapping with 14-and 7.0-km mesh data indicate that Algorithm-3 achieves the best weak scaling performance in terms of elapsed time (lines with rectangles and diamonds in Figure 6), whereas Algorithm-4 achieves almost perfect weak scaling performance in terms of memory usage (lines with rectangles and diamonds in Figure 7).The dashed lines in Figure 9 indicate that the strong scaling performance with the 7.0-km mesh data using Algorithm-3 is good, whereas it is poorer using Algorithm-2 and -4 compared with the 14-km mesh results.
In summary, Algorithm-3 is the most speed-oriented approach, whereas Algorithm-4 is the most memory-saving-oriented approach with good wall-clock time performance, particularly when 7.0-km mesh icosahedral grid data are used for remapping.
However, for the 1.7-km mesh data, Algorithm-3, in its present form, is not effective because of the memory limitation per node.Also, Algorithm-2 consumes about threefold larger memory per node with the 1.7-km mesh data (22 GB) compared with that with the 3.5-km mesh data (7.4GB) in the experiments with 16 icosahedral grid data files per MPI process (multiplication symbols in Figure 7).Such an increase in memory usage in Algorithm-2 primarily comes from use of three "global" latitude-longitude data arrays per MPI process, resulting in 10 GB memory consumption per node.In terms of Algorithm-3, an increase in memory usage primarily arises from allocating global llmap data arrays in all the MPI processes (see Sections 4.1 and 4.3) in addition to the allocation of global icosahedral data arrays.At this resolution, Algorithm-4 performs best in terms of both the elapsed time and memory usage per node on the supercomputer Fugaku.In the case of 16 icosahedral grid data files per MPI process (multiplication symbols in Figures 6-7), the elapsed time and memory usage per node with Algorithm-4 are 2.3 × 10 3 s (0.10 RYPD) and 7.3 GB, respectively.Increasing the number of MPI processes with Algorithm-4 did not lead to a speedup of the remapping task (i.e., no scalability in terms of strong scaling).

Remapping with 0.87 km mesh data
It is still challenging to globally remap icosahedral grid data with a mesh size of less than 1 km.Algorithms-2 and -4 successfully achieve the remapping of global 0.87-km mesh icosahedral grid data with 670 million horizontal grid points.The elapsed time and memory usage per node are 4.5 × 10 4 s (0.0053 RYPD) and 23 GB with Algorithm-2 and 7.9 × 10 3 s (0.030 RYPD) and 14 GB with Algorithm-4, respectively.This means that it costs nearly one year to remap 10-year data with Algorithm-4, successfully demonstrating the feasibility of handling decadal simulation data with a sub-km mesh in a practical time.

SUMMARY AND DISCUSSION
We developed a series of parallelized, efficient remapping algorithms that convert icosahedral grid data used in NICAM, a global non-hydrostatic atmospheric model, to latitude-longitude grid data.The reference algorithm parallelizes only the data reading and interpolation processes but not the data writing process, which is a major bottleneck of the remapping tool.In this study, parallelization is achieved by separating output files by vertical levels (Algorithms-2 and -3) and using the MPI-IO library (Algorithm-4).Furthermore, unnecessary MPI communications are reduced by replacing the MPI_reduce function with the MPI_gather function (Algorithm-3) and using the MPI-IO library (Algorithm-4).We evaluated the impact of the improved algorithms on performance using the supercomputer Fugaku.As expected, they significantly outperform the reference algorithm in terms of elapsed time and memory usage per node.
For the 14-3.5-kmmesh data, Algorithm-3 provides the best solution for reducing the elapsed time of remapping.Algorithm-3 not only achieves the shortest elapsed time among the other algorithms but also has the best performance in terms of both weak and strong scaling.The remapping year per wall-clock day (RYPD) with the 3.5-km mesh (cloud-resolving scale) data reaches 0.56, which is sufficiently fast to handle 100-year climate simulation data.The caveats of Algorithm-3 are large memory consumption and a possible decrease in efficiency for inhomogeneous data.The former caveat prevented us from remapping icosahedral data with less than a 3.5-km mesh, even though memory usage may be reduced by slicing the global data similar to Algorithm-4 and collecting and outputting the global data as in Algorithm-3.The latter caveat, also applicable to Algorithm-2, arises if the number of time steps differs among variables.In such a situation, MPI processes that handle variables with a smaller number of time steps remain idle after they are output.In addition to the above caveats, the separation of output files may cause inconvenience to data users even though it could be rather a good characteristic when the number of vertical levels is significantly increased in the future.
Algorithm-4 is a well-balanced solution for all 14-0.87-kmmesh data in terms of elapsed time and memory usage per node, at least on the supercomputer Fugaku.For the smallest mesh size of 0.87 km, the RPYD with Algorithm-4 is 0.030, which is practically and sufficiently fast to handle decadal simulation data.Another advantage of Algorithm-4 is that it outputs files according to variables.In addition, efficiency is theoretically expected to be maintained, even if the number of time steps differs among variables.The caveat of Algorithm-4 is poor strong scaling performance associated with MPI-IO.In particular, we found that the elapsed time for the closing and opening of NetCDF files rapidly increases as the number of MPI processes increases (not shown), leading to the poor strong scaling performance in Algorithm-4.This may suggest an advantage of Algorithm-4 in handling significant amounts of data with larger time steps.
In summary, we have successfully proposed three remapping algorithms with different characteristics; Algorithm-3 as a speedoriented algorithm with file separation and Algorithm-4 as a memorysaving-oriented algorithm with the MPI-IO library.These performance evaluations could strongly depend on the IO performance of the system, and further improvements in the performance of IO, including MPI-IO, in parallel with computational performance are desirable to accelerate weather and climate research on future supercomputers.

Figure 2 :
Figure 2: Diagram of Algorithm-1.ICO and LL represent icosahedral and latitude-longitude grids, respectively.The dots indicate horizontal data, and the arrows indicate data flow.Only one region is shown per MPI process for simplicity.See the body text for more explanations.

Figure 3 :
Figure 3: Same as Figure 2 but for Algorithm-2.The data flow for the first and second rounds of the outermost loops are highlighted by the blue and orange arrows, respectively.The right side of the wavy line is executed when global latitudelongitude grid data are stored in memory in all MPI processes or when the outermost loops are terminated.

Figure 5 :
Figure 5: Same as Figures 3-4 but for Algorithm-4.The dashed arrows indicate data flow that may occur if necessary.

Figure
Figure6: Elapsed time of remapping using Algorithm-1 (gray), -2 (green), -3 (blue), and -4 (red).The horizontal axis shows the total number of horizontal grid points throughout the globe and their mesh size in brackets.The same symbol indicates the same load per MPI process (Tables1 and 2), and the same symbol with the same color is comparable in the sense of weak scaling, as connected by lines.Only the shortest elapsed time among five measurements for each experimental setting (except for the green circle that is the result of four measurements) is shown.

Figure 7 :
Figure 7: Same as Figure 6 but for the memory usage per node for the measurement with the shortest elapsed time for each experimental setting.

Figure 9 :
Figure 9: Elapsed time of remapping using Algorithm-2 (green), -3 (blue), and -4 (red).The horizontal axis shows the number of MPI processes for remapping.The diamond, rectangle, and addition symbols indicate Algorithms-2, -3, and -4, respectively.The definition of symbols is same as that in Figures6-7, and the dotted, dashed, and solid lines indicate strong scaling for the remapping of the icosahedral grid data with 2.6, 10, and 42 million grid points across the globe, respectively.Only the shortest elapsed time among the five measurements for each experimental design is shown.The gray lines correspond to perfect strong scaling performance with respect to Algorithm-3.

Figure 10 :
Figure 10: Same as Figure 9 but for memory usage per node for the measurement with the shortest elapsed time for each experimental setting.

Table 1 :
Specifications of the icosahedral grid data (  ,   )

Table 2 :
Experimental designs of the benchmark(  ,   )