Adaptive Workload-Balanced Scheduling Strategy for Global Ocean Data Assimilation on Massive GPUs

Global ocean data assimilation is a crucial technique to estimate the actual oceanic state by combining numerical model outcomes and observation data, which is widely used in climate research. Due to the imbalanced distribution of observation data in global ocean, the parallel efficiency of recent methods suffers from workload imbalance. When massive GPUs are applied for global ocean data assimilation, the workload imbalance becomes more severe, resulting in poor scalability. In this work, we propose a novel adaptive workload-balance scheduling strategy, Bassimilation, which successfully estimates the total workload prior to execution and ensures a balanced workload assignment. Further, we design a parallel dynamic programming approach to accelerate the schedule decision, and develop a factored dataflow to exploit the parallel potential of GPUs. Evaluation demonstrates that our algorithm outperforms the state-of-the-art method by up to 9.1× speedup. This work is the first to scale global ocean data assimilation to 4, 000 GPUs.


INTRODUCTION
Data assimilation is a prediction correction technique developed from the field of numerical climate prediction, and has been widely used in many fields such as meteorology [13], oceanology [7,15,16,34] and climatology [22,30].To improve the forecast of a climate model, data assimilation combines observation data and prior model states to estimate the real state of a system as accurately as possible.During the climate simulation, data assimilation provides initial conditions and corrects model intermediate results.Thus, the performance of data assimilation is essential for the overall speed of the numerical simulation.In order to enable high-fidelity assimilation for realistic large-scale problems, it has become an urgent demand to study the performance improvement of data assimilation.
To implement a high-performance data assimilation, various methods have been developed [6,12,20,33], among which the variation-based [12,21,33] and the ensemble-based [29] algorithms are most commonly used.Variation-based algorithms obtain results with higher relative accuracy within an assimilation interval or window [6,20], compared with the ensemble-based algorithms.However, variation-based algorithms only provide the mean analysis result, while the ensemble-based algorithms not only estimate the posterior state using the flow-dependent covariance but also practically compute the mean analysis result and uncertainty of the estimation [35], which is crucial to many applications.
The ensemble-based algorithms usually apply Ensemble Kalman Filter (EnKF) methods to obtain an unbiased optimal estimation for a linear/nonlinear system [4,10,31], which have recently attracted a lot of attention [8,24,25].In recent years, many variants of EnKF have been proposed to improve the analysis quality and the computational efficiency [1,9,17,18,29].Among those work, parallel EnKF methods, e.g., LETKF [9,19] and P-EnKF [17,18], fully exploit the locality of physical properties [14], and achieve both high analysis accuracy and computational efficiency through suppressing spurious correlations and domain decomposition respectively.In the parallelization of EnKF, the uniform domain decomposition is usually used to divide the computational region into multiple domains with an identical number of grid points, and the data assimilation in the whole region is split into multiple workloads that can be executed in parallel.Furthermore, multi-stage strategies for parallel EnKF are proposed [23,26,29,35] to overlap the local computation and the data movement between processors.In related work [29], each workload is split further and processors are given different workloads at different stages, which provides more opportunities to achieve higher overlapping.
Recent parallel EnKF methods have achieved good performance in common cases.In these cases, the workloads are evenly distributed in different domains such as data assimilation for regional atmospheric/oceanic models, or the computation time is significantly longer than the data acquisition time in data assimilation on CPU-based systems [17,18].However, almost all of them become inefficient for the global oceanic data assimilation on GPUs.This is largely the result of two reasons.First, the imbalanced workload assignment for domains is prominent due to the imbalanced distribution of observation data in the global ocean.Second, compared to CPU-based assimilation, the usage of GPUs greatly reduces the computational time, which becomes too short to hide the overhead for data transfer and results in poor scalability.
To address the performance bottleneck for the global oceanic data assimilation on GPUs, we review the current data assimilation workflow in detail and find that changing the order of main phases in the current workflow can provide the chance to achieve workload balance.Motivated by the basic idea, we propose a novel parallel strategy called Bassimilation, which supports an adaptive workload-balanced scheduling for the global ocean data assimilation on massive GPUs.Furthermore, a parallel dynamic programming is designed to efficiently solve the partitioning problem in our strategy.Meanwhile, a factored dataflow implementation is developed to further avoid unnecessary communication and overlap data movement and computation.
Bassimilation is a novel adaptive scheduling strategy that performs a flexible workload assignment according to the need for balance at different assimilation stages among different GPUs.Bassimilation redesigns the workflow of data assimilation in order to estimate the workload distribution before workload assignment, which is impossible in the current workflow followed by previous work.Bassimilation features the following that differs from related work.(1) The redesigned workflow disables the parallel execution of the observation counting.To handle it, Bassimilation transforms the observation counting process from grid-point centered to observation centered, which makes it possible to efficiently and sequentially estimate the total workload.(2) Instead of the uniform decomposition employed by the current workflow, Bassimilation establishes the adaptive domain decomposition to ensure spatiotemporal workload balance.(3) Bassimilation decouples the file input/output (I/O) and inter-node communication to exploit the parallelism of data assimilation.Specifically, our contributions can be summarized as follows: • Propose an adaptive workload-balanced scheduling strategy for global ocean data assimilation on massive GPUs.The strategy achieves the reordering of main phases in the current workflow, efficient workload estimation, and flexible workload-balanced scheduling in time and space dimensions (Section 3).• Design a parallel dynamic programming method for efficiently solving an optimal partitioning problem to establish an adaptive domain decomposition, and develop a factored dataflow implementation for maximal overlapping of three main processes: I/O, inter-node communication, and local computation (Section 4).• Demonstrate the high performance of the proposed strategy, and scale the global ocean data assimilation to 4,000 GPUs for the first time (Section 5).

BACKGROUND AND MOTIVATION
In this section, we provide an overview of parallel data assimilation based on EnKF, and present a detailed exposition of our motivation.

Parallel Data Assimilation
Data assimilation.The data assimilation usually uses EnKF to correct the model-generated error by combining the observation data with an ensemble of model states which are called the background ensemble members, and EnKF gives the assimilation result as follows: where  is the number of model components, and  is the ensemble size.X  is the analysis result.X  is an analysis increment.X  is a given background ensemble taken from the model integration.
Localization.To avoid spurious correlations and realize efficient parallel computation on distributed-memory platforms, localization methods are utilized, which consider the data assimilation only within a local region around a grid point, e.g., a rectangle scope with the dimension of (2 + 1) × (2 + 1).The local analysis at each grid point is independent of each other and can be executed in parallel efficiently, which leads to local EnKF methods, such as LETKF [9,19], P-EnKF [17,18], and S-EnKF [29].

Domain decomposition.
Based on the localization, domain decomposition is proposed to achieve parallel data assimilation.In the beginning, the   ×   latitude-longitude mesh ( =   ×   ) is split into   non-overlapping domains along the longitude and latitude directions.Without loss of generality, we assume that there are   and    domains along the longitude and latitude directions respectively, and   (or   ) is a multiple of   (or    ).
In local EnKF, the update on each point depends on all the points within a local scope around it.Hence, each domain needs some additional model grid points for local analysis.We define the expansion D, of a domain  , as a set that contains  , and all the additional points for local analysis at  , , and denote X [, ] as the data of a background ensemble X  on D, .The data assimilation at a domain  , is where X  [, ] is the analysis result at  , .H [, ] ∈ R m × n is the observational operator on D, , and n = (     + 2) • ( ), and m is the number of observed data in D, .Y  [, ] ∈ R m × is the subset of perturbed observations.B−1 [, ] ∈ R n × n is the local inverse estimation of the background error covariance matrix.R [, ] ∈ R m × m is the local data-error covariance information.P , ∈ R   × n is a matrix for projecting the model components at D, into  , .In the real implementation, P , does not need to be constructed, and we only need to focus on the update of model components at the domain  , .
Multi-stage strategy.The state-of-the-art data assimilation systems utilize a multi-stage strategy with uniform domain decomposition [29,35], as shown in Figure 1.The computational region is evenly split into multiple zones, and the workload of each zone is executed by a processor group.Furthermore, each zone is uniformly decomposed into multiple bands along the latitudinal direction, and each band is uniformly decomposed into multiple domains along the longitudinal direction, resulting in uniform domain decomposition.During the -th stage, one processor in a processor group reads the required data, e.g., the -th band.Meanwhile, the other processors execute local analysis at different domains in parallel.Notably, the multi-stage strategy above has two key advantages: (1) the uniform domain decomposition ensures an even distribution of grid points across different domains, and (2) the multi-stage computation enables the efficient overlapping of the data acquisition process for the next stage with the current stage's computational process.
General workflow.Current multi-stage data assimilation implementations can be generally described as a five-phase workflow as shown in Figure 2. ❶ Data reading: the necessary data is read for each stage, including the required information for data assimilation in a given band.❷ Domain decomposition: the data is partitioned into local components specific to domains.❸ Workload assignment: local data and workload are assigned among processors.❹ Observation counting: the number of observation data within each grid point's local scope is counted.❺ Local analysis: the model state is updated in a grid point whose scope includes at least one observation data.The first three phases are typically handled by a specific processor, while the last two phases are executed by multiple processors.

Motivation
The effectiveness of the multi-stage strategy with uniform domain decomposition depends on two essential conditions: (1) observation data is evenly distributed around each grid point, and (2) local analysis time is significantly longer than data acquisition time.These two conditions do not hold for global oceanic data assimilation on GPUs, due to two key factors.Fact 1: workload-imbalance.In the global oceanic data assimilation, the distribution of observation data frequently exhibits irregularity across computational domains.This irregularity manifests as a disparity in the quantity of observational data surrounding individual grid points.Consequently, certain grid points lack observation data within their local scopes, signifying that adjustments to their model states are unnecessary according to local EnKF.Given this variability in workloads across grid points, the uniform domain decomposition based on even grid points division often results in an imbalanced distribution of workloads across processors.As shown in Figure 3(a), the uniform domain decomposition for global ocean data assimilation engenders substantial disparities in GPU utilization (i.e., busy time), even up to 314×.Thus, it is crucial to find fundamental solutions to address the issue of workload imbalance.
Fact 2: overlap-invalidation. Compared to CPU-based data assimilation, the utilization of GPUs reduces the local analysis time greatly, while the time of data movement remains unchanged.Our tests reveal that with the number of GPUs increasing, the time for data movement begins to dominate the runtime, as illustrated in Figure 3(b).This phenomenon indicates the pipeline designed for Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Figure 4: Observation counting based on the grid-point centered traversal.This approach comprises two processes, i.e., ascertaining the candidate row range in the observation array and applying the binary search along each row to find relevant observation data.Redundant and fruitless searches would occur in the sequential execution.
CPU-based systems may yield inadequate scalability on GPU-based systems, which is also demonstrated by the further experiments in Figure 3(b).When the number of GPUs exceeds 2, 000, the runtime ceases to diminish.The decreasing local analysis time puts forward higher requirements for a fine-grained pipeline design of overlapping data movement and local analysis, thereby achieving high performance and scalability on large-scale GPU systems.
Basic Idea.To address the aforementioned bottlenecks, we propose Bassimilation, an adaptive workload-balanced scheduling strategy for global ocean data assimilation on massive GPUs.The key idea of our design is to realize a workload-balanced domain decomposition by reordering main phases of the current workflow, in which the domain decomposition (Phase ❷) is executed prematurely, lacking access to information about the actual workload distribution.More specifically, we strategically advance the execution of Phase ❹ prior to Phase ❷, thereby enabling the acquisition of precise counts of observation data involved in the update process of each grid point prior to commencing the domain decomposition process (depicted in Figure 2(b)).Further, we develop a cost model to estimate the computational workload of each local analysis.This estimation is based on the number of observation data around individual grid points.With the workload estimation results, we propose an adaptive domain decomposition for the dual purpose of ensuring a balanced workload distribution among distinct processors and harmonizing the workload equilibrium between different stages of the assimilation process (Section 3).Moreover, this adaptive decomposition would exhibit the potential to further exploit available overlap between data movement and computation for high performance and scalability (Section 4).

WORKLOAD-BALANCED SCHEDULING
Based on the reordered workflow in Figure 2(b), Bassimilation is proposed, which is an adaptive workload-balanced scheduling strategy, comprising three main steps: observation-centered counting, workload estimation, and adaptive workload allocation.

Observation Centered Counting
As discussed above, reordering Phase ❹ prior to Phase ❷ creates an opportunity to achieve workload-balanced scheduling.However, this reconfiguration may result in serious performance degradation since the parallel execution of observation counting becomes infeasible within the modified workflow, as shown in Figure 2(b).According to our experiments, the parallel observation counting takes about 1.5% of the total execution time.However, 1.5% ×   of runtime would be spent in the revised workflow, which is unbearable (  represents the number of processors).Therefore, a low-overhead observation counting approach is required to fully harness the potential advantages of the new workflow.
With an in-depth analysis, we have discovered that the existing observation counting adheres to a grid-point centered traversal approach for identifying pertinent observation data surrounding individual grid points.As shown in Figure 4, the conventional observation counting for a given grid point consists of two processes: (1) determining the range of candidate rows in the observation array which is stored with the compressed sparse row (CSR) format, and (2) executing the binary search along each candidate row to find relevant observation data that is within the grid point's scope.Although this approach is straightforward to implement, it exhibits high time complexity since time-consuming searches have to be applied several times.Specifically, if the grid-point centered counting process is executed sequentially, multiple redundant and fruitless searches could not be avoided, as depicted in Figure 4.
From grid-point centered to observation centered.By shifting our focus from a grid-point centered perspective to an observation centered one, we open up avenues for a novel design of Phase ❹.In particular, instead of navigating through grid points to tally observation data, we can traverse observation data and record which grid points are influenced by each observation.Leveraging the results from the observation centered traversal, we can directly deduce which observation data surrounds each grid point.This new approach is expected to be more efficient since the number of observation data is significantly less than that of grid points.
Coordinate hierarchy.To efficiently record the observation data around grid points, we adopt a hierarchical coordinate representation for each grid point.This hierarchical structure takes the form of a tree, where each node contains one coordinate of the grid point at a specific level (Figure 5(b)).The sequence of levels ( → ) indicates the order of each grid point's coordinates in storage.The root node of a hierarchy tree is at the first level and follows an uncompressed format that stores the  coordinates of grid points.The corresponding  coordinates are stored in the nodes of the second level, which adopts a compressed format.The leaf node at the last level counts the number of observation data impacting this grid point.The union of multiple trees is called a tree set.For each observation data, a tree set can be used to record all the oceanic grid points within its influence range.Grid points beyond the oceanic realm are exempt from consideration since the assimilation process exclusively centers on oceanic data.Additionally, we define the summation of two tree sets as an operation of putting all the trees in the two sets together and fusing every pair of trees sharing identical root nodes to construct a new tree set.Fusing two trees needs the aggregation of results stored in leaf nodes that share the same  and  coordinate nodes, as shown in Figure 5(c).
Using tree sets, we can efficiently obtain the observation data within the scopes of all the grid points in each zone due to three facts.First, it takes minimal overhead to generate a tree set to encompass all the oceanic grid points within the influence range of a given observation data, because the coordinates of each grid point can be calculated directly, without the need for any search process.Second, the summation of two tree sets can be accelerated by using multiple threads on a multiple-core CPU to fuse different pairs of trees in parallel, which is fast since the intersection of two influence ranges of two observation data is limited in size.Third, due to the sparse distribution of observation data, the sequential traversal of all the observation data in each zone remains time-efficient.
Observation centered counting.Based on the investigation above, we propose an observation centered counting approach.As shown in Figure 5, an individual processor travels through all the observation data that may impact the data assimilation process in a given zone and uses the tree set to record the grid points within the influence range of each observation data.The details are as follows: Setup : Let   be a tree set recording the coordinates of the -th zone's oceanic grid points.Initialize   as an empty set, i.e.,   = ∅.Construct a map  such that if (, ) represents an oceanic grid point, then  (, ) = 1.Otherwise  (, ) = −1.
Step 1: Pick up all the observation data that may impact the data assimilation of the -th zone from the observation array (Figure 5(a)), and store them in a queue.Set an observation index as  = 1.
Step 2: Pop the -th observation data from the queue and generate a tree set  to record all the grid points within its influence range.Further, apply the map  to prune each tree by deleting the nodes which do not correspond to oceanic grid points (Figure 5(b)).
Step 3: Update the tree set   by summing it with the newly generated  (Figure 5(c)).Set  =  + 1 and go to Step 2 unless the queue becomes empty.
For a given oceanic grid point with coordinate (, ), we denote   (,) as the number of observation data within the scope of that grid point.After the observation centered counting,   (,) is obtained in the leaf node of a tree with the root node of  at the first level and the middle node of  at the second level.
Remark.Compared with the common grid-point centered method, the observation centered counting approach owns two advantages.
(1) The three steps above can be implemented rapidly, because they just rely on straightforward counting and calculations, obviating any searching process.(2) It is efficient even during sequential execution and avoids redundant and fruitless searches.

Workload Estimation
After obtaining the number   (,) of observation data around each oceanic grid point (, ), we try to estimate the workload of local analysis that occurs at (, ) by establishing a cost model.
Figure 6 illustrates the change in execution time for applying the CUDA library on NVIDIA V100 GPU to implement the local analysis (referring to Equation (2)) at (, ) as   (,) increases.The execution time of local analysis can be fit by a cubic function with   (,) as the variable.This is because the matrix inversion in Equation ( 2) has a complexity of  ((  (,) ) 3 ), while the other parts only involve matrix-vector multiplication, which has a trivial computational complexity compared to matrix inversion.Based on the discovery above, the fitted cubic function is deduced by quadratic regression and is selected as a cost model to estimate the workload of local analysis at each oceanic grid point.Subsequently, we update the value within each leaf node of all the trees in   with the estimated workload pertinent to the corresponding oceanic grid point.This adjustment culminates in the establishment of a workload distribution specific to the -th zone, which is denoted as Ŝ .

Adaptive Workload Allocation
Based on the estimated workload distribution, we develop an adaptive workload allocation approach, which consists of two pivotal mechanisms, i.e., an adaptive GPU assignment mechanism and an adaptive domain decomposition mechanism.
Adaptive GPU assignment mechanism.Similar to the current multi-stage strategy, we partition the computational region into uniform zones in Phase ❶ to ensure the balanced distribution of I/O tasks across distinct processors.However, unlike the common workflow that uses uniform GPU assignment for local analysis within each zone, we adaptively allocate varying amounts of GPUs on Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.demand to balance the computational workloads between different zones, which maximizes the utilization of all GPUs.
By aggregating the values of leaf nodes across all trees in Ŝ, the resultant sum provides us with the workload   pertinent to the local analysis within the -th zone.Further, the total workload   across all   zones can be deduced as  =   =1   .Assuming   GPUs are available, we calculate the number   of GPUs allocated to the -th zone as follows: The adaptive GPU assignment is determined by the proportion of the total workload attributed to the -th zone.
Adaptive domain decomposition mechanism.For local analysis of the -th zone on   GPUs, the adaptive domain decomposition mechanism is proposed to achieve the workload-balanced scheduling in both the time dimension (involving different stages) and space dimension (involving different GPUs).
Workload balance in time dimension: For workload balance across   stages of data assimilation, we flexibly partition the -th zone along the  direction into   bands with even workloads.At the -th stage, the computational load of -th band is distributed to   GPUs that have been determined for local analysis in the -th zone.The workload-balanced scheduling for   stages is as below.First, within the set Ŝ , a specific tree denoted as t employs   as its root node.This tree records oceanic grid points located at the latitude   (with   serving as the  coordinate) where data need to be updated in local analysis.By summing the values of all the leaf nodes in the tree t , we obtain the total workload   for all the local analyses occurring at the   latitude (Figure 7(a)-(b)).
Next, assume that the -th zone contains latitudes ranging from where (  ) represents the summation of all elements in   and  =   /  .Consequently, by solving the optimization problem (4) above, all the elements in  *  determine the width of the -th band and show the workloads distributed to the -th stage (Figure 7(b)).Moreover, Section 4.2 would discuss how to efficiently solve the problem (4).
Workload balance in space dimension: For workload balance across the given   GPUs, we need to divide the -th band along the  direction into   domains with even workloads, and assign computational loads of different domains to individual GPUs.To achieve an adaptive domain decomposition, the workload-balanced scheduling for   GPUs is as follows.
First, assuming that the -th band encompasses latitudes spanning from    to   +1 , we define   = { t  , t  +1 , • • • , t +1 }, where each tree t ∈ Ŝ is characterized by   as its root node.The set   includes all the trees that record the grid points involved by local analysis within the -th band (Figure 7(c)).
Second, we construct a one-dimensional array   with the size of   , where   is the number of grid points along the longitude direction.The -th element of   represents the total workload for all the local analyses along the -th longitude within the -th band, and it could be calculated by summing all the values of leaf nodes of trees in   that share the  coordinate , as shown in Figure 7(c).
Finally, we divide the one-dimensional array   into   nonoverlapping sub-arrays with even or similar sums, using the same way for solving the problem (4).The solution leads to a workloadbalanced domain decomposition (Figure 7(d)).
Remark.The efficient workload-balanced scheduling strategy can be established thanks to the hierarchy structure of trees in Ŝ , which enables the flexible workload counting along both the  and  directions.This strategy partitions each zone along two directions to achieve workload balance in both time and space dimensions.
With the adaptive domain decomposition, both the correctness and accuracy of data assimilation are not affected due to two facts.
(1) In local EnKF, the local analysis/computation on any grid point is independent of each other.(2) The adaptive domain decomposition for data assimilation just involves the workload assignment to GPUs, unlike that for partial differential equation (PDE) solver which requires an iterative refinement for the solutions of subdomains.

IMPLEMENTATION
In this section, we present the efficient implementation of our adaptive workload-balanced scheduling strategy for global ocean data assimilation on massive GPUs.

Challenges
Challenge 1: How to efficiently solve the optimal partitioning problem (4).For solving Problem (4), the brute-force enumeration owns the optimal time complexity  ((  − 1) • • • (  −   + 1)/(  − 1)!) only for   ≤ 3. The enumeration is quite time-consuming for   > 8 and   > 400.Alternatively, a more efficient approach is dynamic programming whose time complexity is  ( 2    ).However, a generic dynamic programming method is not well-suited for parallel execution, because its steps typically follow a sequential dependent order, making it challenging to leverage multiple threads for acceleration.In order to minimize the execution time of the decision process for adaptive workload-balanced scheduling, we aim to design a parallel dynamic programming method to efficiently solve the problem (4).
Challenge 2: How to effectively overlap the computation and data movement.As discussed in Section 2.2, GPUs are faster than CPUs for local analysis, which also poses a challenge in hiding data movement with local analysis, resulting in poor scalability.Although the multi-stage strategy provides an opportunity to overlap the data acquisition process for the next stage with the local analysis of the current stage, it is still hard to avoid computation waiting because the time for data movement (I/O and communication) dominates the main part of runtime with the number of GPUs increasing (Figure 3(b)).Therefore, to achieve high scalability, the adaptive workload-balanced scheduling strategy requires a new dataflow to effectively overlap the data movement with local analysis.

Parallel Dynamic Programming Method
To solve the optimization problem (4) in a time-efficient manner, we propose a parallel dynamic programming (DP) method to achieve a fast decision-making process.
Finally, after obtaining  (  ,   ), the optimal sub-arrays can be identified by backtracking recursively through  (, ) with the time complexity of  (  ).

Factored Dataflow Implementation
To optimize the coordination between data movement and local analysis, we propose a new dataflow implementation inspired by the factored operating system [26].Our basic idea is to separate I/O and inter-node communication processes, which are coupled in the current workflow (Figure 2(a)), and offload the inter-node communication tasks from the processor for I/O to other processors.This means that we utilize individual processors for different tasks, which provides an opportunity to overlap the three main processes, i.e., I/O, inter-node communication, and local analysis.In contrast, the current workflow could only overlap data transfer (I/O and inter-node communication) with local analysis.
Motivated by the idea above, the processors are divided into three categories: Reader, Extractor, and Analyzer (Figure 9).Readers and Extractors are bound to CPU processors, and Analyzers to GPUs in the platform.For file reading, we propose an OST-based I/O scheme to enable each Reader to efficiently load data from disks into its memory.Moreover, to maximize the overlapping of data movement and local analysis, we design a factored dataflow pipeline.OST-based I/O scheme.In file storage systems of distributedmemory clusters, such as Lustre [11], each file is decomposed into multiple parts, and each part is stored in an object storage target (OST) following a shared manner.By examining the I/O procedure for data assimilation, we find that all the background ensemble members are commonly stored in about 25% of OSTs, and most of the OSTs are idle during the I/O process, which yields less than 10% utilization of the maximum I/O bandwidth of platforms.
To fully utilize the concurrent potential of file systems, we design an OST-based I/O scheme.First, for all the  OSTs in a platform, we leverage  Readers for the data reading process.Each Reader is paired with a dedicated OST to avoid I/O request conflict.Second,  Readers are assigned to  Master nodes respectively (Figure 9).Third, each observation data or background ensemble member is split into  strips along the latitude direction ( direction), where each strip contains the data for the local analysis in a zone of the computational region.Each strip is stored in a dedicated OST.At the -th stage, each Reader reads the -th part of a strip from its paired OST.Until the data assimilation is finished, all the final results on the -th zone will be gathered by the Reader in the -th node group.
It is worth mentioning that the final results contain two parts.The first part consists of data that do not need assimilation, and this part can be obtained directly from Extractors.As soon as each Reader accomplishes all the tasks of data reading, the first part is written back to the file system, which could be overlapped with the local analysis (Figure 10).The second part is the results of local analysis, which are gathered from Analyzers through the inter-node communication (Figure 10).As each Reader writes the data to its paired OST individually, the OST-based I/O scheme avoids the I/O request conflict in the conventional ways.
Factored dataflow pipeline.Based on the OST-based I/O process, we assign all the nodes in the platform into  groups, where  is the number of OSTs and equals the number of Readers.The -th node group is responsible for the local analysis within the -th zone respectively, which consists of one Master node and    Worker nodes (  Extractors, while each Worker node includes the same number of Analyzers.Figure 9 presents the proposed implementation. Pre-process: The Reader in the -th group executes the observation centered counting and estimates the workload distribution Ŝ in the -th zone using the fitted cost model.Based on the adaptive GPU assignment mechanism (Section 3.3), we determine the number   of GPUs for the -th zone, and set the   GPUs as the Analyzers in the -th group.Each Worker node is assigned   /   Analyzers.The workloads for each Analyzer at different stages are determined by the adaptive domain decomposition mechanism (Section 3.3).
Iteration: At the -th stage, the -th Analyzer  , in the -th group executes the local analysis in the domain  (, ) of the -th zone.Meanwhile, the Extractors in -th group pick up the necessary data for the domain  ( + 1, )'s local analysis that will occur at the ( +1)-th stage, and send them to the node containing -th Analyzer, which involves the inter-node communication.At the same time, the Reader of the -th group reads the data for the local analysis of ( + 2)-th stage using the OST-based I/O scheme, and scatters them to Extractors in the same Master node through the intra-node communication.The pipeline is shown in Figure 10.
Remark.The factored dataflow pipeline successfully overlaps data reading, inter-node communication, and local analysis.Based on the adaptive workload-balanced scheduling strategy, it further avoids unnecessary occupation of inter-node communication bandwidth.

EVALUATION
In this section, we evaluate the performance of Bassimilation on clusters by conducting a series of experiments from the global oceanic environment prediction applications.

Computational Environment and Datasets
All the experiments are run on a GPU cluster.Each node is equipped with a Hygon Dhyana x86 CPU (AMD ZEN) and a 32-core CPU with cores divided into four 8-core-die connected with high speed links.Each 8-core-die is connected to an AMD Vega20 GPU through a 16× PCIE 3.0.The system software is a custom Linux OS based on CentOS 7.6 with GCC 7.2.1 and ROCm 2.9.
In our tests, we select the ocean temperature and salinity datasets from Met Office Hadley Centre [3], which produces and maintains a range of datasets of meteorological variables for use in climate monitoring and climate research.Both temperature and salinity datasets contain 120 background ensemble members ( = 120) of about 648 GB taken from a long-time ocean model integration with the 0.1 • spatial resolution and 55 vertical levels.In the 2dimensional latitude-longitude mesh,   ×   = 1800 × 3600.As the salinity observation data is typically much less than the temperature observation data, the global ocean data assimilations on these two datasets entail distinct workload distributions, which are selected to evaluate the sensitivity of Bassimilation for various scenarios.
The baseline is the multi-stage strategy (MSS), which has been demonstrated to represent the state-of-the-art method for data assimilation in climate research [29,35].The comparison of MSS and Bassimilation focuses on workload allocation, data movement, effectiveness of overlapping, sensitivity to datasets, and parallel scalability.

Evaluation of Workload Balance
As shown in Figure 11(a), Bassimilation achieves an average speedup of 4.2× in the local analysis compared with MSS, which comes from the well-balanced workload distribution.Specifically, Figure 12(a) and (b) respectively present the workload allocations of MSS and Bassimilation on 1, 000 GPUs.It is evident that the workload distribution in MSS is extremely skewed when using the uniform domain decomposition.Some GPUs have to process up to 21× the average workloads, while over 50% of GPUs get the trivial tasks.In contrast, Bassimilation detects and evens out the workload distribution across GPUs using the adaptive domain decomposition.With Bassimilation, almost all of the workload spread over 96% of the GPUs in a well-balanced manner, as demonstrated by Figure12(b).While some imbalance still exists due to the nature of oceanic data, it has little impact on performance.
Workload distributions.The workload balance in Bassimilation is evaluated in both time and space dimensions, as shown in Figure 13.The results indicate three facts.First, different numbers of GPUs are assigned to different node groups to allocate total workloads adaptively, which is determined by the adaptive GPU assignment mechanism.Second, the workloads for different stages are balanced within a fixed node group due to the workload-balanced partition along the  direction in the adaptive domain decomposition mechanism.Third, for a given stage and fixed node group, balanced workloads are assigned to different GPUs, guided by the workload-balanced partition along the  direction in the adaptive     is the number of observation data within the -the zone, and     is the number of leaf nodes in the tree set   (or Ŝ ).  ℎ1 and   ℎ2 are the numbers of threads applied by the Reader (a multiple-core CPU) in the -th node group for the observation centered counting and workload estimation respectively.domain decomposition mechanism.These facts demonstrate that Bassimilation achieves workload balance in both time and space.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Pre-process cost.Table 1 presents the time complexities of three main steps in the adaptive workload-balanced scheduling.The overheads of the three steps constitute the main part of the preprocess cost.It is clear that, if we choose   ℎ1 =    and   ℎ2 =     , the complexities of the observation centered counting and workload estimation processes would become  (1) respectively.Furthermore, for the adaptive workload allocation, the proposed parallel DP method uses   −   + 1 and   −   + 1 threads individually to solve two optimal partitioning problems for the workload balance in time and space dimensions, which reduces the pre-process time significantly.
Figure 11(b) shows the percentage of the pre-process time over the total runtime.The pre-process overhead accounts for less than 5% of the total execution time, due to the flexible observation centered counting (rather than grid-point centered one), the fitted cost model, and the efficient DP method.This minor overhead is deemed acceptable since it leads to a significant performance improvement as a reward.

Evaluation of Data Movement
Table 2 shows the comparison of I/O and communication costs between MSS and Bassimilation.Based on the factored dataflow implementation, Bassimilation achieves at least 13.4× speedup in data reading, 6.8× speedup in inter-node communication, and 58.2× speedup in data writing.The I/O performance is improved significantly due to the OST-based I/O scheme which avoids the request conflict and fully exploits the potential of parallel file systems.Moreover, Bassimilation improves the communication efficiency by avoiding the unnecessary inter-node communication of over 52% data volume that is discovered and filtered by Extractors during the adaptive workload-balanced scheduling.
From Table 2, it is also clear that both I/O (data reading and writing) and inter-node communication costs of MSS become larger as the number of GPUs increases.However, with Bassimilation, the I/O cost remains stable throughout because the number of Readers is fixed at the number of OSTs.The inter-node communication cost of Bassimilation grows slowly with an increasing number of GPUs due to the communication-avoiding of unnecessary internode data movement.The experiment results verify the outstanding performance of the factored dataflow implementation.

Effectiveness of Overlapping
We define the overlapped time as the time for I/O and inter-node communication that overlaps with the time for local analysis.The overlapping ratio is calculated as the percentage of the overlapped time over the total runtime.Figure 14 shows the distribution of the number of GPUs at various overlapping ratios.The results show that the overlapping ratios of over 60% of GPUs exceed 40%, which is sustained as the number of GPUs increases.Although the time for data movement would dominate the total runtime with the number of GPUs increasing, the factored dataflow pipeline design of Bassimilation fully exploits the overlapping opportunities and successfully achieves excellent overlapping effectiveness.

Sensitivity for Different Workload Cases
To evaluate the portability of Bassimilation for different workload cases, the ocean temperature and salinity datasets are chosen respectively, which provide disparate workload distributions.As shown in Figure 15, Bassimilation achieves significant performance improvement over MSS on both datasets.On the temperature dataset, Bassimilation outperforms MSS by up to 5.9× (from 4.7×).As Figure 15(a) shows, the proposed approach significantly reduces the time for local analysis, data reading, communication, and data writing, with average speedups of 3.6×, 12.8×, 11.6×, and 121.4× respectively.On the salinity dataset, Bassimilation achieves an average speedup of 5.1× over MSS, as shown in Figure 15(b).Similar performance improvement can be observed in all evaluating terms.

Strong Scaling Evaluation
In the scaling tests, we fix the total problem size and increase the number of GPUs.The total runtime would decrease as more GPUs are utilized.However, the computation-communication ratio becomes smaller at a higher GPU count, which will eventually affect the scalability.In this evaluation, the global ocean data assimilation focuses on the sea surface temperature.The dataset comprises approximately 648 GB in size.The background ensemble members are generated by an ocean simulation of 10 model years, spanning from 1984 to 1993, which have been sourced from Met Office Hadley Centre [3].As shown in Figure 16, the runtime of MSS does not drop with the number of GPUs exceeding 2,000.However, Bassimilation achieves a higher scaling efficiency even with the use of 4,000 GPUs.Basically, Bassimilation outperforms MSS by up to 9.1× (from 5.8×), because Bassimilation makes full use of the computing power of all GPUs through the well-balanced workload, efficient data movement, and effective overlapping.Although there is a slight loss of scalability due to the fact that the data movement gradually takes longer time than the local analysis, the parallel efficiency of Bassimilation is still remarkable.To the best of our knowledge, Bassimilation is the first one to assimilate 648 GB of oceanic data within less than 100 seconds.

RELATED WORK
The recent decade has witnessed the development of data assimilation on GPUs.To the best of our knowledge, the local ensemble transformed Kalman filter algorithm was the first to be studied for data assimilation on CUDA GPUs [2].Although the study only focused on how to exploit the parallel potential of a single GPU, a high speedup was achieved.Afterward, the GPU-based algorithms were proposed for the matrix inversion in the climate data assimilation [24,28].Furthermore, a parallel algorithm was developed for nonlinear data assimilation of ocean drift trajectories using multiple GPUs [8].In the well-known climate models, i.e., Weather Research and Forecasting (WRF) model [27] and Community Earth System Model (CESM) [32], their data assimilation systems have been ported to GPUs by applying GPU-based libraries such as cuBLAS [5] to accelerate matrix operations.In recent years, the distributed GPU parallelization was studied to enable high-resolution, real-time ensemble simulation and data assimilation of flood inundation [25].The fine-grained analysis [29] of the operations in the assimilation procedure suggested that maximizing the overlapping of data movement and local analysis is fatal to achieving high scalability on distributed-memory platforms.The related work has developed the multi-stage strategy [29,35] which represents the state-of-the-art implementation for parallel data assimilation.As it is difficult to estimate workload before execution, almost all the recent work applied a uniform domain decomposition for workload distribution, resulting in workload imbalance.The imbalance becomes more prominent in the global ocean data assimilation due to the sparse distribution of observation data, which blocks the performance improvement of the GPU parallelization.In this work, we propose a novel adaptive workload-balanced strategy and solve the performance bottleneck.To the best of our knowledge, the maximum number of GPUs is less than 1,000 in recent reports [8,25,35].However, this work firstly discusses the performance of parallel implementations with up to 4,000 GPUs, which proves that the proposed Bassimilation is of strong scalability for large amounts of data assimilation with high resolution.

CONCLUSION
This work presents an in-depth analysis on the performance of the global ocean data assimilation on massive GPUs.Based on the basic idea of reordering the main phases of the current workflow, we develop Bassimilation, which is an adaptive workload-balanced scheduling strategy for global ocean data assimilation.To push the envelope of performance further, we design a parallel dynamic programming method to accelerate the schedule decision, and propose a factored dataflow implementation for maximal overlapping of data movement and local analysis.The evaluation results confirm the high performance of Bassimilation.and an attachment file named "attachment.docx".After the hardware resources and software dependencies have been prepared, please use the following commands to install and deploy our source codes.

D.1 Artifact installation
(1) Install the artifacts in a GPU cluster platform.
-Run unzip Bassimilation.zip.There will be 3 folders named "src_code", "data", and "scripts".All the test scripts are in the "scripts" folder.(2) Install the artifacts in an NVIDIA GPU platform.
-Run unzip Bassimilation_svd.zip to get the script to test the execution time of local analysis (Fig. 6).

D.2 Deployment process
(1) Reduce the experiments in a GPU cluster platform.
Step 1: Prepare the data before running all the tests.Since the two original datasets are too large, we only upload one static-ensemble file.Run sh preparedata.shto make enough copies of the static-ensemble file.
Step 2: Following the experiment list in Section G, please use commands like sh test_busytime.sh to run specific experiments.All shell scripts can be found in the folder named "scripts".
As we use Slurm to submit tasks to the cluster, the running tasks and the login process are asynchronous.The results will be written into record files, whose names will be specified by the test shell scripts.
(2) Reduce the experiments in an NVIDIA GPU platform.
In NVIDIA GPU platform, run sh test_svd.shto test the execution time of local analysis with the increase of matrices scale.This experiment reproduces the results of Fig. 6 in the article.

OTHER NOTES E EXPERIMENT REPRODUCTION IN OUR CLUSTER
The software dependencies we used in our artifacts have specific versions, and the hardware of the cluster is also difficult to configure.Hence, we advise the reviewers to use our environment to reproduce all the experiments.We provide a way to log in to our cluster.The steps to log in and do the tests are as below.

E.1 How to log in to our cluster
Please use the following command to log in: Run: ssh ncic_test@60.245.128.10 -p 65010 The login password is composed of two parts, i.e., the static and dynamic parts.Please input the static part, and then input the dynamic part.
(i) The static part of the login password is "ncic@cas".
(ii) The dynamic part of the login password is generated by a token APP.Please follow the three steps below: Step 1. Download the token APP.
Please use the QR code for "Android APP" or "IOS APP" (provided in the file named "attachment.docx"of the submitted artifacts) to download the token APP.The APP's default language is Chinese, which can be changed to English after activating the token APP.
Step 2. Activate the token APP.
If the token APP is opened for the first time, there will be a camera view and advice to activate the token APP by using "the activation QR code", which is also provided in the file named "attachment.docx"of the submitted artifacts.
Step 3. Use the token APP.
After scanning "the activation QR code" once, the APP generates the dynamic part of the login password per minute.The language can be changed to English by clicking the setting icon and choosing the 7-th option.

E.2 How to run the tests
Please follow the steps below to reproduce the test results.
Step 1. Run: cd ./pyf_folder/licomStep 2. Run: source loadmoulde.shStep 3. Run: cd ./scriptsStep 4. Follow the experiment list in Section G, and use the corresponding commands, such as sh test1.sh, to run specific experiments.After the corresponding tasks will be submitted to the cluster, please run the command "squeue" to check the job status.After all the tasks are finished, the results will be written in the corresponding files.

Figure 1 :
Figure 1: Multi-stage strategy for data assimilation with the uniform domain decomposition.

Figure 2 :
Figure 2: Current workflow and our basic idea.In the current workflow, the data assimilation in each zone is executed by a processor group, where a single processor is assigned for I/O, and multiple processors are used for local analysis.

Figure 3 :
Figure 3: Evaluation of the multi-stage strategy on GPUs.

Figure 7 :
Figure 7: Adaptive domain decomposition.Panel (a) presents the value update in leaf nodes of all the trees for the workload estimation.Panels (b) and (c) show the flexible partition of a zone for workload balance in time and space dimensions respectively.Panel (d) gives the adaptive domain decomposition result for that zone.

Figure 8 :
Figure 8: Parallel dynamic programming method.At each step,   −  +1 elements in the same column are calculated in parallel.Optimal sub-arrays are identified by backtracking.

Figure 10 :
Figure 10: Factored dataflow pipeline that overlaps I/O, internode communication, and local analysis.
Workload allocation of MSS.
Workload allocation of Bassimilation.

Figure 12 :Figure 13 :
Figure 12: Bassimilation for workload balance.Each pixel represents a single GPU and its color gradation indicates the workload assigned to that GPU.The workload is evaluated based on the total runtime of the GPU during local analysis.
On the salinity dataset.

Observation data Grid point Scope of a grid point A B C Binary search is executed along each candidate row for counting observation data around A. Compressed sparse row (CSR) storage of observation data r1 r2 r3 r4 r5 r6 r7 r0 Candidate row range for A Candidate row range for B Candidate row range for C Redundant search occurs in counting observation data around B. Fruitless search occurs in counting observation data around C. Observation data in the scope of a grid point Observation data that is searched twice Observation data
The number of latitudes within the zone can be denoted as   =  +1 −  +1.For the -th zone, we create a one-dimensional array   of size   .The elements of   are    ,    +1 , • • • ,   +1 (Figure7(b)).Achieving workload balance across   stages can be reduced to an optimal partitioning problem of splitting the array into   non-overlapping sub-arrays  *  such that:

Loop First Data Reading and Communication Stage 1st Local Computation Stage Local Computation Stages From 2nd to ns-2 th ns-1 th Local Computation Stage ns th Local Computation Stage Readers Extractors Analyzers Last Communication and Data Writing Stage
≥ 1).The Master node contains one Reader and

Table 1 :
Time complexities of three main steps in the adaptive workload-balanced scheduling strategy. *

Table 2 :
Comparison of I/O and communication overheads between MSS and Bassimilation.