Rapid simulations of atmospheric data assimilation of hourly-scale phenomena with modern neural networks

Atmospheric data assimilation is essential for numerical weather prediction. Ensemble data assimilation connects multiple instances of an atmospheric model through a Kalman filter-based algorithm, which is regarded as a challenging computing task today. In this work, we build a fast, low-cost, and scalable atmospheric data assimilation prototype, DIDA, for the new-generation Sunway supercomputer, including: (1) a framework that enables flexible deployment of components, and manages and optimizes data communication among modules, achieving maximum resource efficiency; (2) an accurate, robust, UNet-based surrogate model for atmospheric dynamic simulation to generate the background ensemble; (3) a batch-LETKF algorithm with high-performance eigenvalue decomposition, which is up to 7.37 times faster than existing numerical libraries while exhibiting almost linear scalability. Experimental evaluations show that our AI-integrated ensemble data assimilation prototype can complete hour-cycle assimilation in minutes, maintain linear scalability, and save an order of magnitude of computing resources, compared with the traditional method.

work, we build a fast, low-cost, and scalable atmospheric data assimilation prototype, DIDA, for the new-generation Sunway supercomputer, including: (1) a framework that enables flexible deployment of components, and manages and optimizes data communication among modules, achieving maximum resource efficiency; (2) an accurate, robust, UNet-based surrogate model for atmospheric dynamic simulation to generate the background ensemble; (3) a batch-LETKF algorithm with high-performance eigenvalue decomposition, which is up to 7.37 times faster than existing numerical libraries while exhibiting almost linear scalability.Experimental evaluations show that our AI-integrated ensemble data assimilation prototype can complete hour-cycle assimilation in minutes, maintain linear scalability, and save an order of magnitude of computing resources, compared with the traditional method.

INTRODUCTION
Numerical weather prediction (NWP) plays a vital role in protecting lives and property from extreme weather events.High-resolution and fast-updating forecasting is crucial for accurate predictions, making NWP an important application of high-performance computing.Despite the great success of numerical weather prediction in recent years [3], it is still urgently required to improve extreme weather event prediction and lengthen the lead time of forecasts.
NWP is essentially an initial value problem in numerical partial differential equations, since future weather is determined by integrating an atmospheric model starting from current observations.Data assimilation is a technique to provide accurate initial values by combining observations with atmospheric dynamics, which is critical to NWP.Traditional methods include variational and ensemble DA.Ensemble DA estimates the flow-dependent background error covariance, which varies as the system evolves, and does not require linearized or adjoint versions of the forecast model for 4D-Var [2].In addition, [6,37] demonstrated that ensemble DA can achieve satisfactory accuracy compared to 4D-Var.However, due to the ever-increasing variety and number of observations, this method requires more computing resources than a single model to achieve rapid updates at similar speeds.Owing to its "big computation" and "big data" features, data assimilation has become an emerging application for today's supercomputers [20,21].For example, the NICAM+LETKF weather prediction system [30,32] , one of the 10 target applications for Japan's current top supercomputer Fugaku, exploits its full computing resources to conduct a thousand-member ensemble data assimilation experiment on global weather simulations with a resolution of 3.5 km.In addition, petabyte-sized data are exchanged in one assimilation cycle [38], and a model run can take over 90% of the time consumed in one assimilation cycle.
To reduce computational costs while maintaining accuracy is a promising research direction in large-size ensemble data assimilation.Recently, artificial intelligence (AI)-enabled solutions have made rapid progress in weather prediction.The FourCastNet model showed the potential to surpass ECMWF IFS in the accuracy of 48hour forecasts, while faster by several orders of magnitude.[27].Pangu Weather [4] realized a a one-day to one-week 0.25 • × 0.25 • resolution forecast by employing a 3D Earth Specific Transformer (3DEST) architecture, while outperforming cutting-edge numerical weather prediction techniques in terms of accuracy of all variables and across all time scales.GraphCast [15] builds the prediction model based on graph neural networks and can make 10-day forecasts on a 0.25 • latitude-longitude grid in less than 60 seconds on Cloud TPU v4 hardware.
Inspired by these successes, we seek to create a deep neural network model with high parallelism, capable of capturing continuous changes in complex atmospheric model behaviors, and generating a flow-dependent background ensemble.We also aim to significantly reduce the computational cost of ensemble data assimilation, a challenging task with at least two difficulties.One of these lies in the specificity of the physical fields we consider.We consider all horizontal layers of each physical variable, whereas most previous work only selected some of these as the input of the neural network.The increase in the number of horizontal layers poses a greater challenge to a neural network to capture cross-layer features.Another difficulty is due to the specificity of the data assimilation problem.When the neural network is coupled with LETKF: (1) additional noise is introduced by observations, and the neural network should be able to adapt to them; and (2) the continuous simulation process of data assimilation requires stable, continuous predictions.
Another challenge is to maintain high computational performance on an AI-integrated data assimilation application on a modern parallel supercomputer while efficiently using resources.For an application with multiple tasks, previous deployment schemes [1,7,25] have fallen into two categories.In one category [5,40], multiple components are merged into one executable file and deployed on the same set of resources, treating the whole system as a single task.Each component fulfills a sub-function, and is called by a main driver according to specific orders.Components are bound by the same set of process groups, putting extra limits on the level of parallelism, and requiring additional effort to conduct data exchanges if data partition patterns differ between modules.Another category [10,18] assigns tasks to different resources and unleashes the bind between components, which leads to poor resource utilization.
In this work, we present our progress in building a low-cost, high-performance ensemble data assimilation prototype, DNN Integrated Data Assimilation (DIDA), on the new-generation Sunway supercomputer.DIDA has three major components : (1) the improved Local ensemble transform Kalman filter algorithm (LETKF) is designed for low cost and scalability, and is mostly built on top of BLAS and LAPACK libraries; (2) a highly parallel convolutional neural network (UNet)-based model generates different sizes of background ensembles; (3) a high-throughput data backplane makes it easy to support periodic data assimilation.We conducted experiments and performance evaluations on the Sunway supercomputer.We make the following contributions in this work: and (3) saves hundreds of times of computing resources under the same calculation time ; • High-performance parallel LETKF implementation with nearly linear strong scalability for ensemble data assimilation: (1) uses a blocking strategy to reduce computation; and (2) utilizes batch eigenvalue decomposition to significantly increase speed, employing a greedy algorithm with dynamic ordering and RMA communication across CPEs, and the SIMD method on SW26010-Pro to implement a block Jacobi transformation; • A framework supporting various ensemble assimilation workflows, with a flexible schedule and coupling ability, and a fast data-staging strategy, achieving low-cost, a balanced computational load, and high scalability.
Our approach and results contribute to the progress of an AIbased surrogate model for PDE-related problems, and reveal the promising future of real-world weather forecasting.Although this work concentrates on optimization over the Sunway architecture, some insights may benefit optimization on other many-core architectures.

THE NEW-GENERATION SUNWAY SUPERCOMPUTER
We introduce the new-generation Sunway supercomputer, a recently built heterogeneous many-core supercomputer, and the platform for developing and deploying DIDA.This is the successor of the Sunway TaihuLight [12], which inherits its design concept and has an un upgraded processor, network, storage, and system scale.
The system includes tens of thousands of 390-core SW26010-Pro chips connected to a high-performance proprietary network.The storage subsystem adopts a fusion architecture that supports both I/O forwarding and burst buffer architectures.

SW26010-Pro Heterogeneous Many-core Processor
Fig. 1 shows the overview of the SW26010-Pro processor.Each processor is composed of six core-groups (CGs), which are connected through a ring network on chip and can execute independently or work together like an SMP node.Similar to the predecessor SW26010 [36], each CG of the SW26010-Pro processor has one management processing element (MPE), 64 computing processing elements (CPEs), and a memory controller (MC) with a peak performance of over 2.5 TFLOPS in double precision and a theoretical bandwidth of 51.During program execution, one thread occupies one CPE, and programmers can launch up to 64 threads per CG to maximize the usage of CPEs.Athread is a low-level interface for setting up threads on the SW26010-Pro.Both the GNU-compliant compiler and Sunway OpenACC (SWACC) source-to-source compiler can help exploit the computing power of the 390-core SW26010-Pro processor.In addition, various numerical and AI operator libraries, such as xMath2.0(BLAS and LAPACK are included) and swDNN (similar to cuDNN), are refactored for the SW26010-Pro to improve portability and performance.However, given the obvious difference between a commodity architecture and the SW26010-Pro, it is a challenge to develop parallel applications on the new-generation Sunway supercomputer, and effective solutions are urgently needed.

Network and Storage of new-generation Sunway Supercomputer
Each computing node, with one SW26010-Pro processor, connects to the supercomputer through one proprietary network interface with a bidirectional bandwidth of 56.3 GB/s.One super-node is composed of 256 computing nodes with the fully connected and unblocked network.Hundreds of super-nodes form the entire system via a multilayer and oversubscribed fat tree network.MPI is the dominant way to achieve parallelism between computing nodes, following the MPI-3 specification and tuned for large-scale parallelism.
The back-end storage employs the Lustre parallel filesystem.Between the computing nodes and the Lustre back end is a layer of hundreds of I/O nodes, each playing multiple roles, as a Lightweight File System (LWFS) server to the computing nodes, a client to the Lustre back end, and a burst buffer filesystem server equipped with two Nvme-SSDs [34].The I/O forwarding architecture and burst buffer filesystem are both widely adopted by other supercomputers that operate at an extreme scale.The multilayer network and storage design is conducive to building the system at such a scale, but may introduce contention issues and high latency, hindering the performance gains in massive data processing applications such as DIDA.

ARCHITECTURE OF DIDA
We present the basic components and data flow of the DIDA system, along with the framework design, which aims for load balance and scalabilty, and supports high-performance communication and data storage.The DIDA system has several major parts: GMCORE (Grid-point Model dynamical CORE) [17], developed by research scientists from the Institute of Atmospheric Physics, Chinese Academy of Sciences, a dry dynamical core of the baroclinic atmosphere with hydrostatic equilibrium; the UNet-based generative model component (details in section 4); the LETKF data assimilation algorithm component (details in section 5); and communication and data service as the data backplane.
Fig. 2 shows the workflow of the traditional ensemble DA scheme and our DNN-integrated scheme (able to switch to each other straightforwardly), supported by the DIDA system.The two schemes share a common workflow, with a simulation component (GMCORE in traditional DA scheme, UNet in DNN-integrated scheme), and LETKF components, consisting of an assimilation cycle and running in a ceaseless loop.
In the traditional DA scheme, GMCORE is initialized with the data read in parallel with a global filesystem.After initialization, the system becomes an online working state that periodically transfers data between components and fetches observations.Several GMCORE instances carry out model integration within each assimilation cycle and send the results (as background) to LETKF, which finishes the data assimilation computation using the ensemble background and this cycle's observations, and delivers the analysis back to the GMCORE component before the next cycle.
With the traditional DA scheme, the GMCORE component consumes the majority of running time and resources, leading to the requirement of abundant computing resources and a severe load imbalance between components, especially for the high-resolution and large-member cases.To address this issue, we propose the DNN-integrated scheme, in which the UNet component performs as a surrogate model of GMCORE, together with the LETKF component, forming the online assimilation cycle.To avoid reducing the inference accuracy of our neural network due to the numerical shock of initialization, the calculation of the first background assimilation cycles should be prepared by the traditional GMCORE model.During the online workflow, the UNet component receives the analysis field of the last cycle, performs inference with our UNet-based model, generates the ensemble, and transfers the background fields to the LETKF component for assimilation.

Infrastructure of DIDA
According to the above DIDA workflow, the core components, including simulation (GMCORE or UNet) and LETKF, run in an interleaved order.Most previously developed DA systems, such as the work of Fugaku [38] , use file I/O for data exchange between components to achieve easy and independent development and deployment, which is regarded as the offline operation mode of data assimilation, which may be time-consuming owing to high latency and limited bandwidth of a supercomputer I/O system.DIDA uses the online operation mode for performance consideration, where two components exchange data by high-bandwidth communication instead of I/O.Online operation mode, conducting data exchange through communication and remote memory access, relieves the performance impact of frequent data access, and better satisfies the requirements of high efficiency and flexibility.The infrastructure of DIDA takes charge of several tasks.First is the flexible scheduling of components so that they can be configured so as to achieve load balancing and optimize the utilization of computing resources.Second is to conduct parallel data exchange between core components with various data layouts.Third is efficient observation acquisition for LETKF.The DIDA infrastructure uses an independent component for observation acquisition.To minimize the performance impact, the component can choose an appropriate I/O mode, such as the number of I/O processes and their placement scheme.In addition, the I/O and computation overlapping must be exploited to hide this I/O overhead.

Flexible Scheduling of Components.
Our infrastructure allows users to easily configure a proper scheduling scheme for DIDA components, whose parallelism can be set independently according to their performance behavior, and which can be arranged to fully or partially share computing resources in one experiment.We provide a unified set of interfaces for the components, and each one coupled inside is treated as a sub-module and scheduled as a complete unit.A manager takes charge of allocating resources and scheduling components according to specified rules.A maximum degree of computing resources is allocated to components sharing the same computing resources, and each component is then arranged and executed according to its parallelism and domain partition.

High-performance Communication.
To ensure data exchange efficiency and flexible exchange patterns between components, we build a high-performance communication service based on the Model Coupling Toolkit (MCT) [16] and make further improvements for the most complicated part, initialization.
Originally designed for coupling earth system models, MCT provides a comprehensive set of interfaces for data exchange between components with different parallelism scales.We utilize MCT for inter-component communication, and improve its performance by resolving an issue in the original MCT, which gets stuck during initialization when the receive processes want to attain the domain with halos.This functionality is widely used in DIDA to reduce the number of halo updates on the receiving side.We implement improved segment rearrangement.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

In-memory Data
Service.We require a high-speed data service to store and organize observations based on their location and time, so as to support efficient parallel access of observations during data assimilation.LETKF must search the observations based on the domain of interest, and attributes such as position and time can serve as indices.We choose the widely used in-memory database Redis [8] to address the storage requirements of observation.To provide efficient storage and parallel access of observation data during data assimilation, we employ a hash table to store values with indices related to their attributes, and a sorted set to maintain their order.During a search, clients request the sorted set to obtain the attributes and their hash indices in the desired range, and access the hash table to retrieve the corresponding values.We adopt multiple Redis servers, and achieve balanced loading by distributing observations among them according to their attributes.

Data Staging for Observation Acquisition.
We propose a data staging strategy to solve the load imbalance and performance problem caused by irregular distribution of observation points.Compared to the background data field, the size of the observation data is small, usually below 1 GB, in our experiments.The optimum I/O performance can be reached with the same parallelism of I/O nodes, whose number may be much smaller than that of component computation.If we use all the computational nodes requiring the observations to perform I/O, performance may degrade because of a relatively small data size and contention caused by oversubscribed parallelism.Therefore, we independently assign the number of I/O processes of observation acquisition components, which are responsible for fetching and distributing observations to all LETKF computation processes.This data-staging strategy can directly accelerate the process of observation I/O access, and increase resource efficiency.It decreases the magnitude of I/O load imbalance with relatively coarse-grain access, and reduces the synchronization time in communication.
The DIDA infrastructure is implemented based on MPI and Fortran, which can be ported to the new Sunway architecture.Moreover, the in-memory data service Redis has satisfying portability.As such, we can easily port and deploy the service to the newgeneration Sunway supercomputer.

UNET-BASED GENERATIVE MODEL FOR ENSEMBLE MEMBERS
In traditional ensemble data assimilation, ensemble members are generated by integrating an atmospheric model, whose total cost may become prohibitive as the size of ensemble members increases, due to the high computational cost of model integration, which hinders the development of such algorithms.In recent years, AI-based generative algorithms have achieved significant success, inspiring us to employ related AI technologies.Our goal is to develop AI models to surrogate large-scale atmospheric model instances and reduce the computational cost of integration while maintaining accuracy.

Overview of Challenges and Solutions
Ensemble generation in DIDA refers to the generation of a set of background fields at the next moment from a set of analysis fields at the current moment.Instead of model integration, we train a set of neural networks for such purposes.As mentioned before, we face two challenges in building such neural networks.The input of our generative model is one analysis field of an assimilation cycle, and the output is the corresponding background field of the next cycle.In GMCORE, an analysis or background field consists of four variables: zonal wind (), meridional wind (), potential temperature (), and surface hydrostatic pressure (ℎ).Among them, ℎ is a 2D variable with shape (1, , ); , , and  are 3D variables with shape (, , ), where  is the number of vertical layers, and  and  correspond to the size of the horizontal mesh of the model.We stack the four variables along the first dimension, and store the background field in a 3D matrix of size (3 ×  + 1, , ).Since  can be as large as 32, the total layer dimension can be up to 97, which is much larger than in traditional computer vision, which typically has only three channels .To better extract cross-layer characteristics and capture large-scale features of the input, our kernel model is based on a deep UNet-like structure, which has exhibited strong fitting abilities (see section 4.3).
When performing data assimilation, our model must be coupled with the LETKF assimilation algorithm and perform iterative predictions, for whose stability we take advantage of online assimilation data to construct the dataset for neural networks (see section 4.4).
Moreover, to increase the parallelism of the model, we deploy 3 × + 1 neural networks to surrogate the integration of the traditional model.Each neural network predicts one variable at one horizontal layer, and by sequentially concatenating all the predicted fields, we can obtain the complete background field at the next time step.

Overall Network Framework
Our network framework has four steps, as shown in Fig. 3. First is padding.Since  and  are generally not divisible by 16 or 32, they must be separately padded to such multiples before being fed into the neural network.In the longitude direction, since the longitude forms a ring, we use the pixels on the opposite side for padding.In the latitude direction, we use the nearest pixels to pad the polar regions.Padding is essential, because the deep Re-sUNet [41] structure, which we apply later, requires the size of the input to be divisible by a power of 2. We do padding instead of reshaping because experiments have shown that it can lead to discontinuities in the generated field, especially for long-term predictions.We denote the shape after padding as (3 ×  + 1, 16, 32).Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
We employ a kernel UNet-like neural network for prediction.As mentioned above, one neural network only predicts one variable at one horizontal layer.Therefore, the output shape should be (1, 16, 32).Moreover, since the integration step is small, the output field is usually similar to the input.For more accurate predictions, instead of directly forecasting, we require the neural network to learn incremental changes from input to output.That is, after we get the output of the neural network, we add the corresponding horizontal layer from the input to the network output, so as to obtain the background field of the next time step.
In the last step, we slice the background field from a shape of (1, 16, 32) to (1, , ), so as to maintain consistency with the input.

UNet-like Structure
Our network structure design is inspired by UNet [29].Specifically, the Deep ResUNet [41], a well-known variant of the traditional UNet, is adopted as the fundamental structure, and we modify it to make it more suitable for our problem setting, as shown in Fig. 4. We adopt the deep ResUNet structure for two reasons.First, the hierarchical structure of UNet allows the neural network to capture both large-and small-scale information of input physical fields.Additionally, copying encoded features to the corresponding decoder layer (gray arrows in Fig. 4) creates an information propagation path that allows signals to more easily propagate between the encoder and decoder, which facilitates backpropagation during training.Second, the residual structure (blue and red arrows in Fig. 4) helps alleviate the vanishing gradient problem [13], thus allowing the design of UNet models with deeper layers.
We also modify the deep ResUNet structure.For example, when mapping features from lower to higher levels (orange arrows in Fig. 4), instead of using the transpose convolution operator, we use a combination of the upsample and convolution operators.Such modification can help avoid the checkerboard effect [26].Moreover, in a classic Deep ResUNet structure, the activation function is a rectified linear unit (ReLU) [22], but recent advances in deep learning suggest that the Sigmoid linear unit (SiLU) [11] could be a better choice, owing to its self-stabilizing property.Therefore, we adopt SiLU as the activation function in our network structure.It is also worth noting that, while the original deep ResUNet is a fourlayer structure, we add a layer to further enhance its fitting ability.

Dataset Construction
Considering that observations introduce noise to the background field during assimilation, to directly use a GMCORE integration sequence for training may result in the neural networks failing to learn the noise in the analysis field.To denoise the analysis field input, we construct dataset samples by running simulations of GMCORE ensemble data assimilation, whose analysis fields are selected as the dataset inputs, and the outputs obtained by the GMCORE model, i.e., the background fields at the next time step, are selected as the corresponding labels.This ensures that the constructed dataset is consistent with the application scenario of the model, which helps to improve the stability of long-term prediction when the neural network is integrated into the ensemble data assimilation system.
We set the assimilation length to 3 hours, and train neural networks to learn 3-hour integration.To construct a dataset that meets the requirements, we perform a 120-hour simulation of data assimilation for 128 members from June 7, 2019.Since the assimilation interval is 3 hours, this series of simulations consists of 40 time steps.That is, for each member, it can generate 40 samples, and by simulating 128 members, we can obtain a total of 5120 samples.By taking a subset of these samples, a smaller dataset can be constructed.In our case, the dataset size is set to 1280.To alleviate overfitting, we randomly divide each dataset into training and validation sets in an 80:20 ratio.We train neural networks on the training set, choose the network with the smallest validation loss as the best model, and deploy it for the online data assimilation test.
As mentioned in section 4.1, the physical field in GMCORE consists of four variables.It is common to perform variable-wise normalization of the physical fields before feeding them into the neural network.However, this may not lead to satisfactory forecast accuracy in our case, because even for one variable, the data ranges of different horizontal layers can be very different, especially for .We deal with this problem through layer-wise normalization.That is, we calculate the mean and variance of one variable at one horizontal layer, and use them to normalize the corresponding layer.A total of 3 ×  + 1 layers are normalized.

Network Training
The root mean square error (RMSE) of the normalized field is applied to calculate the loss for network training.Following the deep learning paradigm, we use an ADAM [14] optimizer for parameter optimization.The learning rate is set to 0.0003, which is smaller than the classical learning rate of 0.001, which would likely cause Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
instability during training on our dataset.The batch size is set to 16, and the number of training epochs is set to 200.

LOW COST AND SCALABLE LETKF
We choose the local ensemble transform Kalman filter (LETKF) [32] for data assimilation, owing to its computational efficiency and scalability.LETKF operates in the low-dimensional ensemble space instead of the high-dimensional model space or observation space, which greatly reduces the computational complexity.Furthermore, the parallel implementation of LETKF demonstrates great potential for scalability by leveraging the inherent independence in calculations of different grid points.

Workflow of LETKF
Calculation and observation selection of different grid points, which have multiple vertical levels at one horizontal location, can be repetitive in LETKF.To achieve high performance, we introduce a vertical blocking strategy.Weights of the ensemble members to calculate the analysis states are computed using only the horizontal distance between grid points and observation locations, and these are shared vertically among grid points.
Fig. 5  Finally, they compute the probabilistically optimal weights of the ensemble members in the LETKF core and calculate the analysis states of related grid points.
The LETKF core mainly comprises several matrix-vector and matrix-matrix multiplications and one eigenvalue decomposition of symmetric matrices (to obtain the estimate of the ensemble covariance).We use xMath2.0 on SW26010-Pro, an extended math library with the support of BLAS and LAPACK, for high-performance matrix-vector and matrix-matrix multiplications.However, its performance at eigenvalue decomposition is unsatisfactory, as it usually takes over half the time of LETKF.To better utilize the computational resources of SW26010-pro, we implement an eigenvalue decomposition algorithm to simultaneously decompose matrices of a number of grid points.

Batch-evd on SW26010-pro
Our batch eigenvalue decomposition algorithm (batch-evd) on SW26010pro is based on the Jacobi method, which has natural parallelism.By performing a series of Jacobi transformations that act only on two rows and two columns each, the off-diagonal elements are gradually eliminated, and the obtained diagonal elements are the eigenvalues.Since the matrices to be decomposed in LETKF are all symmetric, two symmetric off-diagonal elements are eliminated by one Jacobi transformation.
To reduce the time to solution, we use the block Jacobi method [33] instead of the original element-based method.We perform parallel block Jacobi transformations to eliminate the off-diagonal blocks,  In our implementation, the sizes of the CPE sets that work together to solve an eigenvalue problem of a matrix are determined by the matrix scale and the limit of LDM volume, where batchevd supports 1, 4, 16, and 64 batches .As shown in Fig. 6(a), when the matrix scale  = 128, we support 16 batches, and one CPE set comprises 4 CPEs.Optimizations for the parallel block Jacobi transformation on SW26010-pro are as follows.
(1) We use a greedy algorithm with dynamic ordering for better convergence.Let  denote the number of CPEs in a CPE set.Since each CPE is responsible for the elimination of two symmetric matrix blocks for parallelism, we dynamically select 2 off-diagonal matrix blocks to eliminate in each block Jacobi transformation, which are the largest, with the largest sum of squares of all elements in the block, for faster convergence.During the above dynamic ordering, we ensure there are no conflicting rows or columns among the largest blocks, so we can reorganize and evenly distribute them with their corresponding diagonal blocks to the  CPEs.As shown in Fig. 6(b), the 128 × 128 matrix is divided into 8 × 8 blocks and distributed on the LDM of four CPEs.The block Jacobi transformations of the eight largest blocks A01 and 10, A23 and A32, A45 and A54, and A67 and A76 can be conducted in parallel by the four CPEs; (2) We utilize the characteristics of the Sunway architecture to achieve the vectorization of block Jacobi transformation.As shown in Fig. 6(c), the current CPE reads the diagonal vectors of A66 and A77, performs eight independent Jacobi transformations in parallel to eliminate the eight diagonal elements of A67 and A76, and records the corresponding Jacobi transformation matrix.Then it periodically and one-stride shifts the diagonal vector of A66, eliminates the next eight elements (in purple) of A67 and A76, and gradually updates the Jacobi transformation matrix.Finally, A67 and A76 are eliminated with vectorization computations; (3) We use RMA communication across CPEs to reorganize matrix blocks and broadcast the Jacobi transformation matrices.After each block Jacobi transformation, one CPE will broadcast its Jacobi

Design of Experiments
To validate the effectiveness of the proposed UNet-based generative model and the DIDA system, we conducted two kinds of experiments on the new-generation Sunway supercomputer: (1) twin experiments to evaluate the capability of the UNet-based generative model to construct background ensembles by comparing its accuracy with that of conventional ensemble data assimilation algorithms; and (2) experiments to measure the performance of the DIDA system with the above data assimilation experiments and various parallelisms .
A twin experiment framework was built, which could simulate real-world data assimilation scenarios.The nature run , which started on June 1, 2019, was accomplished using the Model for Prediction Across Scales (MPAS-A), whose initial conditions were provided by the final operational global analysis data from the National Centers for Environment Prediction (NCEP-FNL, ds083.2[28]).Note that the physical processes were turned off, and the moisture variable was set to zero in the nature run.The initial ensemble conditions were chosen from the results of the nature run from June 12-19, 2019, in equal time intervals, overriding the states of five days before and after the data assimilation start time (June 7, 2019), so that the weather process could be captured precisely.Conventional ensemble data assimilation was used as the baseline to validate the results of UNet.In the experiments, the model was initialized by the results of the nature run on June 2, 2019.In each assimilation cycle, our model generated the background fields from analysis fields of the last cycle.The observations were assimilated every 3 hours from June 7-12, 2019, and were sampled and interpolated from the nature run regularly with the horizontal GMCORE grids and the vertical MPAS-A layers every four layers.The observed variables included zonal wind (), meridional wind (), and potential temperature (), whose values were obtained by adding Gaussian distributed noise with zero mean and a square deviation to the nature run results from June 1-12, 2019.

Scientific Results of Ensemble Data Assimilation with UNet-based Generative Model
To test our UNet structure, we conducted two types of experiments, including the so-called "offline DA mode" experiments and the socalled "online DA mode" experiments.In the "offline DA mode" experiments, the UNet was not yet coupled to the data assimilation system, and we only tested its accuracy on the validation dataset.
In the "online DA mode" experiments, we integrated the UNet into DIDA to test its online accuracy.The main metric we considered was the root-mean-square-error (RMSE) of each physical variable between the generated field and the ground truth, which is defined as where x,, and x ,, correspond to the generated field and ground truth field, respectively; the subscript  indexes the layer number of certain physical variable,  and  index respective the longitude and latitude location of a field, and summation is performed on all horizontal layers, longitudes and latitudes.All performance tests were conducted on the new generation Sunway supercomputer, with each process mapped to a core group.We first trained our neural networks on the training set and tested the accuracy in "offline DA mode".Fig. 7 shows the RMSE results of our trained model on the validation dataset, where "UNet" denotes to the RMSE between the ground truth and the constructed field by our model, and "Input" denotes to the RMSE between the ground truth and the input.It can be found that our network successfully reduces the RMSE of the input by about an order of magnitude.We then conducted experiments on the "online DA mode", where the generated field is defined as the ensemble mean of all the analysis fields.Specifically, our trained neural networks were tested under four different experimental settings, with ensembles size of 16, 32, 64, and 128.Fig. 8 shows the results of the analysis fields of the last step of the experiments with 128 members.The patterns are from the near-surface model level.The patterns of our proposed model (UNetDA) are similar to those of the traditional method (TradDA) on both potential temperature and wind.Further, the change in RMSE of UNetDA is very close to that of TradDA, as shown in Fig. 9.In the first seven steps, the RMSE of our model is smaller than that of the traditional method; after seven steps, it becomes larger than that of the traditional method, but the rate of growth is very low, and the RMSE difference between the two models is always less than 0.2.These results suggest that our model may be capable of realizing a relatively stable long-term forecast when coupled with the LETKF algorithm.
We designed two further experiments to test the generalization ability of our neural network.In the first experiment, we removed one of the variables , , and  from the observed variables during assimilation.When the variables of  or  was missing, our trained model was directly integrated into the assimilation system without fine-tuning; when  was missing, we fine-tuned our model on the newly generated assimilation dataset, and then integrated it into the assimilation system.This is because, when  or  is missing, another variable can compensate for the missing information, since  and  are strongly correlated; whereas, if  is missing, there is no similar variable that can directly fill in the lost information.We clarify here that the number of epochs for fine-tuning was only 10, which was much less than for training, which was 200.The change in the RMSE is shown in Fig. 10, where the three rows correspond to the experiments where one of , , and  is missing, respectively.It can be found that when one of the three observed variables is missing, the assimilation accuracy of our model is only slightly affected, and the RMSE is still very similar to that of the In the second experiment, we increased the data assimilation cycle length to both 6 and 12 hours.Since our model is trained for 3hour predictions, we used the neural network to perform two and four consecutive inferences for the 6-hour and 12-hour forecasts, respectively.It is worth noting that no fine-tune was done before our network is coupled to the assimilation system.The number of time steps was adjusted accordingly, so as to maintain identical total time lengths across different experiments.The change in RMSE of our model and the traditional method is shown in Fig. 11.Experimental results show that the assimilation accuracy is still close to the traditional assimilation accuracy, and is sometimes even better, which suggests that our neural network has a strong fitting ability, and it can generalize to new experimental settings with different time steps.
Considering that our neural network saves computational resources by more than an order of magnitude, our UNet-based ensemble generation strategy has great potential for practical application in large-size and fast-updating ensemble data assimilation scenarios.

Performance Results of DIDA
In the LETKF algorithm, the matrices that require eigenvalue decomposition are all of size  × , where  is the ensemble size.We conducted experiments to show the performance of sw-evd and xMath2.0,determining the maximum number of batches of sw-evd by the dynamical mapping strategy of CPEs on SW26010pro, which were used in experiments on the DIDA system.Since xMath2.0can only perform one eigenvalue decomposition at a time.Fig. 12 present the average time required to solve a single matrix.Our sw-evd shows a speed increase by a maximum factor of 7.37 using 128 ensemble members, with an average factor of 4.89 (arithmetic mean).
To evaluate the performance of the DIDA system when deployed on the new-generation Sunway supercomputer, we conducted experiments in "online DA mode" (LETKF and UNet share the same computing nodes) along with different parallelism.13 shows a bar chart of the performance results of three major components of the DIDA system in "online DA mode", where bars represent a component's time per DA cycle (three hours) with different parallelism and ensemble sizes.Fig. 13(a) breaks down performance of each component under 512 ensemble members.The observation I/O time can be totally overlapped in the process using the data staging strategy proposed in section 3.2.Fig. 13(b) depicts the performance of UNet, with nearly linear scalability with increasing parallelism.This aligns with our expected results, given parallel implementation based on model inference.Fig. 13(c) illustrates the overall performance of LETKF.Eigenvalue decomposition, which accounts for the majority of LETKF, has cubic computational complexity in relation to the ensemble size, resulting in a corresponding cubic increase in the time per DA cycle of LETKF.There is also a close-to-linear reduction in time as the number of processes increasing, demonstrating satisfactory scalability of LETKF.Fig. 14 shows the performance of online data assimilation in one DA cycle with the DIDA system in the case of 100-km resolution.In traditional DA, all members are derived from GMCORE, while for UNet, all are generated by the UNet model.The results show that both the traditional method and the UNet model have excellent scalability, achieving over 92% parallel efficiency in cases of maximum ensemble size, which confirms the effectiveness of the DIDA infrastructure.Moreover, in the traditional DA, GMCORE computation accounts for most of the time in each DA cycle, with an average of 95%, and a maximum reaching 98% (in the 128-member Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.case).After using the UNet model, the time consumption is reduced by at least 98%.When using similar numbers of processes (UNet using 256, 512, and 768 processes, compared to GMCORE using 240, 480, and 960 processes) across different ensemble sizes, the UNet DA achieves an average speedup of 85.9 and a maximum speedup of 106.2.Despite ideal linear scalability, the traditional DA would require hundreds of thousands of processes to achieve performance similar to UNet DA.After significantly reducing computation time and resource requirements, our DIDA system with the UNet model can easily scale to larger ensemble sizes.With an equal number of processes (768), our DIDA system with the UNet model can finish 512-member data assimilation using less than 20% of the time consumed by 32-member data assimilation with the traditional method, which demonstrates the potential of AI-integrated solutions for ensemble data assimilation.

RELATED WORK
In the traditional NWP framework, the forecast model and the data assimilation operate separately, and generally use file I/O to exchange data.Examples include the Integrated Forecasting System (IFS [9]) of the European Center for Medium-range Weather Forecasts (ECMWF), Global Forecast System (GFS) and Global Data Assimilation System (GDAS) of the National Centers for Environmental Prediction (NCEP), and the Global Regional Assimilation and Prediction System (GRAPES [31]) of the China Meteorological Administration (CMA).However, when a large number of ensemble members are used, I/O becomes a performance bottleneck [38], so DIDA uses MCT to transfer the background and analysis fields.
A Kalman filter is usually applied in ensemble data assimilation [24].Several frameworks of the DA system and their coupled experiments with atmospheric models were made available in previous studies.LETKF was utilized to assimilate real-world observation data with the Non-hydrostatic Icosahedral Atmospheric Model (NICAM) [32], whose performance when evaluated on a K computer could scale up to 10,000 nodes (80,000 cores) [39].S-EnKF exhibited a co-design for a scalable ensemble Kalman filter, and had nearly ideal scalability up to 12,000 processors [35].As for offline frameworks, a dynamically running job scheme and parallel I/O algorithm were proposed to reduce the time to solution for high-resolution cases.In our DIDA system, we explore the vertical parallelism of the LETKF implementation, and construct a highperformance communication and data service to improve the scalability of the DA system [42].
In our DIDA system, we explore the vertical parallelism of the LETKF implementation and construct a high-performance communication and data service to improve the scalability of the DA system.A previous study with up to 10,240 members [19] found that an increase in the ensemble size enormously improved the prediction accuracy, while a limited number of members subjected the ensemble PDF to significant sampling error.A Big Data Assimilation system [21] demonstrated strong scalability of up to 14,400 nodes of a K computer, and achieved a 1,000-member ensemble size [23].Another state-of-the-art work [38] utilized the entire Fugaku supercomputer to conduct 1,024-member ensemble data assimilation experiments with the high-resolution NICAM model.Inspired by the growing success of deep learning, we designed for the first time a UNet-based model to generate a large ensemble with reasonable features, saving significant computational resources in ensemble DA experiments.

CONCLUSION AND FUTURE WORK
We built a low-cost, high-performance atmospheric data assimilation prototype for the new-generation Sunway supercomputer, using for the first time a UNet-based surrogate model to generate the background ensembles.Experimental results showed that, compared with the traditional method, our prototype can save an order of magnitude of computing time and resources while maintaining accuracy.Future work will extend our framework to scenarios with higher resolutions.

APPENDIX: ARTIFACT DESCRIPTION/ARTIFACT EVALUATION A ARTIFACT IDENTIFICATION
In this work, we present our progress in building a low-cost and high-performance ensemble data assimilation prototype (we call it DNN Integrated Data Assimilation and the abbreviation is DIDA) over the new-generation Sunway supercomputer.We provide the complete code of DIDA on Github, including the three major components: GMCORE, LETKF, and UNet.There are two versions of the code.One is based on SW(/code/sw), and we use C to implement the UNet.The other is based on x86(/code/x86), and we use Python to implement the UNet.Both include the traditional data assimilation code(/code/sw/TradDA; /code/x86/TradDA) and the UNet data assimilation code(/code/sw/UNetDA; /code/x86/UNetDA).

B REPRODUCIBILITY OF EXPERIMENTS
The new-generation Sunway supercomputer on which the DIDA system has been ported and tuned is only with access granted to a limited group of people currently.We have no authority to open the access or provide a simulator.Furthermore, according to the confidentiality agreement with the supercomputer's vendor, the SW26010-Pro CPE version of the code is subject to confidentiality obligations.Therefore, we apologize that neither the CPE version of the code nor the artifact evaluation environment can be made publicly available, and access restrictions are imposed on reproducing any experiments carried out under our program.So the code in GitHub cannot be executed.However, we provide full logs to demonstrate the reproducibility of the work.We will also explain how to run the program to get these logs.The system environment for running these logs is as follows.For both platforms, we can compile the code using "sh compile.shintel" or "sh compile.shsw" under "/SCRIPTS".For the x86 platform, we can submit the executable using "sbatch run_dida.sbatch"under "code/x86/TradDA/SCRIPTS" for the traditional data assimilation code or using "sh run.sh"under "code/x86/UNetDA/tests" for the UNet data assimilation code.For the sw platform, we can submit the executable using "sh run.shwork-dir namelist ATM_proc DA_proc" under "/SCRIPTS" for both data assimilation codes.Among them, work-dir is the working directory used to store logs and background field files of the DIDA system, namelist is the input configuration for the DIDA system, ATM_proc is the number of processes of the ATM module(ATM_proc equals zero for UNetDA), and DA_proc is the number of processes of both the DA module and the UNet module.The "log" directory contains all the running logs corresponding to the experiment section of our paper.Logs under "/log/ScientificResults" correspond to section 6.2.One-toone correspondence between file names and legends of Fig. 8, 9, 10.Logs under "/log/PerformanceResults" correspond to section 6.3.One-to-one correspondence between file names and legends of Fig. 13.

Figure 2 :
Figure 2: Architecture of the DIDA System.

Figure 5 :
Figure 5: Parallel implementation of LETKF with vertical blocking

Figure 6 :
Figure 6: Optimizations for the parallel block Jacobi transformation on SW26010-pro

Figure 7 :
Figure 7: Results of RMSE for "offline DA mode" on validation set

Figure 8 :Figure 9 :
Figure 8: The analysis fields of the last step of experiments with 128 members

Figure 10 :
Figure 10: Same as Fig. 9, but for the experiments of different missing observed variable.

Figure 11 :Figure 12 :
Figure 11: Same as Fig. 9, but for the experiments of different data assimilation cycle length.

Figure 13 :
Figure 13: Performance of different components of DIDA

Figure 14 :
Figure 14: Strong scaling performance of online data assimilation with our DIDA system