Training one DeePMD Model in Minutes: a Step towards Online Learning

Neural Network Molecular Dynamics (NNMD) has become a major approach in material simulations, which can speedup the molecular dynamics (MD) simulation for thousands of times, while maintaining ab initio accuracy, thus has a potential to fundamentally change the paradigm of material simulations. However, there are two time-consuming bottlenecks of the NNMD developments. One is the data access of ab initio calculation results. The other, which is the focus of the current work, is reducing the training time of NNMD model. The training of NNMD model is different from most other neural network training because the atomic force (which is related to the gradient of the network) is an important physical property to be fit. Tests show the traditional stochastic gradient methods, like the Adam algorithms, cannot efficiently deploy the multisample minibatch algorithm. As a result, a typical training (taking the Deep Potential Molecular Dynamics (DeePMD) as an example) can take many hours. In this work, we designed a heuristic minibatch quasi-Newtonian optimizer based on Extended Kalman Filter method. An early reduction of gradient and error is adopted to reduce memory footprint and communication. The memory footprint, communication and settings of hyper-parameters of this new method are analyzed in detail. Computational innovations such as customized kernels of the symmetry-preserving descriptor are applied to exploit the computing power of the heterogeneous architecture. Experiments are performed on 8 different datasets representing different real case situations, and numerical results show that our new method has an average speedup of 32.2 compared to the Reorganized Layer-wised Extended Kalman Filter with 1 GPU, reducing the absolute training time of one DeePMD model from hours to several minutes, making it one step toward online training.


Introduction
Molecular Dynamics (MD) with ab initio accuracy is the method of choice in theoretical studies of many microscopic phenomena such as material defects [2], phase transition [38] and nanotechnology [41] and numerous other issues.Recent developed Neural Network Molecular Dynamics (NNMD) has advanced both numerical accuracy and computational efficiency, and it has resulted in a number of methods and packages such as SNAP [44], SIMPLE-NN [29], HDNNP [4][5][6], BIM-NN [48], CabanaMD-NNP [10], SPONGE [24], DeepMDkit [46], Schnet [42], ACE [12,33], NequIP [3], DimeNet++ [16,17] and SpookyNet [45], etc.Although high-performance computing has greatly sped up the inference efficiency of NNMD models, e.g., DeePMD, a state-of-the-art NNMD package, can simulate billions of atoms with nanoseconds per day on top supercomputers [19,26], the model training process remains a major issue in practical NNMD development and usage.More specifically, there are two challenges in NNMD training: 1).The absolute time for training one NN model with thousands of ab initio labeling data can be several hours; 2).Since the labeling data cannot cover all chemical space at priori, the training procedure is invoked repetitively to obtain a well-trained NN model.As shown in Figure .1(a), for the same system, samples with different configurations will be incorporated for repetitive training.Furthermore, among different systems, each system needs to be trained individually due to the inherent differences, as shown in Figure .1(b)(c).In one particular NNMD development, this retraining can take 20 to 100 times, as depicted in Figure.1(d).Thus the practical use of the NNMD suffers significantly from the extended training time.This is common for many NNMD packages, like Schnet [42], NequIP [3], DeePMDkit [46] etc.In this work, we use DeePMD as our example, as it is a state-of-the-art NNMD method.It combines a physical symmetry-preserving descriptor and a deep neural network model.Also a broad spectrum of physical phenomena such as the phase diagram of water [52], nucleation of liquid silicon [7], diffusion in Lithium battery [34] and absorption of  2  5 [15], etc, has been studied using DeePMD.
An effective way to increase the computational efficiency of NNMD would be to use a larger minibatch method.Here, the batch size means the number of atomic configuration "images" to be used together for gradient calculations with the same model parameters.However, our test shows, that it is infeasible for DeePMD to apply a larger minibatch under the Adam training method currently deployed in the DeePMD package.When the larger minibatch is adopted in Adam based method, more tactical parameter tuning may be needed to avoid instability and convergence issues.To our knowledge, a universal parameter tuning strategy has not been found due to different situations of physical systems.Without fine-tuning, training by a large batch can even deteriorate the convergence rate.Table 1 shows the convergence of Adam-based DeePMD under different training batch sizes.Note that the training of Adam-based DeePMD with batch size 32 and 64 is readjusted by multiplying the learning rate with their square root of the minibatch respectively.For a simple copper bulk system, when increasing the batch size from 1 to 32, using the default training parameters, the corresponding number of epochs increases from 17 to 327 and the wall clock time extends from 9 hours to 20 hours.We remark that the default setting (scaling the learning rate by multiplying with the square root of minibatch size) converges faster than other heuristics such as adjusting the learning rate by multiplying the minibatch size.This indicates, currently, there is no practical way to increase batch size in the Adam algorithm used in many NNMD training.In summary, Table 1.The Adam-based DeePMD convergence under different training batch sizes.The first two columns are the physical systems and converged Energy RMSE (training by DeePMD by using default setting: Adam optimizer with batch size set to be 1).The column "batch size" records the converged epochs under different batch sizes when reaching the given error (the second column).The column "epoch growth" is a factor of consumed number of epochs of batch size 32 to 1 and batch size 64 to 32.The "-" denotes that when the batch size increases, the error cannot decrease to the default Energy RMSE. a conflict arises between the requirement for repeated training and the essentiality of manually readjusting parameters for every training procedure.
The Adam algorithm is a first-order gradient algorithm in optimization.Theoretically, second-order gradient methods in optimization can have much better efficiency compared with first-order methods, and also importantly, it exhibits greater simplicity and robustness in hyperparameter tuning compared with first-order methods [36,37] to some degree.The second-order optimizers can converge faster than first-order stochastic gradient descent (SGD) methods by leveraging curvature information to accelerate model training.In NNMD, a representative quasi-Newtonian method is Extended Kalman Filter(EKF) based optimizer.One example is the Reorganized Layer-wised Extended Kalman Filter(RLEKF) based DeePMD [23].RLEKF can converge in much fewer epochs (11.7% epochs) compared with Adam optimizer.While the training wall clock time is still 80% of single-sample minibatch Adam-based DeePMD due to the additional calculation required in Kalman Filter theory and the instance-by-instance updating in RLEKF [23].The second-order method has inherent advantages in NNMD model training.We believe an efficient multi-minibatch Extended Kalman Filter optimizer can further reduce the training wall clock time.
Optimizers, in essence, update each individual entry in the parameter vector.In the first-order optimizer, each entry has its own step size which is adaptively updated using gradients.In second-order methods, Hessian matrix is involved.To enhance practicability, a lot of approximation algorithms [9] have emerged to reduce the computation and communication cost.In this work, we propose Fast Extended Here,  is referred to as the symmetry order, and G is a composite function that involves three transformations: E0, E1, and E2.The transformations are defined as follows:

The outputs potential energy of systems of interest
:=    , and force on atom :   := −∇ r    .The optimizer of the current DeePMD model is Adam.A widely used first-order method by the vast majority of applications in the AI for Science field for it is more stable, readable, and better supported by TensorFlow and PyTorch than second-order method.However, the biggest disadvantage of first-order methods compared to second-order methods is that they require more epochs for training, which leads to slow convergence rate.Therefore, there is an urgent need for new fast training algorithms.Research on second-order methods is still scarce.Although the Newton method converges faster than first-order methods, its computational complexity is too high, and the algorithm's stability is poor due to the need to solve the inverse of the Hessian matrix.Luckily, Kalman filter has computational complexity and convergence speed that are both between those of the Newtonian methods and first-order methods, providing a tool for mitigating the conflict between the convergence speed and computational complexity of the algorithm.One of the typical Extended Kalman Filter-based methods is RLEKF [23] by organizing the expensive error covariance matrix into blocks to reduce the computational complexity.

RLEKF method
The DeePMD neural network can be represented as a stochastic system in the form of a Kalman filter targeting on θ an estimator of   , where  is the number of blocks dependent on RLEKF splitting strategy,  is the initialization of trainable parameters,   is the vector of all trainable parameters in the network ℎ(•, •), {(  ,   )}  ∈N are pairs of feature and label, {  }  ∈N can also be seen as measurements of the system, {  }  ∈N are noise terms subject to normal distribution with mean 0 and variances {/ 2  }  ∈N correspondingly, The weights error covariance matrix of RLEKF is P = diag( 1 , . . .,   ) a block diagonal matrix of shape  2 1 , . . .,  2

𝐿
dependent on RLEKF splitting strategies, where   and  :=    are the number of weights of the th block and that of the whole neural network respectively.
For any given block, we recover the estimator of  via that of  divided by the factor   , define ∀ ∈ N,   :=  −1  θ , and then obtain We remark that RLEKF is a second-order method and it can converge to local minimum with a couple of epochs.Similar to other second-order methods like BFGS, Kalman filter gives P as an estimator of the inverse of optimization function's Hessian matrix with second-order derivative of optimization function term omitted.Compared with Adam, RLEKF requires more memory footprint in training since diagonal blocks of the matrix  are stored and updated in the training process.

Parallel Algorithm
The single-sample minibatch RLEKF [23] has demonstrated robustness while maintaining accuracy, effectively preventing gradient explosion.This results in properties similar to those of Fast Extended Kalman Filter (FEKF), as they are all variants of the Kalman Filter based Algorithms.This section Table 2.The general weights increment formulation of the first-order method, second-order method, and quasi-Newtonian method(EKF) based method.The second row to the fourth row represents the updating rule under the single-sample   , the naive multi-sample minibatch (a set ), the approximated algorithm under multi-sample minibatch(a set ).
First-order Second-order quasi-Newtonian(EKF) is organized as follows: In Section 3.1, we will first introduce two multi-sample minibatch KF parallel algorithms(fusiformshaped and funnel-shaped dataflow respectively).Our proposed FEKF adopts the funnel-shaped updating flow.Section 3.2 provides intuitive guidance on hyperparameter settings.Section 3.3 analyzes the memory footprint and communication overhead of the two KF parallel algorithms.To exploit the computational power provided by the heterogeneous architecture, we apply system optimizations such as derivative refinement and customized kernels, as discussed in Section 3.4.

Algorithmic Innovation
Neural network training is a process of iteratively adjusting neural network weights.Each iteration can be represented by   +1 =   +  * .Optimizers differ in their own unique ways in giving  * .The  * of first-order and second-order methods is summarized in Table .2. First-order method updates for each individual entry in the parameter vector with its own gradients.Second-order method adopts some matrices (preconditioners, such as Hessian) to transform the gradient.The taken expectation "E" over the samples in one batch is empirically performed when we extend the update procedure of a single-sample   to a multi-sample set a set of randomly selected samples, shown in the second and third rows of Table .2. The approximation algorithm is then empirically proposed to reduce computational time and resources compared to exact algorithms, transit from the third row to the fourth row of Fusiform-shaped dataflow: Averaging over samples in one batch is favored and empirically employed by first-order and second-order optimizers when batch size increases from 1 to a larger one.Theoretically, the weights' increment  * of EKF is approximated by the mean value of the product of Kalman gain and the Absolute error, written as E( • ), namely, a Naive-EKF, shown in the last column of the third row in the Table.Funnel-shaped dataflow: An "early reduction" is adopted (i.e.approximating E( • ) by  (E())E()).This is inspired by the first and second-order approximating methods, When differentiating of samples first occurs in the backward pass, the emerge operation will be taken.A typical example of the second-order method is reducing over the Hessian matrix and gradients, denoted as E( ) and E() respectively in Table .2. In our Fast Extended Kalman Filter, we perform reduction operations at the very initial point.More specifically, the reduction is applied to the Absolute errors

Tuning for Fast Convergence
Finding the quasi-learning-rate: In the FEKF, the square root of batch size is applied in the weights updating.It can be viewed as the expected length of the sum of several gradients if these gradients are chosen randomly and uniformly in all possible directions with unit length.Hence, the weights update rule of FEKF is set as Eq. 2. We verify the effectiveness of Eq. 2 and can easily draw a conclusion from Figure .4 that the square root of batch size leads to faster convergence.
Hyper-parameters setting: The second-order method requires less hyper-parameters tuning than the first-order method.We provide a task-independent parameter setting guideline that can be applied across various training systems.In FEKF, the hyper-parameter adjustment is only related to the training batch size.Hence, the practicability of FEKF Let's rewrite it to Eq. 3 to better analyze .The increment of  consists of two parts.The (1 − ) can be taken as a constant once  is first determined.1 −   decreases to 0 as t increases.Hence, the increment of  gets smaller as t goes infinite.As batch size increases, a larger step size (1 − ) is usually adopted.Hence,  is supposed to set a smaller value.The lower initial  is required to make sure a wider range of  where t changes from 0 to infinite.In our test, when the training batch size exceeds 1024, the recommended ,  are set to 0.90 and 0.996 respectively.We remark  and  are the only adjusted hyper-parameters when using large batch sizes under various systems.

Tuning for Data Movement
Memory footprint reduction: As the training batch size increases, the FEKF will not suffer as substantial memory footprint overhead as the Naive-EKF, because samples within a batch share a uniformed  matrix.In the Naive-EKF, each sample updates independently and has an individual  matrix.This becomes unbearable when a large batch is adopted in training.
Communication avoidance: In the FEKF, averaging over gradients and errors ensures an identical replica of the  matrix among different GPUs.Hence, the communication overhead of  is eliminated.To illustrate the communication of FEKF, we will first introduce Figure .5. Assuming that one minibatch data is first separated in r Chunks.The training is then paralleled among r GPUs.Under a typical ring-Allreduce communication mode, the communication volume is ( − 1) •  ×  where r is the number of GPUs and  is the number of weights parameters.Even in a smart decouple strategy(decoupling P into blocks; an even-splitting strategy [23]), the order of memory occupation for each block is O ( 2  ) and the number of the block is of order O ( /  ).Hence the total communication of FEKF is of order O (( − 1) , where   is the threshold of splitting blocksize.The communication of  of Naive-EKF still hinders large-scale distributed training.

Specific Implementation Optimization
The network optimization: As detailed in section 2.1, DeePMD has a unique symmetry-preserving descriptor D that is constructed via the output of the embedding network and then trained by the fully connected fitting network.The system energy is fitted via a forward network, and atomic forces are derived by backward propagation to maintain energy conservation.Note that force field   := −∇ r   involves first-order derivative calculations and is realized by Autograd API of Machine Learning Framework.We observe a lot of fragmented kernels being launched by using Autograd API.To this end, Table 3. Dataset description.Physical systems, the temperature at which the data was generated, time steps (frequency of yielding snapshots), the number of snapshots (images), and the atom number in one snapshot are indicated from left to right.we implement the force calculation manually.In this section, we focus on the derivative of the symmetry-preserving operation.We remark that the symmetry-preserving principle is widely applied in physics simulation and the descriptor D is critical for accurately describing the force field.

= The optimizer optimization: (1) Rewrite  updating: According to the updating rule of  matrix, shown in Line 10 of Algorithm.1.Note that the  •   operation is involved where K is with the shape  × 1.In machine learning frameworks, the backend invokes CUDA GEMM kernels and they are highly optimized.For example, the tiling strategy is often used to efficiently utilize the GPU's memory hierarchy.A simple tiling method is introduced in Supplementary.I and  ≥ 8.The number of multiply-add operations of  •   required is 2 2 but increased by a factor of  by using torch.matmul()API.Hence, we substitute the pytorch implementation of Algorithm. 1 Line 10 with a handwritten kernel.The number of floating-point operations and memory access are all reduced.(2) Cache intermediate results: The A calculation in Line 8 of Algorithm. 1 involves , which is required

Experiments Setup
Hardware and software stacks.All numerical tests were conducted using the GPU cluster, which consists of 629 computing nodes.Each node is equipped with 2 64-core Kunpeng 920s (ARM architecture) and 4 NVIDIA A100 GPUs.Each node has 256GB host memory (8 channels × 32GB), and each A100 GPU has memory capacity of 40GB and a bandwidth of 900GB/s.The CPUs and GPUs are connected via a PCIe 4.0 with a bandwidth of 64 GB/s.The computing nodes are interconnected with a non-blocking fat-tree topology using RoCE interconnect, providing a total bandwidth of 25 GB/s.The Pytorch 2.0 framework and Horovod 0.27.0 are utilized in training the DeePMD model.Gcc 9.3 and CUDA 11.8 are used as our CPU and GPU compilers, respectively.Dataset.The systems tested are bulk systems, which pose a much greater challenge for AIMD training than small molecular simulations.One of the reasons is the relatively larger number of total atoms in one sample, usually with over 50 atoms per image.Another reason for making the training task more complex than small molecules is the various configurations since the samples are mixed with different temperatures when generating.The sub-systems and their corresponding temperature are listed in the first two columns of Table .3, ranging from -200 to 2400 Kelvins.For each dataset, snapshots are yielded based on solving ab initio molecular trajectories via PWmat [25] except for    2 [47].During this process, to enlarge the sampling span of configuration space, we fast generate a long sequence of the snapshot by a small time step (the third column of Table .3) and choose one for every fixed number.The number of samples are ranging from 10k-70k in the fourth column.
Model parameters.The DeePMD network size is [25,25,25] and [400,50,50,50,1] for embedding net and fitting net respectively.The truncation value under the symmetry preserving operation      < is set 16.The activation is tanh.The number of parameters is 26651.In Adam optimizer single-sample-batch training procedure, the learning rate is 0.001 and exponential decay by 0.95 for every 5000 steps.In the EKF optimizer training, the blocksize is set to 10240.The neural network weights are updated one time with total Energy and four times with atomic force.

Accuracy:
The Adam single-sample minibatch DeePMD shows the SOTA accuracy in many published papers.In the experiment part, we set the precision at which Adam single-sample minibatch training RMSE (the summation of Energy RMSE and Force RMSE) as the baseline.The RLEKF exhibits comparable accuracy with Adam-based DeePMD.We claim that FEKF can also reach a promising accuracy as the baseline accuracy.Table .4 describes the accuracy results of Adam batch size 1 and FEKF with training batch size 32.For the eight physical cases under the Adam batch size 1 and FEKF batch size 32, the RMSE of FEKF is slightly lower than that of Adam (the lower the better).
Generalization gap: The FEKF with training batch size 32 does not suffer a generalization problem, shown in the last column of Table .4. The training set RMSE (before slash) and testing set RMSE (after slash) are with fewer differences, ranging from 0.0009 to 0.0798.
Convergence ratio: The faster convergence is guaranteed by a factor (the square root of batch size) when updating weights.In Section 3.1, We have provided intuition to this strategy and experiments supported it well.The convergence (epoch) ratio is used in convergence measurement.The third column of Table. 4 is the ratio of single-sample minibatch Adam to 32-sample minibatch FEKF with regard to the converged epochs number.The lower convergence ratio means a smaller number of convergence epochs in FEKF, indicating a faster training procedure.

End-to-end Training Time
We unbiasedly choose 8 representative datasets and comprehensively test these systems.

Performance Analysis
We have conducted extensive optimizations (detailed in section 3.2) to fully utilize the computing power of GPU.The testing is performed on A100 GPU and chooses a typical Cu72102 dataset as an example.The training batch size is 64.Memory reduction: The updating of the  matrix is substituted by a customized handwritten kernel.The memory footprint can be reduced compared with the Pytorch implementation of  updating.In the copper system with training batch size set to 1, the peak memory usage can be reduced from 3380MB to 1805MB.To figure out the peak memory usage, we have to introduce the following prerequisite: In the DeePMD network, the number of parameters is 26651.The blocksize used in FEKF method is 10240.The weights error covariance matrix of FEKF is P = diag( 1 ,  2 ,  3 ,  4 ) a block diagonal matrix of shape 1350 2 , 10240 2 , 9760 2 , 5301 2 by the gather and splitting strategies of [23].The  1 ,  2 ,  3 ,  4 have a memory consumption of 13.90, 800, 726.76, 214.39 in MB respectively.The P has a memory footprint of 1755MB.The peak memory usage of optimized FEKF is 1805MB, consisting of the P, weights, and intermediate variables.During the real computation process, the peak memory usage will be larger than this allocation, for the reason that some additional intermediate variable results will be generated.The implemented  updates by Pytorch, inevitably bring a   matrix writing when calculating the out product of      .The subtraction operation occurs in   calculation introduces a   ×  matrix memory read/write overhead.The dominant intermediate memory footprint of  is from  2 .Hence, the memory allocation is twice the  2 footprint.Hence, the whole peak memory usage is 3405(2×800+1805)MB in theory.In summary, the handwritten kernel can reduce memory consumption to twice the memory footprint of  {  } ,  = 1, • • • , .
Scalibilty Analysis: We remark that the communication overhead of FEKF is approaching that of first order optimizer when the number of GPUs(#GPUs) is far less than the number of neural network parameters.The distributed error covariance matrix s do not need to be communicated because they are uniform already.In the FEKF and Adam-based method, the reduction in gradients is all required for latter iteration usage.The gradient  = { 1 ,  2 ,  3 ,  4 } of FEKF is with a shape {1350 × 1, 10240 × 1, 9760 × 1, 5301 × 1}.Their memory footprints are 0.01, 0.08, 0.07, and 0.04 in MB respectively.The overall memory usage of the neural network gradient is 0.2MB(Mem(g)).The communication of gradients is (# −1)×().The only additional communication overhead of FEKF is from Absolute Error(ABE).The communication of ABE is the order of O (# ).That is to say, if #  ≪  , the predominant communication overhead in FEKF arises from the gradient, and the communication of ABEs can be ignored.

Related work
Optimizers play a fundamental role in training neural networks.Adam [27], AdamW [32], SGD [18] are widely used in training deep neural networks.They are regular and standard first-order methods(Stochastic Gradient based methods).With the advent of large-scale datasets, training deep neural networks usually takes days [11,21].Due to the recent hardware advances, a feasible method to tackle the training issue is applying large mini-batches.Plenty of strategies [18,22] have been proposed for effectively large mini-batch training.One of the straightforward strategies is that the learning rate multiplies the square root of the minibatch size [22].This is based on the variance-keeping principle when using large mini-batches.The linear scaling is multiplying the learning rate with minibatch size [18].By using linear scaling with LR warm-up, Resnet-50 is trained with batch size 8K without loss in accuracy [18].However, these methods need handtuned warmup to avoid instability and can be detrimental beyond a certain batch size.Another strategy is then proposed for training larger minibatch by using adaptive learning rates mechanisms, such as LARS [50] and LAMB [51] The LARS [50] and LAMB [51] are representative methods while their performance differs in networks.More specifically, LARS works in ResNet50 and LAMB gains performance on attention models.Although the first-order methods based on large minibatch training have been deeply investigated, some recently sprung-up AI for Science networks (such as DeePMD [46], NequIP [3], etc) still adopt small mini-batches in the training procedure.The large minibatch training is not successfully applied to these NNMD networks due to the following reasons: The network training is strongly coupled with the physical systems.Each system will train a customized network that is tailored for the given system.The high cost of the necessary hand-tuning warmup prelude on all various training systems is not acceptable.Hence, the universal training method suitable for each system is more favored.
The second-order optimizers use second derivatives and/or second-order statistics of the data to speed up the iteration.The well-known second-order optimizers, such as Shampoo [20,43], K-FAC [35], SP-NGD [39], BFGS [31,54], Ada-Hessian [49], THOR [13], RLEKF [23] are proposed for its faster convergence speed.These second-order methods are formulated based on distinct theoretical foundations.In this paper, we focus on the Extended Kalman Filter Theory based method.The Kalman Filter method is robust and well-founded.While the implementations of KF used in neural network weight updating of real applications are seldom studied.[23] is one of the works that is a instance-by-instance weights updating implementation.In this paper, we propose a paralleled FEKF in neural network weights updating.
Kernel fusion aims at reducing the number of launched kernels.It enables better sharing of computation and eliminates intermediate allocations [30].Kernel fusion is an indispensable Deep Learning oriented optimization supported by Deep Learning compilers, such as TVM [8], TensorRT [53], etc.However, their main effort is to make inferences more efficiently.XLA [28] compiler can be used in model training while only supporting Tensorflow [1] and JAX [14] Framework.Expanding the capability of Deep Learning compilers to support model training on more popular Deep Learning frameworks such as PyTorch [40], is imperative.In summary, the implemented kernel fusion in our paper is focused on model training and differentiation based refinement.

Conclusion
The slow convergence of first-order Adam and instance-byinstance RLEKF method have hindered the time-to-solution of the DeePMD model.In this paper, we propose a parallelled KF algorithm namely FEKF.FEKF can be used to accelerate the training process in low-batch with no sacrifice in accuracy.As a quasi-Newtonian method, FEKF requires less hyperparameter tuning compared to first-order methods.This naturally fits the requirements of repetitive training in real scenarios.We give a heuristic parameter-tuning strategy which is independent of the specific training tasks.Hence, the hand tuning in each training is eliminated.FEKF has benefits for memory allocation and data movement.Besides, system optimizations such as kernel fusion are adopted to further exploit the computing power of the heterogeneous architecture.Comprehensive tests on eight physical systems demonstrate consistent effectiveness in convergency.
There are also two main issues with EKF methods: first, a suitable and theoretically proven loss function is needed for classification tasks.Second, the P decoupling strategy needs to be adjusted for different network architectures like CNN, RNN, GNN, and Transformers.
In the future, we will work on the rigorous theoretical proof of this FEKF and adapt FEKF to support model parallelism.AI-for-Science neural network training is growing more and more important.The proposed FEKF sheds light on other AI-for-Science models and other fields such as computer vision and natural language processing.
The testing is performed on 8 systems(Cu, Al, Si, Mg, CuO, HfO 2 , NaCl, H 2 O).The test is conducted using the Cu system as an example, and the same procedure can be extended to other systems.
The testing is performed on Arm Server(CPU: KunPeng 920s, GPU: Nvidia A100-PCIe-40GB).They adopt Duonao Job Scheduler to submit jobs.We provide job scripts(both Duonao Job Scheduler and Slurm Job Scheduler) in our MLFF code.

A.2 Installation
Download the code from Zenodo and then build feature generation tools and customized op; cd RLEKF / s r c / ./ b u i l d .sh cd . ./ . ./ F E K F _ m u l t i / Op / S r c p i p i n s t a l l .cd . ./ . ./ . ./

A.3 Experiments
The feature generation has run successfully when the output log shows "Saving npy file done".cd d a t a s e t / Cu72102 chmod +x c u _ g e n _ d a t a .sh dsub − s c u _ g e n _ d a t a .sh  The default generated directories are: "figure7 rlekf bs1", "figure7 fekf bs32", "figure7 fekf bs32 opt".
Table .5:to get the training wall time(s).The default generated directories are "table5 rlekf bs1 gpu1", "table5 fekf opt bs32 gpu1", "table5 fekf opt bs512 gpu4", "table5 fekf opt bs4096 gpu16" cd d a t a s e t / c u 7 2 1 0 2 # awk ' { l a s t =$NF } END { p r i n t l a s t } ' t h e / g e n e r a t e d / f i l e / e p o c h _ t r a i n .d a t

Figure 1 .
Figure 1.The repeated training in NNMD.(a) An example of the copper system under different temperatures.(b) and (c) is the H 2 O and CH 4 retraining process.(d) The retraining loop.

Figure 2 .
Figure 2. The network of DeePMD.(a) The 3-layer fullyconnected embedding network.(b) The fitting net aims at fitting Energy potential from the Descriptor .The translation from  to  is by the symmetry operation in (c).(c) The entire workflow of the DeepMD network.

Figure 4 .
Figure 4. Effect of quasi-learning-rate Factor on Convergence Rate of Energy.
2. It can be derived by statistically averaging over each sample Δ +  ,  = 1, • • • , .The brace of individual Δ +  in Figure.5(a) means a reduction is performed.We view it as an "computing-then-aggregation" mode since the KF computation is performed on each individual sample.
+  and the gradients  +  , where  ∈ {1, • • • , }, shown in the last column of the fourth row in the Table.

Figure 5 .
Figure 5.The data distribution, calculation, and communication of the naive fusiform-shaped multi minibatch EKF.(a) The samples are first divided into multiple chunks.(b) The data in different chunks are calculated on corresponding GPUs.(c) The communication overhead under Ring-Allreduce operation.

Figure 6 .
Figure 6.The computation of the first order derivatives of energy  to   .(a) The abstraction of the symmetrypreserving operation.Briefly denoted as  =    < where  =   .(b) The derivative of descriptor  with respect to    .(c) The derivative of Energy  with respect to    .(b) and (c) shows the calculation process of Eq. 4.
Problem statement: The D is written as    < .Our goal is to calculate   .Problem decomposition: The first step is to calculate the  D  by the "product derivative rule".The second step is to calculate the   by the "chain rule".Problem solution: To better understand, we degenerate the derivative of the tensor  to a scalar    and by iterating over  and  to get the complete derivative of  .The decomposed two problems can be mathematically expressed in Eq. 4. The computation and actual tensor shape are illustrated in Figure 6(b) and (c) respectively.
Figure.7(a) describes the training wall clock time of training by single-sample minibatch

Figure 7 .
Figure 7. (a)End-to-end training time of Adam, RLEKF and EFKF optimizers.(b)Thenumber of CUDA kernels launched under step-by-step system optimization.(c)Theiteration time under step-by-step system optimization.

Figure. 7
(b) describes the number of CUDA launch kernels.Figure.7(c) records iteration time.The x-axis of Figure.7(b) and Figure.7(c) represents the different optimization configurations.The listed baseline is the original version.Opt1 substitutes the Autograd with handwritten kernels.Opt2 adopts the torch.compile(model)API to automatically fuse the launched kernels.Opt3 optimizes the FEKF updating process by using a customized kernel(  calculation) and computation reuse(    ).For each configuration, the left column is the FEKF updating by using Energy predictions and the right column is the FEKF updating by using Force predictions.CUDA launched kernel: The system optimization greatly reduces the launched CUDA kernels as shown in Figure.7(b).Upon initial observation, the launched CUDA kernels decreased from 397 to 174 and 846 to 281 in terms of FEKF updating by Energy and Force respectively.The training consists of one Energy-based RLEKF update and four Force-based updates.The overall number of CUDA launched kernels is from 3781(397+846x4) to 1298(174+281x4).The number of Kernels is reduced by 64% compared to the baseline.Iteration time: In Figure.7(c), we separate the whole updating process into three parts, represented by different shades on the bar.From bottom to top, it represents three independent processes.The first part is going through the network forward pass to get the predictions and errors.The second part is the gradient obtaining which is required in EKF updating.The third part is the FEKF calculation flow.The total iteration time is 3.48× faster when the systematic optimizations have been applied step by step.One of the major reasons for the wall time reduction is the decrease in the number of launched kernels.

Figure. 7 :
Figure.7: to get the training wall time(s).The default generated directories are: "figure7 rlekf bs1", "figure7 fekf bs32", "figure7 fekf bs32 opt".Table.5: to get the training wall time(s).The default generated directories are "table5 rlekf bs1 gpu1", "table5 fekf opt bs32 gpu1", "table5 fekf opt bs512 gpu4", "table5 fekf opt bs4096 gpu16" Training one DeePMD Model in Minutes: a Step towards Online Learning PPoPP '24, March 2-6, 2024, Edinburgh, United Kingdom The DeePMD neural network takes a snapshot of the molecular system as input, which includes the 3D Cartesian coordinates of each atom in the system.To represent the input, a neighbor list is constructed for each atom by including all other atoms that are within a certain cutoff distance   of that atom.The resulting neighbor list for each atom is then transformed into a smooth version, denoted as R, which is a matrix of size   ×4, where   is the maximum length of all the neighbor lists.Each row of R corresponds to a neighbor atom  of atom  and is a 4-dimensional vector given by  (|r   |) (1, r   /|r   |), where  () is a smooth function that smoothly decays to zero between two thresholds,   and   , and is 1/ for  <   .The resulting matrix is then used as input to the DeePMD neural network for training and prediction.2. The embedding net G ∈ R × is defined as Kalman Filter (FEKF) algorithm, a Quasi-Newtonian Method designed for fast convergence.FEKF is an approximation algorithm based on Kalman Filter Theory.The early reduction strategy is adopted to benefit from memory footprint and communication overhead.In the aforementioned copper training example, by using the FEKF optimizer, the 32-sample minibatch version can reduce the wall time from 26132s to 576s without sacrificing the final model accuracy.The batch size can scale up to 4096 and be distributed among 16 GPUs, reducing the absolute training time to 281s with little sacri- 1, 2}, and ⊗ denotes the outer product.tanh is an activation function.3. The descriptor D  is a matrix of size  ×  < .
Table.2.In summary, the per-sample gradient  and Hessian  change to the mean gradients and Hessian of , denoted as E() and E( ), respectively.Their definition is shown in eq. 1 where  denotes the loss function.The EKF-based optimizer is a quasi-Newtonian method and is depicted in the last column of Table.2. A typical work is RLEKF by using one instance   in neural network weights updating.The weights' increment of sample   at the timestep t is the product of Kalman gain vector (denoted by

Table 4 .
The convergence ratio of 32-sample minibatch FEKF with regard to single-sample minibatch Adam.The root mean square errors (RMSE) of the training(before slashes) and the testing set (after slashes).The testing is conducted on 8 systems.Line 9 of Algorithm.1).Supplementary.II illustrates the calculation procedure.The intermediate results  are cached for  reuse, as Supplementary.II shows.

Table 5 .
Training wall time of Cu system under different configurations.Each column represents a configuration(the batch size and the number of GPUs in experiment testing) and the corresponding results(the end-to-end time and speedup).Adam, single-sample minibatch RLEKF, the 32-sample minibatch FEKF, and the 32-sample minibatch FEKF after system optimization.Without loss of accuracy, the speedup of 32sample minibatch FEKF to single-sample minibatch RLEKF over the eight systems is from 3.66× to 24.69× and the average speedup is 11.61×.The optimized 32-sample minibatch FEKF can attain a further 2.02× to 4.36× speedup compared with the 32-sample-batch FEKF without optimization.The average speedup is 3.25× in terms of system optimization.The overall average speedup is 32.22×.The training wall clock time of Figure.7(a) is measured under the accuracy referring Table.4. FEKF can further reduce the wall time when employing a larger batch size.For a typical Copper system, as illustrated in Table.5, when reaching 1.5× the baseline accuracy, the training can scale up to 16 GPUs with a batch size of 4096, resulting in a speedup of 93×.

Table . 1
: Baseline Statement.Table 1 describes the consumed number of epochs among single-instance Adam, 32instance Adam and 64-instance Adam when reaching a given Energy RMSE. Figure.7:Single GPU Performance.The end-to-end training time of single-instance Adam, RLEKF, 32-instance FEKF, the optimized 32-instance FEKF under 8 comprehensive datasets.Table.5:Distributed FEKF Performance.Table 5 represents the training time of RLEKF, larger-batch FEKF on multiple GPUs with a large number of training samples on Cu dataset.