FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters

Protein structure prediction helps to understand gene translation and protein function, which is of growing interest and importance in structural biology. The AlphaFold model, which used transformer architecture to achieve atomic-level accuracy in protein structure prediction, was a significant breakthrough. However, training and inference of AlphaFold model are challenging due to its high computation and memory cost. In this work, we present FastFold, an efficient implementation of AlphaFold for both training and inference. We propose Dynamic Axial Parallelism (DAP) as a novel model parallelism method. Additionally, we have implemented a series of low-level optimizations aimed at reducing communication, computation, and memory costs. These optimizations include Duality Async Operations, highly optimized kernels, and AutoChunk (an automated search algorithm finds the best chunk strategy to reduce memory peaks). Experimental results show that FastFold can efficiently scale to more GPUs using DAP and reduces overall training time from 11 days to 67 hours and achieves 7.5 ~ 9.5× speedup for long-sequence inference. Furthermore, AutoChunk can reduce memory cost by over 80% during inference by automatically partitioning the intermediate tensors during the computation.


Introduction
Predicting the three-dimensional structure of a protein from its amino acid sequence, a field known as protein structure prediction, has been a major area of research in structural biology for over 50 years [2].Accurate protein structure prediction has numerous applications, including drug design [29] and protein design [13], and can be achieved through both experimental and computational methods.However, experimental methods can be difficult and costly, making computational approaches an attractive option due to their ability to predict protein structure at high throughput and low cost.Improving the efficiency and accuracy of computational protein structure prediction methods is therefore of great importance.
The success of deep neural networks in various fields, such as Computer Vision (CV) and Natural Language Processing (NLP), has led to the widespread use of Artificial Intelligence in many domains.In protein structure prediction, Convolutional Neural Networks (CNNs) were introduced by AlphaFold [28] and RaptorX-Contact [34] and achieved significant performance improvements.This demonstrates that CNNs can be an effective solution for protein structure prediction using deep learning techniques.
The Transformer model, which uses Multi-Head Attention to focus on different positions and capture long-range dependencies in long sequences [30], has made significant improvements in the fields of NLP and CV, and has become the dominant model architecture, such as BERT [8], GPT [5], and ViT [9].AlphaFold 2 [17] applied Transformer to protein structure prediction and achieved atomic resolution.For the remainder of this paper, we will refer to the transformerbased AlphaFold 2 model as simply AlphaFold.
Although Transformer delivers impressive performance in prediction accuracy, it poses significant computational challenges for training and inference.Firstly, AlphaFold's computational complexity is much higher than that of vanilla Transformer due to the extra dimension of intermediate representation.Secondly, AlphaFold is less computationally efficient on the GPU platform due to its unique architecture (see Section 3).Thirdly, the limited global batch size prevents training from scaling to more nodes using data parallelism, as larger batch sizes may result in a decrease in accuracy.Training AlphaFold on 128 Google TPUv3 nodes takes approximately 11 days [16].Finally, the high memory consumption of AlphaFold exceeds the capacity of current GPUs.
To address these challenges, we introduce FastFold, an efficient implementation of AlphaFold for training and inference.FastFold includes several innovations, such as Dynamic Axial Parallelism, a model parallelism strategy that outperforms existing techniques.We also apply low-level optimizations, including Duality Async Operations, kernel optimization and AutoChunk, which reduce the communication, computation and memory cost in training and inference.Duality Async Operations is implemented to control asynchronous communication in forward and backward in PyTorch [23].Au-toChunk automatically determines the optimal chunking strategy which can significantly reduce activation memory by partitioning intermediate tensors during the computation.To the best of our knowledge, FastFold is the first attempt to optimize the performance of training and inference for protein structure prediction models.FastFold introduced large model training techniques and significantly reduces the time and economic cost of training and inference for the AlphaFold model.
In summary, we make the following contributions: • We introduced and analyzed the AlphaFold model from a system perspective, focusing on computational performance (Section 3.1) and memory consumption (Section 3.2).The transformer layer consists of two parts: a multi-head attention (MHA) block and a feed-forward block.MHA block is the primary component of the transformer, responsible for modeling the sequence.It is composed of three parts: QKV linear layers divided into heads, scaled dot-product attention, and output linear layer.The input sequence is passed through the three linear layers to obtain Query, Key, Value, which are then split into multiple heads.In the scaled dotproduct attention, the dot products of the queries and keys are first calculated, followed by the softmax function, and finally a matrix multiplication with the values.The attention output from each head is concatenated and passed through an output linear layer.The feed-forward block consists of two linear layers and helps to increase the model's capacity.
In the transformer model, two important parameters are the hidden dimension and the number of heads.The hidden dimension refers to the number of features in the input sequences, and the number of heads refers to the number of heads in the MHA.

Overview of AlphaFold
AlphaFold is a transformer-based model that takes amino acid sequences as input and directly outputs the structure of proteins.It obtains Multiple Sequence Alignment (MSA) and templates information for the target sequence through genetic database search and structure database search.MSA information consists of amino acid sequences that are similar to the target sequence, and allows for the identification of amino acids that have mutated during evolution.These co-evolving residues are likely to be located near each other in the three-dimensional structure of the protein.Templates information provides structural information for known sequences, which helps in the prediction of protein structure.
The architecture of AlphaFold is shown in Figure 1 and consists of three parts: embedding, evoformer, and structure module.The embedding part encodes the MSA and template information of the target sequence into MSA and pair representations, which are then processed by evoformer blocks.The MSA and pair representations, which contain highly processed modeling information, are then fed into the structure module, which ultimately outputs the three-dimensional structure.To reduce training time and memory consumption, Al-phaFold uses Bfloat16 precision [18].To improve prediction accuracy, AlphaFold uses a recycling technique that repeatedly performs forward passes on the model by re-embedding its output back into the representation.This allows the model to process multiple versions of the embedding features.During training, the number of recyclings is chosen from a range of 1 to 4, while it is fixed at 4 during inference.
The training process of AlphaFold consists of initial training and fine-tuning, as shown in Table 1.It is conducted on 128 TPUv3 cores with a mini-batch size of 128.To ensure the accuracy of the final model, batch size does not exceed 128 during training.The limited batch size prevents AlphaFold from scaling to more devices, leading to an overall training time of 11 days.

Evoformer
The main network trunk of AlphaFold consists of 48 evoformer blocks, each of which has three parts: MSA stack, communication, and pair stack, as shown on the right side of Figure 1.Evoformer takes two inputs: MSA and template representations, which have two sequence dimensions, unlike the inputs of vanilla transformer.Attention is calculated along different dimensions, which can be divided into rowwise and column-wise.MSA representations are processed with row-wise attention, column-wise attention, and feed-forward, while pair representations are processed with similar blocks with additional triangular updates module (shown in Figure 2).Triangular updates module uses triangular relationships in pair information to infer and update representations.Attention bias and outer product mean are used to enable communication between the two representations.Algorithm details of evoformer can be found in the Appendix A.

Parallelism for Training
In modern deep learning training, parallel methods are introduced for two main purposes: 1) to significantly reduce the time cost of training; 2) to train large models with limited resources.The most mainstream parallel methods include data parallelism and model parallelism.
Data parallelism is the most basic and widely used parallel method.Each device has a complete set of model parameters and processes a different mini-batch of training data.During the training phase, each device calculates the local gradient using its own mini-batch, then uses all-reduce communication to average the gradients globally.The model parameters are then updated based on the averaged gradients.DeepSpeed [24] introduces the ZeRO Optimizer which involves partitioning model parameters, gradients, and optimizer states along the data-parallel dimension to diminish redundant storage on each data-parallel worker, thus reducing memory consumption.DeepSpeed ZeRO is mainly used to reduce parameter-related memory consumption in the training of large language models.
Model parallelism distributes the model parameters across multiple devices, which can be divided into pipeline parallelism [10,15,21] and tensor parallelism [22] based on the distribution method.In pipeline parallelism, the model is split vertically (layer-wise) among the devices.However, this method introduces device idleness due to dependencies between the computations on different devices.To improve resource utilization, the mini-batch is often divided into micro-batches, which allows for more overlap between computations on different devices.
Tensor parallelism is typically imposed on the linear layer because it is relatively easy to distribute matrix multiplication across different devices.Megatron-LM [22] proposed column parallelism and row parallelism.In column parallelism, the weight matrix  is divided column-wise across  devices, resulting in  matrices  1 , 2 , ...,  .Matrix multiplications  1 ,  2 , ...,   are conducted in parallel, resulting in  output vectors  1 ,  2 , ...,   .In row parallelism, the weight  and input  are divided across  devices, and matrix multiplications  1  1 ,  2  2 , ...,     are conducted in parallel, resulting in  output vectors.The final output vector  is obtained through an all-reduce operation.

In-depth Analysis of Evoformer
For the convenience of later expressions, we denote the number of residues in the input as   , the number of sequences processed in the MSA stack as   , the hidden dim for MSA representation as   = 256, the hidden dim for pair representation as   = 128.Specific values of   and   can be found in Table 1.

Performance Analysis
To further analyze the performance characteristics of evoformer, we can classify its operators into three categories based on the characteristics of the computation and memory access: 1) General Matrix Multiply (GEMM): This category includes matrix multiplication, batch matrix-matrix product, and other dense matrix calculations.2) Batch Reduction: This category includes LayerNorm, Softmax, and other operations with lower computational intensity.3) Element-wise Operators: This category includes element-wise addition, dropout, and activations, and is the least compute-intensive category.
GEMM operators are typically computed using highly optimized Basic Linear Algebra Subprograms (BLAS) libraries provided by the vendor, such as cuBLAS on GPU platforms.However, deep learning frameworks like PyTorch may not be as efficient at implementing non-GEMM operators.For example, during AlphaFold model training on NVIDIA Tesla A100, only 14.7% of the time was spent on GEMM operators, while 55.7% was spent on batch reduction, 19.8% on elementwise operations, and 9.8% on other operations such as data movement.The time spent on batch reduction, in particular, was high because the implementation of LayerNorm and Softmax in PyTorch is inefficient.This performance issue also occurs with other Transformer models, but is more severe in AlphaFold due to its smaller hidden dimension (as shown in Table 2).The above data suggests that further optimization of batch reduction is needed for better training and inference performance.

Memory Consumption
During AlphaFold training, we observed high memory consumption.It is worth noting that the overall model size for AlphaFold is only 93M, according to Table 2. Despite its small model size, AlphaFold requires a large amount of memory due to the large intermediate activations it generates.For example, the activations in the attention module require  3  ×  ℎ ×   (16) bytes of memory, which can exceed 20 GB for 48 layers when   = 384 and  ℎ = 4.To mitigate this issue, AlphaFold uses gradient checkpointing [6] to reduce memory consumption.However, memory is still a bottleneck for AlphaFold, as each device can only process one data sample during training due to the limited memory capacity.
The memory consumption of existing large language models, which mainly comes from model parameters [27].Tensor Parallelism or DeepSpeed ZeRO can effectively reduce the memory consumption of model parameters, so as to realize model parallelism and scaling.AlphaFold, on the other hand, because its main memory consumption comes from the activation, we need to design a model parallelism approach with the activation as the core.
Introducing model parallelism allows the training to be distributed across more computational resources, reducing the overall training time.For inference, model parallelism can significantly reduce the latency of long-sequence predictions, making it more practical for use in real-world applications.

Existing Parallel Approaches
Referring to the background section, it's notable that three methodologies exist for the parallel processing of AlphaFold: DeepSpeed ZeRO, Pipeline Parallelism, and Tensor Parallelism.Firstly, the parameters of AlphaFold is relatively small (shown in Table 2), and the memory pressure of the activation is high (refer to Section 3.2).Although DeepSpeed ZeRO effectively reduce the parameter-related memory consumption [24], it cannot reduce the consumption of activation which is more important for AlphaFold training.For example, assuming that the activation tensors consume 30  and parameter-related tensors consumption is 2 , when scaling up to 8 GPUs with DeepSpeed ZeRO, the memory consumption is still 30 + 2/8 = 30.25 , which does not provide effective model parallelism.So the parallelism which partition the activation would be feasible.
For pipeline parallelism, it requires further partitioning of mini-batch into multiple micro-batches in order to improve the efficiency of hardware resource utilization [15].However, in the training of AlphaFold, the batch size is one [17] and cannot be further divided into micro-batches, so pipeline parallelism is not a suitable parallelism strategy.
Unlike pipeline parallelism, Tensor Parallelism (TP) can be easily adapted to AlphaFold model training.The main structure of evoformer contains attention blocks and feedforward blocks, similar to the structure of vanilla transformer.Therefore, we can use TP in a similar way to Megatron-LM, as described in Section 2.4.
However, TP is not efficient for AlphaFold for the following reasons: 1) frequent synchronization communication in each evoformer layer leads to high overhead; 2) modules other than attention and feed-forward cannot be parallelized; 3) the scaling of TP is limited by the number of heads in attention (the heads in the AlphaFold are 4 in the pair stack, so TP can be scaled up to a maximum of 4 devices).

Dynamic Axial Parallelism
As analyzed in Section 3.2, because of the memory consumption of AlphaFold, we should focus on the activation when designing the parallel algorithm.Therefore, unlike TP, we propose Dynamic Axial Parallelism (DAP) which choose to keep the complete model parameters on each device and divide the input and activations among different devices.Both MSA representation and pair representation processed by the evoformer module contain two sequence dimensions, but the calculations in the evoformer are along only one dimension.Therefore, we can divide the data along the other dimension and insert all-to-all communication when the two sequence dimensions are transposed, keeping the data dimensions of the computation axial and complete on each device, as shown in Figure 3(b).No other communication is needed in the computation of attention computation.In the outer product mean module, we need to gather the global left projection using all-gather and then perform the outer product mean with the local right projection.The triangular updates module also uses a similar approach for parallelism.In Table 3, we compare the communication overhead of Tensor Parallelism (TP) and Dynamic Axial Parallelism (DAP).It can be seen that TP only supports parallelism in the attention and feed-forward modules, while DAP supports all the computational modules of the evoformer.TP introduces 12 all-reduce communications in the attention and feed-forward module, 6 in the forward pass and 6 in the backward pass.Assuming the use of a ring all-reduce, the amount of communication per step is calculated as 24 ×  ( − 1)/ , where  is the size of the intermediate representation and  is the number of devices for model parallelism.In the forward pass, DAP introduces one all-gather communication in the outer product mean module and two in the triangular updates modules (incoming and outgoing).The communication volume per all-gather is ( − 1) × / .The backward pass does not require additional communication.DAP needs to insert all-to-all communication in between calculations in different dimensions, a total of 12 times (6 times in the forward and 6 times in the backward) in an evoformer block.Each all-to-all requires ( − 1) × / 2 of communication volume.Overall, DAP has an order of magnitude lower communication volume compared to TP.Therefore, DAP has several advantages over TP: 1) DAP supports all computational modules in Evoformer; 2) The communication volume of DAP is much smaller than TP; 3) Model parallelism can distribute activation to different devices, and DAP consumes less memory than TP because it has more parallel parts; 4) DAP has more opportunities for computation-communication overlap.

Low-level Optimization
Dynamic Axial Parallelism (DAP) enables the training and inference of AlphaFold to be scaled over more computational resources.To further optimize the time and economic cost, we implement several low-level optimization techniques, including communication and computation optimization.We also propose AutoChunk, a method that automatically determines the optimal chunking strategy to efficiently inference long sequences with minimal computational overhead.

Communication Optimization
DAP requires all-to-all and all-gather communication between all devices with axial parallelism.Due to the synchronized communication in the layer, the communication of DAP can become a bottleneck.To address this issue, we design and implemente optimization strategies to reduce the communication overhead of DAP.
In PyTorch, all computation and communication are assigned to different CUDA streams.However, PyTorch will block the computation stream to wait for the completion of the communication.In the vanilla transformer model, the computation is straightforward and there is no opportunity to overlap the communication with computation.However, in AlphaFold, we have the opportunity to overlap the computation and communication because we have two representation features to process.While it is difficult to use asynchronous communication interfaces and implement corresponding communication in the backward pass in dynamic-graph deep learning frameworks like PyTorch, we have designed the Duality Async Operations (DAO) for PyTorch to enable the overlap of communication and computation.
As shown in Figure 4, the DAO consists of a pair of communication operations.During the forward pass, the first operation triggers asynchronous communication, followed by computation on the computation stream that does not depend on the communication.The second operation then blocks the asynchronous communication until it is completed, after which the subsequent computation is performed.In the backward pass, the second operation triggers the asynchronous communication and the first operation blocks it.We have observed that using asynchronous communication significantly reduces the communication overhead through the overlap of computation and communication.

Computation Optimization
As mentioned in Section 3.1, the GEMM operators in Al-phaFold only account for a small portion of the total runtime, while 55.7% was spent on batch reduction, 19.8% on elementwise operations.So, to achieve high performance, we implemented several optimization techniques, including highly optimized kernels for batch reduction operations (softmax and layernorm) and kernel fusion.
The softmax function is a normalized exponential function that converts its input elements into values between 0 and 1, with the sum of all elements being 1.In the AlphaFold model, the input to the softmax function has many rows, but each row has a relatively small number of elements.If not implemented and parallelized properly, the native kernel will have poor performance in this case.The input to the softmax function goes through two additions -one due to the mask and one due to the bias_add operation in the evoformer's attention mechanism.These broadcast additions introduce a significant memory bottleneck.
For small column sizes, we use one warp to calculate one row of data and use the communication primitives between registers to achieve high performance.To calculate the global max of a row, we first find the local max in threads and then use WarpAllReduce to get the global max.Subtraction and exponential operations are performed, and the local sum is calculated in threads.The global sum is then obtained using WarpAllReduce, followed by a final division.In addition, we have fused the mask and bias_add into the softmax kernel, thereby avoiding broadcasts and significantly improving performance.
We also used the same approach to implement a highperformance layernorm kernel.To further improve the efficiency of the computation, we also applied other kernel fusion methods such as merge GEMM and JIT fusion to further reduce memory access overhead and lower kernel launch overhead.

AutoChunk
High memory consumption is a major bottleneck in the Al-phaFold model, especially for long sequences inference.To address this issue, AlphaFold uses chunk techniques, which involve partitioning intermediate tensors along dimensions that are independent of the computation to reduce the activation memory.This technique has been demonstrated to significantly reduce the peak memory consumption, such as the attention module.However, this approach has several drawbacks: 1) it requires significant programming effort from expert technicians; 2) manually analyzing and specifying of the range and size of chunk can be labor-intensive; 3) human-designed chunk schemes can be inefficient, resulting in increase inference latency.Our analysis also shows that 95% of the operations in evoformer have a memory footprint below 20% of the peak, suggesting that module-level chunking may not be necessary.Instead, targeting and optimizing these outlying operations may provide an efficient means of chunking.
Based on this observation, we propose AutoChunk, which is able to generate chunk strategies adaptively and efficiently for inference.AutoChunk can identify the optimal chunk range and chunk size, reducing memory usage with minimal cost.The overview of AutoChunk is shown in Algorithm 1.Given the computational graph  and memory budget  as inputs, AutoChunk iteratively finds all chunk strategies .In each iteration, it estimates the memory consumption based on the existing chunk strategy  and graph , and finds the node with the highest memory usage, where node refers to a basic operation such as add and linear.Then, it determines the maximum chunk range according to the peak memory node and current memory status, and identifies all possible chunk strategies within that range.Finally, the best strategy is selected and added to the chunk strategy .Once all chunk strategies  have been found, they are inserted into the code through code generation.
Specifically, in order to find the max chunk range, we identify the peak memory node and extend the chunk range from this node.The nature of chunk is partitioning the intermediate memory of the current node and the activation memory of active nodes into smaller parts, where active nodes refer to all nodes currently generated but not deleted in .So the number and size of active nodes are taken into account to determine the max range.Then we need to define chunk range and identify all possible chunk strategies within this range.Consider a function denoted as: Then the chunked function can be denoted is: Where  ℎ are chunked inputs,  ℎ are inputs without chunk, and    (  ), refers to the -th part of   partitioned by the chunk dimension  (  ).
As seen in the equation above, chunk involves identifying suitable function  and partitioning its inputs and outputs along certain dimensions, and then combining the results to obtain the output.From this definition of chunking, we can define the criteria for a reasonable chunk range as follows: 1) All outputs  = [ 1 , ...,    ] have a chunkable dimension  (  ).2) For all nodes  = [ 1 , ...,    ] within the chunk range, if any dimension     of a node   belongs to  ( ( )), where  ( ( )) refers to the dimensions included in the flow path of outputs chunk dimension  ( ), then dimension     needs to be chunked, i.e.  (  ) =     .
3) The chunked dimension  (  ) of each node must be a free dimension, without involving any computation.A simple illustration is shown in Figure 5. Search begins from the chunk dimension of output, and trace upwards as the arrow indicates.Dimensions that trace go through are denoted as chunk dimensions.
As defined above, we need to trace each output upwards for every chunk range, which is computationally expensive.Therefore, we propose an efficient two-stage search method: in the first stage, we only consider whether the start and end nodes of the chunk range satisfy rules 2 and 3 mentioned above.In the second stage, we further search within the satisfied range to check if all nodes within the chunk range meet the criterion for a possible chunk.Subsequently, we select the optimal chunk strategy and determine the chunk size.We aim to select the chunk with the least impact on the speed within the memory budget.Some nodes are rearranged to optimize the efficiency, i.e. transpose node can be removed from chunk in Figure 5. Once we have obtained all the chunk ranges, we utilize them for code generation.

Evaluation
In this section, we will evaluate the end-to-end improvements provided by FastFold in both training and inference.And then we analyze the enhancements contributed by Fast-Fold's low-level optimizations.All experiments were conducted on the NVIDIA Tesla A100 platform, using the official implementation of AlphaFold and OpenFold [1] as baselines.The official implementation includes only the inference part, while OpenFold reproduces both training and inference [17].

End-to-End Training Performance
In the evaluation of end-to-end training performance, we use the training parameters from the official AlphaFold paper for testing to better compare different methods and implementations in real training scenarios.All training experiments were conducted on a 128-node GPU supercomputer, where each node consists of 4 NVIDIA Tesla A100s and has NVLink for GPU interconnects.Model parallelism relies heavily on fast interconnections between devices for communication, so model parallelism is generally used within nodes and data parallelism between nodes during training.We evaluate the training performance at both the model parallelism and data parallelism levels and present the results in Figure 6.Training is worse because the sequence length is shorter, which results in more obvious communication overhead.It is worth noting that when Initial Training is scaled to 4 GPUs, we can turn off the activation checkpoint because the GPU memory is sufficient, leading to a 16.14% performance improvement.The Duality Async Operation also significantly reduces the communication overhead, resulting in an overall performance improvement of 3% to 8%.
2) Data parallelism.We used data parallelism to scale with fixed model parallelism settings.Following the settings of the AlphaFold paper, we scaled the global batch size to 128 for data parallelism.In fine-tuning training, we used DAP to scale the computation of a sample to a full node (4 GPUs), so data parallelism was scaled from 1 to 128 nodes.In initial training, to improve scaling efficiency, we only scaled DAP to half a node (2 GPUs), so data parallelism was scaled to 64 nodes only.The scaling results are shown in Figure 6(c).It can be seen that data parallelism scales almost linearly and the scaling efficiency of Fine-tuning training reaches 90.1%.
Table 4 compares the time and economic costs of three implementations of AlphaFold: OpenFold, FastFold, and the original AlphaFold.Using the same compute resources, i.e. 128 A100 GPUs, FastFold can save 40% in time and economic costs.To minimize time cost, we can use DAP to scale to 256 A100 GPUs for Initial Training and then scale to 512 A100 GPUs during the Fine-tuning phase.With this configuration, FastFold reduces the training time to 2.81 days.This represents a 3.91-fold reduction in training time compared to the original AlphaFold and a 2.98-fold reduction compared to OpenFold, as well as a 20% reduction in economic cost.During the Fine-tuning phase, FastFold achieve an aggregate throughput of 6.02 PetaFLOP/s with 512 A100 GPUs.

End-to-End Inference Performance
We evaluate the inference performance of FastFold, Open-Fold, and AlphaFold implementations in three scenarios: short sequences, long sequences, and extremely long sequences.The experiments are conducted on a GPU server with 8 NVIDIA A100s (with NVLink).In practice, it is common to use multiple models and aggregate their results to improve the accuracy.However, since the performance characteristics of multiple models are consistent, our experiments on inference performance only evaluate the performance of a single model.Additionally, some optional modules such as ExtraMSA and templates are disabled.
For short sequences, which typically have amino acid sequences no longer than 1K, inference of a single model takes a few seconds to about one minute.At this sequence range, the memory consumption is relatively small and the efficiency of using distributed inference is lower.Therefore, we compared the inference latency of the three implementations on a single GPU and presented the results in Figure 7.In the scenario of short sequence inference, FastFold's inference performance is improved by 2.01-4.05times and 1.25-2.11times compared to AlphaFold and OpenFold, respectively.It is worth noting that AlphaFold's performance is lower on the GPU platform.This is because AlphaFold uses the JAX framework, which has better support for Google TPUs and may not have optimal computational performance on the GPU platform.In addition to the inference time, AlphaFold also requires 50-150 seconds to compile kernels during inference.For long sequence inference with amino acid sequences ranging from 1K to 2.5K in length, direct inference already encounters memory capacity problems and takes several minutes or even tens of minutes.To address this issue, AlphaFold and OpenFold use the chunking technique for inference.In contrast, FastFold can use distributed inference to reduce the memory capacity requirement and significantly shorten the inference time.As shown in Figure 8, FastFold reduces the inference time by 7.5-9.5 times compared to OpenFold and by 9.3-11.6times compared to AlphaFold when using distributed inference.The figure also shows that DAP can scale to more GPUs (TP can only scale to 4 GPUs due to the limitations mentioned in section 4.1), and has significantly better overall scaling efficiency compared to TP.
For inference with extremely long sequences, over 3K in length, even with the chunking technique, the single GPU's memory capacity is exceeded.As shown in Table 5, both AlphaFold and OpenFold encounter Out of Memory (OOM) when the sequence length reaches 3K.However, FastFold can utilize distributed inference with AutoChunk for inference on extremely long sequences.In fact, for sequences up to 4K in length, FastFold's inference latency is within 10 minutes.

Evoformer Performance
The performance of the evoformer layer, both forward and backward, can be found in Table 6.We conducted benchmarks to measure the time consumed by each layer of the evoformer for different problem sizes.Notably, we used problem sizes (128, 256) and (512, 384) in the AlphaFold model, and two larger sizes to verify the generalizability of our  optimization methods.The results show that the introduction of kernel optimization yields a significant performance improvement, with the speedup effect being particularly pronounced in the backward pass.Furthermore, the use of fused kernel also significantly accelerates the forward pass.For comprehensive details on kernel performance benchmarks, please refer to Appendix A.2.

AutoChunk
We evaluate the performance of AutoChunk in terms of memory usage and inference latency.As shown in Figure 9, we compare the peak memory usage for various sequence lengths for OpenFold without chunk, OpenFold with chunk=1, and AutoChunk.When the sequence length surpasses 1024, OpenFold without chunk will experience an Out of Memory (OOM) error while the other two with chunk will not.Au-toChunk significantly reduces the memory usage by 86.0%-92.6%compared to OpenFold without chunk and by 30.6%-34.4% compared to OpenFold with chunk, which is designed by experts.This demonstrates the effectiveness of AutoChunk in significantly reducing the memory usage while also surpassing expert-designed approaches by a large margin.In Figure 10, we compare the speedup of OpenFold without chunk and AutoChunk versus OpenFold with chunk when inferring sequences of different lengths.We fix the chunk size of OpenFold with chunk=64, as it has been found to reduce memory usage by approximately 80% while approaching the maximum improvement in memory, and having a relatively small impact on speed.We also set the memory limit of AutoChunk to the memory cost of OpenFold with chunk=64 to ensure that the memory usage of both is the same.On average, AutoChunk improves the inference speed by 12% compared to OpenFold with chunk and only losses 4% compared to OpenFold without chunk.This indicates that AutoChunk efficiently selects the chunk range and size with minimal overhead.

Validation
From the point of view of theoretical analysis, neither the optimization of the kernel nor the parallel strategy will change the computation results.However, because of the introduction of some custom CUDA Kernel, the different calculation methods and orders will lead to certain precision errors.We validated the numerical correctness of FastFold by comparing inference results with the official implementation of Al-phaFold.We use AlphaFold and FastFold to predict the same amino acid sequence, and compare with the experimental results from both visualization and quantitative measurement perspectives.The visualization results are shown in Figure 11.
From the perspective of quantitative, the template modeling score (TM-Score) [38] is an important metric for comparing the structural similarity.The TM-score is a number between 0 and 1, with 1 indicating a perfect match.The TM-score higher than 0.5 assume roughly the same fold [35].As shown in the Figure 11, the TM-scores of FastFold and AlphaFold are mostly consistent, which illustrates that Fast-Fold can predict protein structure with the same quality as AlphaFold.Efficient Transformer.There have been many efforts to optimize the performance of Transformer models, which can be broadly classified into two categories: efficient design of the transformer and efficient implementation of the transformer.Many works aim to reduce the complexity of attention computation in the transformer through techniques such as sliding windows or low-rank approximation, such as Performer [7], Reformer [19], and Linformer [31].Others, like LightSeq [32,33] and TurboTransformer [11], focus on optimizing the inference performance of the transformer on the GPU platform.However, the evoformer has several differences from the vanilla transformer, and the optimizations in FastFold are mostly based on the characteristics of the evoformer.These optimizations can be combined with parallel techniques to significantly reduce the time costs of training and inference.
Large scale training.Several approaches have been proposed to address the challenges of large-scale training.Large batch training methods such as LAMB [37] and LARS [36] have been used to speed up training and address optimization issues that arise during scaling with data parallelism.For large model training, several approaches have been proposed to achieve high performance.Megatron-LM [22] uses a hybrid parallel strategy to scale the model to more GPUs.DeepSpeed ZeRO [24,25] provides a memory-efficient optimizer.In FastFold, we proposed Dynamic Axial Parallelism, which has higher scaling efficiency than current mainstream model parallelism methods.

Conclusion
AlphaFold has made significant contributions to the advancement of structural biology.Efficient methods for training and inference of these models pose a considerable challenge.Fast-Fold addresses this challenge by leveraging Dynamic Axial Parallelism, allowing both training and inference to be executed across more GPUs and significantly reducing time consumption.Moreover, FastFold incorporates various lowlevel optimization techniques, such as DAO, AutoChunk and kernel optimization, further enhancing efficiency and minimizing the cost of training and inference.These techniques not only enable the design and deployment of more efficient protein structure prediction models but also facilitate the creation of larger models for improved precision.
The versatility of the optimization technique presented in this work makes it applicable to a wide range of models.As most protein prediction models adopt similar structures (evoformer), this optimization technique can be easily implemented on models like RoseTTAFold [4], ESMFold [20], MSA Transformer [26], and others.Similarly, the video transformer uses axial attention along the spatial or temporal dimensions, respectively.Our proposed DAP and communication optimization techniques can also be tailored to these models [3,14].
For LayerNorm, we compare not only the PyTorch native kernel but also the highly optimized LayerNorm kernel from NVIDIA Apex.According to Figure 13, the performance of FastFold is improved by 5.53 ∼ 8.65× and 1.20 ∼ 1.62× compared to PyTorch and Apex, respectively.

Figure 1 .
Figure 1.The architecture of AlphaFold model.The amino acid sequence is encoded into MSA and pair representation after Embedding layer, then feeding into Evoformer and Structure Module.In Evoformer, MSA and pair representation were processed by MSA stack and pair stack, respectively.The number of residues in the input is denoted as   , while the number of sequences processed in the MSA stack is denoted as   .The hidden dimensions for MSA and pair representation are set to (  = 256) and (  = 128), respectively.

Figure 3 .
Figure 3. Communication for Tensor Parallelism and Dynamic Axial Parallelism.The figure shows the parallelism on 2 GPUs, where the yellow and blue matrices represent the MSA and Pair representations.

Figure 4 .
Figure 4.The Duality Async Operations (DAO) enables the overlap of computation and communication in both the forward and backward passes through asynchronous execution.

Figure 5 .
Figure 5. Illustration of AutoChunk strategy search.The boxes represent the dimensions of tensors.For example, input 1 has three dimensions.Blue boxes are chunk dimensions, and yellow boxes are compute dimensions.Arrows indicate the flow path of chunk dimensions.

1 )
Model Parallelism.We compared the scalability of TP and DAP for model parallelism under two training settings: Initial Training and Fine-tuning.As shown in Figure 6(a) and 6(b), DAP has significantly better scalability than TP for both Initial Training and Fine-tuning.The scalability of Initial

Figure 6 .
Figure 6.Parallel Efficiency of AlphaFold Training.The left two figures show the scaling efficiency of model parallelism intra-node, and the right figure shows the scaling efficiency of data parallelism inter-node.

Figure 8 .
Figure 8.Comparison of inference performance for long sequences.Both AlphaFold and OpenFold only support single GPU inference, so the orange and blue dashed lines are shown in the figure.FastFold-TP and FastFold-DAP refer to parallel inference using TP and DAP on multiple GPUs, respectively.

Figure 11 .
Figure 11.Comparison of experimental results and the prediction of AlphaFold/FastFold on T1024-D.

7
Related WorkOptimization for protein prediction models.The optimization of protein structure prediction models for training and inference has received relatively little attention.ParaFold[39] is an optimized system for AlphaFold inference on heterogeneous platforms, focusing on optimizing the data processing workflow.FastFold, on the other hand, focuses on the training and inference of the AlphaFold model on the GPU platform.The model optimization of FastFold and the data processing workflow optimization of ParaFold are complementary and could be combined in future work.

Table 3 .
Communication Volume for Each Evoformer Block.: The size of the intermediate activation. : The number of devices for model parallelism.

Table 4 .
Comparison of Resource and Time Cost of Different Implementation.

Table 5 .
Inference Latency for Extremely Long Sequence (s).