Kimad: Adaptive Gradient Compression with Bandwidth Awareness

In distributed training, communication often emerges as a bottleneck. In response, we introduce Kimad, a solution that offers adaptive gradient compression. By consistently monitoring bandwidth, Kimad refines compression ratios to match specific neural network layer requirements. Our exhaustive tests and proofs confirm Kimad's outstanding performance, establishing it as a benchmark in adaptive compression for distributed deep learning.


INTRODUCTION
Deep learning has steadily emerged as a transformative paradigm, demonstrating profound results in various domains.With its growth, there's been an explosion in the size of models and datasets.This upsurge in complexity often demands expansive computational resources, prompting researchers to adopt distributed training.
The Graphics Processing Unit (GPU) has emerged as a cornerstone in the realm of deep learning model training, fundamentally altering the landscape of artificial intelligence research and applications.The latest advancements in GPU technology, exemplified by the state-of-the-art models such as the Ampere and Hopper [9,10], exhibit unprecedented computational power and speed up the training by up to 16×.However, it is noteworthy that the acquisition of these cutting-edge GPUs comes at a considerable financial cost, more than $200,000 for a single DGX A100.
In this scenario, researchers increasingly turn to cloud-based computational resources for model training due to their flexible pricing models, variety of hardware, and ease of scaling computational resources.However, the bandwidth variability problem in cloudbased deep learning training poses a substantial challenge to the efficient execution of large-scale machine learning tasks [1,20,27].The bandwidth fluctuations, influenced by factors such as network congestion and competing workloads, lead to inconsistent performance during training.Figure 1 shows an example of bandwidth discrepancy measured at AWS EC2 with a TCP server in Frankfurt receiving simultaneously from 4 workers using IPerf3.While the existing framework CGX [21] has made strides by offering a comprehensive approach that integrates widely adopted gradient compression techniques and strikes a balance between accuracy * Authors contributed equally to this work.and compression ratio, it failed to address dynamic bandwidth considerations.DC2 [1] achieves adaptive compression by inserting a shim layer between the ML framework and network stack to do real-time bandwidth monitoring and adjust the compression ratio.However, this approach is model-agnostic which cannot be used together with other application-level optimization.In addition to bandwidth adaptivity, numerous researchers are investigating how to capitalize on the diverse structural nature present across network layers in order to enhance compression ratios [3,8].However, these studies solely address the static nature of network structures and assume an ideally stable network connection, a scenario seldom encountered in real-world deployments.
In light of these findings, we introduce Kimad: an adaptive gradient compression system designed to be aware of both bandwidth changes and model structures.The comprehensive designs are depicted in Figure 2. Kimad deploys a runtime bandwidth monitor and a compression module on each worker and server.Throughout the training phase, the bandwidth monitor gauges communication delays using historical statistics.Subsequently, the compression module utilizes the estimated bandwidth to compute the compression budget for the entire model.It then refines the layer-wise compression ratios while adhering to the overarching compression budget constraints.
In essence, we advance the following contributions:

BACKGROUND AND RELATED WORK 2.1 Data Parallelism
Data parallelism is a widely used strategy to solve the distributed training problem, which can be formulated as (1). min ∈ R  corresponds to the parameters of a model, [] := {1, . . .,  } is the set of workers (e.g.GPUs, IoT devices) and  1 , . . .,   are non-negative weights adding up to 1 (for example, the weights can be uniform, i.e.,   = 1  for all ).Further,   () := E ∼D  [ℓ (, )] is the empirical loss of model  over the training data D  stored on worker , where ℓ (, ) is the loss of model  on a single data point .
In data parallelism, each worker keeps a copy of the model and a partition of the dataset.The gradients computed on each worker are then communicated to aggregate and update the model.We provide the general formulation to solve (1) in Appendix A.
In this work, we predominantly focus on the Parameter-Server model.Our choice is driven by its inherent capability to efficiently handle sparse updates [12,16], and its widespread adoption in environments with shared bandwidth like federated learning.While our emphasis is on the PS architecture with Data Parallelism, we posit that the adaptivity innovations we introduce can seamlessly integrate and offer value to the Peer-to-Peer architecture and model parallelism as well.

Gradient Compression
Relying on the nature that deep learning training can converge despite lossy information, gradient compression is a popular approach to speed up data-parallel training [36].During back-propagation, gradients will be compressed before communication with the server, and the server will decompress the gradients prior to aggregating them; thus the communication cost can be largely reduced.Additionally, the server can distribute the model using compression as well.Gradient compression techniques can be generally categorized into three classes: • Sparsification [5,17,28,29,31]: Selectively retaining elements in gradients while zeroing others.This includes methods like Top (selecting the  largest absolute value elements) and Rand (randomly selecting  elements).• Quantization [4,13,22,26,35]: Reducing data precision to fewer discrete values.Deep learning frameworks often use Floating Point 32 (FP32) for gradients, which can be compressed to formats like FP16, UINT8, or even 1 bit [26].• Low-Rank Decomposition [30]: Approximating gradients by breaking them down into lower-rank matrices, reducing their size as  ≈  •   , where  is the original matrix, and  and  are lower-rank matrices.Adaptive compression.Adaptive compression is an emerging area to study how to apply gradient compression efficiently with different compression levels [1,2].Gradient compression is traditionally used in an intuitive way: Given a compressor C : R  → R  , gradients are compressed with a static strategy where the same compression ratio is used for each layer and across the whole training procedure.However, gradient compression has a different impact on different training stages.For instance, Accordion [2] selects between high and low compression levels by identifying the critical learning regimes.Furthermore, it is incumbent upon gradient compression methodologies to account for the diverse attributes of individual layers.For example, Egeria [34] methodically freezes layers that have achieved convergence during training.L-Greco [3] uses dynamic programming to adjust layer-specific compression ratios given the error budget, reducing overall compressed size.Moreover, researchers should also take the system architecture into consideration.Notably, FlexReduce [18] proposes that the communication protocol can be split into different portions unevenly based on the communication hierarchy.

Error Feedback
Error feedback (EF), also referred to as error compensation, is a widely adopted method for ensuring convergence stability in distributed training of supervised machine learning models.It is particularly effective when combined with biased compressors such as Top.EF was originally introduced as a heuristic [26]; then theoretical guarantees were proposed [5,28].More recently, EF21 [11,23,24] provides theoretical analysis for distributed settings and achieves a state-of-the-art O (1/ ) convergence rate.We integrate EF21 into Kimad to achieve better convergence.

Bandwidth Monitoring
Bandwidth monitoring is critical in network management, especially in cloud-based scenarios.It addresses the need to monitor data transfer rates between computational nodes during training, ensuring optimal communication efficiency.Existing works [1,6,7] allow us to estimate the bandwidth changes by utilizing a collection of the network-level communication properties such as the latency.Particularly, adaptive strategies [33,37] such as dynamic synchronization algorithms or buffering mechanisms can alleviate the effects of bandwidth fluctuation.

METHODOLOGY
We propose Kimad, an adaptive compression framework to accommodate varying bandwidth and model structures.Kimad continuously monitors the bandwidth and dynamically adjusts the volume of communication size in each round for every machine.For instance, if the bandwidth    for machine  at step  becomes limited compared to other devices, we instruct machine  to employ a suitable compressor to reduce the size of the update vector 1    with the goal of ensuring that this machine does not become a straggler.Additionally, we present Kimad+, an extension of Kimad, which fine-tunes the compression ratio differently across layers.Kimad+ involves an additional step that introduces some computational overhead and is recommended for use when there is surplus computational capacity available (i.e., when communication is the most severe bottleneck).
As Figure 2 shows, to train a deep learning task, the end users need to inform Kimad of , which is the time budget for a single communication round (a step).The server and each worker  will determine the compression strategy locally without knowing global information.Kimad requires a bandwidth monitor, which is deployed on each worker and server, and will continuously monitor the network behavior and estimate the current bandwidth.Kimad will calculate how many bits need to be communicated at each step based on the bandwidth, which we call the compression budget denoted as .The blue arrows in Figure 2 further represent Kimad+, which allocates the compression budget to each layer to minimize the compression error and thus improve accuracy.Algorithm 1 formulates the general version of Kimad.The algorithm starts with the server broadcasting the latest compressed update C  (  − x−1 ), then each worker will calculate the update by  update  and upload the compressed update C   (   − û−1  ).Afterward, the server will update the model   by the aggregated update vector.The core of the algorithm is  compress , which selects a compressor from Ω in an adaptive manner, based on the model information and current bandwidth estimation    .To recover accuracy, we apply bidirectional EF21, therefore, both server and workers maintain two estimators: û  and x , and only the server stores the global model   .We put a detailed version of the Kimad algorithm in Algorithm 3 in Appendix B.

Kimad: Bandwidth Adaptivity
With a user-specified time budget , the target of Kimad is to limit the training time at each step within  time units while communicating as much information as possible.
In our work, we examine asymmetric networks, e.g., the uplink and down-link bandwidth can be different, and the bandwidth varies among workers.We apply bidirectional compression, i.e., both workers and server send compressed information.
We break down the time cost of worker  at step  as: We abstract the computation time of a step as   which is assumed to be constant across a training task.For the uplink communication, we define  Estimate   at communication round  for each worker  = 1, 2, . . .,  in parallel do 8: Update model estimator: Calculate update: Select compressor: When communication is triggered, Kimad will read the current bandwidth from the bandwidth estimator and use it to calculate (with negligible computation overhead) the compression budget as:

Kimad+: Layer Adaptivity
With a predefined compression budget, Kimad+ can dynamically allocate the compression ratios of individual layers in a non-uniform manner.This optimization aims to enhance performance while ensuring that the cumulative compression ratio remains within the allocated budget.We start by formulating it as an optimization problem as: The target is to minimize the total error   across layers caused by compression, with compressed size constrained by the compression budget.We consider the standard Euclidean norm ( 2 -norm) as the error indicator defined by: However, the relation between the compression error and compressed size is not deterministic, and the search space of the compression ratio is continuous.As a result, finding an analytical solution for this optimization problem is not feasible.To tackle this challenge, we employ a discretization approach, narrowing down the compression ratio search space.Specifically, for each layer, Kimad+ restricts its choice of compression ratio to a discrete set {1, 2, . . .,  }.Therefore, (3) can be written as: We adopt the idea from L-Greco [3] to formulate it as a knapsack problem.In contrast to L-Greco, Kimad+ uses the compression budget  as the knapsack size and the compression error as the weight.Then, Kimad+ uses dynamic programming to solve the knapsack problem.The time complexity is  ( ) where  is the number of layers,  represents the possible compression ratios, and  is the discretization factor for the error.We give the algorithm details in Appendix C.

Error Feedback
We apply error feedback within Kimad.To the best of our knowledge, EF21 [23] is one of the most effective EF methods.We adapt EF21 and extend it in a layer-wise fashion.However, while EF21 is analyzed using a constant step size, our theory here allows the step size to depend on the layer  and on the iteration .Below we give the theoretical result; the proof is in Appendices D, E, and F.

EVALUATION
We begin our evaluation by initially performing synthetic experiments to showcase the efficiency of our proposed Kimad method, particularly to demonstrate that EF21 can work with compression ratio adaptive to bandwidth.The synthetic experiments are done with a simple quadratic function  which is lower bounded by 0, and has layer smoothness (Appendix D.1) and global smoothness (Appendix D.2).This function fits the theory assumptions and allows us to fine-tune the learning rates for all compression ratios and time budget  at an affordable cost.Subsequently, we present results from more practical tasks, demonstrating that Kimad is applicable to distributed deep learning training.We also conduct an evaluation of Kimad+ to substantiate its superior capabilities of reducing compression error compared to Kimad, all while maintaining the same communication cost.The evaluation is simulation-based, running as a Parameter Server architecture with dynamic asymmetric bandwidth.We use TopK with fixed K as the default compression method.The simulator is tested with Python 3.9.15,and Pytorch 1.13.1.

Synthetic Experiments
For now, we consider only one direction; e.g., the down-link (server to worker) communication cost can be neglected.So, there is only an up-link bandwidth cost.We simulate the bandwidth oscillation with a sinusoid-like function as Figure 3.We start our experiments in a single-worker setup to optimize a quadratic function.So,  = 1, , where  = 30.Hence  () in problem (1) can be written in the following form: Previous works [11,23,24] show that EF21 can improve convergence rates in federated learning setups, particularly for biased compressors such as TopK.We now demonstrate that EF21 can also be used to improve performance seamlessly with Kimad.For a fair comparison, it's crucial to fine-tune all hyperparameters for each method.For EF21 with TopK, we systematically explored various K values and selected the one that performed the best for comparison with Kimad.However, Kimad doesn't require us to determine the best K since it adapts to the available bandwidth dynamically.Instead, we focus on optimizing the time budget parameter  and fine-tuning Kimad in conjunction with EF21.We compare performance among Kimad, EF21, and set the standard gradient descent (GD) as the baseline.
As Figure 3 shows, Kimad can be much faster than the best EF21.We achieved these results because Kimad adapted the compress ratio depending on the bandwidth to be as effective as possible.These results are consistent over different bandwidth patterns: with small bandwidth (  < ) and high relative oscillation we can see great results because we gain more with using adaptive strategy (Figure 3 and Figure 4).As the amplitude of the bandwidth oscillations becomes higher, we still have improvements in performance (Figure 5).However, when the bandwidth is very high and the amplitude of its oscillations is low, we do not gain from adapting of compress ratio: there is no need to adapt because the bandwidth is not a bottleneck anymore (Figure 6).

Kimad on Deep Model
Setting.We train ResNet18 on Cifar10 for 100 epochs.We set {  = 1, ∀ ∈  },  = 0.01, Ω = { | > 0},   calculates gradients with batch size = 128, random seed=21.We conduct 5 epochs warmup training, thus û  and x are initialized as  5  and  5 .Compression occurs on a per-layer basis, in accordance with common practice.We set  = 1 for the downlink so that the compression budget can be calculated by  =      .We set   =  ℎ . 2 Our baseline is EF21 with fixed-ratio compression, which has the same overall communication size as Kimad but applies the same compression ratio across layers and steps.Bandwidth.In our simulation, we model the dynamic bandwidth within the range of 30 Mbps to 330 Mbps using the function: ℎ() =  sin( • ) 2 + , where ,  ,  are userdefined coefficients to adjust the changing frequency and amplitude.We assume the bandwidth between the server and each worker follows the same patterns with different noise.The dashed curve in Figure 7 shows the bandwidth pattern.Communication adaptivity.

Kimad+
Kimad+ minimizes the compression error while maintaining the same compression ratio as Kimad.We train Kimad and Kimad+ under the same setting as above with error discretization factor 1000 and compression ratio chosen from { |  = 0.01 +  • 0.02, where  ∈ Z, 0.01 ≤  ≤ 1}. Figure 9 shows the compression error at one worker in a time frame, the optimal baseline is to select K with the whole model information.The compression error is negatively correlated with bandwidth, while Kimad+ can generally achieve lower compression error.We also observe that Kimad+ can achieve 1% higher accuracy than EF21 after the training.

LIMITATIONS AND FUTURE WORK
Kimad introduces a user-defined hyperparameter , which is a tradeoff between per-step time and accuracy and can also be adjusted dynamically.The learning rate can also be adjusted layer-wise Besides, our work is not yet a fully implemented system.As the current experiments are simulation based, thus the implementation of monitor is trivial.We value the importance to integrate SOTA monitoring method to a complete work in the future.We can generalize the idea from splitting models to layers to blocks, where one block may contain many small layers.The computation overhead of Kimad+ is non-negligible and can be overlapped with communication.Moreover, LLM-targeted compression such as CocktailSGD [32] can also be considered.

CONCLUSION
We proposed Kimad, a bandwidth-aware gradient compression framework that comes with extended EF21.Kimad adapts the compression ratio based on the bandwidth and model characteristics; namely, each worker determines its local compression ratio considering its available bandwidth and time budget, and this ratio can be allocated to different layers in a non-uniform manner based on layer-wise sensitivity.We validated that Kimad can preserve the same convergence of fixed-ratio compression while saving communication time.

A PROBLEM FORMULATION
Algorithm 2 is a generic form for solving problem (1) which can be used as a baseline.The server broadcasts model   to all machines  ∈ [] 5: Each machine  ∈ [] uses algorithm   to compute the update Each machine  ∈ [] uploads the update    to the server 7: The server updates the model via where   > 0 is a learning rate 8: end for Here are some canonical examples: performs one step of gradient descent with respect to function   , i.e., where    is a step size, then Algorithm 2 becomes gradient descent (with step size     ) for solving problem (1).If  update  applies multiple steps of gradient descent instead, then Algorithm 2 becomes local gradient descent [14,15] applies multiple steps of stochastic gradient descent instead, then Algorithm 2 becomes local stochastic gradient descent [15].
In practice, not all workers will participate in every epoch's training.There are many worker sampling algorithms proposed [19,25] to speed up the training.However, these algorithms can also introduce bias and have various behaviors on different tasks.In this work, we consider the situation of full participation of workers to avoid the influence of worker sampling.
(1) The server broadcasts model   to all workers  ∈ []; (2) Each machine  ∈ [] computes update û  =     , ℓ, D  via some algorithm   and uploads the update to the server; (3) The server aggregates the updates and updates the model via where   is a learning rate.

B KIMAD ALGORITHM WITH EXPLANATION
Algorithm 3 illustrates the Kimad algorithm with more details and comments.
Algorithm 3 Kimad: Adaptive Gradient Compression with Bandwidth Awareness (Detailed) for computing the model update on each machine  ∈ []; set of compressors Ω; compressor-selection algorithm  compress used by the server and the machines; model  0 ∈ R  known by the server; initial model estimator x −1 ∈ R  known by the machines and the server (for example, x −1 = 0 or x −1 =  0 are acceptable choices); initial update estimators û−1  ∈ R  for  ∈ [] known by the machines and the server (for example, û−1  = 0 for all  ∈ [] is an acceptable choice); single round time budget  > 0; learning rate schedule {  } for iterations  ≥ 0 2: for each communication round  = 0, 1, 2, . . .do

3:
The server estimates the broadcast/downlink bandwidth at communication round ; let the estimate be   The server updates the model estimator to The server broadcasts the compressed vector C  (  − x−1 ) to all machines  ∈ [] end for 14: The server updates all update estimators to The server updates the model via where   > 0 is a learning rate 16: end for C KIMAD+ DYNAMIC PROGRAMMING Algorithm 4 lists the dynamic programming algorithm to optimize the layer-wise compression ratio allocation to minimize the compression error.

□
Our next lemma is specific to Algorithm ( 5)- (7). where and   is any positive number. Proof.
where the first inequality holds since C   ∈ C   (  ), and in the last step we have applied Young's inequality.□

F PROOF OF THEOREM 1
Proof.We proceed in three steps: STEP 1. First, we note that Lemma 3 says that E û+1 Adding inequalities (19) where we used the fact that    ≡   .

Figure 7
depicts a single worker's communication size over time for different T comm .The left y-axis represents bandwidth, while the right represents communication size.The plateau at the top signifies the maximum uncompressed size.This graph illustrates Kimad's effective adaptation to changing bandwidth conditions, thus optimizing communication throughout.Convergence.The loss curve in Figure8shows the comparison with EF21.Kimad finishes training faster while achieving the same final convergence.

4 :
The server chooses a compressor C  ∈ Ω for compressing the difference   − x−1 via algorithm  compressC  =  compress (Ω,   , x−1 ,   , )(The algorithm  compress aims to choose the compressor from Ω suffering minimal error when compressing the difference   − x−1 , subject to the constraint that the compressed message should take at most  seconds to broadcast to the machines given the broadcast bandwidth estimate   ) 5:

7 : 9 : 11 : 12 :
for each machine  = 1, 2, . . .,  in parallel do 8: Update the model estimator to x = x−1 + C  (  − x−1 ) using the previously stored estimator x−1 and the received message Use algorithm  update  to compute the update    =  update  x , ℓ, D  ∈ R  10: Estimate the uplink bandwidth of machine  at communication round ; let the estimate be    Choose a compressor C   ∈ Ω for compressing the difference    − û−1  via algorithm  compress C   =  compress (Ω,    , û−1  ,    , ) (The algorithm  compress aims to choose the compressor from Ω suffering minimal error when compressing the difference    − û−1  , subject to the constraint that the compressed message should take at most  seconds to upload to the server given the uplink bandwidth estimate    ) Upload the compressed vector C   (   − û−1  ) to the server 13:

Clients (Kimad) Bandwidth Adaptivity Clients (Kimad+) Layer Adaptivity Clients Compression Clients Upload Bandwidth Monitor Model Compression Budget Per Layer Compression Ratio Current Bandwidth Average Bandwidth
is the coefficient of broadcasting congestion which can be simply set to 1 assuming no congestion.Therefore, for simplicity, and without loss of generality, we only consider varying   to simulate various scenarios.

Table 1 :
Average step time across T comm . = 4 workers.

Table 2 :
Top5 accuracy across varying .  = 1.Speedup.Table 1 lists the average time of one SGD step across different T comm .In our setting, Kimad can generally save 20% training time for different communication budgets.Scalability.Table 2 presents the Top5 accuracy on the evaluation set after 100 epochs.Kimad demonstrates comparable scalability to EF21 which maintains good accuracy levels with increasing number of workers.
. • If  update  performs one step of stochastic gradient descent with respect to function   , i.e., , then Algorithm 2 becomes a variant of mini-batch stochastic gradient descent for solving problem(1).If  update