Ymir: A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters

The breakthrough of foundation models makes foundation model fine-tuning (FMF) workloads prevalent in modern GPU datacenters. However, existing schedulers tailored for model training do not consider the unique characteristics of FMs, making them inefficient in handling FMF workloads. To bridge the gap, we propose Ymir, a scheduler to improve the efficiency of FMF workloads in GPU datacenters. Ymir leverages the shared FM backbone architecture to expedite FMF workloads from two aspects: (1) Ymir investigates the task transferability among different FMF workloads and automatically merges FMF workloads with the same FM into one to improve the cluster-wide efficiency via transfer learning. (2) Ymir reuses the fine-tuning runtime of FMF workloads to reduce the significant context switch overhead. We conduct 32-GPU physical experiments and 240-GPU trace-driven simulations to validate the effectiveness of Ymir. Ymir can reduce the average job completion time by up to 4.3 × compared with existing state-of-the-art schedulers. It also promotes scheduling fairness by fully exploiting the task transferability. More supplementary materials can be found on our project website https://sites.google.com/view/ymir-project.


INTRODUCTION
Foundation models (FMs) have pushed the state-of-the-art performance envelope across a wide range of artificial intelligence tasks [6,20,21,56,63].An FM is a machine learning model (commonly large-scale in parameters) trained over massive data and adaptable to various downstream tasks [12].The fine-tuned FMs have shown impressive performance in many downstream tasks [14,64,65], leading to an increasing of foundation model fine-tuning (FMF) workloads in public and private GPU datacenters [12,23].To meet the growing resource demand of FMF workloads, it is crucial to improve their efficiency from datacenter perspective.
Compared with conventional deep learning training (DLT) workloads, FMF workloads exhibit several distinct characteristics.First, FMs typically have substantial parameter sizes.Hence, FMF workloads demand predominant GPU memory [14,54,65].Second, FMF workloads tend to require multiple GPUs for distributed execution to support large-scale models [23,33,54], which consequently increases the time needed to initiate the distributed execution runtime.Therefore, FMF workloads have much higher context switch overhead than general DLT workloads [5,37,75].Third, FM users adopt a limited number of common FMs (e.g., RoBERTa [48], Vicuna [19]), as observed in [2]. Figure 1 shows the distribution of FM downloads in HuggingFace Model Hub [1].The top 10 downloaded FMs account for 83% and 89% of the top 100 vision and language FMs, respectively.Also, existing commercial FM services (e.g., Ope-nAI [2]) only release a few FMs for public access.Due to the high expense of building an FM from scratch, it is cost-efficient to reuse existing FMs instead of providing diverse FMs for different tasks.Accordingly, it is common to see many FMF workloads share the same backbone architecture in a GPU datacenter.
Previous studies have proposed many efficient schedulers to optimize DLT workloads [16,36,50,59,62,86].They consider two prominent advanced practices.The first is to co-locate DLT workloads on the same GPUs to reduce the long queuing delay [16,86].However, the job colocation might cause out-of-memory issues for FMF workloads due to their vast GPU memory consumption.The second one is to dynamically scale up the allocated GPUs to improve the job throughput [36,59,62].The frequent GPU allocation adjustment aggravates the context switch overhead and could yield significant job progress delays for FMF workloads.Some studies [7,29] aim to reduce the context switch overhead but only FM downloads (top 10 in blue) to the top 100 downloads of vision (left) and language (right) FMs in HuggingFace [1].
for inference workloads.In summary, little systematic efforts are dedicated to accelerating the FMF workloads in GPU datacenters.Given the shared architecture of FMs, this gap could be bridged by (1) reusing weights across tasks to expedite fine-tuning through transfer learning; (2) reusing the fine-tuning runtime to reduce the context switch overhead in scenarios where FMF workloads primarily differ in model weights and task-specific datasets.
We propose Ymir, an elastic scheduling system to capitalize on these opportunities presented by the same backbone architecture to accelerate FMF workloads.Ymir consists of three key designs for FMF workload scheduling.First, we devise YmirEstimator to estimate the execution time of FMF workloads with and without task merging.Task merging indicates merging two workloads into one and subsequently fine-tuning it via transfer learning.It involves two decisions: determining which tasks to combine and selecting the appropriate transfer learning modes ( illustrated in § 2.1).Specifically, YmirEstimator profiles each new workload's statistical information (e.g., loss, gradients).Based on the profiled information, YmirEstimator predicts the execution time to reach the model convergence for FMF workloads under various resource allocations and task merging scenarios.Second, we develop YmirSched to automate the task merging and resource allocations for FMF workloads to improve cluster-wide efficiency.Task merging can expedite the model convergence, however, randomly combining tasks might not necessarily yield speedup and could even result in a degradation of model accuracy 1 .Ymir introduces speedup gain to quantify the reduction in execution time resulting from various task merging scenarios, thereby mitigating the risk of poor task merging choices.In each scheduling interval, YmirSched leverages the estimation results of YmirEstimator to compute the speedup gain.Then, YmirSched incorporates the speedup gain into the FMF workload scheduling objective, favoring task merging with higher speedup gains.Through optimizing this objective, YmirSched determines how to merge tasks and allocate GPUs for cluster-wide workloads.
Third, we implement YmirTuner to reduce the context switch overhead by reusing the fine-tuning runtime.YmirTuner comprises two modules, the task constructor and the pipeline switch to facilitate the context switch between FMF workloads.The task constructor provides a universal implementation to different FM fine-tuning algorithms [31] and allows only modification of task-specific datasets, model weights, and other hyper-parameters to perform the context switch.The pipeline switch pipelines the dataset preparation and parameter transfer with the model execution to hide the context switch overhead.Moreover, the pipeline switch tailors the pipeline concept to data-and pipeline-parallel FMF workloads respectively, ensuring the context switch that takes no more than one minute.
• We automate the task merging and resource allocations for FMF workloads.
• We reuse the fine-tuning runtime of FMF workloads to enormously reduce the context switch overhead.
• We implement and evaluate Ymir with representative FMs and datasets to demonstrate its efficiency.

BACKGROUND AND MOTIVATION
We begin with an in-depth exploration of task transferability, followed by characterizing FMF workloads.

Task Transferability
As a core idea of Ymir, we provide a thorough exploration of task transferability.Task transferability refers to the ability of a model, initially trained on one task, to be used in another related but different task.In the context of FMs, downstream models sharing the same FM can expedite training convergence.Here, we discuss the transfer learning modes and benefits of task transferability.Transfer Learning Modes.Recent theoretical [77] and empirical [4,61,69,84,87] analysis from transfer learning show that task transferability can improve the accuracy of FMs on downstream tasks.Unlike their focus on model accuracy, we consider how transfer learning expedites training convergence.By investigating existing transfer learning studies [3,10,22,34,55,76,76,80], we identify three predominant transfer learning modes to accelerate FMF workloads, as illustrated in Figure 2. (1) Normal transfer: this is the conventional solution, where the downstream model for each task is fine-tuned on a given dataset from the pre-trained weights of the FM.(2) Temporal transfer: a new task  is fine-tuned from the FM fine-tuned previously on another task .We denote this mode as  ↦ → .(3) Spatial transfer: both task  and  are fine-tuned together using a multi-task learning scheme.We denote this as ∥.§9 further discusses the extension of these modes.Benefits of Task Transferability.Compared to normal transfer, temporal and spatial transfer can better leverage the knowledge from other tasks [10,80].Figure 3    fine-tune the QQP dataset [81] by 2.3× when the FM is previously fine-tuned on the STSB dataset [81].Similarly, Figure 3(b) shows that spatial transfer reduces the number of epochs to fine-tune the ImageNet75 dataset [68] by 2.0× when the FM is fine-tuned together on the FOOD101 dataset [13].The speedup benefits stress the need for an automated approach to identify task combinations and transfer learning modes for cluster-wide workloads.

Characterization of FMF Workloads
FMF workloads possess some unique characteristics.We demonstrate them with three representative FMs (ViT-Base, RoBERTa-Base, Vicuna-7B) and corresponding datasets discussed in §7.1 on a server of 4 A100-80GB GPUs.
Exorbitant Context Switch Overhead.Figure 4(a) illustrates the measured context switch overhead for RoBERTa, ViT, and Vicuna-7B on STSB [81], CIFAR100 [41], and SAMSUM [25].The overhead, mainly attributed to weight loading and dataset preparation, surpasses one minute.This high overhead hinders scaling up GPUs to improve the job throughput.Smooth Loss Curve.Prior works [26,50] emphasize that loss curves may not always exhibit smooth decreases, and curve fitting techniques may not extrapolate the relationship between loss and iteration.Fortunately, current ML studies [30,39,49] point out FMs possess well-behaved loss curves.In Figure 4  for downstream tasks.Followed by prior studies [59,93], we can adopt curve fitting techniques to predict model convergence.Pervasive Task Transferability.Task transferability provides new opportunities to optimize FMF workloads in a datacenter: workloads sharing the same FM can be combined to enhance the performance and cluster efficiency, even for different tasks with different datasets.Task transferability manifests pervasive across diverse FMs and tasks.For FMs, previous studies [12,14,65] emphasize their remarkable ability to adapt to various tasks.FM developers strategically optimize their models across a spectrum of tasks, enhancing the generalization and transferability of FMs.Consequently, robust task transferability is a common phenomenon within FMs.For tasks, recent ML studies [3,55,73,80,84] have analyzed transferability between numerous language and vision tasks.Their findings reveal that over 50% of task combinations can benefit from spatial or temporal transfer learning.To present this, we compute the Time-To-Accuracy (TTA) metric, which is defined as the time required to achieve the target accuracy on a task.We utilize the targeted accuracy of our evaulated FMF tasks, and measure the TTA of various task combinations for different FMs.In Figure 5(a), we illustrate the box plot of relative TTA speedup for temporal and spatial transfer, in comparison to normal transfer.Both temporal and spatial transfer can speedup FMF workloads up to 10 ×.Furthermore, more than half of the task combinations exhibit positive speedup (≥ 1).This underscores selecting optimal task combinations and transfer learning modes can expedite FMF workloads significantly.
Indeed, users have a desire to share task-specific model parameters with the ML community.Every day, hundreds of new task-specific models built upon representative FMs are released on HuggingFace [1].ModelKeeper [42] and Sommelier [28] harness the potential of model sharing to expedite model training in GPU datacenters.Naturally, task transferability opens a new venue to expedite training progress for cluster-wide FMF workloads.

YMIR OVERVIEW
We introduce Ymir, a scheduler for FMF workloads to unleash the potential of task transferability of FMs and improve the clusterwide efficiency and scheduling fairness.We discuss the system assumptions and workflow below.System Assumptions.We make several assumptions about our system.(1) We assume all FMF workloads share the same FM backbone in the GPU datacenter, as discussed in §1.We discuss extending Ymir to multiple FMs in §9.(2) A task is denoted as a (dataset, objective function) pair.The same dataset might be employed with different objective functions, which could be considered various tasks.(3) We focus on the widely adopted data-parallel and pipelineparallel mechanisms in FMF workloads.Other parallelism schemes can be easily integrated into Ymir.System Workflow.Ymir contains three key components: YmirEstimator is responsible for predicting the execution time of FMF workloads with different task merging scenarios, including task combinations and transfer learning modes; YmirSched automates the efficient task merging and resource allocations for cluster-wide workloads; YmirTuner improves the efficiency of FMF workloads with lightweight context switch mechanisms.
Figure 6 shows the workflow of Ymir.First, a user submits an FMF request to Ymir in a YAML format.The YAML file specifies a list of system parameters, as presented in Table 1.(❶).Then, YmirEstimator demands resources (e.g., 1 GPU) for each new workload from YmirSched for profiling, collecting relevant statistical information (e.g., loss, gradient) (❷).YmirEstimator utilizes profiling results to perform time prediction for each new workload and send prediction results to YmirSched (❸).Second, YmirSched decides how to merge tasks and makes the resource (re-)allocations for cluster-wide workloads (❹).Third, YmirTuner receives task merging decisions and instantiates the FMF workloads based on transfer learning modes and other hyperparameters (❺).It also pipelines the context switch to reduce corresponding overhead.YmirSched places FMF workloads on appropriate GPUs (❻).Lastly, Ymir returns the desired model weights to the user when the FMF workload is finished (❼).

YMIRESTIMATOR
YmirEstimator consists of three components to estimate the execution time of FMF workloads over various task merging scenarios with profiling results, as shown in Figure 7. First, transferability

Target
The job completion criteria, including a maximum number of iterations and an accuracy target 2 .

Sharing
Whether to share parameters with other tasks.

Pipeline
Whether to adopt pipeline parallelism.estimator computes the transferability score and predicts the transfer gain (defined in Eqn. 1) between the new workload and other FMF workloads.Then, iteration estimator uses the transfer gain to predict the number of iterations (defined in Eqn. 2) that reach the target accuracy in different learning modes.Last, time estimator estimates the execution time by multiplying the number of iterations with the time estimated for each iteration under any resource allocations (defined in Eqn. 5).The estimation process is performed only once for each new workload, significantly reducing the computational overhead and improving efficiency.We emphasize that the YmirEstimator's design is highly modularized, and its components can be replaced with other techniques that perform the same functions.Below, we present the technical details of each component.

Transferability Estimator
This component estimates the transfer gain for each joint transfer learning mode ( § 4).Given two tasks  and , the transfer gain from  to  is calculated as follows: , is the performance (e.g., accuracy) of  when jointly fine-tuned with , while   is the performance of  when fine-tuned alone.If joint fine-tuning improves the performance of ,  , is positive.Otherwise, it is negative or zero.
A straightforward way to obtain the transfer gain is to fine-tune the tasks in different learning modes, measure the performance, and compute  , with Eqn. 1.This is computationally expensive and impractical in workload scheduling.Instead, inspired by previous works [3,10,55,80], we adopt statistical information and ML techniques to predict the transfer gain.As Ymir requires the least computation overhead and satisfactory prediction accuracy, we empirically find that Task2Vec [3] is the most suitable technique (discussed in §7.5).Its underlying principle is that tasks with high gradient similarity exhibit high transferability.We make two modifications over Task2Vec to adapt to our scenario.First, Task2Vec only considers the temporal transfer learning and provides the corresponding transferability score  (, ) from tasks  to .We extend this metric to spatial transfer learning: we compute the bidirectional transferability scores  (, ) and  (, ) and take their average as the final transferability score for spatial transfer learning.
Second, we take the transferability score  (, ) as input to predict the transfer gain  , .Table 2 (Transferability) shows the Pearson correlation between  (, ) and  , for different FMs.The high linear correlation between these two metrics suggests the feasibility of using linear regression to predict the transfer gain from the transferability score.Error Analysis.In Table 2 (Transferability), we choose two metrics to evaluate transferability estimator by considering various task combinations across different transfer learning modes: (1) The mean absolute percentage error (MAPE) between the transfer gain and estimated gain using the transferability score; (2) We categorize the transfer gain estimation into two classes: positive ( , ≥ 0) and negative ( , < 0) transfer, and then report the classification accuracy (ACC).The low MAPE and high accuracy across different FMs indicate that transferability estimator is a general and practical approach for estimating the transfer gain.Sensitivity Analysis.We further analyze the impact of transferability estimator's errors on the JCT speedup performance brought by task merger (as discussed in §5.1).Specifically, we add random noise with the scale following a uniform distribution over [−1, 1] on the prediction results of transferability estimator.Figure 8 (a) presents the JCT speedup compared to the case without task merger.Even when the added noise scale is up to 40%, the JCT speedup brought by task merger is still larger than 1.Despite potential deviations in estimation accuracy, the overall performance improvement remains satisfactory.

Iteration Estimator
This component estimates the number of iterations required for joint fine-tuning to reach (or exceed) the same validation accuracy as the normal transfer.It estimates the training loss curve using the predicted transfer gain  , for different joint transfer learning modes.Then, following previous works [9,93], it identifies the minimum number of iterations that makes the training converge.Formally, for task , the number of iterations   is estimated as follows:  It is challenging to obtain the training loss L  () efficiently.The smoothing loss curve of FMF workloads motivates us to adopt a curve function proposed by Optimus [59] to characterize the job progress and training loss for DLT workloads.FMF workloads commonly use the Adam optimizer [40], which has a faster convergence rate than SGD.We introduce an additional second-order term  2 to characterize better the job progress and normalized training loss of FMF workloads: where  ,3 ,  ,2 ,  ,1 , and  ,0 are learnable non-negative coefficients.We empirically observe that our adopted curve-fitting technique performs better than Optimus.Also, the user can provide appropriate fitting functions based on their experience.We can use loss traces during profiling to fit Eqn. 3 and obtain a general set of  ,3 ,  ,2 ,  ,1 , and  ,0 for each task .Specifically, we assume the joint transfer learning task follows a similar training loss convergence pattern as normal transfer, as investigated by previous studies [39,49].This is empirically validated in Table 2 (Iteration-Transfer) as well.Then, we use the estimated transfer gain  , to derive the normalized loss curve as L , () = L  ( ) (1+ , ) for either spatial or temporal transfer learning from task  to .A higher  , can reduce the number of training iterations using spatial or temporal transfer learning.Lastly, we use this loss to estimate   .For temporal transfer learning from tasks  to , we calculate  ↦ →  with L  () with Eqn. 2. For spatial transfer learning, the estimated number of iterations is where for a task ,   is obtained from Eqn.
The execution time of temporal transfer learning from tasks  to  and spatial transfer learning is denoted as  ↦ →, and  ∥, , respectively.Their main difference is reflected in the calculation of   in §4.2.Specifically, cfgs include {, , , ℓ, , }, where  is the number of GPUs assigned to the workload,  is the number of gradient accumulation steps,  is the local batch size per device, ℓ is the number of frozen layers during fine-tuning,  is a boolean value for automatic mixed-precision training,  is a boolean value for the gradient checkpoint, and  is a boolean value for parameter-efficient transfer learning.pipeline also implies the selection of data-parallelism or pipeline-parallelism, which will be discussed in § 6.1.Building the Look-Up Table (LUT) offline poses a great challenge due to a large number of potential configurations.We reduce the number of configurations needed to profile and implement the offline profiling within 5 hours per FM.We continuously update the LUT online to minimize the gap between LUT and practical scenarios.Estimation Error Handling of YmirEstimator.From the sensitivity analysis of transferability estimator and iteration estimator, Ymir achieves a satisfactory speedup, even when the prediction of our estimators is not accurate enough.This highlights the robustness of our system.However, it is imperative to proactively address potential estimation errors of YmirEstimator, as they could undermine model accuracy and impede training progress.We monitor the accuracy changes of merged tasks to prevent these issues.
For temporal transfer  ↦ → , we assess the validation accuracy of task  when fine-tuning task  during the accuracy evaluation stage.If it fails to enhance the accuracy of task  in the first two epochs, we disable the temporal transfer and schedule both tasks independently.For spatial transfer, if the accuracy of either task  or  does not improve in the first two epochs, we decouple the spatial transfer and schedule both tasks separately.
Overhead Analysis of YmirEstimator.The overhead of YmirEstimator consists of the workload profiling and the ML model estimation in the middle scheduling interval.The workload profiling overhead has been discussed in §7.5.The maximal ML model estimation overhead for ViT-B, RoberTa-B, and Vicuna-7B is 7.8, 8.6, and 11.2 seconds respectively.Overall, the estimation overhead is acceptable compared to the FMF workload execution time (tens of minutes).

YMIRSCHED
In YmirSched, we first introduce the task merger determines task combinations and transfer learning modes.Next, we discuss how YmirSched addresses special cases and scalability issues.

Task Merger
Fairness objective.Achieving resource allocation fairness in workload scheduling is critical to incentivizing users to share GPU resources [50,62].Fairness aims to assign GPU resources evenly across all FMF workloads.Formally, given a set of  tasks J = {  1 , subject to: where Z( ) = {1, . . .,  },  , is a binary variable to denote whether to allocate  GPUs to   ,  ,, is a binary variable to denote whether to allocate  GPUs and use temporal transfer learning from   to   , and  ,, is a binary variable to denote whether to allocate  GPUs and use spatial transfer learning between   and   . 3Note that we use 2 ↦ →,2 ā (2  ∥,2 ā ) to compute the slowdown of the merged task.Constraint (9) ensures at most one allocation policy for each job.Constraint (10) guarantees no overlap between individual workload and merged workload in resource allocations.Constraint (11) ensures the total number of allocated GPUs does not exceed the resource capacity.
The numerator of each equation is the JCT of executing  and  with the Shortest Remaining Time First (SRTF) scheduling algorithm.The denominator of Eqn. 12 is the JCT of executing  and then  with temporal transfer learning; the denominator of Eqn.
Using the Integer Linear Programming (ILP) solver, we obtain a solution to Eqn. 7, i.e., the resources allocated to each job and the transfer learning mode.Then, we pack each workload with as few nodes as possible to minimize the communication overhead.

Discussion
Worklod Profiling.YmirSched needs to provide profiling resources for new workloads to gather statistical information.YmirSched does not take into account joint fine-tuning for profiling workloads.Additionally, the allowable resource allocations for profiling workloads are one GPU for data-parallel workloads and four GPUs for pipeline-parallel workloads.Pipeline Workloads Scheduling.Following typical resource request practice of pipeline-parallel workloads [35,54], we restrict the resource allocation set as A = {4| ∈ N}, reserving entire GPU servers for each pipeline-parallel workloads.The job throughput of the pipeline-parallel workloads depends upon some configurations (e.g., model partition, the number of pipelines).Given the fixed backbone architecture, We profile these configurations offline and use them during model execution.Scalability.In solving the above optimization problem, the scalability of YmirSched is related to the square of the number of jobs.In practice, YmirSched can quickly filter out unnecessary task combinations (e.g., TranWt < 1) to reduce the number of optimization variables.We provide further investigations in §7.2 to validate its scalability.Machine Failure Handling.In the event of machine failures, the default epoch-based checkpoint allows us to resume from the latest checkpoint.Moreover, we maintain the transfer learning modes and restore the execution of FMF workloads until the next scheduling interval (at most 120 seconds).The efficiency might be undermined slightly in this scenario.We leave it as our future work.

YMIRTUNER
We introduce the task constructor and pipeline switch to reuse the fine-tuning runtime of FMF workloads to improve efficiency and mitigate the context switch overhead.

Task Constructor
Task constructor has two main functions.First, it supports three transfer learning modes as illustrated in Figure 2. The only difference between normal and temporal transfer learning is the path storing the initialized weights.For spatial transfer learning, task constructor adopts the same hyperparameters (e.g., learning rate, batch size.) to fine-tune task-specific inputs.The dataloader adopts the annealed sampling [47] to yield the inputs.
Second, task constructor decides the configurations of data and pipeline parallelism for high throughput.It adopts Parameter-Efficient Transfer Learning (PETL), a common practice in finetuning FMs to enable data parallelism for FMF workloads.With PETL, we can fine-tune a small portion of task-specific parameters instead of the entire model to reduce GPU memory consumption.As such, we can also execute most fine-tuning tasks in a data-parallel manner and take advantage of its benefits, e.g., elastic training and performance modeling.There are different types of PETL architectures [32,33], and we choose a unified architecture proposed in [31].Particularly, task constructor decides the steps of gradient accumulation  to alleviate the GPU memory consumption in the case of a large batch size.Additionally, it supports pipeline parallelism when requested by users.It profiles the optimal pipeline stage and model partition offline and adopts them on demand.In evaluation, only fine-tuning Vicuna-7B on ROC [53] dataset assumes the pipeline parallelism for better throughput in consideration of large batch size (96) and model parameter size (7 billion).§ 5.2 have discussed scheduling pipeline-parallel workloads.

Pipeline Switch
The context switch between FMF workloads exacerbates the scheduling flexibility and delays the job progress, especially for shortterm ones.Based on the analysis in §2.2, we consider hiding the overhead of parameter load and data loader preparation for pipelineand data-parallel workloads.
First, we hide the latency between loading weights and launching the CUDA stream to execute gradient computation for pipelineparallel workloads.We propose to pipeline the gradient computation of task  and parameter transmission of task , as illustrated in Figure 9.Each machine maintains the entire model structure and partial model parameters.Both  and  adopt the pipeline parallelism on a 4-GPU machine, and the FM is partitioned into four parts.For naming conventions, we use the subscript of 1-4 to denote the partition, and the superscript  , , and  to represent the forward propagation, backward propagation, and parameter transmission.When the context switch happens between  and , we overlap the parameter store of  and the parameter load of  across machines.We also pipeline the gradient computation and parameter transmission as much as possible in each machine.To this end, we require  to compute from machines 4 to 1. On machine 4, after completing   4 , we save the partial parameters of  subsequently.Next, the partial parameters of  is loaded into machine 4, and   1 starts execution.Note that our pipeline schemes differ from PipeSwitch [7] in two aspects: (1) we consider the pipeline parallelism while PipeSwitch only focuses on single-GPU tasks; (2) the reverse direction of the model execution between task  (machine 1 to 4) and task  (machine 4 to 1) facilitates hiding the latency between parameter store of task  and parameter load of , which PipeSwitch cannot achieve.
Second, we hide the latency between dataloader preparation and model execution for data-parallel workloads.Dataloader preparation mainly involves spawning multiple processes for efficient data loading and preprocessing.It does not request GPU resources and brings less system overhead for the main process.Hence, we implement a simple handler for user signals (e.g., SIGUSR1 in UNIX) to accomplish on-demand dataloader preparation ahead of time.For the scheduling interval, YmirSched will notify the YmirTuner to prepare the dataloader for preempted tasks 30 seconds ahead.
We emphasize that the benefits of our proposed pipeline switch depend upon the PCIe bandwidth.With the increased bandwidth, the overhead of context switching diminishes, resulting in shorter execution time.Consequently, the ratio of context switch overhead over computation time decreases, making computation time the new bottleneck.Moreover, the pipeline switch alleviates the context switch overhead, thereby providing a way to enhance hardware utilization rates.

EVALUATION
We first present the setup of our evaluation experiments in §7.1.Then, we perform physical and simulation experiments for three FMs to validate the effectiveness and scalability of Ymir in §7.2.Next, we analyze the impact of several key system components in §7.3-7.5
Cluster testbed.We conduct physical experiments in a cluster of 8 GPU nodes.Each node has 4 × Tesla V100 SXM2 32 GB, 1 × 200 Gbs HDR InfiniBand, 64 CPU cores, and 256 GB memory, connected via PCIe-III.Particularly, we evaluate Vicuna-7B on GPU servers containing A100 SXM4 80GB GPUs due to its high GPU memory consumption.Our physical implementation is built upon Pollux [62].We use CephFS 14.2.8 to store checkpoints.Additionally, we set the cluster capacity as 60 4-GPU nodes in our simulation to demonstrate the scalability of Ymir.FMF tasks.We evaluate Ymir on 9 vision datasets, 9 language understanding, and 9 language generation for ViT-Base, RoBERTa-Base, and Vicuna-7B, respectively.We have conducted a hyperparameter sweep to search each task's optimal learning rate and batch size.As we evaluate Ymir on 27 FMF different tasks, we present a full suite of FMF tasks, including hyperparameters and target validation metrics, in Part A of our project website.Workloads.Our evaluation workloads are sampled from a trace from Shanghai AI Lab where users submit extensive jobs related to FMs.For physical evaluation, we sample 120 -240 jobs for different FMs and construct one workload accounting for the expensive cost.
For large-scale simulation experiments, we sample 1500 -3000 jobs for different FMs and construct three workloads for evaluation.The number of sampled workloads is based on the model scale to match the GPU time usage of our adopted trace.We follow Pollux's workload generator to synthesize our evaluation workloads.Specifically, we categorize FMF tasks based on their GPU time and set the probability of generating these jobs on their scales.The detailed workload synthesis can be found in Part A of our project website.
Baselines.In the physical experiments, we compare Ymir with three schedulers, Tiresias [26], Optimus [59] and Pollux [62].They are all implemented atop Pollux's official implementation.Tiresias fixes the number of workers for each workload.Similar to Ymir, Optimus and Pollux dynamically change the number of workers to maximize the cluster-wide performance.However, due to the sensitivity of FMF workloads toward batch size [44,72], we disable GNS [52] to tune the batch size for Pollux throughout the training 4 .Besides, we also compare with Themis [50] to show how Ymir balances fairness and efficiency.We also add preemptive SRTF to reinforce the effectiveness of Ymir.Following Pollux's practice, we construct our simulator, detailed in Part A of our project website.We set the lease term interval of Themis as 600 seconds.The scheduling interval of Pollux and Optimus is set as 300 seconds for the exorbitant context switch overhead.The scheduling interval of Tiresias, Themis, and SRTF is set as 120 seconds because of their infrequent resource re-allocations.Thanks to the pipeline switch, Ymir adopts a short scheduling interval of 120 seconds.To show the generality and applicability of Ymir, we choose three representative FMs (ViT-Base, RoBERTa-Base, and Vicuna-7B).We evaluate them on 9 vision datasets, 9 language understanding datasets, and 9 language generation datasets.More detailed descriptions of datasets are available in Part A of our project website.Simulator fidelity.To validate the fidelity of our simulator, we measure the difference of average JCT and tail JCT between the simulation and physical experiments in Table 3.The average JCT gap is within 10%, and tail JCT difference is around 10%.This shows our simulator can provide reliable and accurate evaluation results.Without special explanation, we use our simulator in §7.3-7.5.

End-to-end Performance
Physical evaluation results.We adopt average JCT and 99% tail JCT to measure the efficiency of Ymir. Figure 10 presents the performance of Ymir and baselines over different FMs normalized to Ymir.Additionally, Figure 10 shows the average and tail JCT (seconds) of Ymir.Ymir can reduce 1.11 -4.34× average JCT, and 0.89 -3.56× 4 GNS leads to NAN issues when fine-tuning Vicuna on COQAQG [67].We terminate FMF workloads when the accuracy reaches the validation target or epochs.However, an important question is whether the transfer learning would harm model performance.Table 4 presents the maximal, minimum, and average relative accuracy (performance) improvement of tasks fine-tuned with temporal and spatial transfer compared to normal transfer.Vicuna can attain maximal 68.82% accuracy improvement for the BLEU metric of SAMSUM [25] with spatial transfer with DA [45].The minimum accuracy improvement is no less than zero.To summarize, both temporal and spatial transfer improve model accuracy.This is in line with previous works [4,61,69] that transfer learning can improve the model performance.Moreover, the fractions of tasks participating in different transfer learning modes are shown in Table 5.About 20-30% of workloads are assigned temporal or spatial transfer learning modes.Different FMs present various preferences toward transfer learning modes, and no single dominant transfer learning mode exists.Large-scale simulation.We use our simulator to conduct largescale simulation experiments.We set the cluster capacity as 60 4-GPU nodes, and vary the job load from 1 × to 2×.Specifically, based on the model scale, we set 1 × job load as 1500 -3000 jobs.Figure 11 shows Ymir achieves 1.66 -22.3×JCT speedup across different job loads and FMs.Also, Figure 11 presents the average JCT (seconds) of Ymir.The speedup gain of Vicuna is more significant than that of small FMs, especially compared to Optimus.Pollux cannot perform satisfactorily in large-scale simulation experiments due to the high search cost of its adopted evolutionary algorithm.With the increase of the job load, Ymir presents a better JCT speedup, as a higher job load potentially brings more beneficial task combinations and thus provides more chances to reduce the JCT.Besides, the maximal/average of the ILP solver latency for 2 ×  jobs is 0.23/5.43seconds using one CPU core, which does not have significant impact on the scheduling performance.Figure 12 compares the cumulative distribution function (CDF) of the finish time fairness (FTF) metric between Ymir and other fairness baselines (Pollux, Themis, and Tiresias) for RoBERTa-Base and ViT-Base.We follow Shockwave's implementation [95] to compute FTF and draw the CDF curve under 1× job load.Our observation is that Ymir outperforms existing fairness baselines considerably.Note that Ymir even achieves zero FTF loss in the case of ViT-Base.We conclude that the benefit brought by transferability can enhance efficiency and fairness very well.

Evaluation of YmirEstimator
Time estimator.The key component of time estimator is LUT.To evaluate its robustness, we manually add uniform random noise to the result of LUT before reporting it.Increasing estimation error has no significant impact on scheduling efficiency.This primarily results from the fact that the scheduling objective (Eqn.6) is not sensitive to the throughput estimation error.Transfer learning modes.In Figure 14  learning modes are jointly considered, the scheduling performance experiences a further enhancement.Except that temporal transfer learning degrades the JCT speedup brought by spatial transfer learning in Vicuna.This could arise from the prediction error of adopted estimator.

Impact of LUT and Pipeline Switch
Performance contribution of LUT.We compare LUT with the throughput estimator adopted in Pollux.Table 6 (row w/ LUT) reports the JCT of the throughput estimator normalized to that of our LUT.We observe that LUT is more beneficial to language FMs than vision FM.Little performance gain is shown for ViT-Base.The efficiency of the throughput estimator depends upon the fact that the job throughput scales linearly with the increase of the batch size and allocated GPUs.Its effectiveness is extensively validated in vision tasks [62], but is not satisfactory for language tasks.Pipeline dataloader and model preparation.In this paper, we use PETL to reduce the size of parameters to compute and communicate gradients for most FMF workloads.Hence, most FMF tasks adopt data parallelism, and the pipeline switch between parameter transfer and gradient computation is insignificant for such a scenario.The dataloader preparation becomes a performance bottleneck that restricts scheduling flexibility.Ymir proactively invokes this step to hide the data preparation to the greatest extent before fine-tuning the next FMs.Table 6 (row w/ data pipe) shows the JCT without the dataloader pipeline normalized to that with the dataloader pipeline.The pipeline dataloader brings 1.1 -1.7× JCT speedup.
Pipeline parameter transfer and model execution.We propose to execute the context switch between two pipeline parallelism workloads in a pipelined way.This pipeline context switch can considerably reduce the exorbitant time cost of the context switch.This technique is not applicable to all FMF tasks.We mainly examine how this pipeline practice benefits to fine-tuning Vicuna on ROC [53].It does not bring apparent cluster-wide JCT speedup but reduces around 4% JCT for tasks fine-tuning Vicuna on ROC.

Impact of Transferability Estimation
Impact of transferability metrics.We categorize existing metrics for task transferability estimation into probability-based, featurebased, and gradient-based methods.(1) LEEP [55] is a representative probability-based method incorporating the entire dataset to estimate the data distribution accurately.The computation overhead of LEEP scales with the dataset size.The estimation accuracy of probability-based methods closely correlates with the number of classes [11].Hence, LEEP fails to perform regression and generation  tasks (e.g., Vicuna).( 2) Task2Feat [80] is a feature-based method that extracts the penultimate layer's features over the entire dataset and designs various metrics to measure the similarity between tasks.Hence, the computation overhead is exorbitant when the number of examples is enormous.(3) Our adopted Task2Vec [3] is a gradient-based method, which adopts a subset of the dataset to quantify the transferability between tasks.We compare the speedup brought by task merger using three task transferability estimation metrics in Table 7, and find Task2Vec achieves the best JCT speedup over different FMs.Task2Feat falls behind on JCT speedup.LEEP has adverse effects on cluster-wide efficiency for ViT-Based.The maximal profiling overhead of various metrics is shown in Table 7, and Task2Vec considerably reduces the overhead compared to other baselines on language FMs.Overall, Task2Vec is a suitable metric for transferability estimation.
Sensitivity to the number of datasets.We vary the number of datasets from 3 to 9 and present the JCT speedup between Ymir with and without using task merger in Figure 14(b).Our observation is that task merger can attain at least 1.3 × JCT speedup over different numbers of datasets and FMs.We acknowledge the JCT speedup brought by task merger correlates with the intrinsic task transferability.Our sensitivity analysis demonstrates that the performance improvement brought by task merger does not arise from our cherry-picking datasets.

RELATED WORKS
Transfer learning.Initially, this technique aims to transfer the weights of a pre-trained model to downstream tasks to reduce the training time and data [88].Many works adopt heuristic methods [3,10,55,80,89] to determine the optimal pre-trained model for initialization based on the task similarity.Additionally, some works estimate the performance of different transfer learning modes [3,10,22,34,55,76,80,89], as discussed in § 2.1.Other works [18,82,83] morph a well-trained model to a new one to warm start the training.The advancements in transfer learning can be leveraged to further improve Ymir.DLT schedulers.Recent efforts of DLT schedulers primarily focus on effective resource allocations towards data-parallel training [9,16,26,36,38,59,62,86,95].Nevertheless, existing DLT schedulers cannot adapt to FMF workloads because they overlook the optimization opportunities presented by the unique characteristics of FMF workloads.While Titan [24] focuses on scheduling pipeline-parallel FMF workloads in GPU data centers, it lacks a systematic solution to exploit task transferability to enhance overall cluster-wide efficiency.Ymir automates task merging scenarios and optimizes resource allocations.Furthermore, Ymir contributes to reducing context switch overhead for both data-and pipeline-parallel workloads.
Fine-tuning FMs.Recent advances in model fine-tuning are primarily limited to individual jobs from the algorithm and system perspectives.Many PETL architectures have been proposed to improve the model accuracy on language tasks [27,32,33,43,60,91,94] and vision tasks [17,57,74,90].Apart from [31], Ymir can utilize another unified PETL architecture called Unipelt [51], which learns to activate the PETL architectures for downstream tasks.These works can attain competitive model accuracy compared to fine-tuning all the parameters.Apart from the algorithmic innovations, several system works [8,23,66,70] provide efficient pipeline parallelism for FMF workloads.Different from these single-workload optimization, Ymir optimizes cluster-wide FMF workloads.

DISCUSSION
Extensions to other transfer learning modes.Ymir considers combining at most two tasks.Intuitively, jointly fine-tuning more tasks can increase the potential benefit of transfer learning, but the lack of ML studies to estimate transfer gains when combining multiple tasks (≥ 3) impedes combining more tasks.Moreover, our empirical results have shown that merging two tasks can yield sufficiently good results.Managing multiple FMs.This paper mainly evaluates the scenario with one FM.There can be numerous FMs in the datacenter for fine-tuning.Then, we can adopt a load-balancing policy to determine the GPU quotas for each FM, and more sophisticated designs can be our future work.Nevertheless, our empirical results have demonstrated the potential of Ymir in improving the efficiency.Catastrophic forgetting.Temporal transfer learning is susceptible to the catastrophic forgetting issue.Fortunately, many works [46,78,92] point out that PETL can effectively avoid catastrophic forgetting.Empirically, our adopted Task2Vec metric can identify positive temporal transfer to mitigate catastrophic forgetting.Privacy concerns.Ymir merges FMF workloads from different users to achieve high efficiency.Although Ymir does not directly share datasets but just parameters, there is still a potential privacy threat from malicious users, e.g., membership inference [71].To handle this, Ymir allows users to disable sharing parameters.

CONCLUSION
This paper presents Ymir, a novel scheduler tailored for FMF workloads in GPU clusters.We propose YmirEstimator and YmirSched to determine the optimal transfer learning modes and resource allocations.We design YmirTuner to improve the efficiency of individual FMF workloads with PETL architectures and pipeline schemes.Our extensive experiments demonstrate that Ymir outperforms existing DLT schedulers in job efficiency and resource allocation fairness.
Fund -Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).
compares the validation accuracy during training in different transfer learning modes.Figure3(a)shows that temporal transfer reduces the number of epochs to

Figure 2 :
Figure 2: Illustration of different transfer learning modes.(a) Normal transfer: the downstream model is fine-tuned from the pre-trained weight (blue trapezoid).(b) Temporal transfer: task B is fine-tuned from the FM fine-tuned previously on another task A. (c) Spatial transfer: both task A and B are fine-tuned together.

Figure 3 :
Figure 3: Transfer learning performance: (a) QQP accuracy in temporal transfer learning on RoBERTa-Base; (b) Ima-geNet75 accuracy in spatial transfer learning on ViT-Base.

Figure 5 :
Figure 5: The TTA speedup box plot of (a) temporal transfer and (b) spatial transfer across various FMs.

Figure 6 :
Figure 6: Workflow of Ymir.It comprises three key designs: (1) YmirEstimator estimates the execution time of FMF workloads; (2) YmirSched determines the task merging scenarios and resource allocations; (3) YmirTuner provides efficient context switch for FMF workloads.

Table 1 :
Description of System Parameters in Ymir.Parameters Description Model The model name.Dataset A path (e.g., AWS S3) where training and evaluation samples are stored.Hyperparam batch size, learning rate, optimizer, etc.

Figure 7 :
Figure 7: The workflow of YmirEstimator.It contains three components: (1) The transferability estimator estimates the transfer gain between new requests and other FMF requests; (2) The iteration estimator estimates the number of iterations needed to reach the target accuracy in different transfer learning modes; (3) The time estimator estimates the execution time of new FMF requests.

Figure 8 :
Figure 8: Sensitivity analysis of Transferability Estimator (a) and Iteration Estimator (b) on JCT speedup between w/ and w/o task merger.where 1 is the indicator function and L  () is the training loss value at the  ℎ training step.It is challenging to obtain the training loss L  () efficiently.The smoothing loss curve of FMF workloads motivates us to adopt a curve function proposed by Optimus[59] to characterize the job progress and training loss for DLT workloads.FMF workloads commonly use the Adam optimizer[40], which has a faster convergence rate than SGD.We introduce an additional second-order term  2 to characterize better the job progress and normalized training loss of FMF workloads:

Figure 9 :
Figure 9: Pipeline model propagation and parameter transmission.D2H indicates saving parameters from device (GPU) to host (CPU).H2D indicates loading parameters from host (CPU) to device (GPU).

Figure 10 :
Figure 10: Physical evaluation results over different FMs.

Figure 11 :
Figure 11: Scheduling efficiency results over FMs and job loads in simulation experiments.-axis is the JCT normalized to Ymir while -axis is the job load.

Figure 13 :
Figure 13: Time Estimator Figure 13 varies the degree of the added noise (-axis) and presents the speedup (-axis) compared to Ymir without task merger.Increasing estimation error has no significant impact on scheduling efficiency.This primarily results from the fact that the scheduling objective (Eqn.6) is not sensitive to the throughput estimation error.Transfer learning modes.In Figure14(a) investigates the contributions of different transfer learning modes to scheduling performance improvement over different FMs.No single transfer learning mode dominates across all FMs.Nevertheless, when both transfer Figure 13  varies the degree of the added noise (-axis) and presents the speedup (-axis) compared to Ymir without task merger.Increasing estimation error has no significant impact on scheduling efficiency.This primarily results from the fact that the scheduling objective (Eqn.6) is not sensitive to the throughput estimation error.Transfer learning modes.In Figure14(a) investigates the contributions of different transfer learning modes to scheduling performance improvement over different FMs.No single transfer learning mode dominates across all FMs.Nevertheless, when both transfer

Figure 14 :
Figure 14: Impact of key components.

Table 2 :
Prediction accuracy of YmirEstimator ModelTransferability Iteration (APE) Iteration-Transfer (APE)Pearson's r ↑ MAPE (%) ↓ ACC (%) ↑ Max (%) ↓ Mean (%) ↓ Max (%) ↓ Mean (%) ↓ 2 with L  (),   is the global batch size, and   is the training set size.Error Analysis.We report the mean/max absolute percentage error (APE) for different FMs with normal transfer in the fifth and sixth columns of Table2(Iteration).We use transfer gain to predict corresponding training iterations for both temporal and spatial transfer learning.The prediction error of iteration estimator for both temporal and spatial transfer learning modes are presented in the seventh and eighth columns of Table2(Iteration-Transfer).The estimation error of Iteration-Transfer is typically larger than Iteration, resulting from the accumulated estimation error brought by transferability estimator.The maximal prediction APE is within an acceptable range (40%).Our iteration estimator performs well in estimating the number of iterations needed.Sensitivity Analysis.We use the similar technique as transferability estimator to analyze the sensitivity of iteration estimator's error in Figure8(b).Our findings indicate that the JCT speedup gradually decreases with the increased noise scale.When the noise scale is up to 40%, task merger still decreases the JCT.Moreover, Vicuna can benefit from the added noises to a certain degree, which might result from the internal prediction error of our iteration estimator.After obtaining   from iteration estimator, the next step is to attain the job speed under a given resource allocation.Considering the fixed backbone architecture of FMF workloads, our time estimator provides accurate job speed via offline profiling.We utilize a simple yet effective method called lookup table (LUT).It accepts resource allocations and training configurations as input and returns the job speed of each training iteration.In particular, LUT constructs a map S(, cfgs), where  is the number of GPUs assigned to the job and cfgs are the training configurations.Given such information, we use LUT to obtain the execution time of the task  as follows: , i62]n element of a binary matrix X ∈ B  × , indicating whether   is allocated with  GPUs;  , ā / , measures the reciprocal of the job speedup brought by elastic training.The definition of this objective is inspired from previous fairness schedulers[50,62].

Table 3 :
JCT diff.between simulator and physical implementations.

Table 4 :
Accuracy improvement over normal transfer.

Table 5 :
[62]fractions of tasks participating in different transfer modes in the physical experiment.Unlike discussed in[62], Pollux and Optimus do not outperform Tiresias considerably for language FMs.The frequent resource re-allocations might delay the job progress and degrade the performance benefit of elastic training.Besides, Vicuna attains better performance improvements than smaller FMs, as they facilitate task transferability and perform well in model generalization and transferability.§7.5 provides empirical evidence that Vicuna enjoys the most JCT speedup brought by task merger.

Table 6 :
Speedup brought by LUT and Pipeline Switch.

Table 7 :
Performance of various transferability metrics.