Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity

Multitask learning is a widely used paradigm for training models on diverse tasks, with applications ranging from graph neural networks to language model fine-tuning. Since tasks may interfere with each other, a key notion for modeling their relationships is task affinity. This includes pairwise task affinity, computed among pairs of tasks, and higher-order affinity, computed among subsets of tasks. Naively computing either of them requires repeatedly training on data from various task combinations, which is computationally intensive. We present a new algorithm Grad-TAG that can estimate task affinities without this repeated training. The key idea of Grad-TAG is to train a"base"model for all tasks and then use a linearization technique to estimate the loss of the model for a specific task combination. The linearization works by computing a gradient-based approximation of the loss, using low-dimensional projections of gradients as features in a logistic regression to predict labels for the task combination. We show that the linearized model can provably approximate the loss when the gradient-based approximation is accurate, and also empirically verify that on several large models. Then, given the estimated task affinity, we design a semi-definite program for clustering similar tasks by maximizing the average density of clusters. We evaluate Grad-TAG's performance across seven datasets, including multi-label classification on graphs, and instruction fine-tuning of language models. Our task affinity estimates are within 2.7% distance to the true affinities while needing only 3% of FLOPs in full training. On our largest graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates within 5% distance to the true affinities, using only 112 GPU hours. Our results show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches.


INTRODUCTION
Modern applications of neural networks often employ a single neural network for prediction or classification on multiple tasks.This multitask learning setup is widely used across a variety of settings, with examples such as a visual system that aims to detect various objects in autonomous driving simultaneously [50], a Graph Neural Network for community detection on large networks [29], and prompt-tuning of pre-trained LLMs for NLP tasks [34].This multitask learning setup is not only computationally efficient (a single network can jointly predict many tasks), but it often improves prediction accuracy due to transfer learning.
The often implicit assumption behind multitask modeling is that there is a positive transfer effect among tasks [8].However, as the number of tasks increases, one frequently observes a negative transfer effect in many applications, such as for prompt tuning of large language models, where adding a task to the model degrades performance on one or more tasks [58][59][60]54].This observation has motivated a line of work that aims to group the tasks into subsets such that negative transfer among tasks within a subset is minimized, allowing one to train a separate multitask model per subset, thereby improving performance on all tasks [29].
A key concept underlying many multitask learning algorithms is a notion of task affinity, which can capture the abovementioned positive or negative transfer effects across tasks in a precise way.For instance, one can compare pairwise task affinity [50,14]-the loss of a model trained on each pair of tasks-against the loss of a model trained on each task.Given a notion of task affinity, a common recipe for designing multitask learning algorithms involves (1) Task affinity computation that builds a task affinity matrix, then (2) task grouping that uses this task affinity matrix to group tasks with positive transfers together, and finally (3) multitask training that fits a separate model per task group.
The performance improvement achieved through this paradigm depends on the notion of task affinity and the grouping procedure.Moreover, the ability to leverage this paradigm hinges on the computation of task affinity (Step 1 above), which becomes expensive as the number of tasks grows.As a case in point, the computational complexity of pairwise task affinity scales quadratically with the number of tasks: this implies that even for community detection with 100 labelings, using pairwise task affinity requires training nearly 5000 models for computing the affinity matrix.In this paper, we scale up this multitask learning paradigm by dramatically speeding up the first step of task affinity computation for two canonical examples of task affinities: Pairwise and higherorder task affinity (See Examples 2.2, 2.3).In our experiments on various real-world datasets representing different applications, our algorithm can reduce the task affinity computation time by nearly 32× compared to full model training while incurring less than 2.7% error.In addition to this dramatic efficiency improvement, we also design a more robust method for task grouping (Step 2).Taken together, these new techniques match or improve the performance of previous multitask models.
The primary challenge for task affinity computation is how to avoid training a large number of multitask models on various task combinations.The key technical insight behind our algorithm is to leverage a linearization property of deep neural networks, including large language models.The linearization property for a neural network means that we can approximate the model loss for a pre-trained meta-initialization and an input/output pair by using a gradient-based Taylor's expansion centered at the metainitialization.This linearization property has been observed for large language model fine-tuning in recent works, albeit not for the purpose of multitask learning [37,38,57].Here, we leverage linearization to estimate task affinities in an efficient manner by using the first-order Taylor expansion from a pre-trained model, thereby saving the computation of backpropagation during model fine-tuning.This algorithm, Grad-TAE, is illustrated in Figure 1.
In more detail, we first compute the gradient at the initialization and then map the gradients to task labels with logistic regression.The dimension of this regression can be high, especially for heavily parameterized models.Thus, we use a dimension reduction technique and apply the Johnson-Lindenstrauss Lemma to give an error analysis.On experiments of datasets with 100 tasks, we show that this approach estimates pairwise task affinity with 45× fewer FLOPs and 11× less GPU hours than fully computing the true scores, with only 5.7% relative error.For higher-order task affinity, our approach uses 32× fewer FLOPs and 5× less GPU hours, with only 2.7% relative error.Furthermore, our approach also scales to a large-scale graph with over 21M edges and 500 tasks and estimates the task affinities within 5% relative errors with 112.3 GPU hours, while computing the true affinity scores can take over 8000 GPU hours.Our algorithm is also suitable for accelerating task selection methods that are typically computationally expensive.An example is forward or backward subset selection [18], which is a popular heuristic but requires evaluating quadratically many task combinations.
As for the second step, we design a new clustering algorithm that uses these estimated task influences efficiently through a Semi-Definite Programming (SDP) relaxation formulation.The clustering algorithm takes the estimated task affinity matrix  (of size  × ) & the number  of task groups as input, then solves an SDP for maximizing the average density of the  groups.Since the SDP is a convex program, it can be solved efficiently, and we round the resulting solution to get the final task groups.Our experiments indicate that our clustering algorithm is more robust and performant than commonly used clustering techniques such as spectral clustering [39] and Lloyd's algorithm [33].Once we have the task groups from the clustering, we can partition the tasks into subsets and train a separate model on tasks within each subset -this overall algorithm is called Grad-TAG.In experiments, we show that our approach achieves the Pareto optimum in terms of error rate and computation cost.For multi-label prediction on graphs trained with a 3-layer GNN, Grad-TAG achieves comparable performance with over four baselines, while using 32× fewer FLOPs and 5× less GPU hours.For instruction fine-tuning of language models using T5-Base, Grad-TAG uses 48× fewer FLOPs and 11× less GPU hours with comparable performance to the best baseline.The code repository for reproducing our experiments can be found at: https://github.com/VirtuosoResearch/ScalableMTL.

Summary of Contributions:
We design an efficient algorithm, Grad-TAE, for estimating the task affinity scores of a multitask learning algorithm.The key idea of Grad-TAE is to trade off multitask pre-training, which is computationally expensive, with gradient-based estimation for fine-tuning, whose computation is lightweight.We then design a clustering algorithm on top of the estimation procedure for downstream multitask optimization.Through a detailed experimental study, we demonstrate that our overall algorithm, Grad-TAG, significantly speeds up full model training while delivering comparable performance.
Organization: We briefly touch on related work and then provide the technical preliminaries for the rest of the paper.In section 3, we outline our task affinity estimation procedure Grad-TAE, along with a theoretical error analysis for the estimation error.Then, we present the clustering approach for task grouping and the overall algorithm Grad-TAG in Section 4. Finally, we provide a thorough empirical evaluation of the Grad-TAG algorithm for a variety of multitask learning settings in Section 5.

Related Work
Multitask learning is a fundamental problem with many applications, such as federated learning [49], road safety modeling [41], and language model fine-tuning [34].This problem has been studied since the early literature of data mining [8].As the number of tasks increases, modeling task relationships becomes increasingly complex and challenging [36,67].These relationships are influenced by data distribution characteristics, including covariate and label shifts [59].Thus, designing optimization algorithms for multitask learning is challenging [29,30].We contribute to this literature by proposing a new approach to significantly speed up the computation of task affinity scores for modeling task relationships.We now proceed to discuss several lines of work most related to ours.
Task Similarity Measures.Previous works [50,14] estimate task affinities between every pair of tasks.The computation complexity of such methods scales quadratically with the number of tasks.Another approach is to use task embeddings [54], i.e., training one model on each task and measuring the cosine similarity between the model weights.Although this approach scales linearly with the number of tasks, the measures tend to be noisy.Intuitively, if two tasks are similar, their gradients should exhibit higher cosine similarity.This idea can be implemented to balance training by dynamically tuning gradient magnitudes [12], or to project the gradients noto the span of other tasks' gradients that have a conflicting gradient [14].The same idea can also be implemented to choose auxiliary tasks that are most beneficial for a primary task [13].Similarity measures based on feature representations of tasks have also been applied to grouping tasks [48] and used to predict task transferabilities [4].The main advantage of these approaches is their efficiency, as only a single multitask model needs to be trained.The downside is that the gradients can be noisy during a stochastic training procedure.For example, Azorin et al. [5] empirically observed that representation and gradient similarity measures do not consistently correlate with actual MTL performance.Thus, a more accurate approach is to build measures that approximate multitask outcomes directly; see recent work on designing surrogate models for multitask learning systems [29,30].
Transferability Estimation.There have also been developments on information theoretic measures of transferability in recent literature.One natural idea is to evaluate conditional entropy between target pseudo labels (assigned by a pretrained source model) and real target label [7].Log Expected Empirical Predictor [40] proposes a modified procedure by using soft predictions from the source model.These methods do not utilize feature embeddings in the measure [55]; TransRate [21] introduces a surrogate measure based on mutual information that also incorporates feature embeddings.An improved estimation method with better robustness can be achieved by shrinkage [22].In the fine-tuning setting, the distance between the model search and the pretrained initialization can indicate the level of generalization capability [31].The geometry relates to the Hessian of the loss, which has been shown to correlate with the generalization performance of fine-tuned models [26].Ju et al. [25] extend this Hessian measure to graph neural networks, which can guide the design of optimization algorithms to regularize the Hessian of neural networks [27].
Multitask Learning Optimization Algorithms.Multitask learning can be viewed as a multiobjective optimization problem [42], where the goal is to identify the Pareto frontier among multiple objectives [47].One common MTL optimization algorithm is to reweight task losses and optimize a weighted combination of task losses [32,46].Our goal is to maximize the averaged prediction performance of all tasks.Thus, we are interested in partitioning the tasks into similar groups, where tasks are closely related within each group and can differ significantly across groups.Another interesting line of work is designing branching neural networks such as tree structures [53,17], where each layer contains multiple modules to handle different tasks [35].Compared with branching methods, task grouping may be more suitable for handling a large number of tasks (like hundreds to thousands).In this regime, negative interference between tasks is almost unavoidable, and clustering tasks into similar groups could provide a more efficient strategy than designing a single neural network that handles all tasks.Influence Functions.There is a line of work on estimating the influence of adding or removing one sample on the whole dataset.Influence functions [28] based on efficient approximation of the Hessian inverse provide one way to approximate this.Random sampling-based approaches to measuring leave-one-out influence have also been studied [23,43].The distinction between these works and us is we focus on task-level affinity, whereas this literature focuses on estimating the influence of a single data sample.Clustering Algorithms.Clustering is a fundamental aspect of machine learning.Besides SDP relaxations, linear programming relaxations are known for clustering objectives such as -center.The integrality gap of linear programming and semidefinite programming relaxations can be analyzed when there is a separation structure in the underlying clusters [3].These approximation guarantees typically require the underlying similarity scores to satisfy a metric condition.By contrast, the task affinity matrix can easily violate the triangle inequality.Recent work has also studied mixed integer programming for best subset selection [9].One novel contribution of this work is to make explicit a connection between multi-instruction fine-tuning and clustering.In light of this connection, it would also be interesting to revisit hierarchical clustering and hypergraph clustering for task grouping.For example, recent work by Tsitsulin et al. [52] investigates unsupervised graph clustering problems with graph neural networks.

PRELIMINARIES
Suppose we are interested in making predictions on  tasks.We are given a set of samples for training and testing of each task.Our goal is to design a prediction algorithm to maximize the averaged testing performance over all the  tasks simultaneously.We assume that the samples from all the tasks are supported on a joint product between a -dimensional feature space X and a label space Y.In order to precisely discuss task relationships, we formally define what we mean by a multitask learning algorithm.Definition 2.1 (Multitask learning algorithms).For any subset  ⊆ {1, 2, . . ., }, a multitask learning algorithm  takes the training data of all the tasks in  and combines them in a joint training procedure.Then, the (jointly trained) model is tested on each task  ∈ .In the end, a test result is obtained for each .Let us denote the test result as  (, ).Thus, the output of the algorithm will include a total of | | results for any subset , one for each  ∈ .
Given a multitask learning algorithm, the transfer between the  tasks can then be viewed through the results of  , applied to combinations of tasks as subsets.This notion of transfer underlies many existing multitask learning systems.We give two examples below, which have been used in prior works to tackle task transfer in complex visual systems [64,50].
Example 2.2 (Pairwise task affinity).Consider two tasks such as  and .Given a multitask learning algorithm  , one can mix the training data of tasks , , using SGD to train a shared encoder and task-specific prediction heads.If we compute the pairwise task affinity for all pairs of tasks 1 ≤  ≤  ≤ , then we get an  by  task affinity matrix  , where  , =  ({,  }, ).
Example 2.3 (High-order task affinity).Next, we discuss higherorder task affinity, which is analogous to sampling features in random forests.First, fix an integer , which is the number of subsets we would like to sample (e.g., analogous to the number of decision trees in a random forest).We independently sample  subsets out of the set {1, 2, . . ., }, each subset having a size of , chosen uniformly over all such subsets.Let us denoted the  subsets as  1 ,  2 , . . .,   .Then, compute  (  , ), for every  = 1, 2, . . ., , and  = 1, . . ., .Lastly, compute  , as the average value of  among all subsets including tasks , : where  , is the number of subsets that include both , .This leads to another task affinity matrix  , better capturing the higher-order relationship among tasks.
In both examples, computing the task affinity matrix requires fitting Ω() models, given  tasks.In Example 2.2, one needs to train  2 models, one for every pair of tasks.Then, in Example 2.3, a total of  = Ω( log ) models are required, each for a subset of tasks.This raises the question of whether one can approximate the results of a multitask learning algorithm by designing a more efficient computational method.
Specifically, given a multitask learning algorithm  and a collection of subsets  1 ,  2 , . . .,   ⊆ {1, . . ., }, can we quickly estimate the task affinity corresponding to  (  , ), for any  = 1, 2, . . .,  and any  ∈   quickly (e.g.without fully training a model for each subset)?Do these task affinity estimates accurately approximate the affinity one would get from fully trained models?Moreover, are the estimates useful in the downstream task grouping setup?

TASK AFFINITY ESTIMATION
We now describe a new method for estimating task affinity scores.To circumvent the cost of full-model training, we start by describing an empirical observation regarding pre-training and fine-tuning.Then, we present our approach to estimating fine-tuned model parameters for task subsets.Additionally, we use random projection to reduce the dimension of the gradients.We provide an error analysis to justify the design of our algorithm.

Linearization of Fine-tuned Models
Our method is motivated by the fact that once we pre-train all the  tasks to obtain a meta-initialization, this initialization can provide representations that can be quickly adapted to the remaining tasks.This is based on the premise that the underlying tasks share structural similarities in multitask learning.As the model fine-tuned to a subset of tasks stays in the affinity of the initiation, the fine-tuning procedure behaves like linear models locally.
To illustrate this observation, we consider three distinct scenarios involving graph neural networks (GNNs) and transformers (BERT and T5).We test GNNs on a multi-label prediction dataset on a YouTube graph [61], using a 3-layer SIGN network [15].This dataset Table 1: Measuring Taylor's expansion error for models finetuned from an initialization pre-trained on all tasks.The results are averaged over 100 random task subsets.
includes  = 100 subtasks, one corresponding to the node labels of a subgraph of the whole graph.For transformers, we take a pretrained BERT model and fine-tune it on a sentence classification dataset [63], which contains  = 26 tasks.We also use a pretrained T5-Base model and fine-tune it on a sentence classification dataset with 100 instructions [6], which has  = 100 tasks.In each experiment, we first train a meta-initialization  ★ by training on all tasks combined.Then, we fine-tune  ★ on a random subset of the tasks.We perform Taylor's expansion with  ★ as the anchor point.Let  denote the fine-tuned weight.Denote the model with  and  ★ as   and   ★ , respectively.For an input  with label , denote the output of the fine-tuned model as   (, ).If  is close to  ★ ,   (, ) can be approximated by We measure the error term  and report the Residual Sum of Squares (RSS) in Table 1: In particular, we fine-tune the meta-initialization to a subset of tasks to get weight  .Then, we measure the fine-tuned distance as ∥ − ★ ∥ ∥ ★ ∥ .Interestingly, our results show that the gradient-based approximation yields within 3.5% RSS, even when the fine-tuned distance is up to 10%.In particular, viewing  as the decision variables, Eq. ( 2) is a linear model with the gradient ∇    ★ (, ) as the feature vector.
Remark 3.1 (Second-order approximation).It is natural to ask if a second-order approximation can further reduce Taylor's expansion error.Notice that there is a tradeoff between approximation quality and computation cost.Based on our preliminary test of the Hessian approximation, it can indeed reduce estimation error; however, this requires computing Hessian-gradient products.The premise is that the underlying tasks share a structural similarity, like in community detection, where clusters have higher densities.Our experiments found that 94% of models fine-tuned for random task subsets remain <10% distance to initialization (on the Youtube and RTE datasets), suggesting that the first-order approximation is generally sufficient.

Gradient-based Estimation
We now describe our algorithm, which builds on the above linearization property, by using logistic regression with gradients as features.It also includes dimension reduction, as described below.
(1) Estimating fine-tuned model parameters: In the following discussion, we focus on binary classification, such that   ∈ {+1, −1}.See Remark 3.2 for extensions to multiple classification and regression.Recall the gradient-based approximation of   (  ,   ), given the input (  ,   ): Let us denote ∇    ★ (  ,   ) as   and −    ★ (  ,   ) as   , for any .Using logistic loss, we can write down the loss function as for  ∈ R  .Denote the combined data set in the task subset  as where   is the combined number of data samples in the set D  .The main idea is to solve a logistic regression problem with   being the feature vector and   being the response label.However, keep in mind that the dimension of   is the same as the number of parameters in a neural network, which could be tens of millions.Thus, we introduce a dimension reduction procedure that does not lose much precision.
(2) Dimension reduction: We use the Johnson-Lindenstrauss random projection [24], which projects the gradients to a much lower dimension before solving the logistic regression.Let  be a  by  Gaussian random matrix, whose entries are independently sampled from a Gaussian  (0,  −1 ).We project the gradient from dimension  onto dimension  as g =  ⊤   .Then, we solve the following logistic regression, which is now in dimension : Lastly, we set Ŵ as  Ŵ +  ★ to map the projected solution back to the -dimensional space.Ŵ is the estimated model parameter for fine-tuning  ★ with task subset .
(3) Averaging over an ensemble: To reduce the above estimation's variance, we also add a model averaging step.In particular, we train several meta-initializations and repeat the above estimation procedure.We average the estimated scores within the ensemble.
We summarize the entire procedure in Algorithm 1 with all three steps.Let us compare the running time complexity between this estimation and one that uses full training to get  (  , ) instead: • In our estimation, we need  full training, plus  () gradient evaluations and solving logistic regression  times.• If we were to compute  , we need  full model training instead.
Typically,  =  (1), while  = Ω() or even  ( 2 ) in downstream use cases.Thus, our estimation algorithm reduces Ω() full-model training to only  (1).The tradeoff is that we require  () gradient evaluations (to retrieve the gradients on all tasks) plus solving logistic regression  times.As we will show below, the random projection helps reduce the dimension of the logistic regression problem to  (log ) dimension, which is much cheaper.This is in Let  be a  by  Gaussian random matrix ∼  (0,  −1 )

Remark 3.2 (Extension to multiple classification or regression).
We note that the above procedure can be extended to deal with multiple classifications.This requires setting up one prediction vector for each class; The rest remains the same.The procedure also applies to regression by using mean squared error instead.

Error Bounds
We now show that the error introduced by approximations in Grad-TAE is bounded.Specifically, we use the Johnson-Lindenstrauss Lemma to argue that as  increases, the random projection yields a minimizer whose quality is not much worse than the solution without the projection.We will assume that the averaged Taylor's expansion error is at most  across the entire data set of every task.Additionally, we assume that the search procedure occurs within a bounded space of radius .Lastly, in the pretrained initialization, each gradient vector's Euclidean norm is at most .With these conditions, we state the error bounds for Grad-TAE as follows.
Proposition 3.3.Let D be a search space whose radius is at most .Suppose the gradient of   ★ at the initialization  ★ in the training set is at most  in Euclidean norm.For each task  = 1, 2, . . ., , let   denote the training data.Suppose that for every , Provided that  =  log   2 , the training loss of Ŵ is bounded away from the minimum training loss for any  ⊆ {1, 2, . . ., } as The proof, given in Appendix A, uses the Johnson-Lindenstrauss Lemma [24].In particular, using the fact that the logistic loss is 1-Lipschitz continuous, we can relate L( Ŵ ) to min L( ).The errors introduced by random projection and Taylor's expansion can be bounded using the JL Lemma and the bound on Taylor's expansion error, respectively.Further, our experiments in Table 1 suggest that  is relatively small in practice.Thus, as  goes to zero, Eq. ( 6) guarantees the gap between L( Ŵ ) and min L( ) will be small.

TASK AFFINITY BASED GROUPING
We now describe a clustering algorithm to partition the  tasks into  disjoint subsets.Given an  by  task affinity matrix  , we will find a clustering that maximizes the average density of all clusters.Concretely, let  1 , . . .,   be a disjoint partition of [].Let  1 , . . .,   be a 0-1 vector indicating whether a task is in one cluster or not.The average density of this clustering can be written as: This integral objective is NP-hard to optimize in general (in particular, geometric clustering is a special case [2]).
We design a Semi-Definite Programming (SDP) relaxation and then round the SDP solution to a clustering.Let us denote the assignment variables as an  ×  matrix  , such that each entry  , indicates whether a task  belongs to a cluster , for every  = 1, . . ., ,  = 1, . . ., .Moreover, let the th column of  , which is the characteristic vector of the -th cluster, be denoted as   .Under this assignment, the sum of  , across any task  must be one, as we allow one task to be assigned in a single group.By contrast, the sum of  , across   is the number of tasks assigned to   , which is at least one.
Let  denote the all-ones vector.We state an integer program to maximize the average density of all  clusters as follows max Note that    ⊤  is a rank-one semidefinite matrix.Let us denote the sum of them (normalized by  ⊤    ) as the following new variable has rank  since it is the sum of  rank-1 matrices, and the   's are orthogonal to each other.Additionally, its trace is equal to  because the trace of 6: Round the solution X into clusters using the threshold  Further relaxing the rank constraint (while keeping the trace constraint) leads to a convex program, which can be solved efficiently.Given a solution of the SDP, denoted as X , the last step is to round X into an integer solution.We set a threshold  such that if X, ≥ , tasks  and  are assigned to the same cluster.In the experiments, we set  as / for a constant  ≥ 1, since X, should be 1  |  | when they are in the same cluster with |  | < .Thus, the intra-cluster distance must always be at least  with the assignment.
We provide the entire procedure in Algorithm 2, which uses Algorithm 1 as a subroutine to estimate the task affinity scores.

Example 4.1 (Discussion about alternative clustering algorithms).
A natural question is using alternative algorithms such as spectral clustering or Lloyd's clustering.We find that these algorithms are not as robust as the SDP relaxation because the scale of the loss values varies across rows for different tasks.We describe a toy example to illustrate.Suppose  is a 6 by 6 matrix involving three clusters  1 ,  2 ,  3 of size 2 each.The affinity in  1 is 7, while the affinity scores in  2 and  3 are 20, 19, respectively.We find that both spectral clustering and Lloyd's clustering will group  2 and  3 together, while the SDP relaxation manages to separate them apart.See Figure 2 for an illustration.For this reason, we use the SDP relaxation in Grad-TAG.(10).Although this is a well-studied problem in approximation algorithms [1], task affinity violates the metric condition typically required in order to obtain guarantees in this literature.In particular, the triangle inequality  , +  , ≥  , is violated.It is possible that by making an assumption regarding intra-cluster separation (see, e.g., Awasthi et al. [3]), one might be able to analyze the SDP theoretically.This is left for future work.Remark 4.3 (Further variants of Grad-TAG).While we focus on the task grouping problem, the idea can be used to speed up forward and backward selection.We set the list of subsets in Algorithm 1 as {1}, {2}, . . ., {}.Suppose we select task 3.Then, in the next round, we set the list of subsets as {3, 1}, {3, 2, }, . . ., {3, }.And so on.

EXPERIMENTS
We now validate Grad-TAE and Grad-TAG across various settings.The evaluation focuses on the following key questions.Does the estimation procedure accurately approximate the target task affinity scores?How does the running time compare to the full computation required to obtain these scores?Third, do the estimated affinity scores combined with the clustering algorithm work well in downstream use cases?
Our experiments show that Grad-TAE approximates the true task affinities (based on full model training) within a relative error of less than 2.7%, while using less than 3% of the computational cost of full training.Further, Grad-TAG achieves comparable downstream accuracy to existing methods in two canonical applications, multilabel classification on graphs and language model fine-tuning, while using 32.8× fewer FLOPs.Lastly, we discuss the parameters and the steps as part of our algorithm, including the comparison with alternative clustering.

Experimental Setup
5.1.1Evaluation settings.We note that our algorithm applies to a wide range of multitask learning scenarios.For a representative evaluation, we focus on multi-label prediction on graphs, and language model fine-tuning.In the first setting, each labeling task corresponds to a subgraph within a graph.Given a seed set of each labeling as the training set, the goal is to identify the remaining nodes of the subgraph.This can be cast as multitask learning, by viewing each labeling as a binary classification task.The objective is to optimize the average accuracy of all the labeling tasks.
The second setting involves fine-tuning language models using human-designed instructions, known as instruction fine-tuning.Each instruction corresponds to a prompt.Typically, a data set can come up with many relevant instructions, some of which are more relevant to a subset of tasks than others [34].Thus, a natural question is to select the instructions that are more relevant to the downstream task, which can be formulated using multitask learning.In particular, we view each instruction tuning as a single task.While we focus on these two applications, it is conceivable that our algorithm can be used in other related applications.

Datasets and models.
We use social network datasets with community labels for multi-label prediction on graphs.We select four graphs from SNAP [61] (Amazon, YouTube, DBLP, and Live-Journal), while we expect similar results to hold on other graphs.The number of nodes in these four graphs ranges from 3k to 57k; the number of edges ranges from 20k to 1M.For each graph, we pick 100 (largest) communities corresponding to  = 100 tasks.For preprocessing, we randomly sample 10% of nodes from each community subgraph as positive training samples and 10% of nodes outside the subgraph as negative samples.From the remaining data, 20% is randomly sampled for validation.We evaluate performance using the macro  1 -score on the test set [62].
Next, we examine the running time scaling of our algorithm on a large graph (the Orkut network), which has 395k nodes, 21M edges, and a total of 500 communities.We use a 3-layer SIGN model [15] with a fixed width of 256 as the encoder in the MTL models, which is more efficient to train than GCN.
For fine-tuning language models, we use two text classification datasets from SuperGLUE [56], specifically RTE and WiC.Each dataset includes 100 instructions, with 10 sourced from Bach et al. [6] and 90 generated using the automatic instruction generation method in [66].Thus, each dataset has 100 tasks in total, each corresponding to fine-tuning with one instruction.We use T5-Base [45] as the encoder for the MTL model.The choice of this encoder is without loss of generality, as we expect similar results to hold on other encoders.
Put together, our experiment covers seven different datasets in total, spanning medium-and large-scale instances, with the largest dataset containing 500 tasks.

Evaluation metrics.
We assess the accuracy of estimated task affinity by measuring the distance between our estimated task affinities and the task affinities computed from fully trained models.
For task grouping, we evaluate the accuracy averaged over all tasks when training a collection of networks, each on a subset of tasks.The accuracy metric is task-dependent, such as zero-one accuracy or the  1 -score, depending on the setting.
Lastly, we measure each method's total number of FLoatingpoint OPerations, namely FLOPs.In addition, we report the number of GPU hours evaluated on a single Nvidia RTX6000 GPU.

Task Affinity Estimation
We now report the results from running our estimation procedure.We regard the task affinity scores computed from fully trained models as the target, denoted as  ★ .Then, after running Grad-TAE, we compute the affinity matrix  , and measure the relative distance between  and  ★ as: We evaluate the relative distance on the YouTube graph, which contains  = 100 labeling tasks.
As for the computation cost, our procedure has three parts: (i) training  meta-initializations, each on the combination of all tasks; (ii) For each meta-initialization, computing the gradients on all training examples and projecting the gradients to a lowerdimension; (iii) Solving logistic regression on projected gradients Table 2: We report the distance between our estimated task affinity and  ★ , computed on the YouTube graph.For interpreting the computation cost, we report the ratio between the number of FLOPs to compute  ★ divided by the number of FLOPs of our algorithm.Recall from Algorithm 1 that  is the number of meta-initializations, and  is the random projection dimension.for a subset of task and evaluate the performance on each task in the subset.We report the computation in terms of FLOPs using our algorithm to compute  and fully training models to compute  ★ .

5.2.1
Accelerating pairwise task affinity computation.First, we train a separate multitask model on each pair of tasks to compute  ★ .We report the distance metric and the number of FLOPs between fully-trained models (to compute  ★ ) and our algorithm in Table 2.
To explain our findings, we set the number of meta-initializations to  = 1 and vary the projection dimension  among 50, 100, 200, and 400.We note that all these values yield an estimation of  ★ within 11% distance.As expected, increasing  leads to better estimation.After  increases above 200, the distance metric also stabilizes to around 5.7%.Thus, we set  as 200 in the remaining experiments.As a remark, this is approximately 15 log(), where  = 683, 370 in this experiment, aligning with our analysis in Proposition 3.3.Remarkably, under this setting, Grad-TAE uses 3.5 GPU hours and achieves 130× less computation compared to fully-trained models!Next, we fix  = 200 while increasing  up to 9.This further reduces the distance metric to 5.4%, with 45.0× less compute cost.We observe diminishing returns from ensembling, once  goes beyond 5. Thus, we will set  as 5 in the remaining experiments.This uses 17.6 GPU hours and 44.9× less computation than fullytrained models.5.2.2 Accelerating higher-order task affinity computation.We note qualitatively similar results for approximating higher-order task affinity matrix.Recall this definition from equation ( 1), Example 2.3.We set  = 2000 so that the higher-order task affinity matrix converges while setting the subset size as  = 10 (further ablation study will be provided in Section 5.3.4).
Using  = 1 and  = 200, our algorithm approximates  ★ within 3.5% distance while using less than 1% cost of computing  ★ .Further increasing  to 5, the distance drops to 2.7%.Again, the computation cost is only 3% of computing  ★ .This takes 11.9 GPU hours and uses 32.8× less computation than fully-trained models.

Accelerating task affinity computation on text and image data sets.
We have shown that Grad-TAE significantly reduces the computational cost in task affinity estimation.To verify that these efficiency gains are consistent across different data modalities, we apply Grad-TAE to a text classification dataset (RTE) and an image classification dataset (DomainNet) [44].The RTE data set contains 100 tasks.We use T5-Base and compute higher-order task affinity with 2000 subsets of size 10.The DomainNet data set contains 6 tasks.We use ResNet-50 and compute higher-order task affinity with 20 subsets of size 3. On the two data sets, our algorithm reduces computation by 42.6× and 9.5×, respectively, compared to computing true higher-order task affinities, while incurring less than 3% relative error.The smaller speedup in the image dataset is due to the fewer total models trained on task subsets.

5.2.4
Scaling task affinity estimation to very large instances.Lastly, we estimate task affinities on the Orkut graph by varying  from 100 to 500.We measure the distance between the estimated and the true pairwise affinity by downsampling the number of pairs to 2000. Figure 3 shows the comparison.We observe that our algorithm scales to as many as 500 tasks, using only 112.3 GPU hours, which is much faster than computing  ★ .Moreover, the relative distance to the true scores remains within 5%.Forward Selection (FS) and Backward Selection (BS) [18]: These are standard approaches to perform subset selection, and we adapt them to task selection.

Comparison for Task Grouping
Higher-Order Approximation (HOA) [50]: This algorithm computes pairwise task affinities between every two tasks and averages them to approximate higher-order affinities.It uses a branch-andbound search algorithm to identify task groupings.
Task Affinity Grouping (TAG) [14]: This approach computes the task affinity by evaluating the projecting one task's gradients onto another task's gradients during training.TAG also uses the branchand-bound search algorithm to identify grouping.
Auto- [32]: This bilevel optimization technique balances the ratio of each task relative to the average objective of all tasks.
BoostMTL [29]: This approach computes higher-order task affinity between two tasks as the prediction loss of one task jointly trained with another task and a random subset of the remaining tasks, followed by spectral clustering to identify task groupings.Figure 4: This figure illustrates the tradeoff between error rate and computation cost, measured by the number of FLOPs and GPU hours.Compared to multitask learning baselines, our approach achieves the Pareto optimal balance between error rate and computation cost.Recall that  is the number of meta-initializations used in Grad-TAG.The number of FLOPs is reported in the Giga FLOPs unit.For both settings, there are  = 100 tasks.Our approach delivers comparable test accuracy to all baselines, using 32.8× fewer FLOPs and 5.2× less GPU hours than all baselines.

Multi-label classification on graphs.
We report the result from applying our algorithm to overlapped community detection.We use our algorithm to estimate higher-order task affinity scores and then cluster the tasks.We illustrate our results in Figure 4a, while deferring a full comparison to Appendix C. We use 1− Macro  1score as the error rate on multi-label classification datasets.First, we confirm that our algorithm outperforms single-task learning that trains one model on each task by 2.1% (as also evidenced by prior works on multitask learning [65]).
We note that our algorithm reduces the error rate compared to all baselines while using 32.8× fewer FLOPs and 5.2× fewer GPU hours compared to the closest baseline, with  = 5 We can set  = 1 for further speed up.This results in 71.4× fewer FLOPs and 26.2× less GPU hours than the closest baseline.The decrease in performance is only 0.3%.

5.3.3
Fine-tuning language models.Next, we report the results from fine-tuning language models (T5 base) on text classification with  = 100 instructions.We again use our algorithm to estimate higher-order task affinity scores and apply SDP clustering to group tasks.We illustrate our results in Figure 4b while deferring the complete comparison to Appendix C. We use 1− accuracy as the error rate on the text classification datasets.In particular, our algorithm outperforms single-task learning by 1.9%.
With  = 5, our algorithm shows comparable performance to all baselines while using 48.2× fewer FLOPs and 10.6× less GPU hours.By reducing  to 1, our algorithm further uses 105.4× less FLOPs and 53.2× less GPU hours, with only 0.5% performance decrease.

Discussion of clustering algorithms and hyper-parameters.
We discuss the design choices of Algorithm 2. First, we study the SDP-based clustering vs. spectral and Lloyd's clustering.Across six datasets, SDP-based clustering outperforms these classical algorithms by an average of 1.2%.Next, we discuss the number of clusters  and the rounding threshold .We vary  between 5, 10, 20, and 40 (recall that  = 100).We note that the performance stabilizes when  = 20.Thus, we set  = 20.For , we choose between 1   and 10  , and select the value that results in  clusters.
Recall that Algorithm 2 also requires setting the number of subsets  and each subset's size .Given  = 100, we vary  from 1000 to 3000 and observe that the result stabilizes when  reaches 2000.Thus, we set  = 2000.For , we choose it between 5, 10, and 20.We choose  = 10, as it yields better results than the rest.

CONCLUSION
This paper designs an efficient estimation algorithm to compute task affinity scores.The main idea is to first pretrain a meta-initialization on all tasks and then use the initialization's gradients to estimate the fine-tuned model parameters for a particular task combination using logistic regression.A random projection is applied to the gradients to reduce the dimension of the regression.Then, we design a robust clustering algorithm to accompany the task affinity estimation, which together yields an efficient multitask learning algorithm.Experiments show that the algorithm can scale to as many as 500 tasks on very large graphs while accurately approximating the true task affinity scores.The overall algorithm gives the best tradeoff between computation and performance compared to existing multitask learning methods.
We discuss several aspects of future work.First, it would be interesting to design novel dimension reduction and clustering methods in Grad-TAG, and they will likely depend on downstream applications.Second, it would be interesting to see if boosting could be used in branching neural networks, another type of multitasking architecture that trains a joint model on all tasks.A naive application of our method to group at the layer level is to start with a joint model and gradually split layers into task groups from input to output.In each layer, the estimation procedure (based on layerlevel features) may be used to compute task affinity scores and then group them accordingly.This would help reduce the final model to a single neural network.
A PROOF OF PROPOSITION 3.3 For this proof, we shall focus on binary classification.As discussed in Remark 3.2, the extension to multiple classifications requires additional notations, but the proof is straightforward.
Proof of Proposition 3.3.Recall that we define the minimizer for the logistic regression after random projection as Ŵ .To make it clear, we annotate the vector with its dimension so that it is easy to distinguish.Ŵ is the minimizer of the following problem min ℎ 1 ( ) = 1 where we recall that  is a  by  random projection matrix,   = ∇    ★ (  ,   ), and   = −    ★ (  ,   ).Now, we define an intermediate solution   as follows We can see that the function value of Ŵ for equation (11) must be less than the function value of   for equation (12).This is because the latter is a special case of the former.Thus, we first have that Next, we compare ℎ 2 (  ) with L( ★ ).Recall that  ★ is the minimizer for the following problem: We note that there are two sources of errors in this comparison.The first is the error between   (  ,   ) and its Taylor's expansion  ⊤  ( −  ★ ) +   .The second is the error introduced by the random projection.
To make it easier to compare between equation ( 14) with ( 12), let us expand the former as follows: min 1 Thus, we can see that the difference between  ★ and Ŵ can be attributed to the error term ε .We rewrite equation ( 16) as follows to make it clear min 1 Now we bound the magnitude of ε .Our idea is to use the fact that the logistic loss is 1-Lipschitz continuous (to see that, one just needs to verify that With this, we could then show that ℎ 2 (  ) and L( ★ ) are relatively close to each other.By definition, ℎ 2 (  ) ≤ ℎ 2 ( ★ ).Additionally, ≤ 1 Recall from the assumption that the averaged Taylor's expansion error is at most .Thus, Next, by the Johnson-Lindenstrauss transformation [24] (For a modern exposition, see, e.g., lectures notes by Gregory Valiant: https: //theory.stanford.edu/~valiant/teaching/CS265/lectureNotes/l9.pdf),provided that  =  log   2 , we have Thus, applying the above two steps back into equation (21), we can now conclude that Applying equation ( 21) back into equation ( 13), we can now conclude that To finish the proof, we can apply the above calculation to compare between ℎ 1 ( Ŵ ) and L( Ŵ +  ★ )), to get that Combining equations ( 23) and ( 24) together, we finally conclude that This completes the proof of Proposition 3.

□
It would also be interesting to examine Taylor's expansion up to the Hessian in equation ( 2).This requires additional computation of Hessian vector products.After that, one needs to solve a quadratic program that depends on the Hessian matrix.This is left for future work.
Lastly, there is a line of work on model agnostic meta-learning and continual learning (See, e.g., survey article by Hospedales et al. [19]).It would be interesting to see if our method can be applied to this setting (i.e.estimating fine-tuned model parameters without backpropagation).This is a promising direction for future work.

B DATA MATRIX FOR EXAMPLE 4.1
For completeness, we report the data matrix  used to generate the clusters in Example 4.1.
We use the SIGN model [15] as the encoder in the multitask learning models on the community detection tasks.The encoder involves three layers, each with a fixed width of 256 neurons.Our choice of this encoder is without loss of generality, and our observations also apply to other encoders.We construct the node features from the VERSE embedding [51], which encodes personalized PageRank vectors known as useful for community detection.We use the same number of model parameters for the Auto- and MoE baselines as for the other task grouping baselines.
On text classification tasks, we use T5-Base as the base model.We use LoRA fine-tuning [20], which is a parameter-efficient fine-tuning method.For each dataset, we evaluate the average performance over all 100 instructions.In our approach, we view one instruction as one task.We train the model with the AdamW optimizer with a learning rate of 5 × 10 −5 for 5,000 gradient update steps.We vary the rank of LoRA between 4, 8, 16, 32, 64, and 128.We find that a rank of 4 leads to the best performance; thus, we set the rank as 4 in our experiments.
C.1.2Baselines.We describe the details of Forward selection: Start from all empty groups.Enumerate through all tasks by adding one task to one of the existing groups which results in the best average performance.In Backward selection, we start from a group with all tasks and other groups as empty.Enumerate through all tasks by removing one task from the first group and assigning the task to the group which results in the best average performance.
To be representative in terms of relative improvement, we also compare the performance with conventional methods for community detection, including BigClam [62], Louvain clustering [11], Network embedding methods including Node2Vec [16], and VERSE [51], and GNN-based community detection methods including MinCutPool [10] and Deep Modularity Networks [52].We noted that our approach outperforms the community detection baselines.The comparison results are reported in Table 3.

C.2 Omitted results
C.2.1 Additional task grouping results.We illustrate the tradeoff between the error rate and the computation cost in terms of FLOPs and GPU hours of the other four datasets in our experiments in Figure 5.We observe that our approach, Grad-TAG, consistently achieves Pareto optimal in the evaluation metrics.While achieving the comparable performance of the best baseline, our approach reduces the computation cost by 32.8× and 5.2× in terms of FLOPs and GPU hours, respectively.Compared to the baselines using the same level of computation cost, our approach improves the MTL performance over the baselines by 4% on average.Figure 5: This figure illustrates the tradeoff between the error rates and computation cost in terms of FLOPs and GPU hours on four datasets omitted in the main text.Our approach, Grad-TAG, consistently achieves the Pareto optimal, delivering comparable test accuracy to other MTL baselines and using 32.8× fewer FLOPs and 5.2× less GPU hours than other baselines. denotes the number of meta-initializations used in our approach.C.2.2 Correlation Between Estimated Affinities and True Scores.Our results show that task grouping with our estimated task affinities can achieve competitive performance with the previous method that uses the fully computed higher-order task affinities.To explain these results, we hypothesize that the estimated task affinities are highly correlated with the true task affinities, resulting in similar task groupings and, consequently, comparable performance.We compute the Spearman correlation between the estimated and true task affinities corresponding to one task , i.e., the correlation between [ 1, , . . ., , ] and [ ★ 1, , . . ., ★ , ].We evaluated on the YouTube network of 100 tasks.We show that using  = 1 meta-initialization, the estimated task affinities have a 0.91 correlation with the true scores averaged over all tasks.With  = 5, the estimated scores have a 0.96 correlation with true scores.

C.3 Tables of Full Comparisons
Here, we report the complete results for Section 5.3.
Table 3: We report the Macro  1 -score, computation cost as FLOPs, and runtime as GPU hours, on community detection tasks using four social networks.We compare our approach with MTL optimization methods, feature subset selection methods, and graph embedding methods.For each experiment, we report the results averaged over three random seeds and include their standard deviations.

Figure 1 :
Figure 1: Visualization of the gradient-based model approximation step in our Grad-TAE algorithm, where we replace multitask training with a regression-based estimation of model parameters fine-tuned on a particular subset of tasks.

Figure 2 :
Figure2: We compare the SDP relaxation with spectral and Lloyd's clustering in a toy example.There are three clusters, with the second and third clusters having higher densities than the first.The black solid line illustrates the clusters yielded by each algorithm.As shown in Fig.2b, spectral and Lloyd's clustering group the high-affinity clusters together.Fig.2ashows the SDP relaxation separates them correctly.

Figure 3 :
Figure 3: The number of GPU hours vs. the number of tasks to compute pairwise affinity, evaluated on the Orkut graph up to 500 tasks.We estimate the full training cost by training on randomly sampled 2000 subsets of tasks.

5. 3 . 1
Baselines.We set up a wide range of baselines covering heuristic solutions and recent optimization techniques.

5 ( 5 (
a) Multi-label classification on graphs (The YouTube network) b) Instruction fine-tuning of language models (On the RTE dataset)