FedHLT: Efficient Federated Low-Rank Adaption with Hierarchical Language Tree for Multilingual Modeling

Federated Multilingual Modeling (FMM) has become an essential approach in natural language processing (NLP) due to increasing linguistic diversity and the heightened emphasis on data privacy. However, FMM faces two primary challenges: 1) the high communication costs inherent in network operations, and 2) the complexities arising from parameter interference, as languages exhibit both unique characteristics and shared features. To tackle these issues, we introduce a communication-efficient framework for Multilingual Modeling (MM) that combines low-rank adaptation with a hierarchical language tree structure. Our method maintains the base model's weights while focusing on updating only the Low-rank adaptation (LoRA) parameters, significantly reducing communication costs. Additionally, we mitigate parameter conflicts by organizing languages based on their familial ties rather than merging all LoRA parameters together. Our experimental findings reveal that this novel model surpasses established baseline models in performance and markedly decreases communication overhead.

The growing emphasis on multilingual modeling within natural language processing (NLP) is driven by the increasing diversity of languages present online [21].However, the acquisition of multilingual data often faces significant challenges, including high costs associated with its dispersed nature and data privacy concerns [11,33].To overcome these obstacles, we leverage the potential of Federated Learning (FL) to develop a multilingual model utilizing various institutional and data sources [4,10,39].Central to FL is the concept of exchanging model parameters instead of direct data sharing, which ensures the preservation of data privacy [36,40].
The application of large pre-trained language models (PLMs) in fine-tuning Federated Multilingual Models (FMM) within a federated environment encounters notable difficulties, especially when dealing with limited data [41].A primary challenge lies in transmitting the PLMs' extensive parameters across the network, leading to communication bottlenecks [18].Additionally, FMM is inherently prone to non-IID (Non-Independently and Identically Distributed) issues due to the diverse linguistic and cultural characteristics [38].For languages closely related linguistically, parameters can be mutually beneficial despite minor conflicts.In contrast, for languages that are more distantly related, while there are more parameter inference, there is also a mutual benefit.For example, languages like English and German, which have a close relationship in the language family tree (see Figure 2), show significant similarities.However, English and Chinese, belonging to different language families, exhibit considerable distributional differences.Though Chinese and English are distant in the language tree, they also share mutual benefits.For instance, the use of inversions is common in English and Chinese, but rare in German.In this scenario, parameters between Chinese and English can assist each other, whereas English and German parameters might experience minor conflicts.These disparities can hinder the model's specific language adaptation, leading to significant Parameter Inference (PI) problems [5,23], adversely affecting transfer performance [35], as depicted in Figure 1.
In response to these challenges, we introduce a novel, communicationefficient federated learning framework for multilingual modeling, utilizing hierarchical language tree learning strategy (HLT).Drawing inspiration from parameter-efficient fine-tuning (PEFT) strategies [14,16,29,31], our approach involves fine-tuning a select subset of parameters via Low-Rank Adaptation (LoRA) techniques, while keeping the original pre-trained language model (PLM) parameters intact.This marks a groundbreaking implementation of LoRA in the context of federated learning (FL).By limiting the number of trainable parameters in the LoRA adapter, our method substantially reduces communication overhead, as illustrated in Figure 2. To alleviate the issue of interference among diverse languages, we suggest organizing languages into clusters based on their linguistic family affiliations, a concept visually represented in Figure 2. The empirical results demonstrate that our method not only achieves enhanced performance but also boasts greater efficiency compared to a range of baseline models.
In this research, we make significant strides in the field of federated multilingual modeling, marked by the following contributions: i. FedHLT Framework.We introduce the groundbreaking FedHLT framework, a novel and communication-efficient approach to federated learning for multilingual modeling.OA pivotal aspect of our contribution is the pioneering application of Low-Rank Adaptation (LoRA) within the realm of Federated Learning (FL), which has successfully achieved a remarkable reduction in communication overhead by a factor of 100.ii.Hierarchical Language Tree Learning Strategy.We employed a language hierarchical language tree strategy to alleviate parameter interference in the context of federated multilingual modeling.
iii.Experimental Results.We have rigorously tested the FedHLT framework across a suite of downstream tasks, i.e., language modeling, machine translation, and text classification.The results from these experiments conclusively demonstrate the superior performance of our FedHLT framework.

RELATED WORK 2.1 Federated Learning in NLP
Federated learning (FL) [19,25], a distributed machine learning framework, consists of a central server and several client nodes.In this model, clients' raw data is kept local to address privacy concerns.The training process involves exchanging parameters among clients instead of data [22].However, the non-IID (not Independently and Identically Distributed) nature of data across these clients hampers FL's performance, often leading to less accuracy than centralized training models [17].Recent advancements have seen federated multilingual models being increasingly deployed in a variety of tasks, such as medical transcript analysis [24], enhancing multilingual natural language understanding through knowledge composition [33], applying pre-trained models in multilingual federated settings [34], multilingual emoji prediction [12], and machine translation [23].Nonetheless, these models often suffer from inefficiency due to the extensive data exchanged between the server and clients during training.Current solutions, like adapter tuning, unfortunately introduce additional latency during inference.In our work, we integrate LoRA [15], an approach for parameter-efficient fine-tuning, to substantially decrease the number of trainable parameters by a factor of 100 and reduce GPU memory requirements by a factor of 3, thereby enhancing both efficiency and performance in federated multilingual modeling.

Parameter-efficient Fine-tuning
Parameter-Efficient Fine-Tuning (PEFT) is a technique designed to modify a minimal subset of parameters in pre-trained language models (PLMs) for specific tasks, as opposed to retraining the entire model [3,14,15,20,41].PEFT methods are generally divided into three categories [8].The first category, addition-based methods, incorporates additional trainable parameters not originally present in the model.However, these can introduce challenges such as inference latency in adapters [14,16] and limited input sequence handling in prefix-tuning [20].The second category, specification-based methods, includes BitFit [3] and diff pruning [13], which selectively make certain original model parameters trainable while freezing the rest.The third category, reparameterization-based methods like LoRA [15], transforms existing parameters into a more efficient form through reparameterization techniques.Despite their advantages, PEFT models can reduce the performance of language models, as shown by Zhang et al. [41] and our experiments.This decrease in performance is mainly attributed to parameter inference among different languages, a significant challenge that needs to be addressed in multilingual contexts.
distinguishes itself through two fundamental innovations: (1) federated low-rank fine-tuning, which offers an effective approach to learning in federated multilingual settings, and (2) hierarchical tree learning strategy, which addresses the challenges of parameter inference that often arise in multilingual learning.In the following sections, we delve deeper into the intricacies and implementation of FedHLT.Section 3.1 introduces the overarching setting of Federated Multilingual Modeling.Subsequently, in Section 3.2, we provide a detailed exposition on the federated low-rank fine-tuning aspect, explaining its significance and mechanics.Finally, Section 3.3 thoroughly discusses our HLT learning strategy, showcasing its role in mitigating parameter inference in multilingual scenarios.

Federated Multilingual Modeling
We begin by introducing the formulation of Federated Multilingual Modeling (FMM) [34].Given  language datasets {  }  =1 , the goal of FMM is to collaboratively train a multilingual FL model that achieves high performance in the downstream tasks.Specifically, in the setting of FMM, we assume there are  client {  }  =1 .Each client   owns only one language   and the different client has different languages.Let Θ  be the trainable parameters of the local model in   .At each training round , the server initially performs a weighted aggregation of the LoRA parameters for client   (in the language family tree, its node is   ) and its parent nodes  = { 1  , . . .,   } (where  1  is the parent node of   at different levels), then sends the result to the corresponding language's client.The client   trains the local FL model with parameter Θ ( ) on its own dataset   and then sends parameters to the server .The server  then aggregates these parameters to update parent nodes  =  1  , . . .,   of   Θ (+1) and sends Θ (+1) to client   for the subsequent training round.FedAvg is employed for aggregation by default [25] and is computed as follows: (1)

Federated Efficient Fine-tuning with Low-Rank Adaption
In FMM, training the entire FL model incurs substantial communication costs as it involves computing/exchanging a large number of parameters through the networks.The success of fine-tuning on PLMs motivates us to explore adjustment of the small portion of parameters in the FMM.FMM with Low-Rank Adaption.It has been shown that PLMs exhibit a low "intrinsic dimension" when adapting to specific tasks [1] and can still learn efficiently despite a random projection to a smaller subspace.Inspired by this, in FMM, we hypothesize the local updates to the weights Θ for each client also have such low "intrinsic rank" during training.Therefore we employ the Low-Rank Adapter (LoRA) for efficient FMM fine-tuning -instead of training and exchanging Θ for each client, we only adjust the parameters of adapter ΔΘ in propagation.Specifically, the forward process for the linear layer in the FMM model is computed as follows: where  represents the output of the previous layer,  is the hidden state.Note that Θ ∈ R  × is parameters of the PLM used in the local model, which is frozen.ΔΘ is the parameters of the adapter, which is updated during training rounds.ΔΘ can be factorize into two matrix B ∈ R  × and A ∈ R  × As the intrinsic rank  ≪ (, ) is small, ΔΘ = BA has fewer parameters to communicate.
Federated Parameter-Efficient fine-tuning. .At the end of each training round, clients transfer their updated LoRA parameters to the server.When the server receives the parameters of all clients, it aggregates LoRA parameters as . (3)

Updating LoRA Parameters with Hierarchical Language Tree Learning Strategy
The PI issue is common in FMM.The presence of languages from different sources in diverse distributions introduces a non-i.i.d.(nonindependent and identically distributed) nature, which leads to conflicts when aggregating parameters trained on different datasets, denoted as   .The update of the parameter Θ  from one client may have an adversarial effect on the others, yielding sub-optimal performance.
Hierarchical Language Tree Learning Strategy (HLT).To address the PI issue prevalent in FMM, we introduce the HLT approach.Prior cluster-based methods have demonstrated effectiveness in reducing PI [23,28,32].Our LFC method not only considers the benefits of clustering languages within the language family tree, where closely related languages aid each other, but it also acknowledges the potential conflicts and occasional assistance between more distantly related languages.This approach resonates with findings in FL research, where clustering a subset of clients with similar distribution strategies can mitigate PI.LFC shares similarities with existing clustering strategies.In language modeling, languages are grouped based on linguistic similarities, forming language families.Our method follows the categorizations in the language family tree as outlined in [26].We aggregate the LoRA parameters according to this family tree, as depicted in Figure 2.For instance, languages in the Germanic family, including English and German, are clustered together, as are languages in the Italic family (Spanish, French, and Portuguese), the Balto-Slavic family (Russian, Polish, Czech, and Lithuanian), the Sino-Tibetan family (including Chinese), the Uralic family (including Finnish), the Afro-Asiatic family (including Arabic), and the Japonic family (including Japanese).This clustering allows us to capitalize on the synergies within language families FedHLT is a communication-efficient framework designed for Federated Multilingual Learning, comprising two key designs: federated low-rank fine-tuning and HLT approach.Specifically, languages are divided into a family tree based on their language families.At the outset, both the server and client sides possess an identical pre-trained language model, and maintain a set of LoRA parameters for each language tree node.During the federated learning process, for each language, the server initially performs a weighted aggregation of the LoRA parameters for that language's node in the language tree and its parent nodes, then sends the result to the corresponding language's client.Each client then updates the LoRA parameters and sends the updated parameters back to the server.The server, upon receiving these LoRA parameters, performs a weighted aggregation and uses it to update the nodes of that particular language in the language tree and all its parent nodes.The parameter-efficient fine-tuning design with the divide-and-conquer strategy in FedHLT effectively resolves the parameter inference arising from multilingual learning and reduces the communication costs in federated learning.
while minimizing conflicts, thereby enhancing the overall efficiency and effectiveness of our FMM approach.
Let {G  }  =1 , ( ≤  ) denotes the parent nodes of a client node at each hierarchical level in the language tree.Each G  contains a set of index  indicating the -th clients with datasets   belong to the -th language tree path.The aggregation in Equation 3 then change to Regarding our implementation, we have  LoRA adapters associated with different language tree path G  .For downstream tasks in specific languages, we utilize the corresponding ΔΘ ,(+1) for inference.The comprehensive algorithm detailing this process is presented in Algorithm 1.

EXPERIMENT
We conduct a thorough evaluation of our model across three widely recognized tasks in NLP, i.e., Language Modeling (LM), Machine Translation (MT), and Text Classification (TC).For these evaluations, we utilize four distinct datasets: Europarl, MTNT, UN Corpus, and News Classification.Detailed statistics and the specific evaluation metrics for each dataset are provided in Table 1.

Algorithm 1: Hierarchical Language Tree Aggregation
Input: The hierarchical language tree nodes set ; To offer a comprehensive understanding of our evaluation process, we have organized the following sections accordingly.In Section 4.1, we present detailed descriptions of each task, elaborating on the specific challenges and objectives associated with LM, MT, and TC.Section 4.2 is dedicated to discussing the evaluation metrics employed in our study, providing insights into how the performance of the model is quantitatively assessed across different tasks.Section 4.3 delves into the datasets, highlighting their relevance, composition, and the rationale behind their selection for this research.Section 4.4 provides comprehensive details about the training process in our study, including the framework, GPU type, learning rates, training durations, and the use of pre-trained models.Section 4.5 analyzes the performance of three settings (Centralized Model, FedAvg, and Standalone) and the standard Adapter approach.Lastly, in Section 4.6, we analyze the main results presented in Tables 3, 4, 5, and 6 for LM, MT, and TC experiments, highlighting our consistently superior performance compared to other federated learning methods, and discussing key observations.

Tasks.
Language Modeling (LM).The LM task involves predicting the subsequent word in a given sequence, and its evaluation metric, perplexity (PPL), serves as a measure of model performance.It tests the model's ability to understand contextual dependencies, capture semantic relationships, and generate coherent and meaningful sequences of words.For example, given the sentence "The acting in this movie is", the model would predict the next word, such as "excellent".When the language model effectively captures and understands complex language patterns and dependencies, leading to more accurate predictions of the next word, the PPL will be low.In our study, we employ the UN Corpus and Europarl datasets to conduct LM experiments.Machine Translation (MT).The MT task involves automatically translating text from a source language to a target language, with BLEU serving as an evaluation metric.For example, given a source text in English, such as "Hello", the desired output in French would be "Bonjour".A higher BLEU score in MT indicates better translation quality.In our research, we have utilized the UN Corpus and MTNT datasets.Text Classification (TC).The TC task involves assigning predefined labels to text data, and accuracy is a metric used to evaluate the performance of the classification model by measuring the percentage of correctly predicted class labels compared to the total number of predictions.For example, if the input text is "This movie is fantastic!", the output label should be "movie".In our study, we adopt News Classification datasets.
We adopt these three tasks to verify the proposed method can provide a comprehensive performance improvement in FMM setting.

Evaluation Metric.
In evaluating our model across three fundamental NLP tasks-LM, MT, and TC-we employed different datasets and metrics.For the language modeling task, we use perplexity (PPL) as the evaluation metric [34].For neural machine translation task, we use BLEU as evaluation metrics, using ScareBLEU package [27].For the text classification task, we use accuracy as an evaluation metric.

Implementation Details
In our federated learning setup, we employed the FedLab framework [37] 1 , adhering to the training methodology outlined in [34].

Baselines
In our experiment, we utilized three distinct settings: Centralized Model, FedAvg, and Standalone, each offering unique insights into the performance of our model.The Centralized Model, following the approach of Weller et al. [34], involves consolidating all data in a single location for training.FedAvg adopts the Federated Averaging method as described by McMahan et al. [25], distributing data across various clients within the federated learning framework.Both these settings train a conventional multilingual model utilizing all parameters.The Standalone setting diverges by focusing on training data in just one language and assessing its performance across multiple languages.This scenario, as explored by Weller et al. [34], simulates a model trained solely with data from a single client.To highlight the effectiveness of our LRH and LoRA methods, we also include scenarios where the parameters of PLMs are frozen in both the Centralized and FedAvg settings.Additionally, we compare the performance of the LoRA method [15] and the standard Adapter approach [14] without LRH to further demonstrate the superiority of our proposed techniques.

Main Results
In this section, we delve into the results and observations outlined in Tables 3, 4, 5, and 6, which correspond to our experiments in language modeling, neural machine translation, and text classification.A thorough analysis of these tables reveals that our approach consistently outperforms other FL methods across the majority of tasks.Following are several key observations.FMM Model Outperform Standalone.In our study, the standalone model acts as the lower performance benchmark.Our results indicate that models trained with FedAvg generally outperform the standalone setup.This finding emphasizes the value of FMM in realworld language training, as it efficiently leverages diverse data sets across languages without the constraints of data barriers, highlighting its practical superiority in federated learning environments.Parameters Efficient FT vs. Full-Parameters FT.Our results, as shown in Tables 3, 4, 5, and 6, demonstrate that our model utilizing LoRA for PEFT not only aligns with but, in certain tasks like text classification (Table 6), surpasses the performance of conventional full fine-tuning models.This enhanced performance is likely due to LoRA's focused approach in mitigating overfitting, which in turn maintains the pre-trained model's generalization capabilities and leads to an overall improvement in effectiveness.Comparison with Adapter-based PEFT method.We observed a notable trend in our experiments: not all PEFT methods are equal in their effectiveness within FMM contexts.Specifically, Adapterbased methods often do not reach the performance levels of full fine-tuning (FT).For instance, in text classification (Table 6), while FedAvg with FT achieves an average accuracy of 85.3% across five languages, the combination of FedAvg and Adapter yields sligHLTy lower effectiveness at 84.9%.In contrast, FedAvg paired with LoRA achieves a higher accuracy of 87.3%, outperforming FT.This clearly demonstrates LoRA's distinct advantages in FMM, emphasizing its efficiency and uniqueness.Additionally, our LoRA-based method significantly enhances efficiency over traditional adapter-based PEFT methods.For text classification, FedAvg with FT requires a hefty 278.1 million trainable parameters (TP), and using an Adapter demands 5.4 million TP.In stark contrast, LoRA manages the same task with just 2.5 million TP, as detailed in Table 6.This dramatic reduction in TP not only speaks to LoRA's heightened efficiency but also underscores its effectiveness compared to other PEFT methods.
Model Efficiency and Communication Costs.The proposed FedHLT demonstrates high efficiency across several key dimensions, as shown in Figure 3. (1) Trainable Parameters: regarding trainable parameters, integrating LoRA results in a significant reduction compared to conventional full fine-tuning, decreasing to less than 1%.This also cuts the parameter count needed for Adapterbased methods by half, as shown in Tables 3, 4, 5, and 6.Such a drastic reduction is vital for minimizing communication costs in federated learning environments.(2) GPU Memory: in terms of GPU memory usage, as detailed in Table 2, FedHLT surpasses the FedAvg baseline in terms of memory efficiency.For instance, in a text classification task, FedHLT uses only 20.8G of GPU memory, while the baseline consumes 27.2G.This efficiency is attributed to our model's optimized fine-tuning approach.When compared with other PEFT methods like Adapter and LoRA, FedHLT remains competitive in memory usage.[34] indicate that in some instances, a FedAvg-model in a FedNLP framework can exceed the performance of a centralized model.We attribute this occurrence to parameter interference.Within the language family tree, closely related languages often share more similarities in parameters, whereas languages with distant relationships, despite some conflicts, can still mutually assist each other.Aggregating parameters indiscriminately from all languages tends to diminish these characteristics, potentially leading to poorer model Table 5: Results for FL experiments on the machine translation task.We evaluated our model's performance on the UN Corpus and MTNT datasets.A comparison was made between our proposed method and the baseline methods.The BLEU score was chosen as the performance metric, with bold numbers indicating the best results.↑ means higher is better, ↓ means lower is better.Our method consistently achieved higher BLEU scores than all baseline models, indicating that the clustering strategy employed significantly improves performance.Furthermore, our method demonstrated the highest efficiency in terms of memory utilization for TP, highlighting the effectiveness of our model.
# TP ↓ MTNT ↑ UN ↑ Method En-Fr En-Ja Avg En-Fr Ar-Es Ru-Zh Avg Table 6: Results for FL experiments on the text classification task.We test the model on NC datasets and compare our proposed method with the baseline methods.↑ means higher is better, ↓ means lower is better.We use accuracy as the adopted metric, with bold numbers indicating the best results.Our method consistently outperforms full fine-tuning, Adapter, and LoRA in terms of accuracy across multiple languages, providing strong evidence for the effectiveness of our framework.Furthermore, our approach demonstrates superior efficiency by utilizing the lowest memory usage for TP, further highlighting the efficiency of our method.performance [2].In contrast, employing a HLT learning strategy capitalizes on these characteristics, resulting in improved model performance.This phenomenon is also evident in the three tasks of our experiments, showcasing the nuanced interplay of language relationships and parameter management in federated learning.

CONCLUSION
In conclusion, this paper presented an innovative approach in FMM, introducing a highly efficient federated learning framework, FedHLT.Our method leverages the strengths of LoRA and a hierarchical language tree learning strategy, addressing key challenges in the realm of multilingual natural language processing.The significant reduction in trainable parameters, enhanced GPU memory utilization, and reduced training times, as demonstrated in our experiments, underline the efficacy of FedHLT in a federated setting.
Experimental results demonstrate the efficiency and effectiveness of our proposed model, resulting in a remarkable reduction of communication overhead by a factor of 100.

Figure 1 :
Figure 1: Traditional Federated Learning (FL) encounters two primary challenges in the context of Federated Multilingual Modeling (FMM).1. Huge communication cost: the necessity for large models to learn multilingual knowledge leads to significant communication costs due to the transfer of extensive model parameters across clients.2. Parameter inference: various languages exhibit unique characteristics while also sharing commonalities.Our FedHLT model addresses these 2 challenges through the introduction of Low-Rank Adaptation (LoRA) and multilingual hierarchical language tree learning strategy, respectively.

Figure 2 :
Figure2: The overall framework of FedHLT.FedHLT is a communication-efficient framework designed for Federated Multilingual Learning, comprising two key designs: federated low-rank fine-tuning and HLT approach.Specifically, languages are divided into a family tree based on their language families.At the outset, both the server and client sides possess an identical pre-trained language model, and maintain a set of LoRA parameters for each language tree node.During the federated learning process, for each language, the server initially performs a weighted aggregation of the LoRA parameters for that language's node in the language tree and its parent nodes, then sends the result to the corresponding language's client.Each client then updates the LoRA parameters and sends the updated parameters back to the server.The server, upon receiving these LoRA parameters, performs a weighted aggregation and uses it to update the nodes of that particular language in the language tree and all its parent nodes.The parameter-efficient fine-tuning design with the divide-and-conquer strategy in FedHLT effectively resolves the parameter inference arising from multilingual learning and reduces the communication costs in federated learning.

Figure 3 :
Figure 3: Comparison of trainable parameters and performance between FedHLT and baselines on the Text Classification task.FedHLT achieves less than 1% parameter count compared to FedAvg and demonstrates significant performance improvements compared to other PEFT solutions, e.g., LoRA and Adapter.

( 3 )
Training Time: the training time for FedHLT is substantially shorter than traditional models.PEFT training with FedHLT for a single client takes only 1-3 hours on Our approach involves freezing a pre-trained model and solely training adapters, which is more parameter-efficient.For each client   , we add a LoRA module with trainable parameter ΔΘ  in parallel to the PLMs parameter Θ  .In each training round , we freeze the parameters of the PLM, Θ LoRA parameters Θ 0 ; Clients set {  }  =1 ; In the language tree, the parent nodes of a client node at each hierarchical level in each path ; Training round .Output: LoRA Parameters {Θ   }  =1 .
1 for  from 1 to  do 2 Initialize Θ 0  with Θ 0 ; 3 for  from 1 to  do 4 for  from 1 to  do // local update of client  5 update Θ  −1

Table 1 :
Datasets and metric information for the three experiment tasks.

Table 2 :
Memory usage for FL experiments on four tasks.Bold numbers indicate the best memory usage.The proposed FedHLT demonstrates superior memory efficiency compared to the FedAvg baseline, thanks to parameter-efficient finetuning.Compared to PEFT methods, e.g., Adapter and LoRA, FedHLT achieves comparable memory efficiency.

Table 3 :
Results for LM experiments on the UN Corpus.A comparison was made between our proposed method and the baseline methods.PPL was chosen as the performance metric, with bold numbers indicating the best results.↑ means higher is better, ↓ means lower is better.Our method consistently achieved lower PPL scores than FedAvg + Adapter and FedAvg + LoRA across all languages, indicating its superior performance.Furthermore, our model demonstrated the lowest memory usage for TP, highlighting its efficiency.

Table 4 :
Results for LM experiments on the Europarl.Illustration is the same as Table3.Our method consistently achieved lower PPL scores than FedAvg + Adapter and FedAvg + LoRA across all languages, indicating its superior performance.Furthermore, our model demonstrated the lowest memory usage for TP, highlighting its efficiency.