Only Send What You Need: Learning to Communicate Efficiently in Federated Multilingual Machine Translation

Federated learning (FL) is a promising approach for solving multilingual tasks, potentially enabling clients with their own language-specific data to collaboratively construct a high-quality neural machine translation (NMT) model. However, communication constraints in practical network systems present challenges for exchanging large-scale NMT engines between FL parties. In this paper, we propose a meta-learning-based adaptive parameter selection methodology, MetaSend, that improves the communication efficiency of model transmissions from clients during FL-based multilingual NMT training. Our approach learns a dynamic threshold for filtering parameters prior to transmission without compromising the NMT model quality, based on the tensor deviations of clients between different FL rounds. Through experiments on two NMT datasets with different language distributions, we demonstrate that MetaSend obtains substantial improvements over baselines in translation quality in the presence of a limited communication budget.


Introduction
Federated Learning (FL) has emerged as a popular distributed machine learning paradigm.FL enables collaborative model training among a set of clients via periodic aggregations of local models by a server (McMahan et al., 2017;Konecný et al., 2016).The FL property of keeping data local to clients has important privacy advantages that have made it attractive for many learning applications.
Natural language processing (NLP) is one domain standing to benefit from FL since usergenerated text may contain sensitive information.Among the applications of FL in NLP, relatively few works have considered multilingual NLP and the impact of different languages on FL (Liu et al., 2021).In recent years, neural machine translation (NMT) has shown substantial progress in this domain with the advent of large-scale language mod- els such as BERT (Devlin et al., 2019), GPT (Radford et al., 2019), and their extensions.NMT has a further natural alignment with FL given its setting of non-IID local data distributions (Weller et al., 2022): each client (user) typically has a specific language direction they are interested in for translation, which their local dataset will be skewed towards, motivating them to collaborate with each other via FL to construct the general NMT model.
On the other hand, resource utilization is often a concern in deploying large-scale NMT models due to demands imposed on computational and memory resources (Ganesh et al., 2020;Gupta and Agrawal, 2020).While FL will distribute the processing load, every client must exchange its model parameters with a central server during the FL communication phases.Communication efficiency is a known bot-tleneck in traditional FL applications (McMahan et al., 2017) and becomes an even more critical challenge with large-scale NMT models.
In this paper, we are interested in optimizing multilingual NMT performance over an FL system with a limited communication budget.A premise for our work is that exchanging complete NMT engines in FL might not be necessary, similar to the argument in Passban et al. (2022).We can develop some intuition around this through a small FL experiment using the well-known FedAVG algorithm (McMahan et al., 2017).In Figure 1, we perform FL on the UN Corpus dataset (see Section 4 for details) distributed across three clients (each containing one language translation direction), and plot the differences in NMT model tensors between a few consecutive training rounds for one of the clients.These differences are computed and visualized tensor-by-tensor, indicating the deviation of each tensor.We observe that the majority of deviation in the NMT model tensors cluster within small norms, while a small subset of tensors exhibit significant deviations.This observation is consistent across all clients, datasets, and data distribution combinations considered in this study.
Recently, Passban et al. (2022) proposed a strategy of focusing on either highly fluctuating or less active tensors during FL communication to reduce computation load.Their approach involves sending only a fixed portion of NMT parameters -namely, by sending either the top 50% or the bottom 50% of tensors based on their deviation, which is computed using the previous round's engine.The result of this is illustrated by the red threshold for the cases in Figure 1.However, sending a fixed portion of the parameters does not account for the fact that the deviation distribution will likely vary dynamically across rounds (as also observed in Figure 1).As a result, this approach is not sensitive to NMT quality, potentially resulting in the transmission of either too many parameters, i.e., extra communication burden without any significant change in translation quality, or too few parameters, i.e., leading to an undesirable model that negatively impacts translation quality.
Contributions.Motivated by this, our objective is to explore dynamic thresholding techniques for multilingual federated NMT.The central challenge involved is how to adaptively determine a threshold that selectively filters out parameters from transmission until we expect that the translation qual-ity will start to be compromised.To address this, our methodology, MetaSend, incorporates a metalearning approach that generates a dynamic sending threshold adapting to the varying deviation distribution across training rounds.The result is depicted by the blue thresholds for the cases in Figure 1.In doing so, MetaSend considers translation quality and communication efficiency as important objectives in multilingual NMT training.In developing MetaSend, we make three major contributions: • We conduct the first research on the communication efficiency of FL in multilingual NMT, and study the relationship between translation quality and the volume of transmitted parameters in multilingual NMT engines. 2 Related Work

Efficient NLP
Previous research has explored efficiency enhancements for large NLP models from a computational perspective, i.e., achieving comparable results with fewer resources (Treviso et al., 2022).Some studies have focused on the data side, e.g., showing how smart downsampling of available datasets can result in equal or improved performance compared to using the entire dataset (Lee et al., 2021;Zhang et al., 2022).On the other hand, efforts to enhance efficiency through model designs include questioning the necessity of full attention heads in large language models and demonstrating that removing certain attention heads does not significantly impact test performance (Kovaleva et al., 2019;Tay et al., 2020;Raganato et al., 2020).Compared to these works, motivated by the recent demand for FL in NLP, we specifically focus on communication efficiency in federated multilingual NMT and design a strategy that selectively transmits only the essential parameters of NMT engines for learning.

FL in NLP
Recent research has begun exploring FL methods for NLP applications requiring privacy preservation (Sui et al., 2021;Qin et al., 2021;Ge et al., 2020;Lin et al., 2022).During the FL communication phase, large NLP models are exchanged, introducing a significant communication cost associated with model updates.To address this, Melas-Kyriazi and Wang (2022) proposed a gradient compression methodology for language modeling, while Ro et al. (2022)  The FedAVG aggregation is defined as: where K is the total number of clients, n k is the number of samples in the k-th client's dataset, n is the total number of all training data points, and W r s and W r k are the model parameters at the r-th communication round for server S r and the k-th Algorithm 1 Cross-Silo Federated Learning 1: Server S, Client C k , total number of clients K 2: for Each round r = 1, 2, ..., R do Send(C r k ) to server 6: S r ← Aggregation(C r 1 , C r 2 , ..., C r K ) 7: end for client C r k , respectively.The system has finished one FL communication round once the server has completed the aggregation.For the next round, the clients will Receive the server's weights for initialization.The overall process is repeated for r = 1, 2, ..., R FL communication rounds.

Overview of MetaSend
During FL communication, a large amount of NMT model weights have to be uploaded to the server during the Send action (in Algorithm 1) for aggregation.This communication can be quite costly for a large NMT model, which is a key bottleneck for FL.To tackle this challenge, we propose MetaSend, which adapts the NMT tensors sent based on a customized sending threshold for each communication round.The key idea of MetaSend is to build Model-Agnostic Meta-Learning (MAML) (Shu et al., 2019) into the FL rounds to balance communication efficiency and translation quality.
Figure 2 and Algorithm 2 summarize the overall procedure.In each round r, after completing the local iterations using their local training data, each client C r k will retain its learned model weight W r k and training loss L r k .In each round r, MetaSend operates according to the following steps: (i) After every client has finished training their local models, the training losses L r 1 , L r 2 , ..., L r K of all K clients are inputted into our MAML module, which is implemented as a multi-layer perceptron (MLP) network.The MAML module serves as a serverside component that leverages the clients' losses to learn a threshold, which is subsequently shared with all clients.The purpose of this module is to generate a customized threshold θ r based on the extracted losses (line 5 in Algorithm 2), which should consider the anticipated impact on learning performance.(ii) Based on the threshold θ r , each client C r k selects which model tensors to send based on a deviation comparison with its previous version C r−1 k (line 6 in Algorithm 2).(iii) After receiving the transmissions, the server S r executes the ag-

MetaSend: Customized Sending and Aggregation
In this subsection, we answer the first question mentioned above.Our intuition is that the extent of model parameter deviation relative to the original norm provides an indication of whether information is worth sending.Our observation in Section 1 shows that the tensors of the NMT model responsible for learning exhibit a clustered pattern in the deviation distribution.
Compared with the clients in the previous round, MetaSend will first compute the deviation (dev) for each tensor, with dev defined as: where ℓ ∈ L denotes a particular tensor of the model, and || • || is the absolute-value norm that measures the difference between clients' weights in different rounds.Based on dev and the learned threshold θ r (line 6 in Algorithm 2), MetaSend may select each tensor to be sent based on one of two criteria: whether its dev is greater (g) or less (l) than the threshold θ r .Each of these has potential advantages: deviations above the threshold (g) will promote sending tensors that have experienced the largest changes, which could potentially be an informative or noisy update, while deviations below the threshold (l) will encourage more gradual tensor refinements that are not susceptible to sudden large fluctuations.As a result, MetaSend generates two sending methods, namely MetaSend g and MetaSend l : where W ′r k represents the selected model's weights for the k-th client in round r.
Given the resulting weights W ′r k of every client, the server then executes aggregation via FedAVG (line 8 in Algorithm 2).Formally, Equation 1 is:

MetaSend: MAML Module Update
To address the second question, we aim for our MAML module to generate an adaptive sending threshold based on translation quality.Figure 3 shows the learning process and the optimization flow of this module.ϕ r represents the hyperparameter set of our MAML module in communication round r, and θ r (ϕ r ) is the generated threshold from the module.After the parameter selection process and aggregation (Equations 3 and 4), the parameters of the k-th resulting client model and the global model can be expressed as W ′r k (ϕ r ) and W r s (ϕ r ), respectively.To assess the quality of the global model W r s (ϕ r ), we randomly select Algorithm 2 FL with MetaSend 1: Model Parameter: W s for server S, W k for each client C k , ϕ for MAML module (MLP) 2: for Each round r = 1, 2, ..., R do Send resulting local model W ′ r k to server 8:  b batches of samples from the validation dataset and evaluate the global model using these samples.Subsequently, we employ the validation loss L val (W r s (ϕ r )) as the optimization objective for the MAML module, which encourages the module to adapt in the direction of superior translation quality.Thus, our MAML module update can be written as: where β is the learning rate for the meta update.By optimizing the MAML module with consideration of the translation quality of the global NMT model, our MAML module can generate a customized threshold θ r for each round that considers both the deviation distribution and the translation quality.We will see in Section 5 how this process of learning what parameters to send results in substantial translation quality and communication efficiency improvements.
4 Experimental Design

Datasets and Client Partitioning
We utilize two NMT datasets: MTNT (Michel and Neubig, 2018) and UN Corpus (Ziemski et al., 2016).The MTNT dataset comprises English to French (En → Fr) and English to Japanese (En → Ja) translations, while the UNMT dataset includes three official UN language directions: English to French (En → Fr), Arabic to Spanish (Ar → Es), and Russian to Chinese (Ru → Zh).Three training settings are considered: (i) centralized training without FL, (ii) FL with IID data, where the data for each client is sampled randomly from all data, and (iii) FL with non-IID data, where each client only sees data for one language direction.See Appendix A for more details on the datasets.

Base Model and Evaluation Metrics
Following the multilingual FL experimental settings in Weller et al. ( 2022), we use the M2M-100 model (Fan et al., 2021) to conduct machine translation.The M2M-100 model is a sequence-tosequence model with 418M parameters, and it can translate between any pair of 100 languages.We measure the quality of translation by using sacre-BLEU (Post, 2018) and COMET (Rei et al., 2020).
SacreBLEU is a commonly used metric for evaluating NMT quality, while COMET is a more advanced metric that shows some degree of correlation with human judgement.See Appendix A for more details on these translation metrics.
Besides translation quality, we consider two metrics for FL efficiency: tensor saving and processing time.Tensor saving is defined as the ratio of tensors that are not exchanged between the server and clients during the Sending step in Algorithm 1 (or line 7 in Algorithm 2).For efficiency evaluation, we will report the average tensor savings and the exact processing time over all training rounds.

FL Training and MAML Module
We build our FL experiments using the Flower framework (Beutel et al., 2020) for training and evaluation.For centralized experiments, we train models for 50 epochs and discuss the effect of pretrained knowledge for NMT.For every FL experiment, we train each method for 25 communication rounds (epochs) and initialize the clients using a pre-trained M2M-100 model from Hugging Face's transformers library (Wolf et al., 2019).As a reference, we also conduct FL experiments by initializing the clients' model with random weights; the corresponding results can be found in Appendix G.
For our MAML module, we use a multi-layer perceptron (MLP) network with one hidden layer containing 100 neurons as the default setting.The ablation study in Section 5.3 presents the results of MetaSend considering different numbers of neurons in the MAML module.To randomly sample a small portion of the validation set for the MAML

Baselines
We use several competitive baseline approaches and parameter selection strategies to evaluate federated NMT.PMFL (Weller et al., 2022) is the basic FL framework that uses a pre-trained model for federated NMT without any decision-making mechanism.DP g and DP l are the recent methods from (Passban et al., 2022) that select which tensors to send by comparing the norm difference between the previous and current client models.
Their thresholding mechanism sorts based on the norm difference and either send the top 50% (DP g ) or bottom 50% (DP l ) of tensors during the aggregation process.We also include the results from a random configuration, RandSend, which randomly sends 50% of the tensors during FL aggregation.
5 Experimental Results

Translation Performance Evaluation
Table 1 presents the SacreBLEU results of the translation task for both datasets.In the first section, we observe that the centralized method outperforms PMFL methods, as we would expect; on the other hand, it compromises data privacy by not preserving individual client data confidentiality.Further, the performance decrease of PMFL from IID to non-IID FL training reveals the challenges in the practical NMT scenario of clients having only single language directions.By randomly sending the model parameters, RandSend achieves the lowest performance among all methods for both IID and non-IID FL.Compared with our MetaSend methods, we see that the DP methods face challenges.The significant performance improvements of MetaSend over DP show the advantage of modeling an adaptive sending threshold based on the norm difference distribution.Moreover, our MAML-learned threshold learns what to send during communication to better optimize the NMT task.Note that this threshold is dynamic and can adapt to different norm difference distributions in each round.Specifically, our MetaSend method achieves average sacreBLEU improvements of 3.9 and 3.4 points over DP on IID and non-IID data, respectively.Among all the parameter selection methods, MetaSend l achieves the highest scores in both sacreBLEU and COMET metrics (refer to Appendix C).It demonstrates comparable translation quality to PMFL, indicating its ability to preserve communication resources without compromising translation quality.Translation examples generated by our method and the baseline are provided in Appendix E, where it is evident that our method shows better alignment with the ground truth regarding sentiment and accurate word usage.Additional results presented in Appendix F show the consistent superiority of our method over the baselines, not only in scenarios with ample training data but also in situations where clients have limited data resources.

Communication Efficiency Evaluation
In Figure 4, we present the average sacreBLEU score and tensor savings for each method across 25 communication rounds on the MTNT (a and b) and UNMT (c and d) datasets.The RandSend and DP methods can save around 50% of tensors   during FL communication due to their designs.By sending the tensors based on a specific learned threshold, MetaSend methods obtain substantial improvements in translation quality compared to DP methods, while also obtaining varying degrees of tensor savings.Among our two MetaSend methods, MetaSend l outperforms in translation quality, indicating that sending the majority of tensors for update ensures significant performance improvement.On the other hand, MetaSend g demonstrates higher tensor savings, with an average of 10.3% more tensors saved compared to DP methods.
In addition to evaluating tensor savings, Table 2 provides the specific training time for each method.During local iterations, all methods require a similar amount of time to process the entire local training dataset and update all parameters in the model.Although PMFL does not spend time on the parameter selection process, it consumes the most time in client communication and aggregation due to the necessity of transmitting and aggregating every tensor in the model.Both DP and MetaSend require computation time for calculating deviations between a client's current tensors and its tensors from the previous round.However, DP carries out its operation by selecting either the top or bottom 50% of tensors, which occurs after all deviations have been calculated and sorted.In contrast, MetaSend immediately decides whether to send a tensor based on the learned threshold after calculating a single deviation and without any sorting  calculation.As a result, MetaSend requires less time for parameter selection compared to DP.Finally, the time required for sending and aggregation depends on the tensor savings of each scheme.Both MetaSend l and MetaSend g involve additional computation for the MAML module, and they spend a similar amount of time on this module as it is independent of the operator.The breakdown of the time spent on the MAML module for MetaSend methods is provided in Table 3.We observe that the most time-consuming configuration is the meta-evaluation, which requires forwardpassing a few batches to the global model and obtaining the validation loss.Therefore, we conduct an ablation study in Section 5.3 to examine the impact of the number of sampled batches.In sum, our proposed MetaSend significantly enhances translation quality while achieving greater resource savings compared to both PMFL and DP methods.

Ablation Studies
Effectiveness of learned threshold.To isolate the impact of the sending threshold, we compare MetaSend with different thresholds, including our learned threshold, a fixed threshold (θ r = 0.5), and a random threshold selected from 0 to 1.The red arrows in Figure 5 show the improvements in MetaSend when using different thresholds within a single operator (l or g).Compared with other thresholds, our learned threshold significantly increases both translation quality and the amount of tensor savings.By comparing Figures 5 and 4, MetaSend with a fixed threshold sometimes outperforms DP methods, suggesting that sending based on the deviation distribution should be considered instead of sending approximately half of the model's tensors.See Appendix D for additional details and a comprehensive discussion on our learned threshold.
Parameters used in MAML module.To explore the impact of the number of neurons on performance, we keep the learning parameters consistent while varying the number of neurons in the hidden layer of the MAML module.In Figure 6, we see that using more neurons in the MAML module generally leads to improved results in terms of sacreBLEU score and tensor savings.The performance gain from using more neurons is intuitive since it provides additional degrees of freedom for learning optimization.However, it is important to note that using more neurons also incurs higher resource requirements during system construction.
Meta evaluation for MAML module.To examine the influence of the number of batches used to optimize our MAML module, Figure 7 shows the performance of our method with different numbers of batches used for optimization.We see that increasing the number of samples used for MAML optimization generally results in improved translation quality and efficiency.Naturally, using more batches will increase the exact time spent on our MAML module.However, the time taken by clients to send parameters for aggregation may be more critical than this optimization time, as the optimization process is performed only once in each round.
Additional experiments.Other results on limited training data and the absence of initialized knowledge, as well as detailed discussions on our learned threshold and ablation studies with other datasets and data distributions, are available in Appendices D-F.

Conclusion
We conducted the first analysis on the efficiency of federated multilingual NMT.To address the practical challenges that arise in this setup, we proposed MetaSend, which selects tensors for transmission that are most critical to the NMT.By adaptively learning the sending threshold in each FL round based on meta-learning, we saw that MetaSend not only improves communication efficiency, but also effectively captures the NMT threshold for sending.Extensive experiments on two datasets showed that MetaSend outperforms existing baselines regarding machine translation quality and significantly reduces communication costs in FL, confirming its advantage in practical federated NMT settings.

A Details for Datasets and Evaluation Metrics
This paper considers two widely used NMT datasets: MTNT (Michel and Neubig, 2018) and UN Corpus (Ziemski et al., 2016).The Machine Translation of Noisy (MTNT) dataset (Michel and Neubig, 2018) was gathered from user comments on Reddit discussion threads.The dataset contains two language directions: English to French (En → Fr) and English to Japanese (En → Ja).The dataset contains 5,605 instances in each direction for training and approximately 1k each for validation and test set.The UN Corpus (Ziemski et al., 2016) consists of manually translated UN documents over the years 1990 to 2014, and we consider three official UN language directions: English to French (En → Fr), Arabic to Spanish (Ar → Es), and Russian to Chinese (Ru → Zh).The dataset contains 80k instances in each direction for training and approximately 10k each for validation and test set.
The evaluation metric sacreBLEU is widely used in the machine translation community and is built upon BLEU (Papineni et al., 2002).Our study uses the standard sacreBLEU settings, including nrefs:1, mixed case, eff:no, tok:13a, smooth:exp, and version 2.0.0.For, Japanese (Ja) and Chinese (Zh), we use their respective tokenizers to ensure accurate evaluation.We utilize the default COMET model as suggested by the authors, which employs a reference-based regression approach and is developed based on XLM-R.This model has been trained on direct assessments from WMT17 to WMT20 and assigns scores ranging from 0 to 1, where 1 indicates a perfect translation.The COMET metric has been found to exhibit segmentlevel correlation with human evaluations and has demonstrated its potential to distinguish between high-performing systems more effectively.

B Hyperparameters and Compute Settings
For all the experiments, the MT engine (M2M-100 model) is optimized using the Adam optimizer.We search the learning rates from [1e-2, 5e-3, 1e-3, 5e-4, 1e-4] and select 5e-3 as the optimal learning rate for the MT engine.The batch size for the MT engine is set to 2 for both client training and updating the MAML module.The MAML module is optimized via Adam optimizer with a learning rate 1e-3, which is also searched from [1e-2, 5e-3, 1e-3, 5e-4, 1e-4].
We run all experiments on a 3-GPU cluster of Tesla V100 GPUs, with each GPU having 32GB of memory.Centralized experiments will be conducted on one of our 3 Tesla V100 GPUs, while FL experiments will utilize K GPUs, where K is the total number of clients.Each centralized experiment ran for approximately four hours and two days for the NTMT and UNMT datasets, respectively, when run on a single GPU.For FL experiments on the NTMT dataset, each simulation was completed in about 2 hours by distributing clients' data on 2 GPUs.For FL experiments on the UNMT dataset, each simulation was run for approximately 14 hours by distributing clients' data on 3 GPUs.

C Translation Quality Evaluated by COMET
This section provides supplementary results to Section 5.1.Several research works have mentioned that the traditional overlap-based evaluation metrics do not correlate well with human evaluation (Sinha et al., 2020;Zhang et al., 2020;Sellam et al., 2020;Hsu et al., 2022Hsu et al., , 2021)).Therefore, we also evaluated each method using a novel MT evaluation metric, COMET (Rei et al., 2020), demonstrating that the evaluated quality is better aligned with human judgement.Table 4 shows the translation quality evaluated using COMET for each method.Consistent with the sacreBLEU results in Table 1, MetaSend l demonstrates the highest average COMET score across two datasets, while MetaSend g achieves comparable quality.

D Learned Threshold of MetaSend
This section provides supplementary results to Section 5.3, and Figure 8 shows the learned threshold θ r for our method.The decreasing value of θ r for MetaSend l indicates its focus on sending tensors with even smaller deviations during communication.As the model converges and becomes stable, the deviations decrease, resulting in MetaSend g generating a smaller threshold for a milder deviation distribution.

E Case Study
This section provides some translation examples to Section 5.1.Figure 9 illustrates that our method generates translations that closely align with the ground truth by utilizing similar words.In contrast, the baseline method produces translations  with redundant tokens, leading to potential confusion within the sentence.Figure 10 shows that our method employs the same words as the ground truth, conveying a neutral sentiment.In comparison, the baseline method generates similar words but with a negative sentiment.
Figure 11 shows the challenge of training FL algorithm under Non-IID data distribution.We can see that our method, trained under the IID setting, achieves a good translation compared to the ground truth by round 9.However, when trained under the Non-IID setting, our method requires 17 communication rounds to achieve comparable results.The result obtained at round 9 for our method trained under the Non-IID setting shows that the model has not yet converged, resulting in chaotic translations with mixed languages.

F Insufficient Training Samples for Clients
To mirror the limited data scenario of each client in practical FL, we performed experiments by randomly sampling a small portion of data from MTNT and UNMT datasets as training data.
Specifically, we randomly selected 20% of samples from each dataset, resulting in 1k and 10k training samples in each language direction for MTNT and UNMT datasets, respectively, while keeping the validation and test sets at the same size.All other hyperparameters, such as batch size for the NMT engine or MAML optimization, neurons in the MAML module, and learning rate, remain the same as in Section 5. Tables 5 and 6 present the translation quality of each method when trained with limited data.It is evident that our MetaSend methods consistently outperform other baselines and sometimes deliver comparable or even slightly better performance compared to PMFL.MetaSend l shows overall superiority in the UNMT dataset, whereas the results in the MTNT dataset are more diverse due to the presence of noise in the dataset.

G FL Experiments Without Using
Pre-trained Knowledge Tables 8 and 9 show the results of each method without utilizing pre-trained knowledge for the M2M-100 model in each client.Specifically, each Recently, the United Nations Millennium Declaration listed solidarity as a fundamental value of fundamental importance in international relations.
Recently, the United Nations Millennium Declaration identified solidarity as one of the fundamental values essential to international relations.The 2007 budget included funds for these offices, but there was a delay in their opening.

DP
The 2007 budget included funds for these offices but delayed their opening.
Funding for these offices was included in the 2007 budget, but the opening was delayed.Recently, the United Nations Millennium Declaration listed solidarity as a fundamental value of fundamental importance in international relations.
Recently, the United Nations Millennium Declaration identified solidarity as one of the fundamental values essential to international relations.The 2007 budget included funds for these offices, but there was a delay in their opening.

DP
The 2007 budget included funds for these offices but delayed their opening.
Funding for these offices was included in the 2007 budget, but the opening was delayed.
Figure 10: Translation examples (Ru → Zh) of DP l , MetaSend l , and ground truth.Our method generates the same sentiment-meaning word ("延迟") as ground truth, while DP l generates similar but different sentiment-meaning words ("延误").

H Supplementary Results for Ablation Study
This section provides additional results to Section 5.3.Figure 12 shows the ablation study of our method on the MTNT and UNMT datasets, examining the impact of using different numbers of neurons in our MAML module.On the other hand, Figure 13 presents the ablation study of our method, investigating the effects of using different batches for the MAML module.We observe that these additional results align with the findings discussed in Section 5.3.Specifically, increasing the resources, such as the number of neurons or batches, for our MAML module generally results in improved performance.Additionally, Table 7 provides specific time measurements for meta-evaluation with different numbers of batch samples.Intuitively, it can be observed that using more samples for metalearning will result in increased time requirements.
Figure 1: Sample histograms of the difference (absolutevalue norms) between tensors of NMT engines computed for clients across consecutive communication rounds in FL training.The traditional method (red thresholds) fails to accurately capture the boundary between clusters during sending, while our MetaSend (blue thresholds) provides a dynamic threshold that adapts to the varying distribution across FL rounds.

Figure 2 :
Figure 2: Overview of MetaSend for federated NMT.MetaSend enables clients to adaptively select important parameters of NMT models based on a learned threshold for each communication round.Each client sends only a subset of model tensors to the server for aggregation, enhancing efficiency within a limited communication budget.

Figure 3 :
Figure 3: Optimization of our MAML module in an FL setup.The module aims to adapt the sending threshold based on NMT model quality.

Figure 4 :
Figure4: The average sacreBLEU score and the amount of tensor savings for each method.We see that MetaSend g and MetaSend l exhibit different tradeoffs between tensor savings and translation quality.

Figure 5 :
Figure 5: Average sacreBLEU score and tensor savings of MetaSend with different sending thresholds.
Figure 7: Average sacreBLEU scores and tensor savings for MetaSend with different MAML module batch numbers on the UNMT dataset in Non-IID FL.

Figure 8 :
Figure 8: The learned threshold θ r for our method.

Figure 12 :
Figure 12: The average sacreBLEU score and the amount of tensor savings for MetaSend with different numbers of neurons in the MAML module.
Figure 13: The average sacreBLEU score and tensor savings for MetaSend with varying numbers of batches inputted to the MAML module.

Table 1 :
Fr En → Ja Avg En → Fr Ar → Es Ru → Zh Avg SacreBLEU scores obtained with centralized and FL (IID and Non-IID) methods for various strategies on the MTNT and UNMT datasets.The bold scores indicate that MetaSend outperforms other methods in all cases.update,we use 16 batches (i.e., b = 16).In Section 5.3, we also present an ablation study on b to see the effect of MAML optimization for NMT quality.See Appendix B for detailed hyperparameters and compute settings.

Table 2 :
The average training time (in seconds) spent over 25 training rounds for each method on our machine.

Table 3 :
Detailed time spent within our MAML module.

Table 4 :
COMET scores obtained with centralized and different FL methods on MTNT and UNMT datasets.
As of late, the United Nations Millennium Declaration listed solidarity as a fundamental value of fundamental importance in international relations.

Table 5 :
SacreBleu scores obtained with each method with the reduced number of training samples.

Table 6 :
COMET scores obtained with each method with the reduced number of training samples.

Table 7 :
Time spent for passing different numbers samples the NMT meta-evaluation.

Table 8 :
Fr En → Ja Avg En → Fr Ar → Es Ru → Zh Avg SacreBleu scores obtained with each method without using pre-trained weights as initialization.Bold scores indicate the best in the column for the given section.

Table 9 :
COMET scores obtained with each method without using pre-trained model as initialization.