Inclusive Data Representation in Federated Learning: A Novel Approach Integrating Textual and Visual Prompt

Federated Learning (FL) is often impeded by communication overhead issues. Prompt tuning, as a potential solution, has been introduced to only adjust a few trainable parameters rather than the whole model. However, current single-modality prompt tuning approaches fail to comprehensively portray local clients' data. To overcome this limitation, we present Twin Prompt Federated learning (TPFL), a pioneering solution that integrates both visual and textual modalities, ensuring a more holistic representation of local clients' data characteristics. Furthermore, in order to tackle the data heterogeneity issues, we introduce the Augmented TPFL (ATPFL) employing the contrastive learning to TPFL, which not only enhances the global knowledge acquisition of client models but also fosters the development of robust, compact models. The effectiveness of TPFL and ATPFL is substantiated by our extensive evaluations, consistently showing superior performance compared to all baselines.


INTRODUCTION
The emergence of distributed learning systems has provided considerable advantages across a wide range of domains.Nonetheless, growing privacy concerns about distributed learning have necessitated the advent of Federated Learning (FL) [2,20], a framework expressly developed to protect participants' private information.In FL, instead of uploading their private data, local clients share their local model weights with a central server during each communication round.The server aggregates these models and circulates them back to the local clients, thereby accomplishing the goal of information consolidation.
Recently, FL has confronted a wealth of challenges, including significant communication overheads [19,26,27] and data heterogeneity [13].A variety of recent research initiatives have sought to tackle these obstacles.Specifically, some have proposed innovative efficient encoding and model compression algorithms to reduce the communication cost, such as quantization to a continuous range of values into a finite set and sparsification [24] to clip the full gradient into a sparse one, as well as intelligent scheduling of client participation [21] during the training process.Moreover, some incorporate the original FL framework with an additional step of knowledge distilling [17] to contract larger models into smaller ones, thereby enhancing the robustness of the global model.Despite these strategies, certain inherent limitations persist.Primarily, they require a substantial volume of labeled training samples, which may be unavailable to many clients in the FL environment, hindering effective training and resulting in model overfitting [12].In addition, notwithstanding the communication costs reduction achieved by these efficient methods, most IoT devices such as smart home devices or industrial sensors, cannot accommodate large backbone model training due to their limited processing powers [11], infinitesimal memory, and energy constraints.To illustrate, training a ResNet-50 model [9] involves intensive computation and storage memory.It has approximately 25 million weight parameters and computes 16 million activations in the forward pass.Even after the communication-efficient algorithm to weights and activations, the total storage needed for saving ResNet-50's intermediate gradient results is over 7.5 GB for a mini-batch of 32 on a highperformance GPU.Given the hardware constraints of typical IoT devices, it is clear that they would struggle to support such intensive computations and memory requirements.
To resolve these problems, current research is leaning towards prompt tuning [14].Unlike conventional fine-tuning methods in FL that tune and aggregate full model parameters, applying prompt learning in FL only adjusts soft prompts for corresponding downstream tasks, while keeping large backbone models static to diminish both the communication and computation costs.Back to the ResNet-50 case, prompt tuning could save gradient results to just a handful of MB, drastically decreasing the communication overhead.However, most existing work only considers a single modality, failing to represent the local clients comprehensively.For instance, Guo et al. [7] exclusively employs textual soft prompts to depict the local clients without taking the visual knowledge into consideration; yet, Feng et al. [5] leverages continuous visual prompts to capture the image data information, disregarding text knowledge.In contrast, our work proposes Twin Prompt Federated learning (TPFL), a method resorting to both visual and textual modalities for a more comprehensive representation of the local clients' data characteristics.First off, we find that merely combining two modalities overlooks the potential for a unified approach.As such, we devise Augmented TPFL (ATPFL) to fuse the contrastive learning approach into the prompt tuning, facilitating the acquisition of global knowledge by client models.To the best of our knowledge, ATPFL is the first to integrate both textual and visual modalities within the context of FL and use contrastive learning to connect them.The contributions of this paper are threefold: • We present an innovative FL framework named ATPFL, that merges both visual and textual modalities for an improved representation of local clients' data characteristics, surpassing existing work's performance that only considers a single modality.• The incorporation of contrastive learning to prompt tuning, enabling clients to acquire more global knowledge and improving on the direct combination of modalities that may overlook the potential for a unified approach.This is the first work to integrate two modalities within the context of FL and to utilize contrastive learning for their integration.• Extensive evaluations have been conducted to ascertain the effectiveness of TPFL and ATPFL.The results demonstrate that ATPFL outperforms all the baselines.

RELATED WORKS 2.1 Communication Efficiency
Communication efficiency has always been a critical challenge in the FL field.Different lines of research have been investigated to tackle this challenge.Firstly, quantization [6] methods are used to represent the full model parameters with lower bits.This technique involves converting the high-precision floating-point values of the model parameters into lower-precision values.For example, stochastic quantization [1] adaptively adjusts the quantization level in a stochastic manner.Secondly, sparsification methods improve communication efficiency by directly reducing the number of model parameters to be sent.More specifically, the sparsification method selects an important subset of model parameters and sets other insignificant parameters to zero before sending them to the global server.Top-k sparsification and rank-k sparsification are common sparsification methods [3].Han et al. [8]proposed to adaptively change the sparsification level to minimize overall training time.Shi et al. [25] introduced global-k sparsification to compress the down streaming communication from the server to the clients.Thirdly, knowledge distillation is also investigated to alleviate communication overhead [15].Knowledge distillation methods transfer knowledge from a larger teacher model to a smaller student model.Examples of knowledge-distillation-based federated learning are FedMD [15], FedDF [23], etc.However, all the aforementioned strategies have a high resource requirement and can hardly be implemented in IoT devices due to their limited hardware restrictions.

Prompt Tuning
Houlsby et al. [10] proposed parameter-efficient transfer learning with adapter modules.Liu et al. [18] showed that prompt-tuning can match the performance of fine-tuning with only 0.1% -3% tuned parameters in the context of Natural Language Understanding.Li and Liang [16] applied prefix-tuning to GPT-2 and BART for downstream tasks and shows that prefix-tuning can outperform fine-tuning in low-data settings.Guo et al. [7] proposed a federated learning framework for prompt-tuning called PromptFL.The PromptFL framework leverages the power of federated learning, which allows training prompts on decentralized data across multiple devices.In this work, only one modality text prompt is used and the result shows that federated prompt tuning achieved better performance compared to fine-tuning FL in many IID and non-IID settings.Nonetheless, the existing research primarily focuses on a single modality, constraining their capability to obtain more information of local clients.In this paper, we present to employ both textual and visual representations to comprehensively characterize the local client.

METHODOLOGY
This section begins by outlining the basic structure of FL.Subsequently, we introduce the TPFL which considers both visual and textual information.Despite showing improvements, TPFL has certain inherent limitations.Therefore, we propose ATPFL to address these shortcomings and achieve superior performance.

Problem Statement
In the general FL setting, the entail system envelops  clients, while, in every round,  clients will actively participate, each possessing a unique local dataset.Each local dataset on client  consists of   samples, with each sample representing a pair, (   ,    ), of a data feature  and its corresponding target label .The primary objective of FL is to construct a global model parameter vector  that minimizes the mean loss across all local datasets, as demonstrated in the following optimization problem: where  denotes the weights of the prediction model, L is the loss function.

Twin Prompt Federated Learning (TPFL)
As aforementioned CoOp [28] resorts to a series of continuous learnable parameters as the textual prompts, replacing the manuallydesigned constant ones.The textual prompt can be denoted as    = { 1 ,  2 , . . .,   , . . .,   }, where   signifies the word embedding of the  ℎ image class names,  is a collection of learnable vectors, denoted as {  |  =1 }, and  symbolizes the length of context words.Importantly, the position of   can be placed anywhere between (1,  + 1).In the training process, the textual prompt will be fed into a text encoder (•), obtaining the textual feature as     = (   ).Similarly, the visual feature   =  () is calculated by visual encoder  (•).The final prediction probability is computed by the negative log likelihood matching score: where sim(•, •) function represents the similarity function, Γ ∈ R is the temperature factor to control the overall distribution of the similarity between the embedding of the visual feature and test feature.
Different from the previous work, which solely obtains a single modal to represent a local client, our study introduces TPFL to resort to two different modalities, vision and text, to enhance the generalization capability and resilience of the global model.More specifically, instead of relying on a constant input visual feature , we incorporate an additional trainable visual prompt  as an extended representation for the local data characterization and conduct  +  to get the final input feature.As illustrated in Figure 1, three templates of the visual prompt are employed: the padding, random patch, and fixed patch patterns, each contributing to varying model performances.After acquiring both the textual and visual prompts, each local client transmits them to the central server.The server then aggregated the received prompts, in light of the number of their training samples: However, the naive aggregation of the uploaded model weights may invite certain problems.To begin with, in practical scenarios, the data distribution across multiple clients may not be independently and identically distributed (IID).In other words, different clients can host data with significantly divergent statistical characteristics.The direct averaging of models struggles to effectively amalgamate local models originating from these devices, owing to this non-IID data distribution, and as a result, the performance of the global model suffers.Moreover, data volume can significantly vary across devices, with certain scenarios providing only a sparse dataset (only a few data points are available).Conventional FL aggregation might lack the robustness required to manage these few-shot learning scenarios, thereby complicating the process of discerning meaningful patterns from such limited data.

Augmented TPFL (ATPFL)
To address these aforementioned challenges, we propose the incorporation of a contrastive learning strategy, thus fortifying the robustness of FL.Specifically, we utilize the InfoNCE loss function [22] to encourage the output distributions of both the local visual and textual prompts to align closely with the output distribution of where   +1 and   refer to the embedding of local textual or visual prompts at step  + 1 and , respectively, and    represents the global textual and visual prompts.After attaining the contrastive loss, the overall loss of the trainable prompts can be calculated by: where ℓ  denotes the contrastive loss formulated in (2),  , ( , ) and  , ( , ) denote the embedding of the local client 's (global) textual or visual prompts, respectively, and  represents a tuning factor to control the influence of textual augmented loss ℓ _ (  +1 , ,   , ,   , ) and visual augmented loss ℓ _ (  +1 , ,   , ,   , ).The overall training process of ATPFL is shown in Algorithm 1.

EVALUATION
In this section, we perform intensive evaluations to verify the effectiveness of our proposed TPFL and ATPFL.

Evaluation setup.
Few-shot Dataset and Data partition.Extended from PromptFL [7] who only evaluates their model on four datasets, we verify ATPFL in seven different datasets: Caltech-101 [4], Oxford-Pets, Stanford Cars, OxfordFlowers-102, EuroSAT, UCF-101, and Describable Textures (DTD).Furthermore, in order to create the few-shot dataset, we set that each client has   samples for each class.For the majority of our evaluations, we choose   = 4 meaning that each client has a four-shot dataset; besides, we investigate the effect of the shot size in the ablation section.For the non-IID setting in FL, we select the label-skewing method to emulate the heterogeneous local clients.
Models Following the existing work, we choose the ResNet-50 (RN50) and Visual Transformer model (ViT) as the backbone of the visual encoder, and the Transformer model as the textual encoder.
Baselines.In our evaluation, we compare ATPFL with the following baselines: (1) Local training, where all clients train their own models in an offline manner, and no model transmission is conducted; (2) PromptFL, using only the textual modality; (3) TPFL, employing both the textual and visual modalities, but no InfoNCE loss.
Implementation details.To prevent the influence of randomness and ensure the fairness of our evaluations, each experiment setting has been performed in three identical random seeds, and then we average the results to get the final result.We use the Adam optimizer with learning rate  = 1 − 3, and the Cosine scheduler with ℎ = 20.Furthermore, For the implementation environment, we conduct our code on Python version 3.11.0 and Pytorch 1.13.0.We also use 4 NTX NVIDIA A6000 GPUs to run our code.

Main results
In this section, the experimental outcomes are assessed.Table 1 and Table 2 present the average test accuracy for ViT and RN50 backbones across seven diverse datasets in a non-IID setting.Both PromptFL and ATPFL consistently surpass local training, with margins extending up to 18.5%.This is intuitive, as local training or full-model fine-tuning may lead to catastrophic forgetting.This issue is exacerbated by client data heterogeneity.These compounded factors significantly impede fine-tuning performance in the federated learning context, necessitating the exploration of PromptFL and ATPFL.For ViT, TPFL excels over PromptFL in six of the seven datasets, with margins spanning 0.1% -6.2%, except for UCF-101 where TPFL lags by 0.2%.When factoring in the standard error of test accuracy across multiple experiments, TPFL's advancements over previous methods are noticeable.Despite TPFL's success, limitations persist, leading to the proposal of ATPFL to better address these issues.Our ATPFL model outperforms the baseline by 0.4% -1.1% across all datasets, illustrating ATPFL's potential to mitigate data heterogeneity in prompt federated learning scenarios.
In the ResNet-50 tests, TPFL outperforms local training and PromptFL in six of the seven datasets, except for the EuroAT dataset.Our ATPFL continues to surpass TPFL in four of the seven datasets, except for Oxford-Pets and DTD where ATPFL trails TPFL by 0.2% and 0.5% respectively.This could be due to the model disparities between ViT and ResNet-50.In conclusion, our proposed ATPFL, leveraging the concept of contrastive learning, offers superior performance in handling data heterogeneity.These results corroborate our prior discussions in the methodology section.

Ablation study
In this section, we examine various factors influencing our model's performance, including the application of InfoNCE loss, number of shot size, and client quantity.
InfoNCE loss.First off, we investigates the impact of InfoNCE loss (i.e., the difference between TPFL and ATPFL).As illustrated in Table 1 and Table 2, ATPFL shows a clear advantage compared to TPFL.In 11 out of 14 experiments, ATPFL outperforms TPFL by a margin of up to 1.1%.
Shot size.Second, we explore the impact of shot size, and Figure 2 demonstrates a monotonic increase in the F1-score as the number of shots rises, with the F1-score in a 16-shot scenario exceeding that of a 1-shot scenario by 2.3%.Moreover, despite the absence of a consistent increase, accuracy still trends upward with an increasing number of shots.Even at a 1-shot scenario, ATPFL exhibits substantial performance (90.2% accuracy and 87.8% F1-score), but greater shot numbers offer additional potential performance benefits due to the increased feature information provided at each learning round.
Client volume.Lastly, the ablation study examines the effect of the number of clients.Figure 3 reveals a decline in both accuracy and the F1-score as the client number rises, with a tenfold increase in clients (from 10 to 100) decreasing accuracy and the F1-score by 2.1% and 3.2%, respectively.However, even with a larger number of clients, ATPFL maintains reasonable performance, achieving 86.1% accuracy in a 100-client scenario.

CONCLUSION
In this paper, we propose an FL framework, TPFL, which first considers both visual and textual information in prompt tuning to augment the global model in FL.Notwithstanding, the performance improvement offered by TPFL is limited due to data heterogeneity.

Figure 1 :
Figure 1: This figure illustrates the pipeline of ATPFL with contrastive learning.In local training, the current prompt, previous prompt, and received global prompt are passed to each modality encoder.After the encoding, two types of contrastive learning are performed.Text contrastive loss and Visual contrastive loss use the feature extracted from the global prompt as positive contrast and the feature extracted from the previous prompt as negative contrast.CLIP contrastive loss is computed with the test prompt feature and the visual prompt feature.

Figure 3 :
Figure 3: This figure illustrates how client number affects the model accuracy and F1-score The entire  clients are indexed by  ∈ {1, 2, . . .,  };   and   is the number of global epochs and local epochs, respectively, and  is the learning rate.
to the server the global model.This methodology fosters a better comprehension of the global model by the local client, consequently mitigating the adverse effects of non-IID data.The key insight fueling this strategy is that contrastive learning facilitates the distinction between similar and dissimilar data points.It mitigates the discrepancies among local models caused by non-IID data through the learning of invariant features, making local models more amenable to aggregation at the global level.The contrastive (InfoNCE) loss functions for both textual and visual prompts are formulated in (4):

Table 1 :
Test Accuracy (%) Results for ViT model on 7 datasets with 5 different seeds.
To address this issue, we developed ATPFL to facilitate local clients