Architecture-Based FedAvg for Vertical Federated Learning

Federated Learning (FL) has emerged as a promising solution to address privacy concerns by collaboratively training Deep Learning (DL) models across distributed parties. This work proposes an architecture-based aggregation strategy in Vertical FL, where parties hold data with different attributes but shared instances. Our approach leverages the identical architectural parts, i.e. neural network layers, of different models to selectively aggregate weights, which is particularly relevant when collaborating with institutions holding different types of datasets, i.e., image, text, or tabular datasets. In a scenario where two entities train DL models, such as a Convolutional Neural Network (CNN) and a Multi-Layer Perceptron (MLP), our strategy computes the average only for architecturally identical segments. This preserves data-specific features learned from demographic and clinical data. We tested our approach on two clinical datasets, i.e., the COVID-CXR dataset and the ADNI study. Results show that our method achieves comparable results with the centralized scenario, in which all the data are collected in a single data lake, and benefits from FL generalizability. In particular, compared to the non-federated models, our proposed proof-of-concept model exhibits a slight performance loss on the COVID-CXR dataset (less than 8%), but outperforms ADNI models by up to 12%. Moreover, communication costs between training rounds are minimized by exchanging only the dense layer parameters.


INTRODUCTION
Nowadays, Artificial Intelligence (AI) is increasingly being utilized in various domains.AI techniques require training on data available in a single data lake.However, data are often distributed among different institutions, and aggregating them is not always feasible due to privacy and security concerns.The data's distributed location makes the training model's deployment local, with specific concerns on the heterogeneity of hardware capacities, connections and, in our case, data privacy.In this sense, FL has emerged as a promising solution for collaborative machine learning across distributed parties while preserving data privacy, and it represents an interesting challenge for Computing Continuum [18].In particular, FL leverages the computing continuum to enable collaborative, distributed model training across a range of connected devices, ensuring seamless and continuous intelligent learning.The key innovation brought by FL is to benefit from data that usually are only used locally.FL is typically divided into two categories: • Horizontal (HFL) [13] Federated Learning, where the federation participants (clients) hold data with the same features or attributes but different instances or examples.• Vertical (VFL) [12] Federated Learning, where the different clients hold data with the same instance but different features.
HFL and VFL adopt very different training techniques.Indeed, in HFL, the most well-known and most used aggregation algorithm is FederatedAveraging (FedAvg), where each client of the federation trains a local model and exchanges its parameters with a central server, which aggregates them and sends the resulting global model back to each client.Besides FedAvg, other FL algorithms have been proposed, such as FedCurv [20], and SCAFFOLD [7], in order to deal with non-IID data, which has been shown to represent a challenge for FL systems [2].On the other hand, in VFL both data and models are kept local while intermediate results are exchanged among clients.These intermediate messages consist of learning representations of local data and their gradients [22,24].
Our work aims to explore an innovative aggregation strategy for VFL that leverages model architecture as a key criterion for weight aggregation.We propose to average the network parameters only for the parts of the models that share the same architectural structure, thus preserving the specificity of features learned from demographic and clinical data.Basically, this results in applying the HFL principles only for the shared identical architectural layers.
This strategy offers several potential advantages.Firstly, it enables collaboration among parties that do not hold the same type of data, i.e., one client holding images and another holding tabular data, increasing the model's generalizability.Secondly, it reduces cost communication among parties by lowering the data size to be shared during the aggregation process.Indeed, a reduction in communication costs during the training of our federated model is achieved by transferring only a smaller subset of layers of the local models, while typical HFL systems exchange all the layers.
VFL is often used in scenarios where different organizations or entities want to collaborate to train a model on data with different attributes without violating the privacy of the data itself.A typical example is the medical application.In healthcare, patient data can be highly heterogeneous.Different institutions may collect patient data with different attributes, such as age, gender, medical history, and test results.For this reason, in our experiments, we employed two medical datasets containing both images and clinical parameters, namely the COVID-CXR dataset [21] and the ADNI study [23].
The main contributions of our work are: • we extend the principles of HFL to VFL by applying FedAvg only for the parts of the models that share the same architectural structure.• We perform extensive experiments to provide a systematic study of FedAvg in a VFL setting, analyzing the performance from a learning and computational point of view.• As a result of experimentation, we show that our proposed framework aligns with the centralized results achieved by relying only on a single input data type and allows for exploiting FL generalizability even when institutions have different input sources.• We release the code to replicate our experiments.
The rest of the paper is organized as follows.Section 2 reviews the most recent related works.Section 3 presents our proposed method.Section 4 describes the experimental settings and discusses the results.In Section 5, conclusions are drawn.

RELATED WORKS
VFL has gained attention in the last few years, tackling the problem of decentralization both for examples and for the feature space.The goal is to create an aggregated architecture efficient on different data without the possibility of considering all of the examples together at training time.
The developed research has become very effective for cases in which privacy concerns are particularly challenging, and data can be very heterogeneous.For example, the healthcare domain can benefit consistently from this setting.
However, in the first place, privacy concerns have not been taken into account, enabling the development of centralized multi-modal architectures for data with heterogeneous feature space (for example, images and tabular data).In particular, the need for multi-modal architectures in the healthcare domain has brought many advancements in designing methods to accomplish specific tasks.In a recent work [9], the authors used the representations extracted by a CNN for training a second linear model that fits tabular data.This method has been outperformed by single architecture methods, which solve the problem of redundant representations.In fact, approaches involving a singular architecture connect the latent representation of image and tabular information before the last layer, overcoming the problem of redundant information [5,8,16,17].However, they exploit only linear relations between image representation and tabular data.This limitation was overcome using an MLP, which takes into consideration the non-linearities between different data involving the same patient, as done in some recent works [4,11,14].
All these approaches present architectures to accomplish a certain medical task with different types of data without considering possible privacy restrictions.Our approach aims at providing a framework where data are owned by institutions sensible to data privacy, which makes the task harder to complete.Indeed, few works have addressed this problem concerning the FL setting.Some FL works in the Internet of Things (IoT), and healthcare sectors are present in the literature.Most of these works design multi-modal architectures for solving specific tasks [1,19,25].Moreover, the increasing proliferation of IoT applications requires a seamless interconnection of resources of edge and cloud devices, leading to the ecosystem of Computing Continuum.FL is a Computing Continuum strategy of Machine Learning applications relying on data coming from IoT and edge devices [15].
Another recent work [3] in the medical domain proposes MERGE, a multi-input NN leveraging multiple input sources, i.e., images and tabular data.The basic assumption of MERGE is that each federation participant has both data types, locally available and accessible.However, in a real federation, each client can hold various types of datasets, such as images, tabular features, or text reports.Our proposed method aims to overcome this FL limitation by aggregating only the identical architectural parts of the DL models.In addition, we focus on the interpretation of the aggregation, shedding light on the interpretation of the extracted features by different architectures on different data and the combination of the two.

METHOD
In this section, we present our approach for VFL with Architecture-Based Aggregation.Our method aims to address the unique challenges of collaborating on machine learning tasks when data sources possess heterogeneous attributes that refer to the same instance (e.g. an institution owning scans of a patient, while another institution owns his clinical parameters).
We consider a typical VFL scenario where two distinct institutions hold data related to the same set of instances but with different attributes.In particular, in our experiments, the first institution trains a CNN on image data, while the second party trains an MLP on tabular data.We selectively aggregate the weights of these models based on their architectural similarity.By design, and in order to be aggregated, the dense layers of the CNN adopted by the first organization are the same as the MLP of the second institution.Figure 1 shows how our proposed method works.
We introduce several aggregation strategies to explore the benefits of selective aggregation: • Full: in this scenario, we perform weight aggregation across all the dense layers of the models.• Half: in this scenario, we perform weight aggregation across half of the dense layers of the models.According to the layers we choose to combine, we can have different cases: -Aggregating the first half: we aggregate the weights of the first layers.This strategy focuses on capturing the models' initial feature extraction and processing stages.-Aggregating the second half: we aggregate the weights of the last layers.This strategy focuses on capturing the last processing stages and the final output decision-making.
In our experiments, we tested only the second case, i.e., we performed weight aggregation of only the last model layers.So, it is possible further to distinguish several subcases within the Half Aggregation approach to refine our aggregation strategy.Specifically, when averaging the last layers, the question arises regarding the weights of the first layers.We investigated three different possibilities: • Random initialization: the first layers of the models are randomly initialized.This approach allows for exploring the relationship between random weights and the features extracted in the initial layers.Moreover, it allows to study the influence and interaction among random weights and subsequent layers.

EXPERIMENTS AND RESULTS
Federation setup: the entities of our simulated federation are a server and two clients.For each federation round, each client executes one training, one validation, and one testing stage.Then, aggregation is performed according to one of the previously discussed strategies, and the resulting models are validated on the test data.All experiments are executed on a simulated federation deployed on a dedicated server with an Intel Xeon CPU (8 cores per CPU) and one Tesla T4 GPU.We release the code required to reproduce our experiments at the following link: OMITTED FOR ANONYMOUS SUBMISSION.

Datasets
We tested our aggregation strategy on two tasks: • Prognosis of COVID-19 disease from chest X-rays (CXR) data, using the COVID-CXR dataset [21] • Detection of Alzheimer's disease from neuroimaging data, using the ADNI study [23].More details about the datasets are provided below.As clearly shown in Fig. 2, this dataset exhibits a clear feature distribution skew due to the different data collection procedures of the six hospitals.This leads to the well-known problem of noniidness in FL [2,10].Each patient of the dataset is provided a CXR and some clinical parameters (namely age, sex, positivity at admission, temperature, days of fever, cough, difficulty in breathing, WBC, RBC, CRP, glucose, LDH, INR, PaO2, PaCO2, pH, high blood pressure, diabetes, dementia, BPCO, cancer, CKD and respiratory failure).The statistics of the dataset are summarized in Table 1.As it can be noted from Table 1, this dataset also exhibits the quantity skew non-iidness, that according to a recent work [2] does not represent a difficult challenge for FL algorithms adopting a weighted averaging of the parameters.
ADNI.The Alzheimer's Disease Neuroimaging Initiative (ADNI) is an ongoing and multicenter study, representing the main benchmark dataset for Alzheimer's Disease (AD).It comprises a set of clinical and neuroimaging (3D T1-weighted MRI scans) data collected over the years in different cohorts (ADNI1, ADNI2, and ADNI3).Each cohort contains patients from two classes: control subjects (CN) showing no signs of depression, mild cognitive impairment, or dementia, and AD participants.As for the COVID-CXR dataset, the ADNI database has clinical indicators.For our study, we considered the age, gender, and APOE4 (4 of Apolipoprotein E, the most important known risk factor for AD).APOE4 can assume three different values, 0, 1, or 2, according to the number of 4 alleles of the APOE gene.Patients with missing values were removed.The statistics of the dataset are reported in Table 2.
The ADNI preprocessing consisted of the following steps: reorientation, bias-field correction, non-linear registration to the MNI152-2mm standard space with dimensions of 91x109x91, and normalization.

ADNI1 ADNI2 ADNI3
Figure 3: Samples coming from each of the three ADNI datasets.
A sample from each of the three ADNI cohorts is shown in Fig. 3.
Models.We have reproduced the experiments of MERGE [3], a multi-input neural network for FL leveraging both images and tabular data, and tested on the COVID-CXR and ADNI datasets, by adopting the same models and hyperparameters.For the first institution of our federation, the CNN adopted is a modified version of a ResNet-18 [6] (2D version for the COVID-CXR dataset, while a 3D version for the ADNI study), where the dense layer is, by design, exactly the same as the MLP trained on the data of the second institution.The MLP comprises four layers: an input layer, two hidden layers (respectively containing 64 and 32 neurons), and an output layer.The activation function used is the ReLU function.Models were trained by minimizing the binary cross-entropy loss function using the Adam optimizer with learning rate 1e-4 and OneCycleLR as the scheduler.The local batch size was set to 8.For the task of prognosis of COVID-19, the models were trained for 100 rounds, while for the detection of AD, they were trained for 200 rounds.

Results and Discussion
As a baseline, we considered the performance of MERGE.Results are shown in Table 3.
Table 3: Accuracy in the centralized setting (all data are gathered in a single data lake).Results (mean ± standard deviation) are obtained with five-fold cross-validation.For each experiment setting is highlighted the best-performing model.

Input COVID ADNI
Only images 0.731 ± 0.06 0.777 ± 0.01 Only tabular 0.740 ± 0.03 0.638 ± 0.02 Multi-input 0.733 ± 0.01 0.811 ± 0.03 We tested our aggregation-based method on two tasks: the prognosis of COVID-19 disease from the COVID-CXR dataset and the AD detection from the ADNI database.The simulated federation encompasses two clients: the first trains a CNN on image data, while the second trains an MLP on tabular data.The CNN is a modified version of ResNet-18 where the dense layers are identical to the MLP of the second client.Results are reported in Table 4.
Results show that our method suffers from a low loss in performance with respect to the best models of MERGE, that, in the case of COVID-19, is the MLP leveraging only tabular data, while ADNI is the multi-input NN, but it allows for exploiting the FL benefits, such as generalizability, even when organizations hold different types of input data.Surprisingly, Half strategies overcome the FULL technique in all three cases considered: Half with random initialization of the first two layers (HALF RANDOM), alignment with CNN dense layers (HALF IMAGES), and alignment with MLP dense layers (HALF TABULAR).Although counterintuitive, this can happen for several reasons: • Extracted features.If the features extracted from images are completely different from the tabular features, then HALF could better combine and process these features during the decision-making stages.This case seems particularly true for the COVID-CXR dataset.Indeed, as shown in Table 3, when combining image and tabular features in a multi-input NN, performance is lower than using only the clinical parameters.
Conversely, in the ADNI study, a multi-input NN benefits from combining both inputs.• Model complexity.The FULL aggregation strategy could lead to a complex model trying to learn from both image and tabular features.This can lead to the well-known problem of overfitting, especially if the two input features are really different.The only exception is in the COVID-19 dataset, where the FULL aggregation achieves better results than HALF RANDOM.
When dealing with images, our method overcomes the baseline for the ADNI case but not for the COVID-CXR dataset.However, it can be noted that for all the aggregation strategies, the accuracy is always the same.This is probably due to the high-skewed datasets.In particular, for the COVID-CXR dataset, our method seems to overfit data coming from hospital F, while for the ADNI study, our method seems to overfit the first two ADNI cohorts.Finally, our method allows for lowering the amount of data to be exchanged among clients without hurting the model's performance, thus decreasing the overall communication time and requiring fewer computational resources.The model's statistics are summarized in Table 5.Our method, although a little loss in the model's performance, allows for exploiting the FL benefits by only aggregating a subset of architectural parts, thus resulting in a decreased quantity of exchanged data (parameters).In particular, in our experiments, we aggregated only the dense layers of a CNN and an MLP without exchanging the convolutional parameters which account for the majority of the memory usage.However, our method, if generalized to include the aggregation of convolutional parameters, would increasingly offer cost-effective solutions in terms of communication.

CONCLUSIONS
In this work, we proposed an architecture-based method for Vertical FL.We leveraged the architectural similarity of different types of NNs to extend the HFL strategies to the VFL scenario.A thorough comparison of our proposed aggregation strategies is carried out.Results spanning two medical datasets, i.e., COVID-CXR and ADNI, show that our method is efficient for integrating FedAvg into VFL when clients hold different types of input sources without performance loss.Moreover, by sharing only a subset of NN layers, our method allows for reducing the communication costs of typical FL systems.In future work, we plan to: • Increasing the federation's participants.In this proofof-concept VFL approach, we considered only two clients.However, in a real-world scenario, it is supposed that more organizations contribute to training a federated model.For future work, we aim to increase the number of federation participants.This can be achieved by: (1) collecting new datasets, either belonging to the healthcare domain or to other scenarios.(2) sharding the datasets considered in this paper, COVID-CXR and ADNI, in more parts.For example, the COVID-CXR dataset lends itself to a natural split into six shards, as its data comes from six different hospitals.• Testing state-of-the-art models.A possible direction is to evaluate how our method performs when trained with SOTA models.
• Proposing new aggregation techniques.Our method presents the same typical limitations of the HFL setting, i.e., layers must be identical to be aggregated.Further investigation of aggregation strategies is required.For example, a possible way to merge different architectures could be the concatenation of convolutions and dense layers trained on different datasets.• Testing other types of input data.We tested our strategy on images and tabular data.Extending this framework to consider alternative types of data and models, such as text classification datasets and RNNs, is a possible direction.

Figure 1 :
Figure 1: A generic representation of our architecture-based FedAvg for VFL.

•
Alignment with CNN dense layers: the weights of the first layers are set to the corresponding values in the CNN.This strategy allows for emphasizing the feature extraction and initial processing stages of the CNN.• Alignment with MLP dense layers: the weights of the first layers are set to the corresponding values in the MLP.This approach allows for emphasizing the feature extraction and initial processing stages of the MLP.

Figure 2 :
Figure 2: One sample coming from each of the six hospitals.

Table 2 :
Main demographic and clinical data for the three ADNI study cohorts.Age is reported as mean ± standard deviation values, gender as the number of males/females, while APOE4 refers to the number of 4 alleles (0, 1, or 2, respectively).

Table 4 :
Accuracy in the federated setting (one client trains a CNN on images, while the other one trains an MLP on tabular data).Results (mean ± standard deviation) are obtained with five averaged runs.For each experiment setting is highlighted the best-performing model.

Table 5 :
Statistics of the models.