Detecting Poisoning Attacks on Federated Learning Using Gradient-Weighted Class Activation Mapping

This paper proposes a new defense mechanism, namely, GCAMA, against model poisoning attacks on Federated learning (FL), which integrates Gradient-weighted Class Activation Mapping (GradCAM) and Autoencoder to offer a scientifically more powerful detection capability compared to existing Euclidean distance-based approaches. Particularly, GCAMA generates a heat map for each uploaded local model update, transforming each local model update into a lower-dimensional, visual representation, thereby accentuating the hidden features of the heat maps and increasing the success rate of identifying anomalous heat maps and malicious local models. We test ResNet-18 and MobileNetV3-Large deep learning models with CIFAR-10 and GTSRB datasets under Non-Independent and Identically Distributed (Non-IID) setting, respectively. The results demonstrate that GCAMA offers superior test accuracy of FL global model compared to the state-of-the-art methods. Our code is available at: https://github.com/jjzgeeks/GradCAM-AE


INTRODUCTION
Federated learning (FL), as a decentralized machine learning approach, enables multiple user devices to cooperatively train a shared model under the orchestration of a server without sharing their local data, User devices in FL consecutively train local model updates (e.g., weight parameters or gradients) utilizing their proprietary data.Rather than transmitting raw, private data, user devices upload model updates to a server for aggregation.In response, the server amalgamates local model updates to generate a common global model that is then distributed to the devices for updating their respective local models [7,14,15].Such a communication round repeats until the model achieves a satisfactory accuracy level.
The distributed architecture of FL makes it particularly susceptible to poisoning attacks.User devices compromised by adversarial actors can alter model update parameters, subsequently contaminating the global FL models [8,9].Existing countermeasures, leveraging Euclidean distance-based metrics to discern deviations between malicious and benign models, have demonstrated efficacy against such attacks [1,13].However, sophisticated adversaries can craft malicious model updates such that the Euclidean distances to benign counterparts remain below a designated threshold, thereby eluding detection by defenses reliant on this metric.
This paper proposes a new defense mechanism, dubbed as GCAMA, against model poisoning attacks on FL.GCAMA leverages a Gradientweighted Class Activation Mapping (GradCAM [10])-based approach in coupling with an autoencoder (AE) to offer a substantially more powerful detection capability compared to existing Euclidean distance-based approaches.Specifically, GradCAM is applied at the server to create GradCAM heat maps for every uploaded model update.An autoencoder is applied to reconstruct the heat maps, while magnifying the discernible features of the heat maps.The reconstruction errors of the GradCAM heat maps are measured, and a threshold is created based on the statistics of the reconstruction errors.A reconstructed GradCAM heatmap with a reconstruction error surpassing the threshold is categorized as atypical, and the corresponding model update as malicious.The key contributions are summarized as follows: • We propose a novel defense method against model poisoning attacks on FL, where GradCAM and autoencoder are orchestrated for the successful detection of subtle attacks.
• GradCAM is adopted to produce heat maps for each uploaded local model, hence transforming each local model into a lower-dimensional, visual representation.This provides a conduit for pinpointing malicious model updates by singling out anomalous GradCAM heat maps.• An autoencoder is utilized to reproject the GradCAM heat maps to accentuate the hidden features of the heat maps, and improve the distinguishability of the heat maps and the success rate of identifying anomalous heat maps and malicious local models.
We conducted a comprehensive assessment of the proposed GCAMA framework using two public datasets, CIFAR-10 and GTSRB, under Non-Independent and Identically Distributed (Non-IID) settings.
Our assessment encompasses two prominent deep learning models, i.e., ResNet-18 [2] and MobileNetV3-Large [3].Our approach offers test accuracy of FL global model compared to the state-of-the-art methods.

PROPOSED GCAMA AGAINST MODEL POISONING ATTACKS
In this section, we elaborate on the GCAMA, where the GradCAM and autoencoder are leveraged on the server side for pinpointing the malicious local models.

GCAMA Architecture
On the device side, each of the  benign devices utilizes the Deep Neural Network (DNN) tailored specifically for image classification tasks, as shown in Fig. 1.The DNN model extracts relevant features from an input image (e.g., a bird) and subsequently maps them to the corresponding classes.The architecture of a typical DNN comprises multiple layers, each with a specific function.() The first layer is a convolutional layer, which applies a set of filters to the input image, thereby extracting intrinsic features, such as edges, corners, and textures.The output of the convolutional layer is a set of feature maps, each representing different aspects of the image.() The following layer is a pooling layer, which reduces the dimensionality of the feature maps while retaining essential information, such as the salient features of the bird's beak or feather.
Various pooling methods can be employed, including max pooling or average pooling, all with the objective of downsampling the feature maps while retaining their salient features.Some DNN architectures may differ from traditional pooling operations.For example, models like SqueezeNet [5], ResNet [2], DenseNet [4], and MobileNet [3] employ alternative strategies.() Subsequent to several convolution and pooling layers, the extracted features are fed into one or multiple fully connected layers, where the final classification is performed.
Upon receiving the local model updates from the devices, the server aggregates the local models, where the benign local models can be mingled with malicious local models.We aim to design a defense countermeasure that can pinpoint and filter malicious devices.GradCAM [10] visualization is distinguished by its high resolution and high-class discriminative ability compared to other methods, e.g., Class Activation Mapping [16].GradCAM is leveraged in our design to detect malicious local model updates.Specifically, the server randomly picks an image from the global model testing dataset that incorporates all categories of the devices dataset as input and passes through the convolutional layers with weight and bias parameters that are replaced by each model update W  ,  ∈ K ∪ M ( is the index of local model updates, including benign and malicious), obtains the feature maps 1   that has  channels.The feature maps   are fed into a fully connected layer for final classification.
To obtain the class discriminative localization map of  , can be obtained through global average pooling: where    (, ) is the activation at location (, ) of the feature map    .We apply a rectified linear unit (ReLU) to the weighted linear combination of the forward activation. ( ) , is given by The server uses the image and local model updates to obtain a single-channel GradCAM heat map that is input into the autoencoder for identification.For conciseness, we suppress the indicator of classes  and rewrite  ( ) , as  , in what follows.

Autoencoder for Malicious GradCAM heat map Identification
Considering all GradCAM heat maps are unlabeled, an autoencoder is an efficient tool in unsupervised learning to discover non-linear features across anomaly detection systems [17].We use an autoencoder to pinpoint abnormal GradCAM heat maps corresponding to malicious local model updates uploaded by the attackers.A canonical autoencoder consists primarily of three components: an encoder, a code (or a latent space) and a decoder.Each GradCAM heat map  , with size of  ×  is flattened into a vector with size of 1 ×  × , which is further concatenated with the vectors of other GradCAM heat maps to form L GradCAM as the input to the encoder.During the autoencoder training, the encoder   (L GradCAM ) with parameter  compresses the GradCAM heat maps from a high-dimensional space to a low-dimensional space  =   (L GradCAM ), also called the code or the latent space.The code learns the underlying features or representation of the GradCAM heat maps, which are input to the decoder   () with parameter .The decoder further reconstructs the input GradCAM heat maps from the code, i.e.,   () = L ′ GradCAM =     (L GradCAM ) .After the training of (, ), the reconstructed GradCAM heat maps are reshaped into the same size as the original GradCAM heat maps.Loss Function.To minimize the difference between the original input GradCAM heat maps and reconstructed GradCAM heat maps, the autoencoder loss function is defined as the mean squared error (MSE) between the encoder input GradCAM heat maps L GradCAM , and the decoder reconstructed GradCAM heat maps L ′ GradCAM , i.e.,

𝐿(𝜃,𝜙)
( Once the autoencoder completes training, the server computes the reconstruction errors between each reconstructed GradCAM heat map and its corresponding input GradCAM heat map and obtains the mean reconstruction error, i.e., ∀ ∈ K ∪ M, The average reconstruction error of all GradCAM heat maps is A threshold  is defined as where  is an empirically configured coefficient.Here,  is used to distinguish between the benign and malicious GradCAM heat maps.If the mean reconstruction error of each GradCAM heat map is greater than the threshold , the corresponding input of the GradCAM heat map is considered potentially abnormal.Otherwise, it is a potential normal GradCAM heat map because the AE learns to capture variations in normal GradCAM heat maps during training.
The AE can encounter difficulties in handling anomalies that do not conform to the learned patterns.
Note that the AE is optimized to minimize the reconstruction errors between the input GradCAM heat maps and the reconstructed GradCAM heat maps during training.In other words, the AE learns to reconstruct normal GradCAM heat maps.It is reasonable to expect that the GradCAM heat maps that can be reconstructed with low reconstruction errors are considered normal, while those with high reconstruction errors are potentially abnormal.

PERFORMANCE EVALUATION
We use two datasets, i.e.,  and GTSRB [12], to evaluate the performance of our proposed GCAMA.We consider the following two state-of-the-art defense schemes, i.e., Multi-Krum [1]: computes a score for each local model update, the score is the sum of its Euclidean distance from its neighbors, those with high scores are regarded as malicious model updates, which are excluded.; and FAA-DL [11]: a lightweight, unsupervised anomaly detection method based on a one-class SVM-based method, support vector machine (SVM), which utilizes an appropriate kernel function and soft margins to estimate a nonlinear decision boundary and separate the benign and malicious local model updates.

CONCLUSION
In this paper, we proposed GCAMA, a novel and robust shield defense against poisoning attacks on FL.GradCAM was leveraged to process the received local model updates with a selected image from the test dataset, generating the corresponding GradCAM heat maps.An autoencoder was incorporated to accentuate the hidden features of the GradCAM heat maps.It was demonstrated experimentally that GCAMA significantly outperforms the cutting-edge defense schemes under the same setting tested.
∈  × with width  and height  for any class , GradCAM first computes the gradient of the score for class ,  ( ) (before softmax) with respect to each feature map    of a convolutional layer, i.e.,  ( )    . ∈ [1, ] is the index of channels.The neuron importance weights  ( )

Figure 1 :
Figure 1: GradCAM-assisted defense against poisoning attacks on FL.The server arbitrarily selects an image (e.g., an image with the label "bird") from the global model testing dataset to create GradCAM heat maps for every uploaded local model update.These GradCAM heat maps flow into an autoencoder for malicious model detection.

Fig. 2
illustrates the test accuracy of ResNet-18 in Non-IID CIFAR-10 and Non-IID GTSRB.Under the Non-IID CIFR10 setting, GCAMA achieves the highest test accuracy (0.8) of FL global model and converges quickly (around the 30th communication round) as it involves more benign devices in FL training.This indicates that GCAMA can accurately filter malicious model updates.Multi-Krum directly detect millions, even many more, parameters of local model updates based on Euclidean distance.The Euclidean distance between the crafted malicious local model updates and benign ones is within the threshold designated by the server.This reveals that the malicious local model updates can elude the detection of the server and participate in the FL training process through multiple communication rounds, resulting in the global model being corrupted.FAA-DL also directly classifies the local model updates aggregated by the server as benign and malicious.However, there are two key reasons why FAA-DL fails in the experiments.First, the local model updates have the characteristics of high-dimensional feature spaces.The curse of dimensionality can lead to data sparsity, making it challenging for FAA-DL to find a suitable margin that separates benign from malicious local model updates.Second, FAA-DL is sensitive to class imbalance (18 benign local model updates and 2 malicious local model updates), which means the FAA-DL biases its decision boundary towards the majority class, making it struggle to detect malicious local model updates effectively.