Abstract
Federated Learning is a distributed machine learning paradigm dealing with decentralized and personal datasets. Since data reside on devices such as smartphones and virtual assistants, labeling is entrusted to the clients or labels are extracted in an automated way. Specifically, in the case of audio data, acquiring semantic annotations can be prohibitively expensive and time-consuming. As a result, an abundance of audio data remains unlabeled and unexploited on users’ devices. Most existing federated learning approaches focus on supervised learning without harnessing the unlabeled data. In this work, we study the problem of semi-supervised learning of audio models via self-training in conjunction with federated learning. We propose
1 INTRODUCTION
The emergence of smartphones, wearables, and modern Internet of Things (IoT) devices results in a massive amount of highly informative data generated continuously from a multitude of embedded sensors and logs of user interactions with various applications. The ubiquity of these contemporary devices and the exponential growth of the data produced on edge provides a unique opportunity to tackle critical problems in various domains, such as healthcare, well-being, manufacturing, and infrastructure monitoring. Notably, the advent of deep learning has enabled us to leverage these raw data directly for learning models while leaving ad hoc (hand-designed) approaches largely redundant. The improved schemes for learning deep networks and the availability of massive labeled datasets have brought tremendous advancements in several areas, including language modeling, audio understanding, object recognition, image synthesis, and more. Traditionally, developing machine learning models or performing analytics in a data center context requires the data from IoT devices to be pooled or aggregated in a centralized repository before processing it further for the desired objective. However, the rapidly increasing size of available data, in combination with the high communication costs and possible bandwidth limitations, render the accumulation of data in a cloud-based server unfeasible [24]. Additionally, such centralized data aggregation schemes could also be restricted by privacy issues and regulations (e.g., General Data Protection Regulation). Due to these factors and the growing computational and storage capabilities of distributed devices, it is appealing to leave the data decentralized and perform operations directly on the device that collects that data through primarily utilizing local resources. The rapidly evolving Federated Learning (FL) field is concerned with distributed training of machine learning models on the decentralized data residing on remote devices such as smartphones and wearables. The key idea behind FL is to bring the computation (or code) closer to where the data reside to harness data locality extensively. Specifically, in a federated setting, minimal updates to the models (e.g., parameters of a neural network) are performed entirely on-device and communicated to the central server, which aggregates these updates from all participating devices to produce a unified global model. Unlike the standard way of learning models, the salient differentiating factor is that the data never leaves the user’s device, which is an appealing property for privacy-sensitive data. This strategy has been applied on a wide range of tasks in recent years [10, 23, 34, 44]. Nevertheless, a common limitation of existing approaches is that they primarily focus on a supervised learning regime. The implicit assumption that the labeled data is widely available on the device, or it can be easily labeled through user interaction or programmatically, such as for keyword prediction or photo categorization, is in most pragmatic cases unrealistic.
In reality, on-device data is largely unlabeled and constantly expanding in size. It cannot be labeled to the same extent as standard datasets, which are annotated via crowd-sourcing or other means for training deep neural networks. Due to the prohibitive cost of annotation, users have little to no incentives, and notably for various important tasks, the domain knowledge missing to perform the annotation process appropriately leaves most of the data residing on devices to remain unlabeled. This is especially true when considering the utilization of audio data to perform various audio recognition tasks, which have recently attracted increasing interest from researchers. As a result, numerous audio recognition systems have been developed, such as for wildlife monitoring [28, 36] and surveillance [5]. In addition to monitoring applications, highly accurate acoustic models are utilized for keyword spotting for virtual assistants [23], anomaly detection for machine sounds [18], and in the development of health risk diagnosis systems, such as cardiac arrest detection [4]. However, in the majority of such applications, there is no straightforward manner for the annotation process. For instance, suppose that we have a sleep tracker application that assesses a person’s risk of obstructive sleep apnea using breathing and snoring sounds during sleep. In this case, the end-users may not be able to evaluate their sleeping sounds sufficiently, and clinicians may need to analyze and annotate the samples. Even in cases where no human expertise is required, like in a music tagging application, the correct labeling of songs requires effort on the user’s end. Additionally, there are cases where distributed devices host models with no human-in-the-loop to annotate the audio data, such as surveillance devices, making the labeling process infeasible. Thus, in many realistic scenarios for FL, local audio data will be primarily unlabeled. This leads to a novel FL problem, namely, semi-supervised federated learning, where users’ devices collectively hold a massive amount of unlabeled audio samples and only a fraction of labeled audio examples.
Semi-supervised learning techniques have been widely deployed in a centralized learning setting to utilize readily available unlabeled data and could also be applied in federated learning settings. In particular, with semi-supervision of models, available unlabeled data can be exploited during the training phase, improving the overall performance of the resulting model [40]. Pseudo-labeling is a widely applied semi-supervised learning method, which relies on the predictions of a model on unlabeled data, i.e., pseudo labels, to utilize unlabeled data during the learning phase [22]. With no structural requirements from the input modalities and tiny computational overhead, pseudo-labeling is an ideal candidate to be applied in federated learning settings, where device heterogeneity and computational resources vary across devices. To this end, we propose a federated self-training approach, named
Apart from the labels’ deficiency, FL introduces other challenges of the system and statistical heterogeneity [17]. These challenges lead to device hardware and data collection diverseness that can significantly affect the number of devices participating in each federated round as well as the on-device data distribution. Several FL techniques provide flexibility in selecting a fraction of clients in each training round and address the non-i.i.d. nature of client’s data distributions, such as FedAvg [30] and FedProx [25]. The training convergence properties of such distributed optimization methods are discussed in Reference [17], where a clear reduction in the convergence rates is reported. In a centralized setting, self-supervised pre-training can improve the model’s convergence and generalization through leveraging pre-training on massive unlabeled datasets [35]. With self-supervised learning, the model is able to learn useful representations from unlabeled data; thus, when used for the downstream task, self-supervised model can significantly improve the training efficiency and predictive performance [35]. To address the issue of slow training convergence in federated settings, we propose the utilization of self-supervised pre-trained models as model initialization for the FL procedure as compared to the naive random initialization of model parameters. Through extensive evaluation, we demonstrate that the convergence rate of our proposed semi-supervised federated algorithm, i.e.,
To the best of our knowledge
The main contributions of this work are as follows:
We study on the practical problem of semi-supervised federated learning for audio recognition tasks to address the lack of labeled data that presents a major challenge for learning on-device models.
We design a simple yet effective approach based on self-training, called
FedSTAR . It exploits large-scale unlabeled distributed data in a federated setting with the help of a novel adaptive confidence thresholding mechanism for effectively generating pseudo-labels.We exploit self-supervised models pre-trained on FSD-50K corpus [6] for significantly improving training convergence in federated settings.
We demonstrate through extensive evaluation that our technique is able to effectively learn generalizable audio models under a variety of federated settings and label availability on diverse public datasets, namely, Speech Commands [41], Ambient Context [33], and VoxForge [29].
We show that
FedSTAR , with as few as 3% labeled data, on average can improve recognition rate by 13.28% across all datasets compared to the fully supervised federated models.
The rest of the article is organized as follows: In Section 2, an overview of the related work is provided. Section 3 presents an overview of related paradigms and methodologies as background information, Section 4 introduces the proposed federated self-training approach for semi-supervised audio recognition. Section 5 presents an evaluation of
2 RELATED WORK
Federated Learning. FL has been attracting growing attention, thanks to its unique characteristic of collaboratively training machine learning models without actually sharing local data and compromising users’ privacy [19]. The most popular and simplistic approach to learning models from decentralized data is the Federated Averaging (FedAvg) algorithm [30]. Specifically, FedAvg performs several local stochastic gradient descent (SGD) steps on a sampled subset of devices’ data in parallel and aggregates the locally learned model parameters on a central server to generate a unified global model through weighted averaging. This strategy has proved to work relatively well for a wide range of tasks in i.i.d. settings [23, 44]. At the same time, the performance can decrease substantially when FedAvg is exposed to non-i.i.d. data distribution [17, 45]. Authors in Reference [45] proposed globally sharing a portion of the dataset to improve FL performance under non-i.i.d settings. In addition to the challenge introduced by data distribution, communication efficiency is another critical problem in FL. The communication challenges could be alleviated by increasing the number of local SGD steps between sequential communication stages. However, with the increase of SGD steps, the device’s model may begin to diverge, and the aggregation of such models can affect the generalization of global models [25]. FedProx was proposed to tackle this issue by adding a loss term to restrict the local models’ updates to be closer to the existing global model [25]. Nevertheless, a typical limitation of existing work is the focus on a supervised learning regime with the implicit assumption that the local private data is fully labeled or could be labeled simplistically through labeling functions. However, in the majority of pragmatic scenarios, a straightforward annotation process is non-existent.
Recently, performing on-device federated training of acoustic models has attracted considerable attention [7, 9, 12, 23, 44]. In Reference [23], FL was employed for a keyword spotting task and the development of a wake-word detection system, whereas, References [7, 9] investigated the effect of non-i.i.d. distributions on the same task. In Reference [7], a highly skewed data distribution scenario was considered, where a large set of speakers used their devices to record a set of sentences. To address the challenges introduced due to the non-i.i.d. distribution of data, a word-error-rate model aggregation strategy was developed. In addition, a training scheme with a centralized model, pre-trained on a small portion of the dataset, was also examined. Furthermore, Reference [9] considered a scenario where devices might hold unlabeled audio samples and used a semi-supervised federated scheme based on a teacher-student architecture to exploit unlabeled audio data. However, the teacher model relied on additional high-quality labeled data for training in a centralized setting. Likewise, Reference [12] introduced a framework for privacy-preserving training of user authentication models with FL using labeled audio data. Nonetheless, all prior approaches consider only semantically annotated audio examples or require supplementary labeled data on the server-side to utilize the available unlabeled audio data that reside on devices. To address these problems, we propose a self-training approach to exploit unlabeled audio samples residing on clients’ devices. In addition, as servers often possess the computational resources to efficiently pre-train a model on a massive unlabeled dataset, we employ self-supervision to develop a model that can be used as a highly effective starting point for federated training instead of using randomly initialized weights.
Semi-Supervised Learning. In semi-supervised learning (SSL), we are provided with a dataset containing both labeled and unlabeled examples, where the labeled fraction is generally tiny compared to the unlabeled one and the curation of strong labels for the unlabeled dataset is impractical due to time constraints, cost, and privacy-related issues [46]. While there is a wide range of SSL methods and approaches that have been developed in the area of deep learning, we will mainly focus on the self-training or pseudo-labeling approach [22]. Self-training uses the prediction on unlabeled data to supervise the model’s training in combination with a small percentage of labeled data. Specifically, pseudo-labels are constructed by extracting one-hot labels from highly confident predictions on unlabeled data. These are then used as training targets in a supervised learning regime. This simplistic approach of utilizing unlabeled data has been combined with various methods to further improve the training efficiency. In Reference [1], authors demonstrated that setting a minimum number of labeled samples per training batch can be effective to reduce over-fitting due to noise accumulation on generated predictions. In addition, the use of a scalar temperature for scaling softmax output achieves a softer probability distribution over classes for the predictions and urges models to generate the correct pseudo-labels without suffering from over-confidence [11]. This temperature scaling approach can be highly beneficial in modern deep neural networks architectures, which have shown to suffer from over-confident predictions [11]. Supplementary, authors in [2] proposed MixMatch, which sharpens the prediction’s distribution to further improve the generated pseudo-labels predictions. The sharpening process is performed by averaging the predictions’ distribution of augmentation versions of the same unlabeled sample. Apart from self-training, alternative SSL approaches introduce a loss term, which is computed on unlabeled data, to encourage the model to generalize better to unseen data. Based on the objective of the loss term, we can classify these approaches in two categories: consistency regularization techniques—which are based on the principle that a classifier should produce the same class distribution for an unlabeled sample even after augmentation [31, 38]; and entropy minimization techniques—which aim to motivate the model to produce low-entropy (high-confident) predictions for all unlabeled data [8]. For a concise review and realistic evaluation of various deep learning based semi-supervised techniques, we refer interested readers to Reference [32].
A recent study [16] has questioned the soundness of the assumption that devices have well-annotated labels in a federated setting. Existing semi-supervised federated learning (SSFL) approaches, such as FedMatch [15] and FedSemi [26], have only recently started to be examined under the vision domain to exploit unlabeled data. FedMatch decomposes the parameters learned from labeled and unlabeled on-device data and uses an inter-client consistency loss to enforce consistency between the pseudo-labeling predictions made across multiple devices. In Reference [26], FedSemi adapts a mean teacher approach to harvest the unlabeled data and proposes an adaptive layer selection to reduce the communication cost during the training process. Apart from these methods, many studies consider different data distribution schemes, including sharing an unlabeled dataset across devices [14]. Last, it is important to note that recent works employ SSFL to address problems in healthcare domain, namely, electronic health records [13] and for problems like human activity recognition [39]. Nevertheless, none of the discussed approaches focuses on learning models for audio recognition tasks by utilizing devices’ unlabeled audio samples.
3 BACKGROUND
In this section, we provide a brief overview of semi-supervised and federated learning paradigms, as they act as fundamental building blocks of our federated self-training approach for utilizing large-scale on-device unlabeled audio data in a federated setting.
3.1 Semi-supervised Learning
Given enough computational power and supervised data, deep neural networks have proven to achieve human-level performance on a wide variety of problems [21]. However, the curation of large-scale datasets is very costly and time-consuming, as it either requires crowd-sourcing or domain expertise, such as in the case of medical imaging. Likewise, for several practical problems, it is simply not possible to create a large enough labeled dataset (e.g., due to privacy issues) to learn a model of reasonable accuracy. In such cases, SSL algorithms offer a compelling alternative to fully supervised methods for jointly learning from the fraction of labeled and a large number of unlabeled instances.
Specifically, SSL aims to solve the problem of learning with partially labeled data where the ratio of unlabeled training examples is usually much larger than that of the labeled ones. Formally, let \( \mathcal {D}_{L} = \left\lbrace \left(x_{l_i},y_{i} \right) \right\rbrace _{i=1}^{N_{l}} \) represent a set of labeled data, where \( N_{l} \) is the number of labeled data, \( x_{l_i} \) is an input instance, \( y_{i}~\epsilon \left\lbrace 1, \ldots , \mathcal {C} \right\rbrace \) is the corresponding label, and \( \mathcal {C} \) is the number of label categories for the \( \mathcal {C} \)-way multi-class classification problem. Besides, we have a set of unlabeled samples denoted as \( \mathcal {D}_{U} = \left\lbrace x_{u_i} \right\rbrace _{i=1}^{N_{u}} \), where \( N_{u} \) is the number of unlabeled data. Let \( p_\theta \left(y \mid x \right) \) be a neural network that is parameterized by weights \( \theta \) that predicts softmax outputs \( \widehat{y} \) for a given input x. In the setting of semi-supervised learning, where in general \( N_{l} \ll N_{u} \), we need to simultaneously minimize losses on both labeled and unlabeled data to learn the model’s parameters \( \theta \). Specifically, our objective is to minimize the following loss function: (1) \( \begin{equation} \mathcal {L}_{\theta } = \mathcal {L}_{s_{\theta }}(\mathcal {D}_{L}) + \mathcal {L}_{u_{\theta }}(\mathcal {D}_{U}), \end{equation} \) where \( \mathcal {L}_{s_{\theta }}(\mathcal {D}_{L}) \) and \( \mathcal {L}_{u_{\theta }}(\mathcal {D}_{U}) \) are the loss terms from supervised and unsupervised learning, respectively.
The teacher-student self-training framework is a popular scheme to simultaneously learn from both labeled and unlabeled data. In this approach, we first use the available labeled data to train a good teacher model, which is then utilized to label any available unlabeled data. Consequently, both labeled and unlabeled data are used to jointly train a student model. In this way, the model assumes a dual role as a teacher and a student. In particular, as a student, it learns from the available data, while as a teacher, it generates targets to help the learning process of student. Since the model itself generates targets, they may very well be incorrect, thus, the learning experience of the student model depends solely on the ability of teacher model to generate high-quality targets [43].
3.2 Federated Learning
FL is a novel collaborative learning paradigm that aims to learn a single, global model from data stored on remote clients with no need to share their data with a central server. In particular, with the data residing on clients’ devices, a subset of clients is selected to perform a number of local SGD steps on their data in parallel in each communication round. Upon completion, clients exchange their models’ weights updates with the server, aiming to learn a unified global model by aggregating these updates. Formally, the goal of FL is typically to minimize the following objective function: (2) \( \begin{equation} \min _{\theta } \mathcal {L}_{\theta } = \sum _{k=1}^{K} \gamma _{k} {\mathcal {L}}_k(\theta), \end{equation} \) where \( \mathcal {L}_k \) is the minimization function of the kth client and \( \gamma _{k} \) corresponds to the relative impact of the kth client to the construction of the global model. For the FedAvg algorithm, parameter \( \gamma _{k} \) is equal to the ratio of client’s local data \( N_k \) over all training samples \( (\gamma _{k} = \frac{N_k}{N}) \).
Specifically, let \( \mathcal {D} = \left\lbrace \left(x_{l_i},y_{i} \right) \right\rbrace _{i=1}^{N} \) be a dataset of N labeled examples, similarly to the previously discussed dataset \( \mathcal {D}_{L} \) in Section 3.1. Given K clients, \( \mathcal {D} \) is decomposed into K sub-datasets \( \mathcal {D}^{k}=\left\lbrace \left(x_{l_i},y_{i} \right) \right\rbrace _{i=1}^{N_k} \) corresponding to each clients’ privately held data. For an initial global model G, the \( \mathit {r} \)th communication round starts with server randomly selecting a portion q (\( 0\lt q\le K \)) of clients to participate in the current training round. Afterwards, each client’s local model receives the global parameters \( \theta _r^G \) and performs supervised learning on their local dataset \( \mathcal {D}^{k} \) to minimize \( \mathcal {L}_{k}(\theta _r^k) \). Subsequently, G aggregates over locally updated parameters by performing \( \theta _{r+1}^G \leftarrow \sum \nolimits _{i=1}^{q} \frac{N_i}{N} \theta _r^i \). The presented circular training process, comprising model weights’ exchanges between server and clients, repeats until \( \theta ^G \) converges after R rounds.
4 METHODOLOGY
In this section, we present our federated self-training learning approach, namely,
4.1 Problem Formulation
We focus on the problem of SSFL, where labeled data are scarce across users’ devices. At the same time, clients collectively hold a massive amount of unlabeled audio data. In addition, in a typical federated learning setting, the on-device data distribution depends on the profile of the users operating the devices. Thus, it is a common scenario for both labeled and unlabeled data to originate from the same data distribution. Based on the aforementioned assumption, with
Formally, under the setting of SSFL, each of the K clients holds a labeled set, \( \mathcal {D}_{L}^{k} = \lbrace \left(x_{l_i},y_{i} \right) \rbrace _{i=1}^{N_{l,k}} \) and an unlabeled set \( \mathcal {D}_{U}^{k} = \lbrace x_{u_i}\rbrace _{i=1}^{N_{u,k}} \), where \( N_{k} = N_{l,k} + N_{u,k} \) is the total number of data samples stored on the \( \mathit {k} \)th client and \( N_{l,k} \ll N_{u,k} \). We desire to learn a global unified model G without clients sharing any of their local data, \( \mathcal {D}_L^k \) and \( \mathcal {D}_U^k \). To this end, our objective is to simultaneously minimize both supervised and unsupervised learning losses during each client’s local training step on the \( \mathit {r} \)th round of the FL algorithm. Specifically, the minimization function, similar to the one presented in Equation (2), is: (3) \( \begin{equation} \min _{\theta } {\mathcal {L}}_{\theta } = \sum _{k=1}^{K} \gamma _{k} {\mathcal {L}}_k(\theta) \textrm { where } \mathcal {L}_{k}(\theta) =\mathcal {L}_{s_{\theta }}(\mathcal {D}_{L}^{k}) + \beta \mathcal {L}_{u_{\theta }}(\mathcal {D}_{U}^{k}). \end{equation} \) Here, \( \mathcal {L}_{s}(\mathcal {D}_{L}^{k}) \) is the loss terms from supervised learning on the labeled data held by the kth client, and \( \mathcal {L}_{u}(\mathcal {D}_{U}^{k}) \) represents the loss term from unsupervised learning on the unlabeled data of the same client. We add the parameter \( \beta \) to control the effect of unlabeled data on the training procedure, while \( \gamma _{k} \) is the relative impact of the kth client on the construction of the global model G.
4.2 Self-training with Pseudo Labeling
Self-training via pseudo-labeling has been widely used in semi-supervised learning [40]. The objective of highly effective teacher-student self-training approaches is to train a teacher model, which supervises the learning process of a student model that learns from labeled and unlabeled data jointly. First, a teacher model is built with the available labeled data and afterwards this is exploited to make predictions for the unlabeled samples. Subsequently, the student model is trained on both labeled and predicted samples. We propose a self-training technique with a dynamic prediction confidence threshold to learn from the unlabeled audio data residing on the client’s device, thus boosting the performance of models trained in federated settings with varying percentages of labeled examples. For audio classification tasks, to learn from the labeled datasets \( \mathcal {D}_{L}^k \) across all participating clients, we apply cross-entropy loss as follows: (4) \( \begin{equation} \begin{aligned}\mathcal {L}_{s}(\mathcal {D}_{L}^{k}) = - \frac{1}{N_{l,k}} \sum \limits _{i=1}^{N_{l,k}}\sum \limits _{j=1}^{C} y_{i}^{j} \log (\mathit {f}_{i}^{\theta ^{k}}(x_{l_j})) = \mathcal {L}_{CE}\left(y,p_{\theta ^k}\left(y\mid x_l \right) \right) \end{aligned} . \end{equation} \)
Next, to learn from unlabeled data, we generate pseudo-labels \( \widehat{y} \) for all available unlabeled data \( x_{u} \) on client \( \mathit {k} \) by performing: (5) \( \begin{equation} \begin{aligned}\widehat{y} = \phi \left(z,T \right) = \mathop{\text{arg max}}\limits_{i \in \left\lbrace 1, \ldots , \mathcal {C} \right\rbrace } \left(\frac{e^{z_i/T}}{\sum _{j=1}^{\mathcal {C}} e^{z_j/T}} \right) \end{aligned} , \end{equation} \) where \( z_{i} \) are the logits produced for the input sample \( x_{u_i} \) by the kth client model \( p_{\theta ^k} \) before the softmax layer. In essence, \( \phi \) produces categorical labels for the given “soften” softmax values, in which temperature scaling is applied with a constant scalar temperature T. As the maximum of the softmax function remains unaltered, the predicted pseudo-label \( \widehat{y} \) is identical as if the original prediction (without scaling) for an unlabeled sample \( x_u \) was used; however, the prediction confidence is weakened. A dynamic threshold \( \tau \) of confidence is proposed following a cosine schedule to discard low-confidence predictions when generating pseudo-labels. For the obtained pseudo-labels, we then perform standard cross-entropy minimization while using \( \widehat{y} \) as targets in the following manner: (6) \( \begin{equation} \begin{aligned}\mathcal {L}_{u}(\mathcal {D}_{U}^{k}) &= - \frac{1}{N_{u,k}} \sum \limits _{i=1}^{N_{u,k}}\sum \limits _{j=1}^{C} \widehat{y_{i}}^{j} \log (\mathit {f}_{i}^{\theta ^{k}}(x_{u_j})) = \mathcal {L}_{CE}\left(\widehat{y},p_{\theta ^k}\left(x_u\right) \right)\!. \end{aligned} \end{equation} \)
Revising the initial minimization goal of
4.3 Federated Self-training
The objective of federated self-training is to create a teacher model on each client to exploit labeled data resident on clients’ devices, which will be used to predict labels for the unlabeled instances available in the device. As both labeled and unlabeled on-device samples originate from the same data distribution, a student model can be constructed on each client device by collectively training on labeled and pseudo-labeled data, the weights of whom will be returned to the server for aggregation. Under federated settings, however, a more complicated analysis is required, as clients’ local labeled data can be limited and can have a highly skewed distribution. In such settings, teacher models may produce inaccurate pseudo-label predictions, and student classifiers potentially amplify the mistakes further during training through using faulty pseudo-labels. To ensure the proper construction of pseudo-labels and guarantee that the student model will learn properly from unlabeled data, the confidence of the predictions is taken into consideration when generating pseudo-labels to discard any low-confidence predictions.
Concisely, in the proposed
Fig. 1. Illustration of FedSTAR for label-efficient learning of audio recognition models in a federated setting.
Fig. 2. On-device self-training based on pseudo-labeling in a federated setting for an audio recognition task shown for illustration proposes.
Since \( N_{l} \ll N_{u} \) holds for all clients, given a sufficient number of participating rounds, unlabeled instances will be exposed to all the available labeled data. Additionally, we propose an adaptive confidence thresholding method to diminish unsatisfactory performance due to training on faulty pseudo-labels. In particular, in addition to using temperature scaling T to “soften” softmax output and generated confident predictions, we employ an increasing confidence threshold \( \tau \) to discard low-confidence pseudo-labels during training following a cosine schedule. Cosine learning rate schedulers rely on the observation that we might not want to decrease the learning rate too drastically in the beginning, while we might want to “refine” our solution in the end using a very small learning rate. Along the same lines, with our cosine confidence thresholding, we allow clients to explore the locally stored unlabeled data, \( D_{U}^{k} \), in the first few federated rounds, while considering only highly confident predictions in a later stage of the training procedure. While other methods could be explored for this purpose, such a study is outside the scope of the current work and we mainly focus on cosine scheduler, which has proven to work well empirically across a variety of tasks [27]. Further details and an overview of our approach for the semi-supervised training procedure can be found in Algorithm 1.

4.4 Self-supervised Pretraining Strategy
Self-supervised learning aims to learn useful representations from unlabeled data by tasking a model to solve an auxiliary task for which supervision can be acquired from the input itself. Given an unlabeled data \( D = \lbrace x\rbrace _{m=1}^M \) and deep neural network \( f_{\theta }(.) \), the aim is to pre-train a model through solving a surrogate task, where labels y for the standard objective function (e.g., cross-entropy) are extracted automatically from x. The learned model is then utilized as a fixed feature extractor or as initialization for rapidly learning downstream tasks of interest. The fields of computer vision and natural language processing have seen tremendous progress in representation learning with deep networks in a self-learning manner, with no human intervention in the labeling process. Here, the prominent techniques for audio representation learning from unlabeled data include and audio-visual synchronization [20], contrastive learning [35], and other auxiliary tasks [37].
In our work, we propose to leverage self-supervised pre-training on the server side to improve training convergence of
Formally, we pre-train our model with contrastive learning [35] using FSD-50K [6] dataset. On a high level, the objective is to train a model to maximize the similarity between related audio segments while minimizing it for the rest. Similar samples are generated through stochastic sampling from the same audio clip, while other segments in a batch are treated as negatives. In particular, we use bilinear similarity formulation and pre-train our model with a batch size of 1,024 for 500 epochs. Moreover, we utilize a network architecture, as described in Section 5.2 as an encoder with the addition of a dense layer containing 256 hidden units on top, which is discarded after the pre-training stage. In this way by using a same architecture, we are able to draw proper conclusions for the effects of utilizing a pre-trained model as an initial global model and directly compare with the randomly initialized
5 EXPERIMENTS
In this section, we conduct an extensive evaluation of our approach on publicly available datasets for various audio recognition tasks to determine the efficacy of
5.1 Datasets and Audio Pre-processing
We use publicly available datasets to evaluate our models on a range of audio recognition tasks. For all datasets, we use the suggested train/test split for comparability purposes. For ambient sound classification, we use the Ambient Acoustic Contexts dataset [33], in which sounds from 10 distinct events are present. For the keyword spotting task, we use the second version of the Speech Commands dataset [41], where the objective is to detect when a particular keyword is spoken out of a set of 12 target classes. Likewise, we use VoxForge [29] for the task of spoken language classification, which contains audio recordings in six languages—English, Spanish, French, German, Russian, and Italian. It is one of the largest available datasets for language identification problems; it is valuable for benchmarking the performance of the supervised FL model. We resampled the Ambient Acoustic Contexts samples from 48 kHz to 16 kHz to utilize the same sampling frequency across all our datasets samples. In Table 1, we present a description of each dataset.
5.2 Model Architecture and Optimization
The network architecture of our global model is inspired by Reference [37] with a key distinction that instead of batch normalization, we utilize group normalization [42] after each convolutional layer and employ a spatial dropout layer. We use log-Mel spectrograms as the model’s input, which we compute by applying a short-time Fourier transform on the one-second audio segment with a window size of 25 ms and a hop size equal to 10 ms to extract 64 Mel-spaced frequency bins for each window. To make an accurate prediction on an audio clip, we average over the predictions of non-overlapping segments of an entire audio clip. Our convolutional neural network architecture consists of four blocks. In each block, we perform two separate convolutions, one on the temporal and another on the frequency dimension, outputs of which we concatenate afterward to perform a joint \( 1 \times 1 \) convolution. Using this scheme, the model can capture fine-grained features from each dimension and discover high-level features from their shared output. Furthermore, we apply L2 regularization with a rate of 0.0001 in each convolution layer and group normalization [42] after each layer. Between blocks, we utilize max-pooling to reduce the time-frequency dimensions by a factor of two and use a spatial dropout rate of 0.1 to avoid over-fitting. We apply ReLU as a non-linear activation function and use Adam optimizer with the default learning rate of 0.001 to minimize categorical cross-entropy.
To simulate a federated environment, we use the Flower framework [3] and utilize FedAvg [30] as an optimization algorithm to construct the global model from clients’ local updates. Additionally, a number of parameters were selected to control the federated settings of our self-training strategy fully. Those parameters are: (1) N - number of clients, (2) R - number of rounds, (3) q - clients’ participation percentage in each round, (4) E - number of local train steps per round, (5) \( \sigma \) - data distribution variance across clients, (6) L - dataset’s percentage to be used as labeled samples, (7) U - dataset’s percentage to be used as unlabeled samples (excluding L% of the data used as labeled), (8) \( \beta \) - influence of unlabeled data over training process, (9) T - temperature scaling parameter, and (10) \( \tau \) - predictions confidence threshold.
We employ uniform random sampling for the clients’ selection strategy, as other approaches for adequate clients election are outside the current work scope. Last, across all
Table 2. Primary Experiment Parameters
5.3 Baselines and Evaluation Strategy
In fully supervised federated experiments where the complete dataset is available, the labeled instances are randomly distributed across the available clients. Likewise, in experiments where the creation of a labeled subset from the original dataset is required (\( L\lt \)100%), we keep the dataset’s initial class distribution ratio to avoid tempering with dataset characteristics. Afterward, the labeled subset is again randomly distributed across the available clients. With the \( \sigma \) parameter set to 25% and a random partitioning of labeled samples among clients, the labeled data distribution resembles a non-i.i.d. one. In contrast, an increase of available clients results in a highly skewed distribution. It is worth mentioning that even if the meaning of non-i.i.d. is generally straightforward, data can be non-i.i.d in many ways. In our work, the term non-i.i.d data distribution describes a distribution with both a label distribution skew and a quantity skew (data samples imbalance across clients). This type of data distribution is common across clients’ data in federated settings. Each client frequently corresponds to a particular user (affecting the label distribution), and the application usage across clients can differ substantially (affecting the label distribution). For a concise taxonomy of non-i.i.d. data regimes, we refer our readers to Reference [17]. Additionally, in
To evaluate the
| Method | SpeechCommands | AmbientContext | VoxForge | |
|---|---|---|---|---|
| Centralized | 96.54 | 73.03 | 79.60 | |
| Federated | N = 5 | 96.93 | 71.88 | 79.13 |
| N = 10 | 96.78 | 68.01 | 78.98 | |
| N = 15 | 96.33 | 66.86 | 76.09 | |
| N = 30 | 94.62 | 65.14 | 65.17 |
Table 3. Evaluation of Audio Recognition Models in Centralized and Fully supervised Federated Settings
5.4 Results
5.4.1 Comparison against Fully Supervised Federated Approach under non-i.i.d. Settings.
We first evaluate
Average accuracy over three distinct trials on test set. Detailed results are given in Table 9 of the Appendix. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, E = 1, R = 100.
Table 4. Performance Evaluation of FedSTAR
Average accuracy over three distinct trials on test set. Detailed results are given in Table 9 of the Appendix. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, E = 1, R = 100.
In Table 4, we observe that
While varying L, we note that the percentage gab between
While N increases and the labeled subset of each client shrinks (and, hence, we obtain an even higher non-i.i.d. distribution), we notice that the
5.4.2 Effectiveness of FedSTAR across Diverse Federated Settings.
In this subsection, we assess the efficacy of
Varying participation rate: With the device heterogeneity and computational resources significantly varying across devices in a federated environment, a participation rate of 100% is probably an unrealistic assumption for most pragmatic FL applications [17]. As clients’ participation rate (q) can greatly influence the convergence rate of an FL model, we evaluate
Fig. 3. Evaluation of FedSTAR performance under varying clients’ participation rate. Federated parameters are set to \( \sigma \) = 25%, \( \beta =0.5 \) , R = 100, E = 1, and N = 15. Average accuracy over three distinct trials is reported.
Varying local train steps: Subsequently, we examine the effect of increasing the local train steps on the
Fig. 4. Evaluation of FedSTAR performance against local train steps size. Federated parameters are set to \( \sigma \) = 25%, \( \beta =0.5 \) , R = 50, q = 80%, and N = 15. Average accuracy over three distinct trials is reported.
Varying number of clients: The number of clients is an important factor in the FL procedure, as it can have a significant impact on the data distribution, which has shown to affect the global model’s generalization [17, 45]. In particular, introducing additional clients to FL, the class distributions across clients can become highly skewed, as the data partitioning process is random. With this ablation study, we aim to answer whether
Fig. 5. Evaluation of FedSTAR performance under varying number of clients. Federated parameters are set to \( \sigma \) = 25%, \( \beta =0.5 \) , R = 100, q = 80%, and E = 1. Average accuracy over three distinct trials is reported.
Varying class distribution across clients: Apart from the number of clients, the preferences of each client can substantially affect the nature of clients’ data distribution. For example, in a music tagging scenario, the type and quantity of data residing on a device are directly correlated to both user’s preference of a specific genre of music and the time users dedicated to the application. Such challenges introduce a highly non-i.i.d. data distribution, both in terms of labels distribution and quantity of data per client. Therefore, in this analysis, we aim to understand the effect of highly non-i.i.d. distributions, both in terms of labels and data quantity distributions, on the effectiveness of
From the results introduced in Table 5, we note that
Class distribution has mean \( \mu \) = 3 and variance \( \sigma _{c} \). Average accuracy over three distinct runs is reported on Speech Commands. Detailed results are given in Table 8 of the Appendix. Federated parameters are set to \( \beta =0.5 \), R = 100, N = 15, q = 80%, and E = 1.
Table 5. Performance Evaluation of Method against Variation of Class Availability across Clients
Class distribution has mean \( \mu \) = 3 and variance \( \sigma _{c} \). Average accuracy over three distinct runs is reported on Speech Commands. Detailed results are given in Table 8 of the Appendix. Federated parameters are set to \( \beta =0.5 \), R = 100, N = 15, q = 80%, and E = 1.
5.4.3 Assessment of Utilizing Self-supervised Learning for Model Pre-training to Improve Training Convergence of FedSTAR .
Our proposed self-training federated learning approach attains high performance on different audio recognition tasks by utilizing unlabeled data available on clients’ end. However, in reality, a large volume of unlabeled instances from a different task or distribution might also be available on the centralized server. As servers often possess the computational power to effectively pre-train a model on a massive unlabeled dataset, a natural question arises, whether leveraging self-supervised learning to pre-train a model as initialization for
Fig. 6. Self-supervised learning improves training convergence in federated setting. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, E = 1. Average accuracy on testset over three distinct trials is reported.
From Figures 6(a), (c), and (e), we note that the utilization of a pre-trained model leads to higher accuracy within 10 rounds in almost all cases, suggesting that it was able to perform finer pseudo-labels predictions and accelerate the model’s convergence. In particular, for the Ambient Context dataset, where the amount of available labeled instances per client is tiny (approximately 13 labeled samples per client for L = 5% and N = 15), we observe a substantial difference between the pre-trained and randomly initialized
5.4.4 Effectiveness of FedSTAR under Varying Amount of Unlabeled Data.
As of now, we have assumed that unlabeled data is largely available across clients. However, it is intriguing to investigate the scenario where both the amount of labeled and unlabeled data varies. In this way, we could simulate two pragmatic scenarios: First, an abundant volume of unlabeled instances generated by clients devices (e.g., numerous IoT devices constantly monitoring the surrounding environment); and second, relatively small amount of unlabeled audio samples available (e.g., medical audio examples, where both obtaining and labeling data is expensive). In addition, the restriction of available unlabeled on-device data could be originated from the often-limited storage capabilities of devices participating in the distributed machine learning paradigms. Thus, we aim to understand the effect of unlabeled data availability on the
| LabeledPercentage | FedSTAR (Randomly Initialized) | FedSTAR (SSL Pretrained) | ||||||
|---|---|---|---|---|---|---|---|---|
| U = 20% | U = 50% | U = 80% | U = 100% | U = 20% | U = 50% | U = 80% | U = 100% | |
| L = 3% | 84.13 | 85.40 | 86.63 | 86.82 | 84.52 | 85.17 | 85.43 | 86.46 |
| L = 5% | 87.47 | 88.52 | 88.90 | 89.33 | 88.07 | 88.28 | 87.73 | 88.98 |
| L = 20% | 90.06 | 92.24 | 93.07 | 93.15 | 92.44 | 93.67 | 93.98 | 94.13 |
| L = 50% | 87.76 | 92.26 | 94.18 | 93.38 | 90.70 | 93.83 | 94.76 | 95.54 |
Average accuracy over three distinct runs is reported on Speech Commands. Detailed results are given in Table 7 of the Appendix. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, R = 100, E = 1, N = 15.
Table 6. Performance Evaluation of FedSTAR when Varying Both Labeled and Unlabeled Datasets
Average accuracy over three distinct runs is reported on Speech Commands. Detailed results are given in Table 7 of the Appendix. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, R = 100, E = 1, N = 15.
As we see in Table 6, the availability of unlabeled data can affect the
6 CONCLUSIONS AND FUTURE WORK
We study the pragmatic problem of semi-supervised federated learning for audio recognition tasks. In the distributed scenario, clients’ well-annotated audio examples are deficient due to the prohibitive cost of annotation, users with little to no incentives to label their data, and notably for various important tasks, the domain knowledge that is missing to perform the annotation process appropriately. Conversely, large-scale unlabeled audio data are readily available on clients’ devices. To address the lack of labeled data for learning on-device models, we present a novel self-training strategy based on pseudo-labeling to exploit on-device unlabeled audio data and boost the generalization of models trained in federated settings. Despite its simplicity, we demonstrate that our approach,
Despite the wide applicability, as
In this work, we provided a federated self-training scheme to learn audio recognition models through a few on-device labeled audio data. In the Internet of Things era, this approach could be employed in a variety of applications, such as home automation, autonomous driving, the healthcare domain, and smart wearable technologies. In particular, we believe that federated self-training is of immense value for learning generalizable audio models in settings, where labeled data are challenging to acquire. However, unlabeled data are available in vast quantities. We hope that the presented perspective of federated self-training inspires the development of additional approaches, specifically those combining semi-supervised learning and federated learning in an asynchronous fashion. Likewise, combining federated self-training with appropriate client selection techniques is another crucial area of improvement that will further improve the performance of deep models in federated learning scenarios. Finally, evaluation in a real-world setting (i.e., federate learning involving real devices) is of major importance to further understand the aspects that require improvements concerning statistical and system heterogeneities, energy, and labeled data requirements in the federated learning setting.
APPENDIX
Table 7. Performance Evaluation of FedSTAR when Varying Both Labeled and Unlabeled Datasets
Table 8. Performance Evaluation of Method against Variation of Class Availability Across Clients
Table 9. Performance Evaluation of FedSTAR
- [1] . 2020. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In International Joint Conference on Neural Networks (IJCNN). 1–8.
DOI: Google ScholarCross Ref
- [2] . 2019. MixMatch: A Holistic Approach to Semi-Supervised Learning. arXiv: 1905.02249 [cs.LG].Google Scholar
- [3] . 2020. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390 (2020).Google Scholar
- [4] . 2019. Contactless cardiac arrest detection using smart devices. NPJ Dig. Med. 2, 1 (2019), 1–8.Google Scholar
- [5] . 2016. Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Trans. Intell. Transport. Syst. 17, 1 (2016), 279–288.
DOI: Google ScholarDigital Library
- [6] . 2020. FSD50k: An open dataset of human-labeled sound events. arXiv preprint arXiv:2010.00475 (2020).Google Scholar
- [7] . 2021. End-to-end Speech Recognition from Federated Acoustic Models. arXiv: 2104.14297 [cs.SD].Google Scholar
- [8] . 2004. Semi-supervised learning by entropy minimization. In 17th International Conference on Neural Information Processing Systems (NIPS’04). The MIT Press, Cambridge, MA, 529–536.Google Scholar
Digital Library
- [9] . 2020. Training Keyword Spotting Models on Non-IID Data with Federated Learning. arXiv: 2005.10406 [eess.AS].Google Scholar
- [10] . 2018. Federated learning for mobile keyboard prediction. CoRR abs/1811.03604 (2018).Google Scholar
- [11] . 2015. Distilling the Knowledge in a Neural Network. arXiv: 1503.02531 [stat.ML].Google Scholar
- [12] . 2020. Federated Learning of User Authentication Models. arXiv: 2007.04618 [cs.LG].Google Scholar
- [13] . 2019. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. CoRR abs/1903.09296 (2019).Google Scholar
- [14] . 2021. Distillation-based Semi-supervised Federated Learning for Communication-efficient Collaborative Training with Non-IID Private Data. arXiv: 2008.06180 [cs.DC].Google Scholar
- [15] . 2020. Federated Semi-supervised Learning with Inter-client Consistency. arXiv: 2006.12097 [cs.LG].Google Scholar
- [16] . 2020. Towards Utilizing Unlabeled Data in Federated Learning: A Survey and Prospective. arXiv: 2002.11545 [cs.LG].Google Scholar
- [17] . 2021. Advances and Open Problems in Federated Learning. arXiv: 1912.04977 [cs.LG].Google Scholar
- [18] . 2019. ToyADMOS: A Dataset of Miniature-machine Operating Sounds for Anomalous Sound Detection. arXiv: 1908.03299 [eess.AS].Google Scholar
- [19] . 2017. Federated Learning: Strategies for Improving Communication Efficiency. arXiv: 1610.05492 [cs.LG].Google Scholar
- [20] . 2018. Cooperative learning of audio and video models from self-supervised synchronization. In 32nd International Conference on Neural Information Processing Systems. 7774–7785.Google Scholar
Digital Library
- [21] . 2016. Building machines that learn and think like people. CoRR abs/1604.00289 (2016).Google Scholar
- [22] . 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3.Google Scholar
- [23] . 2019. Federated Learning for Keyword Spotting. arXiv: 1810.05512 [eess.AS].Google Scholar
- [24] . 2019. Federated learning: Challenges, methods, and future directions. CoRR abs/1908.07873 (2019).Google Scholar
- [25] . 2020. Federated Optimization in Heterogeneous Networks. arXiv: 1812.06127 [cs.LG].Google Scholar
- [26] . 2020. FedSemi: An Adaptive Federated Semi-supervised Learning Framework. arXiv: 2012.03292 [cs.LG].Google Scholar
- [27] . 2016. SGDR: Stochastic gradient descent with restarts. CoRR abs/1608.03983 (2016).Google Scholar
- [28] . 2018. Bat detective-deep learning tools for bat acoustic signal detection. PLOS Computat. Biol. 14, 3 (
03 2018), 1–19.DOI: Google ScholarCross Ref
- [29] . 2018. Voxforge. Ken MacLean. Retrieved from http://www.voxforge.org/home. [Acedido em 2012] (2018).Google Scholar
- [30] . 2017. Communication-efficient Learning of Deep Networks from Decentralized Data. arXiv: 1602.05629 [cs.LG].Google Scholar
- [31] . 2018. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-supervised Learning. arXiv: 1704.03976 [stat.ML].Google Scholar
- [32] . 2019. Realistic Evaluation of Deep Semi-supervised Learning Algorithms. arXiv: 1804.09170 [cs.LG].Google Scholar
- [33] . 2020. Augmenting conversational agents with ambient acoustic contexts. In 22nd International Conference on Human-computer Interaction with Mobile Devices and Services (MobileHCI’20). Association for Computing Machinery, New York, NY.
DOI: Google ScholarDigital Library
- [34] . 2019. Federated learning for emoji prediction in a mobile keyboard. CoRR abs/1906.04329 (2019).Google Scholar
- [35] . 2021. Contrastive learning of general-purpose audio representations. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3875–3879.Google Scholar
Cross Ref
- [36] . 2018. Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge. CoRR abs/1807.05812 (2018).Google Scholar
- [37] . 2019. Self-supervised audio representation learning for mobile devices. arXiv preprint arXiv:1905.11796 (2019).Google Scholar
- [38] . 2018. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv: 1703.01780 [cs.NE].Google Scholar
- [39] . 2020. Towards federated unsupervised representation learning. In Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking. Association for Computing Machinery, 31–36. Google Scholar
Digital Library
- [40] . 2019. A survey on semi-supervised learning. Mach. Learn. 109 (2019), 373–440.Google Scholar
Cross Ref
- [41] . 2018. Speech commands: A dataset for limited-vocabulary speech recognition. CoRR abs/1804.03209 (2018).Google Scholar
- [42] . 2018. Group Normalization. arXiv: 1803.08494 [cs.CV].Google Scholar
- [43] . 2019. Self-training with noisy student improves imagenet classification. CoRR abs/1911.04252 (2019).Google Scholar
- [44] . 2018. Applied Federated Learning: Improving Google Keyboard Query Suggestions. arXiv: 1812.02903 [cs.LG].Google Scholar
- [45] . 2018. Federated Learning with Non-IID Data. arXiv: 1806.00582 [cs.LG].Google Scholar
- [46] . 2009. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 3, 1 (2009), 1–130.
DOI: Google ScholarCross Ref
Index Terms
Federated Self-training for Semi-supervised Audio Recognition
Recommendations
Towards federated unsupervised representation learning
EdgeSys '20: Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and NetworkingMaking deep learning models efficient at inferring nowadays requires training with an extensive number of labeled data that are gathered in a centralized system. However, gathering labeled data is an expensive and time-consuming process, centralized ...
Semi-supervised Time Series Classification Model with Self-supervised Learning
AbstractSemi-supervised learning is a powerful machine learning method. It can be used for model training when only part of the data are labeled. Unlike discrete data, time series data generally have some temporal relation, which can be ...
Highlights- Self-supervised temporal relation learning can assist supervised model for time series classification.
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningIn multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...












Comments