skip to main content
research-article
Open Access

Federated Self-training for Semi-supervised Audio Recognition

Published:18 October 2022Publication History

Skip Abstract Section

Abstract

Federated Learning is a distributed machine learning paradigm dealing with decentralized and personal datasets. Since data reside on devices such as smartphones and virtual assistants, labeling is entrusted to the clients or labels are extracted in an automated way. Specifically, in the case of audio data, acquiring semantic annotations can be prohibitively expensive and time-consuming. As a result, an abundance of audio data remains unlabeled and unexploited on users’ devices. Most existing federated learning approaches focus on supervised learning without harnessing the unlabeled data. In this work, we study the problem of semi-supervised learning of audio models via self-training in conjunction with federated learning. We propose FedSTAR to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models. We further demonstrate that self-supervised pre-trained models can accelerate the training of on-device models, significantly improving convergence within fewer training rounds. We conduct experiments on diverse public audio classification datasets and investigate the performance of our models under varying percentages of labeled and unlabeled data. Notably, we show that with as little as 3% labeled data available, FedSTAR on average can improve the recognition rate by 13.28% compared to the fully supervised federated model.

Skip 1INTRODUCTION Section

1 INTRODUCTION

The emergence of smartphones, wearables, and modern Internet of Things (IoT) devices results in a massive amount of highly informative data generated continuously from a multitude of embedded sensors and logs of user interactions with various applications. The ubiquity of these contemporary devices and the exponential growth of the data produced on edge provides a unique opportunity to tackle critical problems in various domains, such as healthcare, well-being, manufacturing, and infrastructure monitoring. Notably, the advent of deep learning has enabled us to leverage these raw data directly for learning models while leaving ad hoc (hand-designed) approaches largely redundant. The improved schemes for learning deep networks and the availability of massive labeled datasets have brought tremendous advancements in several areas, including language modeling, audio understanding, object recognition, image synthesis, and more. Traditionally, developing machine learning models or performing analytics in a data center context requires the data from IoT devices to be pooled or aggregated in a centralized repository before processing it further for the desired objective. However, the rapidly increasing size of available data, in combination with the high communication costs and possible bandwidth limitations, render the accumulation of data in a cloud-based server unfeasible [24]. Additionally, such centralized data aggregation schemes could also be restricted by privacy issues and regulations (e.g., General Data Protection Regulation). Due to these factors and the growing computational and storage capabilities of distributed devices, it is appealing to leave the data decentralized and perform operations directly on the device that collects that data through primarily utilizing local resources. The rapidly evolving Federated Learning (FL) field is concerned with distributed training of machine learning models on the decentralized data residing on remote devices such as smartphones and wearables. The key idea behind FL is to bring the computation (or code) closer to where the data reside to harness data locality extensively. Specifically, in a federated setting, minimal updates to the models (e.g., parameters of a neural network) are performed entirely on-device and communicated to the central server, which aggregates these updates from all participating devices to produce a unified global model. Unlike the standard way of learning models, the salient differentiating factor is that the data never leaves the user’s device, which is an appealing property for privacy-sensitive data. This strategy has been applied on a wide range of tasks in recent years [10, 23, 34, 44]. Nevertheless, a common limitation of existing approaches is that they primarily focus on a supervised learning regime. The implicit assumption that the labeled data is widely available on the device, or it can be easily labeled through user interaction or programmatically, such as for keyword prediction or photo categorization, is in most pragmatic cases unrealistic.

In reality, on-device data is largely unlabeled and constantly expanding in size. It cannot be labeled to the same extent as standard datasets, which are annotated via crowd-sourcing or other means for training deep neural networks. Due to the prohibitive cost of annotation, users have little to no incentives, and notably for various important tasks, the domain knowledge missing to perform the annotation process appropriately leaves most of the data residing on devices to remain unlabeled. This is especially true when considering the utilization of audio data to perform various audio recognition tasks, which have recently attracted increasing interest from researchers. As a result, numerous audio recognition systems have been developed, such as for wildlife monitoring [28, 36] and surveillance [5]. In addition to monitoring applications, highly accurate acoustic models are utilized for keyword spotting for virtual assistants [23], anomaly detection for machine sounds [18], and in the development of health risk diagnosis systems, such as cardiac arrest detection [4]. However, in the majority of such applications, there is no straightforward manner for the annotation process. For instance, suppose that we have a sleep tracker application that assesses a person’s risk of obstructive sleep apnea using breathing and snoring sounds during sleep. In this case, the end-users may not be able to evaluate their sleeping sounds sufficiently, and clinicians may need to analyze and annotate the samples. Even in cases where no human expertise is required, like in a music tagging application, the correct labeling of songs requires effort on the user’s end. Additionally, there are cases where distributed devices host models with no human-in-the-loop to annotate the audio data, such as surveillance devices, making the labeling process infeasible. Thus, in many realistic scenarios for FL, local audio data will be primarily unlabeled. This leads to a novel FL problem, namely, semi-supervised federated learning, where users’ devices collectively hold a massive amount of unlabeled audio samples and only a fraction of labeled audio examples.

Semi-supervised learning techniques have been widely deployed in a centralized learning setting to utilize readily available unlabeled data and could also be applied in federated learning settings. In particular, with semi-supervision of models, available unlabeled data can be exploited during the training phase, improving the overall performance of the resulting model [40]. Pseudo-labeling is a widely applied semi-supervised learning method, which relies on the predictions of a model on unlabeled data, i.e., pseudo labels, to utilize unlabeled data during the learning phase [22]. With no structural requirements from the input modalities and tiny computational overhead, pseudo-labeling is an ideal candidate to be applied in federated learning settings, where device heterogeneity and computational resources vary across devices. To this end, we propose a federated self-training approach, named FedSTAR (Federated Self-Training for Audio Recognition), to unify semi-supervision with federated learning to leverage large-scale unlabeled audio data. With the exploitation of unannotated audio samples that reside on clients’ devices, we aim to improve the generalization of federated models on a wide range of audio recognition tasks under a pragmatic scenario, where scarcity of labels poses a significant challenge for learning useful models.

Apart from the labels’ deficiency, FL introduces other challenges of the system and statistical heterogeneity [17]. These challenges lead to device hardware and data collection diverseness that can significantly affect the number of devices participating in each federated round as well as the on-device data distribution. Several FL techniques provide flexibility in selecting a fraction of clients in each training round and address the non-i.i.d. nature of client’s data distributions, such as FedAvg [30] and FedProx [25]. The training convergence properties of such distributed optimization methods are discussed in Reference [17], where a clear reduction in the convergence rates is reported. In a centralized setting, self-supervised pre-training can improve the model’s convergence and generalization through leveraging pre-training on massive unlabeled datasets [35]. With self-supervised learning, the model is able to learn useful representations from unlabeled data; thus, when used for the downstream task, self-supervised model can significantly improve the training efficiency and predictive performance [35]. To address the issue of slow training convergence in federated settings, we propose the utilization of self-supervised pre-trained models as model initialization for the FL procedure as compared to the naive random initialization of model parameters. Through extensive evaluation, we demonstrate that the convergence rate of our proposed semi-supervised federated algorithm, i.e., FedSTAR, can be greatly improved by using a pre-trained model learned in a self-supervised manner.

To the best of our knowledge FedSTAR is the first FL approach that learns models for audio recognition tasks by utilizing not only labeled but also unlabeled samples on user devices while not being dependent on any data (labeled or unlabeled) on the server side. Just like the labeled samples, the on-device unlabeled samples are utilized locally by self-training based on our proposed pseudo-labeling with dynamic prediction confidence thresholding. As FedSTAR is not altering either the utilized model’s architecture or the global model averaging process, the underlined hardware requirements are similar to the chosen federated learning algorithm (e.g., FedAvg), while the on-device storage demand is unaffected, since FedSTAR essentially uses already stored unlabeled data that are left unexploited. In addition, with the utilization of unlabeled data, FedSTAR models are less sensitive to the non-i.i.d. nature of the labeled data across clients (label distribution skew and data sample imbalance across clients). As a result, it performs much better in typical non-i.i.d. data federated settings. Furthermore, solutions in the literature focus on randomly initialized models at the server side. We for the first time employ self-supervised pre-training on the server side using publicly available audio data to further improve the efficiency of training, which means fewer training rounds are needed for convergence.

The main contributions of this work are as follows:

  • We study on the practical problem of semi-supervised federated learning for audio recognition tasks to address the lack of labeled data that presents a major challenge for learning on-device models.

  • We design a simple yet effective approach based on self-training, called FedSTAR. It exploits large-scale unlabeled distributed data in a federated setting with the help of a novel adaptive confidence thresholding mechanism for effectively generating pseudo-labels.

  • We exploit self-supervised models pre-trained on FSD-50K corpus [6] for significantly improving training convergence in federated settings.

  • We demonstrate through extensive evaluation that our technique is able to effectively learn generalizable audio models under a variety of federated settings and label availability on diverse public datasets, namely, Speech Commands [41], Ambient Context [33], and VoxForge [29].

  • We show that FedSTAR, with as few as 3% labeled data, on average can improve recognition rate by 13.28% across all datasets compared to the fully supervised federated models.

The rest of the article is organized as follows: In Section 2, an overview of the related work is provided. Section 3 presents an overview of related paradigms and methodologies as background information, Section 4 introduces the proposed federated self-training approach for semi-supervised audio recognition. Section 5 presents an evaluation of FedSTAR on publicly available datasets. Section 6 concludes the article and lists future directions for research.

Skip 2RELATED WORK Section

2 RELATED WORK

Federated Learning. FL has been attracting growing attention, thanks to its unique characteristic of collaboratively training machine learning models without actually sharing local data and compromising users’ privacy [19]. The most popular and simplistic approach to learning models from decentralized data is the Federated Averaging (FedAvg) algorithm [30]. Specifically, FedAvg performs several local stochastic gradient descent (SGD) steps on a sampled subset of devices’ data in parallel and aggregates the locally learned model parameters on a central server to generate a unified global model through weighted averaging. This strategy has proved to work relatively well for a wide range of tasks in i.i.d. settings [23, 44]. At the same time, the performance can decrease substantially when FedAvg is exposed to non-i.i.d. data distribution [17, 45]. Authors in Reference [45] proposed globally sharing a portion of the dataset to improve FL performance under non-i.i.d settings. In addition to the challenge introduced by data distribution, communication efficiency is another critical problem in FL. The communication challenges could be alleviated by increasing the number of local SGD steps between sequential communication stages. However, with the increase of SGD steps, the device’s model may begin to diverge, and the aggregation of such models can affect the generalization of global models [25]. FedProx was proposed to tackle this issue by adding a loss term to restrict the local models’ updates to be closer to the existing global model [25]. Nevertheless, a typical limitation of existing work is the focus on a supervised learning regime with the implicit assumption that the local private data is fully labeled or could be labeled simplistically through labeling functions. However, in the majority of pragmatic scenarios, a straightforward annotation process is non-existent.

Recently, performing on-device federated training of acoustic models has attracted considerable attention [7, 9, 12, 23, 44]. In Reference [23], FL was employed for a keyword spotting task and the development of a wake-word detection system, whereas, References [7, 9] investigated the effect of non-i.i.d. distributions on the same task. In Reference [7], a highly skewed data distribution scenario was considered, where a large set of speakers used their devices to record a set of sentences. To address the challenges introduced due to the non-i.i.d. distribution of data, a word-error-rate model aggregation strategy was developed. In addition, a training scheme with a centralized model, pre-trained on a small portion of the dataset, was also examined. Furthermore, Reference [9] considered a scenario where devices might hold unlabeled audio samples and used a semi-supervised federated scheme based on a teacher-student architecture to exploit unlabeled audio data. However, the teacher model relied on additional high-quality labeled data for training in a centralized setting. Likewise, Reference [12] introduced a framework for privacy-preserving training of user authentication models with FL using labeled audio data. Nonetheless, all prior approaches consider only semantically annotated audio examples or require supplementary labeled data on the server-side to utilize the available unlabeled audio data that reside on devices. To address these problems, we propose a self-training approach to exploit unlabeled audio samples residing on clients’ devices. In addition, as servers often possess the computational resources to efficiently pre-train a model on a massive unlabeled dataset, we employ self-supervision to develop a model that can be used as a highly effective starting point for federated training instead of using randomly initialized weights.

Semi-Supervised Learning. In semi-supervised learning (SSL), we are provided with a dataset containing both labeled and unlabeled examples, where the labeled fraction is generally tiny compared to the unlabeled one and the curation of strong labels for the unlabeled dataset is impractical due to time constraints, cost, and privacy-related issues [46]. While there is a wide range of SSL methods and approaches that have been developed in the area of deep learning, we will mainly focus on the self-training or pseudo-labeling approach [22]. Self-training uses the prediction on unlabeled data to supervise the model’s training in combination with a small percentage of labeled data. Specifically, pseudo-labels are constructed by extracting one-hot labels from highly confident predictions on unlabeled data. These are then used as training targets in a supervised learning regime. This simplistic approach of utilizing unlabeled data has been combined with various methods to further improve the training efficiency. In Reference [1], authors demonstrated that setting a minimum number of labeled samples per training batch can be effective to reduce over-fitting due to noise accumulation on generated predictions. In addition, the use of a scalar temperature for scaling softmax output achieves a softer probability distribution over classes for the predictions and urges models to generate the correct pseudo-labels without suffering from over-confidence [11]. This temperature scaling approach can be highly beneficial in modern deep neural networks architectures, which have shown to suffer from over-confident predictions [11]. Supplementary, authors in [2] proposed MixMatch, which sharpens the prediction’s distribution to further improve the generated pseudo-labels predictions. The sharpening process is performed by averaging the predictions’ distribution of augmentation versions of the same unlabeled sample. Apart from self-training, alternative SSL approaches introduce a loss term, which is computed on unlabeled data, to encourage the model to generalize better to unseen data. Based on the objective of the loss term, we can classify these approaches in two categories: consistency regularization techniques—which are based on the principle that a classifier should produce the same class distribution for an unlabeled sample even after augmentation [31, 38]; and entropy minimization techniques—which aim to motivate the model to produce low-entropy (high-confident) predictions for all unlabeled data [8]. For a concise review and realistic evaluation of various deep learning based semi-supervised techniques, we refer interested readers to Reference [32].

A recent study [16] has questioned the soundness of the assumption that devices have well-annotated labels in a federated setting. Existing semi-supervised federated learning (SSFL) approaches, such as FedMatch [15] and FedSemi [26], have only recently started to be examined under the vision domain to exploit unlabeled data. FedMatch decomposes the parameters learned from labeled and unlabeled on-device data and uses an inter-client consistency loss to enforce consistency between the pseudo-labeling predictions made across multiple devices. In Reference [26], FedSemi adapts a mean teacher approach to harvest the unlabeled data and proposes an adaptive layer selection to reduce the communication cost during the training process. Apart from these methods, many studies consider different data distribution schemes, including sharing an unlabeled dataset across devices [14]. Last, it is important to note that recent works employ SSFL to address problems in healthcare domain, namely, electronic health records [13] and for problems like human activity recognition [39]. Nevertheless, none of the discussed approaches focuses on learning models for audio recognition tasks by utilizing devices’ unlabeled audio samples.

Skip 3BACKGROUND Section

3 BACKGROUND

In this section, we provide a brief overview of semi-supervised and federated learning paradigms, as they act as fundamental building blocks of our federated self-training approach for utilizing large-scale on-device unlabeled audio data in a federated setting.

3.1 Semi-supervised Learning

Given enough computational power and supervised data, deep neural networks have proven to achieve human-level performance on a wide variety of problems [21]. However, the curation of large-scale datasets is very costly and time-consuming, as it either requires crowd-sourcing or domain expertise, such as in the case of medical imaging. Likewise, for several practical problems, it is simply not possible to create a large enough labeled dataset (e.g., due to privacy issues) to learn a model of reasonable accuracy. In such cases, SSL algorithms offer a compelling alternative to fully supervised methods for jointly learning from the fraction of labeled and a large number of unlabeled instances.

Specifically, SSL aims to solve the problem of learning with partially labeled data where the ratio of unlabeled training examples is usually much larger than that of the labeled ones. Formally, let \( \mathcal {D}_{L} = \left\lbrace \left(x_{l_i},y_{i} \right) \right\rbrace _{i=1}^{N_{l}} \) represent a set of labeled data, where \( N_{l} \) is the number of labeled data, \( x_{l_i} \) is an input instance, \( y_{i}~\epsilon \left\lbrace 1, \ldots , \mathcal {C} \right\rbrace \) is the corresponding label, and \( \mathcal {C} \) is the number of label categories for the \( \mathcal {C} \)-way multi-class classification problem. Besides, we have a set of unlabeled samples denoted as \( \mathcal {D}_{U} = \left\lbrace x_{u_i} \right\rbrace _{i=1}^{N_{u}} \), where \( N_{u} \) is the number of unlabeled data. Let \( p_\theta \left(y \mid x \right) \) be a neural network that is parameterized by weights \( \theta \) that predicts softmax outputs \( \widehat{y} \) for a given input x. In the setting of semi-supervised learning, where in general \( N_{l} \ll N_{u} \), we need to simultaneously minimize losses on both labeled and unlabeled data to learn the model’s parameters \( \theta \). Specifically, our objective is to minimize the following loss function: (1) \( \begin{equation} \mathcal {L}_{\theta } = \mathcal {L}_{s_{\theta }}(\mathcal {D}_{L}) + \mathcal {L}_{u_{\theta }}(\mathcal {D}_{U}), \end{equation} \) where \( \mathcal {L}_{s_{\theta }}(\mathcal {D}_{L}) \) and \( \mathcal {L}_{u_{\theta }}(\mathcal {D}_{U}) \) are the loss terms from supervised and unsupervised learning, respectively.

The teacher-student self-training framework is a popular scheme to simultaneously learn from both labeled and unlabeled data. In this approach, we first use the available labeled data to train a good teacher model, which is then utilized to label any available unlabeled data. Consequently, both labeled and unlabeled data are used to jointly train a student model. In this way, the model assumes a dual role as a teacher and a student. In particular, as a student, it learns from the available data, while as a teacher, it generates targets to help the learning process of student. Since the model itself generates targets, they may very well be incorrect, thus, the learning experience of the student model depends solely on the ability of teacher model to generate high-quality targets [43].

3.2 Federated Learning

FL is a novel collaborative learning paradigm that aims to learn a single, global model from data stored on remote clients with no need to share their data with a central server. In particular, with the data residing on clients’ devices, a subset of clients is selected to perform a number of local SGD steps on their data in parallel in each communication round. Upon completion, clients exchange their models’ weights updates with the server, aiming to learn a unified global model by aggregating these updates. Formally, the goal of FL is typically to minimize the following objective function: (2) \( \begin{equation} \min _{\theta } \mathcal {L}_{\theta } = \sum _{k=1}^{K} \gamma _{k} {\mathcal {L}}_k(\theta), \end{equation} \) where \( \mathcal {L}_k \) is the minimization function of the kth client and \( \gamma _{k} \) corresponds to the relative impact of the kth client to the construction of the global model. For the FedAvg algorithm, parameter \( \gamma _{k} \) is equal to the ratio of client’s local data \( N_k \) over all training samples \( (\gamma _{k} = \frac{N_k}{N}) \).

Specifically, let \( \mathcal {D} = \left\lbrace \left(x_{l_i},y_{i} \right) \right\rbrace _{i=1}^{N} \) be a dataset of N labeled examples, similarly to the previously discussed dataset \( \mathcal {D}_{L} \) in Section 3.1. Given K clients, \( \mathcal {D} \) is decomposed into K sub-datasets \( \mathcal {D}^{k}=\left\lbrace \left(x_{l_i},y_{i} \right) \right\rbrace _{i=1}^{N_k} \) corresponding to each clients’ privately held data. For an initial global model G, the \( \mathit {r} \)th communication round starts with server randomly selecting a portion q (\( 0\lt q\le K \)) of clients to participate in the current training round. Afterwards, each client’s local model receives the global parameters \( \theta _r^G \) and performs supervised learning on their local dataset \( \mathcal {D}^{k} \) to minimize \( \mathcal {L}_{k}(\theta _r^k) \). Subsequently, G aggregates over locally updated parameters by performing \( \theta _{r+1}^G \leftarrow \sum \nolimits _{i=1}^{q} \frac{N_i}{N} \theta _r^i \). The presented circular training process, comprising model weights’ exchanges between server and clients, repeats until \( \theta ^G \) converges after R rounds.

Skip 4METHODOLOGY Section

4 METHODOLOGY

In this section, we present our federated self-training learning approach, namely, FedSTAR, for audio recognition tasks. First, we provide a formal overview of the problem, which FedSTAR aims to solve. Next, we discuss the proposed self-training technique (i.e., pseudo-labeling with dynamic prediction confidence thresholding) in detail, followed by the presentation of our FedSTAR algorithm. Finally, we provide a thorough description of the self-supervised pre-training technique used to train a model as an initialization point for the FedSTAR approach.

4.1 Problem Formulation

We focus on the problem of SSFL, where labeled data are scarce across users’ devices. At the same time, clients collectively hold a massive amount of unlabeled audio data. In addition, in a typical federated learning setting, the on-device data distribution depends on the profile of the users operating the devices. Thus, it is a common scenario for both labeled and unlabeled data to originate from the same data distribution. Based on the aforementioned assumption, with FedSTAR, we aim to utilize the available unlabeled data on clients and further improve the generalization of FL models, alleviating the need for clients to hold well-annotated data. In this way, we substantially decouple the amount of available labeled from the predictive power of acoustic models trained under federated settings.

Formally, under the setting of SSFL, each of the K clients holds a labeled set, \( \mathcal {D}_{L}^{k} = \lbrace \left(x_{l_i},y_{i} \right) \rbrace _{i=1}^{N_{l,k}} \) and an unlabeled set \( \mathcal {D}_{U}^{k} = \lbrace x_{u_i}\rbrace _{i=1}^{N_{u,k}} \), where \( N_{k} = N_{l,k} + N_{u,k} \) is the total number of data samples stored on the \( \mathit {k} \)th client and \( N_{l,k} \ll N_{u,k} \). We desire to learn a global unified model G without clients sharing any of their local data, \( \mathcal {D}_L^k \) and \( \mathcal {D}_U^k \). To this end, our objective is to simultaneously minimize both supervised and unsupervised learning losses during each client’s local training step on the \( \mathit {r} \)th round of the FL algorithm. Specifically, the minimization function, similar to the one presented in Equation (2), is: (3) \( \begin{equation} \min _{\theta } {\mathcal {L}}_{\theta } = \sum _{k=1}^{K} \gamma _{k} {\mathcal {L}}_k(\theta) \textrm { where } \mathcal {L}_{k}(\theta) =\mathcal {L}_{s_{\theta }}(\mathcal {D}_{L}^{k}) + \beta \mathcal {L}_{u_{\theta }}(\mathcal {D}_{U}^{k}). \end{equation} \) Here, \( \mathcal {L}_{s}(\mathcal {D}_{L}^{k}) \) is the loss terms from supervised learning on the labeled data held by the kth client, and \( \mathcal {L}_{u}(\mathcal {D}_{U}^{k}) \) represents the loss term from unsupervised learning on the unlabeled data of the same client. We add the parameter \( \beta \) to control the effect of unlabeled data on the training procedure, while \( \gamma _{k} \) is the relative impact of the kth client on the construction of the global model G.

4.2 Self-training with Pseudo Labeling

Self-training via pseudo-labeling has been widely used in semi-supervised learning [40]. The objective of highly effective teacher-student self-training approaches is to train a teacher model, which supervises the learning process of a student model that learns from labeled and unlabeled data jointly. First, a teacher model is built with the available labeled data and afterwards this is exploited to make predictions for the unlabeled samples. Subsequently, the student model is trained on both labeled and predicted samples. We propose a self-training technique with a dynamic prediction confidence threshold to learn from the unlabeled audio data residing on the client’s device, thus boosting the performance of models trained in federated settings with varying percentages of labeled examples. For audio classification tasks, to learn from the labeled datasets \( \mathcal {D}_{L}^k \) across all participating clients, we apply cross-entropy loss as follows: (4) \( \begin{equation} \begin{aligned}\mathcal {L}_{s}(\mathcal {D}_{L}^{k}) = - \frac{1}{N_{l,k}} \sum \limits _{i=1}^{N_{l,k}}\sum \limits _{j=1}^{C} y_{i}^{j} \log (\mathit {f}_{i}^{\theta ^{k}}(x_{l_j})) = \mathcal {L}_{CE}\left(y,p_{\theta ^k}\left(y\mid x_l \right) \right) \end{aligned} . \end{equation} \)

Next, to learn from unlabeled data, we generate pseudo-labels \( \widehat{y} \) for all available unlabeled data \( x_{u} \) on client \( \mathit {k} \) by performing: (5) \( \begin{equation} \begin{aligned}\widehat{y} = \phi \left(z,T \right) = \mathop{\text{arg max}}\limits_{i \in \left\lbrace 1, \ldots , \mathcal {C} \right\rbrace } \left(\frac{e^{z_i/T}}{\sum _{j=1}^{\mathcal {C}} e^{z_j/T}} \right) \end{aligned} , \end{equation} \) where \( z_{i} \) are the logits produced for the input sample \( x_{u_i} \) by the kth client model \( p_{\theta ^k} \) before the softmax layer. In essence, \( \phi \) produces categorical labels for the given “soften” softmax values, in which temperature scaling is applied with a constant scalar temperature T. As the maximum of the softmax function remains unaltered, the predicted pseudo-label \( \widehat{y} \) is identical as if the original prediction (without scaling) for an unlabeled sample \( x_u \) was used; however, the prediction confidence is weakened. A dynamic threshold \( \tau \) of confidence is proposed following a cosine schedule to discard low-confidence predictions when generating pseudo-labels. For the obtained pseudo-labels, we then perform standard cross-entropy minimization while using \( \widehat{y} \) as targets in the following manner: (6) \( \begin{equation} \begin{aligned}\mathcal {L}_{u}(\mathcal {D}_{U}^{k}) &= - \frac{1}{N_{u,k}} \sum \limits _{i=1}^{N_{u,k}}\sum \limits _{j=1}^{C} \widehat{y_{i}}^{j} \log (\mathit {f}_{i}^{\theta ^{k}}(x_{u_j})) = \mathcal {L}_{CE}\left(\widehat{y},p_{\theta ^k}\left(x_u\right) \right)\!. \end{aligned} \end{equation} \)

Revising the initial minimization goal of FedSTAR expressed in Equation (3), we can now represent local models’ loss function on the \( \mathit {r} \)th round of the FL algorithm for the \( \mathit {k} \)th client as: (7) \( \begin{equation} \begin{aligned}\mathcal {L}_{k}(\theta ^k) = \mathcal {L}_{CE} \left(y,p_{\theta ^k}\left(y\mid x_l\right) \right) + \beta \mathcal {L}_{CE} \left(\widehat{y},p_{\theta ^k}\left(x_u\right) \right)\!. \end{aligned} \end{equation} \)

4.3 Federated Self-training

The objective of federated self-training is to create a teacher model on each client to exploit labeled data resident on clients’ devices, which will be used to predict labels for the unlabeled instances available in the device. As both labeled and unlabeled on-device samples originate from the same data distribution, a student model can be constructed on each client device by collectively training on labeled and pseudo-labeled data, the weights of whom will be returned to the server for aggregation. Under federated settings, however, a more complicated analysis is required, as clients’ local labeled data can be limited and can have a highly skewed distribution. In such settings, teacher models may produce inaccurate pseudo-label predictions, and student classifiers potentially amplify the mistakes further during training through using faulty pseudo-labels. To ensure the proper construction of pseudo-labels and guarantee that the student model will learn properly from unlabeled data, the confidence of the predictions is taken into consideration when generating pseudo-labels to discard any low-confidence predictions.

Concisely, in the proposed FedSTAR algorithm, the clients’ local update step is altered to learn from unlabeled datasets \( \mathcal {D}_{U}^{k} \). As can be seen in Figure 2, a representative round \( \mathit {r} \) of FedSTAR starts with the distribution of global models’ weights \( \theta ^G \) to a randomly selected subset of q clients. On each client, equally sized batches l and u from the labeled and unlabeled sets are created, respectively. The model’s weights update is performed, as in Equation (7). First, the classical supervised categorical cross-entropy loss is minimized for batch l, as in Equation (4), to construct a teacher model, and afterwards, with this model pseudo-labels are produced using \( \Phi (\cdot) \). With the creation of pseudo-labels \( \widehat{y} \), the unlabeled batch u is then treated as a labeled batch \( u^{^{\prime }} =\left\lbrace \left(u,\widehat{y} \right) \right\rbrace \), in which the client’s model is further trained with standard cross-entropy minimization. It is important to note that we simultaneously optimize the cross-entropy loss on both l and u subsets by computing both losses before performing backpropagation to update the local models’ parameters. Last, the locally updated weights from all participating clients in the \( \mathit {r} \)th round are sent back to the server, where the global model weights are calculated as a weighted average over all the local weights updates.

Fig. 1.

Fig. 1. Illustration of FedSTAR for label-efficient learning of audio recognition models in a federated setting.

Fig. 2.

Fig. 2. On-device self-training based on pseudo-labeling in a federated setting for an audio recognition task shown for illustration proposes.

Since \( N_{l} \ll N_{u} \) holds for all clients, given a sufficient number of participating rounds, unlabeled instances will be exposed to all the available labeled data. Additionally, we propose an adaptive confidence thresholding method to diminish unsatisfactory performance due to training on faulty pseudo-labels. In particular, in addition to using temperature scaling T to “soften” softmax output and generated confident predictions, we employ an increasing confidence threshold \( \tau \) to discard low-confidence pseudo-labels during training following a cosine schedule. Cosine learning rate schedulers rely on the observation that we might not want to decrease the learning rate too drastically in the beginning, while we might want to “refine” our solution in the end using a very small learning rate. Along the same lines, with our cosine confidence thresholding, we allow clients to explore the locally stored unlabeled data, \( D_{U}^{k} \), in the first few federated rounds, while considering only highly confident predictions in a later stage of the training procedure. While other methods could be explored for this purpose, such a study is outside the scope of the current work and we mainly focus on cosine scheduler, which has proven to work well empirically across a variety of tasks [27]. Further details and an overview of our approach for the semi-supervised training procedure can be found in Algorithm 1.

4.4 Self-supervised Pretraining Strategy

Self-supervised learning aims to learn useful representations from unlabeled data by tasking a model to solve an auxiliary task for which supervision can be acquired from the input itself. Given an unlabeled data \( D = \lbrace x\rbrace _{m=1}^M \) and deep neural network \( f_{\theta }(.) \), the aim is to pre-train a model through solving a surrogate task, where labels y for the standard objective function (e.g., cross-entropy) are extracted automatically from x. The learned model is then utilized as a fixed feature extractor or as initialization for rapidly learning downstream tasks of interest. The fields of computer vision and natural language processing have seen tremendous progress in representation learning with deep networks in a self-learning manner, with no human intervention in the labeling process. Here, the prominent techniques for audio representation learning from unlabeled data include and audio-visual synchronization [20], contrastive learning [35], and other auxiliary tasks [37].

In our work, we propose to leverage self-supervised pre-training on the server side to improve training convergence of FedSTAR on client devices. Motivated by the fact that the server can often hold a large amount of unlabeled data and has enormous computational resources available, we employ contrastive learning for audio to develop a model that can be used as an effective starting point for federated learning instead of using randomly initialized weights. Specifically, pre-training is performed in a centralized setting with a separate publicly available dataset on the server side; thus, it can be done once and be used repeatedly for different downstream tasks. To the best of our knowledge, this is the first time self-supervised learning has been used to address the convergence of federated models with fewer training rounds efficiently.

Formally, we pre-train our model with contrastive learning [35] using FSD-50K [6] dataset. On a high level, the objective is to train a model to maximize the similarity between related audio segments while minimizing it for the rest. Similar samples are generated through stochastic sampling from the same audio clip, while other segments in a batch are treated as negatives. In particular, we use bilinear similarity formulation and pre-train our model with a batch size of 1,024 for 500 epochs. Moreover, we utilize a network architecture, as described in Section 5.2 as an encoder with the addition of a dense layer containing 256 hidden units on top, which is discarded after the pre-training stage. In this way by using a same architecture, we are able to draw proper conclusions for the effects of utilizing a pre-trained model as an initial global model and directly compare with the randomly initialized FedSTAR approach.

Skip 5EXPERIMENTS Section

5 EXPERIMENTS

In this section, we conduct an extensive evaluation of our approach on publicly available datasets for various audio recognition tasks to determine the efficacy of FedSTAR in learning generalizable models under a variety of federated settings and label availability. First, the federated learning framework and datasets utilized for validation are presented, followed by a detailed description of the network architecture. Next, we introduce our experiments in centralized and fully supervised federated settings, which serve as a baseline for evaluating our approach. Finally, we provide a thorough evaluation of FedSTAR , which is structured in the form of several research questions.

5.1 Datasets and Audio Pre-processing

We use publicly available datasets to evaluate our models on a range of audio recognition tasks. For all datasets, we use the suggested train/test split for comparability purposes. For ambient sound classification, we use the Ambient Acoustic Contexts dataset [33], in which sounds from 10 distinct events are present. For the keyword spotting task, we use the second version of the Speech Commands dataset [41], where the objective is to detect when a particular keyword is spoken out of a set of 12 target classes. Likewise, we use VoxForge [29] for the task of spoken language classification, which contains audio recordings in six languages—English, Spanish, French, German, Russian, and Italian. It is one of the largest available datasets for language identification problems; it is valuable for benchmarking the performance of the supervised FL model. We resampled the Ambient Acoustic Contexts samples from 48 kHz to 16 kHz to utilize the same sampling frequency across all our datasets samples. In Table 1, we present a description of each dataset.

Table 1.
DatasetTaskClasses
Ambient Context [33]Event classification10
Speech Commands [41]Keyword spotting12
VoxForge [29]Language identification6

Table 1. Key Details of the Datasets Used in Evaluation

5.2 Model Architecture and Optimization

The network architecture of our global model is inspired by Reference [37] with a key distinction that instead of batch normalization, we utilize group normalization [42] after each convolutional layer and employ a spatial dropout layer. We use log-Mel spectrograms as the model’s input, which we compute by applying a short-time Fourier transform on the one-second audio segment with a window size of 25 ms and a hop size equal to 10 ms to extract 64 Mel-spaced frequency bins for each window. To make an accurate prediction on an audio clip, we average over the predictions of non-overlapping segments of an entire audio clip. Our convolutional neural network architecture consists of four blocks. In each block, we perform two separate convolutions, one on the temporal and another on the frequency dimension, outputs of which we concatenate afterward to perform a joint \( 1 \times 1 \) convolution. Using this scheme, the model can capture fine-grained features from each dimension and discover high-level features from their shared output. Furthermore, we apply L2 regularization with a rate of 0.0001 in each convolution layer and group normalization [42] after each layer. Between blocks, we utilize max-pooling to reduce the time-frequency dimensions by a factor of two and use a spatial dropout rate of 0.1 to avoid over-fitting. We apply ReLU as a non-linear activation function and use Adam optimizer with the default learning rate of 0.001 to minimize categorical cross-entropy.

To simulate a federated environment, we use the Flower framework [3] and utilize FedAvg [30] as an optimization algorithm to construct the global model from clients’ local updates. Additionally, a number of parameters were selected to control the federated settings of our self-training strategy fully. Those parameters are: (1) N - number of clients, (2) R - number of rounds, (3) q - clients’ participation percentage in each round, (4) E - number of local train steps per round, (5) \( \sigma \) - data distribution variance across clients, (6) L - dataset’s percentage to be used as labeled samples, (7) U - dataset’s percentage to be used as unlabeled samples (excluding L% of the data used as labeled), (8) \( \beta \) - influence of unlabeled data over training process, (9) T - temperature scaling parameter, and (10) \( \tau \) - predictions confidence threshold.

We employ uniform random sampling for the clients’ selection strategy, as other approaches for adequate clients election are outside the current work scope. Last, across all FedSTAR experiments, we fixed the temperature scaling parameter \( T=4 \), while we set the confidence threshold \( \tau \) to initialize from 0.5 and gradually increase to a maximum of 0.9 during training, following a cosine schedule. A description of the parameters used is presented in Table 2.

Table 2.
Parameter NameVariableRange
Number of ClientsN5–30
Number of Federated RoundsR1–100
Number of Local Train StepsE1–4
Clients’ Participation Percentageq20% to 80%
Data Distribution Variance across Clients\( \sigma \)0% to 50%
Dataset’s Labeled PercentageL3% to 100%
Dataset’s Utilized Unlabeled PercentageU20% to 100%
Unlabeled Data Influence on Train Step\( \beta \)50%
Temperature ScalingT4
Confidence Threshold Percentage\( \tau \)50% to 90%

Table 2. Primary Experiment Parameters

5.3 Baselines and Evaluation Strategy

In fully supervised federated experiments where the complete dataset is available, the labeled instances are randomly distributed across the available clients. Likewise, in experiments where the creation of a labeled subset from the original dataset is required (\( L\lt \)100%), we keep the dataset’s initial class distribution ratio to avoid tempering with dataset characteristics. Afterward, the labeled subset is again randomly distributed across the available clients. With the \( \sigma \) parameter set to 25% and a random partitioning of labeled samples among clients, the labeled data distribution resembles a non-i.i.d. one. In contrast, an increase of available clients results in a highly skewed distribution. It is worth mentioning that even if the meaning of non-i.i.d. is generally straightforward, data can be non-i.i.d in many ways. In our work, the term non-i.i.d data distribution describes a distribution with both a label distribution skew and a quantity skew (data samples imbalance across clients). This type of data distribution is common across clients’ data in federated settings. Each client frequently corresponds to a particular user (affecting the label distribution), and the application usage across clients can differ substantially (affecting the label distribution). For a concise taxonomy of non-i.i.d. data regimes, we refer our readers to Reference [17]. Additionally, in FedSTAR , the unlabeled subset consists of the dataset’s remaining samples after extracting the provided labels. In such experiments, both the labeled and unlabeled subsets are dispensed at random over the available clients. Furthermore, for an accurate comparison between our experiments, we manage any randomness during the data partitioning and training procedures by passing a seed alongside the parameters presented in Table 2. In this way, we control the amount of data and the data instances that reside on each simulated client. Last, for a more rigorous evaluation, we perform three distinct trials (or runs, i.e., training a model from scratch) in each setting, and the average accuracy over all three runs is reported across the results of Sections 5.3 and 5.4.

To evaluate the FedSTAR, we first need to construct a high-quality supervised baseline both in centralized and federated environments. Therefore, we perform preliminary experiments in both centralized as well as fully supervised federated settings. We conduct initial experiments on all datasets in centralized settings where the models are trained until convergence to obtain the resulting accuracy on a test set, which is presented in the centralized row of Table 3. Following, we examine our model’s performance in federated settings by adjusting the FL parameters to R = 100, q = 80%, E = 1, and \( \sigma \) = 25%. We vary the number of clients (N) while keeping the remaining federated parameters to the same as the earlier mentioned values, as N frequently fluctuates in real-life FL scenarios. Thus, we can explore how the federated model behaves as clients progressively increase and the available local data become yet more distributed, affecting the performance of FL [45]. The results for supervised FL are presented in the Federated row of Table 3. We note from results presented in Table 3 that the supervised federated models achieve comparable results in various cases to the models trained in a centralized setting across all three datasets. Moreover, the number of clients (N) has a clear effect on the model’s performance. With an increase in N, we notice that the training process requires more training rounds (R) to converge as the quantity of local data for each client decreases. The obtained accuracy for a constant number of rounds deteriorates.

Table 3.
MethodSpeechCommandsAmbientContextVoxForge
Centralized96.5473.0379.60
FederatedN = 596.9371.8879.13
N = 1096.7868.0178.98
N = 1596.3366.8676.09
N = 3094.6265.1465.17
  • Average accuracy on test set over three distinct trials. Federated parameters are set to q = 80%, \( \sigma \) = 25%, L = 100%, E = 1, R = 100.

Table 3. Evaluation of Audio Recognition Models in Centralized and Fully supervised Federated Settings

  • Average accuracy on test set over three distinct trials. Federated parameters are set to q = 80%, \( \sigma \) = 25%, L = 100%, E = 1, R = 100.

5.4 Results

5.4.1 Comparison against Fully Supervised Federated Approach under non-i.i.d. Settings.

We first evaluate FedSTAR to determine the obtained improvements versus a fully supervised federated approach when a non-i.i.d. distribution is considered. This analysis helps in understanding whether utilizing unlabeled instances that reside on clients’ devices with FedSTAR can be beneficial for a model trained in federated settings and, if so, to which degree it improves the recognition rate. To this end, we perform experiments on all three datasets for a diverse number of clients (N) where the percentage of available labeled instances is varied from \( 3\% \) up to \( 50\% \). To clearly illustrate the performance gain of FedSTAR in comparison to the supervised FL regime, experiments with identically labeled subsets are conducted under fully supervised FL, where the unlabeled instances remained unexploited. Table 4 provides the accuracy scores on test sets averaged across three independent runs for the considered datasets to be robust against differences in weight initialization and optimization. For ease of comparison, we add the results column on fully supervised FL using entire labeled dataset (L = 100%) in Table 4, as discussed earlier in Section 5.3.

Table 4.
DatasetClientsSupervised (Federated)FedSTAR
\( L=3\% \)\( L=5\% \)\( L=20\% \)\( L=50\% \)\( L=100\% \)\( L=3\% \)\( L=5\% \)\( L=20\% \)\( L=50\% \)
Ambient Context546.3447.8961.4065.8571.8848.6854.9564.3767.04
Speech Commands81.1287.9792.3594.6696.9387.4190.0194.1794.85
VoxForge54.5556.4161.6570.3779.1363.9267.8069.0967.08
Ambient Context1035.2941.3151.7162.6968.0148.8752.3762.9464.42
Speech Commands67.7583.8092.1294.0296.7886.8290.3394.0994.18
VoxForge56.1454.7360.4862.4178.9859.8764.3569.3863.27
Ambient Context1533.0342.7553.3759.9766.8649.5454.7163.4662.41
Speech Commands62.9872.8492.1493.1496.3386.8289.3393.1693.39
VoxForge54.2654.3757.1160.2976.0955.8257.9667.6661.66
Ambient Context3032.3140.1747.0555.8565.1440.8446.5860.2156.19
Speech Commands33.7844.2184.9492.2194.6283.8888.1992.9292.62
VoxForge50.3254.3355.1957.5665.1754.8156.1863.8356.66
  • Average accuracy over three distinct trials on test set. Detailed results are given in Table 9 of the Appendix. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, E = 1, R = 100.

Table 4. Performance Evaluation of FedSTAR

  • Average accuracy over three distinct trials on test set. Detailed results are given in Table 9 of the Appendix. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, E = 1, R = 100.

In Table 4, we observe that FedSTAR can utilize unlabeled audio data to improve the model’s performance across all datasets significantly. Consequently, we can conclude that FedSTAR can be applied in a federated environment with scarce labeled audio instances to boost the performance by learning from unlabeled data, independent of the audio recognition task. In particular, comparing the two rows for L=3%, we note an increase of 13.28% in accuracy on average when using FedSTAR across the considered tasks.

While varying L, we note that the percentage gab between FedSTAR and the supervised federated counterparts shrinks as more labeled data are available across devices. In addition, with only 5% of labels available, we note that FedSTAR model’s accuracy is within a reasonable range from the ones trained under fully supervised federated settings, where the complete dataset is available (L = 100%). These two observations suggest that FedSTAR can be especially useful under extreme label scarcity scenarios, where a highly accurate model can be obtained through the exploitation of unlabeled data. Alternatively, FedSTAR could also be used in cases where sufficient labels are provided (L = 50%) to slightly improve the resulting models’ performance. An exception to the aforementioned behavior is the case of FedSTAR experiments on VoxForge with L = 50%, where we can see that the accuracy obtained after R = 100 rounds is inferior to the one achieved with identical federated settings and L = 20%. This might be because the unlabeled subset was not yet exposed to all available labeled examples; thus, the model had not reached the learning plateau and may require more training rounds to converge. Despite this behavior, the performance of FedSTAR models is superior to the supervised federated models with the same amount of labeled data and reasonably close to supervised federated models trained on the entire labeled data.

While N increases and the labeled subset of each client shrinks (and, hence, we obtain an even higher non-i.i.d. distribution), we notice that the FedSTAR models’ accuracy remains relatively unaffected, especially if we recollect that in FL experiments of Table 3 we noticed accuracy decays for a constant R as N rises. In particular, when N = 30 and the data distribution becomes highly skewed (or non-i.i.d), we note a performance gap between FedSTAR and supervised FL that can reach up to 50% for the Speech Commands dataset. It means that the FedSTAR model can effectively utilize unlabeled audio data, even in highly distributed scenarios. This essentially means that the exploitation of large-scale unlabeled on-device instances can help create a more uniform distribution across devices and tackle the challenges introduced in federated settings from the “non-i.i.d.-ness” of data.

5.4.2 Effectiveness of FedSTAR across Diverse Federated Settings.

In this subsection, we assess the efficacy of FedSTAR across a variety of federated settings. As presented in Section 5.4.1, the performance improvement by utilizing FedSTAR in comparison to the supervised FL scheme can vary across different federated settings. As federated settings’ variability is a primary characteristic of a distributed environment, it is essential for our approach to be effective in distinct scenarios. To this end, we conduct further experiments on the Speech Command dataset, where the participation rate (q), the number of clients (N), the local train step (E), and the data distribution across clients (\( \sigma \)) are varied. We choose to investigate these four parameters, as they primarily vary in a real-life FL setting, and they can have a significant effect on model performance [17, 45].

Varying participation rate: With the device heterogeneity and computational resources significantly varying across devices in a federated environment, a participation rate of 100% is probably an unrealistic assumption for most pragmatic FL applications [17]. As clients’ participation rate (q) can greatly influence the convergence rate of an FL model, we evaluate FedSTAR performance while varying the participation rate in each federated round. Therefore, this assessment helps us in understanding whether FedSTAR can retain the same level of effectiveness under low levels of clients participation. To this end, we conduct experiments with N = 15 for various clients’ participation (q) rates, starting from 20% up to 80%, under different percentages of labels availability on the Speech Commands dataset. Figure 3 provides obtained accuracy score on the test set; we observe that FedSTAR is able to effectively learn from the unlabeled instances residing on clients’ devices under low levels of clients engagement, even if the available labeled samples are scarce. While there is a decrease in FedSTAR model’s accuracy when the participation rate reduces, the reduction is no more than 2% for a given L. In particular, the reduction is eliminated when additional labeled instances are available.

Fig. 3.

Fig. 3. Evaluation of FedSTAR performance under varying clients’ participation rate. Federated parameters are set to \( \sigma \) = 25%, \( \beta =0.5 \) , R = 100, E = 1, and N = 15. Average accuracy over three distinct trials is reported.

Varying local train steps: Subsequently, we examine the effect of increasing the local train steps on the FedSTAR performance. As shown in Reference [25], a reduction in the communication costs can be achieved by increasing E at the expense of local models convergence, which can substantially affect the aggregation process. Thus, with this analysis, we aim to understand whether FedSTAR models can retain their convergence rate when multiple local train steps are performed across clients’ data to reduce the communication costs. To this end, we perform experiments with various labeled percentages for 50 federated rounds (R = 50) and N = 15, while varying E from 1 to 4. From the results shown in Figure 4, we note that FedSTAR can effectively utilize the unlabeled instances, when the available labeled subset exceeds 3%, to avoid possible local models’ divergence, resulting in a highly accurate aggregated (or global) model. However, for L = 3%, we notice a declining trend as E increases, which could be originated from two reasons. Since the labeled data are scant for L = 3%, the downwards trend on FedSTAR performance could be caused due to over-fitting, as the local models are extensively trained on a tiny labeled subset, when E increases. In addition, the absence of such a trend in higher label availability rates suggests that a sufficient amount of labeled data might be required for FedSTAR local models to converge. As FedSTAR uses unlabeled instances predictions to retrain the local models further, any faulty pseudo-labeled samples participating in the retraining step will increasingly intensify local models’ divergence as E rises. Besides providing additional labels to avoid local models divergence, this behavior could be regulated by adjusting the confidence threshold of the predictions, \( \tau \), to a higher value so any initial faulty pseudo-labels would participate in the local SGD steps are discarded.

Fig. 4.

Fig. 4. Evaluation of FedSTAR performance against local train steps size. Federated parameters are set to \( \sigma \) = 25%, \( \beta =0.5 \) , R = 50, q = 80%, and N = 15. Average accuracy over three distinct trials is reported.

Varying number of clients: The number of clients is an important factor in the FL procedure, as it can have a significant impact on the data distribution, which has shown to affect the global model’s generalization [17, 45]. In particular, introducing additional clients to FL, the class distributions across clients can become highly skewed, as the data partitioning process is random. With this ablation study, we aim to answer whether FedSTAR can retain the same level of effectiveness when the number of clients (and thus the non-i.i.d-ness of the class distribution) grows. To this end, we present the performance of FedSTAR on Speech Commands dataset when we vary the number of clients from 5 up to 30, while setting q = 80%, \( \sigma \) = 25%, R = 100%, and E = 1. The findings are shown in Figure 5, where we note that the number of clients has a relatively low impact on FedSTAR ability to utilize the available unlabeled audio data, as FedSTAR models’ performance follows a constant upward trend while we provide additional labels for any given N. In particular, comparing the results for N = 5 and N = 30, we observe that the FedSTAR models’ accuracy is notably close, especially for L>3%. This is in constant with fully supervised FL performance, as presented in Table 3, where models’ performance can drop more than 2% when varying N for the same dataset. Finally, it is important to note that FedSTAR model’s performance is close to the centralized baseline for both L>20 and L = 50; thus, no noticeable improvement appears with the introduction of additional labels samples.

Fig. 5.

Fig. 5. Evaluation of FedSTAR performance under varying number of clients. Federated parameters are set to \( \sigma \) = 25%, \( \beta =0.5 \) , R = 100, q = 80%, and E = 1. Average accuracy over three distinct trials is reported.

Varying class distribution across clients: Apart from the number of clients, the preferences of each client can substantially affect the nature of clients’ data distribution. For example, in a music tagging scenario, the type and quantity of data residing on a device are directly correlated to both user’s preference of a specific genre of music and the time users dedicated to the application. Such challenges introduce a highly non-i.i.d. data distribution, both in terms of labels distribution and quantity of data per client. Therefore, in this analysis, we aim to understand the effect of highly non-i.i.d. distributions, both in terms of labels and data quantity distributions, on the effectiveness of FedSTAR to utilize on-device unlabeled data. To this end, we execute experiments on the Speech Commands dataset with N = 15, q = 80%, and E = 1 for R = 100, in which the partitioning of labeled data on clients followed a defined class availability distribution. We utilize a uniform distribution with a mean value of \( \mu \) = 3 and fluctuating variance \( \sigma _{c} \) from 0% to 50% as our class availability distribution across clients. Since the total number of classes in the Speech Commands dataset is 12, we choose \( \mu \) = 3 for clients to access only a few labeled samples per class (on average, three classes). Thus, the on-device labeled data distribution resembles a realistic non-i.i.d. distribution. The client’s preferences can affect both the type and the number of labeled samples described earlier for a music tagging application. It is important to note that the splitting of the unlabeled subset on clients followed a random distribution, with no assumption being made to distribute the label. Consequently, clients might have labeled samples from a specific subset of classes, yet unlabeled instances from all classes could be available. Such data distributions are frequent in pragmatic applications, where the domain knowledge is missing to perform the annotation process appropriately for all classes. For a rigorous evaluation, we perform identical experiments in terms of on-device labeled samples availability under fully supervised federated settings, where the unlabeled dataset remained unexploited.

From the results introduced in Table 5, we note that FedSTAR can effectively exploit the available on-device unlabeled instances to learn an accurate audio model under highly non-i.i.d. distributions. Comparing the FedSTAR performance with that of a fully supervised FL counterpart, we notice a substantial improvement in accuracy in most cases. In particular, for the case of \( L\le 3\% \), FedSTAR utilized on-device unlabeled examples to effectively train an audio model, whereas FL was unable to learn under such highly non-i.i.d. settings adequately. Additionally, we observe that the obtained accuracy gap across three distinct FedSTAR models (with \( \sigma _{c} \) of 0, 25, and 50 percent) for a given L is no larger than 4.8%. This behavior suggests that FedSTAR can maintain nearly the same level of effectiveness in exploiting on-device unlabeled data, irrespective of the skewness of data distribution on clients’ end. Consequently, FedSTAR could be an effective solution to train an audio model under different federated settings, where the labeled data across clients experience a class distribution skewness and large-scale unlabeled audio samples from all classes are readily available on clients’ devices.

Table 5.
Class DistributionCharacteristicsSupervised (Federated)FedSTAR
\( L=3\% \)\( L=5\% \)\( L=20\% \)\( L=50\% \)\( L=3\% \)\( L=5\% \)\( L=20\% \)\( L=50\% \)
\( \mu \) = 3\( \sigma _{c} \) = 0 %9.8332.6380.2282.4079.0879.6287.0183.14
\( \sigma _{c} \) = 25%10.5423.9775.4183.6179.0584.1586.5285.05
\( \sigma _{c} \) = 50%8.4424.2573.9384.4178.1481.8884.5684.55
  • Class distribution has mean \( \mu \) = 3 and variance \( \sigma _{c} \). Average accuracy over three distinct runs is reported on Speech Commands. Detailed results are given in Table 8 of the Appendix. Federated parameters are set to \( \beta =0.5 \), R = 100, N = 15, q = 80%, and E = 1.

Table 5. Performance Evaluation of Method against Variation of Class Availability across Clients

  • Class distribution has mean \( \mu \) = 3 and variance \( \sigma _{c} \). Average accuracy over three distinct runs is reported on Speech Commands. Detailed results are given in Table 8 of the Appendix. Federated parameters are set to \( \beta =0.5 \), R = 100, N = 15, q = 80%, and E = 1.

5.4.3 Assessment of Utilizing Self-supervised Learning for Model Pre-training to Improve Training Convergence of FedSTAR.

Our proposed self-training federated learning approach attains high performance on different audio recognition tasks by utilizing unlabeled data available on clients’ end. However, in reality, a large volume of unlabeled instances from a different task or distribution might also be available on the centralized server. As servers often possess the computational power to effectively pre-train a model on a massive unlabeled dataset, a natural question arises, whether leveraging self-supervised learning to pre-train a model as initialization for FedSTAR could improve the training convergence in federated settings with fewer rounds. To this end, we perform experiments on all three datasets with N = 15 while using a model trained with a self-supervised pre-training strategy introduced in Section 4.4. We compare the obtained accuracy after 10 rounds of training (R = 10) when utilizing a self-supervised pre-trained model as an initial starting point for FedSTAR in contrast with a randomly initialized FedSTAR model trained for the same number of federated rounds. For a more rigorous evaluation, we vary labels availability from L=3% up to L=100% across all our datasets. The findings are presented in Figures 6(a), (c), and (e), where the average accuracy over three distinct trials is reported. Furthermore, the average train loss for the case of L = 50% in the first 10 federated rounds (R = 10) is also reported in Figures 6(b), (d), and (f). We choose to report the average train loss for the case of L = 50%, since we previously observed from Table 4 that FedSTAR models might require additional rounds to utilize unlabeled data effectively. Thus, we can demonstrate that utilizing a pre-trained model as initialization for FedSTAR can significantly boost training convergence.

Fig. 6.

Fig. 6. Self-supervised learning improves training convergence in federated setting. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, E = 1. Average accuracy on testset over three distinct trials is reported.

From Figures 6(a), (c), and (e), we note that the utilization of a pre-trained model leads to higher accuracy within 10 rounds in almost all cases, suggesting that it was able to perform finer pseudo-labels predictions and accelerate the model’s convergence. In particular, for the Ambient Context dataset, where the amount of available labeled instances per client is tiny (approximately 13 labeled samples per client for L = 5% and N = 15), we observe a substantial difference between the pre-trained and randomly initialized FedSTAR approaches. This behavior suggests that in cases where the amount of labels is exceedingly sparse, utilizing self-supervised learning via model pre-training can significantly shorten the federated rounds needed for convergence. The beneficial role of the model pre-training on FedSTAR can be observed in the train loss gap between the pre-trained and the randomly initialized versions of FedSTAR after 10 rounds, presented in Figures 6(b), (d), and (f).

5.4.4 Effectiveness of FedSTAR under Varying Amount of Unlabeled Data.

As of now, we have assumed that unlabeled data is largely available across clients. However, it is intriguing to investigate the scenario where both the amount of labeled and unlabeled data varies. In this way, we could simulate two pragmatic scenarios: First, an abundant volume of unlabeled instances generated by clients devices (e.g., numerous IoT devices constantly monitoring the surrounding environment); and second, relatively small amount of unlabeled audio samples available (e.g., medical audio examples, where both obtaining and labeling data is expensive). In addition, the restriction of available unlabeled on-device data could be originated from the often-limited storage capabilities of devices participating in the distributed machine learning paradigms. Thus, we aim to understand the effect of unlabeled data availability on the FedSTAR efficiency to improve FL models’ performance as well as the impact of utilizing pre-trained model with self-supervised. Consequently, we perform experiments on the Speech Commands dataset with N = 15, while varying the labeled subset from 3% up to 50% and the unlabeled dataset from 20% up to 100%. The obtained accuracy scores for both pre-trained and randomly initialized FedSTAR models are presented in Table 6.

Table 6.
LabeledPercentageFedSTAR (Randomly Initialized)FedSTAR (SSL Pretrained)
U = 20%U = 50%U = 80%U = 100%U = 20%U = 50%U = 80%U = 100%
L = 3%84.1385.4086.6386.8284.5285.1785.4386.46
L = 5%87.4788.5288.9089.3388.0788.2887.7388.98
L = 20%90.0692.2493.0793.1592.4493.6793.9894.13
L = 50%87.7692.2694.1893.3890.7093.8394.7695.54
  • Average accuracy over three distinct runs is reported on Speech Commands. Detailed results are given in Table 7 of the Appendix. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, R = 100, E = 1, N = 15.

Table 6. Performance Evaluation of FedSTAR when Varying Both Labeled and Unlabeled Datasets

  • Average accuracy over three distinct runs is reported on Speech Commands. Detailed results are given in Table 7 of the Appendix. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, R = 100, E = 1, N = 15.

As we see in Table 6, the availability of unlabeled data can affect the FedSTAR models performance. In particular, when the amount of both labeled and unlabeled instances is limited (\( L\le 5 \) and \( U\le 50 \)), the obtained accuracy for both the randomly initialized and the pre-trained FedSTAR models is similar. However, as the amount of labeled data increases (L > 5), we notice a performance improvement of the pre-trained over the standard FedSTAR approach, in case of L = 50 and U = 20 results an accuracy difference of 3%. This behavior suggests that the utilization of on-device unlabeled data for the pre-trained FedSTAR is superior to that of the randomly initialized FedSTAR when sufficient labeled samples are provided. In addition, comparing results of both FedSTAR approaches in Table 6 with those for supervised federated with N = 15 in Table 4, we observe an accuracy improvement for both pre-trained and randomly initialized FedSTAR models over their supervised counterparts for L < 20. For higher labels availability (L > 20), the performance gap between the pre-trained FedSTAR models and the fully supervised federated alternatives is still prevalent, even for small volumes of unlabeled data (U = 20). From this, we can deduce that FedSTAR can effectively utilize unlabeled instances to improve the performance of audio models, even when the availability of on-device unlabeled samples is insufficient.

Skip 6CONCLUSIONS AND FUTURE WORK Section

6 CONCLUSIONS AND FUTURE WORK

We study the pragmatic problem of semi-supervised federated learning for audio recognition tasks. In the distributed scenario, clients’ well-annotated audio examples are deficient due to the prohibitive cost of annotation, users with little to no incentives to label their data, and notably for various important tasks, the domain knowledge that is missing to perform the annotation process appropriately. Conversely, large-scale unlabeled audio data are readily available on clients’ devices. To address the lack of labeled data for learning on-device models, we present a novel self-training strategy based on pseudo-labeling to exploit on-device unlabeled audio data and boost the generalization of models trained in federated settings. Despite its simplicity, we demonstrate that our approach, FedSTAR, is highly feasible for semi-supervised learning on various audio recognition tasks within different federated settings and labels availability. We exhaustively evaluate FedSTAR on several publicly available datasets while comparing its performance with fully supervised federated and traditional centralized counterparts. The models’ accuracy we achieve is consistently superior to fully supervised federated settings under the same labels availability. In many cases, FedSTAR results are comparable to fully supervised federated settings, where the complete dataset with labels was utilized. Furthermore, FedSTAR can retain the same level of effectiveness on utilizing unlabeled instances, irrespective of the amount of labels available on clients. In addition, FedSTAR can significantly improve the model’s performance in settings where on-device labeled samples from only a subset of classes are present, while the unlabeled instances contain examples from all classes. By utilizing on-device unlabeled samples from all classes, the data distribution across devices becomes more uniform; thus, the local models’ learning objectives converge. This non-i.i.d data distribution setting is frequent in pragmatic scenarios, where the expertise is missing to annotate samples from all available classes, e.g., physiological signals in the medical domain. Finally, we demonstrate that self-supervised pre-trained models can significantly improve training convergence in federated settings with fewer rounds when used as model initialization for federated training instead of randomly initialized weights.

Despite the wide applicability, as FedSTAR is based on self-training, it is still relying on a few well-annotated samples across all devices to properly exploit any additional unlabeled data. Without such labeled samples, the utilization of unlabeled samples through FedSTAR might bring undesirable results. In reality, however, such a limitation can be lifted by requesting from users to annotate two to three samples, which are inexpensive to acquire. With the number of devices in an FL network usually ranging from hundreds to even thousands of devices, this process will provide a sufficient labeled subset. This can be used to train model in conjunction with the massively available unlabeled data using FedSTAR to acquire a highly accurate model. Furthermore, inherent noise, which originates from the audio signal, is an additional challenge that can limit the applicability of FedSTAR in real-life applications. Depending on the type and the amount of noise, this could affect the performance of FedSTAR, making the exploitation of unlabeled samples counter-productive. In such cases, there is a range of methods available that can be introduced as a prepossessing step in the learning procedure to mitigate or denoise the signal with minimal effort.

In this work, we provided a federated self-training scheme to learn audio recognition models through a few on-device labeled audio data. In the Internet of Things era, this approach could be employed in a variety of applications, such as home automation, autonomous driving, the healthcare domain, and smart wearable technologies. In particular, we believe that federated self-training is of immense value for learning generalizable audio models in settings, where labeled data are challenging to acquire. However, unlabeled data are available in vast quantities. We hope that the presented perspective of federated self-training inspires the development of additional approaches, specifically those combining semi-supervised learning and federated learning in an asynchronous fashion. Likewise, combining federated self-training with appropriate client selection techniques is another crucial area of improvement that will further improve the performance of deep models in federated learning scenarios. Finally, evaluation in a real-world setting (i.e., federate learning involving real devices) is of major importance to further understand the aspects that require improvements concerning statistical and system heterogeneities, energy, and labeled data requirements in the federated learning setting.

APPENDIX

Table 7.
LabeledPercentageFedSTAR (Randomly Initialized)FedSTAR (Pre-Trained)
U = 20%U = 50%U = 80%U = 100%U = 20%U = 50%U = 80%U = 100%
L = 3%84.13 \( \pm \) 0.00485.40 \( \pm \) 0.00886.63 \( \pm \) 0.00286.82 \( \pm \) 0.02084.52 \( \pm \) 0.00185.17 \( \pm \) 0.00185.43 \( \pm \) 0.00186.46 \( \pm \) 0.006
L = 5%87.47 \( \pm \) 0.00188.52 \( \pm \) 0.00588.90 \( \pm \) 0.00189.33 \( \pm \) 0.00788.07 \( \pm \) 0.00488.28 \( \pm \) 0.00287.73 \( \pm \) 0.00189.98 \( \pm \) 0.002
L = 20%90.06 \( \pm \) 0.00392.24 \( \pm \) 0.01293.07 \( \pm \) 0.00193.15 \( \pm \) 0.01192.44 \( \pm \) 0.00193.67 \( \pm \) 0.00593.98 \( \pm \) 0.00394.13 \( \pm \) 0.001
L = 50%87.76 \( \pm \) 0.00392.26 \( \pm \) 0.00594.18 \( \pm \) 0.00193.38 \( \pm \) 0.00190.70 \( \pm \) 0.00193.83 \( \pm \) 0.00394.76 \( \pm \) 0.00195.54 \( \pm \) 0.007
  • Average accuracy over 3 distinct runs is reported on Speech Commands, including variance across experiments. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, R = 100, E = 1, N = 15.

Table 7. Performance Evaluation of FedSTAR when Varying Both Labeled and Unlabeled Datasets

  • Average accuracy over 3 distinct runs is reported on Speech Commands, including variance across experiments. Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, R = 100, E = 1, N = 15.

Table 8.
Class DistributionCharacteristicsSupervised (Federated)FedSTAR
\( L=3\% \)\( L=5\% \)\( L=20\% \)\( L=50\% \)\( L=3\% \)\( L=5\% \)\( L=20\% \)\( L=50\% \)
\( \mu \) = 3\( \sigma _{c} \) = 0 %9.83 \( \pm \) 0.01732.63 \( \pm \) 0.09780.22 \( \pm \) 0.05682.40 0.04879.08 \( \pm \) 0.02679.62 \( \pm \) 0.03487.01 \( \pm \) 0.02883.14 \( \pm \) 0.069
\( \sigma _{c} \) = 25%10.54 \( \pm \) 0.01623.97 \( \pm \) 0.13975.41 \( \pm \) 0.05583.61 \( \pm \) 0.04679.05 \( \pm \) 0.05284.15 \( \pm \) 0.01386.52 \( \pm \) 0.03285.05 \( \pm \) 0.051
\( \sigma _{c} \) = 50%8.44 \( \pm \) 0.00124.25 \( \pm \) 0.14073.93 \( \pm \) 0.04484.41 \( \pm \) 0.04378.14 \( \pm \) 0.02181.88 \( \pm \) 0.03184.56 \( \pm \) 0.04184.55 \( \pm \) 0.055
  • Class distribution has mean \( \mu \) = 3 and variance \( \sigma _{c} \). Average accuracy over 3 distinct runs is reported on Speech Commands, including variance across experiments. Federated parameters are set to \( \beta =0.5 \), R = 100, N = 15, q = 80% and E = 1.

Table 8. Performance Evaluation of Method against Variation of Class Availability Across Clients

  • Class distribution has mean \( \mu \) = 3 and variance \( \sigma _{c} \). Average accuracy over 3 distinct runs is reported on Speech Commands, including variance across experiments. Federated parameters are set to \( \beta =0.5 \), R = 100, N = 15, q = 80% and E = 1.

Table 9.
DatasetClientsSupervised (Federated)FedSTAR
\( L=3\% \)\( L=5\% \)\( L=20\% \)\( L=50\% \)\( L=3\% \)\( L=5\% \)\( L=20\% \)\( L=50\% \)
Ambient Context546.34 \( \pm \) 0.00947.89 \( \pm \) 0.05661.40 \( \pm \) 0.00165.85 \( \pm \) 0.02148.68 \( \pm \) 0.00454.95 \( \pm \) 0.02664.37 \( \pm \) 0.01267.04 \( \pm \) 0.010
Speech Commands81.12 \( \pm \) 0.03787.97 \( \pm \) 0.04792.35 \( \pm \) 0.03094.66 \( \pm \) 0.01287.41 \( \pm \) 0.00790.01 \( \pm \) 0.00194.17 \( \pm \) 0.00394.85 \( \pm \) 0.001
VoxForge54.55 \( \pm \) 0.00956.41 \( \pm \) 0.02161.65 \( \pm \) 0.00570.37 \( \pm \) 0.02163.92 \( \pm \) 0.01667.80 \( \pm \) 0.01869.09 \( \pm \) 0.01367.08 \( \pm \) 0.016
Ambient Context1035.29 \( \pm \) 0.00641.31 \( \pm \) 0.01251.71 \( \pm \) 0.00962.69 \( \pm \) 0.01848.87 \( \pm \) 0.00452.37 \( \pm \) 0.01862.94 \( \pm \) 0.02464.42 \( \pm \) 0.006
Speech Commands67.75 \( \pm \) 0.00183.80 \( \pm \) 0.02992.12 \( \pm \) 0.08794.02 \( \pm \) 0.03686.82 \( \pm \) 0.00690.33 \( \pm \) 0.00794.09 \( \pm \) 0.00294.18 \( \pm \) 0.006
VoxForge56.14 \( \pm \) 0.02054.73 \( \pm \) 0.00160.48 \( \pm \) 0.03362.41 \( \pm \) 0.01459.87 \( \pm \) 0.02464.35 \( \pm \) 0.00369.38 \( \pm \) 0.01663.27 \( \pm \) 0.032
Ambient Context1533.03 \( \pm \) 0.00242.75 \( \pm \) 0.00753.37 \( \pm \) 0.00459.97 \( \pm \) 0.00449.54 \( \pm \) 0.00554.71 \( \pm \) 0.02263.46 \( \pm \) 0.00462.41 \( \pm \) 0.006
Speech Commands62.98 \( \pm \) 0.00372.84 \( \pm \) 0.00192.14 \( \pm \) 0.00393.14 \( \pm \) 0.00486.82 \( \pm \) 0.00689.33 \( \pm \) 0.00293.16 \( \pm \) 0.00193.39 \( \pm \) 0.007
VoxForge54.26 \( \pm \) 0.00254.37 \( \pm \) 0.00957.11 \( \pm \) 0.03160.29 \( \pm \) 0.00155.82 \( \pm \) 0.01157.96 \( \pm \) 0.02567.66 \( \pm \) 0.00461.66 \( \pm \) 0.007
Ambient Context3032.31 \( \pm \) 0.00440.17 \( \pm \) 0.00147.05 \( \pm \) 0.00155.85 \( \pm \) 0.00240.84 \( \pm \) 0.04146.58 \( \pm \) 0.01360.21 \( \pm \) 0.01356.19 \( \pm \) 0.009
Speech Commands33.78 \( \pm \) 0.01244.21 \( \pm \) 0.01684.94 \( \pm \) 0.01292.21 \( \pm \) 0.00883.88 \( \pm \) 0.00188.19 \( \pm \) 0.00592.92 \( \pm \) 0.00592.62 \( \pm \) 0.007
VoxForge50.32 \( \pm \) 0.00954.33 \( \pm \) 0.01555.19 \( \pm \) 0.01157.56 \( \pm \) 0.00254.81 \( \pm \) 0.00156.18 \( \pm \) 0.00563.83 \( \pm \) 0.00956.66 \( \pm \) 0.009
  • Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, E = 1, R = 100.

  • Average accuracy over 3 distinct trials on test set is reported, including variance across experiments.

Table 9. Performance Evaluation of FedSTAR

  • Federated parameters are set to q = 80%, \( \sigma \) = 25%, \( \beta \) = 0.5, E = 1, R = 100.

  • Average accuracy over 3 distinct trials on test set is reported, including variance across experiments.

REFERENCES

  1. [1] Arazo Eric, Ortego Diego, Albert Paul, O’Connor Noel E., and McGuinness Kevin. 2020. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In International Joint Conference on Neural Networks (IJCNN). 18. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Berthelot David, Carlini Nicholas, Goodfellow Ian, Papernot Nicolas, Oliver Avital, and Raffel Colin. 2019. MixMatch: A Holistic Approach to Semi-Supervised Learning. arXiv: 1905.02249 [cs.LG].Google ScholarGoogle Scholar
  3. [3] Beutel Daniel J., Topal Taner, Mathur Akhil, Qiu Xinchi, Parcollet Titouan, and Lane Nicholas D.. 2020. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390 (2020).Google ScholarGoogle Scholar
  4. [4] Chan Justin, Rea Thomas, Gollakota Shyamnath, and Sunshine Jacob E.. 2019. Contactless cardiac arrest detection using smart devices. NPJ Dig. Med. 2, 1 (2019), 18.Google ScholarGoogle Scholar
  5. [5] Foggia Pasquale, Petkov Nicolai, Saggese Alessia, Strisciuglio Nicola, and Vento Mario. 2016. Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Trans. Intell. Transport. Syst. 17, 1 (2016), 279288. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Fonseca Eduardo, Favory Xavier, Pons Jordi, Font Frederic, and Serra Xavier. 2020. FSD50k: An open dataset of human-labeled sound events. arXiv preprint arXiv:2010.00475 (2020).Google ScholarGoogle Scholar
  7. [7] Gao Yan, Parcollet Titouan, Fernandez-Marques Javier, Gusmao Pedro P. B. de, Beutel Daniel J., and Lane Nicholas D.. 2021. End-to-end Speech Recognition from Federated Acoustic Models. arXiv: 2104.14297 [cs.SD].Google ScholarGoogle Scholar
  8. [8] Grandvalet Yves and Bengio Yoshua. 2004. Semi-supervised learning by entropy minimization. In 17th International Conference on Neural Information Processing Systems (NIPS’04). The MIT Press, Cambridge, MA, 529536.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Hard Andrew, Partridge Kurt, Nguyen Cameron, Subrahmanya Niranjan, Shah Aishanee, Zhu Pai, Moreno Ignacio Lopez, and Mathews Rajiv. 2020. Training Keyword Spotting Models on Non-IID Data with Federated Learning. arXiv: 2005.10406 [eess.AS].Google ScholarGoogle Scholar
  10. [10] Hard Andrew, Rao Kanishka, Mathews Rajiv, Beaufays Françoise, Augenstein Sean, Eichner Hubert, Kiddon Chloé, and Ramage Daniel. 2018. Federated learning for mobile keyboard prediction. CoRR abs/1811.03604 (2018).Google ScholarGoogle Scholar
  11. [11] Hinton Geoffrey, Vinyals Oriol, and Dean Jeff. 2015. Distilling the Knowledge in a Neural Network. arXiv: 1503.02531 [stat.ML].Google ScholarGoogle Scholar
  12. [12] Hosseini Hossein, Yun Sungrack, Park Hyunsin, Louizos Christos, Soriaga Joseph, and Welling Max. 2020. Federated Learning of User Authentication Models. arXiv: 2007.04618 [cs.LG].Google ScholarGoogle Scholar
  13. [13] Huang Li and Liu Dianbo. 2019. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. CoRR abs/1903.09296 (2019).Google ScholarGoogle Scholar
  14. [14] Itahara Sohei, Nishio Takayuki, Koda Yusuke, Morikura Masahiro, and Yamamoto Koji. 2021. Distillation-based Semi-supervised Federated Learning for Communication-efficient Collaborative Training with Non-IID Private Data. arXiv: 2008.06180 [cs.DC].Google ScholarGoogle Scholar
  15. [15] Jeong Wonyong, Yoon Jaehong, Yang Eunho, and Hwang Sung Ju. 2020. Federated Semi-supervised Learning with Inter-client Consistency. arXiv: 2006.12097 [cs.LG].Google ScholarGoogle Scholar
  16. [16] Jin Yilun, Wei Xiguang, Liu Yang, and Yang Qiang. 2020. Towards Utilizing Unlabeled Data in Federated Learning: A Survey and Prospective. arXiv: 2002.11545 [cs.LG].Google ScholarGoogle Scholar
  17. [17] Kairouz Peter, McMahan H. Brendan, Avent Brendan, Bellet Aurélien, Bennis Mehdi, Bhagoji Arjun Nitin, Bonawitz Kallista, Charles Zachary, Cormode Graham, Cummings Rachel, D’Oliveira Rafael G. L., Eichner Hubert, Rouayheb Salim El, Evans David, Gardner Josh, Garrett Zachary, Gascón Adrià, Ghazi Badih, Gibbons Phillip B., Gruteser Marco, Harchaoui Zaid, He Chaoyang, He Lie, Huo Zhouyuan, Hutchinson Ben, Hsu Justin, Jaggi Martin, Javidi Tara, Joshi Gauri, Khodak Mikhail, Konečný Jakub, Korolova Aleksandra, Koushanfar Farinaz, Koyejo Sanmi, Lepoint Tancrède, Liu Yang, Mittal Prateek, Mohri Mehryar, Nock Richard, Özgür Ayfer, Pagh Rasmus, Raykova Mariana, Qi Hang, Ramage Daniel, Raskar Ramesh, Song Dawn, Song Weikang, Stich Sebastian U., Sun Ziteng, Suresh Ananda Theertha, Tramèr Florian, Vepakomma Praneeth, Wang Jianyu, Xiong Li, Xu Zheng, Yang Qiang, Yu Felix X., Yu Han, and Zhao Sen. 2021. Advances and Open Problems in Federated Learning. arXiv: 1912.04977 [cs.LG].Google ScholarGoogle Scholar
  18. [18] Koizumi Yuma, Saito Shoichiro, Uematsu Hisashi, Harada Noboru, and Imoto Keisuke. 2019. ToyADMOS: A Dataset of Miniature-machine Operating Sounds for Anomalous Sound Detection. arXiv: 1908.03299 [eess.AS].Google ScholarGoogle Scholar
  19. [19] Konečný Jakub, McMahan H. Brendan, Yu Felix X., Richtárik Peter, Suresh Ananda Theertha, and Bacon Dave. 2017. Federated Learning: Strategies for Improving Communication Efficiency. arXiv: 1610.05492 [cs.LG].Google ScholarGoogle Scholar
  20. [20] Korbar Bruno, Tran Du, and Torresani Lorenzo. 2018. Cooperative learning of audio and video models from self-supervised synchronization. In 32nd International Conference on Neural Information Processing Systems. 77747785.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Lake Brenden M., Ullman Tomer D., and Gershman Joshua B. Tenenbaum andSamuel J.. 2016. Building machines that learn and think like people. CoRR abs/1604.00289 (2016).Google ScholarGoogle Scholar
  22. [22] Lee Dong-Hyun et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3.Google ScholarGoogle Scholar
  23. [23] Leroy David, Coucke Alice, Lavril Thibaut, Gisselbrecht Thibault, and Dureau Joseph. 2019. Federated Learning for Keyword Spotting. arXiv: 1810.05512 [eess.AS].Google ScholarGoogle Scholar
  24. [24] Li Tian, Sahu Anit Kumar, Talwalkar Ameet, and Smith Virginia. 2019. Federated learning: Challenges, methods, and future directions. CoRR abs/1908.07873 (2019).Google ScholarGoogle Scholar
  25. [25] Li Tian, Sahu Anit Kumar, Zaheer Manzil, Sanjabi Maziar, Talwalkar Ameet, and Smith Virginia. 2020. Federated Optimization in Heterogeneous Networks. arXiv: 1812.06127 [cs.LG].Google ScholarGoogle Scholar
  26. [26] Long Zewei, Che Liwei, Wang Yaqing, Ye Muchao, Luo Junyu, Wu Jinze, Xiao Houping, and Ma Fenglong. 2020. FedSemi: An Adaptive Federated Semi-supervised Learning Framework. arXiv: 2012.03292 [cs.LG].Google ScholarGoogle Scholar
  27. [27] Loshchilov Ilya and Hutter Frank. 2016. SGDR: Stochastic gradient descent with restarts. CoRR abs/1608.03983 (2016).Google ScholarGoogle Scholar
  28. [28] Aodha Oisin Mac, Gibb Rory, Barlow Kate E., Browning Ella, Firman Michael, Freeman Robin, Harder Briana, Kinsey Libby, Mead Gary R., Newson Stuart E., Pandourski Ivan, Parsons Stuart, Russ Jon, Szodoray-Paradi Abigel, Szodoray-Paradi Farkas, Tilova Elena, Girolami Mark, Brostow Gabriel, and Jones Kate E.. 2018. Bat detective-deep learning tools for bat acoustic signal detection. PLOS Computat. Biol. 14, 3 (03 2018), 119. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] MacLean Ken. 2018. Voxforge. Ken MacLean. Retrieved from http://www.voxforge.org/home. [Acedido em 2012] (2018).Google ScholarGoogle Scholar
  30. [30] McMahan H. Brendan, Moore Eider, Ramage Daniel, Hampson Seth, and Arcas Blaise Agüera y. 2017. Communication-efficient Learning of Deep Networks from Decentralized Data. arXiv: 1602.05629 [cs.LG].Google ScholarGoogle Scholar
  31. [31] Miyato Takeru, Maeda Shin ichi, Koyama Masanori, and Ishii Shin. 2018. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-supervised Learning. arXiv: 1704.03976 [stat.ML].Google ScholarGoogle Scholar
  32. [32] Oliver Avital, Odena Augustus, Raffel Colin, Cubuk Ekin D., and Goodfellow Ian J.. 2019. Realistic Evaluation of Deep Semi-supervised Learning Algorithms. arXiv: 1804.09170 [cs.LG].Google ScholarGoogle Scholar
  33. [33] Park Chunjong, Min Chulhong, Bhattacharya Sourav, and Kawsar Fahim. 2020. Augmenting conversational agents with ambient acoustic contexts. In 22nd International Conference on Human-computer Interaction with Mobile Devices and Services (MobileHCI’20). Association for Computing Machinery, New York, NY. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Ramaswamy Swaroop, Mathews Rajiv, Rao Kanishka, and Beaufays Françoise. 2019. Federated learning for emoji prediction in a mobile keyboard. CoRR abs/1906.04329 (2019).Google ScholarGoogle Scholar
  35. [35] Saeed Aaqib, Grangier David, and Zeghidour Neil. 2021. Contrastive learning of general-purpose audio representations. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 38753879.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Stowell Dan, Stylianou Yannis, Wood Mike, Pamula Hanna, and Glotin Hervé. 2018. Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge. CoRR abs/1807.05812 (2018).Google ScholarGoogle Scholar
  37. [37] Tagliasacchi Marco, Gfeller Beat, Quitry Félix de Chaumont, and Roblek Dominik. 2019. Self-supervised audio representation learning for mobile devices. arXiv preprint arXiv:1905.11796 (2019).Google ScholarGoogle Scholar
  38. [38] Tarvainen Antti and Valpola Harri. 2018. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv: 1703.01780 [cs.NE].Google ScholarGoogle Scholar
  39. [39] Berlo Bram van, Saeed Aaqib, and Ozcelebi Tanir. 2020. Towards federated unsupervised representation learning. In Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking. Association for Computing Machinery, 31–36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Engelen Jesper E. van and Hoos H.. 2019. A survey on semi-supervised learning. Mach. Learn. 109 (2019), 373440.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Warden Pete. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. CoRR abs/1804.03209 (2018).Google ScholarGoogle Scholar
  42. [42] Wu Yuxin and He Kaiming. 2018. Group Normalization. arXiv: 1803.08494 [cs.CV].Google ScholarGoogle Scholar
  43. [43] Xie Qizhe, Hovy Eduard H., Luong Minh-Thang, and Le Quoc V.. 2019. Self-training with noisy student improves imagenet classification. CoRR abs/1911.04252 (2019).Google ScholarGoogle Scholar
  44. [44] Yang Timothy, Andrew Galen, Eichner Hubert, Sun Haicheng, Li Wei, Kong Nicholas, Ramage Daniel, and Beaufays Françoise. 2018. Applied Federated Learning: Improving Google Keyboard Query Suggestions. arXiv: 1812.02903 [cs.LG].Google ScholarGoogle Scholar
  45. [45] Zhao Yue, Li Meng, Lai Liangzhen, Suda Naveen, Civin Damon, and Chandra Vikas. 2018. Federated Learning with Non-IID Data. arXiv: 1806.00582 [cs.LG].Google ScholarGoogle Scholar
  46. [46] Zhu Xiaojin and Goldberg Andrew. 2009. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 3, 1 (2009), 1–130. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Federated Self-training for Semi-supervised Audio Recognition

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!