Black-box Adversarial Attacks on Commercial Speech Platforms with Minimal Information

Adversarial attacks against commercial black-box speech platforms, including cloud speech APIs and voice control devices, have received little attention until recent years. The current"black-box"attacks all heavily rely on the knowledge of prediction/confidence scores to craft effective adversarial examples, which can be intuitively defended by service providers without returning these messages. In this paper, we propose two novel adversarial attacks in more practical and rigorous scenarios. For commercial cloud speech APIs, we propose Occam, a decision-only black-box adversarial attack, where only final decisions are available to the adversary. In Occam, we formulate the decision-only AE generation as a discontinuous large-scale global optimization problem, and solve it by adaptively decomposing this complicated problem into a set of sub-problems and cooperatively optimizing each one. Our Occam is a one-size-fits-all approach, which achieves 100% success rates of attacks with an average SNR of 14.23dB, on a wide range of popular speech and speaker recognition APIs, including Google, Alibaba, Microsoft, Tencent, iFlytek, and Jingdong, outperforming the state-of-the-art black-box attacks. For commercial voice control devices, we propose NI-Occam, the first non-interactive physical adversarial attack, where the adversary does not need to query the oracle and has no access to its internal information and training data. We combine adversarial attacks with model inversion attacks, and thus generate the physically-effective audio AEs with high transferability without any interaction with target devices. Our experimental results show that NI-Occam can successfully fool Apple Siri, Microsoft Cortana, Google Assistant, iFlytek and Amazon Echo with an average SRoA of 52% and SNR of 9.65dB, shedding light on non-interactive physical attacks against voice control devices.


INTRODUCTION
Nowadays, with the advance of speech and speaker recognition technologies, they are reshaping the way we interact with ubiquitous smart devices. More specifically, automatic speech recognition (ASR) technologies [41] allow machines to understand human voices, while speaker recognition (SR) technologies [37,71] enable machines to identify a person from the characteristics of his/her voices 1 . As a result, ASR and SR have become universal in our daily lives, ranging from personal voice assistants (PVAs) [14,46] to biometric authentication [20,79] on various smart devices. The popularity of such speech services allows people to greatly enjoy the convenience of integrating speech as a new input for smart devices to perform daily and even complicated tasks. For example, Amazon has released Alexa [1] and Auto SDK [4] that allow users to make credit card payments and control vehicles with voice interaction, respectively.
Despite their wide applications, the excessive use of voice commands in safety-critical systems, like autonomous driving [38,59] and biometric identification, also poses potential safety hazards. A line of recent researches [12,23,92] have extensively demonstrated the vulnerability of acoustic systems to numerous types of abnormal audios, such as noises and inaudible ultrasounds. These attacks, however, can be easily detected and/or defended by differentiating and analyzing the nature (i.e., legitimate or malicious) of the received audio signals. Inspired by the resounding success of adversarial attacks against image recognition systems [24,40,62,78], more recent researches have begun to investigate the feasibility of adversarial examples in the audio domain [47], as shown in Table 1.
The very first attempts made by Carlini et al. [25] and Yuan et al. [90] have shown that ASR systems are also inherently vulnerable to audio AEs in the white-box scenarios, where the attackers can make full use of a prior knowledge of the structure and parameters inside the system. When it comes to black-box settings, however, the success of adversarial attacks in the image domain is hard to be ported to the audio domain, mainly owing to the multiple nontrivial challenges presented by the unique characteristics of timedomain speech signals and the much more complex architecture of acoustic systems.
As evidenced by [80], Taori et al. combined genetic algorithms [84] and gradient estimation [28], which have been proved effective in the image domain [13,19], to carry out a black-box attack against open-source DeepSpeech [42] but with only a success rate of 35%. Besides, considering that the attacker requires query access to the last layer (i.e., logits) of DNNs inside the DeepSpeech, such attacks become unrealistic when applied to the closed-source ASR systems. More recently, Chen et al. [32] showed that when the confidence scores are publicly known, several commercial ASR systems are vulnerable to adversarial inputs, but with only a very limited number of target commands. Almost instantaneously, Du et al. [35] and Chen et al. [26] presented the first black-box adversarial attacks against SR systems that both heavily relied on prediction scores. However, such kind of scores defined by the attackers or provided by speech service APIs may be useless to legitimate users, so service providers can easily hide these scores to defend against the above black-box attacks. Besides, there also exist a significant number of speech service APIs (e.g., Alibaba Short Speech Recognition API [11], iFlytek Speech-to-Text API [3]) which do not return any intermediate information (e.g., confidence/prediction scores or other probabilities), except for the final decision results, e.g., final transcriptions in ASR systems and user-ids in SR systems.
From the practical perspective, the prior efforts are important but not satisfactory enough with respect to the minimum information as required by the adversary to launch a successful black-box attack. We may ask: "is it possible to launch effective and practical audio adversarial attacks against commercial black-box speech platforms with the minimum information?" We are facing this opportunity already, more facing a big challenge mainly due to the extreme lack of information about the target model. Specifically, the acoustic system, which involves non-linear feature extraction steps to cope with the intricate frequency feature changes in the time dimension, is much more complicated than the image processing system. Furthermore, a speech vector usually contains nearly one hundred thousand variables, far exceeding the hundreds or thousands of pixels in images, i.e., MNIST and CIFAR-10 are 28×28 and 32×32 respectively. As reported in [88], the explicit interdependencies among the massive number of variables significantly hinder the successful construction of audio AEs.

Our Works
Generally speaking, there are mainly two types of black-box speech platforms. One is commercial Cloud Speech APIs that provide audio services to users, and the other is commercial Voice Control Devices, such as Apple Siri, which perform the speech-to-text task in the physical world. In this paper, we present two attack schemes, Occam 2 , a decision-only attack on cloud speech APIs, and NI-Occam, a non-interactive physical attack on voice control devices. Occam. In our first design, we take one more step forward and focus on real-world threat scenarios where the adversary has access to an oracle (target model) which returns only its final decision. We propose Occam, a decision-based black-box adversarial attack against cloud speech APIs. We demonstrate that various commercial speech API services, such as Google Cloud Speech-to-Text, Alibaba Cloud Speech-to-Text, and Microsoft Azure Speech Service, are inherently vulnerable to audio AEs generated by our Occam, even if no internal information is exposed to the adversary.
Our key idea of Occam is to formulate the decision-based blackbox attack against smart acoustic systems as a discontinuous largescale global optimization problem, on the basis of the final discrete decision (the attacker can only obtain) and a large number of optimization variables incurred by the speech. Inspired by this observation, we develop a novel technique called CC-CMA-ES, which applies a cooperative co-evolution (CC) framework to the powerful covariance matrix adaptation evolution strategy (CMA-ES), to solve the large and complex problem in the strictly black-box setting. More specifically, CC-CMA-ES first decomposes the complicated problem into a set of smaller and simpler sub-problems, and then uses CMA-ES to cooperatively optimize each one by modeling their local geometries. To improve the attack efficiency, we further propose an adaptive counterpart, which allows the subproblem size and the decomposition strategy to self-adapt to the environmentally changeable evolution process.
We conduct extensive experiments to evaluate our attack capabilities on both speech and speaker recognition tasks, and also compare it with five decision-based black-box methods to demonstrate the superiority of Occam. We first craft audio adversarial examples against the local DeepSpeech model in the strictly blackbox setting, achieving perfect success rates in both targeted and untargeted attacks. Then, we launch black-box adversarial attacks on a wide range of commercial speech-to-text API services, including Google, Microsoft, Alibaba, Tencent, and iFlytek, with success rates of 100% and an average SNR of 14.37dB. Furthermore, we verify the attack effectiveness against commercial SR systems including Microsoft and Jingdong. It still achieves success rates of 100% and an average SNR of 14.07dB. NI-Occam. In our second design, we further probe the possibility of launching more rigorous and practical attacks on voice control devices, where the adversary still has no access to internal information and training data of the oracle, and does not even need to make queries to probe it. We, for the first time, propose a non-interactive physical attack, named NI-Occam, which successfully attacks many commercial voice control devices without any interaction. We show that NI-Occam works well in real-world attack scenarios, where audio AEs are played over-the-air.
Our key idea of NI-Occam is to combine adversarial attacks with model inversion attacks [29,36,94]. More specifically, we make the attempt to recover the key parts of natural commands audio that are critical for speech recognition on the original example via We emphasize that our attacks have the following highlights: 1) Practicality. They are able to attack commercial black-box platforms in the real-world scenarios without any prior knowledge; 2) Generality. They are able to attack a wide range of commercial cloud speech APIs and voice control devices; 3) Effectiveness. They are able to automatically and easily generate audio AEs with high success rates of attack. Contribution. Our major contributions are summarized as follows.
• Generic black-box attacks with the minimum required information. We present a novel decision-only audio adversarial attack, named Occam, under the strictly black-box scenario where the attackers can rely solely on the final decisions available in any application cases, and this is quite different from the state-of-the-art blackbox adversarial attacks against commercial Cloud Speech APIs. In this sense, our attack strategy is the first one that can fool both commercial ASR and SR services, to our best knowledge. • Effective attacks with a perfect success rate of attack. We thoroughly evaluate our attack on a wide range of popular open-source and commercial (A)SR systems, including Google, Alibaba, Microsoft, Tencent, iFlytek, Jingdong, and DeepSpeech systems. Extensive experiments demonstrate that our attack is highly effective with a success rate of 100% and an average SNR of 14.23dB on commercial speech services, outperforming the state-of-the-art black-box attacks on commercial cloud speech APIs.
• Practical physical attacks without any interaction. We explore the possibility of generating audio AEs and playing them against the commercial voice control devices over-the-air. We thus for the first time propose a new non-interactive physical attack, named NI-Occam, which can successfully fool various voice control devices, including Apple Siri, Microsoft Cortana, Google Assistant, iFlytek and Amazon Echo, without any feedback information from the target devices. The experimental results show that our over-theair attack can achieve an average success rate of 52% with SNR of 9.65dB. This observation is shedding light on non-interactive physical attacks against voice control devices.

BACKGROUND 2.1 Speech Recognition
Automatic speech recognition (ASR) systems allow machines to automatically convert speeches into texts, and have found tremendous applications. Typically, an ASR system consists of four main components: pre-processing, feature extraction, acoustic model, and language model, as in Figure 1. Pre-processing plays an important role to filter out the frequencies beyond the range of human hearing and the segments below a specific energy threshold in the raw audio. Then, features are extracted via a feature extraction algorithm, such as Mel-frequency Cepstral Coefficients (MFCC) [63], Linear Predictive Coefficient (LPC) [49], Perceptual Linear Predictive (PLP) [44], etc. Different from image classification models [56] which take pixels as input, the acoustic model takes the extracted features as input, and outputs the probability of phonemes. In early ASR systems, Hidden Markov Model is one of the preferred techniques [16]. With the stupendous advance of deep learning, deep neural networks (DNNs) [45] and especially recurrent neural networks (RNNs) [41,70] have become the dominant choices today. An ASR system will finally produce the correct transcriptions conforming to grammatical rules via the language model.

Speaker Recognition
Due to the tremendous advance of DNN, speaker recognition (SR) systems are becoming increasingly popular in biometric authentication [30,58,91]. The systems, which allow machines to correctly   Figure 2: The architecture of a typical SR system.
identify a person from his/her unique characteristics of voices, have various applications such as bank services and forensic tests. The key step then is to extract users' voice features from a series of utterances which essentially consist of the underlying text information and the features of the speaker. Figure 2 provides the overview of a typical SR system [54], which consists of two phases: enrollment and evaluation. Generally, in the enrollment phase, a set of background speakers are required to train the background model in the offline phase such that a speaker can provide a few utterances online to create a new specific speaker model. The technologies for generating such models can be generally summarized as three types: i-vector-PLDA [65,66], GMM-UBM [72,73], and DNN [77,91]. During the evaluation phase, the unknown speaker's voice is taken as input and scored by the speaker models in the library. Based on the resulting scores, the decision model will generate the final recognition result.
There are two important sub-tasks in speaker recognition: speaker verification (SV) [72] and speaker identification (SI) [52]. The former is to validate whether the current user is legitimate, i.e., output either accept or reject. The latter aims to figure out the identity of the speaker among a set of enrolled ones. According to its text dependence, SR systems can also be divided into text-dependent and text-independent [71]. The difference is that the text-dependent approach requires all speakers to utter pre-defined sentences. Despite higher accuracy, it is only used for the SV sub-task.

Adversarial Examples
Despite their great success, the vulnerabilities of neural networks to adversarial examples have recently been extensively studied [24,40,62,78]. An adversary can slightly revise the legitimate inputs to generate adversarial examples for fooling neural networks [78]. Generally speaking, there are two types of adversarial examples: untargeted and targeted ones, also known as dodging attacks and impersonation attacks, respectively. Let ( ) : ∈ X → ∈ Y denote the recognition model that maps the input into the corresponding output prediction . Given the original input with the prediction such that ( ) = , an untargeted adversarial example * in the dodging attacks can be represented as: * , . .
where ( , * ) is the distance between an original input and an untargeted adversarial example * , is a parameter used to limit this distance, and is typically 0, 2, or ∞. Similarly, in the impersonation attacks, for an original input and a specific * such that ( ) ≠ * , a targeted adversarial example * can be represented as: * , . .
3 OCCAM: A DECISION-ONLY DIGITAL ATTACK AGAINST CLOUD SPEECH APIS 3.1 Threat Model Nowadays, many commercial cloud speech platforms offer both ASR and SR API services, e.g., Microsoft Azure Speech and Speaker Recognition Service API, and thus third-party developers can access commercial APIs if they have enrolled or paid for the services. The service providers, on the other hand, will not expose any parameters or the architecture of the target model because the internal information is commercially sensitive. Actually, a number of API services, e.g., iFlytek, Alibaba, Tencent, and Jingdong, provide only the final decision results without exposing any other information. Therefore, it is important to explore a generic attack against both ASR and SR commercial APIs in this decision-based scenario. Note that although ASR tasks are different from SR tasks, we can treat them as the same problem in this design because the construction of audio AEs for ASR and SR APIs can be formulated as the same optimization problem.
In this section, our target is commercial cloud speech services that open their APIs to the public, and we assume that the adversary intends to generate AEs against both ASR and SR services without any internal knowledge of the target model. More specifically, the adversary can only query the target model and obtain its final decision, which is a strict but more practical assumption in realworld applications. Considering the adversary's knowledge of the original audio, we make two different assumptions for the ASR and SR tasks respectively. For ASR systems, we assume that the adversary knows nothing about the dataset, and thus we utilize Text-to-Speech Service API to generate the audio of the target text. For SR systems, the adversary only needs to collect the victim's one voice sample, which is readily available owing to the serious leakage of personal information, e.g., public videos in social media.

Problem Formulation
For an acoustic system, given a voice ∈ R and a specific * such that ( ) ≠ * , a targeted AE * can be described by * , . .
In the white-box setting, the problem can be formulated as min * L ( * ) = D ( * , ) + · J ( * , * ), where L (·) is the objective function, J (·, ·) the loss function to check how well * meets the adversarial requirement, and the   adjustment parameter [25]. In the black-box setting without internal knowledge, we can reformulate it as an optimization problem as Note that the L ( * ) is equal to +∞ when * is not adversarial. Thus, we try to find the adversarial region and minimize its distance from the original audio. Similarly, the construction of untargeted audio AEs in our decision-based black-box attack can be reformulated as

Technical Challenges
As mentioned above, compared to the white-box and score-based settings, it is quite challenging to craft decision-based black-box adversarial examples against acoustic systems, since the attacker has no internal knowledge about the target model, except for the final decision corresponding to the query. Thus, we are facing several design challenges. First, depending on whether the output decision matches the adversarial target or not, the optimization space can be divided into a large non-adversarial region and a very small adversarial region. Hence, different from almost all previous adversarial voice attacks which feature a continuous optimization problem, the construction of audio AEs in the strictly black-box scenario poses a difficult discontinuous optimization problem due to the extreme lack of information, as shown in Figure 3. Second, the acoustic systems are extremely complicated, and we have to deal with intricate feature changes of the audios in the time dimension. Finally, because the audio sampling rate is very high (e.g., 16kHz), the large number of optimization variables in the speech vector
2: Query times to find the new adversarial point * close to the decision boundary via binary search algorithm; 3: = + ; 4: Choose a grouping strategy and a group number according to the adaptive scheme (See Section 3.4.4); 5: Decompose the audio vector into disjoint parts with each part having = / dimensions; 6: Set = 1 to start an optimization cycle; 7: Extract two × 1 vectors * and and an × covariance matrix from * , and , respectively; 8: for = 0 to do 9: Sample ∼ N (0, 2 · ); 10: Generate one complete solution solu using * + ( − * ) + and collaborative information from other subspaces; 11: if L (solu) < L ( * ) then 12: * = * + ( − * ) + ; Go to step 2; 24: end if presents another challenge, i.e., the curse of dimensionality, especially when there is a clear interdependence among the variables in audio AEs [88]. The reason behind it is that as the number of optimization variables increases, the complexity of the problem grows exponentially, and the nature of the problem may also change.

Our Method
As noted in our threat model, the lack of internal knowledge (e.g., structures, parameters, gradients, and scores) about the target model further exacerbates the difficulty of crafting AEs. All the adversary can do is to send a limited number of queries to probe the system as far as possible, and obtain the corresponding final decisions. More specifically, the initiation of our decision-based black-box adversarial attack only needs the final decision, e.g., final transcription, and this kind of audio AE generation can be formulated as a discontinuous and large-scale global optimization problem. To solve it, we resort to the large-scale black-box optimization approach.
Design Overview. Note that we are going to fool both commercial ASR and SR services with the minimum required information from the target model. To address the challenges, we propose a new class of cooperative co-evolution methods to generate effective audio AEs. Our method is mainly developed from CC-CMA-ES [60]. However, we cannot directly apply CC-CMA-ES to constructing audio AEs. In the audio domain, the correlations between variables will change in the dynamic evolution process, which is not considered in [60]. To solve this problem, we devise an adaptive scheme to make our strategy self-adapt to the environmentally changeable evolution process, the core of our cooperative co-evolution framework.
In the literature, the CMA-ES [43], known as an efficient evolutionary algorithm, has already demonstrated its good performance on many problems. But it will lose its effectiveness when applied to large-scale global optimization problems due to "the curse of dimensionality". A dimensionality reduction strategy, which uses the bilinear interpolation method to project the original space (112 × 112 × 3) to a lower-dimensional search space (e.g., 45 × 45 × 3), has been carefully devised to effectively create adversarial images against face recognition models [34]. However, this method is also not applicable to the audio domain. This is because bilinear interpolation works in two directions on images, while the inputs to the commercial Cloud Speech APIs are one-dimensional vectors from speeches. Thus, we for the first time introduce the general cooperative coevolution (CC) framework to construct audio AEs in the strictly black-box setting as shown in Figure 4. Our CC framework can scale up CMA-ES to deal with large-scale optimization problems we are facing, by decomposing these challenging problems into a set of simpler and smaller sub-problems and cooperatively optimizing each of them. Considering that the group size and the decomposition strategy play a crucial role in CC, we further propose an adaptive scheme to improve the attack efficiency by letting the size of sub-problems and the decomposition strategy self-adapt to the environmentally changeable evolution process.
Our Occam is presented in Alg. 1. In each optimization cycle, the original problem is decomposed into a set of smaller and simpler subproblems according to the selected grouping strategy. Then, a new offspring is generated from the current solution in each subproblem by a subspace CMA-ES whose parameters are extracted from a global CMA-ES, and the complete candidate solution can be further obtained by using the collaborative information from other subproblems and the generated offspring. The objective function is further used to evaluate the two solutions and choose the better one, based on which we update the covariance matrix accordingly. Next, we will describe each step of the algorithm in detail.
3.4.1 Initialization. As shown in Eq. (5), the optimization routine should start from an adversarial point, because the value of the objective function is equal to +∞ when the input is not adversarial. We first initialize * with a natural adversarial sample distant from the original audio. More specifically, we utilize the text-to-speech API service to synthesize the desired speech as an initial adversarial audio sample against ASR systems. For SR systems, we initialize * using an audio of the target speaker, which can be obtained from the speaker's self-recorded songs and videos posted on public social networks. Since the initial input audio is adversarial and the  original one is not adversarial, we can first utilize the binary search algorithm to effectively approach the decision boundary.

Covariance Matrix Adaptation Evolution
Strategy. CMA-ES, known as an efficient derivative-free method, generates offsprings by sampling a multivariate normal distribution with covariance , i.e., N (0, ). To facilitate understanding, we provide a brief summary of CMA-ES, as shown in Figure 5. Since the covariance is the measure of the relationship between two random variables, it can use the selected sample distribution to estimate the covariance for learning dependencies between variables [43]. Due to the unreliability of this estimation for small samples, the covariance matrix adaptation using the history information and the evolution path , a sequence of successive and normalized steps, thus has been introduced. By adaptively updating the estimated covariance, CMA-ES is able to find better search directions, and achieve a powerful local search by modeling local geometries.
Our framework uses a simple yet effective variant of CMA-ES, i.e., (1+1)-CMA-ES [48]. This version generates one candidate solution from current solution by sampling a random noise, and selects a better solution according to its objective function. Since the direction of optimization in our attack is to find a new adversarial audio closer to the original audio according to the objective function L (·), we adopt a modified (1+1)-CMA-ES [34] to improve its efficiency by adding a bias term ( − * ) to current solution * as * +1 ∼ N ( * + ( − * ), 2 · ), where is the global step size, is the covariance matrix that determines the shape of the distribution, and is a parameter that controls the degree of proximity towards the original audio. Furthermore, we can directly remove candidate solutions that are farther from the original audio, regardless of whether it is adversarial. Note that the covariance matrix plays a very important role in CMA-ES since it models the local geometries to improve the local search efficiency. However, the adaptation of covariance matrix with the complexity of O ( 3 ) may be infeasible when the input dimension is huge. To speed up the computation, the covariance matrix can be simplified as a diagonal matrix [34] and updated adaptively by the evolution path as where and are the parameters controlling the adaptation of and , respectively. The update enlarges the variance along the past successful directions for future search.

Cooperative Co-evolution.
It has been proven that the performance of evolution algorithms may drop significantly [60,89], as the dimensionality of the problem increases, because the complexity of the problem grows exponentially and the property of the problem may also change. To scale up CMA-ES to the highdimensional optimization problem, we use cooperative co-evolution (CC) to conduct the large-scale black-box optimization in a divideand-conquer manner, by decomposing the large-scale problem into several smaller sub-problems and optimizing each sub-problem alternately and iteratively. Considering that each subproblem is only a part of the original problem, the collaborative information from other sub-problems is required to evaluate individuals in the current sub-problem. Generally speaking, the best solutions of each sub-problem in the current cycle are used as the collaborative information. However, considering the query limitation in our attack, we propose a greedy strategy to produce and update the collaborative information in our attack design. More specifically, we do not optimize these sub-problems concurrently. Instead, we optimize them alternatively and iteratively. Therefore, when optimizing a subproblem, we can evolve its values of the variables, which are used to replace those related to the current subproblem in the best solution, and generate a complete solution. The solution is further evaluated by the objective function L (·) to locate a better solution. After the optimization of each subspace, the collaborative information will be updated correspondingly. Since CC is a general framework based on the divide-and-conquer strategy for solving large-scale black-box optimization problems, it is generalizable to other black-box methods that are also trapped in these problems, and CC can also improve their effectiveness.

Adaptive Decomposition.
The grouping strategy, which determines how to assign variables to different groups, plays a crucial role in CC. However, there is insufficient knowledge about the correlations between variables, making manually devising or choosing the most suitable grouping strategy extremely hard when applying CC. Therefore, we propose an adaptive approach, which puts several popular grouping strategies into a candidate pool, and adaptively selects a proper decomposition strategy from it. The candidate pool includes four grouping strategies: Static grouping (SG), Random grouping (RG), Min-variance grouping (MiVG) and Max-variance grouping (MaVG). We adopt SG to preserve the information in the time domain, and RG [89] can help randomly allocate variables to subspaces for improving the probability of placing two interacting variables in the same subspace. Considering that the covariance matrix is used to model the local geometries of the search directions, thus we adopt MiVG and MaVG, which are devised for CC-CMA-ES. Actually, they can minimize or maximize the diversity of the diagonal values of the variables in the same subspace. At the beginning of each optimization cycle, we randomly select a decomposition strategy. Based on its performance, we calculate the selection probabilities of each grouping strategy for the next cycle.
We observe that the levels of interdependency among variables in the audio vector will change significantly from the natural audio to the audio AE during the optimization process [88]. Therefore, we propose to adaptively adjust the group size to capture different interdependency levels in the dynamic evolution process. Then, we can make a good trade-off between the effectiveness and the efficiency of the optimization. The details of the adaptive decomposition algorithm can be seen in Appendix A.

Parameter Adjustment.
Our algorithm contains many hyperparameters, such as , , , , and . Following [34], we set = 0.001 · ( * , ), = 0.01, = 0.001 and = 15. We further set = 30 and = 0.08, because has an important impact on the search process, we need to carefully tune . Finally, we adopt the 1/5th success rule [15] to update as if a better solution is obtained,

NI-OCCAM: A NON-INTERACTIVE PHYSICAL ATTACK AGAINST VOICE CONTROL DEVICES 4.1 Threat Model
In this section, our target is commercial voice control devices. We consider the most rigorous and practical assumption, called noninteractive physical setting, where the adversary makes no query to the oracle. Compared to prior physical attacks, the key advantage of non-interactive physical attacks is that we do not need to query the target devices for effective audio AE generation, thus saving the potentially large query cost. In this sense, this attack is the most practical one in the real world.

Technical Challenges
Compared to our decision-only adversarial attacks, non-interactive black-box setting is much more challenging since it further breaks the dependence on the final decision in the decision-only black-box scenario. That is to say, the target model is completely unknown to the adversary. Moreover, voice control devices also present additional challenges that the constructed audio AEs should remain robust even if they are played in the physical world. By physical attacks against voice control devices, we mean that audio AEs are played by a speaker and recorded by the device. Since the effectiveness of audio AEs is greatly affected by the reverberation of the environment, and perturbations from the speaker and the microphone [75,86], it is really difficult to launch physical attacks. These two obstacles pose severe challenges to craft effective audio AEs.

Our Method
As described in our threat model, the adversary will not issue any queries to probe the target model and obtain no feedback in the noninteractive black-box setting. Thus, our Occam cannot be applied in this case. Intuitively, it is almost impossible to directly construct an audio adversarial example against the target model by solving the optimization problem without any interaction. In fact, we are facing the problem of generating AEs with no information during the whole attacking process. In the image domain, there have been works that demonstrated AEs crafted for the target model are able to attack other unknown models, which is called the transferability of AEs. However, in the audio domain, the poor transferability of audio AEs [35,90] among different ASR systems indicates that we cannot directly leverage the transferability property especially when there are no interactions with the target model.
Observing that the ultimate goal of speech recognition systems is to perform the task of converting natural speech into text, we believe that the inclusion of the characteristics of natural command audios in the constructed audio AEs may help improve their transferability. Inspired by model inversion attacks [29,36,94] that aim to recover input data or its sensitive attributes via the model output, we propose NI-Occam to craft audio AEs, where the command voice is recreated and implicitly embedded in the original music via the gradient update. The main reason behind is that it is hard for people to perform speech separation on our constructed audio AEs and further recognize the malicious speech commands.

Feature extraction
Finally, audio AEs we constructed, just like natural command audios, will remain robust in the over-the-air attack and naturally effective in the physical world. Besides, compared to cloud speech APIs, voice control devices are more vulnerable to audio AEs containing command audios since they are more sensitive to voice commands than speech APIs, which has been demonstrated in Devil's Whisper [32]. Thus, we propose NI-Occam, which realizes noninteractive physical attacks against voice control devices. Notably, this is the first physical attack that can effectively create audio AEs without any feedback information. Previous physical attacks, e.g., Devil's Whisper and FakeBob, rely much on returned scores to generate effective audio AEs, while our attack is a non-interactive one, i.e., requiring no access to the target devices.
Our proposed NI-Occam is presented in Alg. 2. To be specific, we choose the open-source Kaldi model (ASpIRE Chain Model) as the substitute model and the inversion model due to its simple structure of the neural network and the excellent performance. In Figure 6, it can be seen that the Kaldi model obtains MFCC feature through feature extraction and takes these features as the input of DNN, while the output of DNN is the probability density function (pdf). According to the idea of "pdf-id sequence matching" proposed in CommanderSong [90], we can recover the command audios from their pdf-id sequences via the gradient inversion. Different from model inversion attacks that start from the Gaussian noise ∼ N (0, 2 ), adversarial attacks start from the original example . Therefore, we add Gaussian noise into the original example to address this problem. The audio AEs can be described as * = arg min where is the probability value of the target pdf-id sequence, J (·, ·) is the loss function [90]. Note that, the Gaussian noise we added to the original example is very large. This is to alleviate the impact of the original example on the model inversion process. To facilitate convergence, we gradually attenuate the size of Gaussian noise in the iterative process. Besides, very recent works [83,87] have shown that the AdaBelief method [96] is beneficial to improve the transferability of adversarial examples, so we use the AdaBelief optimizer to solve Eq. (11). The details of implementation are: the learning rate is set to 0.003, the standard deviation is set to 0.25, and the size of perturbation is set to 0.3.

Algorithm 2 NI-Occam
Input: The original example , the command audio ′ , the loss function J (·, ·), the Kaldi model , the learning rate , the standard deviation and the size of perturbation . Output: The produced audio adversarial examples.
1: Obtain the target pdf-id sequence through the Kaldi model and the command audio ′ ; 2: Initialize * with , * with ∅, and the learning rate of the AdaBelief Optimizer with ; 3: while not converged yet do 4: Sample ∼ N (0, 2 ); 5: Use the AdaBelief Optimizer to minimize J ( + * , ) and update * ; 6: Clip * into the vicinity of ;

IMPLEMENTATION AND EVALUATIONS 5.1 Experiment Settings
Experiment Design. To evaluate the performance of Occam, i.e., our decision-only digital attacks on cloud speech APIs, we design four sets of experiments: attacks on open-source ASR systems, attacks on ASR services, attacks on SR services, and human perception of the audio AEs 3 . For the attacks on ASR services, we choose ten frequently-used voice commands 4 that are expected to be recognized by the target systems. The original audios are selected from three datasets including Common Voice [8], Song [90], and LibriSpeech [67]. We select ten test samples in each experiment to evaluate the performance. To generate audio AEs, we need to query the target commercial speech-to-text cloud services 5 in Table 2 and get the final transcription in return according to our proposed algorithms. To access the commercial cloud APIs, we first register on the platforms and then use audio AEs to query the speech-totext services according to API keys provided by the platforms. For the attacks against SR services, we conduct our targeted attack on three systems (also in Table 2). For Microsoft SI and Jingdong SV, we choose ten people from Voxceleb dataset [64] and enroll four utterances for each person. Since Microsoft SV is text-dependent, we collect five volunteers' voice data, and each volunteer is required to enroll 10 fixed sentences. To generate AEs, we query the APIs for 10,000 times for each target person.
To evaluate the performance of NI-Occam, i.e., our non-interactive physical attacks on voice control devices, we design two sets of experiments: attacks on popular commercial voice assistants and human perception of the audio AEs. We evaluate NI-Occam on Google Assistant, Apple Siri, Microsoft Cortana, iFlytek, and Amazon Echo, as shown in Table 3. We generate 10 sets of audio AEs locally, whose target phrases of AEs are the same as those in the   Echo 631499520 Echo 1st gen evaluation of Occam. We then play them using a JBL Clip 3 portable speaker near the target devices in a quite laboratory. The distance between the speaker and the target devices is around 15cm. Methods for Comparison. We compare Occam with two stateof-the-art black-box adversarial attacks against ASR and SR services, i.e., Devil's Whisper [32] and FakeBob [26]. Since Devil's Whisper originally utilized confidence scores to filter the synthetic audio data, in the follow-up evaluations, we omit this step to adapt Devil's Whisper to the decision-based attack. To better evaluate the effectiveness of Occam, we also select five decision-based attacks for comparison: the boundary attack [21], the HopSkipJump attack (HSJA) [27], the opt-attack [33], the evolutionary attack 6 [34], and the Differential Evolution attack (DEA) 7 [68].
We compare NI-Occam with a straightforward non-interactive attack, i.e., the superposition attack. In this attack, we just directly superimpose the original example and the command audio. Since this procedure does not require any knowledge of the target systems, the superposition attack is non-interactive. Evaluation Metrics. We use the success rate of attack (SRoA) to evaluate the effectiveness of AEs. SRoA calculates the proportion of AEs that can successfully attack ASR or SR services. Besides, we use SNR 8 to describe the perturbation on audio AEs and the number of queries on the target model to indicate the efficiency of the attacks. It is worth noting that the success rate or the SNR is not the only metric that determines whether an AE can successfully attack the target system. An effective AE should fool both the model and the human. Hence, we should combine both the SRoA and SNR when evaluating the effectiveness of AEs. Note that, to illustrate this statement, we conduct a user study to analyze how SNRs affect the human perception. The results show that a lower SNR makes it easier for the users to notice or even recognize the AEs. Due to the space limit, we present the detailed results in Appendix E.

Evaluation on Cloud Speech APIs
Effectiveness of Occam on ASR services. Table 4 shows the performance of the targeted attacks on different commercial speech services after 30,000 queries (10,000 queries on Google). With regard to SNR, Occam performs the best among the seven attacks, reaching the best SNR of 17.84dB. Notably, DEA only obtains an average SNR of 6.37dB, which means that the perturbations of the audio AEs are very large. The results demonstrate that as the gradient-free optimization method in Occam's cooperative co-evolution framework, CMA-ES is a better choice than the differential evolution. The reason behind this may be that CMA-ES is more suitable than DEA to solve the non-separable optimization, since CMA can well learn the dependencies between variables. However, the average SNR of AEs generated by the evolutionary attack (which is based solely on CMA-ES) can only achieve 7.11dB, while Occam has an average SNR of 14.37dB. This indicates that although CMA-ES can solve the nonseparable optimization problem, when regarding to complex speech data, CMA-ES becomes ineffective to solve the discontinuous largescale global optimization problem. Hence, CMA-ES alone cannot deal with complex speech data well. The above results demonstrate that Occam can effectively manage high-dimensional audio data.
Compared to Devil's Whisper, Occam can achieve 100% SRoAs on all API services, while Devil's Whisper can only achieve an average SRoA of 54%. This indicates that Devil's Whisper is less effective in fooling the target speech recognition model than Occam. Figure 10 in Appendix D.2 shows the waveforms and spectrograms of the original audio and the adversarial audios generated from Occam and Devil's Whisper. We can see from the waveforms that the audio AE generated from Occam is almost the same as the original audio. However, the differences in Devil's Whisper are more noticeable and thus more likely to be perceived by humans. Besides, although we only choose ten commands in the experiments, Occam can generate AEs of arbitrary phrases 9 , while Devil's Whisper can only generate a limited number of target phrases with a trained model. These observations suggest our decision-based Occam is more effective, powerful, and practical. We also evaluate the untargeted attacks. Due to space limitation, related results are given in Appendix D.2. Effectiveness of Occam on SR services. We evaluate the SNRs and SRoAs of the AEs against SR services, as in Table 5. Among the attacks against SR services, Occam still achieves 100% SRoAs, indicating that the audio AEs can be successfully recognized as the target person (or mislead the SR system). FakeBob can only achieve a 1% SRoA. Note that we tested FakeBob with 200 instead of 10 AEs. Since the success rate of FakeBob is too low, we enlarge the AE set to create an effective AE. We find that the results given by FakeBob are higher than those in Table 5. This is probably because   Note that, (i) we only evaluate untargeted attacks against Microsoft Azure SI. For speaker verification systems, the goal of untargeted attacks is the same as that of targeted attacks.
(ii) N/A denotes "not available". Since there is no effective AE against Microsoft SI and SV in FakeBob, the SNR is not available.

Evaluation of NI-Occam against Voice Control Devices
Effectiveness. We test the effectiveness of NI-Occam on 5 voice control devices, and the results are given in Table 6 10 . If the audio AE can be correctly recognized by the devices as the target command within 3 attempts (play the AEs within three times), we consider this AE successful. Overall, NI-Occam achieves an average SRoA of 52% and SNR of 9.65dB. Note that NI-Occam is a non-interactive attack that requires no access to the target devices, which is very practical since some devices like Apple Siri do not provide a programmable API. For example, Devil's Whisper [32] fails to generate effective 10 Detailed results on individual commands can be found in Appendix F (see Table 12).
AEs when confronted with devices that do not return confidence scores. We also find that NI-Occam performs well on Apple Siri, while Devil's Whisper failed in attacking updated versions of Siri. This indicates that NI-Occam is more effective in leveraging the useful transferability of the AEs and can generate successful AEs with minimal information from the target devices. We also evaluate a simple non-interactive attack as the baseline, i.e., the superposition attack. Since the AEs from the superposition attack are constructed by superimposing two audios, the SNR is adjustable. We set the SNRs as 7.00dB, smaller than those of NI-Occam. A smaller SNR means the audio of the target command is more obvious in the audio AE, making it easier for the devices to recognize. However, the superposition attack can only achieve an average SRoA of 20% with a 7.00dB SNR. More importantly, the superposition attack is easily perceived by human, which we will discuss shortly. We also evaluate the impact of the number of attempts on SRoA with a larger group of target phrases. We observe that increasing the query attempts can help increase SRoA to 70%, and our NI-Occam can also perform well on large sets of 60 commands with an SRoA of 71.7%. Detailed results are given in Appendixes F and G. Human Perception. Although SNR describes the proportion of noise, it cannot fully reflect the imperceptibility of the audio. For example, if the noise can well fit the background, even though the signal has a low SNR, the users cannot perceive it. On the contrary, if the noises all appear in a small piece of audio, although the overall SNR is high, the users may readily perceive the command.
To evaluate the performance of NI-Occam on human perception, we surveyed 37 volunteers aged from 19 to 24 (who are sensitive to sound), including 21 males and 16 females. In the experiment, Note that, (i) "Normal" means that the volunteer regards the audio as a normal audio. (ii) "Noise" means that the volunteer can feel some noises. (iii) "Talking" means that the volunteer can hear talking in the audio. If the volunteer thinks there is talking in the audio, he/she is then asked to recognize the content of the talking. (iv) The audio will be labeled as "once-recognize" or "twice-recognize" if the volunteer recognizes over half of the content after listening to the audio once or twice, respectively.
we first show some examples of "noise", "normal", "talking", and "recognized", and then ask the volunteers to listen to the audio AEs and tell their views about them. Specifically, the audio AEs are generated from NI-Occam and the superposition attack and can successfully attack the devices. Each volunteer ranks 6, 6, 6, 4, and 4 successful AEs from NI-Occam on Apple Siri, iFlytek, Microsoft Cortana, Google Assistant, and Amazon Echo, respectively 11 . Table 7 presents the results of the experiments on human perception. Overall, NI-Occam performs much better than the superposition attack. More than 67% volunteers think the audio generated from NI-Occam is normal or just noisy, while 100% volunteers can recognize the AEs from the superposition attack. This is because the audios crafted in the superimposing manner can be more easily noticed due to the human ear's excellent ability of speech separation. We also assume that a proper noise level is acceptable because the background environment may be noisy, and the equipment may emit some noise due to a temporary fault. The results show that NI-Occam is stealthy enough and cannot be easily perceived.

RELATED WORK 6.1 Audio Adversarial Examples
Adversarial Examples Against ASR Systems. Despite the great success of adversarial examples in the image domain, the transcription of spontaneous speech poses a more significant challenge for crafting audio AEs. The early results have clearly indicated that ASR systems are inherently vulnerable to AEs in white-box settings. Among others, Carlini et al. [25] was the first to implement an iterative optimization-based attack with a success rate of 100% on the end-to-end Mozilla DeepSpeech model. However, it takes approximately one hour for their attack to produce one adversarial example on a single NVIDIA 1080Ti GPU, and the crafted adversarial sample fails when being played over the air. Concurrently, CommanderSong [90], which embedded malicious commands into popular songs, was reported to have successfully attacked against Kaldi [2]. It further launched a very limited over-the-air attack, which is heavily dependent on the recording devices, speakers and room settings. Efforts were also made by [86] to build robust over-the-air AEs by utilizing impulse responses to simulate the reverberation with a success rate of around 60%. Moreover, Chen 11 The AEs are the ones that can fool the devices, as shown in Table 6. et al. [31] achieved a 90% success rate of over-the-air attacks over the attack distance of up to 6m by further removing device-and environment-specific features. Besides, imperceptible audio AEs were produced in [74] via the psychoacoustic model. Both imperceptible and robust audio AEs were constructed in [69] against the Lingvo ASR system, with a success rate of 50%.
Compared to the above attacks, our attacks have the following superiorities: 1) Our attacks are more practical since these attacks have access to the internal information of target model; 2) Instead of only targeting one open-source system in the white-box setting, our attacks evaluate the robustness of many representative audio processing systems in real-world scenarios.
While adversarial example based white-box attacks have obtained excellent results against open-source ASR systems, its blackbox counterpart hasn't made big progress until recently. By leveraging the last layer (i.e., logits) of DNNs inside the DeepSpeech, Taori et al. [80] obtained the fitness score of adversarial inputs, and combined genetic algorithms with the gradient estimation to attack DeepSpeech. In addition to the rather low success rates even after 300,000 queries, this type of black-box attack is not applicable to commercial systems. Following this work, selective gradient estimation attack [82] was proposed to achieve a success rate of 98% in this setting. Moreover, multi-objective genetic algorithms were also introduced to start a black-box adversarial attack against ASR systems [53]. However, due to the relatively large word error rate (WER) after many evolutions in the attack, it becomes ineffective for generating audio AEs. A very recent work, Devil's Whisper [32], utilized confidence scores exposed by commercial ASR systems to launch the black-box attack. However, it is worth noting that there are many commercial ASR systems that do not return any score information, and service providers are also apt to hide these scores to reduce the risk of adversarial attacks, considering that these information almost has no effect on the user experience and may be mainly exploited by malicious attackers.
There are two key differences: 1) Our Occam, as a generic attack, requires only the final decisions to generate audio AEs, effective to both commercial ASR and SR services; 2) Our NI-Occam requires no access to the targeted devices, but still can generate effective audio AEs. Our attacks are more practical in real world scenarios. Adversarial Examples Against SR Systems. When it comes to adversarial example generation against SR systems, relatively little work has been done in both white-box and black-box cases. Kreuk et al. [55] presented a white-box adversarial attack against the DNNbased speaker verification system. Gong et al. [39] demonstrated the vulnerability of speaker identification system to audio AEs in the white-box scenario. Obviously, these attacks require access to internal structures and parameters of the target systems, and they are impractical when facing commercial SR systems.
In a recent work, SirenAttack [35] was presented to launch a black-box attack against a number of classification-oriented acoustic systems, including the SR system via the predicted probabilities/scores. More recently, FakeBob [26] also utilized the predicted probabilities/scores to conduct a black-box adversarial attacks against Talentedsoft API [7] with a success rate of 100%. The main limitation of SirenAttack and FakeBob is that they lose the effectiveness when applied to commercial SR systems, which usually hide the predicted scores. For example, the Microsoft Azure SR API service only provides the decision (i.e., the predicted speaker) along with three confidence levels (i.e., low, normal, or high) to users. Our attack shows its superiority to prior attacks by achieving an attack success rate of 100% against commercial SR systems even if the service provider conceals the prediction score information.

Other Types of Attacks
In addition to audio AEs, researchers have also discovered that intelligent voice systems are vulnerable to other types of attacks, including misinterpretation attack and hidden voice attack. Misinterpretation Attacks. Kumar et al. [57] conducted an empirical analysis of misinterpretations and investigated security implications on Amazon Alexa, based on which they introduced a new attack called skill squatting to surreptitiously route users to malicious third-party public services. Along this direction, Zhang et al. [93] reported similar attacks against ASR systems, where a malicious skill with the similarly pronounced name or paraphrased name was exploited to impersonate a benign skill. Also targeting at ASR systems, Zhang et al. [95] designed a linguistic-model guided fuzzing tool called LipFuzzer to systematically discover misinterpretations leading to malicious attacks. Hidden Voice Attacks. Besides, by either exploiting knowledge of the feature extraction algorithm or hardware vulnerabilities in microphone circuits, the adversary can embed hidden commands into an audio carrier, in the form of noises, thereby compromising the intelligent voice systems. To achieve this goal, hidden voice command [23] utilized inverse MFCC to craft obfuscated commands against ASR systems in a "black-box" manner with considerable human effort for obtaining feedbacks. Four different perturbations were introduced to generate noise-like adversarial audios against ASR and SR systems [12]. Moreover, DophinAttack [92] was devised to modulate voice commands on inaudible ultrasounds, which can be interpreted by the ASR system, by exploiting the non-linearity of the microphone circuits. However, compared to adversarial example based attacks, these attacks could be easily defended or perceived.
Our initial idea stems from the image-based adversarial attacks. We have overcome particular challenges for generating audio AEs. Compared to the state-of-the-art black-box attacks on acoustic systems [26,32,35,80], ours is more generic and practical.

DISCUSSIONS
We discuss four possible countermeasures to defend against our Occam and NI-Occam below, with detailed performance results of countermeasures in Appendix H (Tables 14 and 15). Local Smoothing. Because audio AEs are carefully constructed by adding small perturbations, they can be mitigated by local smoothing. We can apply a sliding window with the median filter to the adversarial audio signals. Given a data point , we replace it with the average of samples before and after it, i.e., [ −ℎ , ..., , ..., +ℎ ]. Since the adversarial perturbation is carefully constructed in our attacks, the audio AEs may become less effective after local smoothing transformation. For example, when ℎ = 1, the SRoA of Occam drops from 100% to 20%, while the SRoA of NI-Occam drops from 60% to 40%. When ℎ = 3, all AEs of Occam fail, and NI-Occam remains an SRoA of 40%. We can find that NI-Occam is more robust to local smoothing. The reason is that the recognized key parts of natural command audios are recovered from audio AEs in NI-Occam, thus making them more robust. Downsampling. Based on the sampling theory, the high-frequency information in the audios will be lost after the downsampling process, which may disrupt perturbations in the audio AEs. Thus, audio AEs crafted in our Occam will fail to work after the downsampling and upsampling process, e.g., audio AEs are first downsampled to 12kHz and then upsampled to 16kHz as indicated in our experiments. Nonetheless, NI-Occam remains a better SRoA, i.e., 30% when the downsampling rate (DSR) is 12kHz, indicating that the AEs of NI-Occam have a greater chance of resisting downsampling. While downsampling can help to mitigate our attacks, if the dawnsampling/upsampling rates are known to the attacker, this countermeasure will be invalid. This is because the adversary can directly compensate the generated example for the information lost in the downsampling and upsampling process. Temporal Dependency Based Approach. The inherent temporal dependency in audio data was recently utilized in [88] to detect audio AEs due to the disruption of the temporal information. Namely, the first (0 ∼ 1) portions of the whole audio were selected and recognized as , and the portions of the transcription of the whole audio { ℎ , } is obtained and compared with . By checking the consistency between { ℎ , } and , audio AEs crafted by Occam can be easily detected. This approach has a strong discriminative ability in identifying AEs targeting at ASR systems, and theoretically it can identify almost all audio AEs against ASR systems. Even so, Occam is still effective in attacking SR API services since AEs constructed for SR systems preserve the temporal information like natural audios. For instance, Occam also achieves a success rate of 80% against Microsoft Azure speaker identification API when our audio AEs were randomly split into two parts, i.e., and 1 − portions of the whole audio.
The defense [88] is better suited for ASR tasks since the temporal dependency is stronger. To evaluate the effectiveness of the temporal dependency based approach against (NI-)Occam for ASR, we build a dataset of 40 audio AEs generated using (NI-)Occam with 40 natural command audios, and calculate the detection rates. For each audio sample, we randomly select in the range of [0.2, 0.8] and split the audio into two pieces. We then calculate the consistency of the split audios and the whole audio by the word error rate (WER).
With a varying WER threshold, we can obtain the ROC curve and finally calculate the area under curve (AUC), where a higher AUC indicates better detection performance. The AUC is 100% in Occam, showing that the defense can successfully detect all AEs generated by Occam. However, the AUC drops to 68% when classifying AEs from NI-Occam, which means that NI-Occam is robust to the temporal dependency based approach. This is because the adversarial perturbation generated by NI-Occam usually only occurs in a small piece of audio, and splitting the audio does not disrupt the temporal dependency. Therefore, it is hard for the classifier to detect the AEs generated using NI-Occam by analyzing the temporal information. Adversarial Training. Adversarial training [40,61] is one of the most effective defenses against adversarial attacks in the image domain [17]. The basic idea of adversarial training is to train models on AEs to make the models robust to AEs. It can be formulated as a mini-max optimization problem: where denotes the deep learning model, ( , ) denotes the original data point, is the adversarial perturbation, and ∥ · ∥ is the p-norm.
Here, the worst-case samples for the given model are found in the inner maximization problem to train a more robust model via the outer minimization operation. So far, adversarial training has been extensively studied on image classification tasks [18,22,76,81,85]. However, there is relatively little research on this defense for speech recognition tasks, which are more complex and challenging. Besides, generating strong audio AEs incurs high computation costs. For example, [25] reports that it takes approximately one hour to construct an audio AE in a singe NVIDIA 1080Ti GPU, far exceeding the time of generating an adversarial image. In order to study the effect of adversarial training on ASR systems, we evaluate the performance of adversarial training on the open-source Kaldi. For solving the inner maximization problem of Eq. (12) that , our targeted NI-Occam is not suitable as it performs the minimization operation, and thus we adopt the untargeted projected gradient descent (PGD) attack [61], which is usually used in adversarial training, as follows: where P , is a projection operator on L ball, and the step size. According to our experimental results on Kaldi (Mini LibriSpeech model 12 ) (see Table 15 in Appendix H), adversarial training is indeed an effective defense against our targeted NI-Occam attack, e.g., the SRoA of the AEs drops to 30% when = 0.002. But, meanwhile, the WER of the model is increasing from 10.69 to 19.82, which means that the accuracy of the speech recognition drops about 10%. Moreover, with the increase of , both the SRoA and model accuracy will be significantly reduced. Considering the extreme case when almost all audio AEs fail, the model accuracy drops about 20%. And if was further increased to eliminate our attack, e.g., approaching = 0.006, adversarial training is not even able to converge. The reason may be that the audio vector contains many values very close to zero, and they are changed a lot with a high .
Overall, adversarial training on Kaldi is effective against our attacks. However, while the SRoA can be reduced by 70%∼90%, it also inevitably brings a significant performance degradation, i.e., a 10%∼20% accuracy loss, and such loss is unacceptable for commercial ASR systems. Moreover, adversarial training will significantly increase the training time and costs [51,76], particularly for ASR systems. For example, for the Mini LibriSpeech dataset containing only 5 hours of audio data, it took about 10 days for us to adversarially train Kaldi model on six NVIDIA 2080Ti GPUs, while commonly-used voice datasets, like LibriSpeech and Common Voice, have around 1000 hours of voice data), making it almost impractical on large-scale models and datasets. Thus, even with adversarial training our attack cannot be prevented. Therefore, service providers may not be willing to adopt the defense for "black-box" commercial models due to the issues above. Remarks. We point out several potential research directions to obtain more practical attacks. Since the human perception experiment has shown that some audio AEs can be recognized and regarded as abnormal audios, it is necessary to further improve the stealthiness of constructed audio AEs. Moreover, the physical attack against voice control devices in this work will fail when the distance is long or in a very noisy environment. For example, when we play the AEs at a distance of 50cm from the devices in a noisy cafe, the SRoAs of NI-Occam against Amazon Echo and Apple Siri decrease to 20%. Thus, it is still challenging to launch physical and black-box attacks on ASR systems at long distances or in noisy environments, which requires more efforts in the future. Besides, robust physical adversarial attacks against SR systems in the decision-only and non-interactive settings should also be put on the agenda. Finally, it is also an interesting and meaningful job to achieve black-box adversarial attacks against both ASR and SR systems on one audio AE, because some smart speakers, such as Apple HomePod, will first perform speaker identification on the input audio.

CONCLUSION
In this paper, we proposed two novel black-box adversarial attacks against commercial speech platforms, Occam and NI-Occam. Occam constructs audio AEs against cloud speech APIs in the decisionbased black-box setting. It is effective under the strictly black-box scenario where the attackers can rely solely on the final decisions. Extensive experiments on targeted and untargeted attacks against a wide range of popular open-source and commercial ASR and SR systems demonstrated the effectiveness of Occam. Occam achieves an average SNR of 14.37dB and 100% SRoA on commercial ASR systems, outperforming the state-of-the-art black-box audio adversarial attacks. NI-Occam is the first non-interactive physical attack, simple but effective, which can successfully fool commercial devices without needing any feedback information from the target devices. Extensive experiments showed the effectiveness of the AEs from NI-Occam in attacking Apple Siri, Microsoft Cortana, Google Assistant, iFlytek, and Amazon Echo with an average SRoA of 52% and SNR of 9.65dB.

A ADAPTIVE DECOMPOSITION ALGORITHM
Here we present the detailed adaptive decomposition algorithm in Occam. The cooperative co-evolution in Occam is illustrated in Figure 7. Intuitively, one may want to probe all possible group sizes and choose the one with the best performance. However, it is very impractical because of the prohibitively high computational overheads. Instead, we propose a dynamic and self-adapting grouping strategy. In our design, we let the group size vary in different stages according to its performance, which is evaluated in a "pilot" manner. More specifically, we first divide the variables into subgroups of candidate group sizes and optimize one of the subgroups. The candidate group size is selected in a gradual manner, i.e., the number of subgroups is twice or half as large as that in the previous stage. We then determine the group size in the next stage according to the performance of pilot test. A candidate size will be adopted in the next stage if the performance of the pilot test is better. In this way, the grouping strategy greatly reduces computation costs since the pilot test only involves optimizing a small group of variables.
The details of the grouping algorithm are described as follows. Since the degree of interdependence between the variables from the natural audio and the audio AE gradually increases, we first initialize the number of groups as = 1. 2) Compute the probabilities of selecting each grouping strategy in the candidate pool as 3) Select a grouping strategy according to probabilities { } 4 =1 . 4) Group the variables into subgroups G = { 1 , . . . , } according to the selected grouping strategy. 5) Optimize the subspaces on G until the current stage ends.
Update and as where is the best fitness of the last cycle, and ′ is the best fitness of the current cycle. 6) Run pilot test.

Algorithm 3 Differential Evolution Attack
Input: The original audio , the initial adversarial audio ′ , the input space dimension , the Population , and the total number of iterations . Output: The adversarial audio sample * .
1: Initialize the following parameters:

B DIFFERENTIAL EVOLUTION
Differential evolution is a gradient-free optimization method like CMA-ES. To sufficiently justify the design choice of CMA-ES, we also include the evaluations on the differential evolution attack (DEA). Our experimental results (see Tables 8 and 4) show that DEA cannot perform well on audio data. Here we present the algorithm of differential evolution, as shown in Alg. 3. To more easily generate offspring that is closer to the original audio, a bias term is also added into the mutation step, and the crossover step is removed.
is the largest open-source human voice dataset launched by Mozilla, including 28 different languages and nearly 1,800 validated hours of voice data. The Common Voice dataset is used to train ASR systems and test the effectiveness of audio AEs. Song. Song is released by CommanderSong [90] and also used in Devil's Whisper [32]. It contains 20 songs divided into four categories on average, including the soft, popular, rock, and rap. Similar to Devil's Whisper [32], we select a total of 10 songs in the soft and popular categories, which are less noisy and easier to disguise, to evaluate the performance of Occam against speech recognition APIs.
LibriSpeech. LibriSpeech [67] is an English speech corpus that consists of 1,000 hour reading audio sampled at 16kHz. VoxCeleb. VoxCeleb [64] is an open-source large-scale audio dataset that contains more than 100,000 utterances from more than 7,000 celebrities. It consists of short human voice clips extracted from interview videos uploaded to YouTube. In our experiments on the SR systems, we use part of VoxCeleb as the enrollment data.

C.2 Target Systems
DeepSpeech. DeepSpeech [42] is a state-of-the-art ASR system that performs speech-to-text tasks. It is usually used as the target model in adversarial attacks [53,80]. Note that although Deep-Speech is open-source, Occam does not require any internal knowledge about the target model. Commercial Cloud Speech APIs. Speech-to-text API services are widely used in many applications. To evaluate Occam on commercial services, we choose six popular API services, including Google Cloud Speech-to-Text [5], Microsoft Azure Speech Service [6], Alibaba Short Speech Recognition [11], Tencent Short Speech Recognition [10], and iFlytek voice dictation [3]. For the Google's Speech-to-Text API, we select the "command_and_search model" as the target models. The characteristics of these APIs are listed in Table 2. Note that although some APIs provide confidence scores, our attack does not require such information. Commercial Speaker Recognition Systems. We test Occam on the Microsoft Azure's speaker recognition system and the Jingdong speaker recognition system [9]. Microsoft Azure's API [6] can perform speaker identification (who is the speaker) and speaker verification (whether the speaker is legal). It only returns the decision (i.e., the predicted speaker) along with three confidence levels (i.e., low, normal, or high). Jingdong's API can perform speaker verification and only returns the final result, i.e., accept or reject. Commercial Voice Control Devices. We test NI-Occam on five commercial voice control devices, i.e., Apple Siri, iFlytek, Microsoft Cortana, Google Assistant, Amazon Echo. In our experiments, Apple Siri, iFlytek, Microsoft Cortana, Google Assistant are applications installed in on-the-shelf smartphones, and Amazon Echo is an intelligent voice-controlled speaker.

C.3 Hardware
In our experiments, we conduct the attacks against DeepSpeech on a server equipped with four Nvidia 2080Ti GPUs, a six-core Intel(R) Xeon(R) W-2133 CPU 3.60GHz, 62 Gigabyte RAM, and a Hard Drive with 1.37 Terabytes. For the experiment on speech API services, we adopt several laptops, including a Lenovo ThinkPad X1 Carbon 4th with Core Intel i5-6200U CPU 2.30GHz and 8 Gigabyte RAM, a Microsoft Corporation Surface Pro 6 with Core Intel i7-8650U CPU 1.90GHz and 8 Gigabyte RAM, and a Lenovo ThinkPad X1 Carbon 5th with Core Intel i7-7500U CPU 2.70GHz and 8 Gigabytes RAM. Besides, we attack the voice assistant including Apple Siri with version 13.6.1 and iFlytek with version 10.0.8 on an iPhone 11, Cortana with version 3.3.3 on a Samsung C9000 with version 3.3.3, and Google Assistant with version 2.5.1 on a Nokia 7 plus. We play the audio AEs using a JBL Clip 3 portable speaker.

D SUPPLEMENTARY EVALUATION OF OCCAM D.1 Evaluation on Open-source ASR Systems
We evaluate the effectiveness of the audio AEs generated by the attacks in Table 8. These AEs are generated after 200,000 queries (100,000 times on Song 13 ) on DeepSpeech. Since Genetic Algorithmbased Attack (GAA) [80] and Selective Gradient Estimation Attack (SGEA) [82] need to leverage the prediction scores of the model, the AEs achieve higher SNRs than other decision-based attacks. However, neither of them can successfully attack DeepSpeech with a 100% SRoA-ASR. In the other six decision-based attacks, the initial example is originally adversarial and then trained to approach the target example. Therefore, the SRoA of these attacks is always 100%. However, it is worth noting that SRoA is not the only metric that determines whether an AE can successfully attack the target system. An effective AE needs to fool both the model and the human, where SNR must be used as another important factor to measure the effectiveness of AEs. Although the six decision-based attacks can all achieve a 100% SRoA-ASR, only the AEs generated by our attack obtain high SNRs (with the best SNR of 13.80dB).
To evaluate the efficiency of Occam, we tested the SNRs of the AEs generated after different numbers of queries on DeepSpeech. As shown in Figure 8, Occam achieves a high SNR within a small number of queries. Note that the dataset Song requires fewer queries because the audio in Song is less noisy and more powerful, making it easier to generate an audio AE with a high SNR. We also find that the growth rate of SNR significantly decreases with the increase of SNR. For example, the SNR of our AEs can quickly converge to 12.86dB, while the evolutionary attack may require hundreds of thousands of queries to increase SNR from 1.20dB to 12.86dB. Hence, the results show that the co-evolution algorithm has a higher bound of SNR and a faster convergence rate than the evolution algorithm on audio data, validating the effectiveness of the CC framework in constructing audio AEs.

D.2 Evaluation on Cloud Speech APIs
For completeness, we give some additional experimental results as a complement to Section 5.2.   We present Figure 9 to show the SNRs with confidence intervals of 68.2% (-std, +std). The results show that Occam has the highest SNRs among all the decision-based attacks. Figure 10 shows the waveforms and spectrograms of the original audio and the adversarial audios generated by the seven attacks. It can be seen that the waveforms of the original audio and the audio AEs generated by Occam are almost the same. However, the differences in other attacks are more noticeable and thus more likely to be perceived by humans.
As for the untargeted attacks, the results in Table 9 show that the SNRs of AEs generated from untargeted attacks are much higher than those from targeted attacks. In this experiment, the performance of the evolutionary attack is comparable to Occam. Since the implementation of untargeted attacks is much simpler than the targeted attack (e.g., 200 queries are enough), the optimization problem on untargeted attacks can be easily solved without being decomposed into a set of simpler sub-problems. Thus, approaches without a cooperative co-evolution framework can also work well.
However, concerning the large-scale and complex problems of generating targeted audio AEs, these approaches become ineffective. Besides, Devil's Whisper does not explicitly support untargeted attacks. To launch the untargeted attack by using the methodology of Devil's Whisper, the attacker needs to intentionally set a wrong phrase as the target phrase. This step incurs a large amount of queries. In contrast, our methodology can successfully generate effective audio AEs within 200 queries.

D.3 Evaluation on Human Perception
Similar to the human perception experiments on NI-Occam, we also evaluate the performance of Occam about the human perception. The experiment settings are the same as those in Section 5.3.
For each commercial service, the volunteers need to listen to 10 pieces of audio AEs generated from the boundary attack, the optattack, the evolutionary attack, HSJA, DEA, Devil's Whisper, and Occam. The audio AEs are generated by querying the target systems for 200,000 times with original audios from the dataset Song. In all, we get 37×10 samples from each attack on each commercial service (a total of 37×10×6 samples for one attack). Note that, since Devil's Whisper does not achieve a 100% SRoA, the volunteers only rate the successful AEs. Each volunteer ranks 3, 5, 7, 8, and 4 successful AEs from Devil's Whisper on Alibaba, Google, iFlytek, Microsoft, and Tencent, respectively. Table 10 presents the results of the experiments on human perception. Overall, Occam has the highest rate of "normal". More than 70% volunteers think the audio generated from Occam is "normal" or "noise", which is comparable to Devil's Whisper. However, about 50% AEs from Devil's Whisper fail to fool the ASR systems in the first place. The results show that Occam is more effective in fooling both the model and the human than other possible attacks.  Recall that we use two important metrics to evaluate the effectiveness of the audio AEs, i.e., the SRoA (Success Rate of Attack) and SNR. Both metrics are crucial to determine whether an AE can successfully attack the target system, since an effective AE needs to fool both the model and the human simultaneously. To demonstrate the necessity of SNR as an evaluation criterion, we conduct a small human study to help understand how the metric of SNR affects the effectiveness and quality of an audio AE.

E A HUMAN STUDY ON AUDIO AES WITH DIFFERENT SNRS
In this experiment, we surveyed 30 volunteers who are sensitive to sound, including 17 males and 13 females. Similar to the human perception experiment (See Section 5.3), we first show some examples of "noise", "normal", "talking", and "recognized", and then ask the volunteers to listen to the audio AEs and tell their views about them. The audio AEs are generated from Occam against Alibaba and Tencent, with SNRs ranging from 6dB to 16dB. Specifically, we generate 10 AEs for each SNR group. As shown in Table 11, when the SNR increases (i.e., the noise becomes smaller), more volunteers consider the AE as a normal one. Specifically, when the SNR is lower than 8dB, over 84% of volunteers can hear talking in the AE audios, and more than 31% of volunteers can recognize the commands in the AEs. The results validate that if the SNR is low, it is easy for the users to notice or even recognize the AEs. Since the audio AEs need to be stealthy enough to fool the human, AEs with high SNRs are considered more in practice.

F SUPPLEMENTARY EVALUATION OF NI-OCCAM
As a supplement to Section 4.3, here we provide the detailed results of our NI-Occam on 10 target phrases in Table 12. Besides, we also conduct an experiment on the impact of the number of attempts.
In previous experiments on NI-Occam, the default setting for the number of query attempts is 3 times, and Devil's Whisper [32] adopted a number of 30 attempts. Theoretically, more attempts will lead to higher SRoAs. To illustrate this statement, we conduct an experiment with a range of [1, 30] query attempts using NI-Occam, and the results are given in Figure 11. As shown, at the very start, the increasing number of query attempts can help increase the SRoA. However, when the number of attempts is larger than 17, the SRoA will remain at 70% and no longer increases, indicating a performance limit. In practice, however, more attempts also require longer attacking periods, making the attack less stealthy. Hence, we suggest the attacker conduct as few attempts as possible.

G EVALUATION OF OCCAM AND NI-OCCAM ON LARGE COMMAND SETS
As illustrated before, the target phrases in Devil's Whisper [32]

H RESULTS OF POSSIBLE COUNTERMEASURES
Due to the space limitation, we provide the detailed results of four countermeasures against our attacks in this section. Table 14 shows the performance of our attacks against different countermeasures, including local smoothing, downsampling and temporal dependency based approach, and Table 15 presents the performance of our NI-Occam against adversarial training. In our experiment of adversarial training, we adversarially train Kaldi (the aforementioned Mini LibriSpeech model) on the Mini LibriSpeech dataset 14 . We set to ∞, iterations to 10, and the step size to /5 for PGD, which are the same with [50,51]. Since the prior work [51] demonstrates that adversarial training with > 0.01 will be not able to converge, we set to 0.002, 0.004 and 0.006. For evaluation, we also generate 10 audio AEs from Mini LibriSpeech model.