Abstract
Voice user interfaces and digital assistants are rapidly entering our lives and becoming singular touch points spanning our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As our emotional patterns and sensitive attributes like our identity, gender, and well-being are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving, on-device paralinguistic information learning and filtering before transmitting voice data.
In this article we introduce EDGY, a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY’s on-device performance and explore optimization techniques, including model quantization and knowledge distillation, to enable private, accurate, and efficient representation learning on resource-constrained devices. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in “zero-shot” ABX score or minimal performance penalties of approximately 5.95% word error rate (WER) in learning linguistic representations from raw voice signals, using a CPU and a single-core ARM processor without specialized hardware.
- [1] . 2020. The faults in our ASRs: An overview of attacks against automatic speech recognition and speaker identification systems. arXiv preprint arXiv:2007.06622 (2020).Google Scholar
- [2] . 2020. Privacy guarantees for de-identifying text transformations. arXiv preprint arXiv:2008.03101 (2020).Google Scholar
- [3] . 2020. Preech: A system for privacy-preserving speech transcription. In 29th \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security’20). 2703–2720.Google Scholar
- [4] . 2019. Emotion filtering at the edge. In Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor Systems. Association for Computing Machinery. Google Scholar
Digital Library
- [5] . 2020. Privacy-preserving Voice Analysis via Disentangled Representations. Association for Computing Machinery, New York, NY, 1–14. Google Scholar
Digital Library
- [6] 2022. Transcribe. https://aws.amazon.com/transcribe/.Google Scholar
- [7] . 2016. Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. JMLR.org, 173–182.Google Scholar
- [8] . 2019. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019).Google Scholar
- [9] . 2020. vq-wav2vec: Self-supervised learning of discrete speech representations. In International Conference on Learning Representations.Google Scholar
- [10] . 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020).Google Scholar
- [11] . 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).Google Scholar
- [12] . 2021. When the Curious Abandon Honesty: Federated Learning Is Not Private.
arxiv:2112.02918 [cs.LG]Google Scholar - [13] . 2020. DYSAN: Dynamically sanitizing motion sensor data against sensitive inferences through adversarial networks. arXiv preprint arXiv:2003.10325 (2020).Google Scholar
- [14] . 2006. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). Association for Computing Machinery, New York, NY, 535–541. Google Scholar
Digital Library
- [15] . 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing 5, 4 (2014), 377–390.Google Scholar
Cross Ref
- [16] . 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 4960–4964.Google Scholar
Digital Library
- [17] . 2019. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization.
arxiv:1904.05742 [cs.LG]Google Scholar - [18] . 2018. Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations.
arxiv:1804.02812 [eess.AS]Google Scholar - [19] . 2018. State-of-the-art speech recognition with sequence-to-sequence models. 4774–4778. Google Scholar
Digital Library
- [20] . 2019. Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 12 (2019), 2041–2053.Google Scholar
Digital Library
- [21] . 2018. Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning. Methods 151 (2018), 41–54.Google Scholar
Cross Ref
- [22] . 2020. An overview on audio, signal, speech, ‘|&’ language processing for COVID-19. arXiv preprint arXiv:2005.08579 (2020).Google Scholar
- [23] . 2020. When speakers are all ears: Characterizing misactivations of IoT smart speakers. In Proceedings of the 20th Privacy Enhancing Technologies Symposium (PETS’20).Google Scholar
Cross Ref
- [24] . 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 776–780.Google Scholar
Digital Library
- [25] . 2018. Towards learning fine-grained disentangled representations from speech. arXiv preprint arXiv:1808.02939 (2018).Google Scholar
- [26] 2022. Speech-to-Tex12t. https://cloud.google.com/speech-to-text.Google Scholar
- [27] . 2020. Improving on-device speaker verification using federated learning with privacy. arXiv preprint arXiv:2008.02651 (2020).Google Scholar
- [28] . 2019. Model compression with adversarial robustness: A unified optimization framework. In Advances in Neural Information Processing Systems. 1285–1296.Google Scholar
- [29] . 2008. Audio-visual feature selection and reduction for emotion classification. In Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08).Google Scholar
- [30] . 2019. Streaming end-to-end speech recognition for mobile devices. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 6381–6385.Google Scholar
Cross Ref
- [31] . 2015. Distilling the Knowledge in a Neural Network.
arxiv:1503.02531 [stat.ML]Google Scholar - [32] . 2010. Speech perception as categorization. Attention, Perception, & Psychophysics 72, 5 (2010), 1218–1227.Google Scholar
Cross Ref
- [33] . 2017. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in Neural Information Processing Systems. 1878–1889.Google Scholar
- [34] . 2020. Unsupervised style and content separation by minimizing mutual information for speech synthesis. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 3267–3271.Google Scholar
Cross Ref
- [35] . 2020. Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion. IEEE Transactions on Emerging Topics in Computational Intelligence (2020).Google Scholar
Cross Ref
- [36] . 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.Google Scholar
Cross Ref
- [37] . 2019. Privacy enhanced multimodal neural representations for emotion recognition. arXiv preprint arXiv:1910.13212 (2019).Google Scholar
- [38] . 2018. Voice-based determination of physical and emotional characteristics of users.Google Scholar
- [39] . 2018. Efficient neural audio synthesis. In Proceedings of the 35th International Conference on Machine Learning(
Proceedings of Machine Learning Research , Vol. 80), and (Eds.). PMLR, 2410–2419.Google Scholar - [40] . 2022. textless-lib: a Library for Textless Spoken Language Processing.
arxiv:2202.07359 [cs.CL]Google Scholar - [41] . 2018. Disentangling by factorising. In International Conference on Machine Learning. 2649–2658.Google Scholar
- [42] . 2021. Generative speech coding with predictive variance regularization. arXiv preprint arXiv:2102.09660. (2021).Google Scholar
- [43] . 2020. Neural network compression framework for fast model inference. arXiv preprint arXiv:2002.08679 (2020).Google Scholar
- [44] . 2019. Privacy implications of voice and speech analysis–information disclosure by inference. In IFIP International Summer School on Privacy and Identity Management. Springer, 242–258.Google Scholar
- [45] . 2021. Generative Spoken Language Modeling from Raw Audio.
arxiv:2102.01192 [cs.CL]Google Scholar - [46] . 2020. Evaluating voice conversion-based privacy protection against informed attackers. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 2802–2806. Google Scholar
Cross Ref
- [47] . 2020. Unsupervised feature learning for speech using correspondence and siamese networks. IEEE Signal Processing Letters (2020).Google Scholar
Cross Ref
- [48] . 2020. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Transactions on Affective Computing (2020), 1–1.Google Scholar
- [49] . 2020. Deep representation learning in speech processing: Challenges, recent advances, and future trends. arXiv preprint arXiv:2001.00378 (2020).Google Scholar
- [50] . 2020. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659 (2020).Google Scholar
- [51] . 2019. Mobile sensor data anonymization. In Proceedings of the International Conference on Internet of Things Design and Implementation. 49–58.Google Scholar
Digital Library
- [52] . 2020. Privacy and utility preserving sensor-data transformations. Pervasive and Mobile Computing (2020), 101132.Google Scholar
Cross Ref
- [53] . 2017. Replacement autoencoder: A privacy-preserving algorithm for sensory data analysis. arXiv preprint arXiv:1710.06564 (2017).Google Scholar
- [54] 2022. Watson Speech to Text. https://github.com/mozilla/DeepSpeech.Google Scholar
- [55] . 2017. VoxCeleb: A large-scale speaker identification dataset. In 18th Annual Conference of the International Speech Communication Association (Interspeech’17), (Ed.). ISCA, 2616–2620.Google Scholar
Cross Ref
- [56] . 2019. Leveraging acoustic cues and paralinguistic embeddings to detect expression from voice. https://arxiv.org/pdf/1907.00112.pdf.Google Scholar
- [57] . 2017. Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119 (2017).Google Scholar
- [58] . 2019. Preserving privacy in speaker and speech characterisation. Comput. Speech Lang. 58 (2019), 441–480.Google Scholar
Digital Library
- [59] . 2020. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588 (2020).Google Scholar
- [60] . 2020. Scalable model compression by entropy penalized reparameterization. In International Conference on Learning Representations. https://openreview.net/forum?id=HkgxW0EYDS.Google Scholar
- [61] . 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google Scholar
- [62] . 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).Google Scholar
- [63] . 2020. A hybrid deep learning architecture for privacy-preserving mobile analytics. IEEE Internet of Things Journal 7, 5 (2020), 4505–4518.Google Scholar
Cross Ref
- [64] . 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). IEEE, 5206–5210.Google Scholar
Cross Ref
- [65] . 2019. Unsupervised speech domain adaptation based on disentangled representation learning for robust speech recognition. arXiv preprint arXiv:1904.06086 (2019).Google Scholar
- [66] . 2020. FUN! Fast, Universal, Non-Semantic Speech Embeddings.
arxiv:2011.04609 [cs.SD]Google Scholar - [67] . 2020. An empirical analysis of information encoded in disentangled neural speaker representations. In Proceedings of Odyssey.Google Scholar
Cross Ref
- [68] . 2018. Hidebehind: Enjoy voice input with voiceprint unclonability and anonymity. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems. Association for Computing Machinery, 82–94. Google Scholar
Digital Library
- [69] . 2020. Unsupervised speech decomposition via triple information bottleneck. arXiv preprint arXiv:2004.11284 (2020).Google Scholar
- [70] . 2019. Olympus: Sensor privacy through utility aware obfuscation. Proceedings on Privacy Enhancing Technologies 2019, 1 (2019), 5–25.Google Scholar
Cross Ref
- [71] . 2020. Unsupervised pretraining transfers well across languages. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 7414–7418.Google Scholar
Cross Ref
- [72] . 2013. Paralinguistics in speech and language state-of-the-art and the challenge. Computer Speech & Language 27, 1 (2013), 4–39.Google Scholar
Digital Library
- [73] . 2020. Towards learning a universal non-semantic representation of speech. arXiv preprint arXiv:2002.12764 (2020).Google Scholar
- [74] . 2020. Overlearning reveals sensitive attributes. In International Conference on Learning Representations.Google Scholar
- [75] . 2019. Privacy-preserving adversarial representation learning in ASR: Reality or illusion? arXiv preprint arXiv:1911.04913 (2019).Google Scholar
- [76] . 2020. Design choices for x-vector based speaker anonymization. arXiv preprint arXiv:2005.08601 (2020).Google Scholar
- [77] . 2020. Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 6264–6268.Google Scholar
Cross Ref
- [78] . 2019. Compressing RNNs for IoT devices by 15-38x using Kronecker products. arXiv preprint arXiv:1906.02876 (2019).Google Scholar
- [79] . 2020. Introducing the voiceprivacy initiative. arXiv preprint arXiv:2005.01387 (2020).Google Scholar
- [80] . 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems. 6306–6315.Google Scholar
- [81] . 2020. Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge. arXiv preprint arXiv:2005.09409 (2020).Google Scholar
- [82] . 2020. A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings.
arxiv:2012.07387 [cs.CL]Google Scholar - [83] . 2019. An overview of end-to-end automatic speech recognition. Symmetry 11, 8 (2019), 1018.Google Scholar
Cross Ref
- [84] . 2020. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials 22, 2 (2020), 869–904.Google Scholar
Cross Ref
- [85] . 2019. End-to-end anchored speech recognition. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19).Google Scholar
Cross Ref
- [86] 2022. Watson Speech to Text. https://www.ibm.com/ca-en/cloud/watson-speech-to-text.Google Scholar
- [87] . 2021. A survey of joint intent detection and slot filling models in natural language understanding. ACM Computing Surveys (CSUR) (2021).Google Scholar
- [88] . 2019. Utterance-level aggregation for speaker recognition in the wild. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 5791–5795.Google Scholar
Cross Ref
- [89] . 2018. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF’18). IEEE, 268–282.Google Scholar
- [90] . 2020. SqueezeWave: Extremely lightweight vocoders for on-device speech synthesis. arXiv preprint arXiv:2001.05685 (2020).Google Scholar
- [91] . 2019. Pdvocal: Towards privacy-preserving Parkinson’s disease detection using non-speech body sounds. In The 25th Annual International Conference on Mobile Computing and Networking. 1–16.Google Scholar
Digital Library
- [92] . 2019. Learning latent representations for style control and transfer in end-to-end speech synthesis. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). 6945–6949.Google Scholar
Cross Ref
Index Terms
Paralinguistic Privacy Protection at the Edge
Recommendations
Privacy-preserving Voice Analysis via Disentangled Representations
CCSW'20: Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security WorkshopVoice User Interfaces (VUIs) are increasingly popular and built into smartphones, home assistants, and Internet of Things (IoT) devices. Despite offering an always-on convenient user experience, VUIs raise new security and privacy concerns for their ...
Emotion Filtering at the Edge
SenSys-ML 2019: Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor SystemsVoice controlled devices and services have become very popular in the consumer IoT. Cloud-based speech analysis services extract information from voice inputs using speech recognition techniques. Services providers can thus build very accurate profiles ...
Privacy preserving speech analysis using emotion filtering at the edge: poster abstract
SenSys '19: Proceedings of the 17th Conference on Embedded Networked Sensor SystemsVoice controlled devices and services are commonplace in consumer IoT. Cloud-based analysis services extract information from voice input using speech recognition techniques. Services providers can build detailed profiles of users' demographics, ...






Comments