skip to main content
research-article

Paralinguistic Privacy Protection at the Edge

Published:13 April 2023Publication History
Skip Abstract Section

Abstract

Voice user interfaces and digital assistants are rapidly entering our lives and becoming singular touch points spanning our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As our emotional patterns and sensitive attributes like our identity, gender, and well-being are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving, on-device paralinguistic information learning and filtering before transmitting voice data.

In this article we introduce EDGY, a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY’s on-device performance and explore optimization techniques, including model quantization and knowledge distillation, to enable private, accurate, and efficient representation learning on resource-constrained devices. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in “zero-shot” ABX score or minimal performance penalties of approximately 5.95% word error rate (WER) in learning linguistic representations from raw voice signals, using a CPU and a single-core ARM processor without specialized hardware.

REFERENCES

  1. [1] Abdullah Hadi, Warren Kevin, Bindschaedler Vincent, Papernot Nicolas, and Traynor Patrick. 2020. The faults in our ASRs: An overview of attacks against automatic speech recognition and speaker identification systems. arXiv preprint arXiv:2007.06622 (2020).Google ScholarGoogle Scholar
  2. [2] Adelani David Ifeoluwa, Davody Ali, Kleinbauer Thomas, and Klakow Dietrich. 2020. Privacy guarantees for de-identifying text transformations. arXiv preprint arXiv:2008.03101 (2020).Google ScholarGoogle Scholar
  3. [3] Ahmed Shimaa, Chowdhury Amrita Roy, Fawaz Kassem, and Ramanathan Parmesh. 2020. Preech: A system for privacy-preserving speech transcription. In 29th \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security’20). 27032720.Google ScholarGoogle Scholar
  4. [4] Aloufi Ranya, Haddadi Hamed, and Boyle David. 2019. Emotion filtering at the edge. In Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor Systems. Association for Computing Machinery. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Aloufi Ranya, Haddadi Hamed, and Boyle David. 2020. Privacy-preserving Voice Analysis via Disentangled Representations. Association for Computing Machinery, New York, NY, 114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Amazon.2022. Transcribe. https://aws.amazon.com/transcribe/.Google ScholarGoogle Scholar
  7. [7] Amodei Dario, Ananthanarayanan Sundaram, Anubhai Rishita, Bai Jingliang, Battenberg Eric, Case Carl, Casper Jared, Catanzaro Bryan, Cheng Qiang, Chen Guoliang, Chen Jie, Chen Jingdong, Chen Zhijie, Chrzanowski Mike, Coates Adam, Diamos Greg, Ding Ke, Du Niandong, Elsen Erich, Engel Jesse, Fang Weiwei, Fan Linxi, Fougner Christopher, Gao Liang, Gong Caixia, Hannun Awni, Han Tony, Johannes Lappi Vaino, Jiang Bing, Ju Cai, Jun Billy, LeGresley Patrick, Lin Libby, Liu Junjie, Liu Yang, Li Weigao, Li Xiangang, Ma Dongpeng, Narang Sharan, Ng Andrew, Ozair Sherjil, Peng Yiping, Prenger Ryan, Qian Sheng, Quan Zongfeng, Raiman Jonathan, Rao Vinay, Satheesh Sanjeev, Seetapun David, Sengupta Shubho, Srinet Kavya, Sriram Anuroop, Tang Haiyuan, Tang Liliang, Wang Chong, Wang Jidong, Wang Kaifu, Wang Yi, Wang Zhijian, Wang Zhiqian, Wu Shuang, Wei Likai, Xiao Bo, Xie Wen, Xie Yan, Yogatama Dani, Yuan Bin, Zhan Jun, and Zhu Zhenyao. 2016. Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. JMLR.org, 173182.Google ScholarGoogle Scholar
  8. [8] Ardila Rosana, Branson Megan, Davis Kelly, Henretty Michael, Kohler Michael, Meyer Josh, Morais Reuben, Saunders Lindsay, Tyers Francis M., and Weber Gregor. 2019. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019).Google ScholarGoogle Scholar
  9. [9] Baevski Alexei, Schneider Steffen, and Auli Michael. 2020. vq-wav2vec: Self-supervised learning of discrete speech representations. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  10. [10] Baevski Alexei, Zhou Yuhao, Mohamed Abdelrahman, and Auli Michael. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020).Google ScholarGoogle Scholar
  11. [11] Bengio Yoshua, Léonard Nicholas, and Courville Aaron. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).Google ScholarGoogle Scholar
  12. [12] Boenisch Franziska, Dziedzic Adam, Schuster Roei, Shamsabadi Ali Shahin, Shumailov Ilia, and Papernot Nicolas. 2021. When the Curious Abandon Honesty: Federated Learning Is Not Private. arxiv:2112.02918 [cs.LG]Google ScholarGoogle Scholar
  13. [13] Boutet Antoine, Frindel Carole, Gambs Sébastien, Jourdan Théo, and Ngueveu Claude Rosin. 2020. DYSAN: Dynamically sanitizing motion sensor data against sensitive inferences through adversarial networks. arXiv preprint arXiv:2003.10325 (2020).Google ScholarGoogle Scholar
  14. [14] Buciluundefined Cristian, Caruana Rich, and Niculescu-Mizil Alexandru. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). Association for Computing Machinery, New York, NY, 535541. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Cao Houwei, Cooper David G., Keutmann Michael K., Gur Ruben C., Nenkova Ani, and Verma Ragini. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing 5, 4 (2014), 377390.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Chan William, Jaitly Navdeep, Le Quoc, and Vinyals Oriol. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 49604964.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Chou Ju chieh, Yeh Cheng chieh, and Lee Hung yi. 2019. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. arxiv:1904.05742 [cs.LG]Google ScholarGoogle Scholar
  18. [18] Chou Ju chieh, Yeh Cheng chieh, Lee Hung yi, and Lee Lin shan. 2018. Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations. arxiv:1804.02812 [eess.AS]Google ScholarGoogle Scholar
  19. [19] Chiu Chung-Cheng, Sainath Tara, Wu Yonghui, Prabhavalkar Rohit, Nguyen Patrick, Chen Zhifeng, Kannan Anjuli, Weiss Ron, Rao Kanishka, Gonina Ekaterina, Jaitly Navdeep, Li Bo, Chorowski Jan, and Bacchiani Michiel. 2018. State-of-the-art speech recognition with sequence-to-sequence models. 47744778. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Chorowski Jan, Weiss Ron J., Bengio Samy, and Oord Aäron van den. 2019. Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 12 (2019), 20412053.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Cummins Nicholas, Baird Alice, and Schuller Bjoern W.. 2018. Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning. Methods 151 (2018), 4154.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Deshpande Gauri and Schuller Björn. 2020. An overview on audio, signal, speech, ‘|&’ language processing for COVID-19. arXiv preprint arXiv:2005.08579 (2020).Google ScholarGoogle Scholar
  23. [23] Dubois Daniel J., Kolcun Roman, Mandalari Anna Maria, Paracha Muhammad Talha, Choffnes David, and Haddadi Hamed. 2020. When speakers are all ears: Characterizing misactivations of IoT smart speakers. In Proceedings of the 20th Privacy Enhancing Technologies Symposium (PETS’20).Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Gemmeke Jort F., Ellis Daniel P. W., Freedman Dylan, Jansen Aren, Lawrence Wade, Moore R. Channing, Plakal Manoj, and Ritter Marvin. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 776780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Gong Yuan and Poellabauer Christian. 2018. Towards learning fine-grained disentangled representations from speech. arXiv preprint arXiv:1808.02939 (2018).Google ScholarGoogle Scholar
  26. [26] Google.2022. Speech-to-Tex12t. https://cloud.google.com/speech-to-text.Google ScholarGoogle Scholar
  27. [27] Granqvist Filip, Seigel Matt, Dalen Rogier van, Cahill Áine, Shum Stephen, and Paulik Matthias. 2020. Improving on-device speaker verification using federated learning with privacy. arXiv preprint arXiv:2008.02651 (2020).Google ScholarGoogle Scholar
  28. [28] Gui Shupeng, Wang Haotao N., Yang Haichuan, Yu Chen, Wang Zhangyang, and Liu Ji. 2019. Model compression with adversarial robustness: A unified optimization framework. In Advances in Neural Information Processing Systems. 12851296.Google ScholarGoogle Scholar
  29. [29] Haq Sanaul, Jackson Philip J. B., and Edge James. 2008. Audio-visual feature selection and reduction for emotion classification. In Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08).Google ScholarGoogle Scholar
  30. [30] He Yanzhang, Sainath Tara N., Prabhavalkar Rohit, McGraw Ian, Alvarez Raziel, Zhao Ding, Rybach David, Kannan Anjuli, Wu Yonghui, Pang Ruoming, et al. 2019. Streaming end-to-end speech recognition for mobile devices. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 63816385.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Hinton Geoffrey, Vinyals Oriol, and Dean Jeff. 2015. Distilling the Knowledge in a Neural Network. arxiv:1503.02531 [stat.ML]Google ScholarGoogle Scholar
  32. [32] Holt Lori L. and Lotto Andrew J.. 2010. Speech perception as categorization. Attention, Perception, & Psychophysics 72, 5 (2010), 12181227.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Hsu Wei-Ning, Zhang Yu, and Glass James. 2017. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in Neural Information Processing Systems. 18781889.Google ScholarGoogle Scholar
  34. [34] Hu T., Shrivastava A., Tuzel O., and Dhir C.. 2020. Unsupervised style and content separation by minimizing mutual information for speech synthesis. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 32673271.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Huang Wen-Chin, Luo Hao, Hwang Hsin-Te, Lo Chen-Chou, Peng Yu-Huai, Tsao Yu, and Wang Hsin-Min. 2020. Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion. IEEE Transactions on Emerging Topics in Computational Intelligence (2020).Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Jacob Benoit, Kligys Skirmantas, Chen Bo, Zhu Menglong, Tang Matthew, Howard Andrew, Adam Hartwig, and Kalenichenko Dmitry. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 27042713.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Jaiswal Mimansa and Provost Emily Mower. 2019. Privacy enhanced multimodal neural representations for emotion recognition. arXiv preprint arXiv:1910.13212 (2019).Google ScholarGoogle Scholar
  38. [38] Jin Huafeng and Wang Shuo. 2018. Voice-based determination of physical and emotional characteristics of users.Google ScholarGoogle Scholar
  39. [39] Kalchbrenner Nal, Elsen Erich, Simonyan Karen, Noury Seb, Casagrande Norman, Lockhart Edward, Stimberg Florian, Oord Aaron van den, Dieleman Sander, and Kavukcuoglu Koray. 2018. Efficient neural audio synthesis. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Dy Jennifer and Krause Andreas (Eds.). PMLR, 24102419.Google ScholarGoogle Scholar
  40. [40] Kharitonov Eugene, Copet Jade, Lakhotia Kushal, Nguyen Tu Anh, Tomasello Paden, Lee Ann, Elkahky Ali, Hsu Wei-Ning, Mohamed Abdelrahman, Dupoux Emmanuel, and Adi Yossi. 2022. textless-lib: a Library for Textless Spoken Language Processing. arxiv:2202.07359 [cs.CL]Google ScholarGoogle Scholar
  41. [41] Kim Hyunjik and Mnih Andriy. 2018. Disentangling by factorising. In International Conference on Machine Learning. 26492658.Google ScholarGoogle Scholar
  42. [42] Kleijn W. Bastiaan, Storus Andrew, Chinen Michael, Denton Tom, Lim Felicia S. C., Luebs Alejandro, Skoglund Jan, and Yeh Hengchin. 2021. Generative speech coding with predictive variance regularization. arXiv preprint arXiv:2102.09660. (2021).Google ScholarGoogle Scholar
  43. [43] Kozlov Alexander, Lazarevich Ivan, Shamporov Vasily, Lyalyushkin Nikolay, and Gorbachev Yury. 2020. Neural network compression framework for fast model inference. arXiv preprint arXiv:2002.08679 (2020).Google ScholarGoogle Scholar
  44. [44] Kröger Jacob Leon, Lutz Otto Hans-Martin, and Raschke Philip. 2019. Privacy implications of voice and speech analysis–information disclosure by inference. In IFIP International Summer School on Privacy and Identity Management. Springer, 242258.Google ScholarGoogle Scholar
  45. [45] Lakhotia Kushal, Kharitonov Evgeny, Hsu Wei-Ning, Adi Yossi, Polyak Adam, Bolte Benjamin, Nguyen Tu-Anh, Copet Jade, Baevski Alexei, Mohamed Adelrahman, and Dupoux Emmanuel. 2021. Generative Spoken Language Modeling from Raw Audio. arxiv:2102.01192 [cs.CL]Google ScholarGoogle Scholar
  46. [46] Srivastava B. M. Lal, Vauquier N., Sahidullah M., Bellet A., Tommasi M., and Vincent E.. 2020. Evaluating voice conversion-based privacy protection against informed attackers. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 28022806. Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Last Petri-Johan, Engelbrecht Herman A., and Kamper Herman. 2020. Unsupervised feature learning for speech using correspondence and siamese networks. IEEE Signal Processing Letters (2020).Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Latif S., Rana R., Khalifa S., Jurdak R., Epps J., and Schuller B. W.. 2020. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Transactions on Affective Computing (2020), 11.Google ScholarGoogle Scholar
  49. [49] Latif Siddique, Rana Rajib, Khalifa Sara, Jurdak Raja, Qadir Junaid, and Schuller Björn W.. 2020. Deep representation learning in speech processing: Challenges, recent advances, and future trends. arXiv preprint arXiv:2001.00378 (2020).Google ScholarGoogle Scholar
  50. [50] Ling Shaoshi and Liu Yuzong. 2020. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659 (2020).Google ScholarGoogle Scholar
  51. [51] Malekzadeh Mohammad, Clegg Richard G., Cavallaro Andrea, and Haddadi Hamed. 2019. Mobile sensor data anonymization. In Proceedings of the International Conference on Internet of Things Design and Implementation. 4958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Malekzadeh Mohammad, Clegg Richard G., Cavallaro Andrea, and Haddadi Hamed. 2020. Privacy and utility preserving sensor-data transformations. Pervasive and Mobile Computing (2020), 101132.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Malekzadeh Mohammad, Clegg Richard G., and Haddadi Hamed. 2017. Replacement autoencoder: A privacy-preserving algorithm for sensory data analysis. arXiv preprint arXiv:1710.06564 (2017).Google ScholarGoogle Scholar
  54. [54] Mozilla.2022. Watson Speech to Text. https://github.com/mozilla/DeepSpeech.Google ScholarGoogle Scholar
  55. [55] Nagrani Arsha, Chung Joon Son, and Zisserman Andrew. 2017. VoxCeleb: A large-scale speaker identification dataset. In 18th Annual Conference of the International Speech Communication Association (Interspeech’17),Lacerda Francisco (Ed.). ISCA, 26162620.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Naik Vikramjit Mitra, Sue Booker, Erik Marchi, David Scott, Farrar Ute, Dorothea Peitz, Bridget Cheng, Ermine Teves, Anuj Mehta, and Devang. 2019. Leveraging acoustic cues and paralinguistic embeddings to detect expression from voice. https://arxiv.org/pdf/1907.00112.pdf.Google ScholarGoogle Scholar
  57. [57] Narang Sharan, Elsen Erich, Diamos Gregory, and Sengupta Shubho. 2017. Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119 (2017).Google ScholarGoogle Scholar
  58. [58] Nautsch Andreas, Jiménez Abelino, Treiber Amos, Kolberg Jascha, Jasserand Catherine, Kindt Els, Delgado Héctor, Todisco Massimiliano, Hmani Mohamed Amine, Mtibaa Aymen, Abdelraheem Mohammed Ahmed, Abad Alberto, Teixeira Francisco, Matrouf Driss, Gomez-Barrero Marta, Petrovska-Delacrétaz Dijana, Chollet Gérard, Evans Nicholas W. D., and Busch Christoph. 2019. Preserving privacy in speaker and speech characterisation. Comput. Speech Lang. 58 (2019), 441480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Nguyen Tu Anh, Seyssel Maureen de, Rozé Patricia, Rivière Morgane, Kharitonov Evgeny, Baevski Alexei, Dunbar Ewan, and Dupoux Emmanuel. 2020. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588 (2020).Google ScholarGoogle Scholar
  60. [60] Oktay Deniz, Ballé Johannes, Singh Saurabh, and Shrivastava Abhinav. 2020. Scalable model compression by entropy penalized reparameterization. In International Conference on Learning Representations. https://openreview.net/forum?id=HkgxW0EYDS.Google ScholarGoogle Scholar
  61. [61] Oord Aaron van den, Dieleman Sander, Zen Heiga, Simonyan Karen, Vinyals Oriol, Graves Alex, Kalchbrenner Nal, Senior Andrew, and Kavukcuoglu Koray. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google ScholarGoogle Scholar
  62. [62] Oord Aaron van den, Li Yazhe, and Vinyals Oriol. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).Google ScholarGoogle Scholar
  63. [63] Osia Seyed Ali, Shamsabadi Ali Shahin, Sajadmanesh Sina, Taheri Ali, Katevas Kleomenis, Rabiee Hamid R., Lane Nicholas D., and Haddadi Hamed. 2020. A hybrid deep learning architecture for privacy-preserving mobile analytics. IEEE Internet of Things Journal 7, 5 (2020), 45054518.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Panayotov Vassil, Chen Guoguo, Povey Daniel, and Khudanpur Sanjeev. 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). IEEE, 52065210.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Park Jong-Hyeon, Oh Myungwoo, and Park Hyung-Min. 2019. Unsupervised speech domain adaptation based on disentangled representation learning for robust speech recognition. arXiv preprint arXiv:1904.06086 (2019).Google ScholarGoogle Scholar
  66. [66] Peplinski Jacob, Shor Joel, Joglekar Sachin, Garrison Jake, and Patel Shwetak. 2020. FUN! Fast, Universal, Non-Semantic Speech Embeddings. arxiv:2011.04609 [cs.SD]Google ScholarGoogle Scholar
  67. [67] Peri Raghuveer, Li Haoqi, Somandepalli Krishna, Jati Arindam, and Narayanan Shrikanth. 2020. An empirical analysis of information encoded in disentangled neural speaker representations. In Proceedings of Odyssey.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Qian Jianwei, Du Haohua, Hou Jiahui, Chen Linlin, Jung Taeho, and Li Xiang-Yang. 2018. Hidebehind: Enjoy voice input with voiceprint unclonability and anonymity. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems. Association for Computing Machinery, 8294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Qian Kaizhi, Zhang Yang, Chang Shiyu, Cox David, and Hasegawa-Johnson Mark. 2020. Unsupervised speech decomposition via triple information bottleneck. arXiv preprint arXiv:2004.11284 (2020).Google ScholarGoogle Scholar
  70. [70] Raval Nisarg, Machanavajjhala Ashwin, and Pan Jerry. 2019. Olympus: Sensor privacy through utility aware obfuscation. Proceedings on Privacy Enhancing Technologies 2019, 1 (2019), 525.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Rivière Morgane, Joulin Armand, Mazaré Pierre-Emmanuel, and Dupoux Emmanuel. 2020. Unsupervised pretraining transfers well across languages. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 74147418.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Schuller Björn, Steidl Stefan, Batliner Anton, Burkhardt Felix, Devillers Laurence, MüLler Christian, and Narayanan Shrikanth. 2013. Paralinguistics in speech and language state-of-the-art and the challenge. Computer Speech & Language 27, 1 (2013), 439.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Shor Joel, Jansen Aren, Maor Ronnie, Lang Oran, Tuval Omry, Quitry Felix de Chaumont, Tagliasacchi Marco, Shavitt Ira, Emanuel Dotan, and Haviv Yinnon. 2020. Towards learning a universal non-semantic representation of speech. arXiv preprint arXiv:2002.12764 (2020).Google ScholarGoogle Scholar
  74. [74] Song Congzheng and Shmatikov Vitaly. 2020. Overlearning reveals sensitive attributes. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  75. [75] Srivastava Brij Mohan Lal, Bellet Aurélien, Tommasi Marc, and Vincent Emmanuel. 2019. Privacy-preserving adversarial representation learning in ASR: Reality or illusion? arXiv preprint arXiv:1911.04913 (2019).Google ScholarGoogle Scholar
  76. [76] Srivastava Brij Mohan Lal, Tomashenko Natalia, Wang Xin, Vincent Emmanuel, Yamagishi Junichi, Maouche Mohamed, Bellet Aurélien, and Tommasi Marc. 2020. Design choices for x-vector based speaker anonymization. arXiv preprint arXiv:2005.08601 (2020).Google ScholarGoogle Scholar
  77. [77] Sun G., Zhang Y., Weiss R. J., Cao Y., Zen H., and Wu Y.. 2020. Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 62646268.Google ScholarGoogle ScholarCross RefCross Ref
  78. [78] Thakker Urmish, Beu Jesse, Gope Dibakar, Zhou Chu, Fedorov Igor, Dasika Ganesh, and Mattina Matthew. 2019. Compressing RNNs for IoT devices by 15-38x using Kronecker products. arXiv preprint arXiv:1906.02876 (2019).Google ScholarGoogle Scholar
  79. [79] Tomashenko Natalia, Srivastava Brij Mohan Lal, Wang Xin, Vincent Emmanuel, Nautsch Andreas, Yamagishi Junichi, Evans Nicholas, Patino Jose, Bonastre Jean-François, Noé Paul-Gauthier, et al. 2020. Introducing the voiceprivacy initiative. arXiv preprint arXiv:2005.01387 (2020).Google ScholarGoogle Scholar
  80. [80] Oord Aaron van den, Vinyals Oriol, et al. 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems. 63066315.Google ScholarGoogle Scholar
  81. [81] Niekerk Benjamin van, Nortje Leanne, and Kamper Herman. 2020. Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge. arXiv preprint arXiv:2005.09409 (2020).Google ScholarGoogle Scholar
  82. [82] Staden Lisa van and Kamper Herman. 2020. A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings. arxiv:2012.07387 [cs.CL]Google ScholarGoogle Scholar
  83. [83] Wang Dong, Wang Xiaodong, and Lv Shaohe. 2019. An overview of end-to-end automatic speech recognition. Symmetry 11, 8 (2019), 1018.Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Wang Xiaofei, Han Yiwen, Leung Victor C. M., Niyato Dusit, Yan Xueqiang, and Chen Xu. 2020. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials 22, 2 (2020), 869904.Google ScholarGoogle ScholarCross RefCross Ref
  85. [85] Wang Y., Fan X., Chen I., Liu Y., Chen T., and Hoffmeister B.. 2019. End-to-end anchored speech recognition. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19).Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Watson. IBM2022. Watson Speech to Text. https://www.ibm.com/ca-en/cloud/watson-speech-to-text.Google ScholarGoogle Scholar
  87. [87] Weld Henry, Huang Xiaoqi, Long Siqu, Poon Josiah, and Han Soyeon Caren. 2021. A survey of joint intent detection and slot filling models in natural language understanding. ACM Computing Surveys (CSUR) (2021).Google ScholarGoogle Scholar
  88. [88] Xie Weidi, Nagrani Arsha, Chung Joon Son, and Zisserman Andrew. 2019. Utterance-level aggregation for speaker recognition in the wild. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 57915795.Google ScholarGoogle ScholarCross RefCross Ref
  89. [89] Yeom Samuel, Giacomelli Irene, Fredrikson Matt, and Jha Somesh. 2018. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF’18). IEEE, 268282.Google ScholarGoogle Scholar
  90. [90] Zhai Bohan, Gao Tianren, Xue Flora, Rothchild Daniel, Wu Bichen, Gonzalez Joseph E., and Keutzer Kurt. 2020. SqueezeWave: Extremely lightweight vocoders for on-device speech synthesis. arXiv preprint arXiv:2001.05685 (2020).Google ScholarGoogle Scholar
  91. [91] Zhang Hanbin, Song Chen, Wang Aosen, Xu Chenhan, Li Dongmei, and Xu Wenyao. 2019. Pdvocal: Towards privacy-preserving Parkinson’s disease detection using non-speech body sounds. In The 25th Annual International Conference on Mobile Computing and Networking. 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. [92] Zhang Y., Pan S., He L., and Ling Z.. 2019. Learning latent representations for style control and transfer in end-to-end speech synthesis. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). 69456949.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Paralinguistic Privacy Protection at the Edge

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Privacy and Security
          ACM Transactions on Privacy and Security  Volume 26, Issue 2
          May 2023
          335 pages
          ISSN:2471-2566
          EISSN:2471-2574
          DOI:10.1145/3572849
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 April 2023
          • Online AM: 3 November 2022
          • Accepted: 26 September 2022
          • Revised: 10 March 2022
          • Received: 29 May 2021
          Published in tops Volume 26, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)175
          • Downloads (Last 6 weeks)33

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!