skip to main content
10.5555/3327345.3327360guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article
Free access

Transfer learning from speaker verification to multispeaker text-to-speech synthesis

Published: 03 December 2018 Publication History

Abstract

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTS task, and is able to synthesize natural speech from speakers unseen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

References

[1]
Artificial Intelligence at Google – Our Principles. https://ai.google/principles/, 2018.
[2]
Sercan O Arik, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. Neural voice cloning with a few samples. arXiv preprint arXiv:1802.06006, 2018.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, 2015.
[4]
Steven Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2):113-120, 1979.
[5]
Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C Cobo, Andrew Trask, Ben Laurie, et al. Sample efficient adaptive text-to-speech. arXiv preprint arXiv:1809.10460, 2018.
[6]
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. VoxCeleb2: Deep speaker recognition. In Interspeech, pages 1086-1090, 2018.
[7]
Rama Doddipatla, Norbert Braunschweiler, and Ranniery Maia. Speaker adaptation in dnn-based speech synthesis using d-vectors. In Proc. Interspeech, pages 3404-3408, 2017.
[8]
Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep Voice 2: Multi-speaker neural text-to-speech. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2962-2970. Curran Associates, Inc., 2017.
[9]
Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. End-to-end text-dependent speaker verification. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5115-5119. IEEE, 2016.
[10]
Eliya Nachmani, Adam Polyak, Yaniv Taigman, and Lior Wolf. Fitting new speakers based on a short untranscribed sample. arXiv preprint arXiv:1802.06984, 2018.
[11]
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. VoxCeleb: A large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
[12]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. LibriSpeech: an ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206-5210. IEEE, 2015.
[13]
Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep Voice 3: 2000-speaker neural text-to-speech. In Proc. International Conference on Learning Representations (ICLR), 2018.
[14]
ITUT Rec. P. 800: Methods for subjective determination of transmission quality. International Telecommunication Union, Geneva, 1996.
[15]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui. Wu. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
[16]
RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, and Rif A. Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint arXiv:1803.09047, 2018.
[17]
Jose Sotelo, Soroush Mehri, Kundan Kumar, João Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2Wav: End-to-end speech synthesis. In Proc. International Conference on Learning Representations (ICLR), 2017.
[18]
Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani. VoiceLoop: Voice fitting and synthesis via a phonological loop. In Proc. International Conference on Learning Representations (ICLR), 2018.
[19]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. CoRR abs/1609.03499, 2016.
[20]
Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 4052-4056. IEEE, 2014.
[21]
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit, 2017.
[22]
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
[23]
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech, pages 4006-4010, August 2017.
[24]
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017, 2018.

Cited By

View all
  • (2024)AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake DatasetProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680795(7414-7423)Online publication date: 28-Oct-2024
  • (2023)Towards Accurate Lip-to-Speech Synthesis in-the-WildProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611787(5523-5531)Online publication date: 26-Oct-2023
  • (2023)BarrierBypass: Out-of-Sight Clean Voice Command Injection Attacks through Physical BarriersProceedings of the 16th ACM Conference on Security and Privacy in Wireless and Mobile Networks10.1145/3558482.3581772(203-214)Online publication date: 29-May-2023
  • Show More Cited By
  1. Transfer learning from speaker verification to multispeaker text-to-speech synthesis

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems
    December 2018
    11021 pages

    Publisher

    Curran Associates Inc.

    Red Hook, NY, United States

    Publication History

    Published: 03 December 2018

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)131
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake DatasetProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680795(7414-7423)Online publication date: 28-Oct-2024
    • (2023)Towards Accurate Lip-to-Speech Synthesis in-the-WildProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611787(5523-5531)Online publication date: 26-Oct-2023
    • (2023)BarrierBypass: Out-of-Sight Clean Voice Command Injection Attacks through Physical BarriersProceedings of the 16th ACM Conference on Security and Privacy in Wireless and Mobile Networks10.1145/3558482.3581772(203-214)Online publication date: 29-May-2023
    • (2023)Envisioning and Understanding Orientations to Introspective AI: Exploring a Design Space with Meta.AwareProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581336(1-18)Online publication date: 19-Apr-2023
    • (2023)DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech — A Study Between English and MandarinIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.331341331(3418-3430)Online publication date: 1-Jan-2023
    • (2022)Incidental Incremental In-Band Fingerprint Verification: a Novel Authentication Ceremony for End-to-End Encrypted MessagingProceedings of the 2022 New Security Paradigms Workshop10.1145/3584318.3584326(104-116)Online publication date: 24-Oct-2022
    • (2022)Lip-to-Speech Synthesis for Arbitrary Speakers in the WildProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548081(6250-6258)Online publication date: 10-Oct-2022
    • (2022)Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-SpeechIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.316725830(1558-1571)Online publication date: 13-Apr-2022
    • (2021)Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal DetectorsProceedings of the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection10.1145/3476099.3484315(7-15)Online publication date: 24-Oct-2021
    • (2021)Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale CorpusProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475437(3945-3954)Online publication date: 17-Oct-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media