skip to main content
research-article

Modelling of Speech Parameters of Punjabi by Pre-trained Deep Neural Network Using Stacked Denoising Autoencoders

Authors Info & Claims
Published:23 March 2023Publication History
Skip Abstract Section

Abstract

Statistical parametric speech synthesis techniques such as deep neural network (DNN) and hidden Markov model (HMM) have grown in popularity since last decade over the concatenative speech synthesis approaches by modelling excitation and spectral parameters of speech to synthesize the waveforms from the written text. Due to inappropriate acoustic modelling, speech synthesized using HMM-based speech synthesis sounds muffled. DNN tried to improve the acoustic model by replacing decision trees in HMM with powerful regression model. Further, the performance of a deep neural network is greatly enhanced using pre-learning either restricted Boltzmann machines (RBM) or autoencoders. RBMs are capable to map multi-modal property of speech but result in spectral distortion of synthesized speech waveforms as non-consideration of reconstruction error. This article proposed the model of deep neural network, which is pre-trained using stacked denoising autoencoders to map speech parameters of the Punjabi language. Denoising autoencoders work by adding noise in the training data and then reconstructing the original measurements to reduce the reconstruction error. The synthesized voice using the proposed model showed the VARN of 0.82, F0 RMSE (Hz) 9.03, and V/UV error rate of 4.04% have been observed.

REFERENCES

  1. [1] van den Oord Aaron, Li Yazhe, Babuschkin Igor, Simonyan Karen, Vinyals Oriol, Kavukcuoglu Koray, van den Driessche George, Lockhart Edward, Cobo Luis C., Stimberg Florian, Casagrande Norman, Grewe Dominik, Noury Seb, Dieleman Sander, Elsen Erich, Kalchbrenner Nal, Zen Heiga, Graves Alex, King Helen, Walters Tom, Belov Dan, and Hassabis Demis. 2017. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. Retrieved from https://arxiv.org/abs/1711.10433. https://doi.org/10.48550/arXiv.1711.10433Google ScholarGoogle Scholar
  2. [2] Ishtiaq Ahmad, Jing He Yu, and Hi Chang Kyung. 2018. Hidden Markov model-based drone sound recognition using MFCC technique in practical noisy environments. J. Commun. Netw. 20, 5 (2018), 509518. DOI: https://doi.org/10.1109/JCN.2018.000075Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Tudor Barbu . 2013. Variational image denoising approach with diffusion porous media flow. Abstract Appl. Anal. (2013), 856876. https://doi.org/10.1155/2013/856876Google ScholarGoogle Scholar
  4. [4] Uria Benigno, Murray Iain, Renals, Steve and Richmond Korin. 2012. Deep architectures for articulatory inversion. In Proceeding of Interspeech.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Sharma Bidisha and Prasanna S. R. Mahadeva. 2017. Enhancement of spectral tilt in synthesized speech. IEEE Signal Process. Lett. 24, 4 (2017), 382386. https://doi.org/10.1109/LSP.2017.2662805Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Behzad Boloukian and Faramarz Safi-Esfahani. 2020. Recognition of words from brain-generated signals of speech-impaired people: Application of autoencoders as a neural Turing machine controller in deep neural networks. Neural Netw. 121 (2020), 186207. https://doi.org//10.1016/j.neunet.2019.07.012Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Rui Cai. 2007. TH-CoSS, a Mandarin speech corpus for TTS. J. Chinese Info. Process. 21, 2 (2007), 9499.Google ScholarGoogle Scholar
  8. [8] Good Fellow I., Yoshua Bengio, and Courville Aaron. 2016. Deep Learning (Adaptive Computation and Machine Learning series). Springer, Berlin.Google ScholarGoogle Scholar
  9. [9] Shuo Feng, Huiyu Zhou, and Hongbiao Dong. 2019. Using deep neural network with small dataset to predict material defects. Mater. Design 162 (2019), 300310. https://doi.org/10.1016/j.matdes.2018.11.060Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Christos Ferles, Yannis Papanikolaou, and Naidoo Kevin J. 2018. Denoising autoencoder self-organizing map (DASOM). Neural Netw. 105 (2018), 112131. https://doi.org/10.1016/j.neunet.2018.04.016Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Hinton Geoffrey, Deng Li, Yu Dong, Dahl George, Mohamed Abdel-rahman, Jaitly Navdeep, Senior Andrew, Vanhoucke Vincent, Nguyen Patrick, Sainath Tara, and Kingsbury Brian. 2012. Deep neural networks for acoustic modelling in speech recognition. IEEE Signal Process. Mag. 29, 6 (2012), 8297. https://doi.org//10.1109/MSP.2012.2205597Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Zen Heiga, Senior Andrew, and Schuster Mike. 2013. Statistical parametric speech synthesis using deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 79627966. https://doi.org//10.1109/ICASSP.2013.6639215Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Zen Heiga, Tokuda Keiichi, and Black Alan W. 2009. Statistical parametric speech synthesis. Speech Commun. 51, 11 (2009), 10391064. https://doi.org//10.1016/j.specom.2009.04.004Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Zen Heiga, Tokuda Keiichi, Masuko Takashi, Kobayasih Takao, and Kitamura Tadashi. 2007. A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Info. Syst. 90, 5 (2007). https://doi.org//10.1093/ietisy/e90-d.5.825Google ScholarGoogle Scholar
  15. [15] Zhang Jinhui, Zhu Xueyu, and Bao Jie. 2021. Denoising autoencoder aided spectrum reconstruction for colloidal quantum dot spectrometers. IEEE Sensors J. 21, 5 (2021), 64506458. https://doi.org/10.1109/JSEN.2020.3039973Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Tokuda Keiichi, Yoshimura Takayoshi, Masuko Takashi, Kobayashi Takao, and Kitamura Tadashi. 2000. Speech parameter generation algorithms for HMM based speech synthesis. In Proceedings of the International Conference of Acoustics, Speech, and Signal Processing. 3151318. https://doi.org//10.1109/ICASSP.2000.861820Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Mitsui Kentaro, Koriyama Tomoki, and Saruwatari Hiroshi. 2020. Multi-speaker text-to-speech synthesis using deep gaussian processes. In Proceedings of Interspeech. 20322036. https://DOI.org/10.1007/s11265-017-1293-zGoogle ScholarGoogle ScholarCross RefCross Ref
  18. [18] Shinoda Koichi and Watanabe Takao. 2007. MDL-Based context-dependent subword modelling for speech recognition. J. Acoust. Sci. Technol. 21, 2 (2007), 7986. https://doi.org/10.1250/ast.21.79Google ScholarGoogle Scholar
  19. [19] Mohamed Khalil Krichi and Adnan Cherif. 2018. Arabic speech synthesis based on HMM. In Proceedings of the 15th International Multi-Conference on Systems, Signals and Devices (SSD’18). 10911095. https://doi.org//10.1109/SSD.2018.8570388Google ScholarGoogle Scholar
  20. [20] Khorsheed Mohammad S.. 2018. Diacritizing arabic text using a single hidden Markov model. IEEE Access 6 (2018), 3652236529. https://doi.org//10.1109/ACCESS.2018.2852619Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Prashad Narendra N. and Rao Sreenivasa K.. 2017. Generation of creaky voice for improving the quality of HMM-based speech synthesis. Comput. Speech Lang. 42 (2017), 3858. https://doi.org/10.1016/j.csl.2016.08.002.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Kaur Navdeep and Singh Parminder. 2019. Segmentation of Punjabi Text Into Prosodic Unit. In Proceedings of the 14th International Conference on Computer Science and Education (ICCSE’19). 775779. https://doi.org//10.1109/ICCSE.2019.8845439Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Takamune Norihiro and Kameoka Hirokazu. 2014. Maximum reconstruction probability training of Restricted Boltzmann machines with auxiliary function approach. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP’14). 16. https://doi.org/10.1109/MLSP.2014.6958881Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Watts Oliver, Henter Gustav E., Merritt Thomas, Wu Zhizheng, and King Simon. 2016. From HMMs to DNNs: Where do the improvements come from. In Proceedings of the IEEE International Conference in Acoustic, Speech and Signal Processing. 55055509. https://doi.org//10.1109/ICASSP.2016.7472730Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Karaali Orhan, Corrigan Gerald, and Gerson Ira. 1996. Speech synthesis with neural networks. In Proceedings of the World Congress on Neural Networks. 4550.Google ScholarGoogle Scholar
  26. [26] Li Qing and Chen Yang. 2019. Learning to Compress using Deep Autoencoder. In Proceedings of the 57th Annual Allerton Conference on Communication, Control, and Computing (ALLERTON’19). 930936. https://doi.org/10.1109/ALLERTON.2019.8919866Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Wang Ru and Fan Wentao. 2019. Positive sequential data modelling using continuous hidden Markov models based on inverted dirichlet mixtures. IEEE Access 7 (2019), 172341172349. https://doi.org//10.1109/ACCESS.2019.2956477Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Takamichi Shinnosuke, Kobayashi Kazuhiro, Tanaka Kou, Toda Tomoki, and Nakamura Satoshi. 2015. The NAIST text-to-speech system for the blizzard challenge. InProceedings of the Blizzard Challenge Workshop.Google ScholarGoogle Scholar
  29. [29] Takamichi Shinnosuke, Toda Tomoki, Black Alan W., Neubig Graham, Sakti Sakriani, and Nakamura Satoshi. 2016. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 4 (2016), 755767. DOI: 10.1109/TASLP.2016.2522655Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Takamichi Shinnosuke, Toda Tomoki, Neubig Graham, Sakti Sakriani, and Nakamura Satoshi. 2014. Postfilter to modify modulation spectrum in hmm-based speech synthesis. In Proceedings of the International Conference of Acoustics, Speech, and Signal Processing. 290294. https://doi.org/10.1109/ICASSP.2014.6853604Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Wang Shiping, Cai Jinyu, Lin Qihao, and Guo Wenzhong. 2019. An overview of unsupervised deep feature representation for text categorization. IEEE Trans. Comput. Soc. Syst. 6, 3 (2019), 504517. https://doi.org//10.1109/TCSS.2019.2910599Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Yu Shujian and Principe Jose C.. 2019. Understanding autoencoders with information theoretic concepts, Neural Netw. 117 (2019), 104123. https://doi.org/10.48550/arXiv.1804.00057Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Nose Takashi, Chunwijitra Vataya, and Kobayashi Takao. 2014. A parameter generation algorithm using local variance for HMM-based speech synthesis IEEE J. Select. Top. Signal Process. 8, 2 (2014), 221228. https://doi.org/10.1109/JSTSP.2013.2283459Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Nose Takashi, Arao Yusuke, Kobayashi Takao, Sugiura Komei, and Shiga Yoshinori. 2017. Sentence selection based on extended entropy using phonetic and prosodic contexts for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25, 5 (2017), 11071116. https://doi.org/10.1109/TASLP.2017.2688585Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Nose Takashi. 2016. Efficient implementation of global variance compensation for parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 10 (2016), 16941704. https://doi.org/10.1109/TASLP.2016.2580298Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Yoshimura Takenori, Hashimoto Kei, Oura Keiichiro, Nankaku Yoshihiko, and Tokuda Keiichi. 2017. Simultaneous optimization of multiple tree-based factor analyzed HMM for speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25, 9 (2017), 18361845. https://doi.org/10.1109/TASLP.2017.2721219.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Koriyama Tomoki and Kobayashi Takao. 2019. Statistical parametric speech synthesis using deep gaussian processes. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27, 5 (2019), 948959. https://doi.org/10.1109/TASLP.2019.2905167Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Tokuda Keiichi, Zen Heiga, and Black Alan W. 2002. An HMM-Based speech synthesis system applied to English. In Proceeding of IEEE Workshop on Speech Synthesis. https://doi.org//10.1109/WSS.2002.1224415Google ScholarGoogle Scholar
  39. [39] Toda Tomoki. 2011. Modeling of speech parameter sequence considering global variance for HMM-Based speech synthesis. In Hidden Markov Models, Theory and Applications. 131150.Google ScholarGoogle Scholar
  40. [40] Sefara Tshephisho J., Mokgonyane Tumisho Bi, Manamela Madimetja J., and Modipa Thipe I.. 2019. HMM-based speech synthesis system incorporated with language identification for low-resourced languages. In Proceedings of the International Conference on Advances in Big Data, Computing and Data Communication Systems. 16. https://doi.org/10.1109/ICABCD.2019.8851055Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Ping Wei, Peng Kainan, Gibiansky Andrew, Arik Sercan O., Kannan Ajay, Narang Sharan, Raiman Jonathan, and Miller John. 2017. Deep voice 3: Scaling Text-to-Speech with convolutional sequence learning, international conference on learning representations. In Proceedings of the International Conference on Learning Representations. 116. https://arxiv.org/abs/1710.07654Google ScholarGoogle Scholar
  42. [42] Zhang Weizhao, Yang Hongwu, Bu Xiaolong, and Wang Lili. 2019. Deep learning for mandarin-tibetan cross-lingual speech synthesis. IEEE Access 7 (2019), 167884167894. https://doi.org//10.1109/ACCESS.2019.2954342Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Yin Xiang, Ling Zhen-Hua, Hu Ya-Jun, and Dai Li-Rong. 2016. Modelling spectral envelopes using deep conditional restricted Boltzmann machines for statistical parametric speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 51255129. https://doi.org//10.1109/ICASSP.2016.7472654Google ScholarGoogle Scholar
  44. [44] Hu Ya-Jun and Ling Zhen-Hua. 2018. Extracting spectral features using deep autoencoders with binary distributed hidden units for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 26, 4 (2018), 713724. https://doi.org/10.1109/TASLP.2018.2791804Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Huang Yi-Chin, Wu Chung-Hsien, and Weng Si-Ting. 2016. Improving mandarin prosody generation using alternative smoothing techniques. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 11 (2016), 18971907. https://doi.org/10.1109/TASLP.2016.2588727.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Yustiawan Yoga, Ramadhan Hani, and Kwon Joonho. 2021. A stacked denoising autoencoder and long short-term memory approach with rule-based refinement to extract valid semantic trajectories. IEEE Access 9 (2021), 7315273168. https://doi.org/10.1109/ACCESS.2021.3080288Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Tabet Youcef and Boughazi Mohamed. 2011. Speech synthesis techniques. A survey. In Proceedings of the International Workshop on Systems, Signal Processing and their Applications (WOSSPA’11). IEEE, 6770. https://doi.org/10.1109/WOSSPA.2011.5931414Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zhang Yu, Weiss Ron J., Zen Heiga, Wu Yonghui, Chen Zhifeng, Skerry-Ryan RJ, Jia Ye, Rosenberg Andrew, and Ramabhadran Bhuvana. 2019. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. In Proceedings of Interspeech. https://doi.org/10.48550/arXiv.1907.04448Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Ijima Yusuke, Miyazaki Noboru, Mizuno Hideyuki, and Sakauchi Sumitaka. 2015. Statistical model training technique based on speaker clustering approach for HMM-based speech synthesis. Speech Commun. 71 (2015). 5061. https://doi.org//10.1016/j.specom.2015.04.003Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Ijima Yusuke, Asami Taichi, and Mizuno Hideyuki. 2016. Objective evaluation using association between dimensions within spectral features for statistical parametric speech synthesis. In Proceeding of Interspeech. 337341. DOI: 10.21437/Interspeech.2016-584Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Sugomori Yusuke, Kaluza Bostjan, Soares Fabio M., and Souza Alan M. F.. 2017. Deep Learning: Practical Neural Networks with Java, chap. 3, 69109.Google ScholarGoogle Scholar
  52. [52] Ling Zhen-Hua, Wu Yi-Jian, Wang Yu-Ping, Qin Long, and Wang Ren-Hua. 2006. USTC system for blizzard challenge 2006: An improved HMM-Based speech synthesis method. In Proceedings of the Blizzard Challenge Workshop.Google ScholarGoogle Scholar
  53. [53] Ling Zhen-Hua, Kang Shi-Yin, Zen Heiga, Senior Andrew, Schuster Mike, Qian Xiao-Jun, Meng Helen M., and Deng Li. 2015. Deep learning for acoustic modelling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Process. Mag. 32, 3 (2015), 3552. https://doi.org//10.1109/MSP.2014.2359987Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Ling Zhen-Hua, Deng Li, and Yu Dong. 2013. Modelling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 21, 10 (2013), 21292139. https://doi.org//10.1109/TASL.2013.2269291Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Wu Zhizheng and King Simon. 2016. Improving trajectory modelling for DNN-based speech synthesis by using stacked bottleneck features and minimum generation error training. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 7 (2016), 12551265. https://doi.org//10.1109/TASLP.2016.2551865Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Wu Zhizheng, Valentini-Botinhao Cassia, Watts Oliver, and King Simon. 2015. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Proceeding of IEEE International Conference on Acoustic, Speech and Signal Processing. 44604464. https://doi.org//10.1109/ICASSP.2015.7178814Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Modelling of Speech Parameters of Punjabi by Pre-trained Deep Neural Network Using Stacked Denoising Autoencoders

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
      March 2023
      570 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3579816
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 March 2023
      • Online AM: 10 February 2023
      • Accepted: 11 October 2022
      • Revised: 8 August 2022
      • Received: 6 May 2022
      Published in tallip Volume 22, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)64
      • Downloads (Last 6 weeks)3

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!