Abstract
Statistical parametric speech synthesis techniques such as deep neural network (DNN) and hidden Markov model (HMM) have grown in popularity since last decade over the concatenative speech synthesis approaches by modelling excitation and spectral parameters of speech to synthesize the waveforms from the written text. Due to inappropriate acoustic modelling, speech synthesized using HMM-based speech synthesis sounds muffled. DNN tried to improve the acoustic model by replacing decision trees in HMM with powerful regression model. Further, the performance of a deep neural network is greatly enhanced using pre-learning either restricted Boltzmann machines (RBM) or autoencoders. RBMs are capable to map multi-modal property of speech but result in spectral distortion of synthesized speech waveforms as non-consideration of reconstruction error. This article proposed the model of deep neural network, which is pre-trained using stacked denoising autoencoders to map speech parameters of the Punjabi language. Denoising autoencoders work by adding noise in the training data and then reconstructing the original measurements to reduce the reconstruction error. The synthesized voice using the proposed model showed the VARN of 0.82, F0 RMSE (Hz) 9.03, and V/UV error rate of 4.04% have been observed.
- [1] . 2017. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. Retrieved from https://arxiv.org/abs/1711.10433. https://doi.org/10.48550/arXiv.1711.10433Google Scholar
- [2] . 2018. Hidden Markov model-based drone sound recognition using MFCC technique in practical noisy environments. J. Commun. Netw. 20, 5 (2018), 509–518.
DOI:
https://doi.org/10.1109/JCN.2018.000075Google Scholar
Cross Ref
- [3] . 2013. Variational image denoising approach with diffusion porous media flow. Abstract Appl. Anal. (2013), 856–876. https://doi.org/10.1155/2013/856876Google Scholar
- [4] . 2012. Deep architectures for articulatory inversion. In Proceeding of Interspeech.Google Scholar
Cross Ref
- [5] . 2017. Enhancement of spectral tilt in synthesized speech. IEEE Signal Process. Lett. 24, 4 (2017), 382–386. https://doi.org/10.1109/LSP.2017.2662805Google Scholar
Cross Ref
- [6] . 2020. Recognition of words from brain-generated signals of speech-impaired people: Application of autoencoders as a neural Turing machine controller in deep neural networks. Neural Netw. 121 (2020), 186–207. https://doi.org//10.1016/j.neunet.2019.07.012Google Scholar
Digital Library
- [7] . 2007. TH-CoSS, a Mandarin speech corpus for TTS. J. Chinese Info. Process. 21, 2 (2007), 94–99.Google Scholar
- [8] . 2016. Deep Learning (Adaptive Computation and Machine Learning series). Springer, Berlin.Google Scholar
- [9] . 2019. Using deep neural network with small dataset to predict material defects. Mater. Design 162 (2019), 300–310. https://doi.org/10.1016/j.matdes.2018.11.060Google Scholar
Cross Ref
- [10] 2018. Denoising autoencoder self-organizing map (DASOM). Neural Netw. 105 (2018), 112–131. https://doi.org/10.1016/j.neunet.2018.04.016Google Scholar
Digital Library
- [11] . 2012. Deep neural networks for acoustic modelling in speech recognition. IEEE Signal Process. Mag. 29, 6 (2012), 82–97. https://doi.org//10.1109/MSP.2012.2205597Google Scholar
Cross Ref
- [12] . 2013. Statistical parametric speech synthesis using deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 7962–7966. https://doi.org//10.1109/ICASSP.2013.6639215Google Scholar
Cross Ref
- [13] . 2009. Statistical parametric speech synthesis. Speech Commun. 51, 11 (2009), 1039–1064. https://doi.org//10.1016/j.specom.2009.04.004Google Scholar
Digital Library
- [14] . 2007. A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Info. Syst. 90, 5 (2007). https://doi.org//10.1093/ietisy/e90-d.5.825Google Scholar
- [15] . 2021. Denoising autoencoder aided spectrum reconstruction for colloidal quantum dot spectrometers. IEEE Sensors J. 21, 5 (2021), 6450–6458. https://doi.org/10.1109/JSEN.2020.3039973Google Scholar
Cross Ref
- [16] . 2000. Speech parameter generation algorithms for HMM based speech synthesis. In Proceedings of the International Conference of Acoustics, Speech, and Signal Processing. 315–1318. https://doi.org//10.1109/ICASSP.2000.861820Google Scholar
Cross Ref
- [17] . 2020. Multi-speaker text-to-speech synthesis using deep gaussian processes. In Proceedings of Interspeech. 2032–2036. https://DOI.org/10.1007/s11265-017-1293-zGoogle Scholar
Cross Ref
- [18] . 2007. MDL-Based context-dependent subword modelling for speech recognition. J. Acoust. Sci. Technol. 21, 2 (2007), 79–86. https://doi.org/10.1250/ast.21.79Google Scholar
- [19] . 2018. Arabic speech synthesis based on HMM. In Proceedings of the 15th International Multi-Conference on Systems, Signals and Devices (SSD’18). 1091–1095. https://doi.org//10.1109/SSD.2018.8570388Google Scholar
- [20] . 2018. Diacritizing arabic text using a single hidden Markov model. IEEE Access 6 (2018), 36522–36529. https://doi.org//10.1109/ACCESS.2018.2852619Google Scholar
Cross Ref
- [21] . 2017. Generation of creaky voice for improving the quality of HMM-based speech synthesis. Comput. Speech Lang. 42 (2017), 38–58. https://doi.org/10.1016/j.csl.2016.08.002.Google Scholar
Cross Ref
- [22] . 2019. Segmentation of Punjabi Text Into Prosodic Unit. In Proceedings of the 14th International Conference on Computer Science and Education (ICCSE’19). 775–779. https://doi.org//10.1109/ICCSE.2019.8845439Google Scholar
Cross Ref
- [23] . 2014. Maximum reconstruction probability training of Restricted Boltzmann machines with auxiliary function approach. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP’14). 1–6. https://doi.org/10.1109/MLSP.2014.6958881Google Scholar
Cross Ref
- [24] . 2016. From HMMs to DNNs: Where do the improvements come from. In Proceedings of the IEEE International Conference in Acoustic, Speech and Signal Processing. 5505–5509. https://doi.org//10.1109/ICASSP.2016.7472730Google Scholar
Digital Library
- [25] . 1996. Speech synthesis with neural networks. In Proceedings of the World Congress on Neural Networks. 45–50.Google Scholar
- [26] . 2019. Learning to Compress using Deep Autoencoder. In Proceedings of the 57th Annual Allerton Conference on Communication, Control, and Computing (ALLERTON’19). 930–936. https://doi.org/10.1109/ALLERTON.2019.8919866Google Scholar
Digital Library
- [27] . 2019. Positive sequential data modelling using continuous hidden Markov models based on inverted dirichlet mixtures. IEEE Access 7 (2019), 172341–172349. https://doi.org//10.1109/ACCESS.2019.2956477Google Scholar
Cross Ref
- [28] . 2015. The NAIST text-to-speech system for the blizzard challenge. InProceedings of the Blizzard Challenge Workshop.Google Scholar
- [29] . 2016. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 4 (2016), 755–767.
DOI: 10.1109/TASLP.2016.2522655Google ScholarDigital Library
- [30] . 2014. Postfilter to modify modulation spectrum in hmm-based speech synthesis. In Proceedings of the International Conference of Acoustics, Speech, and Signal Processing. 290–294. https://doi.org/10.1109/ICASSP.2014.6853604Google Scholar
Cross Ref
- [31] . 2019. An overview of unsupervised deep feature representation for text categorization. IEEE Trans. Comput. Soc. Syst. 6, 3 (2019), 504–517. https://doi.org//10.1109/TCSS.2019.2910599Google Scholar
Cross Ref
- [32] . 2019. Understanding autoencoders with information theoretic concepts, Neural Netw. 117 (2019), 104–123. https://doi.org/10.48550/arXiv.1804.00057Google Scholar
Digital Library
- [33] . 2014. A parameter generation algorithm using local variance for HMM-based speech synthesis IEEE J. Select. Top. Signal Process. 8, 2 (2014), 221–228. https://doi.org/10.1109/JSTSP.2013.2283459Google Scholar
Cross Ref
- [34] . 2017. Sentence selection based on extended entropy using phonetic and prosodic contexts for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25, 5 (2017), 1107–1116. https://doi.org/10.1109/TASLP.2017.2688585Google Scholar
Digital Library
- [35] . 2016. Efficient implementation of global variance compensation for parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 10 (2016), 1694–1704. https://doi.org/10.1109/TASLP.2016.2580298Google Scholar
Digital Library
- [36] . 2017. Simultaneous optimization of multiple tree-based factor analyzed HMM for speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 25, 9 (2017), 1836–1845. https://doi.org/10.1109/TASLP.2017.2721219.Google Scholar
Digital Library
- [37] . 2019. Statistical parametric speech synthesis using deep gaussian processes. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27, 5 (2019), 948–959. https://doi.org/10.1109/TASLP.2019.2905167Google Scholar
Digital Library
- [38] . 2002. An HMM-Based speech synthesis system applied to English. In Proceeding of IEEE Workshop on Speech Synthesis. https://doi.org//10.1109/WSS.2002.1224415Google Scholar
- [39] . 2011. Modeling of speech parameter sequence considering global variance for HMM-Based speech synthesis. In Hidden Markov Models, Theory and Applications. 131–150.Google Scholar
- [40] . 2019. HMM-based speech synthesis system incorporated with language identification for low-resourced languages. In Proceedings of the International Conference on Advances in Big Data, Computing and Data Communication Systems. 1–6. https://doi.org/10.1109/ICABCD.2019.8851055Google Scholar
Cross Ref
- [41] . 2017. Deep voice 3: Scaling Text-to-Speech with convolutional sequence learning, international conference on learning representations. In Proceedings of the International Conference on Learning Representations. 1–16. https://arxiv.org/abs/1710.07654Google Scholar
- [42] . 2019. Deep learning for mandarin-tibetan cross-lingual speech synthesis. IEEE Access 7 (2019), 167884–167894. https://doi.org//10.1109/ACCESS.2019.2954342Google Scholar
Cross Ref
- [43] . 2016. Modelling spectral envelopes using deep conditional restricted Boltzmann machines for statistical parametric speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5125–5129. https://doi.org//10.1109/ICASSP.2016.7472654Google Scholar
- [44] . 2018. Extracting spectral features using deep autoencoders with binary distributed hidden units for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 26, 4 (2018), 713–724. https://doi.org/10.1109/TASLP.2018.2791804Google Scholar
Digital Library
- [45] . 2016. Improving mandarin prosody generation using alternative smoothing techniques. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 11 (2016), 1897–1907. https://doi.org/10.1109/TASLP.2016.2588727.Google Scholar
Digital Library
- [46] . 2021. A stacked denoising autoencoder and long short-term memory approach with rule-based refinement to extract valid semantic trajectories. IEEE Access 9 (2021), 73152–73168. https://doi.org/10.1109/ACCESS.2021.3080288Google Scholar
Cross Ref
- [47] . 2011. Speech synthesis techniques. A survey. In Proceedings of the International Workshop on Systems, Signal Processing and their Applications (WOSSPA’11). IEEE, 67–70. https://doi.org/10.1109/WOSSPA.2011.5931414Google Scholar
Cross Ref
- [48] . 2019. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. In Proceedings of Interspeech. https://doi.org/10.48550/arXiv.1907.04448Google Scholar
Cross Ref
- [49] . 2015. Statistical model training technique based on speaker clustering approach for HMM-based speech synthesis. Speech Commun. 71 (2015). 50–61. https://doi.org//10.1016/j.specom.2015.04.003Google Scholar
Digital Library
- [50] . 2016. Objective evaluation using association between dimensions within spectral features for statistical parametric speech synthesis. In Proceeding of Interspeech. 337–341.
DOI: 10.21437/Interspeech.2016-584Google ScholarCross Ref
- [51] . 2017. Deep Learning: Practical Neural Networks with Java, chap. 3, 69–109.Google Scholar
- [52] . 2006. USTC system for blizzard challenge 2006: An improved HMM-Based speech synthesis method. In Proceedings of the Blizzard Challenge Workshop.Google Scholar
- [53] . 2015. Deep learning for acoustic modelling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Process. Mag. 32, 3 (2015), 35–52. https://doi.org//10.1109/MSP.2014.2359987Google Scholar
Cross Ref
- [54] . 2013. Modelling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 21, 10 (2013), 2129–2139. https://doi.org//10.1109/TASL.2013.2269291Google Scholar
Digital Library
- [55] . 2016. Improving trajectory modelling for DNN-based speech synthesis by using stacked bottleneck features and minimum generation error training. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 7 (2016), 1255–1265. https://doi.org//10.1109/TASLP.2016.2551865Google Scholar
Digital Library
- [56] . 2015. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Proceeding of IEEE International Conference on Acoustic, Speech and Signal Processing. 4460–4464. https://doi.org//10.1109/ICASSP.2015.7178814Google Scholar
Cross Ref
Index Terms
Modelling of Speech Parameters of Punjabi by Pre-trained Deep Neural Network Using Stacked Denoising Autoencoders
Recommendations
Continuous Punjabi speech recognition model based on Kaldi ASR toolkit
In this paper, continuous Punjabi speech recognition model is presented using Kaldi toolkit. For speech recognition, the extraction of Mel frequency cepstral coefficients (MFCC) features and perceptual linear prediction (PLP) features were extracted ...
Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering
Due to the profound differences between acoustic characteristics of neutral and whispered speech, the performance of traditional automatic speech recognition ASR systems trained on neutral speech degrades significantly when whisper is applied. In order ...
Noise robust exemplar matching using sparse representations of speech
Performing automatic speech recognition using exemplars (templates) holds the promise to provide a better duration and coarticulation modeling compared to conventional approaches such as hidden Markov models (HMMs). Exemplars are spectrographic ...






Comments