Abstract
The voice-based Internet of Multimedia Things (IoMT) is the combination of IoT interfaces and protocols with associated voice-related information, which enables advanced applications based on human-to-device interactions. An example is Automatic Speech Recognition (ASR) for live captioning and voice translation. Three major issues of ASR for IoMT are IoT development cost, speech recognition accuracy, and execution time complexity. For the first issue, most non-voice IoT applications are upgraded with the ASR feature through hard coding, which are error prone. For the second issue, recognition accuracy must be improved for ASR. For the third issue, many multimedia IoT services are real-time applications and, therefore, the ASR delay must be short.
This article elaborates on the above issues based on an IoT platform called VoiceTalk. We built the largest Taiwanese spoken corpus to train VoiceTalk ASR (VT-ASR) and show how the VT-ASR mechanism can be transparently integrated with existing IoT applications. We consider two performance measures for VoiceTalk: speech recognition accuracy and VT-ASR delay. For the acoustic tests of PAL-Labs, VT-ASR's accuracy is 96.47%, while Google's accuracy is 94.28%. We are the first to develop an analytic model to investigate the probability that the VT-ASR delay for the first speaker is complete before the second speaker starts talking. From the measurements and analytic modeling, we show that the VT-ASR delay is short enough to result in a very good user experience. Our solution has won several important government and commercial TV contracts in Taiwan. VT-ASR has demonstrated better Taiwanese Mandarin speech recognition accuracy than famous commercial products (including Google and Iflytek) in Formosa Speech Recognition Challenge 2018 (FSR-2018) and was the best among all participating ASR systems for Taiwanese recognition accuracy in FSR-2020.
- [1] . 2020. Internet of Multimedia things (IoMT): Opportunities, challenges and solutions. Sensors 20, 8 (April 2020), 2334. Google Scholar
Cross Ref
- [2] . 2021. FusionTalk: An IoT-based reconfigurable object identification system. IEEE IoT J. 8, 9 (May 2021), 7333–7345. Google Scholar
Cross Ref
- [3] . 2016. Challenging issues of video surveillance system using internet of things in cloud environment. In Proceedings of the International Conference on Advances in Computing and Data Sciences, 471–481. Google Scholar
Cross Ref
- [4] . 2015. The design and implementation of a wireless video surveillance system. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking (MobiCom’15). Association for Computing Machinery, New York, NY, 426–438. Google Scholar
Digital Library
- [5] . 2021. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32, 2 (2021), 604–624, Feb. 2021. Google Scholar
Cross Ref
- [6] . CALL FOR PAPERS: Special Issue on Applications of Computational Linguistics in Multimedia IoT Services, ACM Transactions on Internet Technology (TOIT), Retrieved from https://dl.acm.org/pb-assets/static_journal_pages/toit/pdf/ACM-TOIT-CFP-IoMT-Jan21-1610575659947.pdf.Google Scholar
- [7] The Association for Computational Linguistics and Chinese Language Processing: Database. Retrieved from http://www.aclclp.org.tw/corp.php.Google Scholar
- [8] The Association for Computational Linguistics and Chinese Language Processing. Retrieved from http://www.aclclp.org.tw/.Google Scholar
- [9] Linguistic Data Consortium, University of Pennsylvania. Retrieved from https://www.ldc.upenn.edu/.Google Scholar
- [10] The European Language Resources Association. Retrieved from http://www.elra.info/en/.Google Scholar
- [11] SpeechOcean: Speech Data Services, Text Data and Image Data Services, Speech Datasets Database. Retrieved from http://en.speechocean.com/.Google Scholar
- [12] Google: Cloud Speech-to-text. Retrieved from https://cloud.google.com/speech-to-text/.Google Scholar
- [13] IFlyTek: iFLYTEK Open Platform—China's First Artificial Intelligence Open Platform for Mobile Internet and Intelligent Hardware Developers. Retrieved from http://global.xfyun.cn/.Google Scholar
- [14] . 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICM’06). Association for Computing Machinery, New York, NY, 369–376. Google Scholar
Digital Library
- [15] 2016. Deep Speech 2: End-to-End speech recognition in english and mandarin. arXiv:1512.02595. Retrieved from https://arxiv.org/abs/1512.02595.Google Scholar
- [16] . 2019. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’19). Google Scholar
Cross Ref
- [17] . 2019. Very deep self-attention networks for end-to-end speech recognition. arXiv:1904.13377. Retrieved from https://arxiv.org/abs/1904.13377.Google Scholar
- [18] . 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). Google Scholar
Digital Library
- [19] 2020. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’20). Google Scholar
Cross Ref
- [20] 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (November 2012), 82–97. Google Scholar
Cross Ref
- [21] 2016. Achieving human parity in conversational speech recognition. arXiv:1610.05256. Retrieved from https://arxiv.org/abs/1610.05256Google Scholar
- [22] . 2020. The RWTH ASR system for TED-LIUM Release 2: Improving hybrid HMM with SpecAugment. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’2020). Google Scholar
Cross Ref
- [23] 2019. RWTH ASR systems for Librispeech: Hybrid vs Attention - w/o Data Augmentation. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’19).Google Scholar
- [24] . 2014. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Aud. Speech Lang. Process. 22, 10 (October 2014), 1533–1545. Google Scholar
Digital Library
- [25] , 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). Google Scholar
Cross Ref
- [26] . 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’15). Google Scholar
Cross Ref
- [27] 2018. Semi-Orthogonal low-rank matrix factorization for deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’18). Google Scholar
Cross Ref
- [28] . 2019. Acoustic modeling for overlapping speech recognition: Jhu Chime-5 challenge system. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). Google Scholar
Cross Ref
- [29] . 2020. Formosa speech in the wild corpus for improving taiwanese mandarin speech-enabled human-computer interaction. J. Sign. Process. Syst. 92, 8 (August 2020), 853–873. Google Scholar
Digital Library
- [30] TCC300 Corpus. Retrieved April 6, 2021 from http://www.aclclp.org.tw/use_mat.php#tcc300edu’.Google Scholar
- [31] . 2003. MATBN 2002: A mandarin Chinese broadcast news corpus. In Proceedings of the ISCA & IEEE Workshop on Spontaneous Speech Process. Recognition (SSPR’03).Google Scholar
- [32] . 2017. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA’17). Google Scholar
Cross Ref
- [33] . 2015. THCHS-30: A free chinese speech corpus. arXiv:1512.01882. Retrieved from https://arxiv.org/abs/1512.01882.Google Scholar
- [34] . 2015. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). Google Scholar
Cross Ref
- [35] . 2016. OC16-CE80: A Chinese-English mixlingual database and a speech recognition baseline. In Proceedings of the Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA’16). Google Scholar
Cross Ref
- [36] . 2010. SEAME: A mandarin-english code-switching speech corpus in South-East Asia. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’10).Google Scholar
Cross Ref
- [37] . 2009. Tagged Chinese gigaword corpus 2.0. linguistic data consortium, philadelphia. Google Scholar
Cross Ref
- [38] . 2018. AISHELL-2: Transforming mandarin ASR research into industrial scale. arXiv:1808.10583. Retrieved from http://arxiv.org/abs/1808.10583.Google Scholar
- [39] . 2019. Internet of things (IoT): A survey. In Proceedings of the IEEE Pune Section International Conference (PuneCon’19).Google Scholar
Cross Ref
- [40] . 2018. A survey on Internet of Things architectures. J.f King Saud Univ.– Comput. Inf. Sci. 30, 3 (July 2018), 291–319. Google Scholar
Cross Ref
- [41] . 2017. A survey on IoT: Architectures, elements, applications, QoS, platforms and security concepts. In Advances in Mobile Cloud Computing and Big Data in the 5G Era, Studies in Big Data, Vol. 22, C. Mavromoustakis, G. Mastorakis, and C. Dobre (Eds.). Springer, Cham. Google Scholar
Cross Ref
- [42] . 2019. Future security of smart speaker and IoT smart home devices. In Proceedings of the 5th Conference on Mobile and Secure Services (MobiSecServ’19). .Google Scholar
Cross Ref
- [43] . 2019. Smart speaker privacy control—Acoustic tagging for personal voice assistants. In Proceedings of the IEEE Security and Privacy Workshops (SPW’19). Google Scholar
Cross Ref
- [44] . 2012. Wireless remote control. IEEE Consum. Electr. Mag. 1, 4 (2012), 48–51. Google Scholar
Cross Ref
- [45] . 2017. IoTtalk-RC: Sensors as universal remote control for aftermarket home appliances. In IEEE IoT J. 4, 4 (August 2017), 1104–1112. Google Scholar
Cross Ref
- [46] 2001. iSMS: An integration platform for short message service and IP networks. IEEE Network 15, 2 (March 2001) 48–55. Google Scholar
Digital Library
- [47] . 2014. Remote control robot using Android mobile device. In Proceedings of the 15th International Carpathian Control Conference (ICCC’14). Google Scholar
Cross Ref
- [48] . 2017. Design of remote control of home appliances via Bluetooth and Android smart phones. In Proceedings of the IEEE International Conference on Consumer Electronics–Taiwan (ICCE-TW’2017). Google Scholar
Cross Ref
- [49] . 2021. HouseTalk: A house that comforts you. IEEE Access 9 (2021), 27790–27801. .Google Scholar
Cross Ref
- [50] , 2017. IoTtalk: A management platform for reconfigurable sensor devices. IEEE IoT J. 4, 5 (October 2017), 2017. Google Scholar
Cross Ref
- [51] . 2018. CampusTalk: IoT devices and their interesting features on campus applications. IEEE Access 6 (2018), 26036–26046. Google Scholar
Cross Ref
- [52] Swing Light Pole Interactive Art Demo. Retrieved March 2021 from https://youtu.be/wZ99kc-4aAo.Google Scholar
- [53] Hollow Light Globe Demo. Retrieved March 2021 from https://youtu.be/ZICUCOjQ4iA.Google Scholar
- [54] 2019. PlantTalk: A smartphone-based intelligent hydroponic plant box. Sensors 19, 8 (2019) 1763. Google Scholar
Cross Ref
- [55] Smart Plantbox Demo. Retrieved March 2021 from https://youtu.be/pyVbYxOEZWo.Google Scholar
- [56] Smart Toilet Demo. Retrieved March 2021 from https://youtu.be/Pr15OyC7fNc.Google Scholar
- [57] Smart Robot Demo. Retrieved March 2021 from https://youtu.be/kPMIJ2TxfIg.Google Scholar
- [58] 2018. Formosa speech recognition challenge 2018: Data, plan and baselines. In Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP’18). Google Scholar
Cross Ref
- [59] . 2020. Formosa speech in the wild corpus for improving taiwanese mandarin speech-enabled human-computer interaction. J. Sign. Process. Syst. 92, (2020), 853–873. Google Scholar
Digital Library
- [60] Taiwan's National Education Radio (NER) Corpus. Retrieved from http://www.aclclp.org.tw/use_mat_c.php#ner.Google Scholar
- [61] 2020. Formosa speech recognition challenge 2020 and taiwanese across Taiwan Corpus. In Proceedings of 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA’20). Google Scholar
Cross Ref
- [62] . 2018. Formosa speech recognition challenge 2018. Retrieved from https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge.Google Scholar
- [63] . Formosa Speech Recognition Challenge 2020. Retrieved from https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge-2020.Google Scholar
- [64] 2020 Presidential Election-Television Debate (Live Subtitling). 2020. Retrieved from https://youtu.be/zcrIoO_8ZbU.Google Scholar
- [65] List of 10th Legislative Councilors of Legislative Yuan in Taiwan, 2021. Retrieved from https://www.ly.gov.tw/Pages/List.aspx?nodeid=109.Google Scholar
- [66] PAL Acoustics Technology Ltd, 2021. Retrieved from http://www.pal-acoustics.com/.Google Scholar
- [67] Public Television Service, PBS Talk. Retrieved 2021 from https://www.youtube.com/user/PTSTalk/videos.Google Scholar
- [68] , 2019. ArduTalk: An Arduino Network Application Development Platform Based on IoTtalk. IEEE Syst. J. 13, 1 (March 2019), 468–476. Google Scholar
Cross Ref
- [69] . 2003. Performance evaluation of location management in UMTS. IEEE Trans. Vehic. Technol. 52, 6 (November 2003) 1603–1615. Google Scholar
Cross Ref
- [70] . 1996. Heterogeneous personal communications services: Integration of PCS systems. IEEE Commun. Mag. 34, 9 (September 1996) 106–113. Google Scholar
Digital Library
- [71] . 1999. Modeling the sleep mode for cellular digital packet data. IEEE Commun. Lett. 3, 3 (March 1999) 63–65. Google Scholar
Cross Ref
- [72] . 2010. Pauses, gaps and overlaps in conversations. J. Phonet. 38, 4 (2010), 555–568. Google Scholar
Cross Ref
- [73] . 2015. Timing in turn-taking and its implications for processing models of language. Front. Psychol. 6 (June 2015), 1–17. Google Scholar
Cross Ref
- [74] . Performance analysis of a dual-threshold reservation (DTR) scheme for voice/data integrated mobile wireless networks. In Proceedings of the IEEE Wireless Communications and Networking Conference, Vol. 1. 258–262.
DOI: Google ScholarCross Ref
Index Terms
VoiceTalk: Multimedia-IoT Applications for Mixing Mandarin, Taiwanese, and English
Recommendations
Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high ...
Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system
Conventional Hidden Markov Model (HMM) based Automatic Speech Recognition (ASR) systems generally utilize cepstral features as acoustic observation and phonemes as basic linguistic units. Some of the most powerful features currently used in ASR systems ...
English Vowel Production by Native Mandarin and Hindi Speakers
ITNG '10: Proceedings of the 2010 Seventh International Conference on Information Technology: New GenerationsResearch into pronunciation and acoustic properties of foreign speech have evolved with technology and produced interesting results. The production of English vowel sounds by native Mandarin and Hindi speakers were investigated. The first three formant ...






Comments