skip to main content
research-article

VoiceTalk: Multimedia-IoT Applications for Mixing Mandarin, Taiwanese, and English

Authors Info & Claims
Published:18 May 2023Publication History
Skip Abstract Section

Abstract

The voice-based Internet of Multimedia Things (IoMT) is the combination of IoT interfaces and protocols with associated voice-related information, which enables advanced applications based on human-to-device interactions. An example is Automatic Speech Recognition (ASR) for live captioning and voice translation. Three major issues of ASR for IoMT are IoT development cost, speech recognition accuracy, and execution time complexity. For the first issue, most non-voice IoT applications are upgraded with the ASR feature through hard coding, which are error prone. For the second issue, recognition accuracy must be improved for ASR. For the third issue, many multimedia IoT services are real-time applications and, therefore, the ASR delay must be short.

This article elaborates on the above issues based on an IoT platform called VoiceTalk. We built the largest Taiwanese spoken corpus to train VoiceTalk ASR (VT-ASR) and show how the VT-ASR mechanism can be transparently integrated with existing IoT applications. We consider two performance measures for VoiceTalk: speech recognition accuracy and VT-ASR delay. For the acoustic tests of PAL-Labs, VT-ASR's accuracy is 96.47%, while Google's accuracy is 94.28%. We are the first to develop an analytic model to investigate the probability that the VT-ASR delay for the first speaker is complete before the second speaker starts talking. From the measurements and analytic modeling, we show that the VT-ASR delay is short enough to result in a very good user experience. Our solution has won several important government and commercial TV contracts in Taiwan. VT-ASR has demonstrated better Taiwanese Mandarin speech recognition accuracy than famous commercial products (including Google and Iflytek) in Formosa Speech Recognition Challenge 2018 (FSR-2018) and was the best among all participating ASR systems for Taiwanese recognition accuracy in FSR-2020.

REFERENCES

  1. [1] Zikria Y. B., Afzal M. K., and Kim S. W.. 2020. Internet of Multimedia things (IoMT): Opportunities, challenges and solutions. Sensors 20, 8 (April 2020), 2334. Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Zhang L.-Y., Lin H.-C., Wu K.-R., Lin Y.-B., and Tseng Y.-C.. 2021. FusionTalk: An IoT-based reconfigurable object identification system. IEEE IoT J. 8, 9 (May 2021), 73337345. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Yadav D. K., Singh K., and Kumari S.. 2016. Challenging issues of video surveillance system using internet of things in cloud environment. In Proceedings of the International Conference on Advances in Computing and Data Sciences, 471481. Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Zhang Tan, Chowdhery Aakanksha, Bahl Paramvir (Victor), Jamieson Kyle, and Banerjee Suman. 2015. The design and implementation of a wireless video surveillance system. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking (MobiCom’15). Association for Computing Machinery, New York, NY, 426438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Otter D. W., Medina J. R., and Kalita J. K.. 2021. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32, 2 (2021), 604624, Feb. 2021. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Sheng M., Sangaiah A. K., and Chaudhary A.. CALL FOR PAPERS: Special Issue on Applications of Computational Linguistics in Multimedia IoT Services, ACM Transactions on Internet Technology (TOIT), Retrieved from https://dl.acm.org/pb-assets/static_journal_pages/toit/pdf/ACM-TOIT-CFP-IoMT-Jan21-1610575659947.pdf.Google ScholarGoogle Scholar
  7. [7] The Association for Computational Linguistics and Chinese Language Processing: Database. Retrieved from http://www.aclclp.org.tw/corp.php.Google ScholarGoogle Scholar
  8. [8] The Association for Computational Linguistics and Chinese Language Processing. Retrieved from http://www.aclclp.org.tw/.Google ScholarGoogle Scholar
  9. [9] Linguistic Data Consortium, University of Pennsylvania. Retrieved from https://www.ldc.upenn.edu/.Google ScholarGoogle Scholar
  10. [10] The European Language Resources Association. Retrieved from http://www.elra.info/en/.Google ScholarGoogle Scholar
  11. [11] SpeechOcean: Speech Data Services, Text Data and Image Data Services, Speech Datasets Database. Retrieved from http://en.speechocean.com/.Google ScholarGoogle Scholar
  12. [12] Google: Cloud Speech-to-text. Retrieved from https://cloud.google.com/speech-to-text/.Google ScholarGoogle Scholar
  13. [13] IFlyTek: iFLYTEK Open Platform—China's First Artificial Intelligence Open Platform for Mobile Internet and Intelligent Hardware Developers. Retrieved from http://global.xfyun.cn/.Google ScholarGoogle Scholar
  14. [14] Graves Alex, Fernández Santiago, Gomez Faustino, and Schmidhuber Jürgen. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICM’06). Association for Computing Machinery, New York, NY, 369376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Amodei D. et al. 2016. Deep Speech 2: End-to-End speech recognition in english and mandarin. arXiv:1512.02595. Retrieved from https://arxiv.org/abs/1512.02595.Google ScholarGoogle Scholar
  16. [16] Karita S., Soplin N. E. Y., Watanabe S., Delcroix M., Ogawa A., and Nakatani T.. 2019. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’19). Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Pham N. Q., Nguyen T. S., Niehues J., Müller M., Stüker S., and Waibel A.. 2019. Very deep self-attention networks for end-to-end speech recognition. arXiv:1904.13377. Retrieved from https://arxiv.org/abs/1904.13377.Google ScholarGoogle Scholar
  18. [18] Chan W., Jaitly N., Le Q., and Vinyals O.. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Gulati A. et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’20). Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hinton G. et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (November 2012), 8297. Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Xiong W. et al. 2016. Achieving human parity in conversational speech recognition. arXiv:1610.05256. Retrieved from https://arxiv.org/abs/1610.05256Google ScholarGoogle Scholar
  22. [22] Zhou W., Michel W., Irie K., Kitza M., Schluter R., and Ney H.. 2020. The RWTH ASR system for TED-LIUM Release 2: Improving hybrid HMM with SpecAugment. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’2020). Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Lüscher C. et al. 2019. RWTH ASR systems for Librispeech: Hybrid vs Attention - w/o Data Augmentation. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’19).Google ScholarGoogle Scholar
  24. [24] Abdel-Hamid O., Mohamed A., Jiang H., Deng L., Penn G., and Yu D.. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Aud. Speech Lang. Process. 22, 10 (October 2014), 15331545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Graves A., Mohamed A., and Hinton G., 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Peddinti V., Povey D., and Khudanpur S.. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’15). Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Povey D. et al. 2018. Semi-Orthogonal low-rank matrix factorization for deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’18). Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Manohar V., Chen S., Wang Z., Fujita Y., Watanabe, S. and Khudanpur S.. 2019. Acoustic modeling for overlapping speech recognition: Jhu Chime-5 challenge system. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Liao Y.-F., Chang Y.-H., Lin Y.-C., Hsu W.-H., Pleva M., and Juhar J.. 2020. Formosa speech in the wild corpus for improving taiwanese mandarin speech-enabled human-computer interaction. J. Sign. Process. Syst. 92, 8 (August 2020), 853873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] TCC300 Corpus. Retrieved April 6, 2021 from http://www.aclclp.org.tw/use_mat.php#tcc300edu’.Google ScholarGoogle Scholar
  31. [31] Wang H.-M.. 2003. MATBN 2002: A mandarin Chinese broadcast news corpus. In Proceedings of the ISCA & IEEE Workshop on Spontaneous Speech Process. Recognition (SSPR’03).Google ScholarGoogle Scholar
  32. [32] Bu H., Du J., Na X., Wu, B. and Zheng H.. 2017. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA’17). Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Wang D. and Zhang X.. 2015. THCHS-30: A free chinese speech corpus. arXiv:1512.01882. Retrieved from https://arxiv.org/abs/1512.01882.Google ScholarGoogle Scholar
  34. [34] Panayotov V., Chen G., Povey D., and Khudanpur S.. 2015. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Wang D., Tang Z., Tang D., and Chen Q.. 2016. OC16-CE80: A Chinese-English mixlingual database and a speech recognition baseline. In Proceedings of the Conference of the Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA’16). Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Lyu D.-C., Tan T.-P., Chng E.-S., and Li H.. 2010. SEAME: A mandarin-english code-switching speech corpus in South-East Asia. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’10).Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Huang C.. 2009. Tagged Chinese gigaword corpus 2.0. linguistic data consortium, philadelphia. Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Du J., Na X., Liu X., and Bu H.. 2018. AISHELL-2: Transforming mandarin ASR research into industrial scale. arXiv:1808.10583. Retrieved from http://arxiv.org/abs/1808.10583.Google ScholarGoogle Scholar
  39. [39] Kavre M.S., Gadekar A., and Gadhade Y.. 2019. Internet of things (IoT): A survey. In Proceedings of the IEEE Pune Section International Conference (PuneCon’19).Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Ray P. P.. 2018. A survey on Internet of Things architectures. J.f King Saud Univ.– Comput. Inf. Sci. 30, 3 (July 2018), 291319. Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Marques G., Garcia N., and Pombo N.. 2017. A survey on IoT: Architectures, elements, applications, QoS, platforms and security concepts. In Advances in Mobile Cloud Computing and Big Data in the 5G Era, Studies in Big Data, Vol. 22, C. Mavromoustakis, G. Mastorakis, and C. Dobre (Eds.). Springer, Cham. Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Godwin S., Glendenning B., and Gagneja K.. 2019. Future security of smart speaker and IoT smart home devices. In Proceedings of the 5th Conference on Mobile and Secure Services (MobiSecServ’19). .Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Cheng P., Bagci I. E., Yan J., and Roedig U.. 2019. Smart speaker privacy control—Acoustic tagging for personal voice assistants. In Proceedings of the IEEE Security and Privacy Workshops (SPW’19). Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Corcoran P. M.. 2012. Wireless remote control. IEEE Consum. Electr. Mag. 1, 4 (2012), 4851. Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Lin Y.-W., Lin Y.-B., Hsiao C.-Y., and Wang Y.-Y.. 2017. IoTtalk-RC: Sensors as universal remote control for aftermarket home appliances. In IEEE IoT J. 4, 4 (August 2017), 11041112. Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Rao C. H. et al. 2001. iSMS: An integration platform for short message service and IP networks. IEEE Network 15, 2 (March 2001) 4855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Nádvorník J. and Smutný P.. 2014. Remote control robot using Android mobile device. In Proceedings of the 15th International Carpathian Control Conference (ICCC’14). Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Sullivan D., Chen W., and Pandya A.. 2017. Design of remote control of home appliances via Bluetooth and Android smart phones. In Proceedings of the IEEE International Conference on Consumer Electronics–Taiwan (ICCE-TW’2017). Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Lin Y.-B., Tseng S.-K., Hsu T.-H., and Tseng C. D.. 2021. HouseTalk: A house that comforts you. IEEE Access 9 (2021), 2779027801. .Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Lin Y.-B., Lin Y.-W., Huang C.-M., Chih C.-Y., and Lin P., 2017. IoTtalk: A management platform for reconfigurable sensor devices. IEEE IoT J. 4, 5 (October 2017), 2017. Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Lin Y.-B., Chen L.-K., Shieh M.-Z., Lin Y.-W., and Yen T.-H.. 2018. CampusTalk: IoT devices and their interesting features on campus applications. IEEE Access 6 (2018), 2603626046. Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Swing Light Pole Interactive Art Demo. Retrieved March 2021 from https://youtu.be/wZ99kc-4aAo.Google ScholarGoogle Scholar
  53. [53] Hollow Light Globe Demo. Retrieved March 2021 from https://youtu.be/ZICUCOjQ4iA.Google ScholarGoogle Scholar
  54. [54] Van L.-D. et al. 2019. PlantTalk: A smartphone-based intelligent hydroponic plant box. Sensors 19, 8 (2019) 1763. Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Smart Plantbox Demo. Retrieved March 2021 from https://youtu.be/pyVbYxOEZWo.Google ScholarGoogle Scholar
  56. [56] Smart Toilet Demo. Retrieved March 2021 from https://youtu.be/Pr15OyC7fNc.Google ScholarGoogle Scholar
  57. [57] Smart Robot Demo. Retrieved March 2021 from https://youtu.be/kPMIJ2TxfIg.Google ScholarGoogle Scholar
  58. [58] Liao Y.-F. et al. 2018. Formosa speech recognition challenge 2018: Data, plan and baselines. In Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP’18). Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Liao Y.-F., Chang Y. H. S., Lin Y. C., Hsu W. H., Pleva M., and Juhar J.. 2020. Formosa speech in the wild corpus for improving taiwanese mandarin speech-enabled human-computer interaction. J. Sign. Process. Syst. 92, (2020), 853873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Taiwan's National Education Radio (NER) Corpus. Retrieved from http://www.aclclp.org.tw/use_mat_c.php#ner.Google ScholarGoogle Scholar
  61. [61] Liao Y.-F. et al. 2020. Formosa speech recognition challenge 2020 and taiwanese across Taiwan Corpus. In Proceedings of 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA’20). Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Liao Y.-F.. 2018. Formosa speech recognition challenge 2018. Retrieved from https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge.Google ScholarGoogle Scholar
  63. [63] Liao Y.-F.. Formosa Speech Recognition Challenge 2020. Retrieved from https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge-2020.Google ScholarGoogle Scholar
  64. [64] 2020 Presidential Election-Television Debate (Live Subtitling). 2020. Retrieved from https://youtu.be/zcrIoO_8ZbU.Google ScholarGoogle Scholar
  65. [65] List of 10th Legislative Councilors of Legislative Yuan in Taiwan, 2021. Retrieved from https://www.ly.gov.tw/Pages/List.aspx?nodeid=109.Google ScholarGoogle Scholar
  66. [66] PAL Acoustics Technology Ltd, 2021. Retrieved from http://www.pal-acoustics.com/.Google ScholarGoogle Scholar
  67. [67] Public Television Service, PBS Talk. Retrieved 2021 from https://www.youtube.com/user/PTSTalk/videos.Google ScholarGoogle Scholar
  68. [68] Lin Y.-W., Lin Y.-B., Yang M.-T., and Lin J.-H., 2019. ArduTalk: An Arduino Network Application Development Platform Based on IoTtalk. IEEE Syst. J. 13, 1 (March 2019), 468476. Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Yang Shun-Ren and Lin Yi-Bing. 2003. Performance evaluation of location management in UMTS. IEEE Trans. Vehic. Technol. 52, 6 (November 2003) 16031615. Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Lin Yi-Bing and Chlamtac I.. 1996. Heterogeneous personal communications services: Integration of PCS systems. IEEE Commun. Mag. 34, 9 (September 1996) 106113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] Lin Yi-Bing and Chuang Yu-Min. 1999. Modeling the sleep mode for cellular digital packet data. IEEE Commun. Lett. 3, 3 (March 1999) 6365. Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Heldner M. and Edlund J.. 2010. Pauses, gaps and overlaps in conversations. J. Phonet. 38, 4 (2010), 555568. Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Levinson S. C. and Torreira F.. 2015. Timing in turn-taking and its implications for processing models of language. Front. Psychol. 6 (June 2015), 117. Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Yin L., Li B., Zhang Z., and Lin Y.-B.. Performance analysis of a dual-threshold reservation (DTR) scheme for voice/data integrated mobile wireless networks. In Proceedings of the IEEE Wireless Communications and Networking Conference, Vol. 1. 258262. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. VoiceTalk: Multimedia-IoT Applications for Mixing Mandarin, Taiwanese, and English

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Internet Technology
      ACM Transactions on Internet Technology  Volume 23, Issue 2
      May 2023
      276 pages
      ISSN:1533-5399
      EISSN:1557-6051
      DOI:10.1145/3597634
      • Editor:
      • Ling Liu
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 May 2023
      • Online AM: 14 June 2022
      • Accepted: 3 June 2022
      • Received: 15 April 2021
      Published in toit Volume 23, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)200
      • Downloads (Last 6 weeks)45

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!