skip to main content
research-article

Research on Chinese Audio and Text Alignment Algorithm Based on AIC-FCM and Doc2Vec

Authors Info & Claims
Published:02 April 2023Publication History
Skip Abstract Section

Abstract

‘‘Audiobook” is a multimedia-based reading technology that has emerged in recent years. Realizing the alignment of e-book text and book audio is the most important part of its processing. This article describes an audio and text alignment algorithm using deep learning and neural network technology to improve the efficiency and quality of audiobook production. The algorithm first uses dual-threshold endpoint detection technology to segment long audio into short audio with sentence dimensions and recognizes it as short text. The threshold is calculated by AIC-FCM optimized based on simulated annealing genetic algorithm. Then the algorithm uses Doc2vec optimized by the threshold prediction method based on the average length of the short text to calculate the text similarity. Finally, proofread and output the text sequence and audio segment aligned in the time dimension to meet the needs of audiobook production. Experiments show that compared to traditional audio and text alignment algorithms, the proposed algorithm is closer to the ideal segmentation result in long audio segmentation, and the alignment effect is basically the same as Doc2vec and the time complexity is reduced by about 35%.

REFERENCES

  1. [1] Sun Y., Liu J., Yu K., Alazab M., and Lin K.. 2021. PMRSS: Privacy-preserving medical record searching scheme for intelligent diagnosis in IoT healthcare. IEEE Transactions on Industrial Informatics, 99 (2021), 11.Google ScholarGoogle Scholar
  2. [2] Guo Z., Shen Y., Bashir A. K., Imran M., and Yu K.. 2020. Robust spammer detection using collaborative neural network in internet of thing applications. IEEEInternet of Things Journal 8, 12 (2020), 9549–9558.Google ScholarGoogle Scholar
  3. [3] Gong Y., Zhang L., Liu R. P., Yu K., and Srivastava G.. 2020. Non-linear MIMO for industrial internet of things in cyber-physical systems. IEEE Transactions on Industrial Informatics, 99 (2020), 11.Google ScholarGoogle Scholar
  4. [4] Zhang Y., Sun Y., Jin R., Lin K., and Liu W.. 2021. High-performance isolation computing technology for smart IoT healthcare in cloud environments. IEEE Internet of Things Journal, 99 (2021), 11.Google ScholarGoogle Scholar
  5. [5] L. Tan, H. Xiao, K. Yu, et al. 2021. A blockchain-empowered crowdsourcing system for 5G-enabled smart cities [J]. Computer Standards & Interfaces 76 (2021), 103517.Google ScholarGoogle Scholar
  6. [6] W. Zeng, Z. Guo, Y. Shen, et al. 2021. Data-driven management for fuzzy sewage treatment processes using hybrid neural computing [J]. Neural Computing and Applications (2021), 1–14.Google ScholarGoogle Scholar
  7. [7] Marchetti Emanuela and Valente Andrea. 2018. Interactivity and multimodality in language learning: The untapped potential of audiobooks. Universal Access in the Information Society 17, 2 (2018), 257274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Y. Zhang, Y. Qian, D. Wu, et al. 2018. Emotion-aware multimedia systems security [J]. IEEE Transactions on Multimedia 21, 3 (2018), 617–624.Google ScholarGoogle Scholar
  9. [9] Y. Shao, J. C. W. Lin, G. Srivastava, et al. 2021. Self-attention-based conditional random fields latent variables model for sequence labeling [J]. Pattern Recognition Letters 145 (2021), 157–164.Google ScholarGoogle Scholar
  10. [10] Lin J. C. W., Shao Y. N., Djenouri Y., and Yun U.. 2021. ASRNN: A recurrent neural network with an attention model for sequence labeling. Knowledge-based Systems 212 (2021), 106548.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Lin J. C. W., Shao Y. N., Zhang J., and Yun U.. 2020. Enhanced sequence labeling based on latent variable conditional random fields. Neurocomputing 403 (2020), 431440.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Brauchli Christian, Leipold Simon, and Jäncke Lutz. 2020. Diminished large-scale functional brain networks in absolute pitch during the perception of naturalistic music and audiobooks. NeuroImage 216 (2020), 116513.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Bordel German, Penagarikano Mikel, Rodríguez-Fuentes Luis Javier, Álvarez Aitor, and Varona Amparo. 2015. Probabilistic kernels for improved text-to-speech alignment in long audio tracks. IEEE Signal Processing Letters 23, 1 (2015), 126129.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Ashokkumar P., Siva Shankar G., Gautam Srivastava, Praveen Kumar Reddy Maddikunta, and Thippa Reddy Gadekallu. 2021. A two-stage text feature selection algorithm for improving text classification. ACM Transactions on Asian and Low-resource Language Information Processing 20, 3 (2021), 49.Google ScholarGoogle Scholar
  15. [15] Moreno Pedro J., Joerg Chris, Thong Jean-Manuel Van, and Glickman Oren. 1998. A recursive algorithm for the forced alignment of very long audio segments. In Proceedings of the 5th International Conference on Spoken Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Malfrère Fabrice, Deroo Olivier, Dutoit Thierry, and Ris Christophe. 2003. Phonetic alignment: Speech synthesis-based vs. viterbi-based. Speech Communication 40, 4 (2003), 503515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] McAuliffe Michael, Socolof Michaela, Mihuc Sarah, Wagner Michael, and Sonderegger Morgan. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proceedings of the Interspeech. 498502.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Katsamanis Athanasios, Black Matthew, Georgiou Panayiotis G., Goldstein Louis, and Narayanan Shrikanth. 2011. SailAlign: Robust long speech-text alignment. In Proceedings of the Workshop on New Tools and Methods for Very-large Scale Phonetics Research.Google ScholarGoogle Scholar
  19. [19] Braunschweiler Norbert, Gales Mark J. F., and Buchholz Sabine. 2010. Lightly supervised recognition for automatic alignment of large coherent speech recordings. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Stan Adriana, Bell Peter, and King Simon. 2012. A grapheme-based method for automatic alignment of speech and text data. In Proceedings of the 2012 IEEE Spoken Language Technology Workshop. IEEE, 286290.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Sakshi Dhall, Ashutosh Dhar Dwivedi, Saibal K. Pal, and Gautam Srivastava. 2021. Blockchain-based framework for reducing fake or vicious news spread on social media/messaging platforms[J]. Transactions on Asian and Low-Resource Language Information Processing 21, 1 (2021), 1–33.Google ScholarGoogle Scholar
  22. [22] T. Mikolov, K. Chen, G. Corrado, et al. 2013. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781.Google ScholarGoogle Scholar
  23. [23] Joulin Armand, Grave Édouard, Bojanowski Piotr, and Mikolov Tomáš. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 427431.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Le Quoc and Mikolov Tomas. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. PMLR, 11881196.Google ScholarGoogle Scholar
  25. [25] Chen Yi-Chen, Huang Sung-Feng, Lee Hung-yi, Wang Yu-Hsuan, and Shen Chia-Hao. 2019. Audio word2vec: Sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 9 (2019), 14811493.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Saz Oscar, Deena Salil, Doulaty Mortaza, Hasan Madina, Khaliq Bilal, Milner Rosanna, Ng Raymond W. M., Olcoz Julia, and Hain Thomas. 2018. Lightly supervised alignment of subtitles on multi-genre broadcasts. Multimedia Tools and Applications 77, 23 (2018), 3053330550.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Li Der-Chiang, Lin Liang-Sian, Chen Chien-Chih, and Yu Wei-Hao. 2019. Using virtual samples to improve learning performance for small datasets with multimodal distributions. Soft Computing 23, 22 (2019), 1188311900.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Shaffer Ronald E. and Small Gary W.. 1997. Peer reviewed: Learning optimization from nature: Genetic algorithms and simulated annealing. Analytical Chemistry 69, 7 (1997), 236A–242A.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Guo Qiuyu, Li Nan, and Ji Guangrong. 2010. A improved dual-threshold speech endpoint detection algorithm. In Proceedings of the 2010 The 2nd International Conference on Computer and Automation Engineering. IEEE, 123126.Google ScholarGoogle Scholar
  30. [30] Lin J. C. W., Shao Y. N., Zhou Y. J., Pirouz M., and Chen H. C.. 2019. Bi-LSTM mention hypergraph model with encoding schema for mention extraction. Engineering Applications of Artificial Intelligence 85 (2019), 175181.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Lin J. C. W., Shao Y. N., and Hamido F. Fournier-Viger, P.. 2019. BILU-NEMH: A BILU neural-encoded mention hypergraph for mention extraction. IInformation Sciences 496 (2019), 5364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Fujihara Hiromasa, Goto Masataka, Ogata Jun, and Okuno Hiroshi G.. 2011. LyricSynchronizer: Automatic synchronization system between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing 5, 6 (2011), 12521261.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Yu Minghe, Wang Jin, Li Guoliang, Zhang Yong, Deng Dong, and Feng Jianhua. 2017. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal 26, 2 (2017), 249274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Chen YunZhi, Lu HuiJuan, and Li LanJuan. 2017. Automatic ICD-10 coding algorithm using an improved longest common subsequence based on semantic similarity. PloS One 12, 3 (2017), e0173410.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Lopez-Gazpio Inigo, Maritxalar Montse, Lapata Mirella, and Agirre Eneko. 2019. Word n-gram attention models for sentence similarity and inference. Expert Systems with Applications 132 (2019), 111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Bag Sujoy, Kumar Sri Krishna, and Tiwari Manoj Kumar. 2019. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences 483 (2019), 5364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Chuan Ching-Hua, Agres Kat, and Herremans Dorien. 2020. From context to concept: Exploring semantic relationships in music with word2vec. Neural Computing and Applications 32, 4 (2020), 10231036.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Pagliardini Matteo, Gupta Prakhar, and Jaggi Martin. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 528540.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Kiros Ryan, Zhu Yukun, Salakhutdinov Russ R., Zemel Richard, Urtasun Raquel, Torralba Antonio, and Fidler Sanja. 2015. Skip-thought vectors. In Proceedings of the Advances in Neural Information Processing Systems. 32943302.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Dey Vidyut, Pratihar Dilip Kumar, and Datta Gauranga Lal. 2011. Genetic algorithm-tuned entropy-based fuzzy C-means algorithm for obtaining distinct and compact clusters. Fuzzy Optimization and Decision Making 10, 2 (2011), 153166.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Li Haojin, Li Junjie, and Kang Fei. 2011. Risk analysis of dam based on artificial bee colony algorithm with fuzzy c-means clustering. Canadian Journal of Civil Engineering 38, 5 (2011), 483492.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Ghazanfari Mehdi, Alizadeh Somayeh, Fathian Mohammad, and Koulouriotis Dimitris E.. 2007. Comparing simulated annealing and genetic algorithm in learning FCM. Applied Mathematics and Computation 192, 1 (2007), 5668.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Singh Vikram, Garg Siddhant, and Kaur Pradeep. 2016. Efficient algorithm for web search query reformulation using genetic algorithm. In Proceedings of the Computational Intelligence in Data Mining’Volume 1. Springer, 459470.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Mamano Nil and Hayes Wayne B.. 2017. SANA: Simulated annealing far outperforms many other search algorithms for biological network alignment. Bioinformatics 33, 14 (2017), 21562164.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Yi-lin L. I. U. and Jian-cheng A. N.. 2018. Optimized kernel fuzzy c-means clustering algorithm. Microelectronics and Computer 35, 2 (2018), 7983.Google ScholarGoogle Scholar
  46. [46] Portet Stéphanie. 2020. A primer on model selection using the akaike information criterion. Infectious Disease Modelling 5 (2020), 111128.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Li Jun, Huang Guimin, Fan Chunli, Sun Zhenglin, and Zhu Hongtao. 2019. Key word extraction for short text via word2vec, doc2vec, and textrank. Turkish Journal of Electrical Engineering and Computer Sciences 27, 3 (2019), 17941805.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Wu Yongliang, Zhao Shuliang, Li Changjing, Wei Nadi, and wang Ziyan. 2017. Text classificationmethod based on tf-idf and cosine similarity. Journal of Chinese Information Processing 31, 5 (2017), 138–145.Google ScholarGoogle Scholar
  49. [49] Toepfer Martin and Seifert Christin. 2018. Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints. Journal: Digital Libraries for Open Knowledge Lecture Notes in Computer Science (2018), 315.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Research on Chinese Audio and Text Alignment Algorithm Based on AIC-FCM and Doc2Vec

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Asian and Low-Resource Language Information Processing
            ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
            March 2023
            570 pages
            ISSN:2375-4699
            EISSN:2375-4702
            DOI:10.1145/3579816
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 2 April 2023
            • Online AM: 18 July 2022
            • Accepted: 6 December 2021
            • Revised: 22 November 2021
            • Received: 4 August 2021
            Published in tallip Volume 22, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)94
            • Downloads (Last 6 weeks)11

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!