Abstract
‘‘Audiobook” is a multimedia-based reading technology that has emerged in recent years. Realizing the alignment of e-book text and book audio is the most important part of its processing. This article describes an audio and text alignment algorithm using deep learning and neural network technology to improve the efficiency and quality of audiobook production. The algorithm first uses dual-threshold endpoint detection technology to segment long audio into short audio with sentence dimensions and recognizes it as short text. The threshold is calculated by AIC-FCM optimized based on simulated annealing genetic algorithm. Then the algorithm uses Doc2vec optimized by the threshold prediction method based on the average length of the short text to calculate the text similarity. Finally, proofread and output the text sequence and audio segment aligned in the time dimension to meet the needs of audiobook production. Experiments show that compared to traditional audio and text alignment algorithms, the proposed algorithm is closer to the ideal segmentation result in long audio segmentation, and the alignment effect is basically the same as Doc2vec and the time complexity is reduced by about 35%.
- [1] . 2021. PMRSS: Privacy-preserving medical record searching scheme for intelligent diagnosis in IoT healthcare. IEEE Transactions on Industrial Informatics, 99 (2021), 1–1.Google Scholar
- [2] . 2020. Robust spammer detection using collaborative neural network in internet of thing applications. IEEEInternet of Things Journal 8, 12 (2020), 9549–9558.Google Scholar
- [3] . 2020. Non-linear MIMO for industrial internet of things in cyber-physical systems. IEEE Transactions on Industrial Informatics, 99 (2020), 1–1.Google Scholar
- [4] . 2021. High-performance isolation computing technology for smart IoT healthcare in cloud environments. IEEE Internet of Things Journal, 99 (2021), 1–1.Google Scholar
- [5] L. Tan, H. Xiao, K. Yu, et al. 2021. A blockchain-empowered crowdsourcing system for 5G-enabled smart cities [J]. Computer Standards & Interfaces 76 (2021), 103517.Google Scholar
- [6] W. Zeng, Z. Guo, Y. Shen, et al. 2021. Data-driven management for fuzzy sewage treatment processes using hybrid neural computing [J]. Neural Computing and Applications (2021), 1–14.Google Scholar
- [7] . 2018. Interactivity and multimodality in language learning: The untapped potential of audiobooks. Universal Access in the Information Society 17, 2 (2018), 257–274.Google Scholar
Digital Library
- [8] Y. Zhang, Y. Qian, D. Wu, et al. 2018. Emotion-aware multimedia systems security [J]. IEEE Transactions on Multimedia 21, 3 (2018), 617–624.Google Scholar
- [9] Y. Shao, J. C. W. Lin, G. Srivastava, et al. 2021. Self-attention-based conditional random fields latent variables model for sequence labeling [J]. Pattern Recognition Letters 145 (2021), 157–164.Google Scholar
- [10] . 2021. ASRNN: A recurrent neural network with an attention model for sequence labeling. Knowledge-based Systems 212 (2021), 106548.Google Scholar
Cross Ref
- [11] . 2020. Enhanced sequence labeling based on latent variable conditional random fields. Neurocomputing 403 (2020), 431–440.Google Scholar
Cross Ref
- [12] . 2020. Diminished large-scale functional brain networks in absolute pitch during the perception of naturalistic music and audiobooks. NeuroImage 216 (2020), 116513.Google Scholar
Cross Ref
- [13] . 2015. Probabilistic kernels for improved text-to-speech alignment in long audio tracks. IEEE Signal Processing Letters 23, 1 (2015), 126–129.Google Scholar
Cross Ref
- [14] Ashokkumar P., Siva Shankar G., Gautam Srivastava, Praveen Kumar Reddy Maddikunta, and Thippa Reddy Gadekallu. 2021. A two-stage text feature selection algorithm for improving text classification. ACM Transactions on Asian and Low-resource Language Information Processing 20, 3 (2021), 49.Google Scholar
- [15] . 1998. A recursive algorithm for the forced alignment of very long audio segments. In Proceedings of the 5th International Conference on Spoken Language Processing.Google Scholar
Cross Ref
- [16] . 2003. Phonetic alignment: Speech synthesis-based vs. viterbi-based. Speech Communication 40, 4 (2003), 503–515.Google Scholar
Digital Library
- [17] . 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proceedings of the Interspeech. 498–502.Google Scholar
Cross Ref
- [18] . 2011. SailAlign: Robust long speech-text alignment. In Proceedings of the Workshop on New Tools and Methods for Very-large Scale Phonetics Research.Google Scholar
- [19] . 2010. Lightly supervised recognition for automatic alignment of large coherent speech recordings. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.Google Scholar
Cross Ref
- [20] . 2012. A grapheme-based method for automatic alignment of speech and text data. In Proceedings of the 2012 IEEE Spoken Language Technology Workshop. IEEE, 286–290.Google Scholar
Cross Ref
- [21] Sakshi Dhall, Ashutosh Dhar Dwivedi, Saibal K. Pal, and Gautam Srivastava. 2021. Blockchain-based framework for reducing fake or vicious news spread on social media/messaging platforms[J]. Transactions on Asian and Low-Resource Language Information Processing 21, 1 (2021), 1–33.Google Scholar
- [22] T. Mikolov, K. Chen, G. Corrado, et al. 2013. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781.Google Scholar
- [23] . 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 427–431.Google Scholar
Cross Ref
- [24] . 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. PMLR, 1188–1196.Google Scholar
- [25] . 2019. Audio word2vec: Sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 9 (2019), 1481–1493.Google Scholar
Digital Library
- [26] . 2018. Lightly supervised alignment of subtitles on multi-genre broadcasts. Multimedia Tools and Applications 77, 23 (2018), 30533–30550.Google Scholar
Digital Library
- [27] . 2019. Using virtual samples to improve learning performance for small datasets with multimodal distributions. Soft Computing 23, 22 (2019), 11883–11900.Google Scholar
Digital Library
- [28] . 1997. Peer reviewed: Learning optimization from nature: Genetic algorithms and simulated annealing. Analytical Chemistry 69, 7 (1997), 236A–242A.Google Scholar
Cross Ref
- [29] . 2010. A improved dual-threshold speech endpoint detection algorithm. In Proceedings of the 2010 The 2nd International Conference on Computer and Automation Engineering. IEEE, 123–126.Google Scholar
- [30] . 2019. Bi-LSTM mention hypergraph model with encoding schema for mention extraction. Engineering Applications of Artificial Intelligence 85 (2019), 175–181.Google Scholar
Cross Ref
- [31] . 2019. BILU-NEMH: A BILU neural-encoded mention hypergraph for mention extraction. IInformation Sciences 496 (2019), 53–64.Google Scholar
Digital Library
- [32] . 2011. LyricSynchronizer: Automatic synchronization system between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing 5, 6 (2011), 1252–1261.Google Scholar
Cross Ref
- [33] . 2017. A unified framework for string similarity search with edit-distance constraint. The VLDB Journal 26, 2 (2017), 249–274.Google Scholar
Digital Library
- [34] . 2017. Automatic ICD-10 coding algorithm using an improved longest common subsequence based on semantic similarity. PloS One 12, 3 (2017), e0173410.Google Scholar
Cross Ref
- [35] . 2019. Word n-gram attention models for sentence similarity and inference. Expert Systems with Applications 132 (2019), 1–11.Google Scholar
Digital Library
- [36] . 2019. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences 483 (2019), 53–64.Google Scholar
Digital Library
- [37] . 2020. From context to concept: Exploring semantic relationships in music with word2vec. Neural Computing and Applications 32, 4 (2020), 1023–1036.Google Scholar
Digital Library
- [38] . 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 528–540.Google Scholar
Cross Ref
- [39] . 2015. Skip-thought vectors. In Proceedings of the Advances in Neural Information Processing Systems. 3294–3302.Google Scholar
Digital Library
- [40] . 2011. Genetic algorithm-tuned entropy-based fuzzy C-means algorithm for obtaining distinct and compact clusters. Fuzzy Optimization and Decision Making 10, 2 (2011), 153–166.Google Scholar
Digital Library
- [41] . 2011. Risk analysis of dam based on artificial bee colony algorithm with fuzzy c-means clustering. Canadian Journal of Civil Engineering 38, 5 (2011), 483–492.Google Scholar
Cross Ref
- [42] . 2007. Comparing simulated annealing and genetic algorithm in learning FCM. Applied Mathematics and Computation 192, 1 (2007), 56–68.Google Scholar
Digital Library
- [43] . 2016. Efficient algorithm for web search query reformulation using genetic algorithm. In Proceedings of the Computational Intelligence in Data Mining’Volume 1. Springer, 459–470.Google Scholar
Cross Ref
- [44] . 2017. SANA: Simulated annealing far outperforms many other search algorithms for biological network alignment. Bioinformatics 33, 14 (2017), 2156–2164.Google Scholar
Cross Ref
- [45] . 2018. Optimized kernel fuzzy c-means clustering algorithm. Microelectronics and Computer 35, 2 (2018), 79–83.Google Scholar
- [46] . 2020. A primer on model selection using the akaike information criterion. Infectious Disease Modelling 5 (2020), 111–128.Google Scholar
Cross Ref
- [47] . 2019. Key word extraction for short text via word2vec, doc2vec, and textrank. Turkish Journal of Electrical Engineering and Computer Sciences 27, 3 (2019), 1794–1805.Google Scholar
Cross Ref
- [48] Wu Yongliang, Zhao Shuliang, Li Changjing, Wei Nadi, and wang Ziyan. 2017. Text classificationmethod based on tf-idf and cosine similarity. Journal of Chinese Information Processing 31, 5 (2017), 138–145.Google Scholar
- [49] . 2018. Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints. Journal: Digital Libraries for Open Knowledge Lecture Notes in Computer Science (2018), 3–15.Google Scholar
Cross Ref
Index Terms
Research on Chinese Audio and Text Alignment Algorithm Based on AIC-FCM and Doc2Vec
Recommendations
Automatic music video summarization based on audio-visual-text analysis and alignment
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalIn this paper, we propose a novel approach for automatic music video summarization based on audio-visual-text analysis and alignment. The music video is separated into the music and video tracks. For the music track, the chorus is detected based on ...
Audio Feature Extraction for DTW-based Audio-to-Score Alignment
ICCCM '22: Proceedings of the 10th International Conference on Computer and Communications ManagementAudio-to-score alignment is one of the music information retrieval (MIR) tasks that concerns the real world time when notes appeared in a corresponding audio. Although recent studies based on synthesizing MIDI to audio then applying audio feature ...
Incremental polyphonic audio to score alignment using beat tracking for singer robots
IROS'09: Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systemsWe aim at developing a singer robot capable of listening to music with its own ?ears? and interacting with a human's musical performance. Such a singer robot requires at least three functions: listening to the music, understanding what position in the ...






Comments