Abstract
We propose a signal-based approach instead of the commonly used model-based approach, to automatically align vocal music with text lyrics at the word level. In this approach, we use a text-to-speech system to synthesize the singing voice according to the lyrics. In this way, aligning the music signal with the corresponding text lyrics becomes the alignment of two audio signals. This study uses the results of music information modeling and singing voice synthesis. In music information modeling, we study different music representation strategies for music segmentation, music region indexing and region content descriptions; in singing voice synthesis, we generate singing voice by making use of music knowledge to approximate the target vocal line in terms of tempo. The experimental results on a 20-song database show 26.3% and 36.1% word level alignment error rates at eighth note and sixteenth note alignment tolerances respectively. The proposed approach presents an alternative and effective solution to music-lyrics alignment which may require less training dataset.
- Royal Schools of Music. 1949. Rudiments and Theory of Music. The associated board of the Royal Schools of Music. London.Google Scholar
- Bartsch, M. A. and Wakefield, G. H. 2004. Singing voice identification using spectral envelop estimation. IEEE Trans. Audio Speech Lang. Proc. 12, 2, 100--109.Google Scholar
Cross Ref
- Berenzweig, A. L. and Ellis, D. P.W. 2001. Location singing voice segments within music signals. In Proceedings of IEEE Workshop on Applications of Signal processing to Audio and Acoustics (WASPAA).Google Scholar
- Brown, J. C. and Puckette, M. S. 1992. An efficient algorithm for the calculation of a constant Q transform. J. Acoustic Soc. Amer. 92, 5, 1933—1941Google Scholar
Cross Ref
- Chen. K., Gao, S., Zhu, Y. and Sun, Q. 2006. Popular song and lyrics synchronization and its application to music information retrieval. In Proceeding of Multimedia Networking and Computing.Google Scholar
- Dannenberg, R. B. 1984. An on-line algorithm for real-time accompaniment. In Proceedings of the International Computer Music Conference. 193--198.Google Scholar
- Deutsch, D. 1988. The perceived height of octave-related complexes. J. Acous. Soc. Amer. 80, 5, 1346--1353.Google Scholar
Cross Ref
- Duxburg, C., Sandler, M. and Davies, M. 2002. A hybrid approach to musical note onset detection. In Proceedings of the International Conference on Digital Audio Effects (DAFx).Google Scholar
- Ellis, D. P. W. and Poliner, G. E. 2006. Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
- Fletcher, H. 1931. Some physical characteristics of speech and music. J. Acousti. Soc. Amer. 3, 2, 1--26.Google Scholar
Cross Ref
- Fujihara, H., Goto, M., Ogata, J., Komatani, K., Ogata, T. and Okuno, H. G. 2006. Automatic synchronization between lyrics and music CD recordings based on viterbi alignment of segregated vocal signals. In Proceedings of the IEEE International Symposium on Multimedia (ISM). Google Scholar
Digital Library
- Furini, M. and Alboresi, L. 2004. Audio-text synchronization inside MP3 files: A new approach and its implementation. In Proceedings of the IEEE Consumer Communications & Networking.Google Scholar
- Grubb, L. and Dannenberg, R. B. 1997. A stochastic method of tracking a vocal performer. In Proceedings of the International Computer Music Conference (ICMC). 301--308.Google Scholar
- Hamon, C. Mouline, E., and Charpentier, F. 1989. A diphone synthesis system based on time-domain prosodic modifications of speech. In Proceedings of the IEEE International Conference on Acoustics Speech Signal Processing (ICASSP). 238--241.Google Scholar
- Hu, N., Dannenberg, R.B. and Tzanetakis, G. 2003. Polyphonic audio matching and alignment for music retrieval. In Proceedings of the IEEE Workshop on Application of Signal Processing to Audio and Acoustics.Google Scholar
- Inoue, W., Hashimoto, S. and Ohteru, S. 1994. Adaptive karaoke system-human singing accompaniment based on speech recognition. In Proceedings of the International Computer Music Conference (ICMC). 70--77.Google Scholar
- Iskandar, D., Wang, Y., Kan, M. Y., and Li., H. 2006. Syllabic level automatic synchronization of music signals and text lyrics. In Proceedings of the ACM Multimedia Conference (ACM MM). 659--662. Google Scholar
Digital Library
- John, R. D., John, H. L., and John, G. P. 1999. Discrete-Time Processing of Speech Signals. IEEE Press.Google Scholar
- Jourdain, R. 1997. Music, The Brain, and Ecstasy: How Music Captures Our Imagination. HarperCollins.Google Scholar
- Katayose, H., Kanomori, T., Kamei, K., Nagashima, Y., Sato, K., Inokuchi, S., and Simura, S. 1993. Virtual performer. In Proceedings of the International Computer Music Conference (ICMC), 138--145.Google Scholar
- Kim, Y. E. 2003. Singing voice analysis/synthesis. PhD. Thesis, Massachusetts institute of Technology.Google Scholar
- Korst, J., and Geleijnse, G. 2006. Efficient lyrics retrieval and alignment. In Proceedings of Philips Symposium on Intelligent Algorithms.Google Scholar
- Loscos, A., Cano, P., and Bonada, J. 1999. Low-delay singing voice alignment to text. In Proceedings of the International Computer Music Conference (ICMC).Google Scholar
- Maddage, N. C., Li, H., and Kankanhalli, M. S. 2006. Music structure based vector space retrieval. In Proceedings of ACM Special Interest Group on Information Retrieval (ACM SIGIR) Conference, 67--74. Google Scholar
Digital Library
- Makhoul, J. 1975. Linear Prediction: A tutorial review. Proc. IEEE, 63, 4, 561--580.Google Scholar
- Nwe, T. L. and Wang, Y. 2004. Automatic detection of vocal segments in popular songs. In Proceedings of the 5th International Symposium/Conference of Music Information Retrieval (ISMIR).Google Scholar
- Rabiner, L. R. and Juang, B. H. 1993. Fundamentals of Speech Recognition. Prentice-Hall. Google Scholar
Digital Library
- Sakeo, H. and Chiba, S. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Audio Speech, Lang. Proce. 26, 1, 43--49.Google Scholar
Cross Ref
- Sheh, A. and Ellis, D. P. W. 2003. Chord segmentation and recognition using EM-trained hidden Markov models. In Proceedings of the International Conference on Music Information (ISMIR).Google Scholar
- Stevens, S. S., Volkmann, J. and Newman, E. B. 1937. A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Amer. 8, 3, 185--190.Google Scholar
Cross Ref
- Taylor, P. A., Black, A. W., and Caley, R. J. 1998. The architecture of the festival speech synthesis system. In Proceedings of the 3rd International Workshop on Speech Synthesis.Google Scholar
- Tsai, W. H., Wang, H. M., Rodgers, D., Cheng, S. S. and Yu, H. M. 2004. Blind clustering of popular music recordings based on singer voice characteristics. In Proceedings of the International Symposium of Music Information Retrieval (ISMIR).Google Scholar
- Wang, Y., Kan, M. Y., Nwe, T. L., Shenoy, A. and Yin, J. 2004. LyricAlly: Automatic synchronization of acoustic musical signals and textual lyrics. In Proceedings of the ACM Multimedia Conference. 212--219. Google Scholar
Digital Library
- Wong, C. H., Szeto, W. M. and Wong, K. H. 2006. Automatic lyrics alignment for Cantonese popular music. In Multimedia Systems.Google Scholar
- Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. 2006. The HTK Book Version 3.4. Department of Engineering, University of Cambridge.Google Scholar
Index Terms
Word level automatic alignment of music and lyrics using vocal synthesis
Recommendations
A Tutorial on AI Music Composition
MM '21: Proceedings of the 29th ACM International Conference on MultimediaAI music composition is one of the most attractive and important topics in artificial intelligence, music, and multimedia. The typical tasks in AI music composition include melody generation, song writing, accompaniment generation, arrangement, ...
Rhythm Speech Lyrics Input for MIDI-Based Singing Voice Synthesis
PCM '09: Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information ProcessingThis paper presents useful techniques and considerations in implementing underlying mandarin singing voice synthesis system using the RSLI unit. The system can receive the continuous speech of the lyrics of a song, and can synthesize the intended song ...
Automatic lyrics alignment for Cantonese popular music
From lyrics-display on electronic music players and Karaoke videos to surtitles for live Chinese opera performance, one feature is common to all these everyday functionalities temporal: synchronization of the written text and its corresponding musical ...






Comments