skip to main content
research-article

Word level automatic alignment of music and lyrics using vocal synthesis

Authors Info & Claims
Published:27 August 2010Publication History
Skip Abstract Section

Abstract

We propose a signal-based approach instead of the commonly used model-based approach, to automatically align vocal music with text lyrics at the word level. In this approach, we use a text-to-speech system to synthesize the singing voice according to the lyrics. In this way, aligning the music signal with the corresponding text lyrics becomes the alignment of two audio signals. This study uses the results of music information modeling and singing voice synthesis. In music information modeling, we study different music representation strategies for music segmentation, music region indexing and region content descriptions; in singing voice synthesis, we generate singing voice by making use of music knowledge to approximate the target vocal line in terms of tempo. The experimental results on a 20-song database show 26.3% and 36.1% word level alignment error rates at eighth note and sixteenth note alignment tolerances respectively. The proposed approach presents an alternative and effective solution to music-lyrics alignment which may require less training dataset.

References

  1. Royal Schools of Music. 1949. Rudiments and Theory of Music. The associated board of the Royal Schools of Music. London.Google ScholarGoogle Scholar
  2. Bartsch, M. A. and Wakefield, G. H. 2004. Singing voice identification using spectral envelop estimation. IEEE Trans. Audio Speech Lang. Proc. 12, 2, 100--109.Google ScholarGoogle ScholarCross RefCross Ref
  3. Berenzweig, A. L. and Ellis, D. P.W. 2001. Location singing voice segments within music signals. In Proceedings of IEEE Workshop on Applications of Signal processing to Audio and Acoustics (WASPAA).Google ScholarGoogle Scholar
  4. Brown, J. C. and Puckette, M. S. 1992. An efficient algorithm for the calculation of a constant Q transform. J. Acoustic Soc. Amer. 92, 5, 1933—1941Google ScholarGoogle ScholarCross RefCross Ref
  5. Chen. K., Gao, S., Zhu, Y. and Sun, Q. 2006. Popular song and lyrics synchronization and its application to music information retrieval. In Proceeding of Multimedia Networking and Computing.Google ScholarGoogle Scholar
  6. Dannenberg, R. B. 1984. An on-line algorithm for real-time accompaniment. In Proceedings of the International Computer Music Conference. 193--198.Google ScholarGoogle Scholar
  7. Deutsch, D. 1988. The perceived height of octave-related complexes. J. Acous. Soc. Amer. 80, 5, 1346--1353.Google ScholarGoogle ScholarCross RefCross Ref
  8. Duxburg, C., Sandler, M. and Davies, M. 2002. A hybrid approach to musical note onset detection. In Proceedings of the International Conference on Digital Audio Effects (DAFx).Google ScholarGoogle Scholar
  9. Ellis, D. P. W. and Poliner, G. E. 2006. Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  10. Fletcher, H. 1931. Some physical characteristics of speech and music. J. Acousti. Soc. Amer. 3, 2, 1--26.Google ScholarGoogle ScholarCross RefCross Ref
  11. Fujihara, H., Goto, M., Ogata, J., Komatani, K., Ogata, T. and Okuno, H. G. 2006. Automatic synchronization between lyrics and music CD recordings based on viterbi alignment of segregated vocal signals. In Proceedings of the IEEE International Symposium on Multimedia (ISM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Furini, M. and Alboresi, L. 2004. Audio-text synchronization inside MP3 files: A new approach and its implementation. In Proceedings of the IEEE Consumer Communications & Networking.Google ScholarGoogle Scholar
  13. Grubb, L. and Dannenberg, R. B. 1997. A stochastic method of tracking a vocal performer. In Proceedings of the International Computer Music Conference (ICMC). 301--308.Google ScholarGoogle Scholar
  14. Hamon, C. Mouline, E., and Charpentier, F. 1989. A diphone synthesis system based on time-domain prosodic modifications of speech. In Proceedings of the IEEE International Conference on Acoustics Speech Signal Processing (ICASSP). 238--241.Google ScholarGoogle Scholar
  15. Hu, N., Dannenberg, R.B. and Tzanetakis, G. 2003. Polyphonic audio matching and alignment for music retrieval. In Proceedings of the IEEE Workshop on Application of Signal Processing to Audio and Acoustics.Google ScholarGoogle Scholar
  16. Inoue, W., Hashimoto, S. and Ohteru, S. 1994. Adaptive karaoke system-human singing accompaniment based on speech recognition. In Proceedings of the International Computer Music Conference (ICMC). 70--77.Google ScholarGoogle Scholar
  17. Iskandar, D., Wang, Y., Kan, M. Y., and Li., H. 2006. Syllabic level automatic synchronization of music signals and text lyrics. In Proceedings of the ACM Multimedia Conference (ACM MM). 659--662. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. John, R. D., John, H. L., and John, G. P. 1999. Discrete-Time Processing of Speech Signals. IEEE Press.Google ScholarGoogle Scholar
  19. Jourdain, R. 1997. Music, The Brain, and Ecstasy: How Music Captures Our Imagination. HarperCollins.Google ScholarGoogle Scholar
  20. Katayose, H., Kanomori, T., Kamei, K., Nagashima, Y., Sato, K., Inokuchi, S., and Simura, S. 1993. Virtual performer. In Proceedings of the International Computer Music Conference (ICMC), 138--145.Google ScholarGoogle Scholar
  21. Kim, Y. E. 2003. Singing voice analysis/synthesis. PhD. Thesis, Massachusetts institute of Technology.Google ScholarGoogle Scholar
  22. Korst, J., and Geleijnse, G. 2006. Efficient lyrics retrieval and alignment. In Proceedings of Philips Symposium on Intelligent Algorithms.Google ScholarGoogle Scholar
  23. Loscos, A., Cano, P., and Bonada, J. 1999. Low-delay singing voice alignment to text. In Proceedings of the International Computer Music Conference (ICMC).Google ScholarGoogle Scholar
  24. Maddage, N. C., Li, H., and Kankanhalli, M. S. 2006. Music structure based vector space retrieval. In Proceedings of ACM Special Interest Group on Information Retrieval (ACM SIGIR) Conference, 67--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Makhoul, J. 1975. Linear Prediction: A tutorial review. Proc. IEEE, 63, 4, 561--580.Google ScholarGoogle Scholar
  26. Nwe, T. L. and Wang, Y. 2004. Automatic detection of vocal segments in popular songs. In Proceedings of the 5th International Symposium/Conference of Music Information Retrieval (ISMIR).Google ScholarGoogle Scholar
  27. Rabiner, L. R. and Juang, B. H. 1993. Fundamentals of Speech Recognition. Prentice-Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sakeo, H. and Chiba, S. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Audio Speech, Lang. Proce. 26, 1, 43--49.Google ScholarGoogle ScholarCross RefCross Ref
  29. Sheh, A. and Ellis, D. P. W. 2003. Chord segmentation and recognition using EM-trained hidden Markov models. In Proceedings of the International Conference on Music Information (ISMIR).Google ScholarGoogle Scholar
  30. Stevens, S. S., Volkmann, J. and Newman, E. B. 1937. A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Amer. 8, 3, 185--190.Google ScholarGoogle ScholarCross RefCross Ref
  31. Taylor, P. A., Black, A. W., and Caley, R. J. 1998. The architecture of the festival speech synthesis system. In Proceedings of the 3rd International Workshop on Speech Synthesis.Google ScholarGoogle Scholar
  32. Tsai, W. H., Wang, H. M., Rodgers, D., Cheng, S. S. and Yu, H. M. 2004. Blind clustering of popular music recordings based on singer voice characteristics. In Proceedings of the International Symposium of Music Information Retrieval (ISMIR).Google ScholarGoogle Scholar
  33. Wang, Y., Kan, M. Y., Nwe, T. L., Shenoy, A. and Yin, J. 2004. LyricAlly: Automatic synchronization of acoustic musical signals and textual lyrics. In Proceedings of the ACM Multimedia Conference. 212--219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wong, C. H., Szeto, W. M. and Wong, K. H. 2006. Automatic lyrics alignment for Cantonese popular music. In Multimedia Systems.Google ScholarGoogle Scholar
  35. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. 2006. The HTK Book Version 3.4. Department of Engineering, University of Cambridge.Google ScholarGoogle Scholar

Index Terms

  1. Word level automatic alignment of music and lyrics using vocal synthesis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!