skip to main content
research-article

Beat space segmentation and octave scale cepstral feature for sung language recognition in pop music

Authors Info & Claims
Published:02 December 2011Publication History
Skip Abstract Section

Abstract

Sung language recognition relies on both effective feature extraction and acoustic modeling. In this paper, we study rhythm based music segmentation with the frame size being the duration of the smallest note in the music, as opposed to fixed length segmentation in spoken language recognition. It is found that acoustic features extracted from the rhythm based segmentation scheme outperform those from fixed length segmentation. We also study the effectiveness of a musically motivated acoustic feature. Octave scale cepstral coefficients (OSCCs) by comparing with the other acoustic features: Log frequency cepstral coefficients, Linear prediction coefficients (LPC) and LPC-derived cepstral coefficients. Finally, we examine the modeling capabilities of Gaussian mixture models and support vector machines in sung language recognition experiments. Experiments conducted on a corpus of 400 popular songs sung in English, Chinese, German, and Indonesian, showed that the OSCC feature outperforms other features. A sung language recognition accuracy of 64.9% was achieved when Gaussian mixture models were trained on shifted-delta-OSCC acoustic features, extracted via rhythm based music segmentation.

References

  1. Adda-Decker, M., Antoine, F., Boula de Mareuil, P., Vasilescu, I., Lamel, L., Vaissiere, J., Geoffrois, E., and Lienard, J. S. 2003. Phonetic knowledge, phonotactics and perceptual validation for automatic language identification. In Proceedings of the International Congress of Phonetic Sciences (ICPhS).Google ScholarGoogle Scholar
  2. Adami, A. G. and Hermansky, H. 2003. Segmentation of speech for speaker and language recognition. In Proceedings of 8th European Conference on Speech and Communication and Technology (Eurospeech).Google ScholarGoogle Scholar
  3. Brown, J. C. and Puckette, M. S. 1991. An efficient algorithm for the calculation of a constant Q transform. J. Acoust. Soc. Amer. 92, 5, 1933--1941.Google ScholarGoogle Scholar
  4. Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., and Torres-Carrasquillo, P. A. 2006. Support vector machines for speaker and language recognition. Comput. Speech Lang. 20, 210--229.Google ScholarGoogle ScholarCross RefCross Ref
  5. Dai, P., Iurgel, U., and Rigoll, G. 2003. A novel feature combination approach for spoken document classification with support vector machines. In Proceedings of the Multimedia Information Retrieval Workshop.Google ScholarGoogle Scholar
  6. Deutsch, D. 1988. The perceived height of octave-related complexes. J. Acoust. Soc. Amer. 80, 5, 1346--1353.Google ScholarGoogle ScholarCross RefCross Ref
  7. Duxburg, C., Sandler, M., and Davies, M. 2002. A hybrid approach to musical note onset detection. In Proceedings of the International Conference on Digital Audio Effects (DAFx).Google ScholarGoogle Scholar
  8. Ellis, D. P. W. and Poliner, G. E. 2006. Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  9. Fletcher, H. 1931. Some physical characteristics of speech and music. J. Acoust. Soc. Amer. 3, 2, 1--26.Google ScholarGoogle ScholarCross RefCross Ref
  10. Friberg, A. and Sundberg, J. 1995. Time discrimination in a monotonic, isochronous sequence. J. Acoust. Soc. Amer. 98, 5, 2524--2531.Google ScholarGoogle ScholarCross RefCross Ref
  11. Fry, D. B. and Manen, L. 1957. Basis for acoustic study of singing. J. Acoust. Soc. Amer. 29, 6, 690--692.Google ScholarGoogle ScholarCross RefCross Ref
  12. Gauvain, J. L. and Lamel, L. 2000. Large-vocabulary continuous speech recognition: Advances and applications. Proc. IEEE, 88, 8, 1181--1200.Google ScholarGoogle ScholarCross RefCross Ref
  13. Goto, M. 2001. An audio-based real-time beat tracking system for music with or without drum-sounds. J. New Music Resear. 30, 2, 159--171.Google ScholarGoogle ScholarCross RefCross Ref
  14. Hazen, T. J. 1993. Automatic language identification using a segment-based approach. M.Sc. Thesis, MIT.Google ScholarGoogle Scholar
  15. House, A. S. 1977. Towards automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Amer. 62, 3, 708--713.Google ScholarGoogle ScholarCross RefCross Ref
  16. Iskandar, D., Wang, Y., Kan, M. Y., and Li, H. 2006. Syllabic level automatic synchronization of music signals and text lyrics. In Proceedings of the ACM Multimedia Conference (MM). 659--662. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jiang, D. N., Lu, L., Zhang, H. J., Tao, J. H., and Cai, L. H. 2002. Music type classification by spectral contrast feature. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME'02). 113--116.Google ScholarGoogle Scholar
  18. John, R. D., John, H. L. and John, G. P. 1999. Discrete-Time Processing of Speech Signals. IEEE Press.Google ScholarGoogle Scholar
  19. Jourdain, R. 1997. Music, The Brain, and Ecstasy: How Music Captures Our Imagination. Harper Collins.Google ScholarGoogle Scholar
  20. Kim, Y. E. 2003. Singing voice analysis/synthesis. PhD. Thesis, MIT.Google ScholarGoogle Scholar
  21. Kirchhoff, K., Parandekar, S., and Bilmes, J. 2002. Mixed memory markov models for automatic language identification. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  22. Li, H. and Ma, B. 2005. A phonotactic language model for spoken language identification. In Proceedings of the 43rd Annual Meeting of the Association of Computational Linguistics (ACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Li, H., Ma, B., and Lee, C. H. 2007. A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15, 1, 271--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lu, L., Liu, D., and Zhang, H. J.} 2006. Automatic mood detection and tracking of music audio signals. IEEE Trans. Speech Audio Process. 14, 1. 5--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ma, B., Guan, C., Li, H., and Lee, C. H. 2002. Multilingual speech recognition with language identification. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google ScholarGoogle Scholar
  26. Maddage, N. C., Li, H., and Kankanhalli, M. S. 2006. Music structure based vector space retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR) Conference. 67--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Maddage, N. C., Xu, C. S., Kankanhalli, M. S., and Shao, Xi. 2004. Content-based music structure analysis with the applications to music semantic understanding. In Proceedings of the ACM Multimedia Conference. 112--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Matrouf, D., Adda-Decker, M., Lamel, L. F., and Gauvain, J.-L. 1998. Language identification incorporating lexical information. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google ScholarGoogle Scholar
  29. Nwe, T. L. and Wang, Y. 2004. Automatic detection of vocal segments in popular songs. In Proceedings of the 5th International Symposium/Conf. of Music Information Retrieval (ISMIR).Google ScholarGoogle Scholar
  30. Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarGoogle ScholarCross RefCross Ref
  31. Rabiner, L. R. and Juang, B. H. 1993. Fundamentals of Speech Recognition. Prentice-Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Reynolds, D. A. and Rose, R. C. 2005. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3, 73--83.Google ScholarGoogle Scholar
  33. Rossing, T. D., Moore, F. R., and Wheeler, P. A. 2001. The Science of Sound 3rd Ed. Addison Wesley.Google ScholarGoogle Scholar
  34. Royal Schools of Music. 1949. Rudiments and Theory of Music, The Associated Board of the Royal Schools of Music, London.Google ScholarGoogle Scholar
  35. Scaringella, N., Zoia, G., and Mlynek, D. 2006. Automatic genre classification of music content. IEEE Sign. Process. Mag. 23, 2.Google ScholarGoogle ScholarCross RefCross Ref
  36. Scheirer, E. D. 1998. Tempo and beat analysis of acoustic musical signals. J. Acoust. Soc. Amer. 103, 1, 588--601.Google ScholarGoogle ScholarCross RefCross Ref
  37. Schwenninger, J., Brueckner, R., Willett, D., and Hennecke, M. 2006. Language Identification in vocal music. In Proceedings of the International Conference on Music Information Retrieval (ISMIR).Google ScholarGoogle Scholar
  38. Singer, E., Torres-Carrasquillo, P. A., Gleason, T. P., Campbell, W. M., and Reynolds, D. A. 2003. Acoustic, phonetic and discriminative approaches to automatic language recognition. In Proceedings of the 8th European Conference on Speech and Communication and Technology (Eurospeech).Google ScholarGoogle Scholar
  39. Stevens, S. S., Volkmann, J., and Newman, E. B. 1937. A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Amer. 8, 3, 185--190.Google ScholarGoogle ScholarCross RefCross Ref
  40. Sugiyama, M. 1991. Automatic language recognition using acoustic features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Sundberg, J. E. F. and Lindqvist, J. 1972. Musical octave and pitch. J. Acoust. Soc. Amer. 54, 4, 922--929.Google ScholarGoogle ScholarCross RefCross Ref
  42. Thyme-Gobbel, A. E. and Hutchins, S. E. 1996. On using prosodic cues in automatic language identification. In Proceedings of the 4th International Conference on Spoken Language Processing (ICSLP).Google ScholarGoogle Scholar
  43. Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., and Deller Jr, J. R. 2002. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In Proceedings of the International Conference on Spoken Language Processing.Google ScholarGoogle Scholar
  44. Tsai, W.-H. and Wang, H. M. 2004. Towards automatic identification of singing language in popular music recordings. In Proceedings of the International Conference on Music Information Retrieval (ISMIR).Google ScholarGoogle Scholar
  45. Typke, R., Wiering, F., and Veltkamp, R. 2005. A survey of music information retrieval systems. In Proceedings of the International Conference on Music Information Retrieval (ISMIR).Google ScholarGoogle Scholar
  46. Waibel, A., Geutner, P., Tomokiyo, L. M., Schultz, T., and Woszczyna, M. 2000. Multilinguality in speech and spoken language systems. Proc. IEEE 88, 8, 1181--1190.Google ScholarGoogle ScholarCross RefCross Ref
  47. Ward, W. 1954. Subjective music pitch. J. Acoust. Soc. Amer. 26, 369--380.Google ScholarGoogle ScholarCross RefCross Ref
  48. Xiao, Z., Dellandrea, E., Dou, W., and Chen, L. 2008. What is the best segment duration for music mood analysis? In Proceedings of the 6th International Workshop on Content-Based Multimedia Indexing (CBMI). 17--24.Google ScholarGoogle Scholar
  49. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P. 2006. The HTK Book Version 3.4. Department of Engineering, University of Cambridge.Google ScholarGoogle Scholar
  50. Zhu, Y., Kankanhalli, M. S., and Gao, S. 2005. Music key detection for musical audio. In Proceedings of the 11th International Multimedia Modelling Conference (MMM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zissman, M. A. 1996. Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4, 1, 31--44.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Beat space segmentation and octave scale cepstral feature for sung language recognition in pop music

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 7, Issue 4
          November 2011
          108 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/2043612
          Issue’s Table of Contents

          Copyright © 2011 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 December 2011
          • Received: 1 October 2010
          • Accepted: 1 March 2010
          • Revised: 1 February 2010
          Published in tomm Volume 7, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)3
          • Downloads (Last 6 weeks)1

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!