Abstract
Sung language recognition relies on both effective feature extraction and acoustic modeling. In this paper, we study rhythm based music segmentation with the frame size being the duration of the smallest note in the music, as opposed to fixed length segmentation in spoken language recognition. It is found that acoustic features extracted from the rhythm based segmentation scheme outperform those from fixed length segmentation. We also study the effectiveness of a musically motivated acoustic feature. Octave scale cepstral coefficients (OSCCs) by comparing with the other acoustic features: Log frequency cepstral coefficients, Linear prediction coefficients (LPC) and LPC-derived cepstral coefficients. Finally, we examine the modeling capabilities of Gaussian mixture models and support vector machines in sung language recognition experiments. Experiments conducted on a corpus of 400 popular songs sung in English, Chinese, German, and Indonesian, showed that the OSCC feature outperforms other features. A sung language recognition accuracy of 64.9% was achieved when Gaussian mixture models were trained on shifted-delta-OSCC acoustic features, extracted via rhythm based music segmentation.
- Adda-Decker, M., Antoine, F., Boula de Mareuil, P., Vasilescu, I., Lamel, L., Vaissiere, J., Geoffrois, E., and Lienard, J. S. 2003. Phonetic knowledge, phonotactics and perceptual validation for automatic language identification. In Proceedings of the International Congress of Phonetic Sciences (ICPhS).Google Scholar
- Adami, A. G. and Hermansky, H. 2003. Segmentation of speech for speaker and language recognition. In Proceedings of 8th European Conference on Speech and Communication and Technology (Eurospeech).Google Scholar
- Brown, J. C. and Puckette, M. S. 1991. An efficient algorithm for the calculation of a constant Q transform. J. Acoust. Soc. Amer. 92, 5, 1933--1941.Google Scholar
- Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., and Torres-Carrasquillo, P. A. 2006. Support vector machines for speaker and language recognition. Comput. Speech Lang. 20, 210--229.Google Scholar
Cross Ref
- Dai, P., Iurgel, U., and Rigoll, G. 2003. A novel feature combination approach for spoken document classification with support vector machines. In Proceedings of the Multimedia Information Retrieval Workshop.Google Scholar
- Deutsch, D. 1988. The perceived height of octave-related complexes. J. Acoust. Soc. Amer. 80, 5, 1346--1353.Google Scholar
Cross Ref
- Duxburg, C., Sandler, M., and Davies, M. 2002. A hybrid approach to musical note onset detection. In Proceedings of the International Conference on Digital Audio Effects (DAFx).Google Scholar
- Ellis, D. P. W. and Poliner, G. E. 2006. Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
- Fletcher, H. 1931. Some physical characteristics of speech and music. J. Acoust. Soc. Amer. 3, 2, 1--26.Google Scholar
Cross Ref
- Friberg, A. and Sundberg, J. 1995. Time discrimination in a monotonic, isochronous sequence. J. Acoust. Soc. Amer. 98, 5, 2524--2531.Google Scholar
Cross Ref
- Fry, D. B. and Manen, L. 1957. Basis for acoustic study of singing. J. Acoust. Soc. Amer. 29, 6, 690--692.Google Scholar
Cross Ref
- Gauvain, J. L. and Lamel, L. 2000. Large-vocabulary continuous speech recognition: Advances and applications. Proc. IEEE, 88, 8, 1181--1200.Google Scholar
Cross Ref
- Goto, M. 2001. An audio-based real-time beat tracking system for music with or without drum-sounds. J. New Music Resear. 30, 2, 159--171.Google Scholar
Cross Ref
- Hazen, T. J. 1993. Automatic language identification using a segment-based approach. M.Sc. Thesis, MIT.Google Scholar
- House, A. S. 1977. Towards automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Amer. 62, 3, 708--713.Google Scholar
Cross Ref
- Iskandar, D., Wang, Y., Kan, M. Y., and Li, H. 2006. Syllabic level automatic synchronization of music signals and text lyrics. In Proceedings of the ACM Multimedia Conference (MM). 659--662. Google Scholar
Digital Library
- Jiang, D. N., Lu, L., Zhang, H. J., Tao, J. H., and Cai, L. H. 2002. Music type classification by spectral contrast feature. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME'02). 113--116.Google Scholar
- John, R. D., John, H. L. and John, G. P. 1999. Discrete-Time Processing of Speech Signals. IEEE Press.Google Scholar
- Jourdain, R. 1997. Music, The Brain, and Ecstasy: How Music Captures Our Imagination. Harper Collins.Google Scholar
- Kim, Y. E. 2003. Singing voice analysis/synthesis. PhD. Thesis, MIT.Google Scholar
- Kirchhoff, K., Parandekar, S., and Bilmes, J. 2002. Mixed memory markov models for automatic language identification. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).Google Scholar
- Li, H. and Ma, B. 2005. A phonotactic language model for spoken language identification. In Proceedings of the 43rd Annual Meeting of the Association of Computational Linguistics (ACL). Google Scholar
Digital Library
- Li, H., Ma, B., and Lee, C. H. 2007. A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15, 1, 271--284. Google Scholar
Digital Library
- Lu, L., Liu, D., and Zhang, H. J.} 2006. Automatic mood detection and tracking of music audio signals. IEEE Trans. Speech Audio Process. 14, 1. 5--18. Google Scholar
Digital Library
- Ma, B., Guan, C., Li, H., and Lee, C. H. 2002. Multilingual speech recognition with language identification. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google Scholar
- Maddage, N. C., Li, H., and Kankanhalli, M. S. 2006. Music structure based vector space retrieval. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR) Conference. 67--74. Google Scholar
Digital Library
- Maddage, N. C., Xu, C. S., Kankanhalli, M. S., and Shao, Xi. 2004. Content-based music structure analysis with the applications to music semantic understanding. In Proceedings of the ACM Multimedia Conference. 112--119. Google Scholar
Digital Library
- Matrouf, D., Adda-Decker, M., Lamel, L. F., and Gauvain, J.-L. 1998. Language identification incorporating lexical information. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).Google Scholar
- Nwe, T. L. and Wang, Y. 2004. Automatic detection of vocal segments in popular songs. In Proceedings of the 5th International Symposium/Conf. of Music Information Retrieval (ISMIR).Google Scholar
- Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google Scholar
Cross Ref
- Rabiner, L. R. and Juang, B. H. 1993. Fundamentals of Speech Recognition. Prentice-Hall. Google Scholar
Digital Library
- Reynolds, D. A. and Rose, R. C. 2005. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3, 73--83.Google Scholar
- Rossing, T. D., Moore, F. R., and Wheeler, P. A. 2001. The Science of Sound 3rd Ed. Addison Wesley.Google Scholar
- Royal Schools of Music. 1949. Rudiments and Theory of Music, The Associated Board of the Royal Schools of Music, London.Google Scholar
- Scaringella, N., Zoia, G., and Mlynek, D. 2006. Automatic genre classification of music content. IEEE Sign. Process. Mag. 23, 2.Google Scholar
Cross Ref
- Scheirer, E. D. 1998. Tempo and beat analysis of acoustic musical signals. J. Acoust. Soc. Amer. 103, 1, 588--601.Google Scholar
Cross Ref
- Schwenninger, J., Brueckner, R., Willett, D., and Hennecke, M. 2006. Language Identification in vocal music. In Proceedings of the International Conference on Music Information Retrieval (ISMIR).Google Scholar
- Singer, E., Torres-Carrasquillo, P. A., Gleason, T. P., Campbell, W. M., and Reynolds, D. A. 2003. Acoustic, phonetic and discriminative approaches to automatic language recognition. In Proceedings of the 8th European Conference on Speech and Communication and Technology (Eurospeech).Google Scholar
- Stevens, S. S., Volkmann, J., and Newman, E. B. 1937. A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Amer. 8, 3, 185--190.Google Scholar
Cross Ref
- Sugiyama, M. 1991. Automatic language recognition using acoustic features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Google Scholar
Digital Library
- Sundberg, J. E. F. and Lindqvist, J. 1972. Musical octave and pitch. J. Acoust. Soc. Amer. 54, 4, 922--929.Google Scholar
Cross Ref
- Thyme-Gobbel, A. E. and Hutchins, S. E. 1996. On using prosodic cues in automatic language identification. In Proceedings of the 4th International Conference on Spoken Language Processing (ICSLP).Google Scholar
- Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., and Deller Jr, J. R. 2002. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In Proceedings of the International Conference on Spoken Language Processing.Google Scholar
- Tsai, W.-H. and Wang, H. M. 2004. Towards automatic identification of singing language in popular music recordings. In Proceedings of the International Conference on Music Information Retrieval (ISMIR).Google Scholar
- Typke, R., Wiering, F., and Veltkamp, R. 2005. A survey of music information retrieval systems. In Proceedings of the International Conference on Music Information Retrieval (ISMIR).Google Scholar
- Waibel, A., Geutner, P., Tomokiyo, L. M., Schultz, T., and Woszczyna, M. 2000. Multilinguality in speech and spoken language systems. Proc. IEEE 88, 8, 1181--1190.Google Scholar
Cross Ref
- Ward, W. 1954. Subjective music pitch. J. Acoust. Soc. Amer. 26, 369--380.Google Scholar
Cross Ref
- Xiao, Z., Dellandrea, E., Dou, W., and Chen, L. 2008. What is the best segment duration for music mood analysis? In Proceedings of the 6th International Workshop on Content-Based Multimedia Indexing (CBMI). 17--24.Google Scholar
- Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P. 2006. The HTK Book Version 3.4. Department of Engineering, University of Cambridge.Google Scholar
- Zhu, Y., Kankanhalli, M. S., and Gao, S. 2005. Music key detection for musical audio. In Proceedings of the 11th International Multimedia Modelling Conference (MMM). Google Scholar
Digital Library
- Zissman, M. A. 1996. Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4, 1, 31--44.Google Scholar
Cross Ref
Index Terms
Beat space segmentation and octave scale cepstral feature for sung language recognition in pop music
Recommendations
Combining Evidences from Mel Cepstral and Cochlear Cepstral Features for Speaker Recognition Using Whispered Speech
TSD 2015: Proceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 9302Whisper is an alternative way of speech communication especially when a speaker does not want to reveal the information other than the target listeners. Generally, speaker-specific information is present in both excitation source and vocal tract system. ...
Pop Music Generation: From Melody to Multi-style Arrangement
Special Issue on KDD 2018, Regular Papers and Survey PaperMusic plays an important role in our daily life. With the development of deep learning and modern generation techniques, researchers have done plenty of works on automatic music generation. However, due to the special requirements of both melody and ...
On compensating the Mel-frequency cepstral coefficients for noisy speech recognition
ACSC '06: Proceedings of the 29th Australasian Computer Science Conference - Volume 48This paper describes a novel noise-robust automatic speech recognition (ASR) front-end that employs a combination of Mel-filterbank output compensation and cumulative distribution mapping of cepstral coefficients with truncated Gaussian distribution. ...






Comments