skip to main content
research-article

Combining Acoustic and Multilevel Visual Features for Music Genre Classification

Published:24 August 2015Publication History
Skip Abstract Section

Abstract

Most music genre classification approaches extract acoustic features from frames to capture timbre information, leading to the common framework of bag-of-frames analysis. However, time-frequency analysis is also vital for modeling music genres. This article proposes multilevel visual features for extracting spectrogram textures and their temporal variations. A confidence-based late fusion is proposed for combining the acoustic and visual features. The experimental results indicated that the proposed method achieved an accuracy improvement of approximately 14% and 2% in the world's largest benchmark dataset (MASD) and Unique dataset, respectively. In particular, the proposed approach won the Music Information Retrieval Evaluation eXchange (MIREX) music genre classification contests from 2011 to 2013, demonstrating the feasibility and necessity of combining acoustic and visual features for classifying music genres.

References

  1. Jeremy F. Alm and James S. Walker. 2002. Time-frequency analysis of musical instruments. SIAM Review 44, 3, 457--476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. James Bergstra, Michael I. Mandel, and Douglas Eck. 2010. Scalable genre and tag prediction with spectral covariance. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR). J. Stephen Downie and Remco C. Veltkamp (Eds.), International Society for Music Information Retrieval, 507--512. http://dblp.uni-trier.de/db/conf/ismir/ismir2010.html#BergstraME10.Google ScholarGoogle Scholar
  3. Thierry Bertin-Mahieux, Daniel P. W. Ellis, Brian Whitman, and Paul Lamere. 2011. The million song dataset. In Proceedings of the International Conference on Music Information Retrieval. 591--596.Google ScholarGoogle Scholar
  4. William M. Campbell, Douglas E. Sturim, and Douglas A. Reynolds. 2006. Support vector machines using GMM supervectors for speaker verification. IEEE Sig. Process. Lett. 13, 5, 308--311.Google ScholarGoogle ScholarCross RefCross Ref
  5. Chuan Cao and Ming Li. 2009. Thinkits submission for MIREX 2009 audio music classification and similarity tasks. http://www.music-ir.org/mirex/results/2009/abs/CL.pdf.Google ScholarGoogle Scholar
  6. Chih-Chung Chang and Chih-Jen Lin. 2010. LIBSVM: A library for support vector machine. (2010). http://www.csie.ntu.edu.tw/∼cjlin/libsvm.Google ScholarGoogle Scholar
  7. Zhi-Sheng Chen, Jyh-Shing Roger Jang, and Chin-Hui Lee. 2011. A kernel framework for content-based artist recommendation system in music. IEEE Trans. Multimedi. 13, 6, 1371--1380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, F. Gouyon, and J. G. Martins. 2012. Music genre classification using LBP textural features. Sig. Process. 92, 11, 2723--2737. DOI:http://dx.doi.org/10.1016/j.sigpro. 2012.04.023 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1--30. http://dl.acm.org/citation.cfm?id=1248547.1248548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hrishikesh Deshpande, Rohit Singh, and Unjung Nam. 2001. Classification of music signals in the visual domain. In Proceedings of the COST-G6 Conference on Digital Audio Effects. 1--4.Google ScholarGoogle Scholar
  11. J. Stephen Downie, Andreas F. Ehmann, and Xiao Hu. 2005. Music-to-knowledge (M2K): A prototyping and evaluation environment for music digital library research. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries. IEEE, 376--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Daniel P. W. Ellis. 2007. Beat tracking by dynamic programming. J. New Music Res. 36, 1, 51--60.Google ScholarGoogle ScholarCross RefCross Ref
  13. Daniel P. W. Ellis and Graham E. Poliner. 2007. Identifying cover songs' with chroma features and dynamic programming beat tracking. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 4, IEEE, 1429--1432.Google ScholarGoogle Scholar
  14. Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang. 2011. A survey of audio-based music classification and annotation. IEEE Trans. Multimed. 13, 2, 303--319. DOI:http://dx.doi.org/10.1109/TMM. 2010.2098858 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jean-Luc Gauvain and Chin-Hui Lee. 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2, 2, 291--298. DOI:http://dx.doi.org/10.1109/89.279278Google ScholarGoogle ScholarCross RefCross Ref
  16. Masataka Goto. 2003. SmartMusicKiosk: Music listening station with chorus-search function. In Proceedings of the 16th ACM Conference on User Interface Software and Technology. ACM, 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Peter Grosche, Joan Serrà, Meinard Müller, and Josep Ll. Arcos. 2012. Structure-based audio fingerprinting for music retrieval. In Proceedings of the International Conference on Music Information Retrieval. 55--60.Google ScholarGoogle Scholar
  18. Dan-Ning Jiang, Lie Lu, Hong-Jiang Zhang, Jian-Hua Tao, and Lian-Hong Cai. 2002. Music type classification by spectral contrast feature. In Proceedings of the IEEE International Conference on Multimedia and Expo. Vol. 1, 113--116. DOI:http://dx.doi.org/10.1109/ICME.2002.1035731Google ScholarGoogle ScholarCross RefCross Ref
  19. Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas. 1998. On combining classifiers. IEEE Trans. Patt. Anal. Mach. Intell. 20, 3 (1998), 226--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chang-Hsing Lee, Jau-Ling Shih, Kun-Ming Yu, and Hwai-San Lin. 2009. Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features. IEEE Trans. Multimed. 11, 4, 670--682. DOI:http://dx.doi.org/10.1109/TMM.2009.2017635 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Thomas Lidy and Andreas Rauber. 2005. Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In Proceedings of the International Conference on Music Information Retrieval. 34--41.Google ScholarGoogle Scholar
  22. Cory McKay. 2010. Automatic music classification with jMIR. Ph.D. dissertation, McGill University, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Anders Meng, Peter Ahrendt, Jan Larsen, and Lars Kai Hansen. 2007. Temporal feature integration for music genre classification. IEEE Trans. Audio, Speech, Lang. Process. 15, 5 (July 2007), 1654--1664. DOI:http://dx.doi.org/10.1109/TASL.2007.899293 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Anders Meng and John Shawe-Taylor. 2005. An investigation of feature models for music genre classification using the support vector classifier. In Proceedings of the International Conference on Music Information Retrieval. 604--609.Google ScholarGoogle Scholar
  25. Meinard Muller, Daniel P. W. Ellis, Anssi Klapuri, and Gaël Richard. 2011. Signal processing for music analysis. IEEE J. Select. Topics Sig. Process. 5, 6, 1088--1110.Google ScholarGoogle ScholarCross RefCross Ref
  26. Timo Ojala, Matti Pietikainen, and Topi Maenpaa. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Patt. Anal. Machine Intell. 24, 7, 971--987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. François Pachet and Daniel Cazaly. 2000. A taxonomy of musical genres. In Proceedings of the RIAO Conference. 1238--1245.Google ScholarGoogle Scholar
  28. Y. Panagakis, C. L. Kotropoulos, and G. R. Arce. 2014. Music genre classification via joint sparse low-rank representation of audio features. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22, 12, 1905--1917. DOI:http://dx.doi.org/10.1109/TASLP.2014.2355774 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yannis Panagakis, Constantine Kotropoulos, and Gonzalo R. Arce. 2010. Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification. IEEE Trans. Audio, Speech, and Lang. Process. 18, 3, 576--588. DOI:http://dx.doi.org/10.1109/TASL.2009.2036813 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jouni Paulus, Meinard Müller, and Anssi Klapuri. 2010. State of the art report: Audio-based music structure analysis. In Proceedings of the International Conference on Music Information Retrieval. 625--636.Google ScholarGoogle Scholar
  31. Soo-Chang Pei and Nien-Teh Hsu. 2009. Instrumentation analysis and identification of polyphonic music using beat-synchronous feature integration and fuzzy clustering. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 169--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lawrence Rabiner and Biing-Hwang Juang. 1993. Fundamentals of Speech Recognition. Vol. 14, Prentice Hall PTR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jia-Min Ren and J. R. Jang. 2012. Discovering time-constrained sequential patterns for music genre classification. IEEE/ACM Trans. Audio, Speech, Lang. Process. 20, 4, 1134--1144. DOI:http://dx.doi.org/10.1109/TASL.2011.2172426 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn. 2000. Speaker verification using adapted gaussian mixture models. Digital Signal Process. 10, 13, 19--41. DOI:http://dx.doi.org/10.1006/dspr. 1999.0361 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Alexander Schindler, Rudolf Mayer, and Andreas Rauber. 2012. Facilitating comprehensive benchmarking experiments on the million song dataset. In Proceedings of the International Conference on Music Information Retrieval. 469--474.Google ScholarGoogle Scholar
  36. Klaus Seyerlehner. 2010. Content-based music recommender systems: Beyond simple frame-level audio similarity. Ph.D. dissertation, Johannes Kepler University, Linz, Austria.Google ScholarGoogle Scholar
  37. Klaus Seyerlehner, Markus Schedl, Peter Knees, and Reinhard Sonnleitner. 2011. Draft: A refined block-level feature set for classification, similarity and tag prediction. http://www.music-ir.org/mirex/abstracts/2011/SSKS1.pdf.Google ScholarGoogle Scholar
  38. Klaus Seyerlehner, Markus Schedl, Tim Pohle, and Peter Knees. 2010. Using block-level features for genre classification, tag classification and music similarity estimation. http://www.music-ir.org/mirex/abstracts/2010/SSPK1.pdf.Google ScholarGoogle Scholar
  39. E. Tsunoo, G. Tzanetakis, N. Ono, and S. Sagayama. 2011. Beyond timbral statistics: Improving music classification using percussive patterns and bass lines. IEEE/ACM Trans. Audio, Speech, Lang. Process. 19, 4 (May 2011), 1003--1014. DOI:http://dx.doi.org/10.1109/TASL.2010.2073706 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. George Tzanetakis. 2007. MARSYAS submissions to MIREX 2007. http://www.music-ir.org/mirex/abstracts/2007/AI_CC_GC_MC_AS_tzanetakis.pdf.Google ScholarGoogle Scholar
  41. George Tzanetakis and Perry Cook. 2002. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10, 5.Google ScholarGoogle ScholarCross RefCross Ref
  42. Ming-Ju Wu, Zhi-Sheng Chen, Jyh-Shing Jang, Jia-Min Ren, Yi-Hsung Li, and Chun-Hung Lu. 2011. Combining visual and acoustic features for music genre classification. In Proceedings of the International Conference on Machine Learning and Applications (ICMLA). Vol. 2, IEEE, 124--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng. 2004. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res. 5, 975--1005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. C.-C. M. Yeh, Li Su, and Yi-Hsuan Yang. 2013. Dual-layer bag-of-frames model for music genre classification. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 246--250. DOI:http://dx.doi.org/10.1109/ICASSP.2013.6637646Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Combining Acoustic and Multilevel Visual Features for Music Genre Classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 12, Issue 1
        August 2015
        220 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/2816987
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2015
        • Accepted: 1 April 2015
        • Revised: 1 January 2015
        • Received: 1 September 2014
        Published in tomm Volume 12, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!