skip to main content
research-article

Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music

Published: 01 January 2017 Publication History

Abstract

Identifying musical instruments in polyphonic music recordings is a challenging but important problem in the field of music information retrieval. It enables music search by instrument, helps recognize musical genres, or can make music transcription easier and more accurate. In this paper, we present a convolutional neural network framework for predominant instrument recognition in real-world polyphonic music. We train our network from fixed-length music excerpts with a single-labeled predominant instrument and estimate an arbitrary number of predominant instruments from an audio signal with a variable length. To obtain the audio-excerpt-wise result, we aggregate multiple outputs from sliding windows over the test audio. In doing so, we investigated two different aggregation methods: one takes the class-wise average followed by normalization, and the other perform temporally local class-wise max-pooling on the output probability prior to averaging and normalization steps to minimize the effect of averaging process suppresses the activation of sporadically appearing instruments. In addition, we conducted extensive experiments on several important factors that affect the performance, including analysis window size, identification threshold, and activation functions for neural networks to find the optimal set of parameters. Our analysis on the instrument-wise performance found that the onset type is a critical factor for recall and precision of each instrument. Using a dataset of 10k audio excerpts from 11 instruments for evaluation, we found that convolutional neural networks are more robust than conventional methods that exploit spectral features and source separation with support vector machines. Experimental results showed that the proposed convolutional network architecture obtained an F1 measure of 0.619 for micro and 0.513 for macro, respectively, achieving 23.1% and 18.8% in performance improvement compared with the state-of-the-art algorithm.

References

[1]
A. Eronen and A. Klapuri, "Musical instrument recognition using cepstral coefficients and temporal features," in Proc. 2000 IEEE Int. Conf. Acoust., Speech Signal Process., 2000, vol. 2, pp. II753-II756.
[2]
A. Diment, P. Rajan, T. Heittola, and T. Virtanen, "Modified group delay feature for musical instrument recognition," in Proc. 10th Int. Symp. Comput. Music Multidiscip. Res., Marseille, France, 2013, pp. 431-438.
[3]
L.-F. Yu, L. Su, and Y.-H. Yang, "Sparse cepstral codes and power scale for instrument identification," in Proc. 2014 IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 7460-7464.
[4]
A. Krishna and T. V. Sreenivas, "Music instrument recognition: From isolated notes to solo phrases," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2004, vol. 4, pp. iv-265-iv-268.
[5]
S. Essid, G. Richard, and B. David, "Musical instrument recognition on solo performances," in Proc. 2004 12th Eur. Signal Process. Conf., 2004, pp. 1289-1292.
[6]
T. Heittola, A. Klapuri, and T. Virtanen, "Musical instrument recognition in polyphonic audio using source-filter model for sound separation," in Proc. Int. Soc. Music Inf. Retrieval Conf., 2009, pp. 327-332.
[7]
T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, "Instrument identification in polyphonic music: Feature weighting to minimize influence of sound overlaps," EURASIP J. Appl. Signal Process., vol. 2007, no. 1, pp. 155-155, 2007.
[8]
Z. Duan, B. Pardo, and L. Daudet, "A novel cepstral representation for timbre modeling of sound sources in polyphonic mixtures," in Proc. 2014 IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 7495-7499.
[9]
M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, "RWC music database: Music genre database and musical instrument sound database," in Proc. Int. Soc. Music Inf. Retrieval Conf., 2003, vol. 3, pp. 229-230.
[10]
Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, 2015.
[11]
L. Deng and D. Yu, "Deep learning: Methods and applications," Found. Trends Signal Process., vol. 7, no. 3-4, pp. 197-387, 2014.
[12]
T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur, "Recurrent neural network based language model," in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2010, vol. 2, pp. 1045-1048.
[13]
G. Mesnil, X. He, L. Deng, and Y. Bengio, "Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding," in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2013, pp. 3771-3775.
[14]
K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi, and D. Yu, "Recurrent neural networks for language understanding," in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2013, pp. 2524-2528.
[15]
Y. LeCun et al., "Learning algorithms for classification: A comparison on handwritten digit recognition," Neural Netw.: Stat. Mech. Perspect., vol. 261, pp. 261-276, 1995.
[16]
A. Calderón, S. Roa, and J. Victorino, "Handwritten digit recognition using convolutional neural networks and Gabor filters," in Proc. Int. Congr. Comput. Intell, 2003.
[17]
X.-X. Niu and C. Y. Suen, "A novel hybrid CNN-SVM classifier for recognizing handwritten digits," Pattern Recog., vol. 45, no. 4, pp. 1318-1325, 2012.
[18]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097-1105.
[19]
J. Ngiam, Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng, "Tiled convolutional neural networks," in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 1279-1287.
[20]
G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97, Nov. 2012.
[21]
P. Hamel, S. Lemieux, Y. Bengio, and D. Eck, "Temporal pooling and multiscale learning for automatic annotation and ranking of music audio," in Proc. Int. Soc. Music Inf. Retrieval Conf., 2011, pp. 729-734.
[22]
J. Schluter and S. Bock, "Improved musical onset detection with convolutional neural networks," in Proc. 2014 IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 6979-6983.
[23]
E. J. Humphrey and J. P. Bello, "Rethinking automatic chord recognition with convolutional neural networks," in Proc. 2012 11th Int. Conf. Mach. Learn. Appl., 2012, vol. 2, pp. 357-362.
[24]
N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, "Audio chord recognition with recurrent neural networks," in Proc. Int. Soc. Music Inf. Retrieval Conf., 2013, pp. 335-340.
[25]
K. Ullrich, J. Schlüter, and T. Grill, "Boundary detection inmusic structure analysis using convolutional neural networks," in Proc. Int. Soc. Music Inf. Retrieval Conf., 2014, pp. 417-422.
[26]
T. Grill and J. Schlüter, "Music boundary detection using neural networks on combined features and two-level annotations," in Proc. 16th Int. Soc. Music Inf. Retr. Conf., Malaga, Spain, 2015.
[27]
T. Park and T. Lee, "Musical instrument sound classification with deep convolutional neural network using feature fusion approach," arXiv:1512.07370, 2015.
[28]
P. Li, J. Qian, and T. Wang, "Automatic instrument recognition in polyphonic music using convolutional neural networks," arXiv:1511.05520, 2015.
[29]
Y. Hoshen, R. J. Weiss, and K. W. Wilson, "Speech acoustic modeling from raw multichannel waveforms," in Proc. 2015 IEEE Int. Conf., Acoust., Speech Signal Process., 2015, pp. 4624-4628.
[30]
D. Palaz, R. Collobert, and M. M. Doss, "Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks," arXiv:1304.1018, 2013.
[31]
J. Nam, J. Herrera, M. Slaney, and J. O. Smith, "Learning sparse feature representations for music annotation and retrieval," in Proc. Int. Soc. Music Inf. Retrieval Conf., 2012, pp. 565-570.
[32]
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556, 2014.
[33]
M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818-833.
[34]
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "Overfeat: Integrated recognition, localization and detection using convolutional networks," arXiv:1312.6229, 2013.
[35]
M. Lin, Q. Chen, and S. Yan, "Network in network," arXiv:1312.4400, 2013.
[36]
D. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv:1412.6980, 2014.
[37]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929-1958, 2014.
[38]
Y. Zhang, B. Zhou, H. Wu, and C. Wen, "2D fake fingerprint detection based on improved CNN and local descriptors for smart phone," in Proc. Chin. Conf. Biometric Recog., 2016, pp. 655-662.
[39]
X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," in Proc. Int. Conf. Artif. Intell. Stat., 2010, pp. 249-256.
[40]
J. Gu et al., "Recent advances in convolutional neural networks," arXiv:1512.07108, 2015.
[41]
V. Nair and G. E. Hinton, "Rectified linear units improve restricted Boltzmann machines," in Proc. 27th Int. Conf. Mach. Learn., 2010, pp. 807-814.
[42]
A. L. Maas, A. Y. Hannun, and A. Y. Ng, "Rectifier nonlinearities improve neural network acoustic models," in Proc. Int. Conf. Mach. Learn., 2013, vol. 30, p. 1.
[43]
K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification," in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1026-1034.
[44]
B. Xu, N. Wang, T. Chen, and M. Li, "Empirical evaluation of rectified activations in convolutional network," arXiv:1505.00853, 2015.
[45]
J. J. Bosch, J. Janer, F. Fuhrmann, and P. Herrera, "A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals," in Proc. Int. Soc. Music Inf. Retrieval Conf., 2012, pp. 559-564.
[46]
F. Fuhrmann and P. Herrera, "Polyphonic instrument recognition for exploring semantic similarities in music," in Proc. 13th Int. Conf. Digit. Audio Effects, 2010, pp. 1-8.
[47]
A. Krogh et al., "Neural network ensembles, cross validation, and active learning," Adv. Neural Inf. Process. Syst., vol. 7, pp. 231-238, 1995.
[48]
N. Ono et al., "Harmonic and percussive sound separation and its application toMIR-related tasks," in Advances in Music Information Retrieval. Berlin, Germany: Springer, 2010, pp. 213-236.
[49]
R. Zhou and J. D. Reiss, "Music onset detection combining energy-based and pitch-based approaches," in Proc. MIREX Audio Onset Detect. Contest, 2007.
[50]
L. Van der Maaten and G. Hinton, "Visualizing data using t-SNE," J. Mach. Learn. Res., vol. 9, pp. 2579-2605, 2008.
[51]
J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, "Understanding neural networks through deep visualization," arXiv:1506.06579, 2015.
[52]
I. J. Goodfellow, J. Shlens, and C. Szegedy, "Explaining and harnessing adversarial examples," arXiv:1412.6572, 2014.

Cited By

View all
  • (2024)DoodleTunes: Interactive Visual Analysis of Music-Inspired Children Doodles with Automated Feature AnnotationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642346(1-19)Online publication date: 11-May-2024
  • (2024)Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music Using Discrete Wavelet TransformCircuits, Systems, and Signal Processing10.1007/s00034-024-02641-143:7(4239-4271)Online publication date: 1-Jul-2024
  • (2024)Weighted Initialisation of Evolutionary Instrument and Pitch Detection in Polyphonic MusicArtificial Intelligence in Music, Sound, Art and Design10.1007/978-3-031-56992-0_8(114-129)Online publication date: 3-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 25, Issue 1
January 2017
217 pages
ISSN:2329-9290
EISSN:2329-9304
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 January 2017
Published in TASLP Volume 25, Issue 1

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)DoodleTunes: Interactive Visual Analysis of Music-Inspired Children Doodles with Automated Feature AnnotationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642346(1-19)Online publication date: 11-May-2024
  • (2024)Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music Using Discrete Wavelet TransformCircuits, Systems, and Signal Processing10.1007/s00034-024-02641-143:7(4239-4271)Online publication date: 1-Jul-2024
  • (2024)Weighted Initialisation of Evolutionary Instrument and Pitch Detection in Polyphonic MusicArtificial Intelligence in Music, Sound, Art and Design10.1007/978-3-031-56992-0_8(114-129)Online publication date: 3-Apr-2024
  • (2023)How reliable are posterior class probabilities in automatic music classification?Proceedings of the 18th International Audio Mostly Conference10.1145/3616195.3616228(45-50)Online publication date: 30-Aug-2023
  • (2023)A Comparative Study of Speaker Role Identification in Air Traffic Communication Using Deep Learning ApproachesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/357279222:4(1-17)Online publication date: 24-Mar-2023
  • (2023)Is it Violin or Viola? Classifying the Instruments’ Music Pieces using Descriptive StatisticsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356321819:2s(1-22)Online publication date: 16-Mar-2023
  • (2023)Examining Emotion Perception Agreement in Live Music PerformanceIEEE Transactions on Affective Computing10.1109/TAFFC.2021.309378714:2(1442-1460)Online publication date: 1-Apr-2023
  • (2023)Multiple Predominant Instruments Recognition in Polyphonic Music Using Spectro/Modgd-gram FusionCircuits, Systems, and Signal Processing10.1007/s00034-022-02278-y42:6(3464-3484)Online publication date: 18-Jan-2023
  • (2023)Leveraging Computer Vision Networks for Guitar Tablature TranscriptionAdvances in Computer Graphics10.1007/978-3-031-50069-5_2(3-15)Online publication date: 28-Aug-2023
  • (2023)Application of Neural Architecture Search to Instrument Recognition in Polyphonic AudioArtificial Intelligence in Music, Sound, Art and Design10.1007/978-3-031-29956-8_8(117-131)Online publication date: 12-Apr-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media