Abstract
Audio fingerprinting techniques were developed to index and retrieve audio samples by comparing a content-based compact signature of the audio instead of the entire audio sample, thereby reducing memory and computational expense. Different techniques have been applied to create audio fingerprints; however, with the introduction of deep learning, new data-driven unsupervised approaches are available. This article presents Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting (SAMAF), which improved hash generation through a novel loss function composed of terms: Mean Square Error, minimizing the reconstruction error; Hash Loss, minimizing the distance between similar hashes and encouraging clustering; and Bitwise Entropy Loss, minimizing the variation inside the clusters. The performance of the model was assessed with a subset of VoxCeleb1 dataset, a“speech in-the-wild” dataset. Furthermore, the model was compared against three baselines: Dejavu, a Shazam-like algorithm; Robust Audio Fingerprinting System (RAFS), a Bit Error Rate (BER) methodology robust to time-frequency distortions and coding/decoding transformations; and Panako, a constellation-based algorithm adding time-frequency distortion resilience. Extensive empirical evidence showed that our approach outperformed all the baselines in the audio identification task and other classification tasks related to the attributes of the audio signal with an economical hash size of either 128 or 256 bits for one second of audio.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, SAMAF: Sequence-to-sequence Autoencoder Model for Audio Fingerprinting
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Software. Retrieved from https://www.tensorflow.org/. Version 1.13.0.Google Scholar
- Shahin Amiriparian, Michael Freitag, Nicholas Cummins, and Björn Schuller. 2017. Sequence to sequence autoencoders for unsupervised representation learning from audio. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE’17).Google Scholar
- Xavier Anguera, Antonio Garzon, and Tomasz Adamek. 2012. MASK: Robust local features for audio fingerprinting. In Proceedings of the International Conference on Multimedia and Expo (ICME’12). 455--460.Google Scholar
Digital Library
- Andreas Arzt, Sebastian Böck, and Gerhard Widmer. 2012. Fast identification of piece and score position via symbolic fingerprinting. In Proceedings of the 13th International Symposium on Music Information Retrieval (ISMIR’12).Google Scholar
- Chris Bagwell. 2015. SoX—Sound eXchange. Software. Retrieved from http://gts.sourceforge.net/ Version 14.4.2.Google Scholar
- Shumeet Baluja and Michele Covell. 2007. Audio fingerprinting: Combining computer vision and data stream processing. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’07).Google Scholar
Cross Ref
- Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (Mar. 1994), 157--166. DOI:https://doi.org/10.1109/72.279181Google Scholar
Digital Library
- Judith C. Brown and Miller S. Puckette. 1992. An efficient algorithm for the calculation of a constant Q transform. J. Acoust. Soc. Amer. 92, 5 (June 1992), 2698--2701.Google Scholar
Cross Ref
- Christopher J. C. Burges, John C. Platt, and Soumya Jana. 2003. Distortion discriminant analysis for audio fingerprinting. IEEE Trans. Speech Aud. Proc. 11, 3 (May 2003), 165--174.Google Scholar
- Pedro Cano, Eloi Batlle, Ton Kalker, and Jaap Haitsma. 2005. A review of audio fingerprinting. J. VLSI Sig. Proc. Syst. Sig. Image Vid. Technol. 41, 3 (Nov. 2005), 271--284.Google Scholar
- Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD). 1445--1454.Google Scholar
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.Google Scholar
Cross Ref
- Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee. 2016. Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder. Research Note. College of Electrical Engineering and Computer Science, National Taiwan University, Taipei City, Taiwan.Google Scholar
- George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Aud. Speech Lang. Proc. 20, 1 (Jan. 2012), 30--42.Google Scholar
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255. DOI:https://doi.org/10.1109/CVPR.2009.5206848Google Scholar
Cross Ref
- Will Drevo. 2013. Audio Fingerprinting with Python and Numpy. Website. Retrieved from http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/.Google Scholar
- Yong Fan and Shuang Feng. 2016. A music identification system based on audio fingerprint. In Proceedings of the International Conference on Applied Computing and Information Technology (ACIT’16). 363--367.Google Scholar
- Jinyang Gaoy, H. V. Jagadish, Wei Lu, and Beng Chin Ooi. 2014. DSH: Data sensitive hashing for high-dimensional k-NN search. In Proceedings of the International Conference on Management of Data (SIGMOD’14).Google Scholar
- Yun Gu, Chao Ma, and Jie Yang. 2016. Supervised recurrent hashing for large scale video retrieval. In Proceedings of the ACM on Multimedia Conference (MM’16). 272--276.Google Scholar
Digital Library
- Vishwa Gupta, Gilles Boulianne, and Patrick Cardinal. 2010. Content-based audio copy detection using nearest-neighbor mapping. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). 261--264.Google Scholar
Cross Ref
- Jaap Haitsma and Ton Kalker. 2002. A highly robust audio fingerprinting system. In Proceedings of the International Conference on Music Information Retrieval (ISMIR’02).Google Scholar
- Mikael Henaff, Kevin Jarrett, Koray Kavukcuoglu, and Yann LeCun. 2011. Unsupervised learning of sparse features for scalable audio classification. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR’11). 681--686.Google Scholar
- Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel Rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Proc. Mag. 29, 6 (Nov. 2012), 82--97.Google Scholar
Cross Ref
- Geoffrey E. Hinton and Ruslan Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (July 28, 2006), 504--507.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780.Google Scholar
Digital Library
- Che-Jen Hsieh, Jung-Shian Li, and Cheng-Fu Hung. 2007. A robust audio fingerprinting scheme for MP3 copyright. In Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP’07).Google Scholar
Digital Library
- Corey Kereliuk, Bob L. Sturm, and Jan Larsen. 2015. Deep learning and music adversaries. IEEE Trans. Multimedia 17, 11 (Nov. 2015), 2059--2071.Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), Vol. 1. Curran Associates Inc., Lake Tahoe, NV, 1097--1105. Retrieved from http://dl.acm.org/citation.cfm?id=2999134.2999257.Google Scholar
Digital Library
- Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. 2015. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 3270--3278.Google Scholar
Cross Ref
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (May 28, 2015), 436--444.Google Scholar
- Hanchao Li, Xiang Fei, Kuo-Ming Chao, Ming Yang, and Chaobo He. 2016. Towards a hybrid deep-learning method for music classification and similarity measurement. In Proceedings of the IEEE International Conference on e-Business Engineering (ICEBE’16).Google Scholar
Cross Ref
- Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. 2015. Deep hashing for compact binary codes learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 2475--2483.Google Scholar
Cross Ref
- James Lyons. 2017. python_speech_features. Software. Retrieved from https://github.com/jameslyons/python_speech_features. Version 0.6.Google Scholar
- A. Nagrani, J. S. Chung, and A. Zisserman. 2017. VoxCeleb: A large-scale speaker identification dataset. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17).Google Scholar
- Viet-Anh Nguyen and Minh N. Do. 2016. Deep learning based supervised hashing for efficient image retrieval. In Proceedings of the International Conference on Multimedia and Expo (ICME’16). 1--6.Google Scholar
- Chahid Ouali, Pierre Dumouchel, and Vishwa Gupta. 2015. Content-based multimedia copy detection. In Proceedings of the IEEE International Symposium on Multimedia (ISM’15).Google Scholar
Cross Ref
- Hamza Özer, Bulent Sankur, and Nasir Memon. 2004. Robust audio hashing for audio identification. In Proceedings of the European Signal Processing Conference (EUSIPCO’04).Google Scholar
- Yongjoo Park, Michael Cafarella, and Barzan Mozafari. 2015. Neighbor-sensitive hashing. J. Proc. VLDB Endow. 9, 3 (Nov. 2015), 144--155.Google Scholar
- Yohan Petetin, Cyrille Laroche, and Aurélien Mayoue. 2015. Deep neural networks for audio scene recognition. In Proceedings of the European Signal Processing Conference (EUSIPCO’15).Google Scholar
Cross Ref
- R. Roopalakshmi and G. Ram Mohana Reddy. 2015. A framework for estimating geometric distortions in video copies based on visual-audio fingerprints. J. VLSI Sig. Proc. Syst. Sig. Image Vid. Technol. 9, 1 (Jan. 2015), 201--210.Google Scholar
- Ruslan Salakhutdinov and Geoffrey E. Hinton. 2009. Semantic hashing. Int. Approx. Reas. 50, 7 (July 2009), 969--978.Google Scholar
Digital Library
- Joren Six, Olmo Cornelis, and Marc Leman. 2014. TarsosDSP, a real-time audio processing framework in Java. In Proceedings of the 53rd Audio Engineering Society Conference (AES’14).Google Scholar
- Joren Six and Marc Leman. 2014. PANAKO-A scalable acoustic fingerprinting system handling time-scale and pitch modificatoin. In Proceedings of the Conference of the International Society for Music Information Retrieval (ISMIR’14).Google Scholar
- Reinhard Sonnleitner and Gerhard Widmer. 2016. Robust quad-based audio fingerprinting. IEEE/ACM Trans. Aud. Speech Lang. Proc. 24, 3 (Mar. 2016), 409--421.Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’14), Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3104--3112.Google Scholar
- Christian Szegedy, Alexander Toshev, and Dumitru Erhan. 2013. Deep neural networks for object detection. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13), C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates Inc., 553--2561. Retrieved from http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf.Google Scholar
- Avery Li-Chun Wang. 2003. An industrial-strength audio search algorithm. In Proceedings of the International Conference on Music Information Retrieval (ISMIR’03).Google Scholar
Index Terms
SAMAF: Sequence-to-sequence Autoencoder Model for Audio Fingerprinting
Recommendations
Blind Clustering of Music Recordings Based on Audio Fingerprinting
IIH-MSP '09: Proceedings of the 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal ProcessingAlthough multiple music recordings may sound identical to a human listener, the underlying representations of sound may differ due to the variations in their audio encoding and/or transmission methods. In contrast to the existing audio-fingerprinting ...
Robust audio fingerprinting using peak-pair-based hash of non-repeating foreground audio in a real environment
In this paper, we propose a high-performance audio fingerprinting system used in real-world query-by-example applications for acoustic audio-based content identification, especially for use in heterogeneous portable consumer devices or on-line audio ...
A unified approach to content-based and fault-tolerant music recognition
In this paper, we propose a unified approach to fast index-based music recognition. As an important area within the field of music information retrieval (MIR), the goal of music recognition is, given a database of musical pieces and a query document, to ...






Comments