skip to main content
research-article

SAMAF: Sequence-to-sequence Autoencoder Model for Audio Fingerprinting

Published:22 May 2020Publication History
Skip Abstract Section

Abstract

Audio fingerprinting techniques were developed to index and retrieve audio samples by comparing a content-based compact signature of the audio instead of the entire audio sample, thereby reducing memory and computational expense. Different techniques have been applied to create audio fingerprints; however, with the introduction of deep learning, new data-driven unsupervised approaches are available. This article presents Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting (SAMAF), which improved hash generation through a novel loss function composed of terms: Mean Square Error, minimizing the reconstruction error; Hash Loss, minimizing the distance between similar hashes and encouraging clustering; and Bitwise Entropy Loss, minimizing the variation inside the clusters. The performance of the model was assessed with a subset of VoxCeleb1 dataset, a“speech in-the-wild” dataset. Furthermore, the model was compared against three baselines: Dejavu, a Shazam-like algorithm; Robust Audio Fingerprinting System (RAFS), a Bit Error Rate (BER) methodology robust to time-frequency distortions and coding/decoding transformations; and Panako, a constellation-based algorithm adding time-frequency distortion resilience. Extensive empirical evidence showed that our approach outperformed all the baselines in the audio identification task and other classification tasks related to the attributes of the audio signal with an economical hash size of either 128 or 256 bits for one second of audio.

Skip Supplemental Material Section

Supplemental Material

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Software. Retrieved from https://www.tensorflow.org/. Version 1.13.0.Google ScholarGoogle Scholar
  2. Shahin Amiriparian, Michael Freitag, Nicholas Cummins, and Björn Schuller. 2017. Sequence to sequence autoencoders for unsupervised representation learning from audio. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE’17).Google ScholarGoogle Scholar
  3. Xavier Anguera, Antonio Garzon, and Tomasz Adamek. 2012. MASK: Robust local features for audio fingerprinting. In Proceedings of the International Conference on Multimedia and Expo (ICME’12). 455--460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Andreas Arzt, Sebastian Böck, and Gerhard Widmer. 2012. Fast identification of piece and score position via symbolic fingerprinting. In Proceedings of the 13th International Symposium on Music Information Retrieval (ISMIR’12).Google ScholarGoogle Scholar
  5. Chris Bagwell. 2015. SoX—Sound eXchange. Software. Retrieved from http://gts.sourceforge.net/ Version 14.4.2.Google ScholarGoogle Scholar
  6. Shumeet Baluja and Michele Covell. 2007. Audio fingerprinting: Combining computer vision and data stream processing. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’07).Google ScholarGoogle ScholarCross RefCross Ref
  7. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (Mar. 1994), 157--166. DOI:https://doi.org/10.1109/72.279181Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Judith C. Brown and Miller S. Puckette. 1992. An efficient algorithm for the calculation of a constant Q transform. J. Acoust. Soc. Amer. 92, 5 (June 1992), 2698--2701.Google ScholarGoogle ScholarCross RefCross Ref
  9. Christopher J. C. Burges, John C. Platt, and Soumya Jana. 2003. Distortion discriminant analysis for audio fingerprinting. IEEE Trans. Speech Aud. Proc. 11, 3 (May 2003), 165--174.Google ScholarGoogle Scholar
  10. Pedro Cano, Eloi Batlle, Ton Kalker, and Jaap Haitsma. 2005. A review of audio fingerprinting. J. VLSI Sig. Proc. Syst. Sig. Image Vid. Technol. 41, 3 (Nov. 2005), 271--284.Google ScholarGoogle Scholar
  11. Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD). 1445--1454.Google ScholarGoogle Scholar
  12. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.Google ScholarGoogle ScholarCross RefCross Ref
  13. Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee. 2016. Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder. Research Note. College of Electrical Engineering and Computer Science, National Taiwan University, Taipei City, Taiwan.Google ScholarGoogle Scholar
  14. George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Aud. Speech Lang. Proc. 20, 1 (Jan. 2012), 30--42.Google ScholarGoogle Scholar
  15. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255. DOI:https://doi.org/10.1109/CVPR.2009.5206848Google ScholarGoogle ScholarCross RefCross Ref
  16. Will Drevo. 2013. Audio Fingerprinting with Python and Numpy. Website. Retrieved from http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/.Google ScholarGoogle Scholar
  17. Yong Fan and Shuang Feng. 2016. A music identification system based on audio fingerprint. In Proceedings of the International Conference on Applied Computing and Information Technology (ACIT’16). 363--367.Google ScholarGoogle Scholar
  18. Jinyang Gaoy, H. V. Jagadish, Wei Lu, and Beng Chin Ooi. 2014. DSH: Data sensitive hashing for high-dimensional k-NN search. In Proceedings of the International Conference on Management of Data (SIGMOD’14).Google ScholarGoogle Scholar
  19. Yun Gu, Chao Ma, and Jie Yang. 2016. Supervised recurrent hashing for large scale video retrieval. In Proceedings of the ACM on Multimedia Conference (MM’16). 272--276.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Vishwa Gupta, Gilles Boulianne, and Patrick Cardinal. 2010. Content-based audio copy detection using nearest-neighbor mapping. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). 261--264.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jaap Haitsma and Ton Kalker. 2002. A highly robust audio fingerprinting system. In Proceedings of the International Conference on Music Information Retrieval (ISMIR’02).Google ScholarGoogle Scholar
  22. Mikael Henaff, Kevin Jarrett, Koray Kavukcuoglu, and Yann LeCun. 2011. Unsupervised learning of sparse features for scalable audio classification. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR’11). 681--686.Google ScholarGoogle Scholar
  23. Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel Rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Proc. Mag. 29, 6 (Nov. 2012), 82--97.Google ScholarGoogle ScholarCross RefCross Ref
  24. Geoffrey E. Hinton and Ruslan Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (July 28, 2006), 504--507.Google ScholarGoogle Scholar
  25. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Che-Jen Hsieh, Jung-Shian Li, and Cheng-Fu Hung. 2007. A robust audio fingerprinting scheme for MP3 copyright. In Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP’07).Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Corey Kereliuk, Bob L. Sturm, and Jan Larsen. 2015. Deep learning and music adversaries. IEEE Trans. Multimedia 17, 11 (Nov. 2015), 2059--2071.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), Vol. 1. Curran Associates Inc., Lake Tahoe, NV, 1097--1105. Retrieved from http://dl.acm.org/citation.cfm?id=2999134.2999257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. 2015. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 3270--3278.Google ScholarGoogle ScholarCross RefCross Ref
  30. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (May 28, 2015), 436--444.Google ScholarGoogle Scholar
  31. Hanchao Li, Xiang Fei, Kuo-Ming Chao, Ming Yang, and Chaobo He. 2016. Towards a hybrid deep-learning method for music classification and similarity measurement. In Proceedings of the IEEE International Conference on e-Business Engineering (ICEBE’16).Google ScholarGoogle ScholarCross RefCross Ref
  32. Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. 2015. Deep hashing for compact binary codes learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15). 2475--2483.Google ScholarGoogle ScholarCross RefCross Ref
  33. James Lyons. 2017. python_speech_features. Software. Retrieved from https://github.com/jameslyons/python_speech_features. Version 0.6.Google ScholarGoogle Scholar
  34. A. Nagrani, J. S. Chung, and A. Zisserman. 2017. VoxCeleb: A large-scale speaker identification dataset. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17).Google ScholarGoogle Scholar
  35. Viet-Anh Nguyen and Minh N. Do. 2016. Deep learning based supervised hashing for efficient image retrieval. In Proceedings of the International Conference on Multimedia and Expo (ICME’16). 1--6.Google ScholarGoogle Scholar
  36. Chahid Ouali, Pierre Dumouchel, and Vishwa Gupta. 2015. Content-based multimedia copy detection. In Proceedings of the IEEE International Symposium on Multimedia (ISM’15).Google ScholarGoogle ScholarCross RefCross Ref
  37. Hamza Özer, Bulent Sankur, and Nasir Memon. 2004. Robust audio hashing for audio identification. In Proceedings of the European Signal Processing Conference (EUSIPCO’04).Google ScholarGoogle Scholar
  38. Yongjoo Park, Michael Cafarella, and Barzan Mozafari. 2015. Neighbor-sensitive hashing. J. Proc. VLDB Endow. 9, 3 (Nov. 2015), 144--155.Google ScholarGoogle Scholar
  39. Yohan Petetin, Cyrille Laroche, and Aurélien Mayoue. 2015. Deep neural networks for audio scene recognition. In Proceedings of the European Signal Processing Conference (EUSIPCO’15).Google ScholarGoogle ScholarCross RefCross Ref
  40. R. Roopalakshmi and G. Ram Mohana Reddy. 2015. A framework for estimating geometric distortions in video copies based on visual-audio fingerprints. J. VLSI Sig. Proc. Syst. Sig. Image Vid. Technol. 9, 1 (Jan. 2015), 201--210.Google ScholarGoogle Scholar
  41. Ruslan Salakhutdinov and Geoffrey E. Hinton. 2009. Semantic hashing. Int. Approx. Reas. 50, 7 (July 2009), 969--978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Joren Six, Olmo Cornelis, and Marc Leman. 2014. TarsosDSP, a real-time audio processing framework in Java. In Proceedings of the 53rd Audio Engineering Society Conference (AES’14).Google ScholarGoogle Scholar
  43. Joren Six and Marc Leman. 2014. PANAKO-A scalable acoustic fingerprinting system handling time-scale and pitch modificatoin. In Proceedings of the Conference of the International Society for Music Information Retrieval (ISMIR’14).Google ScholarGoogle Scholar
  44. Reinhard Sonnleitner and Gerhard Widmer. 2016. Robust quad-based audio fingerprinting. IEEE/ACM Trans. Aud. Speech Lang. Proc. 24, 3 (Mar. 2016), 409--421.Google ScholarGoogle Scholar
  45. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’14), Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3104--3112.Google ScholarGoogle Scholar
  46. Christian Szegedy, Alexander Toshev, and Dumitru Erhan. 2013. Deep neural networks for object detection. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13), C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates Inc., 553--2561. Retrieved from http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf.Google ScholarGoogle Scholar
  47. Avery Li-Chun Wang. 2003. An industrial-strength audio search algorithm. In Proceedings of the International Conference on Music Information Retrieval (ISMIR’03).Google ScholarGoogle Scholar

Index Terms

  1. SAMAF: Sequence-to-sequence Autoencoder Model for Audio Fingerprinting

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!