Abstract
Audio annotation is key to developing machine-listening systems; yet, effective ways to accurately and rapidly obtain crowdsourced audio annotations is understudied. In this work, we seek to quantify the reliability/redundancy trade-off in crowdsourced soundscape annotation, investigate how visualizations affect accuracy and efficiency, and characterize how performance varies as a function of audio characteristics. Using a controlled experiment, we varied sound visualizations and the complexity of soundscapes presented to human annotators. Results show that more complex audio scenes result in lower annotator agreement, and spectrogram visualizations are superior in producing higher quality annotations at lower cost of time and human labor. We also found recall is more affected than precision by soundscape complexity, and mistakes can be often attributed to certain sound event characteristics. These findings have implications not only for how we should design annotation tasks and interfaces for audio data, but also how we train and evaluate machine-listening systems.
- Apple Inc. 2017. Apple GarageBand. (2017). http://www.apple.com/mac/garageband/.Google Scholar
- Avid Technology, Inc. 2017. Pro Tools. (2017). http://www.avid.com/pro-tools.Google Scholar
- Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video Proc. of Advances in Neural Information Processing Systems. 892--900. Google Scholar
Digital Library
- BBC. 2017. BBC Sound Effects Library. (2017). https://www.sound-ideas.com/Product/154/BBC-Sound-Effects-Library-Original-CDs-1--60Google Scholar
- Mark Cartwright and Bryan Pardo. 2013. Social-EQ: Crowdsourcing an Equalization Descriptor Map Proc. of the International Society for Music Information Retrieval Conference.Google Scholar
- Mark Cartwright and Bryan Pardo. 2015. VocalSketch: Vocally Imitating Audio Concepts. In Proc. of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 43--46. Google Scholar
Digital Library
- Mark Cartwright, Bryan Pardo, Gautham Mysore, and Matthew Hoffman. 2016. Fast and Easy Crowdsourced Perceptual Audio Evaluation Proc. of the International Conference on Acoustics, Speech and Signal Processing.Google Scholar
- Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets Proc. of the SIGCHI Conference on Human Factors in Computing Systems. Google Scholar
Digital Library
- Cornell Lab of Ornithology. 2017. Raven. (2017). http://www.birds.cornell.edu/brp/raven/RavenOverview.htmlGoogle Scholar
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database Proc. of the IEEE Cnference on Computer Vision and Pattern Recognition. IEEE, 248--255.Google Scholar
- Jia Deng, Olga Russakovsky, Jonathan Krause, Michael S Bernstein, Alex Berg, and Li Fei-Fei. 2014. Scalable multi-label annotation. In Proc. of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3099--3102. Google Scholar
Digital Library
- Thomas Fillon, Joséphine Simonnot, Marie-France Mifune, Stéphanie Khoury, Guillaume Pellerin, and Maxime Le Coz. 2014. Telemeta: An open-source web framework for ethnomusicological audio archives management and automatic analysis. In Proc. of the International Workshop on Digital Libraries for Musicology. ACM, 1--8. Google Scholar
Digital Library
- Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, and Mario Vento. 2015. Reliable detection of audio events in highly noisy environments. Pattern Recognition Letters Vol. 65 (2015), 22--28. aphy Google Scholar
Digital Library
Index Terms
Seeing Sound: Investigating the Effects of Visualizations and Complexity on Crowdsourced Audio Annotations
Recommendations
Processing of acoustical data in a multimodal bank operating room surveillance system
An automatic surveillance system capable of detecting, classifying and localizing acoustic events in a bank operating room is presented. Algorithms for detection and classification of abnormal acoustic events, such as screams or gunshots are introduced. ...
Comparison of Methods to Annotate Named Entity Corpora
The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the results of an existing NE recognizer and (ii) manually annotating the NEs completely. The annotation time, ...
Immersive auditory display system 'sound cask': three-dimensional sound field reproduction system based on the boundary surface control principle
VRST '18: Proceedings of the 24th ACM Symposium on Virtual Reality Software and TechnologySound cask was developed to realize the perfect 3D auditory display that creates 3D sound waves around the listener's head just the same as the primary sound field, based on the boundary surface control (BoSC) principle.
If we consider the sound ...






Comments