Abstract
Robust sound source localization for environments with noise and reverberation are increasingly exploiting deep neural networks fed with various acoustic features. Yet, state-of-the-art research mainly focuses on optimizing algorithmic accuracy, resulting in huge models preventing edge-device deployment. The edge, however, urges for real-time low-footprint acoustic reasoning for applications such as hearing aids and robot interactions. Hence, we set off from a robust CNN-based model using SRP-PHAT features, Cross3D [16], to pursue an efficient yet compact model architecture for the extreme edge. For both the SRP feature representation and neural network, we propose respectively our scalable LC-SRP-Edge and Cross3D-Edge algorithms which are optimized towards lower hardware overhead. LC-SRP-Edge halves the complexity and on-chip memory overhead for the sinc interpolation compared to the original LC-SRP [19]. Over multiple SRP resolution cases, Cross3D-Edge saves 10.32%~73.71% computational complexity and 59.77%~94.66% neural network weights against the Cross3D baseline. In terms of the accuracy-efficiency tradeoff, the most balanced version (EM) requires only 127.1 MFLOPS computation, 3.71 MByte/s bandwidth, and 0.821 MByte on-chip memory in total, while still retaining competitiveness in state-of-the-art accuracy comparisons. It achieves 8.59 ms/frame end-to-end latency on a Rasberry Pi 4B, which is 7.26× faster than the corresponding baseline.
- [1] . 2018. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing 13, 1 (2018), 34–48.Google Scholar
- [2] . 2018. Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In Proceedings of the 2018 26th European Signal Processing Conference. IEEE, 1462–1466.Google Scholar
Cross Ref
- [3] . 2019. Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network. In Workshop on Detection and Classification of Acoustic Scenes and Events.Google Scholar
- [4] . 2019. A multi-room reverberant dataset for sound event localization and detection. In Workshop on Detection and Classification of Acoustic Scenes and Events.Google Scholar
- [5] . 1979. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65, 4 (1979), 943–950.Google Scholar
Cross Ref
- [6] . 2021. An improved event-independent network for polyphonic sound event localization and detection. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 885–889.Google Scholar
Cross Ref
- [7] . 2017. Broadband DOA estimation using convolutional neural networks trained with noise signals. In Proceedings of the 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 136–140.Google Scholar
Cross Ref
- [8] . 2017. Multi-speaker localization using convolutional neural network trained with noise. arXiv:1712.04276 [cs.SD].Google Scholar
- [9] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In USENIX Symposium on Operating Systems Design and Implementation.Google Scholar
- [10] . 2019. Acoustic beamforming for noise source localization–Reviews, methodology and applications. Mechanical Systems and Signal Processing 120 (2019), 422–448. https://www.sciencedirect.com/science/article/abs/pii/S088832701830637X.Google Scholar
Cross Ref
- [11] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing.Google Scholar
- [12] . 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1251–1258.Google Scholar
Cross Ref
- [13] . 2020. Source localization using distributed microphones in reverberant environments based on deep learning and ray space transform. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2238–2251. https://ieeexplore.ieee.org/abstract/document/9146703.Google Scholar
Digital Library
- [14] . 2018. Enhanced robot speech recognition using biomimetic binaural sound source localization. IEEE Transactions on Neural Networks and Learning Systems 30, 1 (2018), 138–150.Google Scholar
Cross Ref
- [15] . 2020. Cross3D Codebase. Retrieved from https://github.com/DavidDiazGuerra/Cross3D.Google Scholar
- [16] . 2020. Robust sound source tracking using SRP-PHAT and 3D convolutional neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2020), 300–311. https://ieeexplore.ieee.org/abstract/document/9268154.Google Scholar
Digital Library
- [17] . 2021. gpuRIR: A python library for room impulse response simulation with GPU acceleration. Multimedia Tools and Applications 80, 4 (2021), 5653–5671.Google Scholar
Digital Library
- [18] . 2001. Robust localization in reverberant rooms. In Proceedings of the Microphone Arrays. Springer, 157–180.Google Scholar
Cross Ref
- [19] . 2020. Low-complexity steered response power mapping based on Nyquist-Shannon sampling. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA’21), 206–210.Google Scholar
- [20] . 2007. A generalized steered response power method for computationally viable source localization. IEEE Transactions on Audio, Speech, and Language Processing 15, 8 (2007), 2510–2526.Google Scholar
Digital Library
- [21] . 2007. A real-time SRP-PHAT source location implementation using stochastic region contraction (SRC) on a large-aperture microphone array. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07. Vol. 1. IEEE, I–121.Google Scholar
Cross Ref
- [22] . 2020. The LOCATA challenge: Acoustic source localization and tracking. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 1620–1643. https://ieeexplore.ieee.org/abstract/document/9079214.Google Scholar
Digital Library
- [23] . 2021. Improved feature extraction for CRNN-based multiple sound source localization. In 29th European Signal Processing Conference (EUSIPCO’21). 231–235.Google Scholar
- [24] . 2021. A survey of sound source localization with deep learning methods. The Journal of the Acoustical Society of America 152, 1 (2021), 107.Google Scholar
- [25] . 2021. SELD-TCN: Sound event localization & detection via temporal convolutional networks. In Proceedings of the 2020 28th European Signal Processing Conference. IEEE, 16–20.Google Scholar
Cross Ref
- [26] . 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026–1034.Google Scholar
Digital Library
- [27] . 2015. Classification of spatial audio location and content using convolutional neural networks. In Proceedings of the Audio Engineering Society Convention 138. Audio Engineering Society.Google Scholar
- [28] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [29] . 2017. Design of UAV-embedded microphone array system for sound source localization in outdoor environments. Sensors 17, 11 (2017), 2535.Google Scholar
Cross Ref
- [30] . 2020. A time-domain unsupervised learning based sound source localization method. In Proceedings of the 2020 IEEE 3rd International Conference on Information Communication and Signal Processing. IEEE, 26–32.Google Scholar
Cross Ref
- [31] . 2017. Theory and Applications of Spherical Microphone Array Processing. Vol. 9. Springer.Google Scholar
Cross Ref
- [32] . 2019. Sound Event Localization and Detection using Convolutional Recurrent Neural Network.
Technical Report . DCASE2019 Challenge, Tech. Rep.Google Scholar - [33] . 2019. Sound source detection, localization and classification using consecutive ensemble of CRNN models. ArXiv abs/1908.00766 (2019).Google Scholar
- [34] . 2011. Direction of arrival estimation of humans with a small sensor array using an artificial neural network. Progress In Electromagnetics Research B 27 (2011), 127–149.Google Scholar
Cross Ref
- [35] . 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014).Google Scholar
- [36] . 1976. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing 24, 4 (1976), 320–327.Google Scholar
Cross Ref
- [37] . 2019. Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems. arXiv:1904.03476 [cs.SD].Google Scholar
- [38] . 2014. Acoustic source localization. Ultrasonics 54, 1 (2014), 25–38.Google Scholar
- [39] . 2012. Acoustic source localization in anisotropic plates. Ultrasonics 52, 6 (2012), 740–746.Google Scholar
Cross Ref
- [40] . 2019. Learning multiple sound source 2d localization. In Proceedings of the 2019 IEEE 21st International Workshop on Multimedia Signal Processing. IEEE, 1–6.Google Scholar
Cross Ref
- [41] . 2015. Deep learning. Nature 521, 7553 (2015), 436–444.Google Scholar
- [42] . 2018. Online direction of arrival estimation based on deep learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2616–2620.Google Scholar
Digital Library
- [43] Markus V. S. Lima, Wallace A. Martins, Leonardo O. Nunes, Luiz W. P. Biscainho, Tadeu N. Ferreira, Mauricio V. M. Costa, and Bowon Lee. 2015. A volumetric SRP with refinement step for sound source localization. IEEE Signal Processing Letters 22, 8 (2015), 1098–1102.Google Scholar
- [44] . 2019. Decoupled Weight Decay Regularization. arXiv:1711.05101 [cs.LG].Google Scholar
- [45] . 2012. Introduction to Shannon Sampling and Interpolation Theory. Springer Science & Business Media.Google Scholar
- [46] . 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference. Vol. 8. Citeseer, 18–25.Google Scholar
Cross Ref
- [47] . 2013. GPU-based approaches for real-time sound source localization using the SRP-PHAT algorithm. The International Journal of High Performance Computing Applications 27, 3 (2013), 291–306.Google Scholar
Digital Library
- [48] Javier Naranjo-Alcazar, Sergi Perez-Castanos, Jose Ferrandis, Pedro Zuccarello, and Maximo Cobos. 2021. Sound Event Localization and Detection using Squeeze-Excitation Residual CNNs. arXiv:2006.14436 [cs.SD].Google Scholar
- [49] . 2017. Source localization in an ocean waveguide using supervised machine learning. The Journal of the Acoustical Society of America 142, 3 (2017), 1176–1188.Google Scholar
Cross Ref
- [50] . 2019. Three-stage approach for sound event localization and detection. Tech. Report of Detection and Classification of Acoustic Scenes and Events 2019 (DCASE) Challange (2019). https://www.semanticscholar.org/paper/THREE-STAGE-APPROACH-FOR-SOUND-EVENT-LOCALIZATION-Noh-Choi/2e0962d0fc80a5b069a09716b35e4fa1ecdb97b1.Google Scholar
- [51] . 2015. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 5206–5210.Google Scholar
Cross Ref
- [52] . 2018. CRNN-based joint azimuth and elevation localization with the Ambisonics intensity vector. In Proceedings of the 2018 16th International Workshop on Acoustic Signal Enhancement. IEEE, 241–245.Google Scholar
Cross Ref
- [53] . 2017. Robust direction estimation with convolutional neural networks based steered response power. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6125–6129.Google Scholar
Digital Library
- [54] . 2020. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. arXiv:2006.01919 [eess.AS].Google Scholar
- [55] . 2021. Direction of arrival estimation of noisy speech using convolutional recurrent neural networks with higher-order ambisonics signals. In 29th European Signal Processing Conference (EUSIPCO’21), 211–215.Google Scholar
- [56] . 2019. Source localization in reverberant rooms using Deep Learning and microphone arrays. In Proceedings of the 23rd International Congress on Acoustics.Google Scholar
- [57] . 2021. BeamLearning: An end-to-end Deep Learning approach for the angular localization of sound sources using raw multichannel acoustic pressure data. The Journal of the Acoustical Society of America 149, 6 (2021), 4248–4263.Google Scholar
Cross Ref
- [58] . 2017. Localization of sound sources in robotics: A review. Robotics and Autonomous Systems 96 (2017), 184–210. https://www.sciencedirect.com/science/article/pii/S0921889016304742.Google Scholar
Cross Ref
- [59] . 2002. On the approximate W-disjoint orthogonality of speech. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 1. IEEE, I–529.Google Scholar
- [60] . 2015. On Sound Source Localization of Speech Signals using Deep Neural Networks. https://www.semanticscholar.org/paper/On-sound-source-localization-of-speech-signals-deep-Roden-Moritz/cbbcd9214f1d25aaf4cae3cddbf0d9712056e837.Google Scholar
- [61] . 1989. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Transactions on Acoustics, Speech, and Signal Processing 37, 7 (1989), 984–995.Google Scholar
Cross Ref
- [62] . 2018. Exploiting CNNs for improving acoustic source localization in noisy and reverberant conditions. IEEE Transactions on Emerging Topics in Computational Intelligence 2, 2 (2018), 103–116.Google Scholar
Cross Ref
- [63] . 2003. Direction of arrival estimation for multiple source signals using independent component analysis. In Proceedings of the 7th International Symposium on Signal Processing and Its Applications. Vol. 2. IEEE, 411–414.Google Scholar
Cross Ref
- [64] . 1986. Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation 34, 3 (1986), 276–280.Google Scholar
Cross Ref
- [65] Christopher Schymura, Benedikt T. Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, and Dorothea Kolossa. 2021. PILOT: Introducing transformers for probabilistic sound event localization. In Interspeech.Google Scholar
- [66] . 2021. Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 915–919.Google Scholar
Cross Ref
- [67] . 2020. Sound event localization and detection using activity-coupled Cartesian DOA vector and RD3Net. arXiv:2006.12014 [eess.AS].Google Scholar
- [68] . 2018. Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment. In Proceedings of the Interspeech 2018-19th Annual Conference of the International Speech Communication Association.Google Scholar
Cross Ref
- [69] . 2021. Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition. Comput. Speech Lang. 75 (2021), 101360.Google Scholar
- [70] . 2018. Deep residual network for sound source localization in the time domain. arXiv:1808.06429 [cs.SD].Google Scholar
- [71] . 2008. Interpolation methods for the SRP-PHAT algorithm. In Proceedings of the 11th International Workshop on Acoustic Echo and Noise Control.14–17.Google Scholar
- [72] . 2018. Spatial audio feature discovery with convolutional neural networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6797–6801.Google Scholar
Digital Library
- [73] . 2007. Real-time acoustic source localization in noisy environments for human-robot multimodal interaction. In Proceedings of the RO-MAN 2007-The 16th IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 393–398.Google Scholar
Cross Ref
- [74] . 2013. An approach for sound source localization by complex-valued neural network. IEICE Transactions on Information and Systems 96, 10 (2013), 2257–2265.Google Scholar
Cross Ref
- [75] . 2011. Sound source localization using hearing aids with microphones placed behind-the-ear, in-the-canal, and in-the-pinna. International Journal of Audiology 50, 3 (2011), 164–176.Google Scholar
Cross Ref
- [76] . 2020. A deep learning framework for robust DOA estimation using spherical harmonic decomposition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 1248–1259. https://ieeexplore.ieee.org/abstract/document/9056464.Google Scholar
Digital Library
- [77] . 2020. Exploiting periodicity features for joint detection and DOA estimation of speech sources using convolutional neural networks. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 566–570.Google Scholar
Cross Ref
- [78] . 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [79] . 2018. Deep neural networks for joint voice activity detection and speaker localization. In Proceedings of the 2018 26th European Signal Processing Conference. IEEE, 1567–1571.Google Scholar
Cross Ref
- [80] . 2018. Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates. Sensors 18, 10 (2018), 3418.Google Scholar
Cross Ref
- [81] . 2018. Audio Source Separation and Speech Enhancement. John Wiley & Sons.Google Scholar
Digital Library
- [82] Qing Wang, Jun Du, Hua-Xin Wu, Jia Pan, Feng Ma, and Chin-Hui Lee. 2023. A four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection. arXiv:2101.02919 [cs.SD].Google Scholar
- [83] . 2020. The USTC-IFLYTEK system for sound event localization and detection of DCASE2020 challenge. Tech. Rep., DCASE2020 Challenge (2020). https://www.semanticscholar.org/paper/THE-USTC-IFLYTEK-SYSTEM-FOR-SOUND-EVENT-AND-OF-Wang-Wu/735990cac7c3791725ac4c846ac61a603409d66b.Google Scholar
- [84] . 2018. Robust speaker localization guided by deep learning-based time-frequency masking. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 1 (2018), 178–188.Google Scholar
Digital Library
- [85] . 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (2009), 65–76.Google Scholar
Digital Library
- [86] . 2021. SSLIDE: Sound source localization for indoors based on deep learning. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 4680–4684.Google Scholar
Cross Ref
- [87] . 2018. Sound source localization and speech enhancement with sparse Bayesian learning beamforming. The Journal of the Acoustical Society of America 143, 6 (2018), 3912–3921.Google Scholar
Cross Ref
- [88] . 2015. A learning-based approach to direction of arrival estimation in noisy and reverberant environments. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2814–2818.Google Scholar
Cross Ref
- [89] . 2012. High-accuracy TDOA-based localization without time synchronization. IEEE Transactions on Parallel and Distributed Systems 24, 8 (2012), 1567–1576.Google Scholar
Digital Library
- [90] . 2017. Sound source localization using deep learning models. Journal of Robotics and Mechatronics 29, 1 (2017), 37–48.Google Scholar
Cross Ref
- [91] . 2020. Sound event localization based on sound intensity vector refined by DNN-based denoising and source separation. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 651–655.Google Scholar
Cross Ref
- [92] . 2013. A learning-based approach to robust binaural sound localization. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2927–2932.Google Scholar
Cross Ref
- [93] . 2019. Robust DOA estimation based on convolutional neural network and time-frequency masking. In Proceedings of the INTERSPEECH. 2703–2707.Google Scholar
Cross Ref
Index Terms
CNN-based Robust Sound Source Localization with SRP-PHAT for the Extreme Edge
Recommendations
Combining SRP-PHAT and two Kinects for 3D Sound Source Localization
The Kinect(TM) has been developed to recognize gestures and voice commands, through a set of cameras and microphones, respectively. This paper proposes and evaluates low-cost Sound Source Localization (SSL) solution based this off-the-shelf equipment. ...
Improved sound source localization in horizontal plane for binaural robot audition
An improved sound source localization (SSL) method has been developed that is based on the generalized cross-correlation (GCC) method weighted by the phase transform (PHAT) for use with binaural robots equipped with two microphones inside artificial ...
Evaluation of Two-Channel-Based Sound Source Localization Using 3D Moving Sound Creation Tool
ICKS '08: Proceedings of the International Conference on Informatics Education and Research for Knowledge-Circulating Society (icks 2008)We proposed the way that can repeatedly evaluate the localization methods for moving sounds in the same condition regardless of a kind of methods and a number of microphones. And, we developed two-channel-based sound source localization integrated with ...






Comments