Abstract
Audio information has not been considered an important factor in visual attention models regardless of many psychological studies that have shown the importance of audio information in the human visual perception system. Since existing visual attention models only utilize visual information, their performance is limited but also requires high-computational complexity due to the limited information available. To overcome these problems, we propose a lightweight audio-visual saliency (LAVS) model for video sequences. To the best of our knowledge, this article is the first trial to utilize audio cues for an efficient deep-learning model for the video saliency estimation. First, spatial-temporal visual features are extracted by the lightweight receptive field block (RFB) with the bidirectional ConvLSTM units. Then, audio features are extracted by using an improved lightweight environment sound classification model. Subsequently, deep canonical correlation analysis (DCCA) aims at capturing the correspondence between audio and spatial-temporal visual features, thus obtaining a spatial-temporal auditory saliency. Lastly, the spatial-temporal visual and auditory saliency are fused to obtain the audio-visual saliency map. Extensive comparative experiments and ablation studies validate the performance of the LAVS model in terms of effectiveness and complexity.
- [1] . 2013. Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning. PMLR, 1247–1255.Google Scholar
Digital Library
- [2] . 2016. Soundnet: Learning sound representations from unlabeled video. Advances in Neural Information Processing Systems 29, 1 (2016), 892–900.Google Scholar
- [3] . 2017. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Transactions on Multimedia 20, 7 (2017), 1688–1698.Google Scholar
Cross Ref
- [4] . 2018. Give ear to my face: Modelling multimodal attention to social interactions. In Proceedings of the European Conference on Computer Vision. 0–0.Google Scholar
- [5] . 2015. Salient object detection: A benchmark. IEEE Transactions on Image Processing 24, 12 (2015), 5706–5722.Google Scholar
Digital Library
- [6] . 2012. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 185–207.Google Scholar
Digital Library
- [7] . 2012. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Transactions on Image Processing 22, 1 (2012), 55–69.Google Scholar
Digital Library
- [8] . 2018. What do different evaluation metrics tell us about saliency models? IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 3 (2018), 740–757.Google Scholar
Digital Library
- [9] . 2014. Audio matters in visual attention. IEEE Transactions on Circuits and Systems for Video Technology 24, 11 (2014), 1992–2003.Google Scholar
Cross Ref
- [10] . 2014. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 3 (2014), 569–582.Google Scholar
Digital Library
- [11] . 2016. On effective location-aware music recommendation. ACM Transactions on Information Systems 34, 2 (2016), 1–32.Google Scholar
Digital Library
- [12] . 2017. Exploring user-specific information in music retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 655–664.Google Scholar
Digital Library
- [13] . 2018. Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Transactions on Image Processing 27, 10 (2018), 5142–5154.Google Scholar
Cross Ref
- [14] . 2014. An audio-visual attention model for natural conversation scenes. In Proceedings of the 2014 IEEE International Conference on Image Processing. IEEE, 1100–1104.Google Scholar
Cross Ref
- [15] . 2014. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of Vision 14, 8 (2014), 5–5.Google Scholar
Cross Ref
- [16] . 2015. An efficient audio-visual saliency model to predict eye positions when looking at conversations. In Proceedings of the 2015 23rd European Signal Processing Conference. IEEE, 1531–1535.Google Scholar
- [17] . 2016. Multimodal saliency models for videos. In Proceedings of the From Human Attention to Computational Attention. Springer, 291–304.Google Scholar
Cross Ref
- [18] . 2012. Influence of soundtrack on eye movements during video exploration. Journal of Eye Movement Research (2012).Google Scholar
- [19] . 2018. Visual saliency-aware receding horizon autonomous exploration with application to aerial robotics. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation. IEEE, 2526–2533.Google Scholar
Digital Library
- [20] . 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357–366.Google Scholar
- [21] . 2018. R3net: Recurrent residual refinement network for saliency detection. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. AAAI, 684–690.Google Scholar
Cross Ref
- [22] . 2019. Cubic LSTMs for video prediction. In Proceedings of the AAAI Conference on Artificial Intelligence. 8263–8270.Google Scholar
Digital Library
- [23] . 2014. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision. Springer, 505–520.Google Scholar
Cross Ref
- [24] . 2013. Saliency-aware video compression. IEEE Transactions on Image Processing 23, 1 (2013), 19–33.Google Scholar
Digital Library
- [25] . 2019. Understanding and visualizing deep visual saliency models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10206–10215.Google Scholar
Cross Ref
- [26] . 2015. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 262–270.Google Scholar
Cross Ref
- [27] . 2018. Deepvs: A deep learning based video saliency prediction approach. In Proceedings of the European Conference on Computer Vision. 602–617.Google Scholar
Digital Library
- [28] . 2015. A perceptually based spatio-temporal computational framework for visual saliency estimation. Signal Processing: Image Communication 38, 5 (2015), 15–31.Google Scholar
Digital Library
- [29] . 2019. Saliency and spatial information-based landmark selection for mobile robot navigation in natural environments. Advanced Robotics 33, 10 (2019), 520–535.Google Scholar
Cross Ref
- [30] . 2020. Contextual encoder-decoder network for visual saliency prediction. Neural Networks 129, 8 (2020), 261–270.Google Scholar
Cross Ref
- [31] . 2017. Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing 26, 9 (2017), 4446–4456.Google Scholar
Digital Library
- [32] . 2014. Deep gaze I: Boosting saliency prediction with feature maps trained on ImageNet. In Proceedings of the International Conference on Learning Representations. 1–12.Google Scholar
- [33] . 2017. Understanding low-and high-level contributions to fixation prediction. In Proceedings of the IEEE International Conference on Computer Vision. 4789–4798.Google Scholar
Cross Ref
- [34] . 2019. Video saliency prediction using spatiotemporal residual attentive networks. IEEE Transactions on Image Processing 29, 10 (2019), 1113–1126.Google Scholar
Cross Ref
- [35] . 2016. Deep saliency with encoded low level distance map and high level features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 660–668.Google Scholar
Cross Ref
- [36] . 2015. Visual saliency based on multi-scale deep features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5455–5463.Google Scholar
- [37] . 2019. Simple vs complex temporal recurrences for video saliency prediction. In Proceedings of the 30th British Machine Vision Conference. 1–12.Google Scholar
- [38] . 2018. Picanet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3089–3098.Google Scholar
Cross Ref
- [39] Songtao Liu, Di Huang, and Yunbo Wang. 2018. Receptive feld block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision. 385–400.Google Scholar
- [40] . 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.Google Scholar
Cross Ref
- [41] . 2019. A motion-aware ConvLSTM network for action recognition. Applied Intelligence 49, 7 (2019), 2515–2521.Google Scholar
Digital Library
- [42] . 2007. Video summarization using a visual attention model. In Proceedings of the 2007 15th European Signal Processing Conference. IEEE, 1784–1788.Google Scholar
- [43] . 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748.Google Scholar
Cross Ref
- [44] . 2019. TASED-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In Proceedings of the IEEE International Conference on Computer Vision. 2394–2403.Google Scholar
Cross Ref
- [45] . 2016. Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 1 (2016), 1–23.Google Scholar
Digital Library
- [46] . 2011. Clustering of gaze during dynamic scene viewing is predicted by motion. Cognitive Computation 3, 1 (2011), 5–24.Google Scholar
Cross Ref
- [47] Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol, Xavier, and Giro-iNieto. 2017. SalGAN: Visual saliency prediction with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Scene Understanding Workshop 2017.Google Scholar
- [48] . 2016. Shallow and deep convolutional networks for saliency prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 598–606.Google Scholar
Cross Ref
- [49] . 2021. TriBERT: Human-centric audio-visual representation learning. Advances in Neural Information Processing Systems 34, 10 (2021), 9774–9787.Google Scholar
- [50] . 2013. Audio-visual saliency map: Overview, basic models and hardware implementation. In Proceedings of the 2013 47th Annual Conference on Information Sciences and Systems. IEEE, 1–6.Google Scholar
Cross Ref
- [51] . 2016. A fast audio-visual attention model for human detection and localization on a companion robot. In Proceedings of the 1st International Conference on Applications and Systems of Visual Paradigms.Google Scholar
- [52] . 2020. There and back again: Revisiting backpropagation saliency methods. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8839–8848.Google Scholar
Cross Ref
- [53] . 2008. Multimodal saliency-based bottom-up attention a framework for the humanoid robot icub. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation. IEEE, 962–967.Google Scholar
Cross Ref
- [54] . 2011. Multimodal saliency-based attention for object-based scene analysis. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1173–1179.Google Scholar
Cross Ref
- [55] . 2010. Constant-Q transform toolbox for music processing. In Proceedings of the 7th Sound and Music Computing Conference, Barcelona, Spain. 3–64.Google Scholar
- [56] . 2009. An auditory-based feature for robust speech recognition. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4625–4628.Google Scholar
Digital Library
- [57] . 2019. Environment sound classification using multiple feature channels and attention based deep convolutional neural network. In Proceeding of the Interspeech, 1186–1190.Google Scholar
- [58] . 2017. Toward an audio-visual attention model for multimodal video content. Neurocomputing 259, 10 (2017), 94–111.Google Scholar
Cross Ref
- [59] . 2018. Saliency in VR: How do people explore virtual environments? IEEE Transactions on Visualization and Computer Graphics 24, 4 (2018), 1633–1642.Google Scholar
Digital Library
- [60] . 2013. Different types of sounds influence gaze differently in videos. Journal of Eye Movement Research 6, 4 (2013).Google Scholar
- [61] . 2019. A behaviorally inspired fusion approach for computational audio-visual saliency modeling. Signal Processing: Image Communication 76, 10 (2019), 186–200.Google Scholar
Digital Library
- [62] . 2020. Stavis: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4766–4776.Google Scholar
Cross Ref
- [63] . 2008. Pip and pop: Nonspatial auditory signals improve spatial visual search. Journal of Experimental Psychology: Human Perception and Performance 34, 5 (2008), 1053.Google Scholar
Cross Ref
- [64] . 2014. Large-scale optimization of hierarchical features for saliency prediction in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2798–2805.Google Scholar
Digital Library
- [65] . 2015. Deep networks for saliency detection via local estimation and global search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3183–3192.Google Scholar
Cross Ref
- [66] . 2017. A stagewise refinement model for detecting salient objects in images. In Proceedings of the IEEE International Conference on Computer Vision. 4019–4028.Google Scholar
Cross Ref
- [67] . 2017. Deep visual attention prediction. IEEE Transactions on Image Processing 27, 5 (2017), 2368–2378.Google Scholar
Digital Library
- [68] . 2018. Revisiting video saliency: A large-scale benchmark and a new model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4894–4903.Google Scholar
Cross Ref
- [69] . 2017. Video salient object detection via fully convolutional networks. IEEE Transactions on Image Processing 27, 1 (2017), 38–49.Google Scholar
Cross Ref
- [70] . 2018. Eidetic 3d lstm: A model for video prediction and beyond. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [71] . 2011. Attention prediction in egocentric video using motion and visual saliency. In Proceedings of the Pacific-Rim Symposium on Image and Video Technology. Springer, 277–288.Google Scholar
Digital Library
- [72] . 2013. Hierarchical saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1155–1162.Google Scholar
Digital Library
- [73] . 2013. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3166–3173.Google Scholar
Digital Library
- [74] . 2019. A dilated inception network for visual saliency prediction. IEEE Transactions on Multimedia 22, 8 (2019), 2163–2176.Google Scholar
- [75] . 2021. Deep audio-visual fusion neural network for saliency estimation. In Proceedings of the 2021 IEEE International Conference on Image Processing. IEEE, 1604–1608.Google Scholar
Cross Ref
- [76] . 2018. RGB-D saliency detection: Dataset and algorithm for robot vision. In Proceedings of the 2018 IEEE International Conference on Robotics and Biomimetics. IEEE, 1028–1033.Google Scholar
Digital Library
- [77] . 2020. Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), 1–23.Google Scholar
Digital Library
- [78] . 2018. Saliency detection in 360 videos. In Proceedings of the European Conference on Computer Vision. 488–503.Google Scholar
Digital Library
- [79] . 2020. Ransp: Ranking attention network for saliency prediction on omnidirectional images. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo. IEEE, 1–6.Google Scholar
Cross Ref
- [80] . 2014. Saliency optimization from robust background detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2814–2821.Google Scholar
Digital Library
Index Terms
A Novel Lightweight Audio-visual Saliency Model for Videos
Recommendations
A Novel Visual Saliency Model for Surveillance Video Compression
SITIS '11: Proceedings of the 2011 Seventh International Conference on Signal Image Technology & Internet-Based SystemsHuman visual system is very fast at detectingsalient information of a scene. This detection mechanism ishardwired into our HVS. In many applications there is aneed to find a robust visual saliency detection method thatmimics this detection mechanism in ...
A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence
Audio information has been bypassed by most of current visual attention prediction studies. However, sound could have influence on visual attention and such influence has been widely investigated and proofed by many psychological studies. In this paper, ...
A depth perception and visual comfort guided computational model for stereoscopic 3D visual saliency
With the emerging development of three-dimensional (3D) related technologies, 3D visual saliency modeling is becoming particularly important and challenging. This paper presents a new depth perception and visual comfort guided saliency computational ...






Comments