Abstract
Object-level audiovisual saliency detection in 360° panoramic real-life dynamic scenes is important for exploring and modeling human perception in immersive environments, also for aiding the development of virtual, augmented, and mixed reality applications in fields such as education, social network, entertainment, and training. To this end, we propose a new task, panoramic audiovisual salient object detection, (PAV-SOD1), which aims to segment the objects grasping most of the human attention in 360° panoramic videos reflecting real-life daily scenes. To support the task, we collect PAVS10K, the first panoramic video dataset for audiovisual salient object detection, which consists of 67 4K-resolution equirectangular videos with per-video labels including hierarchical scene categories and associated attributes depicting specific challenges for conducting PAV-SOD, and 10,465 uniformly sampled video frames with manually annotated object-level and instance-level pixel-wise masks. The coarse-to-fine annotations enable multi-perspective analysis regarding PAV-SOD modeling. We further systematically benchmark 13 state-of-the-art salient object detection (SOD)/video object segmentation (VOS) methods based on our PAVS10K. Besides, we propose a new baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE). Our CVAE-based audiovisual network, namely, CAV-Net, consists of a spatial-temporal visual segmentation network, a convolutional audio-encoding network, and audiovisual distribution estimation modules. As a result, our CAV-Net outperforms all competing models and is able to estimate the aleatoric uncertainties within PAVS10K. With extensive experimental results, we gain several findings about PAV-SOD challenges and insights towards PAV-SOD model interpretability. We hope that our work could serve as a starting point for advancing SOD towards immersive media.
- [1] . 2017. A dataset of head and eye movements for 360 degree images. In Proceedings of the 8th ACM on Multimedia Systems Conference. 205–210.Google Scholar
Digital Library
- [2] . 2018. Saliency in VR: How do people explore virtual environments? IEEE Trans. Visualiz. Comput. Graph. 24, 4 (2018), 1633–1642.Google Scholar
Digital Library
- [3] . 2017. 360-degree video head movement dataset. In Proceedings of the 8th ACM on Multimedia Systems Conference. 199–204.Google Scholar
Digital Library
- [4] . 2018. Predicting head movement in panoramic video: A deep reinforcement learning approach. IEEE Trans. Neural Netw. Learn. Syst. 41, 11 (2018), 2693–2708.Google Scholar
- [5] . 2018. Saliency detection in 360 videos. In Proceedings of the European Conference on Computer Vision (ECCV). 488–503.Google Scholar
Digital Library
- [6] . 2018. Gaze prediction in dynamic 360 immersive videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5333–5342.Google Scholar
Cross Ref
- [7] . 2018. Cube padding for weakly-supervised saliency prediction in 360 videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1420–1429.Google Scholar
Cross Ref
- [8] . 2020. Audio-visual perception of omnidirectional video for virtual reality applications. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 1–6.Google Scholar
Cross Ref
- [9] . 2018. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV). 435–451.Google Scholar
Digital Library
- [10] . 2020. Semantic object prediction and spatial sound super-resolution with binaural sounds. In Proceedings of the European Conference on Computer Vision. Springer, 638–655.Google Scholar
Digital Library
- [11] . 2020. Self-supervised learning of audio-visual objects from video. In Proceedings of the European Conference on Computer Vision. Springer, 208–224.Google Scholar
Digital Library
- [12] . 2020. The effects of spatial auditory and visual cues on mixed reality remote collaboration. J. Multimod. User Interf. 14, 4 (2020), 337–352.Google Scholar
Cross Ref
- [13] . 2017. MR360: Mixed reality rendering for 360 panoramic videos. IEEE Trans. Visualiz. Comput. Graph. 23, 4 (2017), 1379–1388.Google Scholar
Digital Library
- [14] . 2019. Salient object detection: A survey. Computat. Vis. Media 5, 2 (2019), 117–150.Google Scholar
Cross Ref
- [15] . 2020. Few-cost salient object detection with adversarial-paced learning. Adv. Neural Inf. Process. Syst. 33 (2020), 12236–12247.Google Scholar
- [16] . 2021. Densely nested top-down flows for salient object detection. arXiv preprint arXiv:2102.09133 (2021).Google Scholar
- [17] . 2018. Salient objects in clutter: Bringing salient object detection to the foreground. In Proceedings of the European Conference on Computer Vision (ECCV). 186–202.Google Scholar
Digital Library
- [18] . 2021. Salient object detection in the deep learning era: An in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell. (2021).Google Scholar
- [19] . 2021. A highly efficient model to study the semantics of salient object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2021).Google Scholar
- [20] . 2019. Shifting more attention to video salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8554–8564.Google Scholar
Cross Ref
- [21] . 2021. Weakly supervised video salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16826–16835.Google Scholar
Cross Ref
- [22] . 2021. Dynamic context-sensitive filtering network for video salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1553–1563.Google Scholar
Cross Ref
- [23] . 2022. Global-and-local collaborative learning for co-salient object detection. arXiv preprint arXiv:2204.08917 (2022).Google Scholar
- [24] . 2021. Group collaborative learning for co-salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12288–12298.Google Scholar
Cross Ref
- [25] . 2019. DeepCO3: Deep instance co-segmentation by co-peak search and co-saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8846–8855.Google Scholar
Cross Ref
- [26] . 2020. CoADNet: Collaborative aggregation-and-distribution networks for co-salient object detection. Adv. Neural Inf. Process. Syst. 33 (2020), 6959–6970.Google Scholar
- [27] . 2020. DPANet: Depth potentiality-aware gated attention network for RGB-D salient object detection. IEEE Trans. Image Process. 30 (2020), 7012–7024.Google Scholar
Digital Library
- [28] . 2020. BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In Proceedings of the European Conference on Computer Vision. Springer, 275–292.Google Scholar
Digital Library
- [29] . 2021. RGB-D salient object detection via 3D convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence. 1063–1071.Google Scholar
Cross Ref
- [30] . 2021. MobileSal: Extremely efficient RGB-D salient object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2021).Google Scholar
- [31] . 2019. RGB-T image saliency detection via collaborative graph learning. IEEE Trans. Multim. 22, 1 (2019), 160–173.Google Scholar
Digital Library
- [32] . 2019. RGBT salient object detection: Benchmark and a novel cooperative ranking approach. IEEE Trans. Circ. Syst. Vid. Technol. 30, 12 (2019), 4421–4433.Google Scholar
Digital Library
- [33] . 2020. Revisiting feature fusion for RGB-T salient object detection. IEEE Trans. Circ. Syst. Vid. Technol. 31, 5 (2020), 1804–1818.Google Scholar
Digital Library
- [34] . 2014. Saliency detection on light field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2806–2813.Google Scholar
Digital Library
- [35] . 2019. Memory-oriented decoder for light field salient object detection. Adv. Neural Inf. Process. Syst. 32 (2019).Google Scholar
- [36] . 2021. Learning synergistic attention for light field salient object detection. arXiv preprint arXiv:2104.13916 (2021).Google Scholar
- [37] . 2019. Towards high-resolution salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7234–7243.Google Scholar
Cross Ref
- [38] . 2021. Looking for the detail and context devils: High-resolution salient object detection. IEEE Trans. Image Process. 30 (2021), 3204–3216.Google Scholar
Digital Library
- [39] . 2021. RRNet: Relational reasoning network with parallel multi-scale attention for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Rem. Sens. (2021).Google Scholar
- [40] . 2019. Nested network with two-stream pyramid for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Rem. Sens. 57, 11 (2019), 9156–9166.Google Scholar
Cross Ref
- [41] . 2020. Dense attention fluid network for salient object detection in optical remote sensing images. IEEE Trans. Image Process. 30 (2020), 1305–1317.Google Scholar
Digital Library
- [42] . 1983. Interactions among converging sensory inputs in the superior colliculus. Science 221, 4608 (1983), 389–391.Google Scholar
Cross Ref
- [43] . 1986. Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration. J. Neurophysiol. 56, 3 (1986), 640–662.Google Scholar
Cross Ref
- [44] . 2020. STAViS: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4766–4776.Google Scholar
Cross Ref
- [45] . 2021. From semantic categories to fixations: A novel weakly-supervised visual-auditory saliency detection approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15119–15128.Google Scholar
Cross Ref
- [46] . 2021. Audio-visual salient object detection. In Proceedings of the International Conference on Intelligent Computing. Springer, 510–521.Google Scholar
Digital Library
- [47] . 2020. A fixation-based 360 benchmark dataset for salient object detection. In Proceedings of the IEEE International Conference on Image Processing (ICIP). 3458–3462.Google Scholar
Cross Ref
- [48] . 2019. Distortion-adaptive salient object detection in 360° omnidirectional images. IEEE J. Select. Topics Sig. Process. 14, 1 (2019), 38–48.Google Scholar
Cross Ref
- [49] . 2020. Stage-wise salient object detection in 360° omnidirectional image via object-level semantical saliency ranking. IEEE Trans. Visualiz. Comput. Graph. 26, 12 (2020), 3535–3545.Google Scholar
Cross Ref
- [50] . 2015. Learning structured output representation using deep conditional generative models. Adv. Neural Inf. Process. Syst. 28 (2015).Google Scholar
- [51] . 2019. Revisiting video saliency prediction in the deep learning era. IEEE Trans. Neural Netw. Learn. Syst. 43, 1 (2019), 220–237.Google Scholar
- [52] . 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV). 570–586.Google Scholar
Digital Library
- [53] . 2017. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776–780.Google Scholar
Digital Library
- [54] . 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV). 247–263.Google Scholar
Digital Library
- [55] . 2020. VGGSound: A large-scale audio-visual dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 721–725.Google Scholar
Cross Ref
- [56] . 2021. ObjectFolder: A dataset of objects with implicit visual, auditory, and tactile representations. arXiv preprint arXiv:2109.07991 (2021).Google Scholar
- [57] . 2021. Audio-visual synchronisation in the wild. arXiv preprint arXiv:2112.04432 (2021).Google Scholar
- [58] . 2020. Look, listen, and act: Towards audio-visual embodied navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 9701–9707.Google Scholar
Cross Ref
- [59] . 2020. SoundSpaces: Audio-visual navigation in 3D environments. In Proceedings of the European Conference on Computer Vision. Springer, 17–36.Google Scholar
Digital Library
- [60] . 2018. Self-supervised generation of spatial audio for 360 video. Adv. Neural Inf. Process. Syst. 31 (2018).Google Scholar
- [61] . 2021. Geometry-aware multi-task learning for binaural audio generation from video. arXiv preprint arXiv:2111.10882 (2021).Google Scholar
- [62] . 2022. Strumming to the beat: Audio-conditioned contrastive video textures. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3761–3770.Google Scholar
Cross Ref
- [63] . 2021. Semantic audio-visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15516–15525.Google Scholar
Cross Ref
- [64] . 2021. Move2Hear: Active audio-visual source separation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 275–285.Google Scholar
Cross Ref
- [65] . 2019. Self-supervised audio-visual co-segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2357–2361.Google Scholar
Cross Ref
- [66] . 2021. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16867–16876.Google Scholar
Cross Ref
- [67] . 2019. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3879–3888.Google Scholar
Cross Ref
- [68] . 2022. Visual sound localization in the wild by cross-modal interference erasing. arXiv preprint arXiv:2202.06406 (2022).Google Scholar
- [69] . 2021. Cyclic co-learning of sounding object visual grounding and sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2745–2754.Google Scholar
Cross Ref
- [70] . 2020. Discriminative sounding objects localization via self-supervised audiovisual matching. Adv. Neural Inf. Process. Syst. 33 (2020), 10077–10087.Google Scholar
- [71] . 2021. Class-aware sounding objects localization via audiovisual correspondence. IEEE Trans. Pattern Anal. Mach. Intell. (2021).Google Scholar
- [72] . 2021. Self-supervised object detection from audio-visual correspondence. arXiv preprint arXiv:2104.06401 (2021).Google Scholar
- [73] . 2018. A dataset of head and eye movements for 360 videos. In Proceedings of the 9th ACM Multimedia Systems Conference. 432–437.Google Scholar
Digital Library
- [74] . 2019. A saliency dataset for 360-degree videos. In Proceedings of the 10th ACM Multimedia Systems Conference. 279–284.Google Scholar
Digital Library
- [75] . 2018. Bridge the gap between VQA and human behavior on omnidirectional video: A large-scale dataset and a deep learning model. In Proceedings of the 26th ACM International Conference on Multimedia. 932–940.Google Scholar
Digital Library
- [76] . 2019. 360-degree video gaze behaviour: A ground-truth data set and a classification algorithm for eye movements. In Proceedings of the 27th ACM International Conference on Multimedia. 1007–1015.Google Scholar
Digital Library
- [77] . 2018. SalGAN360: Visual saliency prediction on 360 degree images with generative adversarial networks. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 1–4.Google Scholar
Cross Ref
- [78] . 2020. Viewport-dependent saliency prediction in 360° video. IEEE Trans. Multim. 23 (2020), 748–760.Google Scholar
Cross Ref
- [79] . 2021. ATSal: An attention based architecture for saliency prediction in 360° videos. In Proceedings of the International Conference on Pattern Recognition. Springer, 305–320.Google Scholar
Digital Library
- [80] . 2019. The prediction of saliency map for head and eye movements in 360 degree images. IEEE Trans. Multim. 22, 9 (2019), 2331–2344.Google Scholar
Cross Ref
- [81] . 2021. Simple baselines can fool 360deg saliency metrics. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3750–3756.Google Scholar
- [82] . 2019. DAVE: A deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693 (2019).Google Scholar
- [83] . 2016. SoundNet: Learning sound representations from unlabeled video. Adv. Neural Inf. Process. Syst. 29 (2016).Google Scholar
- [84] . 2020. Learning to predict salient faces: A novel visual-audio saliency model. In Proceedings of the European Conference on Computer Vision. Springer, 413–429.Google Scholar
Digital Library
- [85] . 2020. A multimodal saliency model for videos with high audio-visual correspondence. IEEE Trans. Image Process. 29 (2020), 3805–3819.Google Scholar
Digital Library
- [86] . 2021. Deep audio-visual fusion neural network for saliency estimation. In Proceedings of the IEEE International Conference on Image Processing (ICIP). 1604–1608.Google Scholar
Cross Ref
- [87] . 2020. ViNet: Pushing the limits of visual modality for audio-visual saliency prediction. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 3520–3527.Google Scholar
- [88] . 2021. GASP: Gated attention for saliency prediction. In Proceedings of the 30th International Joint Conference on Artificial Intelligence.Google Scholar
- [89] . 2020. A biologically motivated, proto-object-based audiovisual saliency model. Artif. Intell. 1, 4 (2020), 487–509.Google Scholar
- [90] . 2020. Towards audio-visual saliency prediction for omnidirectional video with spatial audio. In Proceedings of the IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE, 355–358.Google Scholar
Cross Ref
- [91] . 2021. Leveraging frequency based salient spatial sound localization to improve 360° video saliency prediction. In Proceedings of the 17th International Conference on Machine Vision and Applications (MVA). 1–5.Google Scholar
Cross Ref
- [92] . 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 724–732.Google Scholar
Cross Ref
- [93] . 2013. Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision. 2192–2199.Google Scholar
Digital Library
- [94] . 2013. Segmentation of moving objects by long term video analysis. IEEE Trans. Pattern Anal. Mach. Intell. 36, 6 (2013), 1187–1200.Google Scholar
Digital Library
- [95] . 2015. Spatiotemporal saliency detection for video sequences based on random walk with restart. IEEE Trans. Image Process. 24, 8 (2015), 2552–2564.Google Scholar
Digital Library
- [96] . 2015. Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans. Image Process. 24, 11 (2015), 4185–4196.Google Scholar
Digital Library
- [97] . 2016. Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation. IEEE Trans. Circ. Syst. Vid. Technol. 27, 12 (2016), 2527–2542.Google Scholar
Digital Library
- [98] . 2017. A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection. IEEE Trans. Image Process. 27, 1 (2017), 349–364.Google Scholar
Cross Ref
- [99] . 2019. Semi-supervised video salient object detection using pseudo-labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7284–7293.Google Scholar
Cross Ref
- [100] . 2020. Pyramid constrained self-attention network for fast video salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 10869–10876.Google Scholar
Cross Ref
- [101] . 2022. D2Conv3D: Dynamic dilated convolutions for object segmentation in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1200–1209.Google Scholar
Cross Ref
- [102] . 2021. Reciprocal transformations for unsupervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15455–15464.Google Scholar
Cross Ref
- [103] . 2020. Making a case for 3D convolutions for object segmentation in videos. arXiv preprint arXiv:2008.11516 (2020).Google Scholar
- [104] . 2021. Full-duplex strategy for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4922–4933.Google Scholar
Cross Ref
- [105] . 2021. Deep transport network for unsupervised video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8781–8790.Google Scholar
Cross Ref
- [106] . 2020. Motion-attentive transition for zero-shot video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. 13066–13073.Google Scholar
Cross Ref
- [107] . 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3623–3632.Google Scholar
Cross Ref
- [108] . 2020. FANet: Features adaptation network for 360° omnidirectional salient object detection. IEEE Sig. Process. Lett. 27 (2020), 1819–1823.Google Scholar
Cross Ref
- [109] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.Google Scholar
Cross Ref
- [110] . 2017. Deep 360 pilot: Learning a deep agent for piloting through 360 sports videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1396–1405.Google Scholar
Cross Ref
- [111] . 2020. Spherical criteria for fast and accurate 360 object detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 12959–12966.Google Scholar
Cross Ref
- [112] . 2019. Object detection in curved space for 360-degree camera. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3642–3646.Google Scholar
Cross Ref
- [113] . 2017. Real-time object detection for 360-degree panoramic image using CNN. In Proceedings of the International Conference on Virtual Reality and Visualization (ICVRV). IEEE, 18–23.Google Scholar
Cross Ref
- [114] . 2013. Hierarchical saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1155–1162.Google Scholar
Digital Library
- [115] . 2013. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3166–3173.Google Scholar
Digital Library
- [116] . 2014. The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 280–287.Google Scholar
Digital Library
- [117] . 2015. Visual saliency based on multiscale deep features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5455–5463.Google Scholar
- [118] . 2017. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 136–145.Google Scholar
Cross Ref
- [119] . 2017. Instance-level salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2386–2395.Google Scholar
Cross Ref
- [120] . 2021. Benchmarking ultra-high-definition image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14769–14778.Google Scholar
Cross Ref
- [121] . 2021. Multi-scale separable network for ultra-high-definition video deblurring. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14030–14039.Google Scholar
Cross Ref
- [122] . 2020. State-of-the-art in 360 video/image processing: Perception, assessment and compression. IEEE J. Select. Topics Sig. Process. 14, 1 (2020), 5–26.Google Scholar
Cross Ref
- [123] . 2019. A survey on 360 video streaming: Acquisition, transmission, and display. ACM Comput. Surv. 52, 4 (2019), 1–36.Google Scholar
Digital Library
- [124] . 2020. UC-Net: Uncertainty inspired RGB-D saliency detection via conditional variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [125] . 2021. Dense uncertainty estimation. arXiv preprint arXiv:2110.06427 (2021).Google Scholar
- [126] . 2013. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013).Google Scholar
- [127] . 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 12179–12188.Google Scholar
Cross Ref
- [128] . 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).Google Scholar
- [129] . 2020. F\(^3\)Net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 12321–12328.Google Scholar
- [130] . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- [131] . 2019. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3907–3916.Google Scholar
Cross Ref
- [132] . 2019. Stacked cross refinement network for edge-aware salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7264–7273.Google Scholar
Cross Ref
- [133] . 2020. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9413–9422.Google Scholar
Cross Ref
- [134] . 2020. Label decoupling framework for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13025–13034.Google Scholar
Cross Ref
- [135] . 2020. Highly efficient salient object detection with 100k parameters. In European Conference on Computer Vision. 702–721.Google Scholar
Digital Library
- [136] . 2020. Suppress and balance: A simple gated network for salient object detection. In Proceedings of the European Conference on Computer Vision. Springer, 35–51.Google Scholar
Digital Library
- [137] . 2009. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1597–1604.Google Scholar
Cross Ref
- [138] . 2012. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 733–740.Google Scholar
Cross Ref
- [139] . 2017. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision. 4548–4557.Google Scholar
Cross Ref
- [140] . 2018. Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421 (2018).Google Scholar
- [141] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [142] . 2019. Res2Net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43, 2 (2019), 652–662.Google Scholar
Digital Library
- [143] . 2015. No-reference image sharpness assessment in autoregressive parameter space. IEEE Trans. Image Process. 24, 10 (2015), 3218–3231.Google Scholar
Digital Library
- [144] . 2017. Learning a no-reference quality assessment model of enhanced images with big data. IEEE Trans. Neural Netw. Learn. Syst. 29, 4 (2017), 1301–1313.Google Scholar
Cross Ref
- [145] . 2016. No-reference quality metric of contrast-distorted images based on information maximization. IEEE Trans. Cyber. 47, 12 (2016), 4559–4565.Google Scholar
Cross Ref
Index Terms
PAV-SOD: A New Task towards Panoramic Audiovisual Saliency Detection
Recommendations
Image saliency and co-saliency detection by low-rank multiscale fusion
Saliency and co-saliency detection aim to distinguish conspicuous foreground objects from single and multiple images, thus are essential in many multimedia and vision applications. To achieve balanced efficiency and accuracy, most recent successful ...
Salient object detection via boosting object-level distinctiveness and saliency refinement
We detect saliency via boosting object-level distinctiveness and saliency refinement.Our approach can better uniformly highlight heterogeneous regions of salient objects.A new method only using object-level features to detect coarse saliency is ...
Spatiotemporal salient object detection by integrating with objectness
This paper proposes a novel spatiotemporal salient object detection method by integrating saliency and objectness, for videos with complicated motion and complex scenes. The initial salient object detection result is first built upon both saliency map ...






Comments