Abstract
The scene graph is a symbolic data structure that comprehensively describes the objects and visual relations in a visual scene, while ignoring the inherent perceptual saliency of each visual relation (i.e., relation saliency). However, humans often quickly allocate attention to important/salient visual relations in a scene. To align with such human perception of a scene, we explicitly model the perceptual saliency of visual relation in scene graph by upgrading each graph edge (i.e., visual relation) with an attribute of relation saliency. We present a new design, named as Saliency-guided Message Passing (SMP), that boosts the generation of such scene graph structure with the guidance from the visual relation saliency. Technically, an object interaction encoder is first utilized to strengthen object relation representations by jointly exploiting the appearance, semantic, and spatial relations in between. A branch is further leveraged to estimate the relation saliency of each visual relation by ordinal regression. Next, conditioned on the object and relation features (coupled with the estimated relation saliency), our SMP enhances scene graph generation by performing message passing over the objects and the most salient relations. Extensive experiments on VG-KR and VG150 datasets demonstrate the superiority of SMP for the scene graph generation. Moreover, we empirically validate the compelling generalizability of the learned scene graphs via SMP on downstream tasks like cross-model retrieval and image captioning.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [2] . 2015. Salient object detection: A benchmark. IEEE Transactions on Image Processing 24, 12 (2015), 5706–5722.Google Scholar
Digital Library
- [3] . 2019. Rank-consistent ordinal regression for neural networks. arXiv preprint arXiv:1901.07884 1, 6 (2019), 13.Google Scholar
- [4] . 2019. Counterfactual critic multi-agent training for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4613–4623.Google Scholar
Cross Ref
- [5] . 2019. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163–6171.Google Scholar
Cross Ref
- [6] . 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2 (2018), 1–21.Google Scholar
Digital Library
- [7] . 2017. Vse++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference.Google Scholar
- [8] . 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 26 (2015), 1–22.Google Scholar
Digital Library
- [9] . 2001. A simple approach to ordinal classification. In Proceedings of the European Conference on Machine Learning. Springer, 145–156.Google Scholar
Digital Library
- [10] . 2017. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.Google Scholar
Cross Ref
- [11] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [12] . 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3668–3678.Google Scholar
Cross Ref
- [13] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- [14] . 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8888–8897.Google Scholar
Cross Ref
- [15] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [16] . 2021. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11109–11119.Google Scholar
Cross Ref
- [17] . 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 2, Article 48 (2022), 16.Google Scholar
- [18] . 2018. Factorizable net: An efficient subgraph-based framework for scene graph generation. In Proceedings of the European Conference on Computer Vision. 335–351.Google Scholar
Digital Library
- [19] . 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision. 1261–1270.Google Scholar
Cross Ref
- [20] . 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In Proceedings of the 29th ACM International Conference on Multimedia. 3799–3802.Google Scholar
Digital Library
- [21] . 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In Proceedings of the AAAI Conference on Artificial Intelligence.8518–8526.Google Scholar
Cross Ref
- [22] . 2019. Pointing novel objects in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12497–12506.Google Scholar
Cross Ref
- [23] . 2007. Ordinal regression by extended binary classifications. In Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman (Eds.). Vol. 19. MIT Press. https://proceedings.neurips.cc/paper/2006/file/019f8b946a256d9357eadc5ace2c8678-Paper.pdf.Google Scholar
- [24] . 2020. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3746–3753.Google Scholar
Cross Ref
- [25] . 2016. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vision. Springer, 852–869.Google Scholar
Cross Ref
- [26] . 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the 29th ACM International Conference on Multimedia. 5600–5608.Google Scholar
Digital Library
- [27] . 2020. AVR: Attention based salient visual relationship detection. arXiv:2003.07012. Retrieved from https://arxiv.org/abs/2003.07012.Google Scholar
- [28] . 2016. Ordinal regression with multiple output cnn for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4920–4928.Google Scholar
Cross Ref
- [29] . 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375. Retrieved from https://arxiv.org/abs/2007.02375.Google Scholar
- [30] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [31] . 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.1532–1543.Google Scholar
Cross Ref
- [32] . 2019. Attentive relational networks for mapping images to scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3957–3966.Google Scholar
Cross Ref
- [33] . 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2016), 1137–1149.Google Scholar
Digital Library
- [34] . 2020. A Scene Graph Generation Codebase in PyTorch. Retrieved 11 March, 2020 from https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch.Google Scholar
- [35] . 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3716–3725.Google Scholar
Cross Ref
- [36] . 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6619–6628.Google Scholar
Cross Ref
- [37] . 2017. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
Cross Ref
- [38] . 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 1–15.Google Scholar
Digital Library
- [39] . 2019. Exploring context and visual pattern of relationship for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8188–8197.Google Scholar
Cross Ref
- [40] . 2020. Sketching image gist: Human-mimetic hierarchical scene graph generation. In Proceedings of the European Conference on Computer Vision. Springer, 222–239.Google Scholar
Digital Library
- [41] . 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 4 (2018), 1–19.Google Scholar
Digital Library
- [42] . 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492–1500.Google Scholar
Cross Ref
- [43] . 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410–5419.Google Scholar
Cross Ref
- [44] . 2021. Exploring image enhancement for salient object detection in low light images. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1s (2021), 1–19.Google Scholar
Digital Library
- [45] . 2018. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision. 670–685.Google Scholar
Digital Library
- [46] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684–699.Google Scholar
Digital Library
- [47] . 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.Google Scholar
Cross Ref
- [48] . 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894–4902.Google Scholar
Cross Ref
- [49] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78. https://aclanthology.org/Q14-1006.Google Scholar
Cross Ref
- [50] . 2020. Visual relation of interest detection. In Proceedings of the ACM International Conference on Multimedia.Google Scholar
Digital Library
- [51] . 2020. Bridging knowledge graphs to generate scene graphs. In Proceedings of the European Conference on Computer Vision. Springer, 606–623.Google Scholar
Digital Library
- [52] . 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831–5840.Google Scholar
Cross Ref
- [53] . 2021. HCMSL: Hybrid cross-modal similarity learning for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1s (2021), 1–22.Google Scholar
Digital Library
- [54] . 2019. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11535–11543.Google Scholar
Cross Ref
- [55] . 2020. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence.13041–13049.Google Scholar
Cross Ref
Index Terms
Boosting Scene Graph Generation with Visual Relation Saliency
Recommendations
One-shot Scene Graph Generation
MM '20: Proceedings of the 28th ACM International Conference on MultimediaAs a structured representation of the image content, the visual scene graph (visual relationship) acts as a bridge between computer vision and natural language processing. Existing models on the scene graph generation task notoriously require tens or ...
Graph-Based Visual Saliency
NIPS'06: Proceedings of the 19th International Conference on Neural Information Processing SystemsA new bottom-up visual saliency model, Graph-Based Visual Saliency (GBVS), is proposed. It consists of two steps: first forming activation maps on certain feature channels, and then normalizing them in a way which highlights conspicuity and admits ...
Salient object detection via multiple saliency weights
Salient object detection aims to emulate the extraordinary capability of human visual system, which has the ability to find the most visually attractive objects in a complex visual scene. The human visual attention is often complicated and affected by ...






Comments