skip to main content
research-article

Boosting Scene Graph Generation with Visual Relation Saliency

Authors Info & Claims
Published:05 January 2023Publication History
Skip Abstract Section

Abstract

The scene graph is a symbolic data structure that comprehensively describes the objects and visual relations in a visual scene, while ignoring the inherent perceptual saliency of each visual relation (i.e., relation saliency). However, humans often quickly allocate attention to important/salient visual relations in a scene. To align with such human perception of a scene, we explicitly model the perceptual saliency of visual relation in scene graph by upgrading each graph edge (i.e., visual relation) with an attribute of relation saliency. We present a new design, named as Saliency-guided Message Passing (SMP), that boosts the generation of such scene graph structure with the guidance from the visual relation saliency. Technically, an object interaction encoder is first utilized to strengthen object relation representations by jointly exploiting the appearance, semantic, and spatial relations in between. A branch is further leveraged to estimate the relation saliency of each visual relation by ordinal regression. Next, conditioned on the object and relation features (coupled with the estimated relation saliency), our SMP enhances scene graph generation by performing message passing over the objects and the most salient relations. Extensive experiments on VG-KR and VG150 datasets demonstrate the superiority of SMP for the scene graph generation. Moreover, we empirically validate the compelling generalizability of the learned scene graphs via SMP on downstream tasks like cross-model retrieval and image captioning.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Borji Ali, Cheng Ming-Ming, Jiang Huaizu, and Li Jia. 2015. Salient object detection: A benchmark. IEEE Transactions on Image Processing 24, 12 (2015), 57065722.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Cao Wenzhi, Mirjalili Vahid, and Raschka Sebastian. 2019. Rank-consistent ordinal regression for neural networks. arXiv preprint arXiv:1901.07884 1, 6 (2019), 13.Google ScholarGoogle Scholar
  4. [4] Chen Long, Zhang Hanwang, Xiao Jun, He Xiangnan, Pu Shiliang, and Chang Shih-Fu. 2019. Counterfactual critic multi-agent training for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 46134623.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Tianshui, Yu Weihao, Chen Riquan, and Lin Liang. 2019. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 61636171.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cornia Marcella, Baraldi Lorenzo, Serra Giuseppe, and Cucchiara Rita. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2 (2018), 121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference.Google ScholarGoogle Scholar
  8. [8] Feng Fangxiang, Wang Xiaojie, Li Ruifan, and Ahmad Ibrar. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 26 (2015), 1–22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Frank Eibe and Hall Mark. 2001. A simple approach to ordinal classification. In Proceedings of the European Conference on Machine Learning. Springer, 145156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] He Kaiming, Gkioxari Georgia, Dollár Piotr, and Girshick Ross. 2017. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 29612969.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Johnson Justin, Krishna Ranjay, Stark Michael, Li Li-Jia, Shamma David, Bernstein Michael, and Fei-Fei Li. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 36683678.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Ke Lei, Pei Wenjie, Li Ruiyu, Shen Xiaoyong, and Tai Yu-Wing. 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 88888897.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., Michael S. Bernstein, and Fei-Fei Li. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Li Rongjie, Zhang Songyang, Wan Bo, and He Xuming. 2021. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1110911119.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Li Yehao, Fan Jiahao, Pan Yingwei, Yao Ting, Lin Weiyao, and Mei Tao. 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 2, Article 48 (2022), 16.Google ScholarGoogle Scholar
  18. [18] Li Yikang, Ouyang Wanli, Zhou Bolei, Shi Jianping, Zhang Chao, and Wang Xiaogang. 2018. Factorizable net: An efficient subgraph-based framework for scene graph generation. In Proceedings of the European Conference on Computer Vision. 335351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Li Yikang, Ouyang Wanli, Zhou Bolei, Wang Kun, and Wang Xiaogang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision. 12611270.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Li Yehao, Pan Yingwei, Chen Jingwen, Yao Ting, and Mei Tao. 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In Proceedings of the 29th ACM International Conference on Multimedia. 37993802.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Li Yehao, Pan Yingwei, Yao Ting, Chen Jingwen, and Mei Tao. 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In Proceedings of the AAAI Conference on Artificial Intelligence.85188526.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Li Yehao, Yao Ting, Pan Yingwei, Chao Hongyang, and Mei Tao. 2019. Pointing novel objects in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1249712506.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Lin H. T. and Li L.. 2007. Ordinal regression by extended binary classifications. In Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman (Eds.). Vol. 19. MIT Press. https://proceedings.neurips.cc/paper/2006/file/019f8b946a256d9357eadc5ace2c8678-Paper.pdf.Google ScholarGoogle Scholar
  24. [24] Lin Xin, Ding Changxing, Zeng Jinquan, and Tao Dacheng. 2020. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 37463753.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Lu Cewu, Krishna Ranjay, Bernstein Michael, and Fei-Fei Li. 2016. Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vision. Springer, 852869.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Luo Jianjie, Li Yehao, Pan Yingwei, Yao Ting, Chao Hongyang, and Mei Tao. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the 29th ACM International Conference on Multimedia. 56005608.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Lv Jianming, Xiao Qinzhe, and Zhong Jiajie. 2020. AVR: Attention based salient visual relationship detection. arXiv:2003.07012. Retrieved from https://arxiv.org/abs/2003.07012.Google ScholarGoogle Scholar
  28. [28] Niu Zhenxing, Zhou Mo, Wang Le, Gao Xinbo, and Hua Gang. 2016. Ordinal regression with multiple output cnn for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 49204928.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Pan Yingwei, Li Yehao, Luo Jianjie, Xu Jun, Yao Ting, and Mei Tao. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375. Retrieved from https://arxiv.org/abs/2007.02375.Google ScholarGoogle Scholar
  30. [30] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.15321543.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Qi Mengshi, Li Weijian, Yang Zhengyuan, Wang Yunhong, and Luo Jiebo. 2019. Attentive relational networks for mapping images to scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 39573966.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2016), 11371149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Tang Kaihua. 2020. A Scene Graph Generation Codebase in PyTorch. Retrieved 11 March, 2020 from https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch.Google ScholarGoogle Scholar
  35. [35] Tang Kaihua, Niu Yulei, Huang Jianqiang, Shi Jiaxin, and Zhang Hanwang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 37163725.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Tang Kaihua, Zhang Hanwang, Wu Baoyuan, Luo Wenhan, and Liu Wei. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 66196628.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Teney Damien, Liu Lingqiao, and Hengel Anton van Den. 2017. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wang Anqi, Hu Haifeng, and Yang Liang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Wang Wenbin, Wang Ruiping, Shan Shiguang, and Chen Xilin. 2019. Exploring context and visual pattern of relationship for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 81888197.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Wang Wenbin, Wang Ruiping, Shan Shiguang, and Chen Xilin. 2020. Sketching image gist: Human-mimetic hierarchical scene graph generation. In Proceedings of the European Conference on Computer Vision. Springer, 222239.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Wu Jie, Hu Haifeng, and Wu Yi. 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 4 (2018), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Xie Saining, Girshick Ross, Dollár Piotr, Tu Zhuowen, and He Kaiming. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14921500.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Xu Danfei, Zhu Yuke, Choy Christopher B., and Fei-Fei Li. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 54105419.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Xu Xin, Wang Shiqin, Wang Zheng, Zhang Xiaolong, and Hu Ruimin. 2021. Exploring image enhancement for salient object detection in low light images. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1s (2021), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Yang Jianwei, Lu Jiasen, Lee Stefan, Batra Dhruv, and Parikh Devi. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision. 670685.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 26212629.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Yao Ting, Pan Yingwei, Li Yehao, Qiu Zhaofan, and Mei Tao. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 48944902.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 6778. https://aclanthology.org/Q14-1006.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Yu Fan, Wang Haonan, Ren Tongwei, Tang Jinhui, and Wu Gangshan. 2020. Visual relation of interest detection. In Proceedings of the ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Zareian Alireza, Karaman Svebor, and Chang Shih-Fu. 2020. Bridging knowledge graphs to generate scene graphs. In Proceedings of the European Conference on Computer Vision. Springer, 606623.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Zellers Rowan, Yatskar Mark, Thomson Sam, and Choi Yejin. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 58315840.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Zhang Chengyuan, Song Jiayu, Zhu Xiaofeng, Zhu Lei, and Zhang Shichao. 2021. HCMSL: Hybrid cross-modal similarity learning for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1s (2021), 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Zhang Ji, Shih Kevin J., Elgammal Ahmed, Tao Andrew, and Catanzaro Bryan. 2019. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1153511543.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason, and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence.1304113049.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Boosting Scene Graph Generation with Visual Relation Saliency

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1
        January 2023
        505 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572858
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 January 2023
        • Online AM: 17 March 2022
        • Accepted: 24 January 2022
        • Revised: 19 December 2021
        • Received: 28 June 2021
        Published in tomm Volume 19, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!