skip to main content
research-article

Boosting Relationship Detection in Images with Multi-Granular Self-Supervised Learning

Published:17 February 2023Publication History
Skip Abstract Section

Abstract

Visual and spatial relationship detection in images has been a fast-developing research topic in the multimedia field, which learns to recognize the semantic/spatial interactions between objects in an image, aiming to compose a structured semantic understanding of the scene. Most of the existing techniques directly encapsulate the holistic image feature plus the semantic and spatial features of the given two objects for predicting the relationship, but leave the inherent supervision derived from such structured and thorough image understanding under-exploited. Specifically, the inherent supervision among objects or relations within an image can span different granularities in this hierarchy including, from simple to comprehensive, (1) the object-based supervision that captures the interaction between the semantic and spatial features of each individual object, (2) the inter-object supervision that characterizes the dependency within the relationship triplet (<subject-predicate-object>), and (3) the inter-relation supervision that exploits contextual information among all relationship triplets in an image. These inherent multi-granular supervisions offer a fertile ground for building self-supervised proxy tasks. In this article, we compose a trilogy of exploring the multi-granular supervision in the sequence from object-based, inter-object, and inter-relation perspectives. We integrate the standard relationship detection objective with a series of proposed self-supervised proxy tasks, which is named as Multi-Granular Self-Supervised learning (MGS). Our MGS is appealing in view that it is pluggable to any neural relationship detection models by simply including the proxy tasks during training, without increasing the computational cost at inference. Through extensive experiments conducted on the SpatialSense and VRD datasets, we demonstrate the superiority of MGS for both spatial and visual relationship detection tasks.

REFERENCES

  1. [1] Ahsan Unaiza, Madhok Rishi, and Essa Irfan. 2019. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 179189.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 24252433.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Ben-Younes Hedi, Cadene Remi, Thome Nicolas, and Cord Matthieu. 2019. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 81028109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bin Yi, Yang Yang, Tao Chaofan, Huang Zi, Li Jingjing, and Shen Heng Tao. 2019. MR-NET: Exploiting mutual relation for visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 81108117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Cai Qi, Wang Yu, Pan Yingwei, Yao Ting, and Mei Tao. 2020. Joint contrastive learning with infinite possibilities. Advances in Neural Information Processing Systems 33 (2020), 1263812648.Google ScholarGoogle Scholar
  6. [6] Chiou Meng-Jiun, Zimmermann Roger, and Feng Jiashi. 2021. Visual relationship detection with visual-linguistic knowledge from multimodal representations. IEEE Access 9 (2021), 5044150451.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. https://arxiv.org/pdf/1406.1078.pdf.Google ScholarGoogle Scholar
  8. [8] Dai Bo, Zhang Yuqi, and Lin Dahua. 2017. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30763086.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Deléarde Robin, Kurtz Camille, and Wendling Laurent. 2022. Description and recognition of complex spatial configurations of object pairs with Force Banner 2D features. Pattern Recognition 123 (2022), 108410.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Deng Jiajun, Pan Yingwei, Yao Ting, Zhou Wengang, Li Houqiang, and Mei Tao. 2020. Single shot video object detector. IEEE Transactions on Multimedia 23 (2020), 846858.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Deshpande Aditya, Rock Jason, and Forsyth David. 2015. Learning large-scale automatic image colorization. In Proceedings of the IEEE International Conference on Computer Vision. 567575.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Diomataris Markos, Gkanatsios Nikolaos, Pitsikalis Vassilis, and Maragos Petros. 2021. Grounding consistency: Distilling spatial common sense for precise visual relationship detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1591115920.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Divvala Santosh K., Farhadi Ali, and Guestrin Carlos. 2014. Learning everything about anything: Webly-supervised visual concept learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 32703277.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Doersch Carl, Gupta Abhinav, and Efros Alexei A.. 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision. 14221430.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Gidaris Spyros, Singh Praveer, and Komodakis Nikos. 2018. Unsupervised representation learning by predicting image rotations. https://arxiv.org/pdf/1803.07728.pdf.Google ScholarGoogle Scholar
  16. [16] Goyal Priya, Mahajan Dhruv, Gupta Abhinav, and Misra Ishan. 2019. Scaling and benchmarking self-supervised visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 63916400.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Hu Yue, Chen Siheng, Chen Xu, Zhang Ya, and Gu Xiao. 2019. Neural message passing for visual relationship detection. In ICML Workshop on Learning and Reasoning with Graph-Structured Representations.Google ScholarGoogle Scholar
  18. [18] Inayoshi Sho, Otani Keita, Pablos Antonio Tejero-de, and Harada Tatsuya. 2020. Bounding-box channels for visual relationship detection. In European Conference on Computer Vision. Springer, 682697.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Johnson Justin, Krishna Ranjay, Stark Michael, Li Li-Jia, Shamma David, Bernstein Michael, and Fei-Fei Li. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 36683678.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Larsson Gustav, Maire Michael, and Shakhnarovich Gregory. 2017. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 68746883.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Lee Wonhee, Na Joonil, and Kim Gunhee. 2019. Multi-task self-supervised object detection via recycling of bounding box annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 49844993.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Li Junnan, Wong Yongkang, Zhao Qi, and Kankanhalli Mohan S.. 2017. Dual-glance model for deciphering social relationships. In Proceedings of the IEEE International Conference on Computer Vision. 26502659.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Li Rongjie, Zhang Songyang, and He Xuming. 2022. SGTR: End-to-end scene graph generation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1948619496.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Li Rui, Zhang Yiheng, Qiu Zhaofan, Yao Ting, Liu Dong, and Mei Tao. 2021. Motion-focused contrastive learning of video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 21052114.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Li Yikang, Ouyang Wanli, Wang Xiaogang, and Tang Xiao’ou. 2017. ViP-CNN: Visual phrase guided convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13471356.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Li Yehao, Pan Yingwei, Yao Ting, Chen Jingwen, and Mei Tao. 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 85188526.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Li Yehao, Pan Yingwei, Yao Ting, and Mei Tao. 2022. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1799017999.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Li Yehao, Yao Ting, Pan Yingwei, and Mei Tao. 2022. Contextual transformer networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).Google ScholarGoogle Scholar
  29. [29] Liang Kongming, Guo Yuhong, Chang Hong, and Chen Xilin. 2018. Visual relationship detection with deep structural ranking. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32, 7098–7105.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Liao Wentong, Rosenhahn Bodo, Shuai Ling, and Yang Michael Ying. 2019. Natural language guided visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 444–453.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Long Fuchen, Qiu Zhaofan, Pan Yingwei, Yao Ting, Luo Jiebo, and Mei Tao. 2022. Stand-alone inter-frame attention in video models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 31923201.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Lu Cewu, Krishna Ranjay, Bernstein Michael, and Fei-Fei Li. 2016. Visual relationship detection with language priors. In European Conference on Computer Vision. Springer, 852869.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Lu Jiasen, Yang Jianwei, Batra Dhruv, and Parikh Devi. 2016. Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems 29 (2016), 289297.Google ScholarGoogle Scholar
  34. [34] Lu Yichao, Rai Himanshu, Chang Jason, Knyazev Boris, Yu Guangwei, Shekhar Shashank, Taylor Graham W., and Volkovs Maksims. 2021. Context-aware scene graph generation with seq2seq transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1593115941.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Mangla Puneet, Kumari Nupur, Sinha Abhishek, Singh Mayank, Krishnamurthy Balaji, and Balasubramanian Vineeth N.. 2020. Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 22182227.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Mi Li and Chen Zhenzhong. 2020. Hierarchical graph attention network for visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1388613895.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Distributed representations of words and phrases and their compositionality. https://arxiv.org/pdf/1310.4546.pdf.Google ScholarGoogle Scholar
  38. [38] Newell Alejandro, Yang Kaiyu, and Deng Jia. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483499.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Noroozi Mehdi and Favaro Paolo. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision. Springer, 6984.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Pan Yingwei, Chen Yue, Bao Qian, Zhang Ning, Yao Ting, Liu Jingen, and Mei Tao. 2021. Smart director: An event-driven directing system for live broadcasting. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Pan Yingwei, Li Yehao, Luo Jianjie, Xu Jun, Yao Ting, and Mei Tao. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. https://arxiv.org/pdf/2007.02375.pdf.Google ScholarGoogle Scholar
  42. [42] Pan Yingwei, Li Yehao, Yao Ting, Mei Tao, Li Houqiang, and Rui Yong. 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In IJCAI. 38323838.Google ScholarGoogle Scholar
  43. [43] Pan Yingwei, Yao Ting, Li Houqiang, and Mei Tao. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 65046512.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Peyre Julia, Sivic Josef, Laptev Ivan, and Schmid Cordelia. 2017. Weakly-supervised learning of visual relations. In Proceedings of the IEEE International Conference on Computer Vision. 51795188.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Radford Alec, Metz Luke, and Chintala Soumith. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. https://arxiv.org/pdf/1511.06434.pdf.Google ScholarGoogle Scholar
  47. [47] Ramanathan Vignesh, Li Congcong, Deng Jia, Han Wei, Li Zhen, Gu Kunlong, Song Yang, Bengio Samy, Rosenberg Charles, and Fei-Fei Li. 2015. Learning semantic relationships for better action retrieval in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11001109.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Redmon Joseph, Divvala Santosh, Girshick Ross, and Farhadi Ali. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779788.Google ScholarGoogle Scholar
  49. [49] Sadeghi Mohammad Amin and Farhadi Ali. 2011. Recognition using visual phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 17451752.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Schuster Sebastian, Krishna Ranjay, Chang Angel, Fei-Fei Li, and Manning Christopher D.. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the 4th Workshop on Vision and Language. 7080.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Silberman Nathan, Hoiem Derek, Kohli Pushmeet, and Fergus Rob. 2012. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision. Springer, 746760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Su Jong-Chyi, Maji Subhransu, and Hariharan Bharath. 2019. Boosting supervision with self-supervision for few-shot learning. https://arxiv.org/pdf/1906.07079.pdf.Google ScholarGoogle Scholar
  53. [53] Sun Qianru, Schiele Bernt, and Fritz Mario. 2017. A domain based approach to social relation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 34813490.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Tang Pengjie, Wang Hanli, and Li Qinyu. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 2 (2019), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Wang Cheng, Yang Haojin, and Meinel Christoph. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2s (2018), 120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Wang Lei, Lin Peizhen, Cheng Jun, Liu Feng, Ma Xiaoliang, and Yin Jianqin. 2021. Visual relationship detection with recurrent attention and negative sampling. Neurocomputing 434 (2021), 5566.Google ScholarGoogle Scholar
  57. [57] Xu Tong, Zhou Peilun, Hu Linkang, He Xiangnan, Hu Yao, and Chen Enhong. 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 1 (2021), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Xu Ziwei, Wang Guangzhi, Wong Yongkang, and Kankanhalli Mohan S.. 2021. Relation-aware compositional zero-shot learning for attribute-object pair recognition. IEEE Transactions on Multimedia 24 (2021), 3652–3664.Google ScholarGoogle Scholar
  59. [59] Yang Kaiyu, Russakovsky Olga, and Deng Jia. 2019. Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 20512060.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Yang Zhenguo, Lin Zehang, Kang Peipei, Lv Jianming, Li Qing, and Liu Wenyin. 2020. Learning shared semantic space with correlation alignment for cross-modal event retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 1 (2020), 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Yao Ting, Li Yehao, Pan Yingwei, Wang Yu, Zhang Xiao-Ping, and Mei Tao. 2022. Dual vision transformer. https://arxiv.org/pdf/2207.04976.pdf.Google ScholarGoogle Scholar
  62. [62] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 26212629.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Yao Ting, Pan Yingwei, Li Yehao, Ngo Chong-Wah, and Mei Tao. 2022. Wave-ViT: Unifying wavelet and transformers for visual representation learning. In Proceedings of the European Conference on Computer Vision (ECCV’22).Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Yao Ting, Pan Yingwei, Li Yehao, Qiu Zhaofan, and Mei Tao. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 48944902.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Yao Ting, Zhang Yiheng, Qiu Zhaofan, Pan Yingwei, and Mei Tao. 2021. SeCo: Exploring sequence supervision for unsupervised representation learning. In AAAI, Vol. 2. 7.Google ScholarGoogle Scholar
  67. [67] Yin Guojun, Sheng Lu, Liu Bin, Yu Nenghai, Wang Xiaogang, Shao Jing, and Loy Chen Change. 2018. Zoom-Net: Mining deep feature interactions for visual relationship recognition. In European Conference on Computer Vision. Springer, 322338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Yu Ruichi, Li Ang, Morariu Vlad I., and Davis Larry S.. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision. 19741982.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Zhan Yibing, Yu Jun, Yu Ting, and Tao Dacheng. 2019. On exploring undetermined relationships for visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 51285137.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Zhang Hanwang, Kyaw Zawlin, Chang Shih-Fu, and Chua Tat-Seng. 2017. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 55325540.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Zhang Hanwang, Kyaw Zawlin, Yu Jinyang, and Chang Shih-Fu. 2017. PPR-FCN: Weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE International Conference on Computer Vision. 42334241.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Zhang Ji, Kalantidis Yannis, Rohrbach Marcus, Paluri Manohar, Elgammal Ahmed, and Elhoseiny Mohamed. 2019. Large-scale visual relationship understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 91859194.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. [74] Zhang Richard, Isola Phillip, and Efros Alexei A.. 2016. Colorful image colorization. In European Conference on Computer Vision. Springer, 649666.Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Zhang Yong, Pan Yingwei, Yao Ting, Huang Rui, Mei Tao, and Chen Chang-Wen. 2022. Boosting scene graph generation with visual relation saliency. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM’22).Google ScholarGoogle Scholar
  76. [76] Zhao Zhongying, Yang Yonghao, Li Chao, and Nie Liqiang. 2020. GuessUNeed: Recommending courses via neural attention network and course prerequisite relation embeddings. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 4 (2020), 117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. [77] Zheng Sipeng, Chen Shizhe, and Jin Qin. 2019. Visual relation detection with multi-level attention. In Proceedings of the 27th ACM International Conference on Multimedia. 121129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Zhuang Bohan, Liu Lingqiao, Shen Chunhua, and Reid Ian. 2017. Towards context-aware interaction recognition for visual relationship detection. In Proceedings of the IEEE International Conference on Computer Vision. 589598.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Boosting Relationship Detection in Images with Multi-Granular Self-Supervised Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2s
        April 2023
        545 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572861
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 February 2023
        • Online AM: 18 August 2022
        • Accepted: 3 August 2022
        • Revised: 6 May 2022
        • Received: 9 October 2021
        Published in tomm Volume 19, Issue 2s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)168
        • Downloads (Last 6 weeks)11

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!