Abstract
Visual and spatial relationship detection in images has been a fast-developing research topic in the multimedia field, which learns to recognize the semantic/spatial interactions between objects in an image, aiming to compose a structured semantic understanding of the scene. Most of the existing techniques directly encapsulate the holistic image feature plus the semantic and spatial features of the given two objects for predicting the relationship, but leave the inherent supervision derived from such structured and thorough image understanding under-exploited. Specifically, the inherent supervision among objects or relations within an image can span different granularities in this hierarchy including, from simple to comprehensive, (1) the object-based supervision that captures the interaction between the semantic and spatial features of each individual object, (2) the inter-object supervision that characterizes the dependency within the relationship triplet (<subject-predicate-object>), and (3) the inter-relation supervision that exploits contextual information among all relationship triplets in an image. These inherent multi-granular supervisions offer a fertile ground for building self-supervised proxy tasks. In this article, we compose a trilogy of exploring the multi-granular supervision in the sequence from object-based, inter-object, and inter-relation perspectives. We integrate the standard relationship detection objective with a series of proposed self-supervised proxy tasks, which is named as Multi-Granular Self-Supervised learning (MGS). Our MGS is appealing in view that it is pluggable to any neural relationship detection models by simply including the proxy tasks during training, without increasing the computational cost at inference. Through extensive experiments conducted on the SpatialSense and VRD datasets, we demonstrate the superiority of MGS for both spatial and visual relationship detection tasks.
- [1] . 2019. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 179–189.Google Scholar
Cross Ref
- [2] . 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.Google Scholar
Cross Ref
- [3] . 2019. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8102–8109.Google Scholar
Digital Library
- [4] . 2019. MR-NET: Exploiting mutual relation for visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8110–8117.Google Scholar
Digital Library
- [5] . 2020. Joint contrastive learning with infinite possibilities. Advances in Neural Information Processing Systems 33 (2020), 12638–12648.Google Scholar
- [6] . 2021. Visual relationship detection with visual-linguistic knowledge from multimodal representations. IEEE Access 9 (2021), 50441–50451.Google Scholar
Cross Ref
- [7] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. https://arxiv.org/pdf/1406.1078.pdf.Google Scholar
- [8] . 2017. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3076–3086.Google Scholar
Cross Ref
- [9] . 2022. Description and recognition of complex spatial configurations of object pairs with Force Banner 2D features. Pattern Recognition 123 (2022), 108410.Google Scholar
Digital Library
- [10] . 2020. Single shot video object detector. IEEE Transactions on Multimedia 23 (2020), 846–858.Google Scholar
Cross Ref
- [11] . 2015. Learning large-scale automatic image colorization. In Proceedings of the IEEE International Conference on Computer Vision. 567–575.Google Scholar
Digital Library
- [12] . 2021. Grounding consistency: Distilling spatial common sense for precise visual relationship detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15911–15920.Google Scholar
Cross Ref
- [13] . 2014. Learning everything about anything: Webly-supervised visual concept learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3270–3277.Google Scholar
Digital Library
- [14] . 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision. 1422–1430.Google Scholar
Digital Library
- [15] . 2018. Unsupervised representation learning by predicting image rotations. https://arxiv.org/pdf/1803.07728.pdf.Google Scholar
- [16] . 2019. Scaling and benchmarking self-supervised visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6391–6400.Google Scholar
Cross Ref
- [17] . 2019. Neural message passing for visual relationship detection. In ICML Workshop on Learning and Reasoning with Graph-Structured Representations.Google Scholar
- [18] . 2020. Bounding-box channels for visual relationship detection. In European Conference on Computer Vision. Springer, 682–697.Google Scholar
Digital Library
- [19] . 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3668–3678.Google Scholar
Cross Ref
- [20] . 2017. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6874–6883.Google Scholar
Cross Ref
- [21] . 2019. Multi-task self-supervised object detection via recycling of bounding box annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4984–4993.Google Scholar
Cross Ref
- [22] . 2017. Dual-glance model for deciphering social relationships. In Proceedings of the IEEE International Conference on Computer Vision. 2650–2659.Google Scholar
Cross Ref
- [23] . 2022. SGTR: End-to-end scene graph generation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19486–19496.Google Scholar
Cross Ref
- [24] . 2021. Motion-focused contrastive learning of video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2105–2114.Google Scholar
Cross Ref
- [25] . 2017. ViP-CNN: Visual phrase guided convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1347–1356.Google Scholar
Cross Ref
- [26] . 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 8518–8526.Google Scholar
Cross Ref
- [27] . 2022. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17990–17999.Google Scholar
Cross Ref
- [28] . 2022. Contextual transformer networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).Google Scholar
- [29] . 2018. Visual relationship detection with deep structural ranking. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32, 7098–7105.Google Scholar
Cross Ref
- [30] . 2019. Natural language guided visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 444–453.Google Scholar
Cross Ref
- [31] . 2022. Stand-alone inter-frame attention in video models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3192–3201.Google Scholar
Cross Ref
- [32] . 2016. Visual relationship detection with language priors. In European Conference on Computer Vision. Springer, 852–869.Google Scholar
Cross Ref
- [33] . 2016. Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems 29 (2016), 289–297.Google Scholar
- [34] . 2021. Context-aware scene graph generation with seq2seq transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15931–15941.Google Scholar
Cross Ref
- [35] . 2020. Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2218–2227.Google Scholar
Cross Ref
- [36] . 2020. Hierarchical graph attention network for visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13886–13895.Google Scholar
Cross Ref
- [37] . 2013. Distributed representations of words and phrases and their compositionality. https://arxiv.org/pdf/1310.4546.pdf.Google Scholar
- [38] . 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483–499.Google Scholar
Cross Ref
- [39] . 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision. Springer, 69–84.Google Scholar
Cross Ref
- [40] . 2021. Smart director: An event-driven directing system for live broadcasting. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 1–18.Google Scholar
Digital Library
- [41] . 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. https://arxiv.org/pdf/2007.02375.pdf.Google Scholar
- [42] . 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In IJCAI. 3832–3838.Google Scholar
- [43] . 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6504–6512.Google Scholar
Cross Ref
- [44] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [45] . 2017. Weakly-supervised learning of visual relations. In Proceedings of the IEEE International Conference on Computer Vision. 5179–5188.Google Scholar
Cross Ref
- [46] . 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. https://arxiv.org/pdf/1511.06434.pdf.Google Scholar
- [47] . 2015. Learning semantic relationships for better action retrieval in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1100–1109.Google Scholar
Cross Ref
- [48] . 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.Google Scholar
- [49] . 2011. Recognition using visual phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1745–1752.Google Scholar
Digital Library
- [50] . 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the 4th Workshop on Vision and Language. 70–80.Google Scholar
Cross Ref
- [51] . 2012. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision. Springer, 746–760.Google Scholar
Digital Library
- [52] . 2019. Boosting supervision with self-supervision for few-shot learning. https://arxiv.org/pdf/1906.07079.pdf.Google Scholar
- [53] . 2017. A domain based approach to social relation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3481–3490.Google Scholar
Cross Ref
- [54] . 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 2 (2019), 1–23.Google Scholar
Digital Library
- [55] . 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2s (2018), 1–20.Google Scholar
Digital Library
- [56] . 2021. Visual relationship detection with recurrent attention and negative sampling. Neurocomputing 434 (2021), 55–66.Google Scholar
- [57] . 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 1 (2021), 1–23.Google Scholar
Digital Library
- [58] . 2021. Relation-aware compositional zero-shot learning for attribute-object pair recognition. IEEE Transactions on Multimedia 24 (2021), 3652–3664.Google Scholar
- [59] . 2019. Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2051–2060.Google Scholar
Cross Ref
- [60] . 2020. Learning shared semantic space with correlation alignment for cross-modal event retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 1 (2020), 1–22.Google Scholar
Digital Library
- [61] . 2022. Dual vision transformer. https://arxiv.org/pdf/2207.04976.pdf.Google Scholar
- [62] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.Google Scholar
Digital Library
- [63] . 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.Google Scholar
Cross Ref
- [64] . 2022. Wave-ViT: Unifying wavelet and transformers for visual representation learning. In Proceedings of the European Conference on Computer Vision (ECCV’22).Google Scholar
Digital Library
- [65] . 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894–4902.Google Scholar
Cross Ref
- [66] . 2021. SeCo: Exploring sequence supervision for unsupervised representation learning. In AAAI, Vol. 2. 7.Google Scholar
- [67] . 2018. Zoom-Net: Mining deep feature interactions for visual relationship recognition. In European Conference on Computer Vision. Springer, 322–338.Google Scholar
Digital Library
- [68] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google Scholar
Cross Ref
- [69] . 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision. 1974–1982.Google Scholar
Cross Ref
- [70] . 2019. On exploring undetermined relationships for visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5128–5137.Google Scholar
Cross Ref
- [71] . 2017. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5532–5540.Google Scholar
Cross Ref
- [72] . 2017. PPR-FCN: Weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE International Conference on Computer Vision. 4233–4241.Google Scholar
Cross Ref
- [73] . 2019. Large-scale visual relationship understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9185–9194.Google Scholar
Digital Library
- [74] . 2016. Colorful image colorization. In European Conference on Computer Vision. Springer, 649–666.Google Scholar
Cross Ref
- [75] . 2022. Boosting scene graph generation with visual relation saliency. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM’22).Google Scholar
- [76] . 2020. GuessUNeed: Recommending courses via neural attention network and course prerequisite relation embeddings. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 4 (2020), 1–17.Google Scholar
Digital Library
- [77] . 2019. Visual relation detection with multi-level attention. In Proceedings of the 27th ACM International Conference on Multimedia. 121–129.Google Scholar
Digital Library
- [78] . 2017. Towards context-aware interaction recognition for visual relationship detection. In Proceedings of the IEEE International Conference on Computer Vision. 589–598.Google Scholar
Cross Ref
Index Terms
Boosting Relationship Detection in Images with Multi-Granular Self-Supervised Learning
Recommendations
JGCL: Joint Self-Supervised and Supervised Graph Contrastive Learning
WWW '22: Companion Proceedings of the Web Conference 2022Semi-supervised and self-supervised learning on graphs are two popular avenues for graph representation learning. We demonstrate that no single method from semi-supervised and self-supervised learning works uniformly well for all settings in the node ...
Relationship graph learning network for visual relationship detection
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in AsiaVisual relationship detection aims to predict the relationships between detected object pairs. It is well believed that the correlations between image components (i.e., objects and relationships between objects) are significant considerations when ...
Self-supervised Learning for Object Detection in Autonomous Driving
Pattern RecognitionAbstractRecently, self-supervised pretraining methods have achieved impressive results, matching ImageNet weights on a variety of downstream tasks including object detection. Despite their success, these methods have some limitations. Most of them are ...






Comments