Abstract
The prevailing framework for matching multimodal inputs is based on a two-stage process: (1) detecting proposals with an object detector and (2) matching text queries with proposals. Existing two-stage solutions mostly focus on the matching step. In this article, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., query-agnostic), hoping that the proposals contain all instances mentioned in the text query (i.e., query-aware). Due to this mismatch, chances are that proposals relevant to the text query are suppressed during the filtering process, which in turn bounds the matching performance. To this end, we propose VL-NMS, which is the first method to yield query-aware proposals at the first stage. VL-NMS regards all mentioned instances as critical objects and introduces a lightweight module to predict a score for aligning each proposal with a critical object. These scores can guide the NMS operation to filter out proposals irrelevant to the text query, increasing the recall of critical objects, and resulting in a significantly improved matching performance. Since VL-NMS is agnostic to the matching step, it can be easily integrated into any state-of-the-art two-stage matching method. We validate the effectiveness of VL-NMS on three multimodal matching tasks, namely referring expression grounding, phrase grounding, and image-text matching. Extensive ablation studies on several baselines and benchmarks consistently demonstrate the superiority of VL-NMS.
- [1] . 2020. Words aren’t enough, their order matters: On the robustness of grounding visual referring expressions. In arXiv preprint arXiv:2005.01655.Google Scholar
- [2] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [3] . 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Digital Library
- [4] . 2017. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [5] . 2021. On pursuit of designing multi-modal transformer for video grounding. In Conference on Empirical Methods in Natural Language Processing.Google Scholar
Cross Ref
- [6] . 2019. See-through-text grouping for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [7] . 2020. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [8] . 2019. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [9] . 2018. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [10] . 2017. Query-guided regression network with context policy for phrase grounding. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [11] . 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [12] . 2021. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [13] . 2020. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [14] . 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [15] . 2021. Counterfactual samples synthesizing and training for robust visual question answering. arXiv preprint arXiv:2110.01013 (2021).Google Scholar
- [16] . 2022. Rethinking data augmentation for robust visual question answering. In European Conference on Computer Vision. 95–112.Google Scholar
Digital Library
- [17] . 2018. Real-time referring expression comprehension by single-stage grounding network. In arXiv preprint arXiv:1812.03426.Google Scholar
- [18] . 2019. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [19] . 2021. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1769–1779.Google Scholar
Cross Ref
- [20] . 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- [21] . 2021. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16321–16330.Google Scholar
Cross Ref
- [22] . 2017. VSE++: Improving visual-semantic embeddings with hard negatives. In arXiv preprint arXiv:1707.05612.Google Scholar
- [23] . 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia.Google Scholar
Digital Library
- [24] . 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Digital Library
- [25] . 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [26] . 2019. Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence.Google Scholar
- [27] . 2017. Learning non-maximum suppression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [28] . 2018. Relation networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [29] . 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [30] . 2016. Segmentation from natural language expressions. In European Conference on Computer Vision.Google Scholar
Cross Ref
- [31] . 2020. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [32] . 2020. Referring image segmentation via cross-modal progressive comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [33] . 2018. Acquisition of localization confidence for accurate object detection. In European Conference on Computer Vision.Google Scholar
Digital Library
- [34] . 2021. Locate then segment: A strong pipeline for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [35] . 2021. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1780–1790.Google Scholar
Cross Ref
- [36] . 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Google Scholar
Cross Ref
- [37] . 2019. Grounding human-to-vehicle advice for self-driving vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [38] . 2018. Stacked cross attention for image-text matching. In European Conference on Computer Vision.Google Scholar
Digital Library
- [39] . 2019. Semantic concept network and deep walk-based visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications (2019).Google Scholar
Digital Library
- [40] . 2018. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [41] . 2018. Bundled object context for referring expressions. IEEE Transactions on Multimedia (TMM) (2018), 2749–2760.Google Scholar
Cross Ref
- [42] . 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [43] . 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision.Google Scholar
Cross Ref
- [44] . 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [45] . 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [46] . 2019. Joint visual grounding with language scene graphs. In arXiv preprint arXiv:1906.03561.Google Scholar
- [47] . 2019. Adaptive NMS: Refining pedestrian detection in a crowd. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [48] . 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [49] . 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems (2019).Google Scholar
- [50] . 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [51] . 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [52] . 2022. Rethinking the reference-based distinctive image captioning. In Proceedings of the 30th ACM International Conference on Multimedia. 4374–4384.Google Scholar
Digital Library
- [53] . 2018. Dynamic multimodal instance segmentation guided by natural language queries. In European Conference on Computer Vision.Google Scholar
Digital Library
- [54] . 2016. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision.Google Scholar
Cross Ref
- [55] . 2019. Variational context: Exploiting visual and textual context for grounding referring expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).Google Scholar
- [56] . 2018. Conditional image-text embedding networks. In Proceedings of the European Conference on Computer Vision. 249–264.Google Scholar
Digital Library
- [57] . 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641–2649.Google Scholar
Digital Library
- [58] . 2020. Referring image segmentation by generative adversarial learning. IEEE Transactions on Multimedia (TMM) (2020).Google Scholar
Cross Ref
- [59] . 2022. SiRi: A simple selective retraining mechanism for transformer-based visual grounding. In arXiv preprint arXiv:2207.13325.Google Scholar
- [60] . 2019. Zero-shot grounding of objects from natural language queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4694–4703.Google Scholar
Cross Ref
- [61] . 2018. Key-word-aware network for referring expression image segmentation. In European Conference on Computer Vision.Google Scholar
Digital Library
- [62] . 2019. Learning to rank proposals for object detection. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [63] . 2018. Improving object localization with fitness NMS and bounded iou loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [64] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30.Google Scholar
- [65] . 2022. LPGN: Language-guided proposal generation network for referring expression comprehension. In 2022 IEEE International Conference on Multimedia and Expo.Google Scholar
Cross Ref
- [66] . 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 394–407.Google Scholar
- [67] . 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [68] . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google Scholar
- [69] . 2018. Learning semantic structure-preserved embeddings for cross-modal retrieval. In Proceedings of the 26th ACM International Conference on Multimedia.Google Scholar
Digital Library
- [70] . 2021. Natural language video localization with learnable moment proposals. In Conference on Empirical Methods in Natural Language Processing.Google Scholar
Cross Ref
- [71] . 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [72] . 2021. Cross-modal hybrid feature fusion for image-sentence matching. ACM Transactions on Multimedia Computing, Communications, and Applications (2021), 1–23.Google Scholar
- [73] . 2019. Learning to separate: Detecting heavily-occluded objects in urban scenes. In arXiv preprint arXiv:1912.01674.Google Scholar
- [74] . 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9499–9508.Google Scholar
Cross Ref
- [75] . 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [76] . 2020. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [77] . 2020. Improving one-stage visual grounding by recursive sub-query construction. In European Conference on Computer Vision.Google Scholar
Digital Library
- [78] . 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [79] . 2022. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15502–15512.Google Scholar
Cross Ref
- [80] . 2020. Dual convolutional LSTM network for referring image segmentation. IEEE Transactions on Multimedia (TMM) (2020), 3224–3235.Google Scholar
Digital Library
- [81] . 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [82] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (2014).Google Scholar
Cross Ref
- [83] . 2019. Multi-source multi-level attention networks for visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications (2019), 1–20.Google Scholar
- [84] . 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology (2019), 4467–4480.Google Scholar
- [85] . 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [86] . 2016. Modeling context in referring expressions. In European Conference on Computer Vision.Google Scholar
Cross Ref
- [87] . 2017. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [88] . 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems (2018), 5947–5959.Google Scholar
Cross Ref
- [89] . 2018. Rethinking diversified and discriminative proposal generation for visual grounding. In arXiv preprint arXiv:1805.03508.Google Scholar
- [90] . 2022. Knowledge-representation-enhanced multimodal transformer for scene text visual question answering. Journal of Image and Graphics (2022), 2761–2774.Google Scholar
- [91] . 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [92] . 2020. Deep top-\(k\) ranking for image-sentence matching. IEEE Transactions on Multimedia (TMM) (2020), 775–785.Google Scholar
Cross Ref
- [93] . 2022. SeqTR: A simple yet universal network for visual grounding. In arXiv preprint arXiv:2203.16265.Google Scholar
- [94] . 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
Index Terms
VL-NMS: Breaking Proposal Bottlenecks in Two-stage Visual-language Matching
Recommendations
Two-Stage Template Matching
The computational cost of template matching can be reduced by using only a subtemplate, and applying the rest of the template only when the subtemplate's degree of match exceeds a threshold. A probabilistic analysis of this approach is given, with ...
Two-dimensional object recognition through two-stage string matching
A two-stage string matching method for the recognition of two-dimensional (2-D) objects is proposed in this work. The first stage is a global cyclic string matching. The second stage is a local matching with local dissimilarity measure computing. The ...
A Fast Template Matching Method for Rotation Invariance Using Two-Stage Process
IIH-MSP '09: Proceedings of the 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal ProcessingTemplate matching is a technique for finding the location of a reference image or an object inside a scene image. The conventional method of template matching uses cross correlation algorithms. The method is simple to implement and understand, but it is ...






Comments