skip to main content
research-article

VL-NMS: Breaking Proposal Bottlenecks in Two-stage Visual-language Matching

Published:07 June 2023Publication History
Skip Abstract Section

Abstract

The prevailing framework for matching multimodal inputs is based on a two-stage process: (1) detecting proposals with an object detector and (2) matching text queries with proposals. Existing two-stage solutions mostly focus on the matching step. In this article, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., query-agnostic), hoping that the proposals contain all instances mentioned in the text query (i.e., query-aware). Due to this mismatch, chances are that proposals relevant to the text query are suppressed during the filtering process, which in turn bounds the matching performance. To this end, we propose VL-NMS, which is the first method to yield query-aware proposals at the first stage. VL-NMS regards all mentioned instances as critical objects and introduces a lightweight module to predict a score for aligning each proposal with a critical object. These scores can guide the NMS operation to filter out proposals irrelevant to the text query, increasing the recall of critical objects, and resulting in a significantly improved matching performance. Since VL-NMS is agnostic to the matching step, it can be easily integrated into any state-of-the-art two-stage matching method. We validate the effectiveness of VL-NMS on three multimodal matching tasks, namely referring expression grounding, phrase grounding, and image-text matching. Extensive ablation studies on several baselines and benchmarks consistently demonstrate the superiority of VL-NMS.

REFERENCES

  1. [1] Akula Arjun R., Gella Spandana, Al-Onaizan Yaser, Zhu Song-Chun, and Reddy Siva. 2020. Words aren’t enough, their order matters: On the robustness of grounding visual referring expressions. In arXiv preprint arXiv:2005.01655.Google ScholarGoogle Scholar
  2. [2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bodla Navaneeth, Singh Bharat, Chellappa Rama, and Davis Larry S.. 2017. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Cao Meng, Chen Long, Shou Mike Zheng, Zhang Can, and Zou Yuexian. 2021. On pursuit of designing multi-modal transformer for video grounding. In Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen Ding-Jie, Jia Songhao, Lo Yi-Chen, Chen Hwann-Tzong, and Liu Tyng-Luh. 2019. See-through-text grouping for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chen Hui, Ding Guiguang, Liu Xudong, Lin Zijia, Liu Ji, and Han Jungong. 2020. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chen Howard, Suhr Alane, Misra Dipendra, Snavely Noah, and Artzi Yoav. 2019. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Chen Kan, Gao Jiyang, and Nevatia Ram. 2018. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Chen Kan, Kovvuri Rama, and Nevatia Ram. 2017. Query-guided regression network with context policy for phrase grounding. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Chen Long, Lu Chujie, Tang Siliang, Xiao Jun, Zhang Dong, Tan Chilie, and Li Xiaolin. 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Chen Long, Ma Wenbo, Xiao Jun, Zhang Hanwang, and Chang Shih-Fu. 2021. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Chen Long, Yan Xin, Xiao Jun, Zhang Hanwang, Pu Shiliang, and Zhuang Yueting. 2020. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Chen Long, Zhang Hanwang, Xiao Jun, Nie Liqiang, Shao Jian, Liu Wei, and Chua Tat-Seng. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Chen Long, Zheng Yuhang, Niu Yulei, Zhang Hanwang, and Xiao Jun. 2021. Counterfactual samples synthesizing and training for robust visual question answering. arXiv preprint arXiv:2110.01013 (2021).Google ScholarGoogle Scholar
  16. [16] Chen Long, Zheng Yuhang, and Xiao Jun. 2022. Rethinking data augmentation for robust visual question answering. In European Conference on Computer Vision. 95112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Chen Xinpeng, Ma Lin, Chen Jingyuan, Jie Zequn, Liu Wei, and Luo Jiebo. 2018. Real-time referring expression comprehension by single-stage grounding network. In arXiv preprint arXiv:1812.03426.Google ScholarGoogle Scholar
  18. [18] Datta Samyak, Sikka Karan, Roy Anirban, Ahuja Karuna, Parikh Devi, and Divakaran Ajay. 2019. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Deng Jiajun, Yang Zhengyuan, Chen Tianlang, Zhou Wengang, and Li Houqiang. 2021. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 17691779.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  21. [21] Ding Henghui, Liu Chang, Wang Suchen, and Jiang Xudong. 2021. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1632116330.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2017. VSE++: Improving visual-semantic embeddings with hard negatives. In arXiv preprint arXiv:1707.05612.Google ScholarGoogle Scholar
  23. [23] Feng Fangxiang, Wang Xiaojie, and Li Ruifan. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Girshick Ross, Donahue Jeff, Darrell Trevor, and Malik Jitendra. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] He Kaiming, Gkioxari Georgia, Dollár Piotr, and Girshick Ross. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Hong Richang, Liu Daqing, Mo Xiaoyu, He Xiangnan, and Zhang Hanwang. 2019. Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence.Google ScholarGoogle Scholar
  27. [27] Hosang Jan, Benenson Rodrigo, and Schiele Bernt. 2017. Learning non-maximum suppression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Hu Han, Gu Jiayuan, Zhang Zheng, Dai Jifeng, and Wei Yichen. 2018. Relation networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Hu Ronghang, Rohrbach Marcus, Andreas Jacob, Darrell Trevor, and Saenko Kate. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Hu Ronghang, Rohrbach Marcus, and Darrell Trevor. 2016. Segmentation from natural language expressions. In European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Hu Zhiwei, Feng Guang, Sun Jiayu, Zhang Lihe, and Lu Huchuan. 2020. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Huang Shaofei, Hui Tianrui, Liu Si, Li Guanbin, Wei Yunchao, Han Jizhong, Liu Luoqi, and Li Bo. 2020. Referring image segmentation via cross-modal progressive comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Jiang Borui, Luo Ruixuan, Mao Jiayuan, Xiao Tete, and Jiang Yuning. 2018. Acquisition of localization confidence for accurate object detection. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Jing Ya, Kong Tao, Wang Wei, Wang Liang, Li Lei, and Tan Tieniu. 2021. Locate then segment: A strong pipeline for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Kamath Aishwarya, Singh Mannat, LeCun Yann, Synnaeve Gabriel, Misra Ishan, and Carion Nicolas. 2021. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 17801790.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Kazemzadeh Sahar, Ordonez Vicente, Matten Mark, and Berg Tamara. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Kim Jinkyu, Misu Teruhisa, Chen Yi-Ting, Tawari Ashish, and Canny John. 2019. Grounding human-to-vehicle advice for self-driving vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Li Qun, Xiao Fu, An Le, Long Xianzhong, and Sun Xiaochuan. 2019. Semantic concept network and deep walk-based visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Li Ruiyu, Li Kaican, Kuo Yi-Chun, Shu Michelle, Qi Xiaojuan, Shen Xiaoyong, and Jia Jiaya. 2018. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Li Xiangyang and Jiang Shuqiang. 2018. Bundled object context for referring expressions. IEEE Transactions on Multimedia (TMM) (2018), 27492760.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Liao Yue, Liu Si, Li Guanbin, Wang Fei, Chen Yanjie, Qian Chen, and Li Bo. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Liu Chunxiao, Mao Zhendong, Zhang Tianzhu, Xie Hongtao, Wang Bin, and Zhang Yongdong. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Liu Daqing, Zhang Hanwang, Wu Feng, and Zha Zheng-Jun. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Liu Daqing, Zhang Hanwang, Zha Zheng-Jun, Wang Meng, and Sun Qianru. 2019. Joint visual grounding with language scene graphs. In arXiv preprint arXiv:1906.03561.Google ScholarGoogle Scholar
  47. [47] Liu Songtao, Huang Di, and Wang Yunhong. 2019. Adaptive NMS: Refining pedestrian detection in a crowd. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Liu Xihui, Wang Zihao, Shao Jing, Wang Xiaogang, and Li Hongsheng. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems (2019).Google ScholarGoogle Scholar
  50. [50] Luo Gen, Zhou Yiyi, Sun Xiaoshuai, Cao Liujuan, Wu Chenglin, Deng Cheng, and Ji Rongrong. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Mao Junhua, Huang Jonathan, Toshev Alexander, Camburu Oana, Yuille Alan L., and Murphy Kevin. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Mao Yangjun, Chen Long, Jiang Zhihong, Zhang Dong, Zhang Zhimeng, Shao Jian, and Xiao Jun. 2022. Rethinking the reference-based distinctive image captioning. In Proceedings of the 30th ACM International Conference on Multimedia. 43744384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Margffoy-Tuay Edgar, Pérez Juan C., Botero Emilio, and Arbeláez Pablo. 2018. Dynamic multimodal instance segmentation guided by natural language queries. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Nagaraja Varun K., Morariu Vlad I., and Davis Larry S.. 2016. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Niu Yulei, Zhang Hanwang, Lu Zhiwu, and Chang Shih-Fu. 2019. Variational context: Exploiting visual and textual context for grounding referring expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).Google ScholarGoogle Scholar
  56. [56] Plummer Bryan A., Kordas Paige, Kiapour M. Hadi, Zheng Shuai, Piramuthu Robinson, and Lazebnik Svetlana. 2018. Conditional image-text embedding networks. In Proceedings of the European Conference on Computer Vision. 249264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Plummer Bryan A., Wang Liwei, Cervantes Chris M., Caicedo Juan C., Hockenmaier Julia, and Lazebnik Svetlana. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 26412649.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Qiu Shuang, Zhao Yao, Jiao Jianbo, Wei Yunchao, and Wei Shikui. 2020. Referring image segmentation by generative adversarial learning. IEEE Transactions on Multimedia (TMM) (2020).Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Qu Mengxue, Wu Yu, Liu Wu, Gong Qiqi, Liang Xiaodan, Russakovsky Olga, Zhao Yao, and Wei Yunchao. 2022. SiRi: A simple selective retraining mechanism for transformer-based visual grounding. In arXiv preprint arXiv:2207.13325.Google ScholarGoogle Scholar
  60. [60] Sadhu Arka, Chen Kan, and Nevatia Ram. 2019. Zero-shot grounding of objects from natural language queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 46944703.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Shi Hengcan, Li Hongliang, Meng Fanman, and Wu Qingbo. 2018. Key-word-aware network for referring expression image segmentation. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Tan Zhiyu, Nie Xuecheng, Qian Qi, Li Nan, and Li Hao. 2019. Learning to rank proposals for object detection. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Tychsen-Smith Lachlan and Petersson Lars. 2018. Improving object localization with fitness NMS and bounded iou loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30.Google ScholarGoogle Scholar
  65. [65] Wang Guoqing, Yang Chao, Feng Su, and Jiang Bin. 2022. LPGN: Language-guided proposal generation network for referring expression comprehension. In 2022 IEEE International Conference on Multimedia and Expo.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Wang Liwei, Li Yin, Huang Jing, and Lazebnik Svetlana. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 394407.Google ScholarGoogle Scholar
  67. [67] Wang Peng, Wu Qi, Cao Jiewei, Shen Chunhua, Gao Lianli, and Hengel Anton van den. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Wu Yonghui, Schuster Mike, Chen Zhifeng, Le Quoc V., Norouzi Mohammad, Macherey Wolfgang, Krikun Maxim, Cao Yuan, Gao Qin, Macherey Klaus, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google ScholarGoogle Scholar
  69. [69] Wu Yiling, Wang Shuhui, and Huang Qingming. 2018. Learning semantic structure-preserved embeddings for cross-modal retrieval. In Proceedings of the 26th ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] Xiao Shaoning, Chen Long, Shao Jian, Zhuang Yueting, and Xiao Jun. 2021. Natural language video localization with learnable moment proposals. In Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Xiao Shaoning, Chen Long, Zhang Songyang, Ji Wei, Shao Jian, Ye Lu, and Xiao Jun. 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Xu Xing, Wang Yifan, He Yixuan, Yang Yang, Hanjalic Alan, and Shen Heng Tao. 2021. Cross-modal hybrid feature fusion for image-sentence matching. ACM Transactions on Multimedia Computing, Communications, and Applications (2021), 123.Google ScholarGoogle Scholar
  73. [73] Yang Chenhongyi, Ablavsky Vitaly, Wang Kaihong, Feng Qi, and Betke Margrit. 2019. Learning to separate: Detecting heavily-occluded objects in urban scenes. In arXiv preprint arXiv:1912.01674.Google ScholarGoogle Scholar
  74. [74] Yang Li, Xu Yan, Yuan Chunfeng, Liu Wei, Li Bing, and Hu Weiming. 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 94999508.Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Yang Sibei, Li Guanbin, and Yu Yizhou. 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Yang Sibei, Li Guanbin, and Yu Yizhou. 2020. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Yang Zhengyuan, Chen Tianlang, Wang Liwei, and Luo Jiebo. 2020. Improving one-stage visual grounding by recursive sub-query construction. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Yang Zhengyuan, Gong Boqing, Wang Liwei, Huang Wenbing, Yu Dong, and Luo Jiebo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Ye Jiabo, Tian Junfeng, Yan Ming, Yang Xiaoshan, Wang Xuwu, Zhang Ji, He Liang, and Lin Xin. 2022. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1550215512.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Ye Linwei, Liu Zhi, and Wang Yang. 2020. Dual convolutional LSTM network for referring image segmentation. IEEE Transactions on Multimedia (TMM) (2020), 32243235.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. [81] Ye Linwei, Rochan Mrigank, Liu Zhi, and Wang Yang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  82. [82] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (2014).Google ScholarGoogle ScholarCross RefCross Ref
  83. [83] Yu Dongfei, Fu Jianlong, Tian Xinmei, and Mei Tao. 2019. Multi-source multi-level attention networks for visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications (2019), 120.Google ScholarGoogle Scholar
  84. [84] Yu Jun, Li Jing, Yu Zhou, and Huang Qingming. 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology (2019), 44674480.Google ScholarGoogle Scholar
  85. [85] Yu Licheng, Lin Zhe, Shen Xiaohui, Yang Jimei, Lu Xin, Bansal Mohit, and Berg Tamara L.. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Yu Licheng, Poirson Patrick, Yang Shan, Berg Alexander C., and Berg Tamara L.. 2016. Modeling context in referring expressions. In European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  87. [87] Yu Licheng, Tan Hao, Bansal Mohit, and Berg Tamara L.. 2017. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  88. [88] Yu Zhou, Yu Jun, Xiang Chenchao, Fan Jianping, and Tao Dacheng. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems (2018), 59475959.Google ScholarGoogle ScholarCross RefCross Ref
  89. [89] Yu Zhou, Yu Jun, Xiang Chenchao, Zhao Zhou, Tian Qi, and Tao Dacheng. 2018. Rethinking diversified and discriminative proposal generation for visual grounding. In arXiv preprint arXiv:1805.03508.Google ScholarGoogle Scholar
  90. [90] Yu Zhou, Jun Zhu Junjie Yu,, and Kuang Zhenzhong. 2022. Knowledge-representation-enhanced multimodal transformer for scene text visual question answering. Journal of Image and Graphics (2022), 27612774.Google ScholarGoogle Scholar
  91. [91] Zhang Hanwang, Niu Yulei, and Chang Shih-Fu. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  92. [92] Zhang Lingling, Luo Minnan, Liu Jun, Chang Xiaojun, Yang Yi, and Hauptmann Alexander G.. 2020. Deep top-\(k\) ranking for image-sentence matching. IEEE Transactions on Multimedia (TMM) (2020), 775785.Google ScholarGoogle ScholarCross RefCross Ref
  93. [93] Zhu Chaoyang, Zhou Yiyi, Shen Yunhang, Luo Gen, Pan Xingjia, Lin Mingbao, Chen Chao, Cao Liujuan, Sun Xiaoshuai, and Ji Rongrong. 2022. SeqTR: A simple yet universal network for visual grounding. In arXiv preprint arXiv:2203.16265.Google ScholarGoogle Scholar
  94. [94] Zhuang Bohan, Wu Qi, Shen Chunhua, Reid Ian, and Hengel Anton van den. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. VL-NMS: Breaking Proposal Bottlenecks in Two-stage Visual-language Matching

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 5s
        October 2023
        280 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3599694
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 June 2023
        • Online AM: 4 January 2023
        • Accepted: 19 December 2022
        • Revised: 31 October 2022
        • Received: 2 July 2022
        Published in tomm Volume 19, Issue 5s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)140
        • Downloads (Last 6 weeks)20

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!