skip to main content
research-article

Scene Graph Semantic Inference for Image and Text Matching

Authors Info & Claims
Published:09 May 2023Publication History
Skip Abstract Section

Abstract

With the rapid development of information technology, image and text data have increased dramatically. Image and text matching techniques enable computers to understand information from both visual and text modalities and match them based on semantic content. Existing methods focus on visual and textual object co-occurrence statistics and learning coarse-level associations. However, the lack of intramodal semantic inference leads to the failure of fine-level association between modalities. Scene graphs can capture the interactions between visual and textual objects and model intramodal semantic associations, which are crucial for the understanding of scenes contained in images and text. In this article, we propose a novel scene graph semantic inference network (SGSIN) for image and text matching that effectively learns fine-level semantic information in vision and text to facilitate bridging cross-modal discrepancies. Specifically, we design two matching modules and construct scene graphs within each matching module for aggregating neighborhood information to refine the semantic representation of each object and achieve fine-level alignment of visual and textual modalities. We perform extended experiments in Flickr30K and MSCOCO and achieve state-of-the-art results, which validate the advantages of our proposed approach.

REFERENCES

  1. [1] Abdullah Taghreed and Rangarajan Lalitha. 2021. Image-text matching: Methods and challenges. Invent. Syst. Contr. 204 (2021), 213222.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Anderson Peter, Fernando Basura, Johnson Mark, and Gould Stephen. 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382398.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bhowmick Rajat Subhra, Ganguli Isha, Paul Jayanta, and Sil Jaya. 2021. A multimodal deep framework for derogatory social media post identification of a recognized person. ACM Trans. Asian Low-resour. Lang. Inf. Process. 21, 1 (2021), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Biten Ali Furkan, Mafla Andres, Gómez Lluís, and Karatzas Dimosthenis. 2022. Is an image worth five sentences? A new look into semantics for image-text matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 13911400.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cao Jie, Qian Shengsheng, Zhang Huaiwen, Fang Quan, and Xu Changsheng. 2021. Global relation-aware attention network for image-text retrieval. In Proceedings of the International Conference on Multimedia Retrieval. 1928.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chen Cuiqun, Ye Mang, Qi Meibin, Wu Jingjing, Jiang Jianguo, and Lin Chia-Wen. 2022. Structure-aware positional transformer for visible-infrared person re-identification. IEEE Trans. Image Process. 31 (2022), 2352–2364.Google ScholarGoogle Scholar
  8. [8] Chen Xinlei, Li Li-Jia, Fei-Fei Li, and Gupta Abhinav. 2018. Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 72397248.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Ding Songtao, Qu Shiru, Xi Yuling, and Wan Shaohua. 2019. A long video caption generation algorithm for big video data retrieval. Fut. Gen. Comput. Syst. 93 (2019), 583595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Eddine Meftah Mohammed Charaf. 2021. A new concept of electronic text based on semantic coding system for machine translation. ACM Trans. Asian Low-resour. Lang. Inf. 21, 1 (2021), 116.Google ScholarGoogle Scholar
  11. [11] Fan Wenqi, Ma Yao, Xu Han, Liu Xiaorui, Wang Jianping, Li Qing, and Tang Jiliang. 2020. Deep adversarial canonical correlation analysis. In Proceedings of the SIAM International Conference on Data Mining. SIAM, 352360.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Fartash Faghri, Fleet D., Kiros J., and Fidler Sanja. 2018. VSE++: Improved visual semantic embeddings. In Proceedings of the British Machine Vision Conference.935943.Google ScholarGoogle Scholar
  13. [13] Frome Andrea, Corrado Greg S., Shlens Jon, Bengio Samy, Dean Jeff, Ranzato Marc’Aurelio, and Mikolov Tomas. 2013. Devise: A deep visual-semantic embedding model. Adv. Neural Inf. Process. Syst. 26 (2013).Google ScholarGoogle Scholar
  14. [14] Gu Jiuxiang, Cai Jianfei, Joty Shafiq R., Niu Li, and Wang Gang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71817189.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] He Yi, Liu Xin, Cheung Yiu-Ming, Peng Shu-Juan, Yi Jinhan, and Fan Wentao. 2021. Cross-graph attention enhanced multi-modal correlation learning for fine-grained image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 18651869.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Huang Yan, Wu Qi, Song Chunfeng, and Wang Liang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 61636171.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Huang Yan, Wu Qi, Wang Wei, and Wang Liang. 2018. Image and sentence matching via semantic concepts and order learning. IEEE Trans. Pattern Anal. Mach. Intell. 42, 3 (2018), 636650.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Ji Zhong, Wang Haoran, Han Jungong, and Pang Yanwei. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 57545763.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Jian Yangwo, Xiao Jing, Cao Yang, Khan Asad, and Zhu Jia. 2019. Deep pairwise ranking with multi-label information for cross-modal retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’19). IEEE, 18101815.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Johnson Justin, Gupta Agrim, and Fei-Fei Li. 2018. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12191228.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Karpathy Andrej, Joulin Armand, and Fei-Fei Li F.. 2014. Deep fragment embeddings for bidirectional image sentence mapping. Adv. Neural Inf. Process. Syst. 27 (2014).Google ScholarGoogle Scholar
  23. [23] Kipf Thomas N. and Welling Max. 2016. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  24. [24] Kiros Ryan, Salakhutdinov Ruslan, and Zemel Richard S.. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).Google ScholarGoogle Scholar
  25. [25] Klein Benjamin, Lev Guy, Sadeh Gil, and Wolf Lior. 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 44374446.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Lan Hong and Zhang Pufen. 2021. Learning and integrating multi-level matching features for image-text retrieval. IEEE Sig. Process. Lett. 29 (2021), 374–378.Google ScholarGoogle Scholar
  28. [28] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 46544662.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Li Shuang, Xiao Tong, Li Hongsheng, Yang Wei, and Wang Xiaogang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 18901899.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Li Yikang, Ouyang Wanli, Zhou Bolei, Wang Kun, and Wang Xiaogang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision. 12611270.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Li Yongzhi, Zhang Duo, and Mu Yadong. 2020. Visual-semantic matching by exploring high-order attention and distraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1278612795.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740755.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Liu Chunxiao, Mao Zhendong, Liu An-An, Zhang Tianzhu, Wang Bin, and Zhang Yongdong. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Liu Chunxiao, Mao Zhendong, Zhang Tianzhu, Xie Hongtao, Wang Bin, and Zhang Yongdong. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1092110930.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Liu Yu, Guo Yanming, Bakker Erwin M., and Lew Michael S.. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 41074116.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Ma Lin, Lu Zhengdong, Shang Lifeng, and Li Hang. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision. 26232631.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, and Bhattacharyya Pushpak. 2021. A Hindi image caption generation framework using deep learning. ACM Trans. Asian Low-resour. Lang. Inf. 20, 2 (2021), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Mithun Niluthpol Chowdhury, Panda Rameswar, Papalexakis Evangelos E., and Roy-Chowdhury Amit K.. 2018. Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 18561864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Nam Hyeonseob, Ha Jung-Woo, and Kim Jeonghee. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299307.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Niu Zhenxing, Zhou Mo, Wang Le, Gao Xinbo, and Hua Gang. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on Computer Vision. 18811889.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Peng Yuxin, Huang Xin, and Qi Jinwei. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conferences on Artificial Intelligence. 38463853.Google ScholarGoogle Scholar
  43. [43] Peng Yuxin and Qi Jinwei. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multim. Comput., Commun. Applic. 15, 1 (2019), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Plummer Bryan A., Wang Liwei, Cervantes Chris M., Caicedo Juan C., Hockenmaier Julia, and Lazebnik Svetlana. 2015. Flickr30K entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 26412649.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Qu Leigang, Liu Meng, Wu Jianlong, Gao Zan, and Nie Liqiang. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 11041113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015).Google ScholarGoogle Scholar
  47. [47] Santoro Adam, Raposo David, Barrett David G., Malinowski Mateusz, Pascanu Razvan, Battaglia Peter, and Lillicrap Timothy. 2017. A simple neural network module for relational reasoning. Adv. Neural Inf. Process. Syst. 30 (2017).Google ScholarGoogle Scholar
  48. [48] Schuster Sebastian, Krishna Ranjay, Chang Angel, Fei-Fei Li, and Manning Christopher D.. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the 4th Workshop on Vision and Language. 7080.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Teney Damien, Liu Lingqiao, and Hengel Anton van Den. 2017. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wang Bokun, Yang Yang, Xu Xing, Hanjalic Alan, and Shen Heng Tao. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Wang Cheng and Niepert Mathias. 2019. State-regularized recurrent neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 65966606.Google ScholarGoogle Scholar
  52. [52] Wang Liwei, Li Yin, Huang Jing, and Lazebnik Svetlana. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2018), 394407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Wang Liwei, Li Yin, and Lazebnik Svetlana. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 50055013.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Wang Peng, Wu Qi, Cao Jiewei, Shen Chunhua, Gao Lianli, and Hengel Anton van den. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19601968.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Wang Sijin, Wang Ruiping, Yao Ziwei, Shan Shiguang, and Chen Xilin. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 15081517.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Wang Yaxiong, Yang Hao, Qian Xueming, Ma Lin, Lu Jing, Li Biao, and Fan Xin. 2019. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 37923798.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Wang Yu-Siang, Liu Chenxi, Zeng Xiaohui, and Yuille Alan. 2018. Scene graph parsing as dependency parsing. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies. 397407.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Wang Zihao, Liu Xihui, Li Hongsheng, Sheng Lu, Yan Junjie, Wang Xiaogang, and Shao Jing. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 57645773.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Weston Jason, Bengio Samy, and Usunier Nicolas. 2011. WSABIE: Scaling up to large vocabulary image annotation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  60. [60] Wu Hao, Mao Jiayuan, Zhang Yufeng, Jiang Yuning, Li Lei, Sun Weiwei, and Ma Wei-Ying. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 66096618.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Wu Yirui, Liu Wenxiang, and Wan Shaohua. 2021. Multiple attention encoded cascade R-CNN for scene text detection. J. Visual Commun. Image Repres. 80 (2021), 103261.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Wu Yirui, Ma Yuntao, and Wan Shaohua. 2021. Multi-scale relation reasoning for multi-modal visual question answering. Sig. Process.: Image Commun. 96 (2021), 116319.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Wu Yiling, Wang Shuhui, Song Guoli, and Huang Qingming. 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 20882096.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Wu Yirui, Wang Wenhai, Palaiahnakote Shivakumara, and Lu Tong. 2017. A robust symmetry-based method for scene/video text detection through neural network. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17). IEEE, 12491254.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Xiang Yan, Yu Zhengtao, Guo Junjun, Huang Yuxin, and Xian Yantuan. 2021. Event graph neural network for opinion target classification of microblog comments. ACM Trans. Asian Low-resour. Lang. Inf. 21, 1 (2021), 113.Google ScholarGoogle Scholar
  66. [66] Xu Danfei, Zhu Yuke, Choy Christopher B., and Fei-Fei Li. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 54105419.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Yan Fei and Mikolajczyk Krystian. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 34413450.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Yang Xu, Tang Kaihua, Zhang Hanwang, and Cai Jianfei. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1068510694.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] You Quanzeng, Zhang Zhengyou, and Luo Jiebo. 2018. End-to-end convolutional semantic embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 57355744.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Computat. Ling. 2 (2014), 6778.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Zellers Rowan, Yatskar Mark, Thomson Sam, and Choi Yejin. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 58315840.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Zeng Pengpeng, Gao Lianli, Lyu Xinyu, Jing Shuaiqi, and Song Jingkuan. 2021. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching. In Proceedings of the 29th ACM International Conference on Multimedia. 22052213.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. [74] Zhang Ying and Lu Huchuan. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Zhen Liangli, Hu Peng, Wang Xu, and Peng Dezhong. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1039410403.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Zheng Zhedong, Zheng Liang, Garrett Michael, Yang Yi, Xu Mingliang, and Shen Yi-Dong. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multim. Comput., Commun., Applic. 16, 2 (2020), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. [77] Zhou Bolei, Andonian Alex, Oliva Aude, and Torralba Antonio. 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 803818.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scene Graph Semantic Inference for Image and Text Matching

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 5
      May 2023
      653 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3596451
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 May 2023
      • Online AM: 14 September 2022
      • Accepted: 8 September 2022
      • Revised: 15 August 2022
      • Received: 6 April 2022
      Published in tallip Volume 22, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!