skip to main content
research-article

Interactive Re-ranking via Object Entropy-Guided Question Answering for Cross-Modal Image Retrieval

Authors Info & Claims
Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Cross-modal image-retrieval methods retrieve desired images from a query text by learning relationships between texts and images. Such a retrieval approach is one of the most effective ways of achieving the easiness of query preparation. Recent cross-modal image-retrieval methods are convenient and accurate when users input a query text that can be used to uniquely identify the desired image. However, in reality, users frequently input ambiguous query texts, and these ambiguous queries make it difficult to obtain desired images. To overcome these difficulties, in this study, we propose a novel interactive cross-modal image-retrieval method based on question answering. The proposed method analyzes candidate images and asks users questions to obtain information that can narrow down retrieval candidates. By only answering questions generated by the proposed method, users can reach their desired images, even when using an ambiguous query text. Experimental results show the proposed method’s effectiveness.

REFERENCES

  1. [1] Aliannejadi Mohammad, Kiseleva Julia, Chuklin Aleksandr, Dalton Jeff, and Burtsev Mikhail. 2020. ConvAI3: Generating clarifying questions for open-domain dialogue systems (ClariQ). arXiv:2009.11352. Retrieved from https://arxiv.org/abs/2009.11352.Google ScholarGoogle Scholar
  2. [2] Aliannejadi Mohammad, Zamani Hamed, Crestani Fabio, and Croft W. Bruce. 2019. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 475484.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Datta Deepanwita, Varma Shubham, Ravindranath Chowdary C., and Singh Sanjay K.. 2017. Multimodal retrieval using mutual information based textual query reformulation. Expert Systems with Applications 68 (2017), 8192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.Google ScholarGoogle Scholar
  5. [5] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, Toronto Google Brain, and Fidler Sanja. 2018. VSE ++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference.Google ScholarGoogle Scholar
  6. [6] Fu Bin, Qiu Yunqi, Tang Chengguang, Li Yang, Yu Haiyang, and Sun Jian. 2021. A survey on complex question answering over knowledge base: Recent advances and challenges. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 4483–4491.Google ScholarGoogle Scholar
  7. [7] Giacinto Giorgio. 2007. A nearest-neighbor approach to relevance feedback in content based image retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval. 456463.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Gu Jiuxiang, Cai Jianfei, Joty Shafiq R., Niu Li, and Wang Gang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71817189.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Guo Kehua, Zhang Ruifang, Zhou Zhurong, Tang Yayuan, and Kuang Li. 2016. Combined retrieval: A convenient and precise approach for Internet image retrieval. Information Sciences 358 (2016), 151163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Guo Xiaoxiao, Wu Hui, Cheng Yu, Rennie Steven, Tesauro Gerald, and Feris Rogerio. 2018. Dialog-based interactive image retrieval. In Proceedings of the Advances in Neural Information Processing Systems. 678688.Google ScholarGoogle Scholar
  11. [11] Hotelling Harold. 1992. Relations between two sets of variates. In Proceedings of the Breakthroughs in Statistics. 162190.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Huang Gao, Liu Zhuang, Maaten Laurens Van Der, and Weinberger Kilian Q.. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 47004708.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Huang Yan, Wu Qi, Song Chunfeng, and Wang Liang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 61636171.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Ji Zhong, Wang Haoran, Han Jungong, and Pang Yanwei. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 57545763.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Kiros Ryan, Salakhutdinov Ruslan, and Zemel Richard S.. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539. Retrieved from https://arxiv.org/abs/1411.2539.Google ScholarGoogle Scholar
  16. [16] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Kullback Solomon and Leibler Richard A.. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 7986.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Liang Shuang and Sun Zhengxing. 2008. Sketch retrieval and relevance feedback with biased SVM classification. Pattern Recognition Letters 29, 12 (2008), 17331741.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Lin Tsung Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE European Conference on Computer Vision. 740755.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Lin Wei-Chao, Chen Zong-Yao, Ke Shih-Wen, Tsai Chih-Fong, and Lin Wei-Yang. 2015. The effect of low-level image features on pseudo relevance feedback. Neurocomputing 166, C (2015), 2637.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Liu Yu, Guo Yanming, Bakker Erwin M., and Lew Michael S.. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 41074116.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Lu Dan, Liu Xiaoxiao, and Qian Xueming. 2016. Tag-based image search by social re-ranking. IEEE Transactions on Multimedia 18, 8 (2016), 16281639.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Mandal Devraj and Biswas Soma. 2017. Query specific re-ranking for improved cross-modal retrieval. Pattern Recognition Letters 98, C (2017), 110116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Mao Xian-Ling, Hao Yi-Jing, Wang Dan, and Huang Heyan. 2018. Query completion in community-based question answering search. Neurocomputing 274, C (2018), 37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Mei Tao, Rui Yong, Li Shipeng, and Tian Qi. 2014. Multimedia search reranking: A literature survey. Computing Surveys 46, 3 (2014), 138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Piras Luca and Giacinto Giorgio. 2017. Information fusion in content based image retrieval: A comprehensive overview. Information Fusion 37 (2017), 5060.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Putzu Lorenzo, Piras Luca, and Giacinto Giorgio. 2020. Convolutional neural networks for relevance feedback in content based image retrieval. Multimedia Tools and Applications 79, 37 (2020), 2699527021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Qian Xueming, Lu Dan, Wang Yaxiong, Zhu Li, Tang Yuan Yan, and Wang Meng. 2017. Image re-ranking based on topic diversity. IEEE Transactions on Image Processing 26, 8 (2017), 37343747.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Qu Chen, Yang Liu, Chen Cen, Qiu Minghui, Croft W. Bruce, and Iyyer Mohit. 2020. Open-retrieval conversational question answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 539548.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Radlinski Filip and Craswell Nick. 2017. A theoretical framework for conversational search. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. 117126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Rossetto Luca, Gasser Ralph, Lokoc Jakub, Bailer Werner, Schoeffmann Klaus, Muenzer Bernd, Soucek Tomas, Nguyen Phuong Anh, Bolettieri Paolo, Leibetseder Andreas, and Ravindranath Chowdary C. 2020. Interactive video retrieval in the age of deep learning-detailed evaluation of vbs 2019. IEEE Transactions on Multimedia 23 (2020), 243256.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Song Yale and Soleymani Mohammad. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19791988.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Sun Yueming and Zhang Yi. 2018. Conversational recommender system. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 235244.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Tan Fuwen, Cascante-Bonilla Paola, Guo Xiaoxiao, Wu Hui, Feng Song, and Ordonez Vicente. 2019. Drill-down: Interactive retrieval of complex scenes using natural language queries. In Proceedings of the Advances in Neural Information Processing Systems. 26512661.Google ScholarGoogle Scholar
  36. [36] Vendrov Ivan, Kiros Ryan, Fidler Sanja, and Urtasun Raquel. 2016. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations. 112.Google ScholarGoogle Scholar
  37. [37] Vo Nam, Jiang Lu, Sun Chen, Murphy Kevin, Li Li-Jia, Fei-Fei Li, and Hays James. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 64396448.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wang Kaiye, Yin Qiyue, Wang Wei, Wu Shu, and Wang Liang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215. Retrieved from https://arxiv.org/abs/1607.06215.Google ScholarGoogle Scholar
  39. [39] Wang Tan, Xu Xing, Yang Yang, Hanjalic Alan, Shen Heng Tao, and Song Jingkuan. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the ACM International Conference on Multimedia. 1220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Wei Wei, Jiang Mengmeng, Zhang Xiangnan, Liu Heng, and Tian Chunna. 2020. Boosting cross-modal retrieval With MVSE++ and reciprocal neighbors. IEEE Access 8 (2020), 8464284651.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Wu Lei, Jin Rong, and Jain Anil K.. 2013. Tag completion for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 3 (2013), 716727.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Xu Bin, Bu Jiajun, Chen Chun, Wang Can, Cai Deng, and He Xiaofei. 2013. EMR: A scalable graph-based ranking model for content-based image retrieval. IEEE Transactions on Knowledge and Data Engineering 27, 1 (2013), 102114.Google ScholarGoogle Scholar
  43. [43] Xu Danfei, Zhu Yuke, Choy Christopher B., and Fei-Fei Li. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 54105419.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Yanagi Rintaro, Togo Ren, Ogawa Takahiro, and Haseyama Miki. 2021. Interactive re-ranking for cross-modal retrieval based on object-wise question answering. In Proceedings of the ACM International Conference on Multimedia in Asia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Yang Liu, Zamani Hamed, Zhang Yongfeng, Guo Jiafeng, and Croft W. Bruce. 2017. Neural matching models for question retrieval and next question prediction in conversation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  46. [46] Yu Xiaojing, Chen Tianlong, Yang Yang, Mugo Michael, and Wang Zhangyang. 2019. Cross-modal person search: A coarse-to-fine framework using bi-directional text-image matching. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 17991804.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Zamani Hamed, Dumais Susan, Craswell Nick, Bennett Paul, and Lueck Gord. 2020. Generating clarifying questions for information retrieval. In Proceedings of The Web Conference 2020. 418428.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Zellers Rowan, Yatskar Mark, Thomson Sam, and Choi Yejin. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 58315840.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zhang Lei and Rui Yong. 2013. Image search—from thousands to billions in 20 years. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 1 (2013), 1–20.Google ScholarGoogle Scholar
  50. [50] Zhang Yongfeng, Chen Xu, Ai Qingyao, Yang Liu, and Croft W. Bruce. 2018. Towards conversational search and recommendation: System ask, user respond. In Proceedings of the ACM International Conference on Information and Knowledge Management. 177186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Zhang Ying and Lu Huchuan. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the IEEE European Conference on Computer Vision. 686701.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Zhao Qijie, Sheng Tao, Wang Yongtao, Tang Zhi, Chen Ying, Cai Ling, and Ling Haibin. 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 92599266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Zheng Zhedong, Zheng Liang, Garrett Michael, Yang Yi, and Shen Yi-Dong. 2020. Dual-path convolutional image-text embedding with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Zhou Wengang, Li Houqiang, and Tian Qi. 2017. Recent advance in content-based image retrieval: A literature survey. arXiv:1706.06064. Retrieved from https://arxiv.org/abs/1706.06064.Google ScholarGoogle Scholar
  55. [55] Zhu Fengbin, Lei Wenqiang, Wang Chao, Zheng Jianming, Poria Soujanya, and Chua Tat-Seng. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv:2101.00774. Retrieved from https://arxiv.org/abs/2101.00774.Google ScholarGoogle Scholar

Index Terms

  1. Interactive Re-ranking via Object Entropy-Guided Question Answering for Cross-Modal Image Retrieval

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3
          August 2022
          478 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3505208
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 March 2022
          • Accepted: 1 September 2021
          • Revised: 1 July 2021
          • Received: 1 April 2021
          Published in tomm Volume 18, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!