Abstract
Cross-modal image-retrieval methods retrieve desired images from a query text by learning relationships between texts and images. Such a retrieval approach is one of the most effective ways of achieving the easiness of query preparation. Recent cross-modal image-retrieval methods are convenient and accurate when users input a query text that can be used to uniquely identify the desired image. However, in reality, users frequently input ambiguous query texts, and these ambiguous queries make it difficult to obtain desired images. To overcome these difficulties, in this study, we propose a novel interactive cross-modal image-retrieval method based on question answering. The proposed method analyzes candidate images and asks users questions to obtain information that can narrow down retrieval candidates. By only answering questions generated by the proposed method, users can reach their desired images, even when using an ambiguous query text. Experimental results show the proposed method’s effectiveness.
- [1] . 2020. ConvAI3: Generating clarifying questions for open-domain dialogue systems (ClariQ). arXiv:2009.11352. Retrieved from https://arxiv.org/abs/2009.11352.Google Scholar
- [2] . 2019. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 475–484.Google Scholar
Digital Library
- [3] . 2017. Multimodal retrieval using mutual information based textual query reformulation. Expert Systems with Applications 68 (2017), 81–92.Google Scholar
Digital Library
- [4] . 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.Google Scholar
- [5] . 2018. VSE ++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference.Google Scholar
- [6] . 2021. A survey on complex question answering over knowledge base: Recent advances and challenges. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 4483–4491.Google Scholar
- [7] . 2007. A nearest-neighbor approach to relevance feedback in content based image retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval. 456–463.Google Scholar
Digital Library
- [8] . 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.Google Scholar
Cross Ref
- [9] . 2016. Combined retrieval: A convenient and precise approach for Internet image retrieval. Information Sciences 358 (2016), 151–163.Google Scholar
Digital Library
- [10] . 2018. Dialog-based interactive image retrieval. In Proceedings of the Advances in Neural Information Processing Systems. 678–688.Google Scholar
- [11] . 1992. Relations between two sets of variates. In Proceedings of the Breakthroughs in Statistics. 162–190.Google Scholar
Cross Ref
- [12] . 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708.Google Scholar
Cross Ref
- [13] . 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.Google Scholar
Cross Ref
- [14] . 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 5754–5763.Google Scholar
Cross Ref
- [15] . 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539. Retrieved from https://arxiv.org/abs/1411.2539.Google Scholar
- [16] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [17] . 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79–86.Google Scholar
Cross Ref
- [18] . 2008. Sketch retrieval and relevance feedback with biased SVM classification. Pattern Recognition Letters 29, 12 (2008), 1733–1741.Google Scholar
Digital Library
- [19] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE European Conference on Computer Vision. 740–755.Google Scholar
Cross Ref
- [20] . 2015. The effect of low-level image features on pseudo relevance feedback. Neurocomputing 166, C (2015), 26–37.Google Scholar
Digital Library
- [21] . 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107–4116.Google Scholar
Cross Ref
- [22] . 2016. Tag-based image search by social re-ranking. IEEE Transactions on Multimedia 18, 8 (2016), 1628–1639.Google Scholar
Digital Library
- [23] . 2017. Query specific re-ranking for improved cross-modal retrieval. Pattern Recognition Letters 98, C (2017), 110–116.Google Scholar
Digital Library
- [24] . 2018. Query completion in community-based question answering search. Neurocomputing 274, C (2018), 3–7.Google Scholar
Digital Library
- [25] . 2014. Multimedia search reranking: A literature survey. Computing Surveys 46, 3 (2014), 1–38.Google Scholar
Digital Library
- [26] . 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.Google Scholar
Cross Ref
- [27] . 2017. Information fusion in content based image retrieval: A comprehensive overview. Information Fusion 37 (2017), 50–60.Google Scholar
Digital Library
- [28] . 2020. Convolutional neural networks for relevance feedback in content based image retrieval. Multimedia Tools and Applications 79, 37 (2020), 26995–27021.Google Scholar
Digital Library
- [29] . 2017. Image re-ranking based on topic diversity. IEEE Transactions on Image Processing 26, 8 (2017), 3734–3747.Google Scholar
Digital Library
- [30] . 2020. Open-retrieval conversational question answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 539–548.Google Scholar
Digital Library
- [31] . 2017. A theoretical framework for conversational search. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. 117–126.Google Scholar
Digital Library
- [32] . 2020. Interactive video retrieval in the age of deep learning-detailed evaluation of vbs 2019. IEEE Transactions on Multimedia 23 (2020), 243–256.Google Scholar
Cross Ref
- [33] . 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1979–1988.Google Scholar
Cross Ref
- [34] . 2018. Conversational recommender system. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 235–244.Google Scholar
Digital Library
- [35] . 2019. Drill-down: Interactive retrieval of complex scenes using natural language queries. In Proceedings of the Advances in Neural Information Processing Systems. 2651–2661.Google Scholar
- [36] . 2016. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations. 1–12.Google Scholar
- [37] . 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439–6448.Google Scholar
Cross Ref
- [38] . 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215. Retrieved from https://arxiv.org/abs/1607.06215.Google Scholar
- [39] . 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the ACM International Conference on Multimedia. 12–20.Google Scholar
Digital Library
- [40] . 2020. Boosting cross-modal retrieval With MVSE++ and reciprocal neighbors. IEEE Access 8 (2020), 84642–84651.Google Scholar
Cross Ref
- [41] . 2013. Tag completion for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 3 (2013), 716–727.Google Scholar
Digital Library
- [42] . 2013. EMR: A scalable graph-based ranking model for content-based image retrieval. IEEE Transactions on Knowledge and Data Engineering 27, 1 (2013), 102–114.Google Scholar
- [43] . 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410–5419.Google Scholar
Cross Ref
- [44] . 2021. Interactive re-ranking for cross-modal retrieval based on object-wise question answering. In Proceedings of the ACM International Conference on Multimedia in Asia.Google Scholar
Digital Library
- [45] . 2017. Neural matching models for question retrieval and next question prediction in conversation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.Google Scholar
- [46] . 2019. Cross-modal person search: A coarse-to-fine framework using bi-directional text-image matching. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1799–1804.Google Scholar
Cross Ref
- [47] . 2020. Generating clarifying questions for information retrieval. In Proceedings of The Web Conference 2020. 418–428.Google Scholar
Digital Library
- [48] . 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831–5840.Google Scholar
Cross Ref
- [49] . 2013. Image search—from thousands to billions in 20 years. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 1 (2013), 1–20.Google Scholar
- [50] . 2018. Towards conversational search and recommendation: System ask, user respond. In Proceedings of the ACM International Conference on Information and Knowledge Management. 177–186.Google Scholar
Digital Library
- [51] . 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the IEEE European Conference on Computer Vision. 686–701.Google Scholar
Cross Ref
- [52] . 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9259–9266.Google Scholar
Digital Library
- [53] . 2020. Dual-path convolutional image-text embedding with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 1–23.Google Scholar
Digital Library
- [54] . 2017. Recent advance in content-based image retrieval: A literature survey. arXiv:1706.06064. Retrieved from https://arxiv.org/abs/1706.06064.Google Scholar
- [55] . 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv:2101.00774. Retrieved from https://arxiv.org/abs/2101.00774.Google Scholar
Index Terms
Interactive Re-ranking via Object Entropy-Guided Question Answering for Cross-Modal Image Retrieval
Recommendations
Interactive re-ranking for cross-modal retrieval based on object-wise question answering
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in AsiaCross-modal retrieval methods retrieve desired images from a query text by learning relationships between texts and images. This retrieval approach is one of the most effective ways in the easiness of query preparation. Recent cross-modal retrieval is ...
Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval
MM '21: Proceedings of the 29th ACM International Conference on MultimediaWe propose an approach that enhances arbitrary existing cross-modal image retrieval performance. Most of the cross-modal image retrieval methods mainly focus on direct computation of similarities between a text query and candidate images in an accurate ...
Structured retrieval for question answering
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalBag-of-words retrieval is popular among Question Answering (QA) system developers, but it does not support constraint checking and ranking on the linguistic and semantic information of interest to the QA system. We present anapproach to retrieval for QA,...






Comments