Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10\% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.


INTRODUCTION
Image search is a fundamental task in multimedia and computer vision, with applications spanning internet search [55], e-commerce product search [41], and medical research [18,93].In a standard image-search system, a textual or visual query is provided as input, and the system computes similarity with images in the database to retrieve the top candidates.This task, also referred to as text-toimage and image-to-image retrieval [13,54,76], plays a crucial role in various domains by facilitating efficient and effective access to visual information.
Over the years, a number of methods have been proposed to address the retrieval task [30,47,49,75].Existing approaches typically involve a single turn of the retrieval procedure: providing a text-based query to the retrieval model and directly obtaining the final retrieval results.However, this approach can be limited by challenges like vocabulary mismatch (e.g., different wordings for the same concepts) and the semantic gap (e.g., difficulty bridging the gap between text and the information it represents) [74].
Traditional information retrieval tackles these challenges using i.a., Pseudo-Relevance Feedback (PRF) [3,64,65,91].PRF operates under the assumption that some initially retrieved documents, despite keyword mismatches, are relevant to the user's intent (referred to as "pseudo-relevant").It then extracts terms from these documents to enrich the original query, enhancing its representativeness and potentially improving recall (the ability to find all relevant documents).This method has demonstrated success, even in multimedia retrieval tasks [36,66].In contrast to text document retrieval, image search encounters unique challenges.The disparity in modalities between the query and images precludes direct term extraction from retrieved images.To address this, a common strategy in applying feedback methods to image search is the vector space method [33,46,65,67,89].This method iteratively adjusts the query vector based on positive and negative feedback, aiming to align it with relevant images and away from irrelevant ones within the multidimensional vector space.However, this approach is highly sensitive to retrieval results.Substantial modifications to the original query vector may result in the loss of crucial semantic content, potentially compromising search effectiveness.
In response to the previously identified issues, we propose an interactive image retrieval system capable of presenting candidate images for a user query while continuously refining the query based on user relevance feedback in a multi-turn setting.Figure 1 shows the flowchart of the proposed system.This method incorporates an image captioning model to augment the quality of the input textbased query in the natural language space.With each iteration of our retrieval approach, the query undergoes updates or expansions to generate a more informative text query.
However, in our proposed method, we have observed that certain image descriptions generated by an image captioning model might be inaccurate or lack specificity.When multiple image captions convey similar semantic meanings containing redundant information, it could mislead the retrieval model.In such scenarios, these descriptions introduce noise into the text-based query for the subsequent turn, potentially compromising the quality of subsequent queries.Consequently, the performance of the retrieval model might not improve or even decline at this stage.To address this challenge, we propose the integration of an LLM-based denoiser.This denoiser is designed to refine the text-based query expansion before it is forwarded to the retrieval model for the next iteration.
Current single-turn image retrieval methods are primarily assessed on datasets like MSCOCO [45] and Flickr30k [59], designed for tasks such as object detection, segmentation, key-point detection, and image captioning.These datasets offer only one relevant image for each textual query, diverging from typical image retrieval datasets and posing limitations for our evaluation needs.To effectively evaluate our proposed interactive image retrieval system, we curate a new dataset by adapting the video retrieval dataset MSR-VTT [86] to fit the image retrieval task.This modified dataset provides multiple relevant ground truth images for each query and is meticulously labeled by humans, featuring numerous ambiguous textual queries reflective of real-world application scenarios.
In our experiments, we validate the effectiveness of the proposed interactive image retrieval system alongside baseline methods, including single-turn image retrieval models and vector space based relevance feedback models, using the newly constructed dataset.The experimental findings demonstrate that our proposed system outperforms the baselines by 10% in terms of recall and achieves state-of-the-art performance.
The main contributions of this work are summarized as follows: • Innovative Interactive Image Retrieval System: Introducing a pioneering interactive image retrieval system that overcomes the limitations of existing single-turn methods.This system enables multi-turn interactions and continuous query refinement based on user relevance feedback.The incorporation of an image captioning model enhances the quality of text-based queries in natural language space, providing progressively informative queries with each iteration of the retrieval approach.hensive experiments to validate the effectiveness of the proposed interactive image retrieval system.The system is evaluated against baseline methods, which include single-turn image retrieval models and vector space-based relevance feedback models, utilizing the newly constructed dataset.
The experimental results showcase a notable 10% improvement in terms of recall after 6 interaction turns over the baselines, achieving state-of-the-art performance.

RELATED WORK
In this section, we present related works on image retrieval, interactive information retrieval methods, and query updating.

Image Retrieval
Image retrieval has emerged as a focal point of interest within the computer vision and information retrieval communities, finding widespread applications in various domains including e-commerce product search [41], face recognition [58,71], image geolocalization [17], and medical image research [18,93].One particularly intriguing aspect is cross-modal image retrieval, where queries and retrieval objects belong to different modalities.Examples include text-to-image retrieval [37,81], cross-view image retrieval [44], and event detection [31,79].
Traditional methods for text-to-image retrieval have typically relied on convolutional neural networks (CNNs) as encoders to independently represent images and textual content [11,55,60].However, recent years have witnessed a surge in the adoption of transformerbased models and large-scale language-image pre-training, resulting in significant advancements [47,48,62].These models have achieved state-of-the-art performance across various text-to-image benchmark tasks.Nevertheless, existing methods are primarily tailored for single-image retrieval and are evaluated on datasets like MSCOCO [45] and Flickr30k [59], which provide only one relevant ground truth image per textual query.The potential applications of these large-scale pre-trained VLMs in interactive image search tasks remain largely unexplored.
In this study, we focus on text-to-image retrieval, albeit within a human-machine interactive setting, aiming to explore the effectiveness of such models in facilitating interactive image search.Initial Query: Some kids are getting to ride in a fire truck.

VLM CLIP/BLIP2
A child wearing a firefighter hat is standing next to a fire truck.

LLM Image Retrieval Captioning Query Editing
New Query: Some kids wearing firefighter hats stand next to a fire truck and try to get in.
Relevance Feedback 2 1 3 4 Figure 1: Flowchart of the proposed multi-turn interactive image retrieval approach based on relevance feedback.This diagram illustrates an example of the initial interaction followed by query expansion and query refinement.

Interactive Information Retrieval
Interactive search has been instrumental in facilitating efficient access to document collections [2,27,32,39], evolving into an indispensable tool for multimedia researchers, especially in the early stages of content-based image and video retrieval [26,68].Usercomputer interaction can manifest in various formats, including relative attributes [35,57], direct attributes [1,16,94], attribute-like modification text [78], and natural language [14,15].Among these, feedback techniques such as relevance feedback [27,65], pseudorelevance feedback [4,85], and implicit feedback [73] are commonly used.Interactive image search aims to integrate user feedback as an interactive signal for navigating visual search, with feedback techniques widely studied and generally proven effective in enhancing retrieval accuracy [4,64,65].
Numerous approaches have been proposed to enhance interactive image retrieval by integrating user feedback into the search query.Early efforts in interactive information retrieval focused on query expansion techniques based on PRF [3,64,65,91].In PRF-based methods, a set of retrieved documents from the initial query is treated as pseudo-relevant.The initial query is then expanded using the top- pseudo-relevant documents, and this expanded query is used for subsequent retrieval turns.PRF-based approaches offer practical advantages as they do not rely on constructing domain-specific knowledge bases and exhibit versatility across diverse corpora [28].While query expansion through PRF has shown promise in improving the recall of the retrieval system, its effectiveness is constrained by the quality of the top- pseudorelevant documents.Non-relevant results within the feedback set introduce noise that may negatively impact retrieval quality.Additionally, image-based documents cannot directly expand textual queries for image retrieval tasks.
Another approach in interactive image retrieval methods is based on the vector space model [80].In this model, the query vector is directly adjusted based on user feedback.The Rocchio method [65] is a commonly employed feedback technique within vector space models.This method involves updating the query vector using both relevant and non-relevant documents.While vector space models can be applied to interactive image retrieval, directly modifying the query vector can significantly alter the hidden semantic information contained in the high-dimensional vector space.Alternatively, other relevant feedback-based vector space models [33,35,53,89] employ classifiers, such as linear Support Vector Machines, for images based on user-relevant feedback.Unlike methods that modify the query vector, these approaches do not involve direct modifications to the query vector.However, the linear classifier trained on a small amount of data in each interaction turn may lack generalization.
In our work, we primarily focus on relevance feedback-based image retrieval, where users are presented with a ranked list of candidate images in each interaction turn and are asked to assess their relevance.Specifically, we leverage a robust pre-trained VLM to extract features from relevant images, convert them into textual descriptions, and refine the query in natural language.

LLM-based Query Editing
Existing query expansion models rely on pseudo-relevance feedback to enhance the effectiveness of initial retrieval.However, these models encounter challenges when the initial results lack relevance.In contrast, the authors of [50] introduce Generative Relevance Feedback (GRF).This innovative approach constructs probabilistic feedback models using long-form text generated from LLMs.The authors demonstrate in [50] that GRF methods surpass the performance of traditional PRF methods.
In [29], the authors introduce a method to query expansion that capitalizes on the generative capabilities of LLMs.Diverging from traditional methods like PRF, which depend on retrieving a set of pseudo-relevant documents to expand queries, the authors of [29] exploit the creative and generative potential of an LLMs while tapping into its intrinsic knowledge.The study encompasses a range of prompts, including zero-shot, few-shot, and Chain-of-Thought.Notably, the authors observe in [29] that CoT prompts prove particularly effective for query expansion.These prompts guide the model to systematically break down queries, yielding an extensive array of terms closely linked to the original query.
In [52], the authors introduce EdiRCS, a novel text editing-based conversational query rewriting model designed specifically for conversational search scenarios.EdiRCS adopts a non-autoregressive approach where most rewrite tokens are drawn directly from the ongoing dialogue, minimizing the need for additional token generation.This design choice enhances the efficiency of EdiRCS significantly.Notably, the learning process of EdiRCS is enriched with two search-oriented objectives: contrastive ranking augmentation and contextualization knowledge transfer.These objectives are instrumental in improving EdiRCS's ability to select and generate tokens that are highly relevant from a retrieval perspective.
Understanding users' contextual search intent accurately is a significant challenge in conversational search, given the diverse and long-tailed nature of conversational search sessions.In [51], the authors introduce LLMs4CS, a straightforward yet effective prompting framework designed to harness the power of LLMs as text-based search intent interpreters for conversational search.Within this framework, the authors of [51] explore three prompting methods to generate multiple query rewrites and hypothetical responses.In [51], the authors propose aggregating these outputs into an integrated representation capable of robustly capturing the user's true contextual search intent.
Query rewriting is vital for enhancing conversational search by converting context-dependent user queries into standalone forms.In [87], the authors advocate for employing LLMs as query rewriters to generate informative query rewrites with carefully crafted instructions.They establish four essential properties for well-formed rewrites and emphasize their integration into the instructions.In addition, they introduce the concept of rewrite editors for LLMs in a "rewrite-then-edit" process, particularly when initial query rewrites are available.Finally, they propose condensing the query rewriting capabilities of LLMs into smaller models to reduce latency in the rewriting process.Different from previous studies, in this paper, we jointly utilize the capability of both the VLM and the LLM to extract information from the multimodal interactive search context and gradually improve the search query.
In this study, we present a relevance feedback-based interactive image retrieval system that integrates an image captioning model to enhance the quality of text-based queries in natural language, thereby generating increasingly informative queries with each iteration of the retrieval process.

METHODOLOGY
In this section, we present our innovative interactive image retrieval system and discuss the accompanying LLM-based query reformulation method.The proposed image retrieval approach unfolds as an iterative process with three key steps.First, the Image Retrieval step utilizes a pre-trained cross-modal dense retrieval model.Second, an Artificial Actor, following Zahálka et al. [88], emulates a real user by providing relevance feedback and evaluating the retrieval results.Third, the Query Expansion step employs a VLM to expand the query based on the relevance feedback from the Artificial Actor [88].The workflow of the proposed approach operates in a multi-turn setting, encompassing image retrieval, relevance feedback, and query expansion.Subsequently, we introduce our approach that employs an LLM-based method to denoise expanded queries, thereby enhancing the quality of the expanded queries.

Image Retrieval
In this study, we concentrate on the task of interactive text-to-image retrieval.Given a textual query  and an image collection I, the image retrieval system returns a ranked list of images L. Here,   represents the -th ranked image in the list.The system's objective is to retrieve as many relevant images as possible from the top- ranked images L  .We focus solely on the top- retrieval results, considering that users typically review only the top search results [38,63].We employ a pre-trained VLM like CLIP [62] to extract the visual embedding   from the image dataset.The textual query is embedded as   .During retrieval, we extract   for the given query text and compute the cosine similarity with the visual vector   of all images in the candidate pool to identify the top- similar images.
Our research aims to explore how to iteratively use the relevance feedback and the top- retrieved images L  to rerank the next  unseen images in the original ranked list: U = { +1 , . . .,  + }.
Here,  denotes the number of interactive turns, and U represents the unseen images to be ranked.

Relevance Feedback
In this study, we employ the concept of an Artificial Actor [88] to assess the relevance of the feedback images.The artificial actor is a computational agent capable of interacting with the evaluation method and simulating user behavior.The construction of artificial actors is based on three fundamental principles: analytic categories, evolving notions of relevance, and limited time.Analytic categorization refers to the task of classifying individual items into categories defined by the analyst (user).The actors adapt their categories of relevance over time, thereby modeling the dynamic nature of insight.The artificial actor aims to emulate the user's behavior during the analysis of a collection and the development of insight over time.In other words, the user's needs, intent, and the notion of relevance evolve over the course of the analysis.
In our task, we only use a simple version of the artificial actor.Given a textual query  and retrieved top- images L  from the feedback set, we use the ground truth label from the dataset to binary determine the relevance of each image L  and randomly select  relevant images to form the final relevant image set L   .The reason for this design is that the artificial actor in our experiment is not required to assign images to a category set and the only task is to determine whether an image is relevant or irrelevant to a given textual query.The random selection is to simulate the artificial actor that would change the notion of relevance over time.Moreover, the time efficiency of the retrieval system is also not our research topic, so this characteristic of the artificial actor is also not considered.
We operate under the assumption that a real user can always accurately determine the relevance of a candidate image to the given query.Therefore, the use of ground truth labels from the dataset annotation, combined with a random sampling strategy, can effectively simulate the user's real behavior.Importantly, the ground truth of the dataset label remains unseen to the pre-trained dense retriever, thereby eliminating any potential issues related to information leakage.

Query Expansion
We define the query expansion problem as follows: Given a textual query , our goal is to generate an expanded query  ′ that incorporates additional information absent from the original query.Specifically, we investigate the application of state-of-the-art VLMs to generate captions for the images in the relevant image set L  .Formally, this process can be represented as: Where Concat means the string concatenation operation,  is the original query, the VLM is a pre-trained VLM-based image captioner, and Prompt L  is the prompt based on the query.Image captioning is a fundamental task in multimedia modeling.Recent models predominantly rely on large-scale VLMs [8,22,43,70].Current VLMs excel at integrating information from both modalities and have demonstrated impressive zero-shot performance for numerous downstream vision applications.In this study, we employ the state-of-the-art VLM, InstructBLIP [7], to generate captions for relevant images in a zero-shot setting.The instruction prompt used for image captioning is "<Image> A short image caption:".This prompt guides InstructBLIP to generate a concise image description sentence with fewer than 100 tokens.The rationale for generating short image descriptions is to circumvent the issue of hallucination [19,42,90] in the VLM-based caption model.A brief image caption introduces less redundant information to the query, thereby enhancing the efficiency of the retrieval process.

LLM-based Query Editing.
Following each interactive image retrieval turn, a given textual query is concatenated with the captions of the images from the relevant image set L   , provided there is at least one relevant image among the top- retrieved images L  .Ideally, the length of the query increases with each interaction.However, an over-expanded query can negatively impact retrieval performance, as multiple captions with similar or repetitive content can mislead the retrieval model.Additionally, the text encoder of the retrieval model has a limit on input length, leading to the truncation of overlong queries.To address these issues, we explore the use of LLMs to denoise the redundant information in the expanded query and further enhance query quality.We assume that the original query  is expanded in every interaction turn, resulting in the expanded query C  = ,  1 , . . .,   for the current turn .

Prompting Method.
The prompt in our study comprises three components: instruction, demonstration, and input.The instruction defines the specific generation subtask for the query.The demonstration provides in-context examples.The input consists of the original query and the expanded image captions.Specifically, the input is C  , i.e., the concatenation of the original query and all the captions within turn .We investigate how LLMs can generate modifications of the expanded query C  .To this end, we design and explore three prompting methods to guide the LLMs in conducting three query-specific generation subtasks: Query Summary, Keywords Summary, and Chain of Thought (CoT) generation.Examples of prompts for these three subtasks and the corresponding results generated by the LLMs can be found in Table 1.
Query Summary.In this work, we leverage the robust automatic summarization capabilities of LLMs [34,82,92] to reformulate the expanded query in an in-context learning setting.With this approach, we treat LLMs as proficient query rewriters and prompt them to generate more concise and clear queries.Detailed templates for this prompting method and generation results can be found in the "summary" section of Table 1.Despite its simplicity, as demonstrated in Section 4.4, this straightforward prompting method has proven to deliver strong search performance, outperforming existing baselines.
Keywords.Keywords form another crucial element within the query.The task of keyword extraction closely aligns with Query Facet Extraction.Implementing query facet extraction in a web search system can aid in refining and specifying the original query, exploring various subtopics, and diversifying the search result [9,56,69].To leverage the keyword or facet contained in the query, we direct the LLMs to generate a list of significant words or phrases present in the expanded query.The "keywords" section of Table 1 provides the application details.In the subsequent retrieval turn, we concatenate the generated keywords with the prefix "Video or image of" and output the new query.For instance, "video or image of kids, firefighter, hat,... ".This format closely resembles the prompt of the classification task of the CLIP model [61].
CoT Summary.CoT [84] or basic question answering [20][21][22][23][24][25] empower LLMs to break down a reasoning task into multiple intermediate steps, thereby enhancing the reasoning abilities of LLMs.Recent research [82] has explored the use of CoT to guide LLMs in generating summaries in a step-by-step manner.This approach guides LLMs to extract the four most crucial elements such as Entity, Date, Event, and Result from a standardized news text and subsequently generate a summary.In this study, we also examine whether using CoT reasoning can further enhance the quality of the query.However, the text we process is open-domain image captions, which lack the standard article structure found in news articles.Thus, we do not predefine the category of keywords that need to be extracted from the expanded query.
More specifically, as illustrated in the "CoT Summary" section of Table 1, we manually design a prompt that first guides LLMs to extract the keywords, topics, and taggings from the expanded query.Then, by using the extracted keywords, a new query is generated.The extraction of keywords from multiple relevant image captions can filter out duplicated content without losing key information.Regenerating the query based on the keywords can restore the semantic structure of the query sentence.As demonstrated in Section 4.4, our proposed CoT method significantly improves the quality of the query.

EXPERIMENTS AND ANALYSIS
In this section, we provide a detailed overview of the experimental settings.The evaluation of our proposed interactive image retrieval approach is conducted on our proposed modified version of the MSR-VTT video search dataset [86].Subsequently, we analyze the effectiveness of query expansion using relevant image captions and LLMs-based query editing.The code and models are released at https://github.com/s04240051/Multimodal-Conversational-Retrieval-System.git

Dataset and Evaluation Matric
Dataset.The MSR-VTT (Microsoft Research Video to Text) dataset [86] is a comprehensive, large-scale collection curated for video description tasks.It comprises 7,180 videos, collected using a commercial video search engine by retrieving 118 videos per query for 257 popular queries.From this dataset, 10,000 video clips were selected, amounting to 41.2 hours, along with 200,000 clip-sentence pairs.These pairs, covering a wide range of visual content, serve as a valuable resource for advancing video-to-text methodologies and enhancing our understanding of video content through natural language.Each video clip is annotated with approximately 20 natural sentences by a team of 1,327 Amazon MTurk workers.
We have transformed the MSR-VTT dataset into a text-to-image or text-to-frame retrieval dataset, which provides multiple ground truth relevant images for a given query.For each clip, we extract one keyframe per second.Given that the length of each clip ranges between 10 and 20 seconds, we randomly select 10 frames as the ground truth relevant images.As each clip has 20 textual captions, they are treated separately as distinct data samples, each mapping to the same 10 ground truth relevant images in that clip.We utilize the entire clip set for the experiment, resulting in 200,000 data samples (queries) and 100,000 images in the entire collection.
In retrieval, the proposed system aims to retrieve as many relevant images as possible for a given query.Therefore, the use of video datasets enables the retrieval of multiple relevant images for a single query.Moreover, the frames distributed in a short clip generally contain the same scene and topic but from different camera perspectives.Each clip also has 20 unique captions that describe the clip in different ways or about different time slices.Each data sample consists of one textual query and 10 images.This setting allows the search query to closely resemble real human language, which is typically incomplete and ambiguous.
Evaluation Metric.In line with existing work [5], we employ accumulated recall metrics to evaluate the performance of our proposed multi-turn interactive image retrieval system.In each interaction turn, we display and rank 20 images, conducting a total of 6 turns.Images that have been "seen" in previous turns are not retrieved again.Given a query, the accumulated recall of the multiturn search system is calculated as the number of relevant images found in all previous turns divided by the total number of ground truth relevant images for that query.For the entire dataset, the average accumulated recall   of the system by turn  is calculated as follows: Where  is the number of data samples (queries), |L   | means the number of relevant images in turn  for the data sample ,     means the number of ground truth relevant image of a given query   .For the single-turn image retrieval, we use Recall@20, Recall@40, Recall@60, Recall@80, Recall@100, and Recall@120 which is comparable to the multiturn image retrieval evaluation.

Experimental Settings
In our experiment, we employ the pre-trained CLIP model ("ViT-L/14@336px") [61] to encode both textual queries and images.Additionally, we use the BLIP-2 [40] retriever model, which utilizes a pre-trained "ViT-L/14@336px" model to extract embeddings.For image captioning, we use the InstructeBLIP [7] model, which leverages a pre-trained language model, vicuna-7b [95], to generate captions and a ViT-g/14 [12] model to extract features from the image.The prompt used for image captioning is "<Image> A short image caption:", which guides the model to generate a sentence of fewer than 100 tokens, excluding special symbols.The primary LLM used for query editing is the pre-trained LLama2-7b model [77], which is instruct-tuned on a dialogue dataset.In each interaction turn, we only display the top- = 20 ranked images to the user (artificial actor) for relevance judgment.This approach is based on research [38] indicating that users typically focus only on the top search results during web searches.For image searches, users may have the patience to review more results, but 20 is generally the limit.Given a search session, we conduct only five turns of interactive searches, regardless of whether the system finds relevant images.This is because users typically lack the patience to provide more turns of relevance feedback, as suggested by Wang and Ai [83].
We expand the query with captions from at most two relevant images after the user's relevance judgment to improve the system's efficiency and prevent the query from containing too much redundant information.Regarding the procedure of interaction turns, turn 0 involves retrieval with the original query, turn 1 only conducts query expansion with image captions, and LLM-based query editing is applied only after turn 1.

Comparison Methods
We compare the proposed interactive image retrieval model to two types of models, the single-turn VLM, and the multi-turn vector space model.CLIP [61]: This is a widely utilized Vision-Language Model (VLM)-based single-turn dense retriever.We employ the pre-trained CLIP model, which incorporates a "ViT-L/14@336px" image encoder and a Bert-based text encoder [10].
Rocchio [65]: This is a commonly employed feedback method that modifies the query in the vector space.The concept involves updating a query vector with both relevant and non-relevant documents.The Rocchio method leverages the textual query embedding vectors and image embedding vectors encoded by the corresponding retriever model.We set  = 1,  = 0.75, and  = 0.15.

Main Results
Our system incorporates two VLM-based dense retrieval models, namely CLIP and BLIP-2.As the retrievers form a crucial component of our system, it is essential to compare the performance of these two commonly used cross-modal retrievers, which differ in their structure and pre-training datasets.The impact of different LLMs is discussed in Section 4.5.The overall performance comparisons of these models are presented in Table 2.
Effect of Interactive Search.The implementation of an interactive search mechanism resulted in an enhancement of the single-turn image retrieval system's recall by over 10%.This significant improvement underscores the marked advantage of our interactive search system in comparison to conventional methods.
Effect of Query Expansion.The column labeled "Captions only" in Table 2 demonstrates that a simple query expansion using the captions of relevant images yields a robust search performance, surpassing other methods.This approach even exceeds the results of the in-context learning query summary and closely matches the performance of the keywords summary method, albeit slightly inferior to the CoT summary.Upon further comparison of these two methods over additional interaction turns, as depicted in Figure 2, the rate of performance improvement for query expansion with image captions significantly decelerates after the fifth turn.In contrast, the CoT query summary continues to enhance its performance.This observation suggests that over-expansion of the original query with excessive image captions can degrade the quality of the query.The LLM-based query editing emerges as an effective denoising tool, capable of eliminating redundant information and refining the expanded query.Moreover, Figure 2 reveals that the Rocchio method achieves peak performance at turn 1 (the initial query expansion turn) but ceases to improve beyond turn 2. This outcome indicates that query expansion based on the vector space model is less effective compared to methods that perform query expansion within the natural language space.Effect of Different Retrievers.As indicated in Table 2, the BLIP-2-based retriever exhibits a 1% increase in recall compared to the CLIP-based retriever in the single-turn retrieval task.However, BLIP-2 does not offer a substantial benefit in the multi-turn retrieval task.Given that BLIP-2 comprises more parameters and has lower inference efficiency, it is advisable to utilize the CLIP-based retriever in subsequent experiments.

Recall Performance Comparison
Effect of Different Prompting Methods.As evidenced in Table 2, the interactive image retrieval system that utilizes the CoT summary surpasses all other prompting methods post the second turn.The integration of the CoT into the prompting methods typically enhances the search performance.This underscores the effectiveness of our CoT in steering the LLM toward an accurate comprehension of both the scene details in the individual keyframe and the plot specifics in the video clip.
Case Study.In Figure 4, we showcase some queries and search results from our most effective model.We present both successful and unsuccessful cases.These examples demonstrate that queries with a well-structured format and notable keywords, such as 'firetruck', tend to yield better retrieval results.However, the database may contain numerous negative cases that closely resemble the truly relevant image.As illustrated in the concluding case of Figure 4, it becomes exceedingly arduous to differentiate between negative and positive images, given that they portray an identical scene from a television program.
Table 2: Overall performance comparison.The column labeled as "Single Turn" shows the results of one-turn retrieval using CLIP and BLIP-2.For multi-turn retrieval, we only display the top- = 20 ranked images to the user (artificial actor) for relevance judgment in each turn.The best scores in each turn, except for turn 0 (where no query expansion or editing methods are applied), are underlined for clarity.

The Impact of Different LLMs
In order to assess the practical capabilities and constraints of the LLM-based query editing technique, we juxtapose two sets of models of varying sizes, as depicted in Figure 3.We have chosen two sizes of the FLAN-T5 model [6] (Flan-T5-XL and Flan-T5-XXL) and two sizes of the LLama-2 model [77] (Llama-2-7b-chat-hf and Llama-2-13b-chat-hf), with respective model sizes of 3B, 11B, 7B, and 13B.These models are applied to the CoT query summary in our interactive image retrieval system, which employs a CLIP retriever.The decision to refrain from utilizing larger models is attributed to the constraints of our computational resources.Furthermore, the employment of excessively large LLMs would considerably diminish the efficiency of the interactive system, rendering it impractical for real-world applications.From Figure 3, it is observed that all the tested LLMs exhibit similar performance.The largest LLM, LLama-2-13b-chat-hf, yields a 4% recall improvement throughout 5 interaction turns compared to the smallest model, Flan-T5-XL.Interestingly, the smallest LLM with a CoT prompt still outperforms the largest LLM when other prompt methods are used.This further attests to the effectiveness of our CoT query summary method.
Good Query: a group of kids as they board a bus, with the help of a friendly firefighter.ready for an exciting adventure Medium Query: Lady Gaga's powerful music video features a stunning black and white aesthetic, showcasing a woman with her arms outstretched and wearing both a veil and a hijab.
Bad Query: A young boy takes center stage to showcase his talents, while a stern-looking judge wearing sunglasses critiques his performance.Meanwhile, a second man sitting in a chair wearing sunglasses and a black shirt looks on, adding to the pressure of the moment

CONCLUSION AND FUTURE WORK
In summary, this paper presents a novel interactive image retrieval system aimed at overcoming the limitations of traditional singleturn methods.Through the incorporation of multi-turn interactions and continuous refinement of queries based on user relevance feedback, our system significantly improves the quality of text-based queries in natural language space.We introduce an LLM-based denoiser to refine text-based query expansions, effectively addressing inaccuracies and enhancing specificity in image descriptions generated by captioning models.We curated a meticulously designed evaluation dataset, adapted from the MSR-VTT video retrieval dataset, to offer multiple relevant ground truth images for each query, thus mitigating limitations in existing datasets.Through extensive experiments, our proposed system demonstrates a substantial 10% improvement in recall over baseline methods after 6 turns of iterations, achieving state-of-the-art performance.These contributions advance the field of interactive image retrieval and provide valuable insights for future research in this domain.Future research could explore integrating user preferences and domain-specific knowledge to further enhance the accuracy of interactive image retrieval systems.Leveraging multi-turn interaction, VLMs, and LLMs presents an innovative approach that holds promise for developing more user-centric and adaptable image retrieval systems.

Figure 2 :
Figure2: The trend in accumulated recall of the methods shows a steady increase with the number of interaction turns.However, the line representing the Rocchio method levels off after the second turn, indicating that no further relevant images are retrieved beyond this point.In contrast, the other two methods demonstrate the ability to consistently retrieve new relevant images across successive turns.

Figure 3 :
Figure 3: The performance evaluation of systems utilizing two variants of LLMs and four different sizes for CoT query summaries.The CLIP model is employed as the image retriever in this setup.

Figure 4 :
Figure 4: Example queries and search results from the proposed system with CoT query summaries are showcased.Irrelevant images are denoted with a red cross for clarity.

Table 1 :
A general depiction of the query editing process using LLMs.In the context of a video search dataset, a typical prompt follows this format: "Given the video descriptions, generate {subtask}, {demonstration}, {query}", where demonstration refers to the few-shot examples, which are not included in this table.A kid wearing a hat going inside the bus and 2 persons are standing behind them.Given the video descriptions, generate the topic of this video.Model output: Children wearing hat boarding a fire truck, with the assistance of a firefighter Keywords Prompt: Given the video descriptions, generate a list of video taggings Model output: Video of kids, bus, Firefighter, Fire truck, hat.Given the video descriptions, generate a list of keywords and a new description based on generated keywords.Model output: keywords: kids, bus, Firefighter, Fire truck, hat A group of kids as they board a firetruck, with the help of a friendly firefighter.ready for an exciting adventure!