PhotoScout: Synthesis-Powered Multi-Modal Image Search

Due to the availability of increasingly large amounts of visual data, there is a growing need for tools that can help users find relevant images. While existing tools can perform image retrieval based on similarity or metadata, they fall short in scenarios that necessitate semantic reasoning about the content of the image. This paper explores a new multi-modal image search approach that allows users to conveniently specify and perform semantic image search tasks. With our tool, PhotoScout, the user interactively provides natural language descriptions, positive and negative examples, and object tags to specify their search tasks. Under the hood, PhotoScout is powered by a program synthesis engine that generates visual queries in a domain-specific language and executes the synthesized program to retrieve the desired images. In a study with 25 participants, we observed that PhotoScout allows users to perform image retrieval tasks more accurately and with less manual effort.


INTRODUCTION
With the advancement of camera technologies and the prevalence of social media, photography is more accessible than ever.Nowadays, people increasingly have access to large volumes of photographs, taken by themselves or shared by others, that capture unique moments of their lives.As this volume grows, the task of retrieving relevant images from one's personal library becomes more important yet also more challenging.Modern photo management tools like Google Photos allow the user to search for relevant images based on metadata constraints (e.g., presence of a specific person; the date the photo was taken) or visual similarity to another image or natural language query.While existing interfaces work reasonably well for simple search tasks, they fall short in structured image retrieval tasks: that is, tasks that require semantic reasoning about the structure of objects in an image.Such structured image retrieval tasks are important both for professional photographers as well as regular users who increasingly have access to large amounts of visual data on their smartphones and the cloud.For example, event photographers often have a shot list describing certain images that they must deliver to a client, such as those where the bride and groom are walking down the aisle or images containing only the bride and her mother [1,49].However, such structured image search tasks also come up in everyday life for regular users.For example, someone who is writing a travel blog might want to retrieve those images in which they are standing in front of the Eiffel Tower, or someone mourning the loss of a pet might want to find all images in which their cat is sitting on their lap.
As illustrated by these examples, such structured image search tasks require reasoning about contents of the image as well as relationships between them.However, such tasks are not easy to specify using existing image search interfaces.
For example, while they provide support for finding images that contain a specific person, they do not facilitate searching for images where that person is performing a certain action or has a certain property.In fact, a key characteristic of Authors' addresses: Celeste Barnaby, University of Texas at Austin, USA, celestebarnaby@utexas.edu;Qiaochu Chen, University of Texas at Austin, USA, qchen@cs.utexas.edu;Chenglong Wang, Microsoft Research, USA, chenwang@microsoft.com;Işıl Dillig, University of Texas at Austin, USA, isil@cs.utexas.edu.In this paper, we propose a new user interaction model that facilitates structured image search.In general, these structured image retrieval tasks pose two challenges: First, how can a user effectively communicate their intent to the image search tool?Second, how can the search tool plan and execute the search logic underlying the user's intent?
• User specification challenge: For some image search tasks, it is difficult for users to convey their intent with a single modality.In particular, an example image alone is often too ambiguous to convey complex search logic.On the other hand, natural language (NL) alone also has shortcomings.For instance, even ostensibly simple relational attributes like "next to" or "on top of" can have multiple possible interpretations that are difficult to disambiguate without a visual example.
• System development challenge: Existing text or object-based image search tools are powered by vision-language models [5,36].Despite their object understanding capabilities, they have limited capability in reasoning about complex search logic involving multiple constraints and semantic relationships between different objects.Hence, even if the user is able to perfectly convey their intent, there are no existing techniques that can be used to execute complex image retrieval queries.
We propose to address the above challenges of structured image retrieval through a novel program synthesis-powered multi-modal image search tool, PhotoScout.With PhotoScout, users can communicate their intent using a combination of natural language, positive and negative example images, and interactive object tagging.Through PhotoScout's multi-modal specification interface, the user can start with an efficient NL description of the task, then iteratively refine the search results by responding to queries posed through this interface.Under the hood, PhotoScout's backend synthesizer generates a programmatic query expressed in a domain-specific language (DSL) for image retrieval.If the user's NL description is ambiguous, the generated program will be incomplete, allowing PhotoScout to ask clarifying questions to the user in a goal-directed way.Once all ambiguities are resolved through user interaction and PhotoScout generates a complete program, the resulting query is executed on all uploaded images and search results are displayed to the user.At that point, the user can inspect the search results and further refine them if needed.
To assess the efficacy of PhotoScout compared to alternatives, we have conducted a user study involving 25 participants.We find that, compared with a baseline image search tool (leveraging a state-of-the-art vision model), users see a 34% increase in the F1 score of their search results when using PhotoScout.Further, in post-study interviews, users report that they are better able to convey their intent via PhotoScout's multi-modal specification interface and have more trust that PhotoScout's results are correct.
To summarize, this paper makes the following contributions: (1) We present a new multi-modal image search interface targeted towards structured image retrieval tasks that allows users to effectively communicate their intent in an interactive fashion.
(2) We describe a neuro-symbolic image query language that allows expressing the types of logical queries that underlie structured image retrieval tasks.
(3) We present a program synthesis technique that leverages all the different modalities of input that users can provide through our proposed interface.

Image Retrieval
PhotoScout performs content-based image retrieval (CBIR), a technology pivotal in organizing digital image archives by visual content [41].Datta et al. [17] characterize CBIR tools from two perspectives: the user's and the system's.
The user perspective depends on input query modalities, while the system perspective hinges on query processing methods and presentation of search results [17].From the user perspective, PhotoScout is an interactive, multi-modal CBIR system that allows users to find relevant images from a large personal collection.In particular, PhotoScout is multi-modal in that the user provides a combination of natural language and example images, and it is interactive in that the user can refine the query results by providing feedback through the PhotoScout interface.From the system perspective, many prior CBIR tools search for target images using metadata (e.g., where or when an image was taken) [2][3][4] or based on features extracted through machine learning techniques (e.g., lighting conditions and position of an object) [47].In contrast, the backend underlying PhotoScout is based on neuro-symbolic program synthesis -that is, it leverages the user's examples and natural language query to synthesize a logical search query utilizing pre-trained neural networks for object detection and classification.In the remainder of this section, we focus on prior work that is closely related to PhotoScout and refer the interested reader to existing surveys [17,18,29,41] for a more comprehensive overview of CBIR.

Expressing User Intent in CBIR.
A key challenge in image retrieval is the intention gap: the difficulty users face in articulating their task through queries [41].Prior work aims to address this concern through different modalities of input [12,20,27,46,52] and multiple rounds of user interaction [14,28,30,54].One line of work similar to PhotoScout is composed image retrieval [8,31,50], which utilizes visual and textual modalities to jointly specify the user's intent.In this line of work, an example image illustrates the concepts that the user is looking for, while the text query specifies what should be different (e.g."same dress but blue instead of red").In contrast to such interfaces, PhotoScout uses natural language to directly convey the user's intent rather than specifying what should be different from a given image.
In particular, users of PhotoScout utilize positive and negative images to clarify ambiguities in the natural language query rather than providing them as a starting point for visual similarity search.
Relevance Feedback-Based Search Paradigms.Relevance feedback (RF) is a paradigm for interactively refining search results based on user feedback [56].In many systems, users provide relevance feedback in the form of positive and negative images where positive examples correspond to those that are relevant to the user's query while negative examples are not [16,26,32,34,38,44,45].PhotoScout is similar to these approaches in that the user can refine the initial query results by providing positive and negative examples.However, in contrast to many RF systems where examples are used to re-rank the search results (e.g.[26,34,38]), PhotoScout uses positive and negative examples to extract hard semantic constraints that the query results should or should not satisfy.
Semantic Concepts for Images.Another significant hurdle in image retrieval is the semantic gap, which refers to the challenge of describing high-level semantic concepts using low-level visual features [41].Past research has explored deep learning techniques based on Convolutional Neural Networks (CNNs), including architectures like SqueezeNet [24], VGG [39], and ResNet [22], to address this problem.PhotoScout builds on recent advances in this field and leverages pre-trained neural networks for object detection and classification.Prior work targets a variety of applications, including geolocation [7,51], medical diagnosis [6,15,35,55], and interior decorating [10].However, in contrast to most neural Manuscript submitted to ACM CBIR systems, PhotoScout learns new semantic concepts by composing existing neural networks via symbolic operators.

Neuro-symbolic Programming for Images
As mentioned earlier, PhotoScout performs image retrieval by synthesizing neuro-symbolic queries that combine pretrained neural networks with symbolic operators.Specifically, PhotoScout first synthesizes a query that is consistent with the user-provided input and then retrieves the desired images by executing the query on the user's dataset.Hence, PhotoScout is related to a line of recent work on neuro-symbolic programming for images [9,19,23,25,33,37,43,53].
General Visual-Reasoning Tasks.Several recent works such as [21,42] have proposed using neuro-symbolic programming to automate image-related reasoning tasks, such as visual question answering (VQA), image editing, and object tagging.In particular, VisProg [21] proposes a neuro-symbolic DSL targeting images and uses in-context learning to synthesize programs in this DSL based on natural language.ViperGPT [42] proposes a custom Python API for visual reasoning tasks and synthesizes Python programs using this API, also based natural language queries.The back-end of PhotoScout synthesizes neuro-symbolic programs; however, it uses a combination of natural language queries and positive and negative examples.In particular, PhotoScout generates a so-called program sketch by leveraging the natural language description and refines this sketch into a full query by utilizing the user-provided examples.Specific Applications.While VisProg [21] and ViperGPT [42] propose general neuro-symbolic programming frameworks that can be adapted to several visual reasoning tasks, prior research has also developed more robust applicationspecific methods that use neuro-symbolic programming [9,19,23,25,33,37,43,53].Similar to our work, these efforts typically combine symbolic operators for higher-level reasoning with neural modules for perception, with the goal of learning new concepts in a few-shot manner.For example, Huang et al. [23] generate programmatic referring expressions that identify specific objects in an image in terms of their attributes and relationships to other objects.This work focuses on locating a single object, whereas our DSL expresses image search tasks that involve multiple objects.In addition, their focus is on a synthetic dataset with geometric shapes, while our focus is on more realistic images with faces, text, and arbitrary objects.
In the domain of image manipulation, ImageEye [9] allows users to automate batch image editing tasks using neuro-symbolic programming.In particular, ImageEye captures demonstrations of a user editing an image and then synthesizes neuro-symbolic programs that are consistent with the demonstration.In contrast to PhotoScout, ImageEye does not utilize natural language; instead, it requires the user to demonstrate the task by applying actions to selected parts of an image.
Another related work in this space is RAPID [48], which is a system for automated image labeling.The idea behind RAPID is to express new visual concepts (e.g., chef ) as logical combinations of existing concepts and then learn these concept definitions from positive and negative examples.For instance, RAPID may learn that an image should be labeled "chef" if there is food or a bowl in the image.In contrast to PhotoScout, RAPID does not utilize natural language descriptions, and thus lacks the inductive bias for efficient image search.Additionally, RAPID uses a different learning approach based on first-order inductive logic learning.

USAGE SCENARIO
This section illustrates the interface and features of PhotoScout through a use case inspired by real-world scenarios described in online blogs [1].In this example, a photographer, John, is preparing a wedding photo album and needs to Manuscript submitted to ACM locate specific images among hundreds of photos he took during the wedding.As part of this process, John needs to find photos in which the bride, Alice, and the groom, Bob, are next to each other and where Alice is holding flowers.
For example, the first three images in Figure 1 meet John's requirements but the last one does not.John finds this task challenging to perform using existing similarity-based search tools, as there are a lot of other images containing Alice, Bob, and flowers, but many of these images do not match his logical constraints -for example, there is another person between Alice and Bob or Alice is not holding flowers.We now illustrate how John can use PhotoScout to perform this task and avoid significant manual labor.
Figure 2 shows the general interface of PhotoScout, which contains three main components: (1) a task specification panel (Figure 2-1 to 4 ) that allows the user to communicate their intent using a combination of natural language queries and image labels, (2) a search result panel (Figure 2-5 ) that shows the results from the current search query, and (3) a saved images panel (Figure 2-6 ) for saving and exporting the desired images.Using this interface, John can complete his task by performing the following steps: (1) Load images.John first loads all the images to PhotoScout and then sees the interface shown in Figure 2.
(2) Write natural language query.The user interface exposes a search box where the user can type a natural language query (Figure 2-1 ) and a panel displaying thumbnails for all uploaded photos (Figure 2-3 ).In a typical use case, the user starts by entering a natural language query, such as "Alice next to Bob holding flowers", and clicks the "Search" button.
(3) Tag objects.In this example, PhotoScout does not yet know who Alice and Bob are, so, in the search results panel (Figure 2-5 ), PhotoScout displays a message communicating this missing information.John resolves this ambiguity by selecting an image and tagging Alice and Bob's face in the labeling panel (Figure 2-4 ). Figure 3 provides a more detailed view of the labeling panel.When John selects a photo, PhotoScout shows the full-size photo in the center of the labeling panel.The photo is annotated with object detection and classification results to help the user understand what the underlying computer vision tools "see" in that image.For example, when the user hovers over a part of the photo, PhotoScout displays detected objects as a square box, as shown in Figure 3. Additionally, the interface displays a natural language description of the classification results for that object.For example, Alice's face in Figure 3 is further categorized as smiling and between 31 to 41 years old.
In this scenario, John clicks on the face of the bride and labels the face as Alice (see 3d Figure 3).At this point, PhotoScout learns to associate this face with Alice, ensuring that she can be referenced in future search queries without additional user interaction.John uses the same panel to similarly detect the groom's face and label it as Bob.(5) Select negative examples.This time, instead of asking for clarification, PhotoScout shows all relevant images in the result panel (Figure 2-5 ), along with an natural language explanation of how it generated these results.
After looking at the explanation and inspecting the results, John notices that the results contain all relevant photos but also some extra ones, specifically those where there are flowers, but Alice is not holding them (e.g., the last photo in Figure 1, where Bob's boutonnière is visible).To further refine the search results, John labels this photo as a negative example and does another round of search.This time, PhotoScout returns all photos of Alice next to Bob with Alice holding flowers.As a final step, John clicks on the "+" sign located at the top of the search results section (Figure 2-5 ) and all the added photos are displayed in Figure 2-6 .
(6) Manually add/remove images.Upon inspection, John finds that there is one photo in the results in which Alice's flowers are sitting in front of her on a table, but she is not holding them.To exclude that photo from the search results, John selects the photo from (d) is a tagging interface so the user can give semantic meanings to the detected face or object.
In this particular example, the user is tagging the bride with the name "Alice" so that they can refer to the bride in the query.
this image from the export results.Once John is happy about the results, he clicks the "»" button in Figure 2-6 to export the results to a user-defined directory.
In summary, John is able to find all the photos he wants to retrieve by first providing a natural language query and then iteratively refining this query by tagging objects and labeling photos as positive or negative examples.In this process, he benefits from the following design decisions behind PhotoScout: • Multimodal Inputs.PhotoScout grants John the versatility to articulate his search criteria both using natural language prompts and positive and negative examples.On one hand, solely relying on natural language introduces several potential ambiguities: For example, who are Alice and Bob, and who should be holding the flowers?On the other hand, solely relying on examples would be quite cumbersome, as John would need to provide several more examples to convey his intent.In contrast, the combination of natural language and image annotations allows John to succinctly and efficiently convey his intent.• Semantic search.In our example, John's search query is quite specific: First, Alice and Bob must be next to each other, and, second, Alice should be holding flowers.Such search queries are out of scope for existing image retrieval systems, as they cannot reason about relationships between objects within an image.In contrast, our approach performs search by first synthesizing a neuro-symbolic program and then executing that program on all images.This synthesis-based approach allows PhotoScout to perform structured image search tasks where the goal is to find images that conform to non-trivial logical constraints.
• Feedback-guided refinement.Rather than presuming John to deliver flawless instructions from the outset, Photo-Scout employs an interactive feedback mechanism.As illustrated in our example, PhotoScout recognizes ambiguous elements in John's description and proactively seeks clarification via natural language prompts.• Fast synthesis procedure.To ensure that John does not have to wait a long time when interacting with the tool, PhotoScout adopts an efficient synthesis approach to find useful programs.Each synthesis run takes between 0.36

SYSTEM ARCHITECTURE AND IMPLEMENTATION
In this section, we discuss the design and implementation of PhotoScout.As mentioned earlier, PhotoScout performs image search by first synthesizing a program in a neuro-symbolic domain-specific language (DSL) and then applying that program to all images in a collection.In this section, we first provide an overview of the image search DSL and then explain the internal workflow of PhotoScout in more detail.

Image Search DSL
PhotoScout's DSL, shown in Figure 5, is designed to express a wide array of image search tasks.At a high level, a program in this DSL is similar to a first-order logic formula, and evaluates to either true or false given an input image.
The primitives in this DSL are predicates of the form  ( 1 , . . .,   ) where  is an n-ary relation and each   is a term (constant or variable).The PhotoScout DSL contains many built-in predicates such as the binary relations HasEmotion(, ) and HasType(, ), as well as ternary relations such as HasRelation( 1 ,  2 , ).Note that the semantics of the predicates are determined using neural models; hence, we refer to this DSL as neuro-symbolic.For example, HasType(, Car) is determined by using an object classification model to check whether  is classified as a car.Similarly, the truth value of HasRelation( 1 ,  2 , Above) can be determined by using an object classification model that identifies bounding boxes around objects  1 ,  2 and then using the resulting coordinates to check whether  1 is above  2 .As standard in first-order logic, predicates can be combined using boolean connectives.Additionally, our DSL allows quantification over variables to test whether an image contains some object with a given property (requiring existential quantification) or whether all objects have a certain property (requiring universal quantification).
In our implementation, the truth value of atomic predicates is determined using the Amazon Rekognition library.The pre-trained neural nets supported by this library can detect and locate a wide array of objects in image.In particular, this library can be used to identify bounding boxes for different objects in the image, and to determine their types (e.g., cat, car, person etc).Rekognition can also detect properties of human faces (e.g.whether a face is smiling or has open eyes) and identify the same face across multiple images.Overall, it is this combination of logical operators and neural models that allows PhotoScout to express a rich class of structured image search tasks.In this program, the universal quantifier ∀ indicates that every object  identified in the image must obey the subsequent condition.In particular, if  is identified to be a human face, then  must be smiling, and there must exist an object  in the image such that  is identified to be a flower, where  is above .Put simply, this program can be used to find images where every person is smiling and holding flowers, as in Figure 4.Note that the concept of " holds " is approximated by checking a spatial relationship between  and .

PhotoScout Synthesizer
In this section, we describe PhotoScout's underlying synthesis engine, which is depicted schematically in Figure 6.
Given the initial natural language query, PhotoScout starts by generating a program sketch containing holes (i.e., unknowns denoted as □).Intuitively, PhotoScout cannot directly generate a program from the natural language query because some of the concepts used in the NL description have to be grounded.For example, given a query like "Alice is holding flowers," the synthesizer has no idea what Alice corresponds to or how the concept of "holding" can be implemented in our DSL.To instantiate the program sketch into a complete program, PhotoScout interacts with the user by asking them to tag objects or provide examples (Step 2 in Figure 6).In the third step, the synthesizer fills the holes in the sketch by performing enumerative search over the space of sketch completions and discarding those programs that do not satisfy the examples.In the final step, the synthesized program is applied to all uploaded images and displayed to the user.If the search results are unsatisfactory, the user can refine the query by providing more positive and negative examples.We now explain each of the steps in this process in more detail.
Step 1: Generate program sketches.Motivated by the success of few-shot prompting in similar domains [11,13,57], PhotoScout obtains program sketches by prompting GPT-3.5 Turbo. 1 The key idea is to provide GPT with examples of representative natural language and program pairs and then ask it to generate a program for the user's NL query.
However, since Alice is not an object category recognized by the object detector and Holding is not a predicate in the DSL's grammar, these constructs will be replaced with holes.Thus, the following program sketch will be produced: Step where □ 1 and □ 2 were derived from Alice and Holding, respectively, PhotoScout will display the following message to the user: "I don't know the terms 'Alice' and 'Holding'.Can you provide a few positive and negative examples and/or tags to show me what you mean?"The user can easily ground the name "Alice" by using the PhotoScout interface to add a tag.However, the concept of "holding" is harder to explain through a tagging, as it corresponds to a binary relationship between two objects.In this case, the user can help PhotoScout learn this concept by providing a few positive examples where Alice is holding the flower and a few negative examples of those where there is a flower in the picture but Alice is not holding them.
Manuscript submitted to ACM Step 3: Sketch completion.
Suppose that the user has added the first three images in Figure 1 as positive examples and the last one as negative.
Consider the completion  ′ of this partial program where □ has been filled with NextTo.For each positive example image I + ,  ′ (I + ) is true, as Alice is adjacent to flowers in each of these images.However, for the negative example image I + ,  ′ (I − ) is also true.Thus,  ′ is not a valid completion of the program.However, the completion  ′′ where □ has been filled with Above is a valid completion, as Alice's face is below flowers in every positive example, but not in the negative example.
Note that this step is useful even if none of the program sketches contain holes, as example images will filter out complete programs that do not match the user's intent.
Then GPT may generate the following natural language explanation: "I have found all images that contain Alice and flowers and where Alice's face is directly above the flowers." Even after PhotoScout generates a correct program, it may not produce exactly the desired output for two main reasons: First, some concepts such as "holding" may not be perfectly expressible in our DSL.For instance, in our running example, we approximate the concept of holding through a coarse spatial relationship between objects (e.g., if face  is directly above object , then  is holding ).Second, even when all concepts are perfectly expressible, the program may not produce the desired output due to inaccuracies in the underlying neural model.For instance, if the face recognition model does not correctly classify Alice's face, then a photo containing Alice may not appear in the search results even Manuscript submitted to ACM though it should.PhotoScout deals with this problem by allowing users to manually add or remove images through the Saved Images panel of the user interface.

Design Considerations
We conclude this section by summarizing and justifying some of the design considerations underlying PhotoScout.
User Interface.The design principles of PhotoScout's user interface reflect the requirements of structured image search tasks.As seen in the usage scenario, a structured search task may be simple and intuitive to describe in natural language, but contain ambiguities that are easier to resolve through visual examples.Hence, our interface allows the user to interactively refine the search results.In a typical workflow, the user begins their search by writing a natural language query, which may contain unknown terms and concepts that need to be To help the user understand which terms need to be grounded via user interaction, PhotoScout generates natural language explanations of what it does not understand.The user then can then interact with PhotoScout to teach it new concepts.In particular, constants such as people's names are natural to teach via object tagging, whereas new predicates (e.g., "holding") can be demonstrated using positive and negative examples.Furthermore, the user can provide these examples in a piecemeal fashion by providing one example at a time, re-running the synthesizer, and inspecting the search results.
System Implementation.Recall that PhotoScout's system represents search tasks as programs in a DSL.Utilizing DSLs for visual tasks is an approach established in prior work [9,21,40,42].In the context of structured image retrieval, we believe that such a DSL-based approach is a particularly good fit, as the user wishes to find images that satisfy certain logical constraints.
We note that any DSL imposes a tradeoff between expressiveness and reliability: the more expressive the DSL, the larger the space of tasks it can represent, leading to a harder synthesis problem.On the other hand, if the DSL is too restrictive, it may not be able to express image search queries that arise in practice.Our DSL maintains a balance between these two properties, capturing a wide array of structured image search tasks while keeping a compact structure similar to first-order logic that facilitates synthesis.
PhotoScout generates image search queries by using an LLM to "translate" the user's NL description to a program sketch in the DSL.This approach allows the synthesizer to extract as much information as possible from a coarse search query expressed in natural language.However, because the user's description may be ambiguous or contain new concepts that are not captured via pre-trained neural models, PhotoScout grounds new concepts via user interaction, which takes two forms: Object tagging allows the user to conveniently ground names, whereas positive and negative examples allow grounding unknown relations and resolving ambiguities.

USER STUDY
To understand how people interact with PhotoScout and gain deeper insight about the strengths and limitations of the proposed interface, we conducted a within-subjects evaluation centered on the following questions: (1) Does PhotoScout improve user efficiency and accuracy compared to a baseline image search tool?
(2) Does the proposed multi-modal interface help users express their intent?
(3) Are users more confident about the accuracy of the search results compared to the baseline tool?
(4) What strategies do people adopt when interacting with PhotoScout?
Manuscript submitted to ACM In the remainder of this section, we first describe the baseline tool and our user study procedure.Afterwards, we present both quantitative and qualitative analyses of the user study results.

Baseline Tool: ClipWrapper
As a basis of comparison, we implemented a graphical user interface around OpenAI's CLIP model, which is a state-ofthe-art neural network for learning visual concepts from natural language supervision.Given a dataset D of images and a query (in the form of text or image), the CLIP model assigns a score to each image in D that reflects its similarity to the given query.
Our baseline tool, henceforth called ClipWrapper, is a wrapper around this CLIP model.Specifically, ClipWrapper implements a graphical user interface that allows users to input a query on a set of uploaded images.The query can either be a text description of the search task or a photograph that exemplifies the target search results.ClipWrapper simply queries the CLIP model and returns all images in the dataset whose score exceeds some threshold.The ClipWrapper interface allows users to further refine the search results by manually adding or removing images to and from the result returned by the CLIP model.
ClipWrapper allows users to search for images that are similar to either a query image or an open-ended text query.
ClipWrapper does not utilize any hard constraints and may return images that do not precisely match a user's query.
A central question of this user study is: does ClipWrapper suffice for performing structured image retrieval tasks, or is a tool specially designed for such tasks necessary?Further, we explore the specific features of ClipWrapper that make structured image search difficult, compared with PhotoScout.ClipWrapper's interface mirrors PhotoScout as closely as possible so as to reduce the number of confounding factors in our comparison.
Manuscript submitted to ACM

User Study Procedure
We recruited a total of 25 participants for our user study.Among these participants, 23 are in the 18-24 age bracket, and the remaining two are between 25 and 35 years old.In terms of gender, 14 (resp.9) of the participants self identify as female (resp.male) and 2 self-identify as "other".Our only criteria for selecting participants was that they have prior experience using a computer and that they do not have impaired vision.The entire user study took place over the course of three weeks.
During the user study, each participant was asked to first complete a training session and then perform four image search tasks, two using PhotoScout and two using ClipWrapper.The order of tasks, as well as which tool to use for a given task, was randomly selected.The training session involved completing a tutorial about both tools and performing two practice tasks, one with PhotoScout and one with ClipWrapper.The users had access to the tutorial throughout the user study and were explicitly told that they could reference it whenever they wished to do so.The participants were given a total of 5 minutes to complete the practice tasks and each of the four image search tasks.Participants were told that they could end a task whenever they were satisfied with the results; however, participants opted to use all the time available to them in most cases.
In the course of the study, participants were asked to talk about their search strategies while completing each task.
To aid subsequent analysis, we collected both audio and screen recordings throughout the process.Upon completion of the four tasks, the participants were asked to reflect on their experience and answer some interview questions.The total session, including the tutorial and interview, took less than 90 minutes for each participant.

Tasks
Given a dataset of images, the goal of each task in the user study was to identify a subset of the images matching a certain criteria.Specifically, the tasks involved the following three sets of images: • Transportation: A set of 70 images of bicycles, cars, and people, mostly taken on public roads.
• Festival: A set of 420 images from a music festival, comprised of images of performances, venues, and attendees.
• Wedding: A set of 352 images from a wedding, including staged photos of the wedding party and candid photos of the ceremony and reception.
Each task targeted one of these datasets.Since PhotoScout is intended for use on personal images, we collected these datasets from image galleries shared on Flickr.As such, the datasets vary in size.Participants were provided with task descriptions, along with a description of the corresponding dataset.The task descriptions were as follows: (0) Find all images that contain a car and a bicycle.
(1) Find all images that contain a guitar and a microphone.
(2) Find all images that contain no people.An image contains a person if you can see any discernible part of someone's body.
(3) Find all images where the bride is to the left of the groom.An image contains a person if their face is visible.
(4) Find all images that contain the bride but not the groom.An image contains a person if their face is visible.
Task 0 was the practice task and involved the transportation dataset.Tasks 1 and 2 used the festival dataset, and the last two tasks involved the wedding dataset.Note that tasks 3 and 4 involve searching for particular faces in an image.For these tasks, the participant was given example images with the bride and groom's faces.
Manuscript submitted to ACM 6 USER STUDY RESULTS

Quantitative Results
Search Result Accuracy.One of the key metrics for evaluating the efficacy of each tool is accuracy of search results.
That is, within the 5 minute time limit, how close were the saved results to the ground truth?To answer this question, Table 1 reports the F1 score of the search results when participants use PhotoScout and ClipWrapper.We report two different accuracy results, namely before and after post-processing.
To understand what we mean by this, recall that people first interact with the underlying tool (ML model or synthesizer) to get an initial set of results, and then manually add/remove images to refine the search results before finally saving them.The columns labeled before post-processing show the F1 score for the search result automatically generated by the tool before manual intervention. 2As we can see, the initial search results for PhotoScout are significantly better (0.45 vs 0.76 in terms of average F1 score).Furthermore, using the Wilcoxon rank sum test, we find that these results are statically significant, with a -value of less than 0.02, for all tasks.
The columns labeled after post-processing show the results after the users have manually refined the search results within the 5 minute time limit.Overall, the F1 scores for PhotoScout are higher compared to those of ClipWrapper, and overall difference in F1 score is statistically significant (-value of < 3.1e−7 for the Wilcoxon rank sum test).
However, if we run the same test for each individual benchmark, we find that the result is statistically significant for only the Guitar and Microphone task and the No People task.For the Bride and Not Groom task, there was one participant who mistook a wedding guest for the bride and completed the task by searching for images containing that guest.When this outlier is removed from the dataset, the result for the Bride and Not Groom task is significant as well.
A discussion of why the Bride Left of Groom task does not have a significant result is included in Section 6.3.
Search Efficiency.The average time per search query (i.e. the time the system takes to perform a search for a given query) for PhotoScout and ClipWrapper is presented in Table 1.For ClipWrapper, search time is consistent across all tasks and queries.For PhotoScout, search time varies depending on what inputs the user provides.For instance, if the user provides an example image with a lot of different objects, then sketch completion will take longer, as there are more ways that the sketch could be filled in.While PhotoScout, on average, takes longer than ClipWrapper, both tools are efficient enough for interactive online use.
Manual effort.Another important metric for evaluating the efficacy of a tool is the amount of manual effort.That is, how many objects did the user tag, and how many images did the user have to manually add or remove before they were satisfied with the results or reached the 5 minute time limit?The use of tagging was extremely consistent.
Participants used tagging only to assign names to the bride and groom in the two tasks using the wedding dataset.For instance, P14 tagged the bride and groom as "Emily" and "John, " respectively, and wrote the query "Emily is to the left of John."All but one participant who completed at least one of these tasks with PhotoScout used this strategy.
The results for other metrics of manual effort are presented in Table 1.In particular, for PhotoScout, we report two different numbers: (a) the total number of examples provided when using the synthesizer, and (b) the number of added/removed images to refine search results.We can take the sum of (a) and (b) to be the proxy for manual effort.
The difference in manual effort between PhotoScout and ClipWrapper is statistically significant for all tasks, with a -value of less than 0.02.Note that, for all tasks, and the Bride Left of Groom task in particular, ClipWrapper users manually added and removed a significant number of images in proportion to the size of the ground truth dataset.Participants using ClipWrapper often resorted to extra manual efforts to add and remove images from their initial search results to achieve a higher accuracy; however, this required a greater cognitive load to complete their task: e.g., P14 mentioned "it felt like I had to basically look through every image." Task Questionnaire.For the last part of our quantitative study, we analyze the results of the questionnaire that each participant was asked to complete after finishing a task.Specifically, participants were asked the following questions upon completion of each task: (1) On a scale of 1-5, with 5 being very easy and 1 being not easy at all, how easy was it to complete the task using this tool?
(2) On a scale of 1-5, with 5 being very confident and 1 being not at all confident, how confident are you that your results are correct?"Correct" means that all of your saved images match the task, and none of the unsaved images match the task.
Figure 9 summarizes the results of this questionnaire.Across all tasks, participants gave an average score of 4.0 for PhotoScout and 2.7 for ClipWrapper on question 1, and an average score of 3.8 for PhotoScout and 2.6 for ClipWrapper on question 2. For question 1, the difference in scores between PhotoScout and ClipWrapper was statistically significant for all tasks, with a -values less than 0.03.For question 2, the difference in scores was significant for the Guitar and Microphone task and the No People task.
We also asked participants for qualitative input on each question.When answering question 1 (ease of use), some participants noted that PhotoScout "has a steeper learning curve" than ClipWrapper due to its more sophisticated search features, but that "once you have done the setup, the results it gives are pretty accurate" (P1).Further, when answering question 2 (confidence in results), some participants noted that they had confidence in their results with ClipWrapper because of the manual effort they had expended going through the images themselves.Despite these aspects working in ClipWrapper's favor, the scores for each question are consistently higher for PhotoScout than for

Qualitative Results
We conducted a semi-structured interview about the participants' experiences using PhotoScout and ClipWrapper.
We asked participants about their search strategies and results using both tools.In addition, we instructed participants to think aloud while completing each task, kept notes on comments that participants made.One of the authors coded participants' responses to each question and comments on each task, and two authors reviewed and discussed the results collaboratively.We report the following key findings: KF1: PhotoScout's synthesis-based search procedure makes structured image search easier and more efficient.16 of the 25 participants reported that they thought their results using PhotoScout were more accurate than their results using ClipWrapper.Out of the other 9 participants, only one said that their results with ClipWrapper were more accurate; the other 8 were unsure.Participants expressed confidence in their results with PhotoScout: "[PhotoScout] was actually really good at getting what I asked.I think [PhotoScout] was pretty trustworthy overall" (P21).Similarly, they expressed a lack of trust in ClipWrapper: "I don't know, I just didn't have that much faith in [ClipWrapper]" (P5).
In particular, participants observed that PhotoScout was better than ClipWrapper at finding images that were consistent with logical and positional elements of their queries.When completing the No People task with ClipWrapper, P9 noted, "I put no people in the search bar, and it gave me a bunch of images with people."Many participants who used ClipWrapper for this task developed a strategy of finding one image without people, and searching for similar images.This strategy allowed them to find certain types of images without people (e.g.closeup images of signage at the festival), but caused them to miss other types of images that were not visually similar (e.g.photos of venues before performances had taken place).By contrast, participants who used PhotoScout could efficiently write a text query, add a few example images, and see a set of accurate search results matching the logical intention of their query.
Similarly, when completing the Bride Left of Groom task, P18 said "[ClipWrapper] doesn't seem to know its lefts and rights that well."Participants using ClipWrapper were able to find images containing the bride and groom without much difficulty, but finding images where the bride and groom were oriented correctly could only be accomplished Manuscript submitted to ACM through manually filtering.Meanwhile, participants using PhotoScout could use a text query and example images to specify that they only wanted photos were the bride is to the left of the groom, and saw results that reflected this intent.
P12 stated, "I noticed that [PhotoScout] is more functional when it comes to relational statements".
Interestingly, most participants did not make use of the natural language explanations of the search results in PhotoScout.Only 2 out of 25 participants reported that they found the explanations useful, and many participants did not notice the explanations, even though the tutorial pointed out this feature.While we cannot determine exactly why participants did not make use of NL explanations, we can conclude that this feature had little to do with participants' confidence in PhotoScout's search results.Future work could explore alternative methods of explaining search results to the user.One such method allows users to visualize why a particular image appeared, or did not appear, in the search results.This visualization could include annotations and/or text that highlight the parts of an image that match or do not match the query.Several participants noted the potential utility of this feature when reviewing their search results.
KF2: Example images convey information that text alone cannot.as little ambiguity as possible" (P4).
During the study, example images clarified ambiguous or erroneous text queries.For instance, when completing the Guitar and Microphone task with PhotoScout, many participants made the text query "guitar and microphone." Based on this text alone, it is unclear whether the user wants all images containing a guitar and a microphone, or all images containing a guitar and all images containing a microphone.A negative example image containing just one of these objects quickly resolves this ambiguity.In another instance, when completing the No People task with PhotoScout, P14 made the text query "music festival containing no people." The term "music festival" in this query was unnecessary (as all images in the dataset were from a music festival) and could have added noise to the search results.However, because the participant had already added a set of example images for the task, PhotoScout figured out that this part of the query was extraneous, and output images containing no people.
KF3: Selecting example images is an intuitive process.Every participant utilized example images when completing tasks with PhotoScout, and selecting positive and negative example images was an easy process for most tasks.
Usually, the participant quickly found positive and negative example images by scanning through the full image dataset.
In some instances, participants made an initial text query, and then selected example images from the smaller set of preliminary search results.
During the tutorial, we explained what positive and negative examples were, but did not offer any insight into what makes a good or bad example image.Even so, during the study 11 out of the 25 participants noted that they intentionally selected diverse example images.For instance, P10 noted, "for negative examples, I chose things that could be confusing." Similarly, P7 said, "I tried to find edge cases with one image that was totally different," and that especially for negative examples "I tried to find sort of tricky cases where an image was almost correct." This strategy likely helps to produce correct results in PhotoScout, as the example images will filter out synthesized programs that are almost correct but are missing one key component.Even though participants were not given any Manuscript submitted to ACM information about the underlying search procedure, they independently found an effective strategy for selecting example images.This behavior suggests that example images are an intuitive addition to text-based image search.

Description of Failure Cases
Limitations of object detector.Because PhotoScout performs image retrieval by executing neuro-symbolic queries, its performance is dependent upon the accuracy of the underlying neural models for object detection.If an image contains a particular object present in the user's query, but PhotoScout does not detect that object (e.g. because it is partially obscured), then that image will not appear in the search results.
This issue is more apparent if the object detector does not work well on the images that the user selects as positive and negative examples.In particular, because PhotoScout synthesizes a program that matches all positive examples and rejects all negative ones, PhotoScout may fail to synthesize any programs if the object detector misclassifies relevant objects in the example images.Indeed, this limitation of PhotoScout proved problematic in the Bride Left of Groom task.Several participants selected an example image of the bride and groom dancing, where only the back of the bride's head is visible.While participants could easily infer that this person was the bride, PhotoScout did not classify her correctly.Hence, PhotoScout could not generate a program that matched both the user's text query and this example image, and the user was prompted to adjust their query.A common response was for participants to then add more example images in an attempt to correct this error.However, they would continue to get poor results as long as they had any example images where relevant objects were misclassified by the object detector.
As a result, participants sometimes felt more frustrated when using PhotoScout than when using ClipWrapper: 8 out of 25 participants noted instances where PhotoScout should have detected an object but did not.In some cases, this caused participants to lose trust in PhotoScout.During the Bride Left of Groom task, P5 noted, "it decreases my confidence to know that [PhotoScout] misclassified the face I was looking for."When completing the Guitar and Microphone task, P8 said, "it's annoying when [PhotoScout] doesn't recognize a microphone in an image." Limitations of LLM.It is also the case that PhotoScout's framework may fail to output results when GPT is unable to produce a program sketch from the user's text query.If the user provides a query that is very dissimilar from any of the example text queries in the prompt provided to GPT, then the output programs may fail to parse.This was the case when P22 made the text query "solo images of anna" during the Bride and Not Groom task (where they had already tagged the bride as "anna").If this happens, the user will see no search results and will be prompted to adjust their query.
Inspiration for future work.An interesting direction for future work is to explore interaction models that balance the structure of PhotoScout with the flexibility of ClipWrapper.ClipWrapper will also fail to detect objects, and often misinterprets text queries.However, ClipWrapper is designed for similarity-based search queries, and does not extract any hard constraints from queries.As such, users will almost always get some results from any query they provide to ClipWrapper.Even if those results are inaccurate, users may feel more encouraged to continue trying other queries or to edit their results manually.One user suggested that PhotoScout could allow users to edit image labels in cases where the object detector is incorrect.Several other users reported that they would like a fusion of the two tools, wherein they could explore the dataset with open-ended text or image queries in a separate panel, without having to adjust the text query and example images that would determine the hard constraints of their task.
Manuscript submitted to ACM 2024.Manuscript submitted to ACM Manuscript submitted to ACM 1 arXiv:2401.10464v1[cs.HC] 19 Jan 2024 structured image search tasks is that they require the contents of the image to satisfy certain logical constraints and combinations thereof.

Fig. 1 .
Fig. 1.Left: Three images that matches John's intent: the bride and groom are next to each other, with the bride holding flowers.Right: an image that is incorrect image because the bride is not holding flowers.

Fig. 2 .( 4 )
Fig. 2. The PhotoScout interface has six main panels: (1) The user enters a natural language query describing the images to be searched.(2) The example images panel highlights all the positive and negative images that the user has already labeled.Positive examples are wrapped in a green box and negative examples are wrapped in a red box.(3) The album preview panel displays all the photos in the album to be searched from.(4) Once the user selects a photo to label, the example labeling panel displays the image and the example labeling buttons.(5)The search results panel shows all the images that PhotoScout finds that match both the natural language description and the labeled examples, along with a natural language explanation.(6) The photo export panel shows all the images selected by the user as the final search results.

Figure 2 -
Fig.3.The example labeling panel consists of 4 elements.(a) A view of the photo to be labeled.When a user hovers over it, each object identified by the detector is highlighted with a square box, with the detailed description of the detected object shown in (c).(b) asks the user to label the photo either as a positive or negative example.(d) is a tagging interface so the user can give semantic meanings to the detected face or object.In this particular example, the user is tagging the bride with the name "Alice" so that they can refer to the bride in the query.
and 4.8 seconds, making it feasible to use PhotoScout in an interactive fashion.Manuscript submitted to ACM

Fig. 7 .
Fig. 7. GPT prompt for generating program sketches from a user's text query.

Figure 7 Example 4 . 2 .
Figure 7  shows an example of such a prompt where we provide the LLM with a manually curated set of representative (query, program) pairs as well as the natural language query of interest.. PhotoScout asks GPT to generate 20 answers to this prompt in order to increase the likelihood that one of the results match the user's intention.For each result returned by GPT, PhotoScout attempts to parse the string into a program in our DSL.If, during parsing, PhotoScout encounters a predicate or constant that it does not recognize, it replaces that construct with a hole □.If parsing fails for a different reason, that program sketch is discarded.Example 4.2.Given the text query "Alice is holding flowers", GPT may generate the program ∃ .∃.(HasType(, Alice) ∧ HasType(, Flowers) ∧ HasRelation(, , Holding))(1)

2 :
Query the user.If the program generated in Step 1 contains holes, PhotoScout prompts the user to provide additional information by (1) tagging objects that are not recognized by the object detector and (2) adding example images that clarify the meaning of unknown predicates.Tags and examples images both allow the user to clarify the meaning of their natural language query, but in complementary ways.Tags allow grounding unknown terms like people's names in the user's NL query, whereas positive and negative examples allow the synthesizer to learn logical constraints and concepts (e.g., the concept of "holding") in a data efficient way.Example 4.3.Given the program sketch ∃ .∃.(HasType(, □ 1 ) ∧ HasType(, Flowers) ∧ HasRelation(, , □ 2 ))

Step 4 :Example 4 . 5 .
PhotoScout also generates a natural language explanation of what the program does.Such explanations are intended to help users quickly uncover unintended behaviors without having to look through a large set of images and inspecting each one.PhotoScout generates these NL descriptions through few-shot prompting of an LLM: In particular, given a few examples of programs and their corresponding NL description, PhotoScout prompts GPT to produce an natural language description of the programmatic query.Example 4.5.Suppose that  is the program ∃ .∃.HasType(, Alice) ∧ HasType(, Flowers) ∧ HasRelation(, , Above).(5)

Fig. 8 .
Fig. 8.The ClipWrapper interface has five main panels: (1) The user enters a natural language query describing the images to be searched.(2) The album preview panel displays all photos in the target album.(3) The photo view panel displays an image and allows users to search for similar images.(4) The search results panel shows all images that ClipWrapper finds that match the user's query.(5) The photo export panel shows all images selected by the user as the final search result.

Fig. 9 .
PhotoScout Baseline While object tagging helps resolve many sources of ambiguity in the natural language, PhotoScout needs to perform enumerative search over possible sketch completions to find a program that is consistent with all positive and negative examples.Given a program sketch  and a a set of positive and negative examples E + ∪ E − , PhotoScout enumerates possible completions of  by instantiating each hole with a constant and then evaluating the resulting query  on each example in E + and E − .If  evaluates to true (resp.false) for all examples in E + (resp.E − ), then  is retained as a viable completion of the sketch.Among all programs that are consistent with the examples, PhotoScout chooses the simplest program, where simplicity is defined in terms of the number of nodes in the program's abstract syntax tree.Note that enumerative search is tractable in this context because each sketch contains no more than a few holes.

Table 1 .
A quantitative comparison of PhotoScout (abbr.P) and ClipWrapper (abbr.C). # Ground Truth lists the number of images in the ground truth dataset of each task.# Solved lists the number of participants who were assigned with each tool for each task.Avg.F1 Score Before and Avg.F1 Score After list, respectively, the average F1 score of the search results before and after performing post-processing (i.e.manually adding and removing images from the search results) with each task and tool.Manual Effort lists the average number of images post-processed (i.e.added and removed from search results) for each task, and, in the case of PhotoScout, the number of images selected as examples.
22 out of 25 participants indicated that example images provided additional information that they could not convey in text.P7 said "[Examples] can describe what you're looking for better than text....A picture is worth a thousand words." P15 noted, "I like that I was able to provide example images, because it helped me clarify [my intent]."Participants noticed that example images and text queries worked synergistically: "I gave more specific text queries in [PhotoScout], because I could back them up with examples" (P25).By contrast, some participants noted that ClipWrapper required more general text queries: "I tried to give[ClipWrapper]