Critical-Reflective Human-AI Collaboration: Exploring Computational Tools for Art Historical Image Retrieval

Just as other disciplines, the humanities explore how computational research approaches and tools can meaningfully contribute to scholarly knowledge production. Building on related work from the areas of CSCW and HCI, we approach the design of computational tools through the analytical lens of 'human-AI collaboration.' Such work investigates how human competencies and computational capabilities can be effectively and meaningfully combined. However, there is no generalizable concept of what constitutes 'meaningful' human-AI collaboration. In terms of genuinely human competencies, we consider criticality and reflection as guiding principles of scholarly knowledge production and as deeply embedded in the methodologies and practices of the humanities. Although (designing for) reflection is a recurring topic in CSCW and HCI discourses, it has not been centered in work on human-AI collaboration. We posit that integrating both concepts is a viable approach to supporting 'meaningful' human-AI collaboration in the humanities and other qualitative, interpretivist, and hermeneutic research areas. Our research, thus, is guided by the question of how critical reflection can be enabled in human-AI collaboration. We address this question with a use case that centers on computer vision (CV) tools for art historical image retrieval. Specifically, we conducted a qualitative interview study with art historians to explore a) what potentials and affordances art historians ascribe to human-AI collaboration and CV in particular, and b) in what ways art historians conceptualize critical reflection in the context of human-AI collaboration. We extended the interviews with a think-aloud software exploration. We observed and recorded participants' interaction with a ready-to-use CV tool in a possible research scenario. We found that critical reflection, indeed, constitutes a core prerequisite for 'meaningful' human-AI collaboration in humanities research contexts. However, we observed that critical reflection was not fully realized during interaction with the CV tool. We interpret this divergence as supporting our hypothesis that computational tools need to be intentionally designed in such a way that they actively scaffold and support critical reflection during interaction. Based on our findings, we suggest four empirically grounded design implications for 'critical-reflective human-AI collaboration': supporting reflection on the basis of transparency, foregrounding epistemic presumptions, emphasizing the situatedness of data, and strengthening interpretability through contextualized explanations.


INTRODUCTION
In the humanities, retrieving objects and primary sources for the purpose of studying them constitutes a central ingredient of scholarly work.Art historians, for example, access objects or images of objects alongside their metadata and accompanying information.The decision-making in terms of what objects are considered 'relevant' is highly dependent on the art historian's research question, field of study, and other contextualized factors.The process of scholarly image retrieval and corpus building has changed significantly with the digitization of textual and visual sources and the creation of digital repositories (see, e.g., [28,47]).Art historians, just like other humanities scholars, can now query online databases for objects, sources, and literature without traveling to the collecting institution or contacting archivists or other researchers.Recently, research in digital humanities (DH) and the sub-field of digital art history (DAH) has been exploring computer Through a reflexive thematic analysis (TA), we confirm that art historians are eager to interact with computational tools in a critical-reflective manner.However, the participants could not fully realize this eagerness during interaction with a CV tool.We interpret these divergences as support for our hypothesis that computational tools need to be intentionally designed in such a way that they actively support critical reflection.We conclude that computational tools should not be developed based on the assumption that humanities scholars are able to effectuate their competencies in criticality and reflection without any scaffolding, even more so when this would require extensive computational knowledge.Based on our findings, we derive implications for the design of 'critical reflective human-AI collaboration.' We make the following contributions with our research: We deepen the understanding of what constitutes 'critical reflective human-AI collaboration' by building on and extending existing research from CSCW and HCI; we present findings from an empirical study that integrates expert interviews with art historians with observations of their interactions with a CV tool for image retrieval; and we propose four empirically grounded high-level design implications for the design of 'critical reflective human-AI collaboration,' namely: 1) supporting reflection on the basis of transparency, 2) foregrounding epistemic presumptions, 3) emphasizing the situatedness of data, and 4) strengthening interpretability through contextualized explanations.
Our paper is divided into four main sections.We lay the foundation for our research by reviewing related work on human-AI collaboration, principles of reflection in HCI, and critical reflection in the humanities and DH (Section 2).We then explain our research method in Section 3, where we first introduce our use case in the context of art history, specifically CV-based image retrieval.This is followed by a description of our study design.In Section 4, we present the findings that relate to human-AI collaboration and critical reflection.Based on a discussion of our findings, we present four design implications for 'critical reflective human-AI collaboration' (Section 5).

RELATED WORK
In the following, we provide an overview of existing research on human-AI collaboration in qualitative research.We then foreground reflection as one 'meaningful' realization of human-AI collaboration.This research from CSCW and HCI provides the foundation for situating our research in the context of critical reflection in the humanities.

Conceptualizing Human-AI Collaboration in Qualitative Research
Human-AI collaboration is not a clearly defined term in HCI and CSCW but represents more of a design concept.Recent technological advancements in 'artificial intelligence' (AI) have inspired CSCW and HCI scholars to investigate how to support 'collaboration' between humans and AIbased systems (see, e.g., [74,80]).Terveen [74] provided one of the earliest conceptualizations of such collaboration; he states, "collaboration is a process in which two or more agents work together to achieve shared goals." Terveen emphasizes the disciplinary grounding of what he calls "human-computer collaboration."This perspective conceptually draws from both AI and HCI; thus, it combines different understandings of designing such collaboration.The design concept of collaboration has been taken up over the years and appears in various other concepts such as mixedinitiative interaction [46], interactive-intelligent systems [50], or human-computer integration [35].Thus, a growing body of research deals with how humans and AI-based systems can effectively collaborate in particular application areas.While there is a lack of research on areas of applications that directly relate to our field of study, i.e., the humanities and art history, we can build on related work on human-AI collaboration for qualitative research.
Jiang et al. [51] conducted a study on qualitative researchers' work practices and the potential benefits and challenges of AI-based tools in qualitative research.Their study confirms that the distribution of agency between humans and AI systems must be carefully balanced.They point out that the automation of qualitative analysis in its entirety is infeasible.Instead, they suggest shifting the focus "towards ways that AI could be a collaborator that works alongside humans rather than a delegate that performs specific tasks" [51].Their study particularly stresses the importance of emphasizing human agency in data analysis.In response, Feuston and Brubaker [37] suggest a more nuanced approach that considers the different stages during a qualitative analysis process as one of the defining parameters that influence whether and to what extent scholars believe that AI could support them in their research.Feuston and Brubaker also describe how the use of AI in different stages of a qualitative analysis process impacts knowledge production, namely, a shift in scalability, abstraction, and delegation [37].Considering the different stages of qualitative analysis and carefully and purposefully conceiving human-AI collaboration tailored to a specific phase or sub-task within a research process is also echoed in a study by Chen et al. [18].Instead of building an AI-based system that predicts codes 2 for data instances, which turned out to be unfeasible, Chen et al. suggest that an AI system could help identify points of potential inconsistency and ambiguity in the assigned codes.The AI system is not expected to present the researchers with a solution or resolve the ambiguity.Instead, it will direct them to instances that need clarification or prompt them to reconsider diverging interpretations.Going a step further, Baumer et al. explore the application of computational approaches to the actual analysis of texts.They similarly emphasize that the role of an AI-based system should be carefully balanced, as suggested by their approach of "designing for interpretation" [8].Following this principle, Baumer et al. explore an alternative approach in computational text analysis by providing a visualization that supports the interpretation and reinterpretation of the data.They emphasize that algorithmic systems need to strike a balance between explicitly stating what their results mean, for example, explaining how the results come about, while intentionally leaving room for the interpretation of the results [8].In a similar approach, Hong et al. [45] relate their suggestion of using topic models for qualitative research to related work on topic modeling in DH.They conceptualize human-AI collaboration as a tool for assisting interpretive research [45].
In summary, researchers in the field of human-AI collaboration investigate how to effectively combine human intelligence and computational capabilities, i.e., how such collaboration can augment human intellect.Furthermore, they carefully investigate in their application contexts to what extent AI and humans can complement each other and profit from such collaboration.However, we still lack a more nuanced understanding of what 'collaboration' between humans and AI means in specific contexts and what type of collaboration we want to support.We need to define what quality the collaboration should have in order to be considered 'meaningful.' As one approach to achieving 'meaningful' collaboration, we suggest foregrounding reflection, a solely human trait.

Using Reflection as Means of Human-AI Collaboration
We posit that a 'meaningful' realization of human-AI collaboration can be accomplished by explicitly designing AI technologies that support and encourage reflection on the users' side as a guiding principle of interaction [5,7].Reflection is a somewhat elusive concept: the challenge lies in conceptualizing reflection epistemologically and operationalizing this abstract concept [5].Reflection can generally be defined as "reviewing a series of previous experiences, events, stories, etc., and putting them together in such a way as to come to a better understanding or to gain some sort of insight" [7].This perspective builds upon theories of learning and work practice, where experience forms the basis for thinking about and planning for the future.In this process, we adapt and change our knowledge, behavior, and values [25,68].In HCI, reflection has many manifestations that vary in scope; they range from "revisiting" to "transformative" events.The effects of such reflection include, for example, a change in behavior, understanding, or gaining insights [71].Over the last few years, reflective technologies have been realized in various contexts, such as slow technology [43], persuasive design (e.g., [40]), and personal informatics (e.g., [58]).Hallnas and Redström introduced slow technology as "a design agenda for technology aimed at reflection and moments of mental rest rather than efficiency in performance" [43].They emphasize that technology can be slow in several ways, such as learning how it works, understanding why it works the way it does, using it and experiencing its behavior, or figuring out the consequences of its application [43].Building on that, Cox et al. [22] propose "mindful interaction" to realize a deliberate and intentional interaction.They suggest that an interactive system should contain a microboundary, i.e., a "small obstacle [. ..] that prevents us rushing from one context to another." In line with this research, Baumer et al. [5] suggest three essential characteristics that technology should provide to evoke reflection: "breakdown," "inquiry," and "transformation."Breakdowns refer to the intentional violation of the users' expectations, which would create perplexity and, thus, lead to reflection.At the same time, incomprehension might also act as a prompt to engage in interpretation, i.e., making sense or deriving meaning from a particular output.The second dimension is the "process of conscious, intentional inquiry" [5].Such inquiry can be supported by reviewing previous experiences or actions.It can be achieved by providing a designated space for inquiry, for example, in the form of a visualization that is separate from the main activities to which that inquiry relates and that enables an exchange with others to promote reflective inquiry.Baumer introduces the third dimension, transformation, as "envisioning alternatives" [5].Reflection involves change -it is not just about examining the current state of the world or the self but also about imagining alternatives.Thus, such transformation leads to a change in how a human understands or conceptualizes the world.Consequently, when supporting reflection in human-AI collaboration, users should be made aware of conflicting pieces of information (= breakdown) that they can interact with (= inquiry) to resolve possible conflicts or to promote insight (= transformation).

Challenging Critical Reflection in (Digital) Humanities
Reflection is firmly embedded in the methodologies and practice of the humanities, often in conjunction with the notion of criticality.This includes, for example, the understanding of source criticism as a structured method that was "spelled out in systematic guidelines for historical research and became a cornerstone of academic training" [62] as early as the late 18th century.Similarly, reflecting on and through theory and reflecting on epistemology and methodology are central concepts in the humanities.In general terms, as Flusser put it: "methodic reflection is a critique of science" [39].In the following, we foreground related work that helps us illustrate how understandings of criticality and reflection as part of epistemology and knowledge production are being challenged and revisited in the context of engaging with AI, i.e., computational approaches and tools, in the humanities.
The development of computational tools for the retrieval and remediation of objects constituted one of the earlier goals of DH 3 [2].The results of these efforts now benefit the humanities as a whole, primarily in terms of increased access to sources and digitized material.In the context of art history, tool development has mostly been situated in the realm of what Drucker has called "digitized art history, " where digital tools "are just new ways of doing old work a little faster, easier, and with greater access to more materials of all varieties" [28].In this sense, tool development is based on identifying time-consuming tasks that computers can perform more efficiently [13].
In this understanding, 'digitized art history' relates to accessing objects through databases containing images, information, and data that reproduce symbolic encodings of "expert" knowledge [48].Such textual and ontological knowledge representations are usually grounded in long-established epistemologies and knowledge structures, i.e., "doing old work" [28].In a critical reading of subject indexes and classification schemes, Rawson and Muñoz point out that taxonomies and guidelines for their application assume an underlying "correct" order that suppresses diversity of knowledge [65].Even in the digital realm, art-historical objects remain firmly situated within established epistemological frameworks (see, e.g., [17,24,27]).Classification schemes for art historically relevant objects tend to emphasize the interpretation placed on the objects [33], which is, for example, the case with the widely used 'Iconclass' system [21] that allows the classification of iconography in artworks.Consequently, keyword search in image collections largely aligns with established and canonized interpretations.As Underwood has argued in regards to search on literary texts, typing in a search term already encodes "tacit hypotheses about the literary significance of a symbol" [76], which also applies to text-based image retrieval.Such critical and reflective engagements with the hegemonic effects of information systems resonate with work from science and technology studies that point out how databases and classifications mirror and re-perform the knowledge economies of which they are a part [81].Correspondingly, in the context of critical design in HCI, Feinberg et al. [36] have pointed out how classification systems flatten and distort complex realities.
Text-based retrieval has been, and still is, the predominant mode of accessing images, visual sources, and other non-textual material that is relevant to art historical research.While fulltext search gives access to a textual document in itself, text-based image retrieval fully relies on knowledge representation, i.e., textual or numerical representations of an artwork or object.This is not only the case for computer-based systems: Card catalogs have long been the primary retrieval system for artworks and art-historically relevant objects in museums or archives.Such card catalogs also acted as a reference point for the design of computer-based information retrieval, for example, in terms of human-processable knowledge representation and classification (see, e.g., [19,72]) 4 .Accordingly, art historians are well-trained in accessing artworks and other visual, non-textual sources and objects of study through such human-processable knowledge representations, both in the form of manual card catalogs and digital databases.Their competency to reflect on the underlying effects of knowledge representation and epistemic orders of knowledge in their field of study is paired with source criticism and is also applied when using digital databases for text-based image retrieval.
By enabling access to objects in their pictorial form as digital images, CV opens up a mode of image retrieval that is not solely reliant on established and canonized text-based classifications and interpretations.While this broadens access to image collections and might circumvent canonized classifications and orders of knowledge, it is important to acknowledge that CV inserts new and potentially unnoticed distortions that need to be understood, accounted for, and made explicit.Despite this caveat, the potential of CV for art historical research is evident: Instead of being confined to a keyword search based on the symbolic encodings of 'expert' knowledge about the objects, art historians can search image collections using an image as input.This way, CV might also reduce barriers associated with cross-language search, which is particularly promising in light of a growing interest in non-western art history [82].CV also holds the potential to enable exploratory approaches to image retrieval.This is helpful in art historical scholarship since researchers do not always know what they are looking for as well as they think [76].
In acknowledging that CV introduces new layers of symbolic encodings and knowledge economies, it also becomes obvious that it is crucial to scrutinize its distorting and hegemonic effects in order to secure 'meaningful' integration in art historical research and practice.Hence, researchers need to understand and reflect on how this technology affects epistemic assumptions, research processes, and knowledge production and how the technology shapes their "ways of thinking" [66].This requirement also extends to the ways in which the images and object data are mediated through interfaces and visual arrangements.As Drucker put it: "an interface is information, not merely a means of access to it" [29].In this context, the need to critically reflect on the implications of interacting with technology is evident: using computational tools includes "[reflecting] on the methods and premises that shape our approach to knowledge and our understanding of how interpretation is framed.Digital humanities projects are [. ..] occasions for critical self-consciousness" [30].
The works reviewed above highlight how engagement with AI and computational tools in the (digital) humanities challenges critical reflection.However, what exactly constitutes instances or practices of criticality and reflection and how they can be operationalized in practices of use is rarely made explicit [38].Concrete approaches to applying critical reflection in DH are emerging in the context of 'tool criticism.' Here, critical reflection is framed as a means to understand "that research questions, methods, tools, and data are interdependent and choices regarding them are shaped in an interactive and reflective research process" [55].Tool criticism, in this sense, refers to a technical reflection of the tool and the critique of the research output while also considering the tool's influence on the research process [77].This stance echoes work that emphasizes the importance of critically thinking "about how machine learning is being designed and deployed in the specific problem domains represented by the informating, augmenting and automating of digital humanities" [4].However, how such acknowledgments of the importance of criticality and reflection could be translated into actually designing AI technologies that "scaffold the reflection process" [71] is still under-investigated.

RESEARCH METHOD
Our work contributes to a deepened understanding of this challenge, i.e., we investigate how critical reflection can be enabled in human-AI collaboration.We explore this topic with a use case on CV-based image retrieval in art history and conduct a qualitative interview study.However, we also recognize that knowledge is created through use practice and might never become formalized [41].Consequently, we consider it insufficient to address human-AI collaboration and associated concepts of critical reflection only through interviews.Therefore, we extend the interviews with a think-aloud software exploration during which the participants use a CV-based tool to conduct a real-world image retrieval task that is grounded in their current research.
This method setup allows us to investigate our overarching research question along four subquestions: (1) What potential and affordances do art historians ascribe to human-AI collaboration (specifically regarding CV)? (2) How do these potentials and affordances surface when art historians interact with a CVbased image retrieval tool?(3) In what ways do art historians conceptualize critical reflection in the context of human-AI collaboration?
(4) How do art historians realize critical reflection when interacting with a CV-based image retrieval tool?We conducted semi-structured expert interviews and a software exploration with 12 art historians.Our work does not aim at creating representative or generalizable assumptions on human-AI collaboration.This is why we deem this sample size well-suited for our research question.By assuring relative heterogeneity within our sample (see Section 3.2) while at the same time limiting the sample to a moderate size, we set the foundation for us to capture the complexity and nuance contained within the data [11].
In the following, we describe the use case of CV-based image retrieval.Following this, we detail how we recruited our participants, conducted the interviews and software exploration, and performed a reflexive thematic analysis (TA) of the data collected.

Use Case: Computer-Vision-Based Image Retrieval
Our work rests on a clear distinction between ready-to-use tools that support the retrieval of objects and computational analysis that promises to yield actual results that are a relevant contribution to a field of study.In short, the tools that we foreground are intended to be used by art historians to find objects of study -not to inquire their objects of study.Our decision to focus on image retrieval is informed by the observation that, in the humanities, digital tools are often used during the exploration phase of a research process, for example, while searching in digitized collections [55].This decision is also informed by work on human-AI collaboration in the context of qualitative research (see Section 2.1), which has pointed out researchers' skepticism toward the applicability of computational approaches during analysis and interpretation.These findings are congruent with the skepticism regarding computational analysis in the humanities (see, e.g., [23,26,54]).The actual art-historical analysis and interpretation of the resulting corpus that includes the computationally retrieved objects would still be up to the scholars themselves (see, e.g., [54]).
CV is already broadly available in web-based, commercial applications like Google image search or as a feature in the iOS Photos app.However, such commercial applications do not serve the needs of art historians in the sense of a "research tool."The lack of context-specific tools is why current research in DAH explores CV as a means for image retrieval that also gives room to more complex understandings of visual similarity.For example, images can be considered in their entirety or by indicating a more specific search interest such as a specific motif or pose.This can be done, for example, by drawing bounding boxes around image elements (see, e.g., [48,69,75]) 5 .Such applications of CV help researchers to find artworks that -in their entirety -do not look similar to the input image but that do include similar motifs [75].Despite the breadth of research on CV for art historical study purposes, the solutions are, as of yet, rarely made available to nontechnical users in ready-to-use tools [53].
One of the few examples of a ready-to-use tool that integrates CV for visual search on art historical image datasets is imgs.ai[63] 6 .The web-based tool allows users to upload one or more images and perform a visual search on a selection of museum collection datasets.During their interaction with the tool, users can continuously refine their explorations or searches within a collection by using the results of one search as visual input for the next search ("re-search").This is done primarily by selecting images from the tool's output of 'similar' images and marking them as either "positive, " i.e., matching the user's search interest, or "negative, " which prompts another Fig. 1.Screenshot of the imgs.aiinterface after performing a search with three "positive" and three "negative" images as input using the embedding "poses".
search.The tool also enables cross-collection searches.The tool's functionality is based on feature extraction.It enables users to interactively select different embeddings, i.e., compressed semantic descriptors for images, and a distance metric as a similarity criterion when performing their search [63].The operation of the ready-to-use web-based search tool does not require advanced technical knowledge.
We are not involved in the development of this tool but see imgs.ai as a typical example of a computational tool for humanities research.Therefore, we decided to use imgs.ai as part of our study during the software exploration (see Section 3.3).

Participants
Our research follows a qualitative approach and builds on expert interviews and a software exploration with art historians.We compiled a list of potential participants in preparation for the study.The prerequisites for inclusion on this list were that the candidates have at least one academic degree in art history and work in an art history-related profession.The initial selection was based on our professional networks.It was extended by researching art historians listed on institutional websites of universities, research institutes, museums, or other collecting institutions.We snowballed additional candidates during the first round of contacting participants.The final list included 35 potential study participants.All candidates were assigned a level of familiarity with computational approaches ranging from 'A -high' to 'C -low' 7 .We contacted 17 potential participants from the initial list, of whom 12 accepted our invitation.We ensured that our final sample included all levels of familiarity with computational approaches.Thus, we contacted candidates successively, i.e., contacting 'replacement' candidates depending on their level of familiarity.This measure was taken to assure relative heterogeneity in our sample and to reduce probable 'skepticism' or 'enthusiasm' bias towards computational approaches that might correlate with the level of familiarity.Three of the 12 participants were qualitatively grouped as 'A,' i.e., high familiarity with computational approaches, five as 'B, ' i.e., medium familiarity with computational approaches, and four as 'C, ' i.e., low familiarity with computational approaches.Seven of the 12 participants identified as women and 5 as men.All participants have at least a university degree (M.A.) in art history, often combined with another humanities degree.All participants work in an art history-related profession.Four participants work as researchers or or engaging theoretically with computational approaches were grouped as 'C -low.'We adjusted the assigned levels of familiarity in two instances based on the participants' self-reported experiences with computational approaches and tools during the interviews.curators at a museum or institutional collection, and eight participants work at a university or research institute, ranging from doctoral researchers to professors (see Table 1).Eight of the participants were from Germany, the others from Switzerland.
Our university does not have an institutional or ethical review board.To compensate for this, we have developed internal guidelines that set ethical and legal standards for our research group.These guidelines include sending all potential participants a thorough information sheet upon first contacting them, which allows them to determine whether or not they are willing to participate in the study.They were also informed that they would not be compensated for participating in our study.Upon acceptance of participation and prior to the interviews, participants received another detailed information sheet that informed them about the background of the study, detailed the process and how the interview will be conducted, and how the data collected will be used and stored.The document, which is approved by our university's Chief Data Protection Officer, also informs them of their rights under the General Data Protection Regulation.The interviews were conducted remotely through the video conferencing software 'Cisco Webex, ' which was also used to record the interviews' audio and video.Participants were informed of the circumstances of the recording and consented to it.Cisco Webex is approved by our university's Chief Data Protection Officer.The audio and video recordings and the transcriptions are stored on an internal password-protected server and will be deleted after the completion of the project.

Expert Interviews and Software Exploration
The semi-structured expert interviews and software exploration were conducted in March and April 2022.The interviews were each planned to last 35 minutes, and the software exploration another 25 minutes.The average duration of both parts was about 70 minutes.All interviews were conducted in German.
The first part, the semi-structured interviews, focuses on understanding our participants' research processes and work contexts.We were similarly interested in learning more about what roles the use of computational tools and collaboration play in their research.The interview script also included questions that encouraged the participants to share what potentials and affordances they associate with non-specified AI systems, especially systems that integrate CV.As laid out in Section 1, we propose critical reflection as a 'meaningful' realization of human-AI collaboration and hypothesize that tools need to be intentionally designed in such a way that they support critical reflection.In order to be able to test our proposal and hypothesis, we considered it essential not to prime our participants by directly addressing critical reflection.In an iterative process, we carefully crafted the interview script accordingly.We only encouraged participants to expand on topics relating to criticality and reflection when they brought them up themselves.The interview questions are provided in Appendix A.
When we contacted the participants, we informed them of the study procedure and that they are invited to use the freely available CV-based search interface imgs.ai8 during the software exploration.For this purpose, we asked them to bring a digital image with which they could perform an image retrieval task that is relevant to their research or current work.We asked the participants to share their screen, which allowed us to record and follow their interaction with the tool.It also enabled us to insert questions or provide help when needed.
At the beginning of the software exploration, we only briefly introduced the tool and that the participants could perform image-based searches with it.The decision not to explain the tool's functionalities in greater detail before having the participants use it mimics how they would use such a tool as part of their everyday work.The use of search interfaces to web-based repositories, image catalogs, collection databases, or Google image search is not "introduced" and explained either.At least not beyond what the interface offers as guidance or through links to additional external information or documentation.Nonetheless, we ensured that all participants reached a sufficient understanding of the tool's functionalities and supported them when we noticed fundamental misconceptions about how the tool works and how they can operate it.Such additional support was only needed in two instances.Following the familiarization, we asked the participants to use the image they had brought to perform an image retrieval task relevant to their research or current work practice.We instructed the participants to verbally share their thoughts, observations, and questions that arose during the interaction, i.e., to think-aloud [57] while using the tool.We also encouraged them to verbalize their search interest and their interpretation of the tool's output.

Thematic Analysis
We analyzed more than thirteen hours of audiovisual footage.Our analysis was guided by Braun and Clarke's 'reflexive thematic analysis' (TA) approach, which conceives analysis as an active creation of themes by the researchers at the intersection of data, analytic process, and subjectivity [10,12].Our choice of reflexive TA was motivated by the acknowledgement of subjectivity as a central feature and requirement within this analytic approach.Hence, it allows us to embrace our interpretative lenses that are informed by our disciplinary backgrounds 9 .Additionally, we deem the reflexive TA approach flexible enough to be applicable not only to our interview data but also to the data gathered from the software exploration.The emphasis that is put on the researchers' interpretation of the data when identifying patterns [10,12] encouraged us to consider the participants' interaction and behavior during the software exploration as another layer of information.
We transcribed all interviews, including the recordings of the software exploration.In addition to transcribing what the participants said during the 'thinking-aloud' software exploration, we annotated the transcriptions with what the participants were doing, i.e., how they interacted with the tool, to contextualize the spoken word.After familiarizing themselves with the data, the first author initially coded all 12 transcripts, which served as a basis for several rounds of iterations on the coding.This phase included exchanges between both authors, during which we revised and reassigned our codes and shared our observations.Our inductive coding was grounded in the data, but we acknowledge that our theoretical assumptions and knowledge shape the analysis.Instead of seeking consensus on meaning, our exchanges and discussions gave room to subjective readings and interpretation of the data, as outlined by Braun and Clarke as one of the characteristics of a coding approach that is in line with reflexive TA [12].
We coded and analyzed the transcripts in two batches.We analyzed the interview data (first part) separately from the data gathered during the software exploration (second part).This allowed us to compare the two parts and relate them to the dimensions of our research question, i.e., our four sub-questions (see Section 3).Following the coding phase, we generated a set of candidate themes that we organized into a 'thematic map.' We used this map in the subsequent steps of our analysis and discussions to organize and restructure the themes we had developed based on the data.From our candidate themes, we selected those that we considered being the most significant for our research question.We carefully translated selected quotes from German to English for integration them in this paper and intensively discussed gradual shifts in meaning that emerge from such translations.

FINDINGS
Through our analysis, we developed eight themes.In the following, we present the themes structured into four main sections corresponding to our sub-questions in Section 3 (see Table 2).The first section contains themes that relate to the potentials and affordances participants ascribe to human-AI collaboration.The second section lays out themes that relate to how these potentials and affordances surfaced during the participants' interaction with the CV-based image search tool imgs.ai.The themes in the third section summarize how participants conceptualize critical reflection in the context of human-AI collaboration.Lastly, the themes presented in the fourth section illustrate how critical reflection is realized by the participants during their interaction with a CV-based image retrieval tool.Due to the complexity and richness of our data, our analysis led to the creation of a larger number of themes.However, we only foreground the themes that directly relate to our sub-questions.

Potentials and Affordances Ascribed to Human-AI Collaboration
The participants were made aware of our research into human-AI collaboration through the information sheet we sent them in preparation of our study (see Section 3).We repeated this information at the beginning of the interview (see Appendix A).However, in line with our understanding of human-AI collaboration as an analytical lens (see Section 1), we did not prompt the participants to detail their understanding of what, in their opinion, constitutes 'human-AI collaboration.' Instead, we asked participants about their experiences with computational tools and approaches and what role such tools and approaches play in their research and art historical work.We also encouraged the participants to share their ideas and assumptions regarding non-specified AI, particularly CV, even when they did not report any experience in using such technology.In this regard, participants also speculated about how AI might affect art-historical research and work practice.In the following, we highlight two themes that relate to the potentials and affordances that the participants ascribed to human-AI collaboration.
4.1.1Computational tools are not able to produce reliable (research) results automatically and should not be considered for this purpose.Participants were generally open to using computational tools and exploring computational research approaches.In the context of art history, they saw a particular potential in CV and were interested to see how this technology might influence their research or work with collections.By and large, they conceived of computational tools as something that should only support their epistemic process but not replace it.However, several participants feared that computational tools could be misinterpreted as being able to produce research results by themselves.
"It is important to realize that a tool cannot replace the research process, that it is just a tool -like a kind of notebook or database -that is used and cannot deliver the results.I don't think that's entirely clear, as trivial as it sounds, because there is a great hope that computational tools will not only support a research process but will themselves produce a research result." (P7) Many participants saw potential in delegating certain clearly defined tasks to a computational tool.However, they considered it a requirement that humans are included in the process.They mentioned, for example, that it is important that humans "check" the output of a CV tool because they do not expect the tool to perform reliably enough or be accurate enough in terms of what concepts and levels of 'similarity' are needed in art history and when working with original objects."I think you still have to do a cross-check since computer vision can't solve everything 100 %.With prints, there are sometimes minimal differences in the editions, so two images might only look like two identical prints at first glance.Then you have to take a closer look [for example, at the material]." (P11) Participants mostly conceived computational tools as an addition to their research and work practice.They considered it essential to combine computational approaches with "conventional, " i.e., non-computational methods and research practices.
"But maybe it would be enough to say, okay, you get to a certain level and then continue conventionally.For me, it is not about the question that we are replaced and fully automatically have all the answers.I don't want that at all.But maybe I just want to be stimulated to think further in another direction and conduct a preliminary investigation [using a computational tool] for that -or I am at a certain point and then want to use it.This combination is actually rather what I would associate with it." (P1) 4.1.2Computational tools and approaches have the potential to influence epistemology.Participants frequently pointed out that computational research approaches and tools, especially those that integrate 'AI, ' have an influence on epistemology, i.e., they influence what questions can be asked, what methodology and knowledge are considered, and how objects of studies are approached.This influence on epistemology was generally considered productive and welcomed.However, for it to be seen as productive and welcomed, it had to remain in line with the previous theme: computational research approaches and tools should not be considered as being able to produce research results by themselves.A welcomed influence of computational approaches and tools is, for example, that they enable a researcher to come up with new research questions.This, in turn, might lead to a reconsideration or change in the epistemic process.
"When we work with computational tools, I at least try to make sure that I'm aware that it brings up different questions.And that's also okay; that's also what makes it interesting." (P12) "And I believe that this is exactly the potential -whether it is computer vision or other AI-supported research: that one comes up with questions one had not thought of before." (P1) When talking specifically about CV, some participants saw the potential that the technology could contribute to counteracting canonical structures 10 .Canonical structures, for example, influence what material will be considered 'relevant' in certain research areas.
"From my point of view, [CV tools] could help in particular in terms of a similarity search.For example, they can provide access to and insights into other genres of images that one would not find in existing canonical catalogs -so, in a sense, they could support a 'visual science' [Bildwissenschaften] approach.In this way, they can also certainly bring forward new comparabilities and new research questions." (P6) "When I'm trying to question my own view -so, when I, for example, search for representations of factories and immediately think of Menzel, Meyerheim, Karl Blechen, and so forth; the canon -then I just try a search with some CV tool, even if I don't really know how it's trained, and then I just find something else." (P8) Participants generally valued the potential of CV to extend established text-based approaches to image retrieval.Text-based access to objects was, for one, problematized in terms of reproducing and manifesting established and canonical interpretations (see Section 2.3).For the other, participants valued the potential of CV to enable retrieval that is in line with thinking visually.
"[Computer vision holds the potential] that one develops a desire to discover arthistorical data and images differently and perhaps not always through text.We are so used to always entering search terms, or we have these thesauri and that helps us to generate structured queries.But people also think in images." (P4) Correspondingly, some participants conceived CV as potentially influencing their own evaluation of artworks.This was mentioned, for example, in terms of what objects they consider as being 'similar' (and, thus, as possibly relevant for later analysis).Text-based access to art-historical objects is confined to established orders of knowledge and canonical interpretations of artworks (see Section 2.3), which do not always correspond with an apparent visual similarity.Shifting to an image-based retrieval mode was valued as enabling searches based on the object's visual similarities.Participants appreciated that this enables them to include objects in their research that they would not have considered 'similar' based on a text-based search.One participant mentioned that CV, in this sense, would present them with "visual evidence" that they had not considered themselves: "You suddenly get visual evidence somehow.That's really nice sometimes, that you see at first glance: I'm only now becoming aware of this similarity.And that also includes surprising results!So that's where I see the great advantage of an image-based search, that you actually gain this surprising visual evidence." (P3)

Surfacing of Ascribed Potentials and Affordances during Interaction with CV Tool
We seek to understand how the ascribed potentials and affordances that we detailed in Section 4.1 surface during the interaction with a CV-based tool.For this purpose, we analyzed the data gathered from the software exploration and related it to our analysis of the interview data.In the following, we detail two themes that indicate misalignments between the potentials and affordances that participants ascribed to CV during the interviews and how they interacted with or verbally reacted to the tool.This includes, for example, instances of misalignments in expectations.We interpret those misalignments as indicators that the provided information in the tool was insufficient or unsuited to manage expectations regarding the tool's potentials and affordances.Hence, we address divergences as indicators for needed improvements in future tool design (see Section 5.2).4.2.1 CV tool is expected to produce reliable 'results' without further interaction.During the interviews, participants generally conceived computational tools as unable to produce reliable (research) results by themselves.This conception coincided with the implication or explicit acknowledgment of the need to check, interpret, and revise tool output, i.e., to iteratively and critically interact with a tool and its output.Participants also were aware of the limitations of CV in regards to being able to capture their domain-specific concepts of similarity and levels of detail (see Section 4.1.1).However, during the software exploration, we observed that some participants expected the tool to present them with reliable 'results' for complex 11 search interests without further interaction.In such cases, participants showed some reluctance to interact with the tool, for example, in regards to refining the parameters or input, and to interpreting how the tool output related to their input image.
We interpret such instances as overconfidence in the tool's abilities and expected level of precision.We acknowledge that these effects might have been exacerbated by the framing of the tool, which has to be factored in as a layer of information contained in the interface (see [29]) 12 .One participant, for example, had selected a 17th-century engraving of Hercules (a depiction of a very muscular man) and wanted to retrieve other depictions of muscular men, i.e., other depictions of Hercules or related mythical figures.The tool output, however, also included engravings depicting muscular women and mythical figures with androgynous bodies.To an untrained human eye, those images of muscular bodies could be considered 'similar' to the input image.In this case, the level of similarity in the tools' output was not aligned with the expected precision and specificity of the search interest.
"[. ..]Although I said that women are not my search interest, quite a few women appear here.That would annoy me now, of course." (P11) Another participant had uploaded a 16th-century print depicting cherubs (putti) drinking together.When the system returned images of drinking adults, they felt that the tool had misunderstood their input.
"Well, that means that it has not grasped that I'm actually looking for five small children dousing their noses.Instead, I am now presented with many gentlemen drinking alone and an elderly lady.So this is not very helpful for my search." (P5) One feature of imgs.ai is that it allows users to select images from the tool's output as "positive, " i.e., in line with their search interest, or as "negative," which triggers a new search.We ensured that all participants had familiarized themselves with this basic functionality during the initial exploration phase (see Section 3.3).Some participants valued the back-and-forth interaction with the tool's output, i.e., that they could refine the results based on what they considered 'similar' to their input image and also adapt parameters such as the embedding selected.However, in some instances, interacting with the tool in a 'collaborative' sense was perceived as a burden.
"There is a lot of noise among these results.[Interviewer suggests that P7 refines their search by entering those images as 'negative.'] Okay, well, then let's do that.Well, one is more or less spoiled by Google image search, because there, one doesn't need to tinker with the search query at all [. ..]. " (P7) 4.2.2Tool output is not considered purposeful when it does not align with established epistemologies.As summarized in Section 4.1.2,participants ascribed and valued the potential of computational tools and approaches to bring forth new questions and epistemic perspectives.However, we often observed divergences from this conception during the software exploration.Most participants approached the tool with a 'conventional' search interest that aligned with established epistemologies.Commonly, search interests corresponded to a clearly defined era and culture of origin that the participants were interested in.These served as clear-cut exclusion criteria when participants judged the tool's output.When the tool presented them with suggestions of visually similar images, many participants evaluated the suggestions solely based on whether or not they were in line with their exclusion criteria.

"Art-historical research is very specific [. . .].
There is either the cultural context or the historical context, or other relations.So a search has to relate to that in some way.The idea that an image is just somehow visually similar to another one that comes from a completely different context, that doesn't fit well into art-historical research".(P7) However, we also observed instances where participants started with a 'conventional' search interest that they then developed further during their interaction with the tool.Participants who had emphasized during their interviews that they valued the potential influence that computational approaches might have on epistemology tended to be more open toward this aspect during the software exploration.This was, for example, the case with P1.They had initially dismissed the tool's output as not valuable to their search interest, which, in this case, was to find images that depict the birth of the Virgin Mary.While interacting with the tool, however, their openness toward epistemological changes that they had expressed during the interview eventually surfaced during their interaction with the tool.
"So, the most interesting to me is the last embedding, because it brings me to new questions.The others are all conventional -iconographic or generic -results.[...] But here [after selecting the embedding 'raw'], I would immediately come up with new questions that I would pursue." (P1)

Conceptualizations of Critical Reflection in the Context of Human-AI Collaboration
As laid out in Section 3, we carefully crafted the interview script in such a way that it would not prime the participants on the aspect of critical reflection (see Appendix A).Nonetheless, in line with our proposal for 'meaningful' human-AI collaboration, participants regularly emphasized the importance of critical reflection.In the following, we highlight two themes relating to participants' conceptualizations of critical reflection.

4.3.1
Critical reflection is considered a foundation of research and is expected to be applied in human-AI collaboration.Participants generally considered critical reflection a foundation of research and related it to methodological scrutiny.In some instances, the importance of critical reflection in art-historical research contexts was motivated with reference to the "methodological problem" of machine, if I see it correctly, eventually spits out a result.You can then write about the programming of the whole thing or how this 'ingenious brain' was conceived, but understanding how it actually reaches a result would of course be the interesting thing about it." (P9) 4.3.2Participants' expectation to apply critical reflection is directed at themselves.As summarized in the previous theme, participants emphasized the importance of critical reflection and expected it also to be realized in human-AI collaboration.This expectation is directed at themselves or fellow art historians or humanities scholars.They consider it their own responsibility to acquire the skills needed to realize critically-reflective usage of computational tools or data.
"We always need a critical mind, not only as scholars.[...] Just because a tool spits out a result, you still have to think along and ask yourself: does this make sense?I need to have a critical awareness -despite the very high accuracy that these tools offer -I have to remain vigilant.I would consider that to be an important competency.[...] Of course, you must first be trained a bit in order to master this." (P11) Participants consider it important to teach students how to work with data critically, which could build on existing competencies in the humanities regarding the handling of sources and information.They emphasize that it should be part of art-historical curricula to teach students how to handle data, understand their origin, and what research questions could (and could not) be answered based on them.However, in regards to reflecting on methods, i.e., reflecting on how a CV tool produces a result, P7 states that this is not as easy to achieve: "This is, of course, more difficult, because these computer vision systems are a black box, or they somehow present themselves as such." (P7) Only one participant (P6) explicitly pointed out that the realization of critical reflection in human-AI collaboration should not be delegated to art historians without prerequisites: "On the one hand, this is a competency that we [art historians] have to acquire.On the other hand, I expect that all software manufacturers meet the requirements of transparency, so to speak.What are the data sources?How does it work?"(P6)

Realization of Critical Reflection during Interaction with CV Tool
During the interviews, participants consistently emphasized the importance of critical reflection when working with computational tools.This importance was also mirrored during the software exploration, where participants showed eagerness to realize such a critical-reflective approach.In many instances, participants were able to critically reflect on the mediality of the digital reproductions, for example, in regards to how the digital reproduction of objects influenced the visibility of material details.Participants also applied their competence regarding source criticism.This competence was realized by accessing detailed information about an image by clicking on the link in the tool's context menu, which would open the object's record entry in the institutional collection database in a new browser tab.
However, as we detail in the following, we regularly observed divergences from the aspiration to interact with a tool in a critical-reflective manner.These divergences did not occur due to a lack of eagerness or capability on the side of our participants.Instead, we interpret this as support for our hypothesis that computational tools need to be intentionally designed in such a way that they actively scaffold and support critical reflection.

Source criticism and reflection on methods were not fully realized during the interaction.
Participants commonly applied their competencies in "source criticism" by scrutinizing the provided data and the source (see above, Section 4.4).Although some were able to extend their competencies to the intentional selection of a collection on which to perform a search, most participants started their search with the pre-selected collection 13 .In such cases, the participants did not consider whether or not the default collection was likely to hold objects that would correspond to their specific search interest, which sometimes led to initial negative evaluations of the usefulness of the tool.The lack of consideration of the selected collection could be easily overcome when participants were made aware of their unintentional adherence to the pre-selection and encouraged to consider searching in another collection that better matched their search interest.
Participants were likewise eager to reflect on the underlying methods of the CV tool.However, they frequently and consistently mentioned that the lack of transparency and explanations made it hard for them to scrutinize the tool's output, which also limited their ability to interact with the tool in a 'meaningful' manner.
"I do not have the feeling that I [. ..] understand the effects that the parameters have on the result.That is not intuitively recognizable to me, and I would like to know that before I operate the tool.So to me, this is just like operating a bread-cutting machine without knowing where I can adjust it to cut wider or narrower.This is not being explained and that's why I say, okay this is some kind of image recognition on some level.But [. ..] this is not really understandable or verifiable and I would need an explanation of how [the embeddings] work." (P6) In light of the lack of explanations, some participants fell back on their competency as art historians and applied a hermeneutic approach to interpretation.That means that they approached the tool's output in an interpretative manner and tried to understand the underlying principles of similarity encoded in the different embeddings by way of visual analysis and comparison of the outputs.
"[After changing the embedding] it shows me less drawings.The shape is again more strongly represented, the round shape.[. ..]There are many prints among the results.Images that I would spontaneously classify in the direction of woodcut, copperplate engraving, which are generally techniques that require a very controlled way of working, which of course influences the character of a work." (P12) However, in several cases, the tool's lack of transparency negatively influenced the participants' willingness to engage with the output.In such cases, participants dismissed the tool's output as illogical or as 'random' without trying to interpret what kind of 'similarity' the images in the search result had in common.This was the case with P2: they used a painting depicting fish as input.The tool's output also included images of objects relating to the ocean or other marine species.P2 was not able to make sense of why the search results were not limited to images of fish: "So, I don't really understand why it displays these images as results, because they are very different.So we now have some images with fish or fish-like objects.But there is no logic behind it." (P2) 4.4.2Participants expect tool to enable critical reflection during interaction.During the interviews, participants directed their expectation regarding the realization of critical reflection in the context of human-AI collaboration at themselves; they formulated it as an expectation that art historians and other scholars should strive after, for example, by acquiring computational knowledge or data literacy.By contrast, during the software exploration, they formulated this expectation predominantly as something that should be enabled by the tool during interaction.
The tool's interface includes a short descriptive text that links to GitHub repositories and documentation on some of the libraries or models implemented (see Figure 1).Three of the participants (P6, P8, P10) clicked on these links and accessed the documentation and secondary sources of information.However, even those participants that accessed the documentation and secondary sources of information did so only briefly, i.e., they opened the links in new tabs and returned to interacting with the tool after a short glance at the documentation.Even though only three participants accessed the secondary sources of information, most participants were looking for more details on the technical parameters of the tool and assumed that a more refined version of the tool would include such information.
"This 'distance' is unclear to me.And what this embedding with the 'vgg' is about is also unclear to me.But it will be explained in the final version, I suppose.[...] I imagine that there will be an 'i' button for info and I will be able to access explanations about what the settings mean and what influence they have." (P5) As pointed out in Section 1, we are skeptical about the effectiveness of linking to technical documentation and the appropriateness of such information for nontechnical users, which we see confirmed by our observations during the software exploration.In several instances, not being able to understand the tool's parameters resulted in the participants assuming that they were not supposed to alter them, let alone critically reflect on what effect they might have.This highlights the need to improve the design of tools so that they enable critical reflection during usage and in a user-appropriate manner.

DISCUSSION
Our research is guided by the question of how we can enable 'meaningful' human-AI collaboration.We explore this question by foregrounding a genuinely human capability -critical reflectionwhich we situate in the context of humanities research practice, specifically, art historical image retrieval.We conducted a qualitative interview study and software exploration with 12 art historians.Through our study, we developed a better understanding of the participants' ideas and assumptions regarding human-AI collaboration and associated concepts of critical reflection.By extending the expert interviews with a "think-aloud" software exploration, we were able to relate the findings from the interviews to how the participants interacted with a CV tool.Based on a reflexive TA on the data, we derived themes (see Table 2) that inform our proposal of design implications for critical-reflective human-AI collaboration.In the following, we highlight our key findings before we provide the design implications.

Key Findings
All interviewees conceived computational tools as a valuable addition to their research practice.Participants clearly displayed conceptions of computational tools that evoke instances of collaboration rather than automation, which echoes existing research into human-AI collaboration in qualitative research contexts (see [8,18,51]).Interviewees ascribed and valued the potential that AI might influence epistemology in art history.This included the premise that CV, in particular, might help to overcome canonical structures.Participants also indicated that CV could broaden scholarly access to objects by not solely relying on textual encodings of 'expert' knowledge during image retrieval.Such stances parallel critical reflective engagements with the hegemonic effects of text-based information systems (see [36,65,81]).
Participants brought up and elaborated on the importance of critical reflection throughout the interviews and considered it a foundation of (humanities) scholarship.Source criticism was, for example, referred to as a central research competency.Participants emphasized that this competency also needs to be applied when dealing with digitized sources, for example, when working with digital images and other art historical data.This aspect is already acknowledged within the CSCW and HCI community by calling for a more reflexive data practice (see [67]) or, more generally, a reflexive documentation practice (see [59]).Additionally, participants stressed that they consider it crucial that the output of computational tools, especially when integrating AI, can be scrutinized, which evokes instances of reflection on methods.Such scrutinization is especially acknowledged in research on human-centered explanations that help people to understand the results of AI methods (see, e.g., [20,34]).In the context of working with data or using computational tools, participants generally expected that they themselves, and their discipline as a whole, should realize a criticalreflective practice.However, participants shared their concern that computational approaches and tools might be, or are, perceived as producing reliable, 'objective' and quantifiable truths, which makes it harder to question, scrutinize or critically reflect on them.This concern is mirrored by research that shows that people often over-rely on results provided by AI (e.g., [15]).
Through our reflexive thematic analysis, we identified divergences between the participants' aforementioned conceptions and how they interacted with and reacted to a ready-to-use CV tool during the thinking-aloud software exploration.Contrary to what they posited during the interviews, participants often expected the tool to present reliable 'results' for complex search interests more or less automatically.During the interviews, participants valued the potential of computational tools to allow them to explore new epistemologies.During the software exploration, we observed that participants approached the CV tool with search interests that were very much in line with established art-historical epistemologies.In some cases, participants were able to reflect upon this during their interaction with the tool and subsequently realigned their expectations and evaluation of the tool's output.These findings indicate that tools need to guide users in translating theoretically formed conceptions into practices of use.
Participants were generally eager to reflect critically on how the tool's output came about.However, the tool's lack of transparency and interpretability was frequently criticized as hindering them from realizing their own expectations in this regard.That this requirement was not met by the tool was interpreted as a sign that the tool is a 'prototype.' At the same time, incomprehension sometimes prompted participants to engage in interpretation (see [5]).In the absence of explanations regarding the underlying algorithms and the different concepts of 'similarity' encoded in the embeddings, some participants fell back to their competency in hermeneutic interpretation when trying to make sense of or derive meaning from the tool's output.The option to test different embeddings and observe their effect on the tool's output encouraged some participants to reflect not only on the encoded concepts of similarity but also on the variety of notions of similarity ingrained in different art-historical approaches to analysis and interpretation.Such attempts at making sense of the algorithms and relating them to their own mental models and understandings of different concepts of similarity evoke the notion of the "algorithmic imaginary", as put forward by Bucher, i.e., "the way in which people imagine, perceive and experience algorithms and what these imaginations make possible" [16].
Our findings confirm that art historians are eager to apply their competencies of criticality and reflection to the use of computational tools.However, we observed that they were not able to fully realize this eagerness and aspiration while interacting with a CV tool.We interpret such divergences as supporting our hypothesis that computational tools need to be intentionally designed in such a way that they actively enable and support critical reflection.Only then can humanities researchers such as art historians adhere to their own expectations of scrutiny and interact with computational tools in a 'meaningful' manner.

Design Implications for Critical Reflective Human-AI Collaboration
Our hypothesis resonates with Slovak et al. 's understanding of reflection as a process that needs to be scaffolded: we cannot assume that "the ability to reflect is a trait that can be readily triggered by providing the relevant information" [71].Hence, Slovak et al. suggest "scaffolding reflection within experience" [71].Our findings also echo this suggestion: Participants did not formulate their expectation that a tool should support critical reflection during interaction with it until they engaged in a real-world image retrieval task with a ready-to-use CV tool.Taking this into account, we suggest four design implications for critical reflective human-AI collaboration.

5.
2.1 Supporting reflection on the basis of transparency.It is a recurring premise that art historians do not collaborate with colleagues and other experts [14] and consider art historical research a "solitary endeavour" [82].Our findings could not confirm this assessment.Interviewees highly valued collaboration with colleagues from disciplines other than art history and the integration of multiple perspectives.Hence, the approach to take inspiration from human-human collaboration when designing human-AI collaboration (see [80]) seems applicable in this case.
Our interviews showed that art historians are able to reflect upon what expertise is needed for a given task at hand and initiate collaboration with colleagues and other disciplines accordingly.In cases where differing opinions stood in confrontation, they could factor in their collaborators' or colleagues' arguments and positions and either dismiss the suggestion or include it in their decision-making.The basis for such critical-reflective engagement with suggestions that ran contrary to their own assessments or findings was that they could understand the reasoning behind the suggestion and from which area of expertise it stemmed.This insight into human-human collaboration suggests that it is by no means a requirement that a computational tool only returns output that is aligned with the users' perspective, expectation, or established epistemologies.For such misalignments to be productive, however, the user needs to be enabled to engage with the tool's output purposefully and intentionally.We observed that participants tended to expect high levels of accuracy from the CV-based tool, which sometimes materialized in over-reliance on the tool's output.We suggest addressing this known effect in human-AI collaboration (see, e.g., [15]) in such a way that encourages users to scrutinize the output of a tool as a 'suggestion' that they can contradict and potentially refuse.
Being able to understand how computational tools function is considered a prerequisite for their 'meaningful' integration into humanities research [55].However, we should not shift the responsibility to acquire the needed skills to the users of computational tools.Our findings confirm that in order to enable 'meaningful' human-AI collaboration, we need to enable critical reflection during the users' interaction with a tool (see Section 4.4.2).Based on these findings, we suggest approaching the design of human-AI collaboration with a strong emphasis on 'supporting reflection on the basis of transparency.' Transparency in this context refers not only to the clear presentation of the functional scope of a tool (see, e.g., [3]) but also extends to making it obvious for the users that their active scrutinization and critical reflection constitute a central and significant element of interaction.As our findings show, puzzlement about the tool's results caused some participants to engage in hermeneutic interpretation when attempting to derive meaning from the tool's results, which is in line with the suggested effects of microboundaries, i.e. small obstacles [22], and breakdowns as intentional violations of the users' expectation [5].However, we also observed that perplexity could negatively influence the participants' willingness to engage in critical reflection and interpretation.Hence, we suggest supporting reflection by transparently framing the interaction with a computational tool as something that will include instances of perplexity while also encouraging users to tap into the productivity of their own algorithmic imaginary.

Foregrounding epistemic presumptions.
Assumptions about a computational tool are formed prior to interacting with it and are also influenced by public perceptions and assertions about technology, for example, about the capabilities of AI-based systems.As mentioned above, our study emphasizes the importance of clearly communicating a tool's functional scope and limitations, which resonates with Amershi et al.'s general guidelines for designing human-AI interaction [3].However, this would not be sufficient in research application contexts since it does not address the scope and limitations of a tool in regard to underlying epistemic presumptions.Art history, just like other disciplines, incorporates a variety of different research approaches, methods, epistemic presumptions, and associated positional differences.However, our study participants tended to foreground the thematic scope of their research and practice without expanding on the epistemic presumptions and methodological approaches this entails and refers to.Even if a clearly communicated epistemic self-positioning did not occur, it still shaped and influenced their interactions with the CV tool.This implies that critical-reflective human-AI collaboration also requires a component of self-awareness or self-reflection (see [61]).Self-reflection, in this case, would be aimed at reducing a misalignment of epistemology, which could have a negative effect on the perceived usefulness of a computational tool and inhibit 'meaningful' human-AI collaboration.
Hence, we derive the implication that computational tools should not only adhere to the principles of transparency laid out above but also encourage the users to make explicit their own epistemic presumptions (see [1,31]).As our findings confirm, humanities researchers and art historians value the potential of computational tools to influence epistemology and knowledge structures (see also Section 2.3).However, this effect creates friction during usage when it coincides with a lack of awareness about the conflicting epistemologies.Consequently, critical-reflective human-AI collaboration hinges on the users' awareness regarding the purpose of their tool usage and the epistemic presumptions with which they approach a tool.We envision that this design implication could, for example, be operationalized by extending the tool with an "onboarding" interaction sequence that would guide the user through self-reflection regarding their epistemic presumptions and intended purpose of their interaction with a tool.This self-positioning would be mirrored with relevant information about the tool that also makes transparent that tool output might not match the users' initial expectation, i.e., explicitly encouraging critical reflection and scrutinization (see also Section 5.2.1).

5.2.3
Emphasizing the situatedness of data.When applying computational tools to highly selected and canonized data, as is the case in art history, it is essential to emphasize the historicity and situatedness of the data and information and enable a critical-reflective perspective on both.D'Ignazio and Klein [31] highlight that the mechanisms for the production of data -which include social, cultural, historical, and material conditions -need to be disclosed (see Section 2.3).Our study also echoes this claim.As one participant put it: "How were the data collected, and what were the exclusion criteria?Was it even reflected that possibly only one-sided data could be fed in?So, what was available, and what was in the public domain?These are all aspects for which I would like to see appropriate information as a disclaimer preceding the use." (P5) Several of our interviewees associated 'objectiveness,' 'completeness,' and 'reliability' with databased and computational approaches.This association also influences the perception of a tool.In order to counteract this, we suggest placing a much bigger focus on making explicit what kind of data, i.e., which collections, can be accessed through a tool.Purposefully selecting a collection that matches a given search interest presupposes that the users have contextual knowledge about a collection's implicit and explicit biases, i.e., the focus, historical development, and canon ingrained in it.However, this contextual knowledge might not exist when the user is confronted with an unknown collection or an aggregated dataset.Emphasizing the situatedness of data could build on approaches such as Gebru et al. 's suggestion to include "Datasheets for Datasets" [42] or take inspiration from Holland et al.'s "Dataset Nutrition Label" [44].Instead of having the users start with a pre-selected collection as default, the selection could be designed as a required interaction that would be enabled based on the included contextual information about the underlying data.

5.2.4
Strengthening interpretability through contextualized explanations.The tool's lack of interpretability prevented participants from fully realizing their aspiration to reflect on methods: they could not reflect on how the tool's output came about and how the output relates to their search.The question that arises from this is, first, how we can enable interpretability of a CV tool -considering that we envision stakeholders with little to no technical understanding of CV.Second, we need to factor in that domain experts interpret AI technologies within and against a diverse backdrop of existing themes practices [9].Current research in the area of explainable AI or human-centered explainable AI (HCXAI) (see [32]) addresses the question of how AI-based applications can be made interpretable, which is particularly challenging when considering 'nontechnical' users.As pointed out by Ehsan et al. [32], HCXAI needs to consider the explanation needs of the user, the purpose of explanations, and where and when the explanations are provided.As Benjamin et al. have shown, stakeholders with no expertise in AI approach explanations with their own situated sensemaking [9].Benjamin et al. suggest, amongst other things, offering "contextual cues" that enable stakeholders to combine or contrast a given explanation with their own explanation strategies [9].
Based on this related work and our findings, we imply that critical-reflective human-AI collaboration requires 'contextualized explanations.'Such explanations would consider domain-specific practices of interpretation and sensemaking instead of purely focusing on a technological understanding of algorithmic explanation methods.As discussed above, some participants applied a hermeneutic approach to interpreting the tool's output: they tested and observed the effect of different embeddings, i.e., different 'concepts' of similarity encoded in the algorithms, and tried to deduce what visual features or visual information of the input image were picked up in the tool's output.Our suggestion of 'contextualized explanations' aims to support and encourage such an interpretative and 'hermeneutic' approach.We suggest addressing interpretability by designing contextualized explanations that relate the effects of algorithms, machine learning models, and data on a tool's output to the user's situated sensemaking.Contextualized explanations will require a thoughtful mapping of concepts into the stakeholders' vocabulary.

LIMITATIONS
Due to the specific focus of our research and our sample size, we do not claim that our results can be generalized to other, completely different contexts.However, since critical reflection constitutes a core aspect of all scientific and scholarly work, we do assume that our findings can inform the design of human-AI collaboration in other contexts and disciplines as well.Our findings' core aspects align with existing research into human-AI collaboration in qualitative contexts (e.g., [3,18,37,51]), showing us that there is a reasonable overlap despite the differences in contexts.
We are aware of the limitations that relate to potential biases arising from our study participant selection.Our study relied on voluntary participation, which might induce bias insofar as potential participants would only agree to be interviewed if they had at least some degree of interest and openness toward the topic of human-AI collaboration.However, of the initial 17 potential participants we contacted, only five declined participation, each with reference to their current workload.At the same time, compensating participants could likewise engender positive feelings that collude feedback [64].Additionally, in technology-based studies, "novelty bias" could influence the study outcome when participants are excited to try something new [60].Hence, we made sure that our sample included participants with different levels of familiarity with computational approaches (see Section 3 and Table 1).
As pointed out in Section 3.1, there exist only a few implementations of ready-to-use CV tools specifically intended for nontechnical users that give access to art historically relevant image collections.Accordingly, development is still in an exploratory and prototypical phase, which also means that existing tools and interfaces might not be as well-refined as other long-term infrastructural information retrieval systems, for example, in terms of usability.This is also the case for the tool we used in the software exploration, which we acknowledge as a limitation of our study.
Despite these limitations, our work opens promising directions for future research.

CONCLUSIONS AND FUTURE WORK
With the work presented in this paper, we suggest the realization of 'meaningful' human-AI collaboration by integrating it with the concept of 'designing for reflection.' We conceive criticality and reflection as genuinely human competencies that are central to scholarly knowledge production.We explore the question of how critical reflection can be enabled in human-AI collaboration with a use case situated in humanities research.Specifically, our use case focuses on computer vision (CV)based tools for art historical image retrieval.We conducted a qualitative interview study and thinkaloud software exploration with 12 art historians.Our findings confirm our hypothesis that critical reflection needs to be actively supported and scaffolded during interaction with a computational tool.Based on the insights derived from our study, we suggest four design implications for enabling critical reflective human-AI collaboration: supporting reflection on the basis of transparency, foregrounding epistemic presumptions, emphasizing the situatedness of data, and strengthening interpretability through contextualized explanations.
Technological advancements in AI are increasingly becoming available to 'nontechnical' users and domain experts through ready-to-use tools, which also affect and mediate their understanding of the world.Verbeek emphasizes that "[u]sers, designers, and policymakers should be enabled to read, design, and implement technological mediations, in order to be able [to] deal in a critical, creative, and productive way with powers that remain hidden otherwise" [78, p. 31].We contribute to such efforts by highlighting that computational tools, i.e., AI, challenge critical reflection.Acknowledging this challenge is not only important in the humanities but can be extended to other contexts.Awareness of how the use of computational tools affects, shapes, and changes epistemic assumptions and research processes is necessary across all forms of knowledge production.Hence, our work has relevance for current discourses concerned with the question of how we can integrate AI in research and education in a 'meaningful' way.Such discourses are informed by work on ethical and responsible realizations of AI.In this regard, transparency and interpretability are being discussed as core principles that need to be accounted for.Our empirically grounded design implications mirror the importance of these aspects.However, just like tool development, the challenge of enabling transparency and interpretability tends to be addressed from an engineering perspective that prioritizes metrics like usability and efficiency.In contrast, our research suggests approaching tool design in a highly contextualized manner that builds on a nuanced understanding of what constitutes 'meaningful' human-AI collaboration in the designated area of application.The successful implementation of such an approach depends on a close interweaving of theory, empirics, and technology development.Hence, we do not claim that our results are generalizable, but we do believe that our research is a valuable contribution to the question of how we can approach the realization of 'meaningful', i.e., critical-reflective human-AI collaboration, especially in scholarly and educational contexts.
Our future work aims at operationalizing our design implications.We are currently conducting a follow-up study to translate our findings into concrete "explanation needs." This future work relates to one of our design implications: We will probe how we can bridge the users' explanation needs with the technical capabilities of explanation methods for computer vision in order to provide "contextualized explanations".We will further explore this topic through participatory design workshops with domain experts from the humanities and HCI.
relevant to your research or in the context of your current work?It would be great if you have it readily available on your computer, for example on your desktop.You will use it later during the software exploration.

Consent
You have already given your written consent to participate in this study and also agreed that I may record our interview.I will use the "record" function of Webex for this purpose.The recordings will be stored locally on my university computer and be transferred to a password-protected universityowned server.Only I and my colleagues will have access to the recording for the analysis of the interviews.Would you be so kind to confirm your consent?

Interview Script
( • Can you be more specific about the concrete research tasks or content-oriented work activities for which you collaborate with others? • (If answer remains on a meta-level or only refers to administrative/managerial tasks) I would like to encourage you to think more in the direction of research-related or scholarly tasks and content-oriented work activities: How is the collaboration structured when you engage in work that is "art historical" in nature?Could you, for example, describe a concrete workflow?(4) What role do computational tools or computational [research] approaches (for example approaches that build on "AI") play for your research and content-oriented work?
• (If unspecific/unclear) Which computational tools/approaches do you use regularly in your research or for content-oriented work?For which specific tasks do you use them?
• (If none) Can you tell me your reasons for not using computational tools?(5) Image-based search engines or other applications that integrate so-called "computer vision" or "machine vision" are currently being explored in the context of art history.What is your experience with this technology so far?
• (If only "commercial" search engines like Google (reverse) image search are mentioned) Do you also know image-based search engines that are specifically deployed for art historical research contexts?(6) Apart from your own experience with this technology: how do you assess the use of "computer vision" or other "AI"-based applications in art historical contexts in general?
• (If too unspecific) How do you assess, for example, the potentials and/or challenges associated with computer vision and/or "AI"?Could you expand on this and deepen what you have just said in this respect?(7) You have already (implicitly) talked about the (general) potentials and challenges that you associate with computational tools and approaches -what aspects are particularly important to you from your perspective as an art historian?
• (If no reference to specific competencies is made) Which competencies are most important for you in connection with the use of computational tools?Put differently: from your point of view, which competencies should each and every art historian bring to the table when using computational tools in their research and art historical work?(8) When you think again about the computational tools or approaches that you have talked about so far: What do these tools/approaches not yet accomplish, but which would be important for your research or art historical work?(9) Do you wish to add anything before we proceed with the software exploration?

Software Exploration
Familiarization and exploration of the tool.The goal of the software exploration is to better understand how you go about using a computational tool.In particular, we are interested to hear your thoughts, opinions and impressions while using the tool imgs.ai-this is a web-based search tool that integrates computer vision.It was developed by a group of researchers at FAU Nürnberg [participants are again reminded that this tool is not being developed by the authors].The tool allows you to conduct image-based searches in collections.I'll send you the link to the tool in the chat now.I would ask you to share your screen with me so that I can follow you during your interaction with the tool.Please also "think-aloud" while using the tool and simply describe your spontaneous impressions and associations or what questions or thoughts are going through your mind.It's best to imagine that I'm not seeing what you're doing at all and have to rely on your verbalization of what you're doing or what's going through your mind.Please keep in mind that I do not evaluate you or how "well" you interact with the tool -I am interested to hear more about your impressions, thoughts and opinions that come up while you use the tool.
[Let them explore the functionalities of the tool for about 3-4 minutes -advise/help in case participant is not able to operate tool without further instruction.]Image retrieval task.
• (If participants do not automatically proceed with their image retrieval task, encourage them to use "upload" function of the tool) • How would you go about it?(2) Thank you very much....We have now reached the end of the interview.Is there anything else we haven't touched on so far that you would like to mention?

Table 2 .
Overview of themes organized in four sections.
1) To begin, I would like to ask you to briefly describe the field in which you work as an art historian.(2) What is your current thematic focus?For example, what is your current research topic or work interest?(3) When you think about a typical research project or another kind of art historical [contentoriented] work activity of yours: what role does collaborating with other people play in this?
Would you care to explain to me what your search interest is? • Why have you chosen this image?• What kind of images do you wish/expect to retrieve?• What would be a "relevant" result?• To what research question or content-oriented work does this image/your search relate to? [Let them refine search results for about 3-4 min] End of the exploration Outro questions (1) Can you imagine -in principle -to integrate such a tool into your research / use such a tool in your art historical work?• If yes: for which tasks?