GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

Voice assistants (VAs) like Siri and Alexa are transforming human-computer interaction; however, they lack awareness of users' spatiotemporal context, resulting in limited performance and unnatural dialogue. We introduce GazePointAR, a fully-functional context-aware VA for wearable augmented reality that leverages eye gaze, pointing gestures, and conversation history to disambiguate speech queries. With GazePointAR, users can ask"what's over there?"or"how do I solve this math problem?"simply by looking and/or pointing. We evaluated GazePointAR in a three-part lab study (N=12): (1) comparing GazePointAR to two commercial systems; (2) examining GazePointAR's pronoun disambiguation across three tasks; (3) and an open-ended phase where participants could suggest and try their own context-sensitive queries. Participants appreciated the naturalness and human-like nature of pronoun-driven queries, although sometimes pronoun use was counter-intuitive. We then iterated on GazePointAR and conducted a first-person diary study examining how GazePointAR performs in-the-wild. We conclude by enumerating limitations and design considerations for future context-aware VAs.


INTRODUCTION
Voice assistants (VAs) are transforming human-computer interaction.In a recent study of 2,000+ people [65], 72% of respondents indicated that they use VAs for tasks such as playing music, setting timers, controlling IoT devices, and managing shopping lists [6,8,78].While widespread and useful, state-of-the-art VAs like Amazon Alexa, Google Assistant, and Apple Siri do not yet consider a user's spatiotemporal context, which can result in unnatural dialogue or unanswerable queries [6].For example, the query "What is that?" requires the VA to understand what "that" refers to-a problem known as pronoun disambiguation [18].Despite their prominence in human speech [23], pronouns are not well supported by current VAs.
To resolve pronoun ambiguity, humans employ a variety of contextual clues, including eye gaze, pointing, and conversation history [23].For example, a person may physically gesture at an item in a store and ask "How much is this?"While straightforward for a human to resolve, current VAs are unable to answer this query precisely because they lack spatiotemporal context.Pronoun disambiguation and multimodal input have a rich history of research in HCI [71,84]-perhaps best marked by Bolt's visionary "Put That There" system in 1980 [9] and beyond [17,44,92].With recent advances in machine learning, speech recognition, and large language models (LLMs), new approaches are now possible.For example, emerging context-aware VA prototypes such as WorldGaze [60], Nimble [83], and TouchVA [47] examine how to use head gaze, pointing, and touch to resolve ambiguous queries.While promising and informative to our own work, these prototypes share similar limitations: they use Wizard-of-Oz (WoZ) setups [19], are accompanied by tightly-controlled lab studies vs. open-ended queries, employ only one additional modality alongside speech, and are designed for smartphones rather than always-available head-worn displays.
In this paper, we introduce GazePointAR, a context-aware VA for wearable augmented reality (AR), which uses eye gaze, pointing gestures, and conversation history to support pronoun disambiguation.If a user's spoken query contains a pronoun, we process the user's field-of-view using real-time computer vision, automatically extract objects and written text in the scene, and generate a new coherent query phrase that is sent to OpenAI's GPT-3 [70] for processing.The response is then verbally read using speech synthesis.Pronouns are replaced using an empirically-tuned heuristic model that incorporates CV results based on gaze and pointing.For example, when asking "How much is this?" while looking at a bottle of mango juice (Figure 2), GazePointAR extracts information such as object type, brand name, and flavor name to generate "How much is a bottle with text that says Naked Mighty Mango 290 Calories?".
To evaluate GazePointAR and explore the potential of contextaware VAs in wearable AR, we conducted two studies.First, we performed a three-part qualitative laboratory study with 12 participants to compare GazePointAR to two state-of-the-art query systems (i.e., Google Voice Assistant and Google Lens) (Part 1) and examine GazePointAR's usability and performance across various scenarios (Parts 2 & 3).For example, participants searched for the price difference between two salt boxes (e.g., "Can you compare the price between these two?").In Part 3, we invited participants to brainstorm and try their own queries to further assess how contextaware VAs may be used in the future and how well GazePointAR currently supports such uses.Participants primarily used gaze to ask a diverse range of queries, from retrieving object information to foreign language translation, and were impressed by GazePointAR's ability to include their gaze to resolve queries.Participants also noted limitations, such as only capturing gaze data once after a query is spoken, the inability to handle queries with multiple pronouns, lack of AI explainability, and object recognition errors.
Informed by these findings, we created a second GazePointAR prototype with improved object recognition and phrase generation techniques using prompt engineering, and conducted a followup first-person diary study [22].Here, the first author used Gaze-PointAR in their daily life for five days and recorded a written diary of usage, reflections, and observations of both successes and failures.In 20 hours of usage (4 hrs/day), the first author used GazePointAR across various contexts from cafes and restaurants to shopping malls and cinemas, and posed 48 queries, including recommendations for allergy-friendly menu items, ratings of movies, and cheaper alternatives to expensive clothing.Although the first author found GazePointAR to be more natural, instinctual, and robust against complex-to-describe objects in the real world than a traditional VA like Siri, they also encountered similar limitations as the study participants, such as static gaze data and limited object recognition capabilities, as well as privacy concerns with using a speech-and camera-based system in public.
In summary, our contributions include: (1) a fully-functional, context-aware VA for wearable AR that uses real-time computer vision and LLMs for pronoun disambiguation and more natural query dialogue; (2) findings from two user studies, including how users instinctively generate context-sensitive queries, how GazePointAR performs on queries from different scenarios, and limitations such as continuously tracking gaze information and AI explainability; and (3) a discussion on how to design future context-aware VAs that support any natural query a user poses spontaneously.

RELATED WORK
We provide background on pronoun usage in speech before enumerating relevant literature in multimodal interaction with a focus on voice assistants and augmented reality.

Pronoun Usage in Speech
Pronouns are frequently used in human speech, both in conversations between humans and in task-oriented dialogue systemscomputational systems that complete tasks described in natural language.Leech et al. ranked the frequency of 100 million spoken English words showing that pronouns, including demonstrative pronouns (e.g., "this, " "that, " "these, " "those, " "here, " and "there") and third-person pronouns (e.g., "it, " "he, " "him, " "she, " "her, " "they, " and "them") all ranked in the top 200 [48].As further evidence, Byron and Allen annotated a corpus of task-oriented dialogues and found that over one-third of 1,068 dialogue turns contained referential occurrences of pronouns "it" and "that" [13].Similarly, HCI studies have highlighted the importance of pronouns in human speech as they contribute to enhancing its naturalness and expressivity [9,42,47] and that users desire to communicate to VAs using pronouns [34].To resolve pronoun ambiguity, humans rely on multimodality such as looking at or pointing at referents while speaking and conversational context [23].In our work, we investigate real-time gestures, eye gaze, and conversation history to enable pronoun disambiguation in human-VA interaction.

Multimodal Interaction
The HCI community has long been interested in multimodal interaction, highlighting various benefits such as improved naturalness, robustness, and expressiveness compared with unimodal interaction techniques [71,84].For instance, researchers explored gaze as a multimodal input technique in mobile devices to address shortcomings of touch, such as slow interaction speed, limited reach on large screens, and impreciseness on small screens [26,28,43,58,76].Additionally, gestures and speech have often been combined with gaze to improve the accuracy of gaze-alone systems [15,62].In our work, we rely both on gaze and, if identified in the visual frame, pointing gestures to resolve speech ambiguities.Many consumer products now support multiple modes of input, which allow users to interact using both touch and speech.Although the field of multimodal input is vast [71,84], for the purposes of this paper, we focus primarily on its use in voice assistants and augmented reality.

Multimodal
Interaction with Voice Assistants.The integration of speech with additional input modalities has long been a topic of interest in HCI.For example, Bolt's foundational "Put That There" explored the use of speech and gestures as input [9].Further research has expanded on this idea by examining other input modalities, such as gaze pointing [92], pen and voice interaction [17,46], and merging speech, gestures, and eye gaze [44].More recently, researchers have examined multimodal speech and gaze interactions in the context of hands-free communication between humans and vehicles [4,64,82], as well as speech and gestures to support natural interactions with virtual objects in AR [36,50,77].Others have explored AR-based WoZ VA prototypes that support more natural dialogue between users and VAs by employing gaze [60], touch [47], or gestures [83] alongside speech.The importance of multimodality in the design of voice user interfaces is widely acknowledged [1,23] because it enables flexible, expressive, natural, and contextual human-VA communication [9,32,42].Our work aims to contribute to this literature by implementing and evaluating a fully-functional multimodal VA with ambiguous speech support.

Multimodal Interaction in Augmented
Reality.In AR specifically, multimodal interaction is frequently employed to improve object selection and manipulation, typically using hand gestures, gaze, and/or voice [35,89].For instance, both Olwal et al. and Piumsomboon et al. used speech as a supplement to gesture for improved object selection in AR [67,77].Additionally, Kytö et al. used both head motion and eye gaze to increase the efficiency and accuracy of target selection in AR [45].Furthermore, Lystbaek et al. used eye gaze to assist mid-air gestures with distant object selection in AR [57].Lastly, Liao et al. used gestures and speech to generate and interact with AR presentation augmentations [50].Similarly, GazePointAR employs hand gestures to support gaze with a goal of enhancing real-world object selection.
Most relevant to our work, recent research has explored multimodal interaction in AR for pronoun disambiguation.More specifically, when a multimodal VA receives an ambiguous query, such as "When does this store open?",AR is used to analyze various visual contexts, including objects, texts, gaze, and gestures.For instance, Mayer et al. presented WorldGaze, a WoZ smartphone-based multimodal VA that leverages head gaze information to clarify ambiguous queries [60].Others have explored touch [47] and pointing gestures [83] to resolve ambiguity.Each modality has tradeoffs: head gaze is quick and hands-free but can be inaccurate [60], touch is accurate but slower and not hands-free [47], and gestures fall in between the two modalities [83].In this work, we employ a combination of gaze supported by pointing gestures to create a efficient, mostly hands-free, and accurate input modality for speech disambiguation.We evaluate this in a fully-functional VA for wearable AR in various contexts.

Other
Uses of Gaze, Pointing, and Speech in Wearable AR.We conclude by highlighting recent studies that, while not employing gaze, pointing gestures, and speech as multimodal interaction techniques, present novel applications for each of these input sources in wearable AR.For instance, researchers have used eye gaze to design AR interfaces that adaptively control the display of information based on context, including its timing, placement, and volume [52,56,75,80].Additionally, hand gestures are often classified using machine learning to enable more natural object and UI manipulation [72,86,91].Furthermore, wearable AR glasses have been used to caption, translate, and augment speech in a nonintrusive way [29, 37-39, 53, 61, 66, 73, 80, 87].GazePointAR, while multimodal, is influenced by this prior work in wearable AR for enhanced interaction and context.

GAZEPOINTAR PROTOTYPE 1
To advance the naturalness and economy of expression in how humans interact with VAs, we designed and built GazePointAR-a fully-functional context-aware VA for AR glasses that uses eye gaze, pointing gestures, and conversation history to support pronoun disambiguation.Below, we describe GazePointAR's design and implementation, starting with a taxonomy of pronoun usage drawn from linguistics literature.

Taxonomy of Pronoun Use and Resolution
To design GazePointAR, we first examined commonly-spoken pronouns in human speech and referent resolution strategies.We analyzed Leech et al.'s ranked frequency list of 100 million spoken English words [48] and filtered to pronouns spoken at least 500 times per one million words.From this process, we extracted thirteen pronouns across three distinct groups of pronouns, all of which GazePointAR supports: nominal demonstrative pronouns: "this," "that, " "these, " and "those", adverbial demonstrative pronouns: "here" and "there", and third person pronouns: "it, " "he, " "him, " "she, " "her, " "they, " and "them".
Demonstrative pronouns are used to point to specific people or things and can be further broken down into nominal and adverbial [25].In human conversations, gaze and/or pointing gestures are often used for referent disambiguation [23].While demonstrative pronouns such as "this" and "that," "these" and "those," and "here" and "there" seem similar, humans naturally employ one based on relative distance from the speaker to the referent [23].For example, a person may ask "How much is this?" when referring to a nearby object and "How much is that?" if the object is further away.
For third-person pronouns, "it" may function as an anaphoric, which refers to a word used previously in a phrase such as "I have a bicycle.It is red.";pleonastic, which is the use of more words than needed to express meaning either unintentionally or for emphasis such as "kick it with your feet.; or as an event reference such as "He lost his job.It came as a total surprise."[54].When resolving the anaphoric or pleonastic "it," humans need prior conversation history, while for event reference, "it" can be used interchangeably with "this" or "that" [33,54].For other third person pronouns, humans often refer to entities such as other people or animals with "he" or "her", for example, but these pronouns must be used cautiously, as they can introduce gender bias [14].
Grounded in this analysis, we designed a taxonomy of frequentlyspoken pronouns and how ambiguity from each pronoun can be resolved.When implementing GazePointAR, we adhered closely to this taxonomy, enabling our system to handle all thirteen pronouns and determine their referents based on gaze, pointing gesture, and conversation history.

System Implementation
We designed and implemented GazePointAR for the Microsoft HoloLens 2 with Unity 2021.3.16f1 1 and Mixed Reality Toolkit (MRTK) 2.8.22 .While our overarching vision is to develop an always available context-aware VA for lightweight AR displays, the HoloLens 2-despite its bulk-allowed us to rapidly prototype an implementation.
We designed GazePointAR to resemble the user experience of a commercial VA such as Apple Siri or Amazon Alexa.GazePointAR waits for a user to say "Hey Glass" and make a verbal query.If the user's query contains one of thirteen pronouns in our taxonomy, it analyzes the user's field-of-view using various machine learning (ML) solutions, constructs a coherent phrase to describe the user's referent, replaces the pronoun with its referent, and sends the modified query to a large language model (OpenAI's GPT-3 [70]).The query response is vocalized to the user using a text-to-speech engine within 10 seconds.See the system diagram in Figure 2. We expand on key components below.As a rough examination of system response time, we asked the query "How much is this?" while gazing at a bottle of mango juice (a tutorial task) ten times.GazePointAR responded in 7.51 ± 0.45 seconds.We include subcomponent performance times from this same procedure below.
Activating GazePointAR.To activate GazePointAR, the user states "Hey Glass."For this, we implemented a continuously-running background process checking for the trigger phrase.Upon recognition, GazePointAR replies, "Hi, I'm listening."and waits for a spoken query.After the query, GazePointAR performs a substring search to check for pronouns from our taxonomy.
Capturing and analyzing the user's field-of-view.If the query contains a pronoun, GazePointAR prompts the HoloLens to take a 1080p photo of the user's field-of-view.For user and bystander privacy, the captured image is stored temporarily and deleted once a query response is received.This process takes 2.27 ± 0.16 seconds to complete.
Once the user's field-of-view is captured, we begin analyzing the image for objects, texts, and faces.We send the captured image to three ML models through asynchronous POST requests to minimize runtime: Google Cloud Vision's (1) Object Localization and (2) Optical Character Recognition (OCR) models [16], as well as (3) Amazon Rekognition's Celebrity Recognition model [7].This process takes 3.37 ± 0.23 seconds to complete.
After receiving JSON responses from the ML services, Gaze-PointAR identifies hierarchical relationships between the detected objects, faces, and texts.We treat the object detection and celebrity recognition results as the parent layer.The child layer, comprised of OCR results, is connected to parent bounding boxes that have at least 70% pixel overlap (a threshold tuned empirically).Each parent can have up to five OCR results, ranked by bounding box size.This ensures that GazePointAR prioritizes important textual information, such as product and brand names, which tend to be larger in the user's field-of-view, while ignoring less important, smaller details like promotional blurbs.For example, as shown in Figure 2, when a user asks "How much is this?" while holding a bottle of Naked Mighty Mango juice, possible parent layer objects include "person" and "bottle", with "bottle" having child layer objects such as "Naked", "Mighty", "Mango", "290", and "calories".
Gaze tracking and gesture recognition.To capture the user's eye gaze and pointing gesture, we customized MRTK's built-in gaze and pointer modules.For gaze, we designed a white sphere that follows the user's gaze from a fixed distance (i.e., 2 meters) and is overlaid in their field-of-view.This allows us to retrieve 3D gaze coordinate data and also provides visual feedback to the user about their system-inferred gaze.
For pointing, we implemented a finger-pointing gesture to supplement the base palm-pointing gesture, since extending the arm and index finger is a more typical pointing gesture [23].Performing a pointing gesture creates a ray that extends away from the user's hand until a collision with an object in the physical world occurs.To achieve this, we integrated MRTK's spatial awareness into Gaze-PointAR to detect collisions between user inputs and spatial meshes generated in real-time.
As the HoloLens captures an image, GazePointAR simultaneously logs the locations of both the user's gaze and pointing gesture.To convert 3D gaze and pointing gesture coordinates to their corresponding pixel locations on the captured image, we use projection.
Query assembly and pronoun replacement.Using the MLgenerated results and pixel coordinates of gaze and pointing gesture, GazePointAR assembles a coherent phrase to replace the userspoken pronoun.To accomplish this, we employ a state diagram, which encompasses the differences in pronouns in our taxonomy.
If a pronoun is singular, GazePointAR computes whether any input coordinates fall within any parent (i.e., object recognition and celebrity recognition results) bounding boxes.If so, GazePointAR takes that parent object's child layer (i.e., OCR results) and creates the following phrase: "[parent] with text that says [children]".Otherwise, GazePointAR takes the five nearest child layer texts from each input coordinate, computes a union, orders them by distance, and uses the five closest to build the following phrase: "[OCR Result If the pronoun is plural, GazePointAR expands the gaze and pointing gesture pixel coordinates into bounding boxes with width and height equivalent to half of the captured image's width and height.Then, GazePointAR computes whether any input bounding boxes have at least 70% overlap with any parent bounding boxes.The rest of the procedure is the same as with singular pronouns.
Answering the query.GazePointAR assembles the final query by combining the user-spoken query, the ML-generated phrase, and text from the five most recent query-answer pairs.The final result is processed by OpenAI's GPT-3 [70], which takes 1.87 ± 0.43 seconds to complete.The output is displayed as text and read aloud.If there are no ML results or GPT-3 cannot process the modified query, GazePointAR responds "Sorry, I did not understand your question."Users can ask follow-up questions or provide additional information appropriately.

STUDY 1: THREE-PART LAB EVALUATION OF GAZEPOINTAR
To evaluate GazePointAR and explore the potential of contextaware VAs in wearable AR, we conducted two studies: (1) a laboratory study to compare GazePointAR to two state-of-the-art query systems and examine how participants generate and use their own context-sensitive queries; and (2) a first-person diary study using GazePointAR in the real world.We report on the first study below.
For the lab study, we sought to address three primary research questions: (1) How do users initially perceive and use a multimodal, context-aware VA for pronoun disambiguation?(2) How does performance compare to traditional VAs? (3) What types of queries do users want to perform with a context-aware VA, and how well does GazePointAR support these queries?As initial work, our primary aim was not to quantitatively examine GazePointAR's performance but rather to observe how participants reacted to and used a fullyfunctional, context-aware query system for AR glasses.
To address these questions, we conducted a three-part, withinsubjects laboratory study with 12 participants.In Part 1, we asked participants to complete a common query task with GazePointAR as well as two state-of-the-art commercial systems: Google Voice Assistant (voice input) and Google Lens (image+text input).In Part 2, participants completed three additional context-dependent query tasks with GazePointAR, which were designed to highlight different aspects in our design space (e.g., pronoun use, gaze, gesture, and conversation history).Finally, in Part 3, participants brainstormed and tried their own context-sensitive queries.

Participants
We recruited 12 participants via mailing lists, social media, and snowball sampling.Participants were screened via a demographic questionnaire, which asked about prior experiences with VAs, AR, and AI chat systems.Given the reliance on gaze and speech in our study, we filtered participants who indicated visual or auditory disabilities, have a history of seizures or epilepsy, or are not fluent in English.All twelve participants indicated at least some previous experience with VAs, including Amazon Alexa, Apple Siri, Google Voice Assistant, Microsoft Cortana, and Samsung Bixby.Most (9/12) had not previously used AR headsets or glasses-those that did (3/12) mentioned Google Glass, Microsoft HoloLens, and Meta Quest Pro.Finally, all participants indicated at least some familiarity with AI chat systems with six stating that they use them at least once a week (two participants marked never).Most commonly, participants mentioned ChatGPT (8/12) and customer support chatbots (2/12).

Procedure
The in-person laboratory study took place on a university campus and lasted 60 minutes.Instructions were presented orally with backing slides to improve comprehension.Consent and background forms were emailed in advance; written consent was taken in person.All sessions were video recorded for post hoc analysis.Because we were interested in candid reactions, we did not tell participants that we created GazePointAR.
Tutorial.After consenting, participants completed a short tutorial about each VA system: GazePointAR, Google VA, and Google Lens.The tutorial order was counterbalanced but the query task was the same: "Your task is to find the price of this bottle of Naked Mighty Mango juice" (Figure 2).During the tutorial, participants could ask questions of the study facilitator and, for GazePointAR, configure the AR headset fit and calibration.The study commenced once each participant was comfortable with all three VA systems.
Part 1: Comparing VAs.Part 1's goal was to examine how participants constructed queries for a common VA scenario: cooking.Specifically, we asked participants to "find a marinara pasta recipe that uses this jar of Rao's Marinara sauce; the more specific, the better" (Figure 3) using each of the VA systems-which were again counterbalanced.For each VA system, we encouraged participants to construct the query to best leverage the system's input modality (e.g., taking a picture for Google Lens, gazing or pointing for GazePointAR).The search task was deemed complete when the participant had found, from their perspective, a satisfactory recipe.After using each VA, participants filled out a System Usability Scale (SUS) [12,74] questionnaire and answered interview questions regarding their experience.At the end of Part 1, we asked participants to rank the three systems in terms of perceived intelligence, helpfulness, naturalness, and overall preference.We then asked follow-up questions to justify rankings.
Part 2: Context-sensitive Queries with GazePointAR.While Part 1 examined differences in query behavior depending on modality and technology, Part 2 specifically focused on examining contextsensitive queries with GazePointAR.We asked participants to complete three tasks that, based on our own usage of GazePointAR, benefited from context-dependent queries and pronoun disambiguation: (1) Write a simple math equation on a sheet of paper and ask GazePointAR if it is mathematically accurate; (2) Use GazePointAR to find the cost difference between two items; (3) Use GazePointAR to find more information about a person in a magazine article (Figure 4).Again, at the end of Part 2, we asked participants to remark on their GazePointAR experiences and the additional search tasks.
Part 3: Design Probe and Co-design.Finally, in Part 3, participants helped co-design the future of context-aware VA systems.Using a design probe method similar to Mauriello et al. [59], participants first watched five video clips of GazePointAR being used across diverse scenarios: cooking, math, language translation, recycling materials, and asking if there are dangerous items nearby (Figure 5).After viewing and discussing the design probe videos, participants brainstormed and then actually attempted their own context-sensitive queries-a study task that is only possible with a fully-functional prototype like GazePointAR.

Data and Analysis
We analyzed three sources of data: interview transcripts, observations from the user study sessions, and the post-task questionnaires.For the qualitative data, we used reflexive thematic coding [10,11].
The first author, who facilitated all user study sessions, created an initial codebook by reviewing study transcripts.The entire team then collaboratively iterated on the codebook while checking for bias and coverage.With a final codebook consisting of 34 codes, the first author coded participants' quotes, after which the team discussed the resulting themes.While this exploratory study focused on participants' reactions to GazePointAR, we also collected quantitative data from Part 1 to compare GazePointAR with existing systems.For SUS scores, we converted survey responses, which are on a scale of 0-40 when summed, to a range between 0-1003 [12].We then conducted a Friedman test as an omnibus test with an appropriate number of Wilcoxon signed-rank tests corrected with Holm's sequential Bonferroni procedure for statistical significance.See Figure 6 for a summary of quantitative results.

Findings
We report key findings, including how VA input modality influenced perceived performance and query formation, the queries participants generated using GazePointAR, and successes and failures of GazePointAR in various scenarios.We denote each participant as P# (e.g., P1 for participant 1).Quotes have been lightly modified for concision and clarity.

Part 1: Comparing VAs.
In Part 1, participants completed an open-ended query task to find a recipe for a specific marinara sauce with the three different VA systems (Figure 3).We first provide overall reactions before analyzing query formations, perceived intelligence, naturalness, and helpfulness, task completion time, and usability.
Overall.Overall, participants preferred using Google VA (  =1.7; SD=0.7) and GazePointAR (1.8; SD=0.9) over Google Lens (2.6; SD=0.7)-lower is better, range is 1-3.For Google Lens, participants emphasized that while taking photos was familiar (3/12) and lessened the specificity of their queries compared to voice-only systems (3/12), manually capturing an image and supplying written text felt tedious (3/12) and unnatural (2/12).As P2 stated, "I had to take a picture and then add more information... It's like an extra step, right?Is this necessary?".Similarly, P4 said, "Google Lens is the most unnatural, because sometimes you have to type extra context, and I feel like that's just another hurdle.".Finally, the quality of Google Lens' responses influenced opinions: four participants were initially guided to a recipe for making marinara sauce rather than using Rao's Marinara sauce.Two participants mentioned losing confidence in Google Lens due to poor responses.
For Google VA, participants appreciated the straightforward (6/12), quick (4/12), and hands-free (2/12) nature of the system.Additionally, four participants emphasized that, compared to Google Lens and GazePointAR, it was easier to review query responses, visit different links, and decide on the best answer themselves.As P5 said, "Google voice assistant displayed a typical Google search result [on the phone], which gives me a lot of options... clicking into them allows you to try until you find the recipe that you're satisfied with."Half of the participants also mentioned the familiarity of Google VA and the results interface.For limitations, participants noted that Google VA requires queries to be highly specific (4/12),   necessitates accurate pronunciation of complex words like "Rao's" (4/12), and leads to longer queries, which are laborious to say (3/12).P12 aptly summarized theses issues by stating, "You have to be more specific and have to say a lot more...I also think that a lot of people might mispronounce Rao's."One participant (P3) felt strongly about voice-edit capabilities-as Google VA only allows query iteration through text but not voice-based editing.
Finally, for GazePointAR, participants felt that it was simpler (8/12) and faster (8/12) to interact with as well as more natural (7/12) and human-like (6/12) to speak to than Google VA and Google Lens.In part, this was because participants could reduce the specificity of their queries with GazePointAR's context-awareness features.As P10 said, "When speaking to GazePointAR, I am giving it a voice input while also interacting with the product that I am talking about.Perceptually, this is the most natural way of speaking, which is why we do this when talking to other people as well.".Another said: "When you're talking to someone, you point to or look at something and say 'what is this?'They can see what you're pointing to or looking at, which is exactly what the headset is doing...I was also able to receive an answer quickly without having to look through web pages."(P4).However, the most common criticism (8/12) was that GazePointAR provided only a single answer rather than an interactive, explorable list like a traditional search engine.Participants also requested more transparency from the system about their gaze and pointing gestures, the image GazePointAR took for scene processing, desired citations in the query response, and wanted queries to be editable.
Query formations.Beyond overall reactions, we also explored how participants formed queries with the three systems.When examining query length, unsurprisingly, the two multimodal systems had shorter queries on average: Google Lens (avg=1.3words Figure 6: The mean and standard deviation of task time, usability, perceived intelligence, helpfulness, naturalness, and overall preference.Task Time is in seconds.Usability is 0-100; higher the better.Rankings are 1-3; lower is better.For statistical significance, one asterisk (*) is  < 0.05; two asterisks (**) is  < 0.01.long; SD=0.5) and GazePointAR (avg=6.3;SD=1.8) than Google VA (avg=8.4;SD=2.2).With Google Lens, all participants took a picture of the sauce jar then supplied additional text, including "recipe" (9/12) and "recipe using" (3/12).With GazePointAR, all participants used the pronoun "this" along with gaze but did not use pointing.P2 reasoned that "If you're pointing at something, you have to use your hand.This implies that you still have use of your hands during some tasks.Also, because the jar is so close, the system shouldn't need pointing to tell what I'm talking about."Finally, with Google VA, all participants used proper nouns, including various formations of "Rao's homemade Marinara sauce".Full queries are in Appendix 1.
Perceived intelligence, helpfulness, and naturalness.For perceived intelligence, participants ranked GazePointAR the highest with   =1.5 (SD=0.8),followed by Google VA (2.0; SD=0.7), then Google Lens (2.5; SD=0.7).A majority of participants (8/12) reasoned that GazePointAR "recognized things I am talking about just from my gaze and pointing" (P3), while for Google VA and Google Lens, "instead of it figuring things out itself, I have to provide everything" (P12).For perceived helpfulness, participants ranked Google VA the highest with   =1.3 (SD=0.6),followed by GazePointAR (2.1; SD=0.7), then Google Lens last (2.7;SD=0.5).Half of the participants reasoned that Google VA displays multiple options and images in a familiar UI, which helped them decide on a satisfactory answer.
For perceived naturalness, participants ranked both Gaze-PointAR and Google VA highly with   =1.6 (SD=0.8)and 1.7 (SD=0.5)respectively, followed by Google Lens (2.8; SD=0.6).Participants generally equated naturalness to the ease with which the query was constructed (10/12).As P12 said, "I wish I can say queries with and without pronouns, because whichever comes to mind first, that's the one I want to say."Given the simplicity of the search task, P5, P11, and P12 indicated that the high specificity demanded by Google VA is not much of a concern; however, as search queries become more complex, Google VA can quickly fall behind other systems.As one example, three participants were unsure how to pronounce "Rao's" so felt more comfortable saying "this".While seven participants felt GazePointAR was most natural, P12 emphasized that humans are conversationally adaptable and have learned how to speak to modern VAs: "GazePointAR was definitely the most human-like if we mean most 'natural' and 'human-like' in terms of speaking to another person; however, if we say 'natural' as in speaking to a machine, then Google Voice Assistant wins".
Task completion time.While we allowed participants to define their own stoppage mark for determining a satisfactory query answer, task time is still an interesting metric and central to information retrieval [30].On average, the fastest completion was Google VA (avg=26.3secs; SD=12.2) followed by GazePointAR (37.4 secs; SD=11.6) then Google Lens (60.7 secs; SD=28.3).For both Google VA and Google Lens, participants primarily spent time clicking and viewing links to find a satisfactory recipe while with GazePointAR, participants received a direct answer but were delayed by query and image processing.To form the query, Google Lens took the longest as participants had to input both an image and textual content; for both Google VA and GazePointAR, participants could form queries hands-free, which increased interaction speed.System usability.Finally, for the SUS questionnaire, participants gave Google VA a higher usability score (avg=80.0;SD=14.3)than Google Lens (66.3;SD=14.8) and GazePointAR (62.1;SD=20.0)higher is better, range is 0-100.Various factors influenced usability, including familiarity with Google suite, autonomy in choosing a satisfactory answer from Google UI, naturalness in coming up with and vocalizing queries, and task completion time.

Part 2: Context-sensitive Queries.
While Part 1 explored differences between VA systems, Part 2 focuses specifically on GazePointAR and three context-sensitive queries: solving a math equation, comparing costs between items, and finding information about a celebrity (Figure 4).We did not guide participants in how to complete the queries, so our findings are based on participants' initial instincts.For all tasks, participants chose to use gaze+speech rather than pointing as participants felt that pointing was unnecessary (7/12) and like extra work (6/12).In a few instances, participants relied on conversation history; for example, P1 asked "How much do these cost?", then, after receiving the prices of two items, they asked "What's the cost difference?".Below, we report on participants' query formations and their overall reactions across tasks.
Solving a math equation.Interestingly, all participants constructed this query similarly: using the pronoun "this", which felt most natural (9/12).As P10 said, "the equation I wrote is right there, but I don't want to say the whole thing out loud... being able to just look and say 'this' and have it read the equation is pretty useful."All but one participant preferred using a context-sensitive query and pronouns compared to vocalizing the whole equation.Some participants (5/12) mentioned feeling unsure where to look to properly capture the equation during their query: "having to keep my gaze on the equation is more difficult than a jar, since I know I have to fix my gaze, but I am not sure where I should look" (P3).
Comparing costs between two items.Unlike the math equation, participants constructed this query using two different pronouns: ten used the pronoun "these" and two used "them".Currently, GazePointAR only supports one pronoun per query.Five participants felt that constructing a comparative query with multiple pronouns would have felt more natural such as, "Compare the cost of this to that."As P1 stated, "when there are exactly two objects, I feel like I will more likely say 'this or that' rather than 'these'".Similar to the math task, participants were unsure where to look to communicate intent (i.e., multiple object referents) with GazePointAR.Participants also reiterated wanting more system transparency to understand what GazePointAR was capturing for the context-sensitive query: "It is impressive that it can figure out multiple objects, but it will likely be more incorrect when trying to guess multiple objects I am talking about, so I really want to know what it thought I meant" (P5).
Finding information about a celebrity.For this task, the query construction was most varied: five used the pronoun "this' (e.g., "Who is this?"), four "her" (e.g., "Tell me about her. "), and three "she" (e.g., "Who is she?".Seven participants specifically mentioned how helpful pronouns were with this task: "if you are looking at something you don't know, like a photo of a person, the only way to ask a question is by saying 'who is she' or 'who is he'" (P11).

Part 3: Design Probe and Co-design.
Finally, for Part 3, we showed five video clips of GazePointAR and then invited participants to co-brainstorm and try their own contextsensitive queries (Figure 5).Below, we first report on reactions to our design probe and then describe participant-generated queries and how well GazePointAR performed.
Reactions to design probes.Overall, participants believed that GazePointAR has many uses, as many referents are difficult to describe in words.As P10 said: "although I use voice assistants almost every day to play music or something, I now realize that many things I look at are difficult to clearly describe in text... since with this people can now input their environment easily, I think it will make speaking to voice assistants easier in many everyday activities."P3 was surprised with the range of supported queries.Additionally, participants expressed a particular interest in the societal impact examples, such as the hazardous object clip (7/12), which shows an accessibility example where a user is asking "Anything dangerous here?"while looking ahead, and the recycling clip (4/12), which shows a user asking "What goes in these trash bins?".After viewing the hazardous object probe, P5 said, "all you have to do is use pronouns and it can process objects in a person's field-of-view... that's great for blind people, which I really like."Participants summarized that when a visual referent is either unknown or difficult to vocalize, pronouns become especially useful.
Brainstorming and trying queries.For the co-design task, participants generated a total of 32 queries-see Appendix 3-and used gaze (32/32), pointing (6/32), and conversation history (1/32).Queries in which participants used pointing gestures had pronouns "that" (4/6), "there" (1/6), and "they" (1/6), which were all referring to objects faraway from the user.Conversation history was used when asking follow-up questions to find more information about a celebrity.Most queries (23/32) were aimed at deriving information about an object or person, including an object's name and price, a location's distance, and a person's name and accomplishments.Other queries included foreign language translation (4/32) like "How do you pronounce this?", object comparison (3/32), and to confirm the correctness of a user's action (e.g., "Can I put this [trash] in here [recycling trash bin]?") (2/32).In analyzing pronoun usage, participants most commonly used "this" (16 occurrences), followed by "that" (8), "s/he" or "him/her" (5), and "they" (1).
GazePointAR provided a satisfactory answer for 13 of the 32 queries, including "Who is s/he [person]?What is his/hers [musician] top hit?" and "What's happening over there?".Many of the unanswerable queries were due to lack of information, such as limitations in object recognition (e.g., while a object localization model can recognize a car, it does not know the make and model of the car) and missing access to information online (e.g., a price of an item may vary and GPT does not have access to store-specific information).
Other unanswerable queries were due to GazePointAR's inability to handle multiple pronouns in a single query (e.g., "Tell me the price difference between this and that.")or past referents (e.g., "Who was s/he again?").Participants suggested that GazePointAR should capture gaze over time.P3 added that this will remove the need for dwelling on a referent, which will allow users to gaze more naturally and improve the system's overall usability.While P10 was in favor of this feature, they also expressed privacy concerns.P5 went even further and said GazePointAR should record objects nearby gaze to support scenarios where gaze target is not the object in question (e.g., "What is the object next to that chair").

Study 1 Summary
Participants appreciated GazePointAR for its simplicity, naturalness, and human-likeness.When using GazePointAR, participants mostly relied on gaze to keep the interaction hands-free and efficient, while occasionally using pointing gestures and conversation history.Participants preferred to speak pronouns, especially when referents had difficult-to-pronounce, long, or unknown names.In some cases, including pronouns in a query felt less natural (e.g., "What can I make with this?" vs. "What can I make with Rao's Marinara sauce?").In terms of limitations, we found that GazePointAR should support multiple pronouns, provide more answer options and explanations when answering queries, use more robust ML models, and that users could tire due to explicit gazing.Participants suggested several features for improvement: capturing gaze information over time, communicating to the user about captured images, gaze, pointing, and citations used in deriving answers, and displaying an explorable search result similar to Google.

GAZEPOINTAR PROTOTYPE 2
Informed by Study 1 findings and our own experiences using Gaze-PointAR, we created a second GazePointAR prototype with three advancements: first, we replaced Google Cloud Vision's Object Localization model with YOLOv8 [41]; second, we redesigned the multimodal contextual phrase generator using prompt engineering [79]; third, and finally, we updated the chat completions API to leverage GPT-3.5.We describe these advancements below and then discuss our five-day first-person diary study using GazePointAR version 2 "in the wild" [81].
Updating GazePointAR's object recognizer.For the initial GazePointAR prototype, we chose Google Cloud Vision's Object Localization model, as it enabled rapid prototyping.However, a key limitation of this model is that it categorizes an object as "packaged goods" if it cannot precisely identify the object, which confused both GPT-3 and our participants.In this iteration of GazePointAR, we instead employed a state-of-the-art YOLOv8 model trained on the MS COCO dataset [51] by building a local API server using FastAPI4 and Docker5 , and tunneling the local API using Localtunnel6 .This increased ML services' runtime to 3.75 ± 0.31 seconds (+11.28%) and the overall runtime to 7.94 ± 0.38 seconds (+5.73%).
New contextual phrase generator.In GazePointAR v1, our phrase-generator automatically replaced query pronouns with ML results using a hierarchical heuristic model.In the revised Gaze-PointAR prototype, we instead use prompt engineering that leverages GPT, rather than heuristics, to integrate all pieces of information together.This enabled GazePointAR v2 to support multiple pronouns, since the entirety of the original query was captured in the prompt.For example, if the user asks "I love this cloth.Who designed it?",rather than creating the modified query "I love clothing with text that says [brand name] cloth.Who designed it?",GazePointAR includes the user's original query as raw information in the engineered prompt-see Figure 8. Note: to supply gaze and pointing gesture information, we still treat the YOLOv8 object recognition and celebrity recognition results as parent layer and OCR results as child layer to create the phrase.Additionally, as part of the prompt, we asked GazePointAR to briefly explain its answers in an attempt to enhance explainability.
GPT-3.5 Lastly, with the introduction of GPT-3.5, we updated GazePointAR to use gpt-3.5-turbo,which has been trained on more up-to-date data and is more efficient than GPT-3 [68].

STUDY 2: GAZEPOINTAR DEPLOYMENT
After iterating on GazePointAR, we carried out a first-person, fiveday diary study [22].While informed by related first-person study methods like autoethnography [21,27] and autobiography [63], we explicitly use the term "diary study" as the other methods tend to span longer periods of time.The diary study enabled us to evaluate the potential of an always-available, multimodal wearable VA system in the real world.The lead researcher utilized GazePointAR v2 in their day-to-day activities while documenting their interactions.We report our process and findings.

Procedure
The researcher wore a Microsoft HoloLens 2 continuously running GazePointAR v2.Because GazePointAR requires an Internet connection, the HoloLens was connected to either a mobile hotspot or public Wi-Fi networks.Over five days, GazePointAR v2 was used four hours a day across various settings, including: indoor locations like homes, offices, gyms, cafes, restaurants, shopping centers, libraries, cinemas, grocery stores, and hospitals, as well as outdoor areas such as sidewalks, parks, university campuses, and public transit stations.To document their interactions, the lead researcher used HoloLens' internal video recording feature and kept a pen and notebook for journaling insights and observations.

Findings
In total, the lead researcher asked 48 queries, of which GazePointAR provided 20 satisfactory answers.Prompt engineering appeared to enhance the performance of GazePointAR in several ways: (1) GPT seems to recognize the importance of the user's gaze target when resolving ambiguous queries, giving it priority; (2) GPT seems to consider objects similar to the gaze target when answering queries; (3) the response is typically one sentence, and it includes a concise justification for its answer selection.Even with queries it could not answer, GazePointAR seemed to often accurately interpret user inputs and intentions, suggesting its performance was not inherently poor.For a full list of queries, see Appendix 4. Below, we present key findings including overall reflection on having an always-available context-aware VA, the types of queries asked, GazePointAR's response, and perceived limitations.
Overall experience.From simple tasks such as retrieving the rating of a new coffee shop and comparing health benefits of food items to more complicated tasks such as suggesting an allergyfriendly menu item and finding lost keys, the lead researcher set out to "stress test" GazePointAR v2 in the wild.They attempted to use GazePointAR naturally as an everyday assistant-looking around and posing queries as they arose.In his journal, the researcher wrote: "conversing with GazePointAR felt like a friend was tagging along, helping me." Perhaps the most surprising use was when, at a store, they asked: "This is a bit outside my price range... can you recommend a similar brand?" while looking at a piece of clothing.GazePointAR not only grasped the broader context but also identified the gaze target as clothing, determined its brand, and then recommended similar brands.However, the lead researcher recounted several instances where they felt self-conscious using GazePointAR, especially in public settings, mentioning that speaking out loud while wearing a bulky headset drew unwanted attention.This became more apparent in settings where people are typically quiet, such as libraries, hospitals, and movie theaters.Additionally, the lead researcher noted that after extended use spanning more than fifteen minutes, their eyes became tired from dwelling on referents.
Query Analysis.When analyzing the queries, we identified five categories: (1) asking for more information about a referent, such as its usage, price, and rating (21 queries); (2) asking for recommendations, such as a drink at a cafe (11); (3) asking for directions on how to proceed, such as navigating to a location or following step-by-step instructions (9); (4) asking about personal information, such as a schedule (4); and (5) asking about past actions, such as "Did I take this vitamin today?" (3).When thinking about why they used a pronoun, the lead researcher wrote "I'm just realizing that many objects and their features are difficult to describe in words... an apple is an apple, but how do you describe how rotten it looks to a machine?Or what about a clothing stain if I want to know how to get rid of it?Also, sometimes, I don't even know the words.When I was in Chinatown, the restaurant name was only written in Chinese.How else can I ask besides saying 'is this the right place?'"In crafting queries, the lead researcher employed various pronouns, with "this" being the most common (21 occurrences), mirroring Study 1 participants.Other pronouns include "it" (6), "that" (4), "here" (4), "there" (1), "these" (1), and "s/he" (1).While the lead researcher felt that the list of supported pronouns was exhaustive, 13 queries did not have pronouns in our taxonomy, and instead had first-and second person pronouns (12/13), or no pronoun at all ("What's for sale today?").For multimodal input, the lead researcher found themselves relying solely on gaze rather than pointing.When asked why, they said that "gaze was easier and hands free"-similar reason as participants in Study 1-and that "pointing in public spaces felt awkward." Interestingly, the lead researcher often used first-person pronouns, "I " (33 occurrences), "me" (8) and "my" (7), as well as the second-person pronoun "you" (10).They observed that Gaze-PointAR's human-like nature leads them to use full sentences in their queries, which often included first-and second-person pronouns.However, this often results in longer queries, which contradicts findings from Study 1.To justify this inconsistency, the lead researcher wrote, "with regular voice assistants, I feel like I'm speaking commands, while to GazePointAR, I feel like I should have conversations with it.So to Alexa, for example, I would say phrases like 'price of an [item]', while to GazePointAR, I want to speak in full sentences like 'Can you tell me the price of this [item]?'".As a result, 31 queries had more than one pronoun.Finally, as part of their long queries, the lead researcher seemed to instinctively incorporate additional context.For example, when asking "I want to eat something light before my commute... can you suggest me a place?", the lead researcher clarified their preference for a light meal and implied that the time is probably early morning.
Query Answers.GazePointAR successfully addressed 20 of the 48 queries posed by the lead researcher (Figure 9).For example, when asked "Can you recommend me something from here?",Gaze-PointAR read text information on a menu and recommended a drink.Additionally, when asked "I love this cloth.Who designed it?",GazePointAR not only replied with the designer's name but also provided brief information about the designer.GazePointAR even provided brief explanations, such as "the user looked at an <object> when asking this question", which improved understanding of information GazePointAR captured.In contrast, for the 28 failed queries (Figure 10), this was most commonly due to missing object category in our object recognition model and how we capture users' gaze.For example, when asked "How can I use this equipment?"at a gym, our object recognition model failed to recognize the different exercise equipment.Additionally, when asked "I'm looking for my keys... where did I leave it again?",GazePointAR was unable to figure out the lead researcher's referent, as it does not store any information over time.Analogously, GazePointAR still had trouble with some combinations of pronouns, such as "Which is healthier, this or that?".To fully tackle these queries, GazePointAR needs more data, such as gaze over time and improved ML results.

Study 2 Summary
In summary, the lead researcher appreciated GazePointAR for its natural, companion-like qualities, but noted its limitations in realworld settings due to insufficient information access.GazePointAR struggled with time-dependent queries, primarily those containing referents in the past (e.g., "That was a really cool car!Tell me more about it."),which require gaze history or multiple referents (e.g., "Which is the healthier option?This or this?"), which require shift in gaze while speaking.Additionally, while the lead researcher employed various pronouns instinctively, he also used many firstand second-person pronouns, which led to lengthier, full-sentence queries.Furthermore, the lead researcher relied solely on gaze interaction, avoiding pointing due to the additional physical effort and its impracticality in public.Lastly, extended dwelling caused fatigue.To improve, the lead researcher suggested capturing and storing gaze data over time, and using machine learning models with more object categories.

DISCUSSION
By utilizing gaze, gesture, and conversation history along with an LLM, GazePointAR advances the state-of-the-art in context-aware VAs.Both the user study (Study 1) and the diary study (Study 2) highlight key benefits, including more natural query formation, always-available interaction, and human-like "assistant" qualities.Below, we discuss current challenges and future opportunities for context-aware VAs like GazePointAR.
Capturing gaze information over time.In both studies, some queries were unanswerable due to how GazePointAR captures gaze information-at a single moment immediately after the query has been said.Future systems should instead track gaze continuously.This would enable users to shift their gaze, promoting more natural gaze behavior and reducing fatigue from explicit gaze.Continuous gaze tracking would also let users look at multiple referents across time, and the collected gaze pattern can assist an LLM in disambiguating queries with plural pronouns (e.g., "Which is cheapest among these?") or multiple pronouns (e.g., "Which is healthier, this or that?").Moreover, storing gaze information for later reference, even for objects no longer in sight, would be beneficial.A key challenge is to find a suitable way to present temporal gaze data in a processable format for the LLM.One solution may be to pre-process raw gaze data into features such as fixations and saccades [20,85], and then represent them as text for an LLM to perform referent prediction.Of course, introducing continuous gaze tracking on an AR headset may also provoke privacy concerns for both users and bystanders [40]-an additional area of future work.
Ensuring user autonomy in choosing an answer.Gaze-PointAR currently chooses one best answer and reads it out to the user.While this is efficient, balancing interaction speed with user autonomy in choosing answers remains a challenge.Study 1 participants preferred a Google-like UI for exploring options, while the lead researcher in Study 2 highlighted the awkwardness of having to stand still and interact with mid-air gestures in public.Moreover, the lead researcher was satisfied with GazePointAR's concise answers and explanations.A possible solution could be to first offer the top answer verbally with a brief explanation and then a Google-like UI as an option for further exploration.To further  reduce cognitive load further, UI panels should be glanceable [55], gaze-adaptive [52,75], or show different detail levels [24,52].
Enhancing explainability.Our study findings reinforce prior research, emphasizing the growing necessity for explainable AI (XAI) in designing everyday AI-driven experiences using wearable AR [2,3,31,90].Our initial steps included prompting an LLM to explain its responses.While this approach was quick and effective, future context-aware VAs should also visually present the captured images, user inputs, ML results, and predicted referents used to derive an answer.Again, to limit cognitive overload and UI clutter, we imagine first presenting a concise explanation followed by an option to receive more information.
Supporting instinctive queries.Our study findings suggest that while pronouns can facilitate human-VA interaction, they are not always needed and may complicate query formation.For example, in Part 1 of Study 1, some participants preferred explicit queries such as "What can I make with Rao's Marinara sauce?" over using the pronoun "this".The way individuals use pronouns in queries seems to be based on instinct and preference, which affects query ambiguity.To handle a wider range of queries, from those without pronouns to those with many, and from unambiguous to ambiguous, we integrated prompts into GazePointAR v2.This enables an LLM to process the original query, not one altered by simple heuristics, and supply ambiguous queries with relevant information.A context-aware VA should support whatever query a user thinks of first and our work shows promise in achieving this.
Enhancing machine learning capabilities.Other queries were unanswerable because GazePointAR's object recognition model failed to identify referents.This became more apparent in Study 2, as many real world objects are not included in YOLOv8's object categories, such as gym equipment, breeds of dogs, and types of cars.Improvements in ML algorithms [5,49,88] and the use of transformative tools like Google Lens' reverse image search or advanced multimodal LLMs such as GPT-4 [69] may help resolve this issue.Moreover, because many queries asked in both studies pertained to recommendations and personal data, context-aware VAs may benefit from access to personal (e.g., calendar) and online (e.g., ratings) information.Again, system designers must balance this need with the potential risks to privacy.
Designing a more robust study.While Study 2 led to unique insights not obtainable from a lab study, it only involved the lead researcher using GazePointAR in-the-wild, which may lead to subjective results.Future research should include more participants using a context-aware VA outside the lab.

CONCLUSION
In this paper, we present GazePointAR, a context-aware multimodal VA for wearable AR capable of answering pronoun-driven ambiguous queries.In our two studies, participants appreciated GazePointAR for its naturalness and human-likeness, and ability to refer to objects that are difficult to pronounce or describe.However, participants also noted several limitations, including not collecting and storing gaze data over time, lack of autonomy and explainability, the inability to support queries with multiple or past referents, and missing object category in GazePointAR's object recognition model.Future context-aware VAs should support innate, instinctive, and natural gaze and gesture input, as well as speech, enabling users to ask any query spontaneously.

Figure 2 :
Figure 2: System overview and implementation details of GazePointAR

Figure 3 :
Figure 3: Cooking scenario and the three VAs used in Part 1 of the study.

Figure 4 :
Figure 4: Usage scenarios in Part 2 of the study.

Figure 5 :
Figure 5: Design probes in Part 3 of the study.See supplementary materials for the videos.

Figure 7 :
Figure 7: A subset of the scenarios participants came up with during Part 3 of Study 1.The top row shows recreations of answerable queries while the bottom rows highlights example queries that returned unsatisfactory responses.

Figure 9 :
Figure 9: Example queries from the first-person diary study (Study 2) which GazePointAR answered accurately.

Figure 10 :
Figure 10: Example queries from the first-person diary study (Study 2) which GazePointAR answered inaccurately.