WorldPoint: Finger Pointing as a Rapid and Natural Trigger for In-the-Wild Mobile Interactions

Pointing with one's finger is a natural and rapid way to denote an area or object of interest. It is routinely used in human-human interaction to increase both the speed and accuracy of communication, but it is rarely utilized in human-computer interactions. In this work, we use the recent inclusion of wide-angle, rear-facing smartphone cameras, along with hardware-accelerated machine learning, to enable real-time, infrastructure-free, finger-pointing interactions on today's mobile phones. We envision users raising their hands to point in front of their phones as a "wake gesture". This can then be coupled with a voice command to trigger advanced functionality. For example, while composing an email, a user can point at a document on a table and say "attach". Our interaction technique requires no navigation away from the current app and is both faster and more privacy-preserving than the current method of taking a photo.


INTRODUCTION
Finger pointing is quintessentially human.It is the earliest communicative gesture that develops in infants [18] and all cultures have been found to point with the hands [16].Pointing is a fast and convenient way to denote an area or object of interest in the real world, which greatly facilitates human-human communication [15].Unfortunately, in human-computer interactions, pointing is rarely utilized.In the prior work that has utilized pointing interactions (discussed in greater depth later), systems are either room-scale xed setups (e.g., "Put that There" [11]) or virtual / augmented reality experiences (e.g., [28,52]).Underexplored, however, is incorporating nger pointing into conventional smartphone interactions.Fortuitously, it is increasingly common for smartphones to feature both wide-angle, rear-facing cameras and hardware to accelerate machine learning models.These are the key technical ingredients to make real-time nger pointing feasible on phones.
More speci cally, we envision interactions wherein users utilize smartphone apps as usual, but can raise their hand in front of the phone to point to an item.As one example, a user could be composing an email on their phone, and then point to an object in the environment and say "attach" (Figure 1).Importantly, this interaction never requires leaving the email client or indeed any presses to the touchscreen, and the user is free to continue composition or send the email following the interaction.We believe such lightweight interactions maximize the speed and convenience of nger pointing.Further, by segmenting the object of interest and removing surrounding content, user privacy is enhanced.Figure 2 shows a side-by-side storyboard comparison of WorldPoint compared to iOS 14's equivalent mechanism (see also Video Figure).
In this paper, we describe our proof-of-concept implementation, which runs entirely on a smartphone.Rather than running continuously, which would be too power intensive, we instead periodically run (1 Hz) lightweight models that check only for the binary presence of a hand in front of the smartphone.If a hand is detected, we then run a more intensive model that produces a 3D hand pose at 4 Hz.We then test to see if the user is forming a valid pointing gesture, and if so, we increase tracking FPS to ~20 Hz and we ray cast the nger vector into the scene.We then perform image-based segmentation to "cut out" target objects from the scene.We evaluated our system's pointing accuracy through a user study and found a mean Euclidean error of 15.1 cm, which is roughly half as accurate as innate human pointing accuracy (7.7 cm).Nonetheless, our WorldPoint prototype is still su ciently accurate to e.g., point to a document lying on a table 2 m away.Fig. 2. A real-world comparison of UX flows (WorldPoint vs. iOS 16) for a user a aching a receipt to an email.Note that WorldPoint requires no clicks to the screen and takes 6 seconds, while iOS requires 4 clicks and 11 seconds.Elapsed time is taken from the Video Figure .Ellipsis denotes time gaps >1 second.Note that the iOS a achment contains miscellaneous other content, including potentially private materials.The user could spend additional time be er positioning the camera or cropping the photo a er capture, further slowing the interaction.

COMPARISON TO CONTEMPORARY METHODS & WORLDPOINT DESIGN GOALS
The storyboard in Figure 2 provides one example of how users currently capture or otherwise engage with real-world content in iOS when in an app (which is broadly equivalent to the mechanism in Android 12).In the provided example, a user wishes to attach a paper receipt to an email reimbursement request.In the Gmail app (v6.0.230205) running on iOS 16.3.1,this requires the user to click the attachment icon, then click the camera icon, then take a photo of the item of interest, then con rm by pressing "Use Photo", after which the whole photo is inserted into the email.The interaction takes approximately 11 seconds (see Video Figure).Apple's Mail app also requires 4 clicks and has similar completion times.If the user wished to crop-out surrounding content, multiple additional clicks and swipes would be required.Furthermore, the above interaction sequence takes users away from their application context where the content is desired.
The awkward design of the above interaction motivated several key design goals of WorldPoint: 1) We wished to create an interaction method that did not require navigating away from the current application and losing important context.2) An intuitive method, closely matching how humans already communicate with one another.3) A method that was faster than methods found on smartphones today.4) The targeted object is explicitly selected for unambiguous actions (i.e., only one object is selected, rather than a scene).Finally, 5) the greater scene is obscured so as not to reveal unnecessary information and better preserve user privacy.
Revisiting the interaction sequence in Figure 2, WorldPoint requires no clicks and no navigating away from applications and takes roughly half the time.A combined nger-pointing gesture and voice command allows the user to insert their receipt in-situ, with minimal visual distraction.The elapsed time of both interaction sequences is shown in Figure 2 and the Video Figure .Two other smartphone pointing methods worth mentioning are pointing the camera at an object, and either taking the center-most object or having the user tap on the desired object with a nger.While imminently possible, such approaches have to be explicitly launched and then occupy the screen.In contrast, WorldPoint can run in the background across all applications, quickly activated when needed.When active, WorldPoint does not take over the screen, preserving task context.

RELATED WORK
The sociocultural, cognitive, and biomechanical aspects of human nger-pointing have been extensively studied.A review of this literature is beyond the scope of this work, but we point readers to excellent surveys by Wilson [62], and Butterworth and Jarret [13].In this section, we focus on human pointing in human-computer interactions, which has been long studied, especially for targeting distant objects (also called distal pointing) [53,60].We emphasize this is an extensive literature and a full review is not possible.Instead, we use exemplary systems to convey the research landscape.

Pointing with Devices
There are innumerable devices that humans have invented to help facilitate or mediate their pointing, from telescoping poles to laser pointers [48,60].Digital handheld pointing devices include commercial products such as the Nintendo WiiMote [53,56] to research devices like "Soap" [10].Now popular are VR/AR controllers [45,57,59] that allow their users to point in 3D virtual space in mid-air.Wearable form factors, including armbands [30], wristbands [25,36,37] (including the seminal pointing interaction work "Put-That-There" [11]), rings [64], and even head-mounted cameras [2] have been explored for pointing.

Finger Pointing on Touchscreens
The success of direct manipulation interfaces was due in no small part to the intuitiveness of pointing (i.e., poking) at 2D interface elements.Since then, researchers have looked at more advanced methods beyond 2D touch input.For instance, Wang et al. [61] enabled more precise 2D selection by leveraging the intersection of two 2D nger rays (using touch ellipsoid to determine nger pitch).Researchers have also looked at ways to do 3D nger pointing on 2D touchscreens, including Xiao et al. [63] and TouchPose [1], which use capacitive image data to estimate the 3D angle of ngers.Holz and Baudisch showed that knowledge of 3D nger angles can be used to increase 2D touch targeting accuracy [32].More closely related to our work, iOS 16 allows users to long press and drag on an object in a 2D photo, which segments the object.In the accessibility domain, people with visual impairments can explore the pass-through scene on the touch screen by tapping each object [27,40].Finally, researchers such as Hürst and van Wezel [34] have explored interacting with on-screen virtual objects in a mobile pass-through AR context.

Hand Pointing in VR/AR
Although WorldPoint is not a VR/AR method (at no time is the user ever shown a pass-through scene), we note there is considerable research in this space that merits a brief review.Early work by Poupyrev et al. [51,52] explored nger pointing as a virtual object selection and manipulation technique.Microsoft HoloLens o ers a related "Point and Commit" interaction technique [46], though the ray extends from the hand and not a nger.Pointing in VR is also important for human-human interaction in collaborative telepresence applications [42].Recent advances in computer vision now allow modern VR/AR systems to o er unencumbered 3D hand tracking for interaction [29].Unique in the literature, however, is an interaction technique for mobile phones combining nger pointing into the real world with voice commands.Further, our implementation is able to run on o -the-shelf smartphone hardware, and is both real-time and infrastructure-free, which is rare.Additional features, such as semantic segmentation of pointed physical objects (and not virtual elements) further di erentiates this work from prior systems.

Hand Pointing in Mobile AR
Closely related to this work are mobile AR nger-pointing interactions.Vincent et al. [58] describe several fundamental methods, including "direct touch" (on the screen) of an AR object of interest, as well as a "screen-centered crosshair", where the user moves the phone such that the object of interest is centered in the eld of view of the camera.Even more similar to WorldPoint are methods that utilize the hands in front of the phone to "point" at objects in a real-world scene.For example, Hürst and Wezel [34] used markers attached to the ngers to evaluate translating, scaling, and rotating virtual objects in a mobile AR app where the hands operate in front of the phone, though nger pointing is not implemented.Bai et al. [9] also demonstrated nger tracking in front of a phone, though only 2D tip detection is performed.
In contrast to all of the latter systems, WorldPoint computes a true 3D nger ray cast into a 3D scene.Even more critically, the user experience is very di erent.All of the prior systems utilize a pass-through AR view that occupies the phone's screen.WorldPoint does not displace current applications, and instead provides a small and segmented preview of only the currently pointed object (i.e., no scene).In fact, we envision users not looking at the screen at all when triggering WorldPoint.Instead, users can look and point at the object of interest, and simply speak a command.Only then does the user need to return to looking at the screen and continue their task.In this way, it is actually more similar to WorldGaze [41], a smartphone interaction technique using a user's gaze ray and spoken commands.
WorldPoint could utilize IFRC or EFRC, potentially as a user-toggleable option.However, we chose to use IFRC for our proof-of-concept implementation for three important reasons.First, IFRC better matches how humans point naturally.Second, IFRC is more ergonomic when considering that most users hold their phone at an angle and below the height of their head, which means IFRC can be done in front of the phone without having to lift the hands to near the height of the head for distant targets.Finally, IFRC only requires opening the rear-facing camera (to capture the pointing hand and scene), whereas EFRC additionally requires the user-facing camera to ascertain the 3D position of the user's eyes as well, adding additional processing and power draw.

IMPLEMENTATION
We now describe our proof-of-concept implementation.An overview of our pipeline is provided in Figure 3.

Proof-of-Concept Hardware
We selected an iPhone 12 Pro Max as a proof-of-concept platform.Using its rear-facing camera and LiDAR sensor, this device can provide paired RGB and depth images via Apple's ARKit Dev API at 30 FPS with approximately a 65° eld of view.This device also features a 120°ultra-wide angle camera [3], but this is not yet supported by Apple's ARKit API (though indications online suggest this is forthcoming).We also note that while this iPhone contains a rear-facing LiDAR sensor to capture depth data, other LiDAR-less smartphones o er similar depth maps derived from deep learning, SLAM, and other methods (see e.g., Android's equivalent ARCore Raw Depth API [21]).

Wake Gesture
Like wake words (e.g., "hey Siri", "hey Google"), wake gestures [50] must be su ciently unique so as not to trigger falsely or by accident.Although nger pointing is natural and common, it is  uncommon for users to perform this gesture in front of their phones at close range, and thus we believe it can serve as a good wake gesture.Example WorldPoint wake gestures can be seen in Figures 1, 9, 10, 12, 11, and 13.This corresponds to the phone held at a comfortable reading distance, with the arm intentionally extended in front of the body as a trigger.This is most comfortable with the arm kept below the shoulder and with the elbow slightly bent.Note this keeps the arm considerably lower, and thus more comfortable, than if we had employed an EFRC pointing method.

Hand Detection
The rst step of our software pipeline is to detect whether a hand is present in front of the smartphone (Figure 3, Detecting hand @ 1 Hz).For this, we use MediaPipe's Palm Detector [23] running as a TensorFlow Lite [12] model [24], with a con dence setting of 0.5.To conserve power, we convert 1920×1440 resolution to 256×256 frames and run our model at 1 Hz, sleeping the rest of the time.If a hand candidate is detected, we then examine the bounding box to test if the hand is su ciently large to be the user's.This eliminates other distant hands in the scene (i.e., from other people) as well as user hands that are held too close or far from the phone.If the hand passes our checks, we move to the next stage of our pipeline.

Hand Pose & Pointing
With a candidate hand detected, we increase our sampling rate to 4 Hz.We run MediaPipe's Hand Landmark Model [24] (also as a TFLite model) on the candidate bounding box (con dence setting of 0.7; Figure 3, Index nger position @ 20 Hz).If a hand pose is generated, we then test to see if it is held in a pointing pose.For this, we use joint angles to test if the index nger is fully extended and the other ngers are angled and tucked in.If the pose passes our check, we continue to the next step of our pipeline.At this point, we also indicate to the user their "wake gesture" has been detected and tracked with a small onscreen icon (see also Example Uses section).

Finger Ray Casting
With a hand now detected and held in a pointing pose, we now increase our sampling rate to 20 Hz to provide a more responsive user experience.To compute a 3D vector for where the nger is pointing, we use the index nger's MCP and PIP keypoints as follows the most common handrooted method of index nger ray cast (IFRC) [17] (see also Section 3.5 for a discussion on nger pointing methods and why we selected IFRC).We found this joint combination to be the most stable during piloting, though we note that other joints and even other methods are possible, such as regressing on the index nger's point cloud.
Next, in order to ray cast the pointing vector into the scene and have it correctly intersect with scene geometry, we require 3D scene data (i.e., a 2D image is insu cient).For this, we use Apple's ARKit API, which provides paired RGB and depth images (Figure 4, RGB and Depth).From these sources, we can compute a 3D point cloud in real world units.We use Apple's Metal Framework [5] to parallelize this computation.Once composited, we extend a ray from the index nger into the point cloud scene.As the point cloud is sparse, we identify the point within a speci c distance along the ray (Figure 4, Point Cloud), rather than requiring an actual collision.

Image Segmentation
There are several di erent ways the nger-pointed location in a scene can be utilized, which we elaborate in Example Uses (Section 6).As a proof-of-concept implementation, we use DeepLabV3 [14] trained on 21 classes from Pascal VOC2012 [19].This model provides masked instance segmentation and runs alongside the rest of our pipeline at 20 FPS on our iPhone 12 Pro Max.For at rectangular objects, such as receipts and business cards, we take advantage of Apple's built-in Rectangle Detection API [7].Lastly, we note there are many other techniques for image segmentation, both classical and deep learning based, which are nicely summarized in [47].

Speech Triggers
To avoid the Midas Touch problem [35], nger pointing is best combined with an independent input modality that acts as a trigger or clutch.For this, we felt spoken commands were a natural compliment (see example sequence in Figure 1).We enumerate many illustrative commands in the subsequent Example Uses section.To implement this functionality, we use Apple Speech Framework [6] to register keywords and phrases, which then trigger event handlers for speci c functionality.

Commercial Practicality
Our implementation demonstrates that the constituent technical components are possible and present on modern smartphones.However, architecturally speaking, WorldPoint would be engineered in a totally di erent manner by smartphone makers.For example, restrictions in iOS mean that our pipeline must run as a full-blown user application and not a background process.Further limitations in the ARKit API mean that we must open a video stream, as opposed to capturing periodic and low-resolution photos, even when running our hand detection process at 1 Hz.For these and many other architectural reasons, our pipeline currently consumes an estimated 4.2 W when running and tracking a pointing nger.This was calculated using a USB power monitor when the phone was fully charged, and measuring the di erence between the phone open to the home screen (1.6 W) vs. running our app (5.8 W).If we assume pointing interactions needing our full pipeline last on average 5 seconds, this would consume 5.83 mWh of power (or 0.041% of the iPhone's 14,130 mWh battery).However, we include these gures for reference only, as we do not believe it is indicative of how such an interaction technique would be implemented or performed in reality.
Despite prototyping limitations, we still aimed to limit power consumption in several ways to better mimic a commercial implementation.First o , our pipeline does not run when the phone screen is o (which is generally most of the day).When the phone is on, we use a series of gated processes with increasing energy burdens, not only in terms of processing frequency (1 Hz, 4 Hz, and then nally 20 Hz), but also model complexity (from binary hand detection to 3D hand landmarks + object segmentation + voice keywords).We also note that the need for full video steam + LiDAR data would only happen after a nger pointing gesture has been detected.When no nger is pointing, our process only runs once per second, and this check could occur on a high-e ciency co-processor (in much the same way as always-on audio-based wake word detection occurs on modern hardware).

EVALUATION
Human pointing accuracy (not mediated by a device) has been found in prior work to have an angular error of around 4-10° [13].However, nger pointing accuracy using a smartphone held in a user's non-dominant hand has not previously been studied.Furthermore, we could not nd any prior experiment that could o er a direct comparison to WorldPoint (i.e., similar range, scale, and task).For this reason, we ran a separate, matched study to establish baseline human pointing performance, which we discuss next.After this, we describe our main study evaluating the accuracy of pointing with our system.

Unmediated Finger Pointing Accuracy
We recruited 7 participants (mean age 24.3; six identi ed as female and four as male) to study innate human nger pointing accuracy without a mediating device such as a smartphone.One participant was left-handed, but we asked them to point with their right hand.We a xed a 5mW red laser pointer to participants' index ngertips with electrical tape.The laser was turned on, and participants were allowed to adjust it until they felt the laser dot accurately represented their pointing location.This was the last time participants saw the laser dot in the study.From this point on, participants wore laser safety glasses that blocked the red dot, but not the environment.
Participants stood in front of a wall with a 5 × 3 grid (50 cm interval, 85/135/185 cm from the oor) of targets, seen in Figure 5. Participants completed our procedure at three di erent distances from the wall, marked by tape on the oor: 80, 160, and 240 cm.Participants were asked to hold a smartphone in their left hand at a typical reading distance.The smartphone ran our study interface,  which requested wall targets one a time (15 possible targets, 3 repeats, random presentation order, one distance at a time).Participants pointed their nger at the requested target, and when con dence in their pointing direction, announced aloud "next".This prompted the experimenter, on a laptop, to remotely trigger the smartphone to record the requested target, a photo, and a depth map (over a wi socket).In later analysis, we used these data to manually mark and compute the 3D location of the laser and target (visible in the RGB image, and located in 3D space using the paired depth map) to compute Euclidean error.In total, this procedure produced 945 trails (7 participants × 15 targets × 3 repeats × 3 distances).

WorldPoint Accuracy
To evaluate pointing accuracy when using WorldPoint, we used an equivalent procedure to our previous study (such that we can directly compare results).Speci cally, we use the same 5×3 grid arrangement of targets at the same three distances.No laser was worn this time, and participants held a smartphone running our pipeline in their non-pointing hand.Same as before, participants were told to hold a smartphone at a typical reading distance and were further briefed on how to perform the pointing wake gesture.The study interface ran on the smartphone and showed the next target when the wake gesture was correctly invoked (i.e., after passing our various false positive rejection heuristics).
As before, participants pointed to the target requested on the smartphone study interface, and when satis ed spoke aloud "next".The experimenter, on a laptop, remotely triggered the phone to record several pieces of data computed on the device in real-time: requested target, hand landmarks, pointing vector, RGB image, and depth map.Importantly, the study interface at no time showed participants any visual feedback for correctness (other than if the pointing gesture was invoked or not), nor any live video/images from the camera.This design decision was to prevent any user adaptation to tracking errors.
For this study, we recruited 10 participants (6 of which completed our other study; mean age 22.2; two identi ed as female and three as male).One user was left-handed, but they were asked to hold the phone in their left hand and point with the right.In total, we captured 1800 pointing trials (10 participants × 15 targets × 4 repeats × 3 distances).

Results
We rst manually inspected all of the study data to codify any errors.Among the 945 trials from the unmediated nger pointing study (i.e., no smartphone involved), we could not locate the laser dot in 35 trials, and these were dropped from the analysis.Either the laser was pointing out of the scene or was too faint to be seen in the captured photo.We also inspected all 1800 trials captured in our WorldPoint accuracy study, nding 30 frames had obvious and gross hand tracking errors (an error rate of 1.7%), which we excluded from our later spatial accuracy analysis.There were also a handful of very errorful outliers which we ltered by dropping any trials greater than 3 from the mean in each distance condition (35 trials; 1.9% of trials).The latter errors could be detected in real-time and ignored in a fully developed pipeline.
Starting rst with unmediated nger pointing accuracy, we found our participants had 16.6, 31.4,50.1 cm of Euclidean distance error at 80, 160, 240 cm, respectively.Note this is 2D Euclidean distance error in the plane of the wall (where the ray intersects).However, it is known that humans have an o set in where they think they are pointing, and where their nger vector is actually pointing [20,33,38,43,44].For this reason, it is important to calculate a per-user o set and correct for this discrepancy.Such a one-time calibration would also have to occur for WorldPoint, perhaps as part of a setup wizard.To compute and apply this post-hoc o set correction, we followed the  procedure used in [32,65].Brie y here, this procedure computes the mean Euclidean o set for each user at each distance and subtracts this systemic o set from all of their trials.
After o set correction, unmediated nger pointing Euclidean distance error drops to 5.2 cm (SD = 5.0) at 80 cm, 7.1 cm (SD = 5.0) at 160 cm, and 10.7 cm (SD = 8.0) at 240 cm.With the same one-time o set correction procedure applied to WorldPoint study data, our system achieved a Euclidean distance error of 9.2 cm (SD = 6.0) at 80 cm, 13.9 cm (SD = 10.0) at 160 cm, and 22.1 cm (SD = 15.0) at 240 cm.These results are shown in Figure 6.We also provide scatter plots of unmediated nger pointing and WorldPointing in Figure 7 and 8.

Contextualizing Performance
Unsurprisingly, our system is not as accurate as unmediated nger input.We hypothesize this is due to a range of factors, including hand landmarking inaccuracies, noise in the depth map, and even extra cognitive load from holding and looking at a smartphone.Fortunately, the rst two factors are almost certainly going to improve over time, and our proof-of-concept pipeline should be considered a baseline.Nonetheless, our current WorldPoint pipeline is su ciently accurate to point to objects roughly 25×25 cm in size, even at a range of 2.4 m.For reference, this is roughly equivalent to a standard-sized piece of paper (US letter or A4).At closer ranges, WorldPoint can target even smaller objects.We believe this is already su ciently accurate to be useful.
Overall, these results are similar to prior studies investigating index nger ray casting (IFRC).Corradini et al. [17] reported laser pointer's intersection-to-target accuracies of roughly 8.6 cm and 16.2 cm at standing distances of 150 cm and 250 cm, respectively.This is very similar to the unmediated pointing o set we found in our study (7.1 cm at position 160 cm, and 10.7 cm at 240 Fig. 9.In this example sequence, a user is composing an email on their smartphone (A1 & B1).Without exiting the email client, the user points with their finger at a statue they have found (A2).This acts as a "wake gesture" and the phone begins to track what the user is pointing to via its rear-facing camera (C2) and LiDAR, o ering a small onscreen preview (B2).Simultaneously, the phone begins to listen for spoken commands.The user says "a ach" (A2), which causes the statue to be added to the email (B3).The user then drops their hand (A3) and continues on their way.A1-3: External reference photos; B1-3: Interface screenshots; C1-3: Views from a phone camera, unseen by the user.See also Figure 1 for di erent steps of the data pipeline.cm).Lastly, human pointing accuracy not mediated by a device has been found to have an angular error of around 4-10°(see survey in [13]).

EXAMPLE USES
We believe the most compelling use of WorldPoint is to augment traditional smartphone applications as a background process, as opposed to taking over the screen with a new interface.In this section, we describe four example categories of use.For each, we built one or more demo applications running our pipeline to help illustrate the potential of the use case (please also see Video Figure ).Across all of our demos, we use a Siri-themed pointing hand icon to indicate to the user their wake gesture has been detected (akin to the Siri icon that appears following the "Hey Siri" wake word on iOS devices).We brie y conclude with other ideas we considered, but did not implement.

"A ach" from World
We created a WorldPoint-augmented email demo with the ability to quickly and conveniently attach images of objects in the real world, such as a document or meal.While composing an email (Figure 9, A1), users simply raise their hand to point to an object (Figure 9, A2).In addition to the WorldPoint icon, a preview of the attachment appears on the screen (Figure 9, B2).If the user wishes to attach an image of this object to their email, they simply say aloud "attach" (Figure 9, A2), without the need for any wake word.This interaction can be repeated in rapid succession for many attachments, or the user can end the interaction by releasing the pointing pose or dropping their hand (Figure 9, A3).Such an attach-from-world interaction need not be limited to an email client and is broadly applicable to any application capable of handling media, including messaging, social media, and note-taking apps.Fig. 10.A user is preparing to sell their car.They go outside to take a photo, listening to their favorite music (1).The user points at the car and says "copy" (2), which copies an image of the car to the iPhone's clipboard.A er returning to their o ice, the user can seamlessly paste it into an advertisement they are designing on their laptop using Apple's Universal Clipboard feature (3).A1-3: External reference photos; B1-3: Interface screenshots; C1-2: Views from a phone camera, unseen by the user.

"
Copy" from World Whereas our "attach" interaction directs media into the foreground application, it is easy to envision an application-agnostic, system-wide, copy-from-world-to-clipboard interaction.More speci cally, at any time, even when not in an application capable of receiving media, the user can point to an object and say "copy".This copies an image of the pointed object to the system clipboard for later use (Figure 10, B2).For this demo, we use iOS' UIPasteboard API.This means Apple's Universal Clipboard (where you can, for example, copy something on your iPhone, but paste it on your Mac) works with our WorldPoint demo (Figure 10, B3).

"Add to [App]"
WorldPoint could also support more semantically-speci c interactions, such as pointing to a business card and saying "add to contacts" (Figure 11, A2) or pointing to a grocery item and saying "add to shopping list" (Figure 12, A2).As before, the latter interactions could happen while the user is in any application (without any need to navigate away from the current task), and the captured information would be passed to the app associated with the spoken command.As a simple demo of this functionality, we simply store captured objects in di erent lists.We did not, however, write code to parse text (OCR) from captured objects, though obviously this functionality exists.

Search & Information Retrieval
WorldPoint could also prove useful for search and information retrieval tasks for objects in the world.For instance, a user could be walking down the street scrolling through their Twitter feed, and while passing a restaurant, point to it and say "What's good to eat here?"(Figure 13, A1-3), "what's the rating for this place?"or "what time does this close?"In a similar fashion, a user could point to a car parked on the street and ask "What model is this?" or "How much does this cost?"Or, more generally, they could point to an electric scooter and say "Show me more info".These examples were canned demos for illustration (e.g., the restaurant sign was preregistered using Apple's ARKIT ARImageAnchor API [4]), but in practice, WorldPoint would have been tightly coupled to existing services like Google's Voice Assistant and Magic Lens [22].

Other Ideas Not Implemented
There were many other interaction ideas for WorldPoint we considered, but did not build into a demo.For example, in human-human interactions, nger pointing can be used to address and issue commands to other humans (e.g., "you go there"), and could likewise work for smart objects (e.g., "on" while pointing at a TV or light switch).Sharing of media is also possible, such as looking at a photo or listening to music on a smartphone, and then pointing to a TV and saying "share", "play here" or similar.It may even be possible to use technologies such as UWB to achieve AirDrop-like le transfer functionality by pointing to a nearby device.Users could also ask questions about the physical properties of objects, such as "How big is this?" or "How far is this?".A drawing app could even eye-dropper colors from the real world using a nger pointing interaction (e.g., "this color").

Other Pointing & Segmentation Modes
Our pipeline implemented a masking approach to tight segment an object of interest (Figure 14B; see also Section 4.6).This not only brings natural focus to the pointed object but also serves to better preserve privacy (messy bedroom, sensitive documents, other people, etc.).However, this is only one-way nger pointing that could be utilized by end-user applications.For example, and perhaps most straightforward, is to take a small crop centered on the nger point (Figure 14C).This could be done in pixels (e.g., 200 × 200) or real-world units (e.g., 20 × 20 cm).Alternatively, the whole photo could be captured with a "laser dot" (Figure 14D) or " ashlight" e ect (Figure 14E) applied to the image to denote a focus point.Another option entirely is for users to perform a lasso action, where they enclose a region in the scene by moving their nger point.The lasso can be terminated by releasing the pointing pose (e.g., transitioning to a st or open hand gesture) or simply dropping the hand out of view.The lasso could be drawn on the whole image (Figure 14F), or used to "cut out" the lassoed area (Figure 14G).

LIMITATIONS
By running on unmodi ed, o -the-shelf hardware, WorldPoint must accept some limitations.For instance, Apple's ARKit framework only supports the iPhone's wide-angle lens, but not the ultrawide lens.This means our pointing gestures have to operate in a smaller volume in front of the phone, even though the device is equipped with a more suitable sensor.Similarly, on iOS, it is not possible to create a background process with intermittent access to the cameras and LiDAR sensor.This, of course, is not an inherent limitation -OEMs such as Apple have the ability to perform such commercial-level integration and optimization.
A more innate limitation is the occasional self-occlusion of a pointing hand captured from an ego-centric viewpoint.Put simply, in some cases we can easily see the hand, but not the nger that is pointing.This most often occurs when the user holds their hand in front of the phone, but the user rotates their wrist to point at something to the non-phone side of their body.This is mitigated in our present pipeline by not registering the hand as a valid wake gesture.However, this can cause frustration among users, who have to try pointing again, potentially adjusting their stance or body direction.
Another limitation is accuracy, both nger pointing and object segmentation.In our user study, we found in a matched task, that human nger pointing was 7.7 cm inaccurate on average.With WorldPoint, our error was 15.1 cm under the same conditions, suggesting our system added roughly 7.4 cm of imprecision.To close this accuracy gap in the future, we believe more advanced hand pose models and higher-resolution sensors will have to be utilized.Nonetheless, our current system is su ciently accurate for nearby interactions, such as pointing to a document on a table 2 m away, but is insu cient for smaller objects at long ranges.
Additionally, we used an o -the-shelf model, which is reasonably accurate at detecting and masking 21 classes.However, object masking is not perfect, creating graphical artifacts, and the class set is more limited than we might want in practice.Fortunately, deep learning researchers continue to make impressive strides in accuracy and runtime performance, so this capability looks only to improve over time.
We found that our WorldPoint app consumed approximately 4.2 W of power when running our full pipeline at full speed.This is a signi cant level of power draw and is simply not feasible to run continuously on a smartphone.However, our full pipeline does not need to run continuously.If pointing interactions last ve seconds on average, this equates to 5.83 mWh of power, which has negligible impact on the iPhone 12 Pro Max's 14,130 mWh battery (i.e., each interaction would consume 0.041% of the battery).More important is the background process that periodically checks for the presence of a raised hand in front of the phone.This might run on the order of every second and would have to be very tightly optimized.It would likely have to follow a similar strategy as wake word detection on modern smartphones, where a dedicated co-processor monitors an always-on microphone.
Finally, in terms of validation limitations, we evaluated the pointing accuracy of our pipeline, but not the example interactions we developed.Also, as noted in Evaluation, we did not nd su ciently similar experiments in prior work (distance, device, target, task, etc.) that would provide a direct comparison, so we instead designed a paired study measuring unmediated human pointing accuracy.

Fig. 3 .
Fig. 3. WorldPoint pipeline with key steps shown in the data pipeline (top) and user interface (bo om).

Fig. 4 .
Fig. 4. From le to right: 1) RGB image with corners of the hand bounding box (green dots), land handmarks (red dots), and index finger joints (blue dots) overlaid.2) Depth Map. 3) Composited point cloud with points intersecting the finger ray highlighted in light blue.4) Object segmentation result.See also Figure 12 for external reference photos of the user and scene, as well as screenshots of the UI during this interaction.

Fig. 5 .
Fig. 5. Wall with 5×3 grid of targets (50 cm interval) used for measuring accuracy of unmediated finger pointing and pointing via WorldPoint.Participants repeated the study procedure at three distances, marked on the floor.

Fig. 6 .
Fig. 6.The Euclidean error distance of unmediated finger pointing and pointing via WorldPoint.Error bars are standard deviations.

Fig. 7 .
Fig. 7. Sca er plots of unmediated finger pointing trials at 80, 160, and 240 cm distances with per-user o set correction.The fi een targets are shown as black crosshairs, and each colored dot is a single participant trial.An external reference photo of the setup can be seen in Figure5.

Fig. 8 .
Fig. 8. Sca er plots of all WorldPoint trials at 80, 160, and 240 cm distances with per-user o set correction.The fi een targets are shown as black crosshairs, and each colored dot is a single participant trial.An external reference photo of the setup can be seen in Figure 5.

Fig. 11 .
Fig. 11.In this example, a user points to a business card and says "add to contacts".A1-3: External reference photos; B1-3: Interface screenshots; C1-3: Views from a phone camera, unseen by the user.

Fig. 12 .
Fig.12.In this example, a user points to a beverage and says "add to shopping list".A1-3: External reference photos; B1-3: Interface screenshots; C1-3: Views from a phone camera, unseen by the user (see also Figure4).

Fig. 13 .
Fig. 13.While surfing the ACM Digital Library, a user passes a new restaurant.To learn more, they simply point at the restaurant and ask "What's good to eat here?".The phone summons relevant information.A1-3: External reference photos; B1-3: Interface screenshots; C1-3: Views from a phone camera, unseen by the user.

Fig. 14 .
Fig. 14.A user's finger point could support di erent segmentation styles.A) View from phone camera; B) masked instance segmentation (what we implemented); C) crop of the pointed region; D) "laser dot" rendered onto the full image; E) flashlight e ect; F) lassoing on the world image; and G) lasso mask.