UbiPose: Towards Ubiquitous Outdoor AR Pose Tracking using Aerial Meshes

Tracking the position and orientation, or pose, of a viewing device enables AR applications to accurately embed virtual content in physical spaces. Mobile OSs track pose by matching device camera images against street-level imagery. Thus, pose tracking is often unavailable at off-street pedestrian locations. UbiPose enables pose tracking at such locations using aerial meshes, generated from satellite imagery, that are likely to be more widely available at these locations. However, matching a camera image against an aerial mesh can be error-prone, even with modern neural matchers. These neural components are also compute-intensive. UbiPose contains a novel pose tracking pipeline that runs entirely on a mobile device using fast-path optimizations designed to accept or reject pose estimates in many cases, without sacrificing accuracy. Experiments on real-world traces show that it achieves tracking accuracy comparable to AR pose tracking in iOS in places where that is available, and is able to track pose accurately in places where it is not.


INTRODUCTION
Augmented reality (AR) applications place virtual content in a user's view of the physical world.These applications can revolutionize the way we interact with the physical world.They can deliver contextual cues that enrich our experience of physical spaces, such as museums, monuments and other historical sites.They can also promote safety during walking or driving by alerting us to impending dangers beyond our visual field of view.Motivated by this potential, major mobile OSs, iOS and Android, contain mature AR application development platforms, ARKit [9] and ARCore [36] (respectively).
The central primitive that enables AR is the ability to track viewer pose -the position and orientation of the viewing device (a smartphone camera, or a headset), expressed in world coordinates.With accurate pose, applications can correctly align virtual content in the physical space captured within the viewer.Inaccurate alignment can adversely impact the user experience.In this paper, we focus on outdoor pose tracking, which both ARKit and ARCore support.
Both have built-in modules that provide continuous AR pose tracking.At a high-level these modules contain two components [8,36]: a localizer that estimates the current pose of a view devices, and, because localization can be expensive, a tracker that tracks viewer movements over short timescales to update viewer pose in between localizer invocations.ARKit and ARCore use visual-inertial odometry (VIO) trackers, which fuse camera images with inertial sensor readings, and visual localizers that determine pose of a viewer's camera based on a corpus of pre-collected images.
The visual localizers in ARKit and ARCore use terrestrial (or surface-level) imagery.This kind of imagery powers Google Street View and Bing Streetside.At scale, cameras attached to specially-outfitted vehicles periodically collect this imagery by sweeping a large parts of the planet.As such, visual localizers in these AR platforms are likely to be unavailable in locations where at-scale imagery collection is difficult: pedestrian areas in corporate or university campuses, outdoor shopping areas, apartment complexes, and so on ( §2).
To address this shortcoming, we explore the design and implementation of UbiPose, an on-device AR pose tracker that uses aerial imagery collected by earth observation satellites or aircraft ( §3).This imagery, pre-processed into a 3D aerial mesh is widely available in, for example, Google Earth [38].Using aerial meshes can increase the availability of AR pose tracking to areas where terrestrial imagery is unavailable.By one estimate [48], Google Earth has 3× the coverage of Google Street View.To our knowledge, no prior work has explored AR pose tracking using aerial meshes ( §6).Challenges and Contributions.Visual localization using aerial meshes presents two challenges.The first challenge is ensuring accuracy of the estimated pose.Visual localizers match visual features in a device camera image to visual features in collected imagery [52,69].Knowing the 3D positions of features in visual imagery (obtained offline), a localizer can estimate camera pose using the matched features.An aerial mesh is a compact 3D representation of the physical world generated from aerial 2D imagery.As such, it lacks some visual detail, so matching features in a 2D image to features in a 3D mesh can result in significant pose errors ( §3.2).
Recent work [67] has shown that modern neural feature extractors and matchers can localize a camera image by matching it to images rendered from a terrestrial mesh.Aerial meshes are of poorer quality than terrestrial meshes, so UbiPose cannot directly use this approach ( §3.2).Matching against features in the entire rendered corpus can result in false positives, leading to poor pose estimates at the tail (e.g., 95th percentile).
To address this, our first contribution is the design of a VIOassisted localizer ( §3.3).Instead of pre-rendering images and extracting features offline, UbiPose renders multiple images (for robustness to VIO errors) at the location determined by VIO, then extracts features from the rendered images and uses these to match and localize.This significantly improves tail localization accuracy by scoping the feature search.
However, this implies that UbiPose must render, extract, match, and localize on the mobile device.Especially with heavyweight neural extractors and matchers [24,74], this can easily overwhelm mobile device compute resources.
Our second contribution is the design of optimizations that reduce UbiPose's resource footprint ( §3.4).Two optimizations, Lift-and-Project and Early-Exit respectively allow fast-path acceptance and rejection of pose estimates by reusing successful matches from recent camera frames.A third, Fused-Match projects features from multiple rendered images into one to reduce the number of invocations to the matcher when full-path processing is necessary.Summary of Results.On traces collected in three major metropolitan areas in North America, UbiPose's accuracy is comparable to ARKit's ( §4).In many of these locations, ARKit is unavailable: there, UbiPose is able to obtain pose estimates with the same accuracy (about 1 m positioning error and 1 • orientation error at the tail) as in locations where ARKit is available.This is encouraging, and suggests that aerial-mesh based pose tracking can increase pose tracking coverage.UbiPose runs on modern mobile device hardware with modest resource footprints and its optimizations reduce latency by 2× and power consumption by 20%.

BACKGROUND AND MOTIVATION
We first introduce background and terminology, then describe shortcomings in the state-of-practice in pose tracking.
Pose.This term refers to the position and orientation of an object along six degrees of freedom (6DoF) -three translational (x, y, z) and three rotational (roll, pitch, yaw) axes [28].Robotics, autonomous driving, and mixed reality applications need to estimate the precise pose of objects for motion planning or for rendering virtual objects in a scene.Augmented Reality (AR).AR applications place virtual objects or markers in the physical world, as viewed through a device such as a smartphone or a headset.Indoor AR applications can enrich the viewing experience in museums, aid navigation in unfamiliar spaces, or enhance shopping in malls and stores [3,23,34,44,61,70,83].Outdoor AR can enhance visitor experience at landmarks, aid pedestrian navigation, or gamify exercise [47,64,80,85,92].
As an example outdoor AR, at WWDC 2020 Apple first demonstrated an AR art installation in San Francisco (Fig. 1).This installation used AR to create an immersive experience that inserted virtual art into the physical environment.Users could use their iPhones to view the installation and to interact with the virtual art piece.The ability to precisely position and orient the virtual art was crucial to the success of this demonstration.Without the proper alignment of virtual objects relative to the physical environment as seen by the viewing device (the iPhone in this case), users would have had a jarring and disjointed experience.AR Pose Tracking.To accurately position a virtual object in the physical world, AR applications need a precise estimate of the pose of the viewing device (viewer pose) in the   global coordinate system.Since the viewer can move through the physical space, AR applications must track viewer pose continuously and accurately to enable a smooth viewing experience.We call this the AR pose tracking problem.
In this paper, we focus on outdoor AR pose tracking.Of particular interest to us is the ubiquity of outdoor AR pose tracking.Ubiquitous pose tracking: (a) works on commodity mobile devices without assuming additional sensors or other specialized equipment, and (b) can track pose well in a wide range of outdoor physical spaces.The State-of-practice in Outdoor AR Pose Tracking.While much research has explored outdoor AR pose tracking ( §6), the state of practice in ubiquitous pose tracking is represented by Apple's ARKit [8] and Google's ARCore [36].Apple used ARKit to develop the virtual art installation described above [4], and we use that as an example to explain how AR pose tracking enables such applications.
(1) In ARKit, the art installation app developer must provide a 3D model of the art piece, and a geo-anchor (the position in world coordinates at which to place the art piece).(2) When a user opens the app on their iPhone, the app invokes ARKit and instantiates an AR session.ARKit then downloads a map for localization that contains visual features corresponding to the surrounding environment, together with feature positions in world coordinates.(3) As the user moves the camera, ARKit captures camera images, matches them with features in the map, and uses these to estimate the pose of the camera in world coordinates.Using the pose, and a model for the camera and its parameters, ARKit can place the virtual art piece in the corresponding pixel locations on the iPhone display.ARKit uses street-level terrestrial imagery to obtain features of the environment [8].Gaps in the State-of-practice.To understand the accuracy and ubiquity of outdoor AR pose tracking, we evaluated ARKit in three metropolitan areas ( §4).
Accuracy.ARKit achieves 0.5-1.1 m median position error, 0.9-2 m 95th-percentile (p95) position error, 0.5-1.2• median position error, and 1.1-2.3• p95 rotation error.To ensure good user experience, it is necessary to have low tail error in addition to having low median error, so we consider p95 errors throughout our work.In general, then, ARKit is remarkably accurate even at the tail.
Ubiquity.While ARKit's GeoTracking works well in many places, it is limited to specific areas.It appears to use terrestrial imagery [4] collected for the Look Around [5] feature in Apple's Maps.This is similar to Google Maps' Street View [43], or Bing Maps' Streetside [56].At scale, these companies capture most1 of this kind of imagery using vehicles driving on public streets or other areas accessible to vehicles.This means that ARKit pose tracking is unlikely to be available in pedestrian only areas: outdoor malls, parts of college and corporate campuses, large apartment complexes, vehicle-restricted urban centers, amusement parks, and so on.
Apple's Maps app supports the ability to test for ARKit availability at a given location, by setting a geo-anchor at that location on a map.This allows us to check unavailability by virtually visiting the location.To demonstrate how widespread ARKit unavailability can be, we virtually visited offstreet locations (parking lots, side streets) within about 20-30 blocks in the downtown areas of two cities in North America (locations omitted for anonymity).In each block, we tested one location for availability.Figs. 2 and 3 shows that ARKit unavailability can be pervasive in these areas.Android ARCore's pose tracking is also unavailable in some locations.ARCore exports an interface to test for availability of its visual positioning system (VPS), but in our experience, this interface is unreliable: it reports availability even in areas that we verified were unavailable.Instead, we physically visited four qualitatively different locations to determine ARCore availability.Tbl. 1 reports position and heading errors at these locations as reported by ARCore, as well as the corresponding errors at a street-side location near these locations.In each of these cases, errors at these locations are up to 10× greater than at the street-side -the latter has accuracy consistent with ARKit ( §4).At these locations, AR-Core appears to be estimating pose from GPS and its compass because imagery is unavailable for visual positioning.GPS is ubiquitous, but its position error (≤ 8m 95% horizontal error) [31] is an order of magnitude larger than ARKit's.
Towards Ubiquitous AR Pose Tracking.This analysis suggests that outdoor AR pose tracking is not as ubiquitous yet as needed for widespread use of AR applications, both on iOS and Android.In this paper, we explore techniques to significantly increase the coverage of AR pose tracking beyond locations where street-level imagery is available.

UBIPOSE DESIGN
We now describe the design and implementation of UbiPose.

Increasing Availability of Pose Tracking
Goal.A mesh is a compact representation of a three-dimensional object -it approximates object surfaces using a mesh of polygons.The size of the polygons (most meshes use triangles or quadrangles) represents a trade-off between accuracy and compactness: smaller polygons result in a high resolution, more accurate, mesh but the resulting mesh can be larger in size than those using larger polygons.Beyond capturing the surface geometry of an object using polygons, mesh representations usually contain associated surface color and texture attributes.As such, meshes are often used to compactly store, and accurately render, 3D objects, for a wide variety of uses.
To generate a mesh, one can use sensors ranging from 2D cameras to 3D sensors such as stereo or RGB-D cameras and LiDARs.Today, most mesh generation at scale relies on 2D images.Photogrammetry [76] can generate a mesh representation of a (static) 3D object from a sequence of 2D images.Color mapping and texturing algorithms automatically project color and texture from images to the mesh surfaces.
An aerial mesh is a mesh of an outdoor space captured from aerial 2D imagery, obtained either using earth-observation satellites or specially equipped aircraft.(In contrast, a terrestrial mesh is captured using surface-level imagery).Fig. 4 shows a picture of an aerial mesh obtained from Google Earth [38].As such, aerial meshes can capture the structure of locations inaccessible to at-scale methods for capturing surface-level imagery, such as vehicles used to capture Google Street Views [43].Examples of such locations include offstreet plazas and malls, as well as privately-owned residential, corporate, and industrial complexes.By one estimate, aerial meshes from Google Earth cover up to 36 million square miles [48] (3× more than Street View) of the Earth's surface, or 98% of the human-inhabited regions of the Earth.Challenges.While aerial meshes potentially provide greater coverage, we know of no work that has demonstrated accurate AR pose tracking using these.So, UbiPose must address two important challenges: (a) how to track AR pose using aerial meshes ( §3.2) and (b) how to do so accurately ( §3.3), and with minimal resources on mobile devices ( §3.4).

Aerial Mesh Based Pose-Tracking
UbiPose's approach to pose-tracking using an aerial mesh is qualitatively different from that of prior work.Pose-tracking requires a fundamental primitive, image localization (estimating the pose of a given image), for which prior work has considered two broad approaches (Fig. 5).Imagery-based Localization.Both ARKit and ARCore track pose using terrestrial imagery.While we do not know the details of their approach, other recent work has described a commonly used approach for large-scale localization [52,69] that relies on Structure-from-Motion (SfM).We begin by briefly describing SfM; [33] has more detail.In its simplest form, SfM, given a sequence of images from a camera, finds matching features between each camera pair, and uses this to estimate the 3D positions of each feature point in each image.These feature points form a localization map.Then, given a query image, imagery-based localization matches features in the image with features on the map, and uses the 3D positions of those feature points to localize the position of the query image.
Practical systems construct these SfM maps offline [52,69], but because SfM is compute-intensive, they scale map generation by tiling space and constructing SfM maps for each tile separately.They perform localization either on the mobile device, or offload localization to the cloud.
SfM from Aerial Mesh.As discussed in §3.1, UbiPose exploits the broad availability of aerial meshes to enable ubiquitous pose tracking.We designed a plausible approach for UbiPose that builds upon SfM, consisting of the following steps (top panel in Fig. 5 Unfortunately, this strawman does not perform well ( §4): it has high median and p95 error, for three reasons.First, the quality of the SfM map depends on the positions on the mesh of the rendered images in step 1.Second, SfM does not exploit the fact that pixels in the rendered images already have associated 3D points (obtained from the aerial mesh); instead it infers their 3D positions, and these estimates are likely to increase error.Finally, and perhaps most important, images from a camera are different from images rendered by a mesh, since the latter represent a fundamentally different modality; sometimes, feature matching can find very few matches between a camera image and a rendered image (Fig. 6).Mesh-Based Localization.MeshLoc [67] is a recent approach to localizing a camera image directly on a dense mesh.It circumvents the problem identified in Fig. 6 by leveraging the observation that modern neural local feature extractors, such as SuperPoint [24], are robust enough to match camera image features to features in rendered mesh images.At a high level, MeshLoc works as follows (middle panel in Fig. 5): (1) Extract features offline by rendering images from the mesh.Obtain each feature's 3D position from the mesh.(2) Given a query image, match that image's features with those obtained from step 1, then use the 3D locations of the matched features to localize the camera image.MeshLoc uses state-of-the-art neural feature matching [74] and achieves, on a terrestrial mesh, p95 position and orientation error of about 0.5 m and 5 • respectively [67].
MeshLoc on an Aerial Mesh.We evaluated MeshLoc on an aerial mesh.While it achieves low median error, it exhibits high p95 position and orientation error ( §4).Images from an aerial mesh have visibly poorer visual quality than those from a terrestrial mesh in some places (Fig. 7), resulting in fewer matches.Because of the height at which they are captured, aerial meshes have fundamentally lower resolution and can contain distortions because they view vertical surfaces at an angle (unlike terrestrial images).UbiPose's Approach.UbiPose builds upon these two approaches, but deviates from them in a fundamental way (bottom panel of Fig. 5): rather than extract features offline, it extracts features online entirely on the mobile device.More precisely, using a coarse pose estimate, it extracts features from the mesh that are likely visible at the estimated pose.In contrast, approaches described above (SfM, MeshLoc) must search for features to match across all features extracted offline.This can result in false positives, especially for aerial meshes.UbiPose's design results in more robust feature matching, because the coarse pose scopes feature generation.
UbiPose uses visual-inertial odometry (VIO) to continuously obtain coarse pose estimates.VIO systems integrate measurements from visual sensors, such as cameras, with inertial sensors, such as accelerometers and gyroscopes, to produce the pose of a camera relative to a starting or (anchor) pose [45].VIO is a mature technology and modern mobile OSs have integrated highly optimized VIO capabilities in the last few years [26].
UbiPose's AR pose tracking works, at a high-level as follows.It uses, as the camera pose, the VIO estimate relative to an anchor with known world coordinate position.VIO generates estimates at the frequency of the camera (30 fps).On a longer time-scale (e.g., once a second), it repeatedly invokes its localization pipeline to obtain a current estimate of the camera pose in world coordinates.If it is able to obtain a high quality pose estimate, it uses this as the VIO anchor for subsequent frames.

UbiPose Tracker
Alg. 1 describes UbiPose's (un-optimized) pose tracker.Invoked periodically every T seconds (T =1 in our implementation) with the camera image captured at that instant as the query image, it first (Line 2) obtains the current pose estimate using VIO ( §3.2).It treats this as a coarse pose estimate of the mobile device; VIO can drift over time, so by itself it is insufficient for accurate AR pose tracking.Moreover, as described above, VIO by itself produces poses relative to an anchor; in Line 2, P V is a global pose relative to some anchor.We describe below how we obtain anchor poses.Rendering.Line 3 renders K images from the aerial mesh at the VIO estimated pose.We discuss the details of rendering in §3.5, but rendering is fast, requiring only 30 ms on a mobile device.UbiPose renders K images, since rendering a single image produces poor results ( §4), both due to errors in VIO, and because an aerial mesh has lower resolution and is often distorted (Fig. 7) so it produces fewer matches.UbiPose could have sampled K images at slightly different angles relative to the VIO pose P V .This strategy increases the effective field-of-view (FoV), but does not lead to increased matches, since the query image has a fixed FoV, and these extra images overlap little with the query image since their orientation is different.UbiPose renders K images slightly differently: one with the virtual camera at P V , one by moving the virtual camera forward 1 m along the viewing direction, and one by moving it backward by 1 m along the viewing direction.Since the virtual camera resolution is fixed, this results in more detail in the rendered image and/or additional features.Even though rendering is fast, K should be as small as possible to minimize resource usage; UbiPose uses K = 3.This strategy, empirically, produces accurate tracking while keeping resource usage within limits.The Localizer.Line 5 of Alg. 1 invokes a localizer which returns a pose estimate based on the rendered images.Ubi-Pose accepts this pose estimate based on its quality.One test for quality is the inlier ratio, the fraction of feature matches that correspond to the most likely pose as determined by RANSAC [30], as described below.Another is proximity to the VIO pose.When the inlier ratio is higher than a threshold δ L and the pose estimate is within a distance threshold δ P , UbiPose accepts the estimate, and uses this as an anchor for subsequent VIO estimates.
The localizer (Alg.2) is novel: it extracts features online (bottom panel of Fig. 5).Line 8 invokes a feature extraction module Extract on each of the rendered images, as well as the query image.Our current implementation uses Super-Point [24], a robust neural feature extractor.Line 9 matches features between the query image and each rendered image using a feature matching module Match.Our current implementation uses SuperGlue [74], another neural model for feature matching.UbiPose can be easily extended to use other neural network models for feature extraction and matching.
Finally, Line 12 estimates the pose using PnP [49] and RANSAC [30].This latter algorithm returns an inlier ratio, which estimates what fraction of matches do not correspond to outliers (often resulting from noise).UbiPose uses this to estimate the quality of the pose estimate, as described above.Using inlier ratios as an estimate of localization quality is fairly standard in the localization literature [14,22].A hybrid design.VIO is fast, but drifts over time.The localizer produces good accuracy, but doesn't allow per-frame operation.UbiPose uses its localizer to periodically generate an accurate anchor, and uses VIO to calculate poses relative to the latest anchor in real-time.Accuracy and Performance.As we show in §4, this approach results in accurate AR pose tracking.Relative to prior approaches, it is novel in rendering and extracting features Lift-and-Project Early-Exit Table 2: Summary of optimizations in the optimized tracker.
online using VIO pose estimates.UbiPose's approach is necessary because straightforward extensions2 of prior approaches to aerial meshes do not produce accurate tracking ( §4.3).Unfortunately, the tracker and localizer (Algs. 1 and 2) are resource hungry.Specifically, on a modern mobile GPU (the NVIDIA Jetson Xavier NX [62]) neural feature matching and extraction using SuperPoint and SuperGlue require approximately 35 and 150 ms each per invocation, and Alg. 2 invokes the former 4 times, and the latter 3 times.The next section describes how UbiPose optimizes the localizer to reduce Ubi-Pose's resource footprint, essential both to reduce the latency of pose tracking and to reduce power consumption.

The Optimized Tracker
Overview.Tbl. 2 summarizes these optimizations.Two of these, Lift-and-Project and Early-Exit, represent fast paths for acceptance and rejection of pose estimates, respectively.Fused-Match optimizes the basic tracker when the fast paths do not lead to a conclusive acceptance or rejection.
These optimizations work together as follows: (1) UbiPose runs the query image Q on Lift-and-Project.If that produces a good pose, it accepts that and returns.(2) If not, it runs Early-Exit, which reuses computations from Step 1, but determines if pose estimate is likely to be bad.Lift-and-Project.This optimization (Fig. 8) reuses matched features from previous camera frames.Key Idea.Consider a camera image c i .Suppose some feature f matched a feature f ′ in a rendered image (using Match).From this match, UbiPose can estimate the 3D position of f (it lifts f into 3D space).Now, consider a camera image c j captured a short while later.UbiPose projects f onto a feature f ′′ in c j using the VIO estimates when c i and c j we captured.This exploits the accuracy of VIO over short timescales.Our key insight is that f ′′ can be used to match features in c j using very cheap feature-matching (e.g., nearest neighbor matching [58]), since f ′′ is of the same modality (i.e., from a camera image) as features in c j .The following paragraphs describe this approach in detail.
The Lift Cache.At the core of Lift-and-Project is the Lift Cache, a cache of lifted features and their corresponding 3D positions from recent camera images.Initially, this cache is empty.To fill this cache, UbiPose follows a Full path (Tbl.2) to obtain a pose estimate.As part of this path, it obtains all the inlier feature matches from RANSAC (e.g., Line 5 in Alg. 1).It stores all inlier feature vectors and their corresponding 3D positions, as well as the current VIO pose in the lift cache.
Using the Lift Cache.Alg. 3 (and Fig. 8) describes how UbiPose uses the Lift cache once populated.Given a query image Q, it renders (Line 3) a single image (unlike 3 images in Alg.1), extracts features from Q and the rendered image, and matches them (Line 4).The goal of this step is to lift the matched 2D features in Q to 3D (Line 5), leveraging the fact that every pixel in the rendered image has an associated 3D position obtained from the mesh. 3ext, the algorithm projects each cached feature to Q, using the VIO pose obtained when storing the feature (Line 6).These projected features, and the lifted features in Line 5 are all from camera images.Instead of using Match, UbiPose If it decides to reject the pose estimate (using the same criteria as in Line 5 of Alg.1), UbiPose proceeds to check if it should exit early, described below.
If, however, it accepts the pose estimate, this fast path terminates (Line 10), requiring only one localization, one render, two extract and one match operation (Tbl.2).Before exiting, it adds the inlier matches to the Lift Cache (Line 11) and performs cache management operations (Line 12).
Managing the Lift Cache.UbiPose evicts points in the cache that have not recently contributed to matches.For each feature point in the cache, it keeps track of how many subsequent camera images the point was visible in, and in how many of those it contributed to an inlier match.It evicts points whose ratio of contribution count to visibility count is below a certain threshold.Early-Exit.Visual feature matching works well in environments with sharply defined static structures in the environment (such as buildings).If however, at some locations, such structures are occluded (e.g., due to a tree), it may be difficult to get good visual localization.With aerial meshes updated on the timescale of months [35], feature matching may also not work well if the environment has changed (e.g., if trees have shed leaves, or been cut down).In these cases, UbiPose can avoid work by short-circuiting localization.
Our key insight is that UbiPose can reuse computations in Lift-and-Project to exit localization early if it is unlikely to be able to localize at that location.The inlier ratio, returned by RANSAC, is a signal for the quality of localization.In previous steps, a high inlier ratio signals a high quality localization.In Early-Exit, if the inlier ratio returned in Line 8 of Alg. 3 is below another threshold δ E , UbiPose uses the VIO pose estimate and returns without proceeding further.Early-Exit is conservative, since Line 8 uses matches both from the Lift Cache and from Match on the single rendered frame.
Thus, when UbiPose decides to Early-Exit on a query image Q, it re-uses the one render, two Extract, one Match and one localization operation performed for Lift-and-Project (i.e., it incurs no additional operations).Fused-Match.If Lift-and-Project does not produce an acceptable pose estimate, and UbiPose does not Early-Exit, it can try to localize using the basic tracker in Alg. 1.To do so, it would invoke the following steps: (1) Render images R 2 and R 3 and invoke Extract on them.
(2) Invoke Match pairwise on Q and R 2 and R 3 .
(3) Feed these and the matches from Line 4 and Line 7 to the localizer, and decide whether to accept the resulting estimate or not (using the criteria in Line 5 of Alg. 1).Because Match uses neural feature matcher like SuperGlue, it is compute-intensive, so reducing the number of invocations in Step 2 above can reduce latency and resource usage.Fused-Match is an optimization that reduces the two Match invocations in Step 2 to one.It leverages the fact that the exact 3D positions of all features can be obtained from the aerial mesh.So, Fused-Match projects all features obtained in Step 1 to a single image plane, then runs Match once on that image (Step 2).This saves a Match invocation, but the way we have structured our optimizations, Fused-Match requires an additional invocation to the localizer relative to Alg. 1.Localization is faster than neural feature matching, so this results in higher overall efficiency.
Naïvely fusing feature points can result in multiple features from different images being projected to the same or nearby pixels.This can adversely impact Match accuracy, since matching relies on feature uniqueness.Fused-Match filters out a feature if it has already projected a feature to a nearby pixel.This results not only in more accurate matching, but also faster matching, since matching takes time proportional to the number of features.

Other Details
Obtaining Initial Pose.To produce a pose estimate in world coordinates, VIO needs an anchor.When an app initiates AR pose tracking, UbiPose needs to generate an initial anchor pose.It could have used GPS position and heading, but GPS can be erroneous in obstructed environments.Instead, it uses visual localization on the aerial mesh to obtain the initial anchor.Through relatively simple coordinate transformations, whose details we omit, it is possible to convert aerial mesh coordinates to world coordinates.
To obtain the initial anchor, UbiPose uses the GPS and heading from the mobile device at the location at which the user initiates AR pose tracking to render five images at angles of 15 • from the mesh, each with a 30 • field-of-view.It proceeds to obtain a pose estimate using steps similar to Alg. 1: invokes Extract on all images, and Match on every pair, then runs the localizer.If the localizer returns a low inlier ratio, UbiPose rejects the pose, and repeats with another camera image. 4This approach is an order of magnitude more accurate than simply using the GPS position, and UbiPose finds a good initial pose within 2-3 frames on average( §4).Model compression.The state-of-the-art neural models Ubi-Pose uses for extraction and matching, SuperPoint and Super-Glue, are too resource-intensive for the mobile device.We converted both models to a TensorRT [63] compatible format (as currently available, they use PyTorch [68] and the ONNX Runtime [25]).Using TensorRT, we quantized them to FP16 precision and used kernel auto-tuning to optimize them for our target hardware, the Jetson NX board.These two optimizations enables UbiPose to run both models efficiently on the mobile platform without significant lose of accuracy.Rendering.To render an image from a mesh, UbiPose uses OpenGL, whose renderer takes the size of the image desired.The renderer also uses the mobile device camera parameters (the extrinsic and intrinsic matrices) to render images with content, quality and perspective similar to the camera image.Rendered images use ambient light on the mesh, with shadows disabled, in order to properly capture texture and color.All of these help increase the efficacy of feature extraction and matching.To achieve fast rendering, we re-implemented a Python renderer [54] in C++ and exploited the mobile GPU.

EVALUATION
We quantify UbiPose's performance using real-world traces.

Methodology
Implementation.Our implementation of UbiPose uses OpenCV [15] for image processing, OpenGL [90] for rendering, TensorRT [63] to compress models and Colmap [76] to estimate poses from 2D-3D correspondences.Our total implementation is over 7500 lines of C++ code.Mobile Platform.Our implementation runs entirely on an NVIDIA Jetson Xavier NX (6-core 64bit ARM CPU and 384-core Volta GPU) and we use this for all our experiments.Many mobile devices today have hardware comparable to, or better than, the Jetson NX.For example, Apple's Vision Pro [7] headset has an Apple M2 processor [6], which runs at a higher clock rate and contains more cores than the Jetson.Trace Collection.We developed a simple data collection app on iOS, which allows us to collect traces of the camera images (at 1920×1440 resolution), the VIO poses of the camera, the camera intrinsics and extrinsics, as well as poses of the 4 This runs during session initialization, which can take several seconds [8].San Jose Apartment 0.5 (0.9) 0.9 (1.9) 0.5 (0.9) 0.9 (1.4) Table 3: Pose accuracy at locations where ARKit is available.
ARGeoAnchor (if the ARKit's GeoTracking service is available).To capture real-world image quality, we collected all images using off-the-shelf iPhones.We used SensorLog [77] to collect other sensor data including GPS and heading information.We physically visited each location, collected the data and generated ground-truth using SfM (described below) on the traces.Evaluating each trace took several hours to a day, depending on trace length.
Trace Locations.We used this app to collect traces at about 17 different locations in three different metropolitan areas Los Angeles, CA, San Jose, CA and Pittsburgh, PA in the U.S. In these areas, we collected traces at university campuses, shopping centers, streets, corporate campuses, and apartment complexes by walking in these areas with the camera held forward-facing in landscape mode.Our evaluation focuses on diversity in geography (3 cities) and location types (5).In less than half of these locations (at least one of which was off-street), ARKit was available.All our evaluations use these traces, or a subset thereof.For the areas covered by our traces, the mesh sizes ranged from 30 to 160 MB, with an average of 65 MB.These are well within storage limits on modern mobile devices.Aerial Meshes.We extracted 3D meshes from Google Earth [38] using Chrome [37], loaded them into Blender [12] using MapsModelImporter [55], and then exported them in a waveform format with texture mapping (.obj and .mtlformat).
Metrics.We evaluate UbiPose using two primary metrics: the error in meters of the estimate camera position and the error in degrees of its orientation.For each trace, we compute the median and p95 values for each metric across all frames for which UbiPose invokes Alg. 3. Additionally, in some experiments, we also quantify UbiPose's median and p95 latency of estimating the pose of a frame, and the median and p95 power draw during the processing of a trace.For the latter, we leverage built-in power tracing in the NX.Estimating Accuracy.To evaluate accuracy, we need pose ground-truth.Possible approaches to generating pose groundtruth include GNSS-RTK, OptiTrack and LiDAR-SLAM.In dense built environments targeted by AR applications, RTK can have high error due to reflections.Optitrack [65] is more suitable for indoor settings [66].LiDAR-SLAM can drift by tens of centimeters and requires calibrating the camera and the LiDAR, which can introduce more measurement error.Instead, we use pseudo ground-truth generated using SfM.Often used in the localization literature [13,72], this method may not perfectly estimate absolute error, but enables us to compare UbiPose with ARKit against a common reference.Pseudo ground-truth can be susceptible to local minima [13].To overcome this, Brachmann et al. [13] suggest choosing evaluation thresholds large enough that the variation in the pseudo ground-truth is less likely to affect the measured performance.We choose p95 to account for such variations.
UbiPose Accuracy.To evaluate the accuracy of UbiPose, we generate an SfM model of the trace using Colmap.This produces a set of camera poses for each image and a point cloud.We align this point cloud to the aerial mesh using iterative closest point, ICP [11].This enables us to map the SfM's image poses to the mesh's coordinate system, and hence to evaluate the accuracy of UbiPose's poses against the ground truth provided by the SfM model.
ARKit accuracy.We use the same SfM model to estimate ARKit accuracy as well, but need a way to transform the camera pose exposed by ARKit's pose tracker (called Geo-Tracking).The tracker does not expose camera pose in world coordinates, but instead provides the pose of the geo-anchor ( §2) and the camera pose in the AR session coordinates.To address this, we first calculate the relative pose between the geo-anchor and the camera at each instant.Then, we align the first image's pose in the trace with the same image in the SfM model.This enables us to transform each ARKit pose estimate in SfM space, and we can estimate error.

ARKit Comparison
We first compare UbiPose against ARKit using 17 traces whose average duration is 259 s, ranging from 129 s to 346 s.ARKit is available5 at some locations and not in others, so we discuss these separately.Locations where ARKit is available.Tbl. 3 shows the results of seven traces, labeled A-G, at locations where ARKit was available.These locations span public streets in cities (A-C), universities (D-E), and a public street through an apartment complex (G).One of these, F, is in a pedestrian-only zone on a university campus, evidence of terrestrial imagery collection using pedestrians or bicycles.
UbiPose achieves about 0.5 m median and a little over 1 m p95 positioning error.Its median orientation error is, in most cases, 1 • or less, and its p95 orientation error ranges from 1.2 • to 2 • .UbiPose's performance is not strongly correlated with which city it was collected in, and where.For example, across traces D-F, in a University in Los Angeles it has both the lowest and one of the highest median positioning errors.In contrast, ARKit exhibits median positioning errors of 0.5 m to almost 2 m in some cases, and p95 positioning errors well over 1 m.Its orientation errors, however, are low: with a couple of exceptions, the median (p95) orientation error is 0.5-0.9• (less than 2 • ) in many cases.
Tbl. 3 also shows in bold, for each trace, which approach has better position and orientation accuracy.Generally, Ubi-Pose's positioning accuracy is better than ARKit's, and its orientation accuracy is slightly worse.ARKit's better orientation results may be a result of better fusion of its inertial sensors; UbiPose can be extended to exploit these sensors, and we have left this to future work.
There are some exceptions.In G, ARKit is better both in position and orientation.In C and F, UbiPose is better than ARKit in both.In these, ARKit indicated that it had low confidence in its pose estimates.For all three of these cases, the difference comes down to the quality of the imagery: in G the mesh quality is poor, and in the other two, it is better than the terrestrial imagery (we are not sure why, but we note that F is at an off-street location on campus, so imagery there was likely obtained using pedestrians).This suggests that a hybrid approach which matches both aerial and terrestrial imagery might give good uniform AR pose tracking performance.
Overall, we conclude that, at least for the traces we have studied, UbiPose's accuracy is comparable to that of ARKit.Locations where ARKit is not available.Tbl. 4 shows Ubi-Pose's accuracy at locations where ARKit is not available.These span all three of our cities, and are from a range of locations: corporate and university campuses, apartment complexes, and outdoor shopping areas.
In these locations, ARKit cannot track pose, but UbiPose is able to do so uniformly well.Its median positioning error ranges from 0.27 m to 0.61 m and its p95 positioning error is less than 1.2 m.Its median orientation error ranges from 0.27 • to almost 1 • , and its p95 orientation error is below 1.8 • .These are qualitatively consistent with UbiPose performance in Tbl. 3, as one might expect: UbiPose accuracy shouldn't correlate with where terrestrial imagery is available.
In these locations again, accuracy does not seem to correlate with city or type of location, at least from our samples.Overall, these results suggest that UbiPose can increase coverage of AR pose tracking to locations where ARKit is not available, such as those in Figs. 2 and 3.More important, it can do so without impacting accuracy of tracking in those locations, an important consideration for app developers whose users expect uniform quality of experience.

Comparison With Other Approaches
SfM from Aerial Mesh.This builds an SfM model from rendered aerial mesh images, and localizes a query image using the SfM model ( §3.2).SfM usually uses SIFT features and nearest-neighbor matching (SIFT+NN).Because SuperPoint features and SuperGlue matching (SP+SG) perform better on meshes [67], we built an SfM model that uses these.
Tbl. 5 depicts results from this experiment.In this experiment, unlike the prior one, we present the results of image localization alone.In other words, we do not use VIO estimates to track pose.We do this to understand whether UbiPose could have used this approach instead of its localizer.
SIFT+NN exhibits unacceptable performance (Tbl.5), with median positioning error of over 30 m, and median orientation error of 90 • .SIFT features do not generalize to aerial mesh rendered images, since these have a very different visual appearance that camera images ( §3.2).On the other hand, SP+SG has very low median position and orientation error, but its p95 errors are completely unacceptable (over 50 m and over 100 • respectively).As discussed earlier, this results from distortions and low resolution of aerial meshes.MeshLoc on Aerial Mesh.MeshLoc performs visual localization by extracting features offline from a terrestrial mesh, then matching the query image online against those features.In this section, we evaluate a MeshLoc-like approach, but use an aerial mesh to generate features.
Tbl. 6 shows the accuracy of localizing a query image purely using MeshLoc.The table shows results for traces in which MeshLoc performance deviates significantly from  UbiPose.Thus, for traces not listed in Tbl.6, MeshLoc has accuracy comparable to UbiPose's localizer.MeshLoc compares well with UbiPose in median position and orientation error across all of these traces.However, its p95 errors are generally worse, and in some cases substantially so.For example, in G, it has a p95 position error of over 7 m and an orientation error of over 8 m.Other traces where both of these values are high include J, A and E. On a terrestrial mesh, by contrast, its p95 position error is under 0.5 m [67].As discussed in §3.2, we attribute this to the lower resolution of, and distortions in, the aerial mesh.To be robust to these, UbiPose renders multiple mesh images for feature extraction and matching at the estimate pose.

Quantifying The Optimized Tracker
Fast Path Invocations.Tbl. 7 shows the statistics, across all our traces, of the percentage of frames which benefited from Lift-and-Project, and from Early-Exit.Both optimizations are crucial for UbiPose.Over 70% of frames across all traces, and over 90% of frames in one of them (I), use Lift-and-Project.A smaller percentage of frames on average (about 7%) use Early-Exit, but in one of our traces B, it is used 46% of the time.Thus, while Lift-and-Project is uniformly useful, Early-Exit is absolutely critical for at least one of our traces.As we show below, these fast path invocations result in lower latency and lower power consumption.Impact of Optimizations.Tbl. 8 compares UbiPose to the basic tracker in Alg. 1.
Accuracy.In theory, our optimizations can potentially degrade accuracy.For example, Lift-and-Project projects matches from previous frames assuming short-term VIO pose stability, which can introduce error if that assumption is violated.However, at least for two of our traces (representing traces with ARKit and traces without), relative to the un-optimized tracker, UbiPose has comparable median and p95 position and orientation errors.
Latency.To measure the latency, we instrumented the C++ implementation of UbiPose to record the time required for pose tracking for each image.UbiPose's median latency per frame is 2× smaller that of the un-optimized version.Ubi-Pose currently localizes camera frames every second, so our optimizations free up resources for other tasks on the mobile device.In contrast, without these optimizations, the device would be busy most of the time running UbiPose.UbiPose's p95 latency is higher than the un-optimized tracker because of its additional localizer invocation ( §3.4).Individually, the median latenciy for Lift-and-Project is 220ms, Early-exit is 90ms, and Fused-Match is 420ms.However, the latency speed-ups of UbiPose's optimization cannot be attributed to individual modules since they rely on each other (Early-Exit and Fused-Match rely on Lift-and-Project).
Power consumption.Because we target mobile deployments, power usage is an important factor to consider.Ubi-Pose consumes 5.7W (median) and 6.9W (p95) both of which are lower than the basic algorithm's power consumption of 7W (median) and 7.8W (p95).For context, the iPhone 12 Pro Max's GPU has comparable average power consumption [32].UbiPose's improvements come from the Lift-and-Project optimization that reuses feature points, thereby reducing the number of neural network model inferences required.

Other Results
Memory footprint UbiPose needs 700MB for the mesh, 1GB for TensorRT-generated NN models, and 20MB working memory.These are well within the Jetson's 8GB RAM.Initial Pose.Tbl. 9 displays the accuracy of initial pose estimation ( §3.5).For context, it also shows the GPS error at that location.UbiPose's initial pose is 5× more accurate than GPS location, and it finds an acceptable initial pose within 2-3 frames on average, with a p95 position error of 2.3 m and p95 orientation error of 2.5 • .One-shot Neural Matchers.Recent fast one-shot neural matchers [20,81] perform Extract and Match in one step, and are plausible candidates for UbiPose's localizer.However, LoFTR [81], when used for visual localization using an aerial mesh, incurs a position error of 10 m (70 m) and orientation error of 22.9 • (145 • ) on trace F (and comparable error on trace H, omitted for brevity), too high to be useful for UbiPose.Generalizing to other Cameras.Our experiments use iPhone cameras.AR headsets have qualitatively different cameras; for instance, the Hololens has a grayscale camera.To test whether UbiPose generalizes to these, we collected a trace using a grayscale stereo camera with lower resolution and different calibration and auto-exposure algorithms.Unmodified UbiPose achieves, on this trace, a position error of 0.2 m (0.5 m) and orientation error of 0.4 • (1.7 • ).This result suggests that UbiPose may be able to generalize to AR headsets.

LIMITATIONS AND FUTURE WORK
Mesh Availability.The degree to which UbiPose enables ubiquitous pose estimation depends on mesh availability, which in turn depends on depends on mesh providers.In UbiPose's experiments, we used the aerial mesh from Google Earth [38].Prior work [10] shows that Google Earth covers 97% of U.S.'s and 86% of Canada's metro areas [18,39].In some suburban or rural areas, Google Earth only provides satellite images (2D) but not meshes (3D).In these, UbiPose will not be able to provide accurate pose estimates, but neither can terrestrial imagery [59].Mesh Freshness.The freshness of the aerial mesh can impact the accuracy of pose estimation.We have observed in some of our experiments that stale meshes containing trees with or without leaves, or new construction, can result in fewer matched features between camera images and the mesh rendered image, leading to degraded accuracy or even localization failure ( §3.4).Localization using terrestrial imagery can be similarly impacted by stale images.One might assume that terrestrial imagery would always have higher freshness, but anecdotal evidence suggests otherwise.In one of our evaluation locations, the aerial mesh was from 2022, the street-level imagery from 2016.In another, street-level imagery was from Nov 2022, aerial from May 2022.Quality of aerial-mesh.While our experiments use the highest quality mesh available from Google Earth, we find that aerial meshes have poorer quality than terrestrial meshes.Our techniques in §3 essentially compensate for this quality difference ( §3.3).If higher quality aerial meshes are available in the future, UbiPose might be able to achieve accurate, and cheaper, pose estimation.
To illustrate the quality differences between aerial and terrestrial meshes, we took a terrestrial mesh from Aachen, Germany (used by MeshLoc [67]) and obtained an aerial mesh for the same location.By analyzing these, we found three factors that impact UbiPose's localization accuracy.Aerial images have a different perspective than a mobile device camera image; this negatively impacts feature correspondence.Because aerial images are captured from a distance, aerial meshes often lack texture, resulting in poorer feature matching.For a similar reason, the 3D positions of points in the aerial mesh can be inaccurate, relative to a terrestrial mesh.This reduces the accuracy of PnP [49] ( §3.3).Other Designs and Extensions.Future work can explore offloading computation from lower-end mobile devices to edge and/or cloud [41,89] to maintain performance SLOs.We have left it to future work to integrate our implementation into iOS and Android.Apple doesn't provide guidance on how to run

RELATED WORK
Sparse feature-based visual localization.Existing visual localization approaches [42,50,69,72,73,76,82] represent scenes using sparse features [21], often in the form of Structure-from-Motion (SfM) point clouds containing local features extracted from database images.These approaches use different feature extraction (e.g., SIFT [51], ORB [71] and SuperPoint [24]) and matching techniques (e.g., nearest neighbor [58], SuperGlue [74] and D2-Net [29]).As we show in §4 they are unsuitable for use with aerial meshes.UbiPose uses multi-view matching techniques differently than the existing approaches [19,53]; it matches feature points across two modalities and does not need to triangulate to find the 3D positions of the feature points.
Mesh and Dense 3D model based localization.Land-scapeAR [17] estimates pose to within 100 m using digital elevation models (DEMs).Zhang et al. [93] use learned features and view synthesis to generate reference poses for a given query image, but this takes 10-20 s per frame.MeshLoc [67] uses terrestrial meshes for neural feature extraction and matching.UbiPose obtains high pose accuracy in areas where terrestrial meshes may not be available and can run fast on mobile devices.SLAM based visual localization.SLAM [28] simultaneously estimates pose and builds a 3D map of an environment.
Visual SLAM [19], like SfM, builds sparse 3D feature maps of the environment.If these maps were widely available, they could potentially enable ubiquitous pose tracking, but would have the same drawback as terrestrial imagery: at scale, they could only be collected using vehicles [1].UbiPose, using aerial meshes, enables wider coverage for AR pose tracking.
Other camera pose estimation and tracking approaches.

CONCLUSIONS
UbiPose extends coverage of AR pose tracking on mobile devices to areas where terrestrial imagery is not available.It uses aerial meshes, a novel tracker and localizer that uses VIO estimates to render images from the aerial mesh on the mobile device, then performs feature extraction and matching.It uses several optimization to reduce the high cost of neural extraction and matching, and achieves accuracy comparable to ARKit in areas where that is available.In off-street locations without terrestrial imagery, UbiPose is able to accurately track pose while ARKit cannot.

Figure 2 :
Figure 2: ARKit unavailability in the downtown area of Arcadia, CA, USA.

Figure 4 :
Figure 4: An screenshot of an aerial mesh from Google Earth.

Figure 5 :
Figure 5: Different approaches to AR pose tracking at scale.
): (a) Render images from different viewpoints on the aerial mesh; (b) Use these rendered images to obtain an SfM model; (c) Use the SfM model to estimate the camera pose for AR pose tracking by matching camera image features to those in the SfM map.

Figure 6 :
Figure 6: Cross-modality matching between a camera image and an image from an aerial mesh results in relatively few matches.

Figure 7 :
Figure 7: An image rendered from a terrestrial mesh (left) and from an aerial mesh (right).Notice the distorted arch on the right side of the aerial mesh image.

Algorithm 2 : 7 Invoke 8 for each rendered image r in R do 9 Invoke
Localizer 1 Function Localizer (Query image Q, List of rendered images R, Initial correspondences C) Extract Q and each rendered image in R; Match between Q and r; 10 Add new 2D to 3D correspondences from matches based on OpenGL depth map;

( 3 )
If it does not Early-Exit, UbiPose uses Fused-Match to reduce Match invocations.

Figure 8 :
Figure 8: Illustration of the Lift-and-Project optimization.

Table 1 :
ARCore reported position (in m) and heading (in • ) errors at the location and on the nearest street.
Compute the current pose estimate P V using VIO;Use P V to render K image R 1 , . . ., R k from aerial mesh; 3

Algorithm 3 :
Optimized Algorithm 1 for each query image Q do Compute the current pose estimate P V using VIO; Use P V to render one image R 1 from aerial mesh; Run Extract on Q and R 1 and Match on the pair; Project matches M from the Lift Cache; 2 3 4 6

Table 5 :
Pose Accuracy by SfM from Aerial MeshMoreover, position accuracy is not a predictor for orientation accuracy: N has low positioning error, but high orientation error, H has low error along both dimensions.Accuracy depends entirely on the mesh quality, which is likely a function of when, and from what height it was collected.It is also a strong function of the degree to which the environment has enough visual features (obtained from street markings, buildings etc.) to enable matches.This is true for ARKit as well; in wide open spaces, both ARKit and UbiPose may not be able to track pose as accurately.

Table 6 :
Accuracy comparison between UbiPose and MeshLoc.Best alternative in bold.

Table 7 :
Percentage of frames that benefit from optimizations.

Table 8 :
Latency of UbiPose in pose tracking with optimized and basic algorithm

Table 9 :
[21]ial pose accuracy of UbiPose and optimize third-party ML models on Apple's neural engine[40].Android's TensorFlow Lite quantization tools[84]don't improve SuperGlue performance in our experiments, and TensorRT, which we rely upon for model compression, doesn't support Apple hardware as a back-end.Finally, both UbiPose and ARKit estimate pedestrian-carried camera poses; localization of aerial (e.g., drone) camera images[21]is an interesting direction for future work.Future work can also train feature extractors and matchers for cross-modality matching, detect VIO drift to optimize visual localization invocations, and explore extensibility to AR headsets.