Maestro: The Analysis-Simulation Integrated Framework for Mixed Reality

Mixed reality devices with near-eye displays unlock new possibilities for innovation and user experiences. Mixed reality applications require a new unified framework that enables seamless analysis of the real world and simulation of realistic virtual content. Designing such a framework faces various challenges, including huge programming efforts of analysis and simulation pipelines, and inconsistencies between real-world and virtual content caused by end-to-end processes across pipelines. This paper proposes Maestro, the analysis-simulation integrated framework for mixed reality applications. Maestro provides a programming model for effective application representation and control, aiding runtime optimization. Maestro runtime takes an object-level execution approach to minimize misalignment, integrating both simulation and analysis pipelines for applications to process individual objects based on their latency sensitivity. Our evaluation shows that Maestro improves streaming accuracy by up to 1.6× compared to existing frameworks and effectively expresses nine qualitatively distinct workloads expected in prospective mixed-reality applications.


INTRODUCTION
Ally wants to purchase a new sofa and uses a furnishing app [11], with mixed reality (MR) devices with near-eye displays, such as Google Glass [4] and Oculus Quest [20], to place various virtual sofas in her living room.She then plays a pet breeding game to play with and feed a virtual cat [50].The latest breakthroughs in deep neural networks (DNNs) enable smooth and accurate interactions in such an MR app, including placing a virtual sofa with hand gestures, locating food to give instructions to the virtual cat, and realistically rendering the sofa and cat with light estimation [22,40,52].However, the active use of complex DNNs in an MR app likely causes a subtle but irritating mismatch between the virtual content and the real scene, such as the sofa moving slightly out-of-sync with the hand, due to high execution latency, making it challenging to adopt them.
The core challenges in building such a DNN-enabled MR app lie in the seamless integration between two separate computation pipelines: (i) analysis pipeline to analyze the surroundings with multiple DNNs and (ii) simulation pipeline to present the virtual world to the display.
• Inconsistency between the real scene and virtual contents.The simple black box integration of the two pipelines induces continuous inconsistency between the virtual contents and the real scene (seen by bare eyes or video pass-through) [48].We observe that such inconsistency results from periodic frame-level execution and synchronization.First, frame-level analysis and simulation, i.e., completing the entire input frame analysis and performing the corresponding simulation, often result in a high end-to-end latency.The frame analysis with multiple DNNs incurs high processing latency, while the simulation can be started only after the completion of the whole frame analysis, causing perceptual quality degradation.Second, the simulation pipeline runs at its own duty cycle and is loosely synchronized with the analysis pipeline.Hence, the simulation is not started right after the analysis but rather with a considerable delay (e.g., 33 ms for 60 FPS display).Such delay results in the misalignment between the real scene and the virtual content, especially with fast camera movements or content changes.We observed that over 9 out of 11 users notice even a single frame (e.g., 9 -16 ms) delay from real-world usage.
• Absence of a unified framework.Developing and integrating the two pipelines involves huge programming efforts Tab. 1 summarizes existing MR frameworks.The pipelines are often developed with different systems: the former with mobile DNN frameworks [18, 43,44] such as TensorFlow Lite and the latter with game engines such as Unity [3] and UnrealEngine [1], each of which has a steep learning curve.MR apps require sophisticated implementations in each framework for fast multi-DNN execution with complicated 3D virtual scene simulation.Recently, game engines support DNNs with XR SDKs [9, 12,20] to reduce programming efforts.However, they are limited to outdated task-specific DNNs (e.g., hand pose) [21], and state-of-the-art inference acceleration techniques [43] are hardly applicable.
We propose Maestro, the analysis-simulation integrated framework for DNN-enabled MR apps.Maestro offers a programming abstraction to represent the analysis and simulation operations for MR apps with a simple graph while giving cues for runtime optimizations ( §4.2).Maestro provides a highly optimized runtime engine to accelerate the execution of the graph to minimize the perceivable inconsistency ( §5).It efficiently handles all underlying technical complications, such as tight synchronization between the two pipelines and operator scheduling to heterogeneous mobile processors.We implement the system with mobile multi-DNN framework [43] and Unreal Engine [1], and show that our runtime significantly reduces the inconsistency (up to 1.6× higher streaming accuracy) compared to the typical black-box integration of the two pipelines.Also, we demonstrate that our programming interface easily expresses nine distinct MR apps ( §6).Runtime engine for high consistency.Our runtime engine accurately aligns a real scene and corresponding virtual contents by minimizing the end-to-end latency of analysis and simulation.Our key approach is object-level execution and on-demand synchronization, distinguished from the prior frame-level periodic processing.First, it uses an object as the unit of analysis and simulation.This enables prioritizing the processing of important objects in a scene, e.g., fast-moving objects, reducing the end-to-end delay for such important objects.This minimizes the inconsistency between the real scene and virtual content, enhancing the user's perceptual quality.Second, it supports fast synchronization between the analysis and simulation pipelines such that an individual object can be simulated immediately once its corresponding analysis is complete, rather than waiting until the entire frame analysis is over.
We devise two mechanisms to realize our approach: (i) objectlevel scheduling and (ii) object-level simulation.First, the scheduling mechanism identifies operator subgraphs involved in analyzing and simulating individual objects.Then, it prioritizes the subgraphs based on the object importance (e.g., movement speed or eye gaze), and allocates such subgraphs to heterogeneous mobile processors to minimize the processing delay.Secondly, our object-wise simulation mechanism enables our system to instantly incorporate changes to objects during an ongoing simulation cycle-a feature lacking in previous simulation engines.This capability allows for the real-time simulation of new objects seamlessly overlaid onto unchanged parts of a scene, ensuring efficient object-level scheduling with negligible delay.Programming interface.Our graph abstraction is an intuitive extension of existing dataflow abstractions, but it is unique in (i) supporting both DNN-based analysis and simulation in a single graph and (ii) facilitating object-level analysis and simulation.In , ranging from analyzing distinct objects or multiple properties of a single object, such as human posture and facial and hand landmarks [15].We summarize our contribution as follows: • Maestro is the first to tackle the inconsistencies from the end-to-end analysis and simulation process in MR devices, such as see-through and pass-through head-mounted displays.We envision this endeavor towards integrating both pipelines as a cornerstone for enhancing the realism and immersion of future MR apps.• We design the runtime to take object-level execution to minimize the inconsistency between real-world and virtual content.Maestro achieved up to 1.6× higher runtime accuracies.• We propose a comprehensive programming abstraction that accommodates DNN-based analysis and simulation.We show our API facilitates prospective mixed-reality scenarios without any framework-level modification.• We implement Maestro on Unreal Engine [1], and a multi-DNN inference engine [43] (atop TensorFlow Lite [28]), and evaluate them with 3 apps built with state-of-the-art DNNs and 3 flagship mobile devices that share the computational characteristics with future MR devices.

MOTIVATION: USER PERCEPTION IN MR
Background.Mixed reality apps are compatible with a range of device platforms.First, head-mounted display (HMD) includes passthrough and see-through devices.Pass-through devices, such as Meta Quest [20] series and Apple Vision Pro [25], present the real world with images captured by external cameras.See-through devices, such as Microsoft HoloLens [17] and Google Glass [4], overlay virtual contents with a transparent screen on top of the real world Table 2: User perceived inconsistencies (represented as braces) between real-world and virtual contents.A and S represent the analysis and simulation latency, respectively.For example, with seethrough devices, users perceive reality in real-time with bare-eye (0 † ms) along with virtual contents rendered on screen after analysis and simulation (A+S), resulting in inconsistencies from (A+S) latency differences.
Content Type See-through Pass-through Mobile AR

Real
Bare-eye 0 † --Screen -S ‡ A+S Virtual A+S A+S A+S † We consider human latency in processing visual stimuli as a baseline (zero) for simplicity.‡ Pass-through devices immediately simulate (i.e., render) the passthrough camera frame, unlike mobile AR where the rendering can be deferred to be aligned with analysis output.Any delay in rendering could result in side effects like motion sickness because of proprioception (i.e., sense of self-movement, force, and body position).seen by the bare eye.Second, commodity mobile phones offer augmented reality (MAR) capabilities, such as Pokémon GO [7].MAR displays both virtual content and the real world captured through the rear camera on the device's screen.User perception.User perception in MR differs depending on the platform.Fig. 1 illustrates perception models, and Tab. 2 details userperceived latencies of the real-world and virtual contents1 in each platform.In HMDs, users directly perceive reality, especially with near-zero latency in see-through devices.This results in noticeable inconsistencies with virtual content generated through the endto-end process of subsequent analysis and simulation (columns 1 & 2).The inconsistency may not be a problem in MAR, where the user perception remains within the display due to attention tunneling [58], and the real world (i.e., input camera frame) is displayed, being aligned with virtual contents (column 3).

DESIGN CHALLENGES
A framework for DNN-enabled mixed reality apps needs to minimize user-perceived inconsistencies between real world and virtual content.Sadly, no existing framework supports streamlined execution between analysis and simulation pipelines to achieve the goal, especially for pass-through and see-through devices.

Update virtual object
Virtual Camera Virtual Sofa Virtual Light

Strawman Design
Baseline integrated framework.We first design an integrated baseline framework to explore design challenges.It harnesses the mobile DNN frameworks that support diverse DNNs and combines them with game engines' high-quality simulation capabilities.
For analysis pipeline, we choose mobile DNN frameworks [28,43,44] that support custom DNNs with various processors (e.g., EdgeTPU [13], Hexagon DSP, and HTA [34]).Other frameworks cater to DNN analysis pipelines, but they do not align with our purpose.First, XR SDKs, such as ARCore [12] and Oculus SDK [21], offer features like surface detection and hand tracking.However, their lack of generalizability restricts the use of custom DNNs.Second, AR frameworks, such as MediaPipe [41,54], offer a simple abstraction for expressing apps more effectively but suffer from runtime challenges due to their frame-level processing scheme.
To bolster the simulation pipeline, we adopt game engines, such as Unreal and Unity [1,3].They are crafted to facilitate high-quality rendering for mobile platforms and to dynamically reflect the user's head motion in real-time.Framework operation.The integrated framework consists of two pipelines: an analysis pipeline for DNN executions and a simulation pipeline for simulation tasks such as animation, physics, and input processing, alongside rendering.Fig. 2 depicts the operation of these two pipelines.
The analysis pipeline operates at the camera frame rate, for example, 33 ms for 30 FPS.Each frame undergoes preprocessing, such as YUV to RGBA conversion, and is input into a DNN inference engine.Then, the inference outputs undergo post-processing, including tasks to generate bounding boxes and human pose.Meanwhile, the simulation pipeline is triggered at each display refresh, known as vertical sync -VSync.When activated, the pipeline executes three tasks: (i) input gathering information related to virtual contents, (ii) simulation composing virtual contents, and (iii) rendering rasterizing, i.e., drawing the scene onto the display.End-to-end latency breakdown.Fig. 3 provides a breakdown of the e2e latency.Following the completion of DNN inference, its output is fed into the subsequent closest simulation step, introducing a delay to synchronize with the VSync.In modern game engines, simulation (in CPU) and rendering (in GPU) are pipelined to mitigate the critical path to process a single frame within a tight display interval.

Runtime Challenges
This section illustrates the challenges in designing an integrated framework.We focus on two factors of extra delay: (i) frame-level processing in multi-DNN (C1) and (ii) synchronization delay to the simulation (C2).2

Frame-level Analysis (C1).
The first challenge stems from the frame-level execution model of the analysis pipeline.An entire input frame is analyzed with multiple DNNs for various tasks (e.g., person and object recognition for information overlay, hand tracking for interaction, and surface detection for virtual object placements), and all analysis results are conveyed to the simulation step at one go.The more DNNs the app employs for frame analysis, the larger the delay increases.For example, in MediaPipe person pose estimation app [16], rendering two people on Galaxy S20 device takes ≈83 ms with GPU3 .Rendering the pose of the first person incurs an unnecessary delay of 39 ms to wait for the pose analysis of the second person.
The frame-level analysis precludes more frequent rendering opportunities to update partial analysis results, causing higher inconsistency between the real world and virtual contents.The challenge is exacerbated by a notable discrepancy between camera and display rates (Tab.3).MR device displays typically have a screen refresh interval of ≈8-16 ms, which is 2-4× faster than the camera frame rate triggering DNN computation.This disparity results in multiple missing rendering opportunities for every frame analysis.This issue arises from the inherent nature of pipelines.MR hardware, specifically near-eye displays, operates periodically at a fixed rate, often incorporating a low-persistent mode for motion blur reduction [33].Furthermore, the mobile DNN execution pipeline is recognized for its unpredictability, attributed to the performance heterogeneities of DNN accelerators and latency fluctuations [43].

Perceptual Impact of E2E Latency
Two factors, besides the e2e latency, contribute to inconsistency between real-world and virtual contents.Real-world object's movement.Each real-world object exhibits its unique movement, which significantly influences perception.Fig. 5 shows a comparative example of the key point similarity (OKS) [51] across different e2e latencies.The object with faster movement tends to experience a sharp decline in quality as the e2e latency increases, whereas the slower-moving object maintains overall quality.This highlights that each object has its sensitivity to latency, depending on its appearance and movement characteristics.Device movement.The movement of the device introduces a global misalignment that significantly influences perceptual quality.To comprehend the user impact of e2e latency with visualizing various DNN outcomes, such as object detection and key points, we conducted a user study 4 .The study aims to measure the perceptual latency threshold, where users move the device freely.Unlike other MAR apps, we directly render the camera frame to simulate the inconsistency similar to the pass-through case shown in Tab. 2.
We recruited 11 participants, all of whom were familiar with AR apps (4 females, 7 males, aged 22 -37).We utilized a mobile AR app featuring three DNN tasks on three devices with varying FPS.For the study, we employed the just noticeable difference method [45,55] with the staircase procedure [35].Participants were tasked with distinguishing between a baseline of one-frame delay (i.e., the inevitable minimum delay) and a target delay, starting from two frames.Fig. 6 shows the result.Users report even a single-frame delay is noticeable in over 82% of cases, irrespective of task and device FPS with typical device movement of real-world scenarios.The one-frame delay in visually complex tasks such as segmentation and key-point detection is always observable in two devices (Pixel 6 and Galaxy S21).We anticipate that this discrepancy may exacerbate scenarios involving near-eye displays and moving real-world objects.

MAESTRO DESIGN & OVERVIEW
Maestro is designed to meet three core objectives: (i) Comprehensive support for both analysis and simulation pipelines, (ii) Ensuring high consistency between real-world and virtual contents, and (iii) Facilitating developers in building and optimizing MR apps without intricate technical knowledge.
In this light, Maestro takes a parallel approach to design a unified runtime engine and abstraction model.The key novelty lies in leveraging a virtual scene of the game engine as the external state of analysis to decouple dependencies between two pipelines.Fig. 7a depicts the overall design concept.In Maestro, the analysis pipeline asynchronously updates the state of virtual objects in the scene, while the actual simulation process (e.g., rendering) is delegated to the simulation engine.This approach contrasts with the previous frame-level (serial) analysis pipeline, which synchronously processes both pipelines frame by frame, resulting in large e2e latency and underutilization of the display rate.Each node represents four building blocks: analysis ( ), conversion ( ), branch ( ), and simulation ( ).

Approach: Object-Level Execution
Maestro centralizes its execution with a new processing unit, an object, instead of a frame.It accelerates the e2e latency of individual objects and prioritizes salient objects, such as those that are moving quickly or interacting, to minimize the perceptual inconsistencies between virtual content and the real world.The advantage of objectlevel execution is two-fold.
• Rapid analysis-simulation of each object.Maestro identifies the subset of operations from the app graph that affects the analysis and simulation of a single object.These subsets are then processed in a batch without interspersing operations to process other objects (C1).This allows the analysis outcome of each object to take effect and to simulate without additional synchronization delay (C2).• Prioritize latency-sensitive objects.Maestro supports apps in defining a unique priority for each object, taking into account their latency-sensitivity trade-off.This allows various ways of prioritizing real-world objects (e.g., based on movement speed and contents variability) [31,65,66].Maestro ensures a single simulation process for individual objects through a parallel pipeline design.This implies updates for each object from the same camera frame may simulate different iterations of the game engine running on display rate.Simulation.(N  -) represents one update from the app to the virtual scene for simulation.The simulation node is the key element that allows asynchronous execution.It updates the state of the virtual object in the virtual scene (e.g., Level in Unreal Engine) to delegate actual simulation (e.g., animation, physics, and app logic) and rendering operation to the game engine.For instance, the node can relocate or create a new virtual sofa close to the surface indicated by the user's index finger in pet breeding.The engine subsequently renders the scene with the updated object.constituting operations of oTasks to workers that are mapped to heterogeneous processors.oTask matcher correlates a new object update with existing virtual objects for correct representation ( §5.3).

Architecture Overview
Finally, object-level simulator immediately updates the state of the corresponding object to further reduce the latency to display ( §5.4).

MAESTRO ENGINE
This section details Maestro's runtime approaches to tackle challenges in the realization of the object-level approach.
• Maestro must identify oTasks from the graph to relate to optimal processing operations that share the target object.( §5.1) • Maestro must schedule oTasks based on their sensitivity to latency and minimize e2e latency (C1) ( §5.2) • Maestro must integrate object updates from consecutive input frames to represent the entire scene ( §5.3) • Maestro must simulate each virtual objects without delay (C2) ( §5.4)

oTask Analyzer
oTask analyzer partitions the graph into multiple oTasks to isolate the processing of each object.The analyzer utilizes the key characteristics of the simulation node (N  ) to identify each object; it is the only node type that can update virtual objects.Hence, the preceding operators of N  represent the processing for the object.It is not efficient to individually process each object because some operators, such as pre-processing, are sharable for multiple objects.Therefore, it is necessary to share those operators while preserving the distinction between individual object's computations.We define oTask as a set of nodes that share the same reachable set of N  s.Fig. 10 shows an example result of partitioning.The nodes highlighted in red and green share the same N  _ and N  () _ , thus form oTask  and oTask ()  , respectively.oTask  with the ForEach node (indicated as ) has the infinite reachable nodes N  () _ s in theory, so is distinguished from oTask ()  s.Alg. 1 describes the partitioning algorithm.The analyzer takes MaestroGraph and identifies the nodes that change the reachable set of objects (i.e., diverging nodes, and also the N  s).Marking them as seed nodes (L5), it sorts them in topological order (L6).Then, iterating through the seeds in reverse order, it groups all operations that are not visited and converge to the seed node into an oTask and stores it (L8-10).Finally, it collects edges across oTasks to maintain the global dependency (L11). ← TopologicalSort(ℎ, ) for  in  do 9: ← GetPredecessors(, ℎ, ) 10: _.( ) 11: _ ← GetInterOTaskEdges( _, ) 12: return ( _,  _)

oTask Scheduler
oTask scheduler receives the partitioned oTasks and schedules them to workers.The scheduling decisions are two-fold: (i) determine the order of oTasks, and (ii) assign nodes to processors (abstracted as workers) for actual processing.Problem formulation.The scheduler aims to minimize the misalignment between real-world objects and their virtual representation by ordering the executions of oTasks.In order to quantify the misalignment, we use a proxy representing displacement per time Δ  for an object  in a scene .The e2e latency   for the object  is determined by the execution order of oTask.The scheduler seeks to find the optimal order of execution to minimize the impact induced by the latency, which is formulated as a product of the displacement and the e2e latency as follows: minimize Scheduling process.Determining the optimal order is NP-hard because oTasks in shared paths affect multiple objects [60].Instead, the scheduler adopts a depth-first search (DFS) mechanism to approximate the optimal order using the path selection process with a cost.The cost represents the importance of all objects on that path, formulated as a recursive sum of the product of the displacement Δ  and the e2e latency   per oTask:

𝑐𝑜𝑠𝑡(𝑠).
Δ  is calculated from a user-defined policy (e.g.optical flow) in the branch node.The policy is formulated as a function that measures the difference of inputs of oTask compared to the previous frame.For example, the magnitude of the optical flow for key points within a bounding box (the input of the oTask specific to the object) can be used to measure the displacement level of the object [48].After the calculation, all Δ  s for a branch are normalized by the minimum and maximum, i.e., 0 for the minimum and 1 for the maximum.If not applicable, they are set to 1 by default.
Fig. 11 shows an example process.Starting from the root, the scheduler compares the cost of paths between [  ] and [  ,  (0)..
Based on the calculated cost in Fig. 11a, it takes the latter ( 1 ) and moves to its children,    The scheduler re-adjusts the order if any reserved oTasks after branch node, such as  (0)..()  s, are determined.For example, after processing   , the scheduler determines the order of  (0)..()  s based on the amount of displacement estimated by the optical flow magnitude (i.e., MovingObjectFirst policy).Worker assignment.Given the order of oTasks, the scheduler assigns the internal nodes of oTasks to workers.We borrow the HEFT algorithm for operator assignments [59], a widely adopted heterogeneous scheduling method to minimize makespan.The scheduler considers operators' dependencies to insert independent nodes out of oTask 's order for better utilization.Fig. 12 shows the example operator-level schedule of   and   .
Handling inefficiencies from priority.The strict assignment of oTask keeping the original order may incur inefficiency; undetermined oTasks block subsequent executions, causing underutilization.Fig. 12a shows an example.If the scheduler stops at   to prioritize unknown   s, the system underutilizes workers, such as NPU and simulation.
The scheduler employs an uncertainty bound to take the middle ground between maximizing the utilization and respecting the order by injecting low-priority oTasks that do not disrupt the highpriority oTasks.The bound is defined as the maximum expected latency of the in-order schedule (i.e.,   ) across all workers.Fig. 12b illustrates the concept.The scheduler additionally assigns   outof-order within the bound for better utilization.Thus, our scheduler can improve the throughput and respect the given priorities.Latency profile.The planner tracks the latency of each node to its compatible workers with momentum [43].The scheduler approximates the latency of oTask (i.e.,   ) as a makespan of a single node-level schedule.

oTask Matcher
oTask matcher receives oTasks from the branch node and temporally matches them with the previous ones.The matching has two purposes: (i) to allow the scheduler to compute the displacement Δ  , and (ii) to help oTask to update the proper virtual object.Maestro triggers matching up on each execution of the branch node to automatically handle a dynamic number of objects.
Object-level updates without underlying system support could result in incorrect scene representation owing to inaccurate overriding of the previous frame's result.Fig. 13 describes an example.Given a scene with two people (top), an uncoordinated update can cause problems like (a)-adding another virtual person or (c)-not keeping the content for the right person.This requires a proper mapping between the previous oTask for a corresponding left person (i.e.,  (0) , ) and a current oTask (i.e.,  (0) , ).The matcher employs an object-matching mechanism to avoid this problem.The matcher finds a many-to-one matching between the previous and current oTasks in a branch with similarity.The problem is formulated as a maximum weighted bipartite matching [39] between two sets of vertices (oTasks) and edge (similarity), which is solvable in a polynomial time [49].The matcher computes the similarity with oTask inputs, then assigns the corresponding virtual object to each oTask for correct further simulation update.
We take Fig. 13 as an example.The matcher computes the distance between the bounding boxes of previous and current oTasks.

On-demand Object-Level Simulator
Object-level simulator coordinates the simulation worker and the game engine to update the virtual object (i.e., run N  ) without synchronization delay.A fundamental approach is re-designing an adaptive rate pipeline that utilizes variable refresh rate technologies (e.g., Nvidia GSync [6]) to remove the VSync delay.However, such an approach is infeasible in MR devices (i.e., near-eye display) since they are designed to operate at a fixed rate [33].
A naive approach to avoid the VSync delay (C2) is to separately simulate each object upon completion of N  during the stall time for VSync.The simulation CPU thread has enough stall time (e.g., 12 ms out of 16.6 ms (60Hz) for object detection in Google Pixel 4) since rendering operations in GPU serve as a primary bottleneck.However, this leads to duplicated outcomes due to the irreversible nature of the rendering process.Fig. 14a shows the example.As virtual objects exist within the game engine's scene, the engine periodically simulates them, resulting in rendering duplicated objects ( (1) ) from the engine (prev) and the separated process (curr).
During the rendering (i.e., rasterization) process, each pixel only keeps the foremost color (also known as z-buffering), making the second rendering process impossible to revert or alter pixels already drawn onto the buffer from the first.Instead, the simulator restricts the engine's pipeline to render the virtual scene without virtual objects dependent on oTask expected to finish within the frame.As shown in Fig. 14b, this allows the simulator to update any virtual object without delay while ensuring a single draw5 call per object within the display interval.We empirically show that the stall time is available even in high-quality virtual scenes (See §7.1.4).

IMPLEMENTATION
We implemented Maestro on Unreal Engine v4.27 as a plugin with a total of 14.2k lines of C++ code.It utilizes the open-sourced multi-DNN framework Band [43] 6 , also integrated as the plugin.We spawn one DNN worker per accelerator, such as DSP and NPU, and two CPU workers.CPU workers are pinned to big cores using a thread affinity mask to minimize latency fluctuations from ARM big.LITTLE architecture.We dedicate the GPU for rendering the simulation pipeline to avoid contention with the DNN execution [67].

Programming API
Maestro provides a programming API based on the abstraction in §4.2.It easily guides users in properly connecting nodes since all nodes are strongly typed.We support pre-defined operator nodes listed in Tab. 4 and customizable nodes for simple yet flexible development.Graph creation.Lst. 1 shows the example of building the lightestimation feature of the pet breeding app (Fig. 8a) with MaestroGraph.This process analyzes the real-world's light direction and updates the virtual light's direction to render light and shadow effects on virtual objects.Listing 2: ForEach control node for pose estimation.The node takes an array of bounding boxes and duplicates the processing sequence in runtime based on a given number of elements.
Table 5: Built-in run-time policies provided by Maestro API.
It allows users to decide how to prioritize dynamic objects from the branch node.

MovingObjectFirst
Prioritizes objects based on movement.It utilizes the magnitude of optical flow [31].

ChangingObjectFirst
Prioritizes object based on contents variability.It utilizes the magnitude of average pixel difference [67].
As the first step, a developer defines the Source node, which emits the camera input frame to child nodes (L1).The developer can create nodes with built-in definitions (L4-7) or the lambda expression (L9-10).
Once the nodes are created, the developer can make the connections between graph nodes.L14-17 shows the first case, where Next connects the output of a node to the next.
Finally, the developer updates the external parameters of the nodes.This includes assigning the DNN model the node should run, defining the DNN type to parse the tensor, and designating which light virtual object should rotate (L20-L22).Object priority specification.A developer can specify the current frame's estimated displacement of the object (Δ  ) to the branch node.For instance, Lst. 2 L6 describes a part of the pose estimation app detecting a pose per bounding box corresponding to a person.Hexagon DSP APU 3.0 120 This policy estimates the displacement of the oTask in ForEach based on the average magnitude of optical flow for key points.
The user can either plug in predefined policies (listed in Tab.5), or design a custom policy as a lambda expression that can access the node's previous and current input (e.g., boxes) and any runtime output (e.g., camera frame).The system provides the previous frame's inputs based on oTask matching.

EVALUATION
We conduct experiments on three mobile devices with different DNN accelerators and display rates.Tab.6 shows the list.These flagship mobile devices are surrogates for MR devices that share the same ARM-based mobile platform.We further discuss in §9.
Baselines.We compare one existing approach and one variation of Maestro as baselines: • w/o On-demand (w/o Od): a baseline that simulates with the synchronization delay.Each N  waits for the next iteration of the simulation (C2).• MediaPipe-Like (MP-Like): a baseline that analyzes the entire scene, such as AR frameworks [54] implemented on the simulation engine.We distribute DNN operations on available DNN accelerators to increase their utilization since MediaPipe executes operations on the fastest processor, resulting in severe contention and underutilization in complex scenarios.Application scenarios.We evaluate the following app scenarios: • Object Detection [29]: the single-DNN workload represents a typical MR scenario where the object detector draws bounding boxes around objects on the UI.Implemented with SSD-MobileNetV2 [42,53].• Pose Estimation (Fig. 8b): the multi-person pose estimation scenario.An object detector first looks for target people, and a pose estimator is run for each box to detect and draw postures.This is implemented with SSD-MobileNetV2 [42,53] and MoveNet-Lightning [23].• Face Mesh [14]: the multi-person face mesh scenario.A face detector first searches for target faces to run a face mesh model on.Implemented with RetinaFace-MobileNetV2 [37,42] and FaceMesh [46].We evaluate them with three videos each.Each video is selected from public datasets, such as MPII Human Pose (MPII) [30] and ImageNetVID (INV) [56] and contains up to 6 objects7 .Both pose estimation and face mesh scenarios use MovingObjectFirst policy (see Tab. 5).Metric.We compare two metrics: Maestro performs better in all cases, especially in complex scenarios with fast, diverse movement patterns with more real-world objects.The top row shows the mean average distance (MAD) normalized to Maestro (lower is better).The bottom row shows the mAP (higher is better).
• Streaming accuracy [62]: an online accuracy measured after latency.We measure each app's target task (e.g., mean average precision8 (mAP) for object detection and pose estimation, and mean average distance (MAD) for face mesh) compared to offline processing results with the same DNNs as ground truth to remove the impact from the correctness of the model9 .For face mesh task, results are normalized to Maestro's result per video.• End-to-end latency: a latency from camera input to the display.Since we simulate per each oTask, we use the e2e latency per each oTask.We measure the software-level end of the latency (i.e., the end of VSync from the render thread).

Performance Evaluation
In this section, we answer the following questions: • How well does Maestro improve consistency between virtual objects and physical scenes ( §7.1.1)? • To what extent does Maestro reduce the end-to-end latency in the simulation-analysis process ( §7.1.2)? • How does Maestro implement the object-level execution behind the scenes ( §7.1.3)? • Is Maestro extensible to accomodate high-quality virtual contents ( §7.1.4)?
7.1.1Streaming Accuracy.We measure the streaming accuracy [62] to estimate the consistency level between the two scenes.First, Fig. 15a shows the accuracy of each person in the Basketball video for face mesh.Compared to the MP-Like on Pixel 6, Maestro improves the accuracy by up to 1.53× on the fastest object because our object-level execution adaptively prioritizes the most important person based on the displacement.The Basketball video where the face mesh app was evaluated has all people who move throughout the entire duration, thus Maestro improves a similar rate on average for all persons (1.5×).Fig. 15b reports results for pose estimation on the Gym Workout video.The video presents two people, where one moves throughout the entire duration and the other stays nearly still.Maestro outperforms MP-Like by up to 1.2× for the moving person because it prioritizes the most important person based on displacement.Fig. 17 shows the single frame example from the video.The moving person mostly has sudden vertical movements, resulting in a lesser gap compared to the continuously moving Basketball video.However, as shown in Fig. 17b, this movement can lead to a sudden drop in consistency that is noticeable without our asynchronous simulation approach.
Fig. 15c shows the results with a single object on pose estimation.Maestro demonstrates an accuracy enhancement of up to 1.6× compared to the MP-Like baseline on the K40G in the Aerobic video.This underlines the impact of latency improvement from our object-level simulation.
Fig. 15d display the results on object detection.The Birdwatching video shows the shaky footage of a flock of birds.The device movement creates large displacements for the entire frame/scene, resulting in an overall low accuracy compared to the Aerobic video.Despite having a 120 FPS display in K40G, MP-Like performs worse due to the performance imbalance of DNN processors (e.g., Mediatek NPU and Qualcomm DSP) for single-DNN.

End-to-end Latency.
Object-level latency.Fig. 16 compares the e2e latency of each object, depending on how each baseline processes.In pose estimation (b), Maestro significantly outperforms MP-Like, boasting a nearly 42% and 30% reduction in median e2e latencies for processing the first and second object.This result illustrates Maestro's ability to display the first person approximately 3.5 frames earlier than the baseline, enhancing resilience to abrupt object movement.This result shows that the scheduler effectively schedules additional oTasks out-of-order to realize a depth-first approach without throughput degradation.Fig. 19 shows the rendering FPS of Maestro compared to the baseline.Unlike the MP-Like baseline, which is bounded to the camera input rate, Maestro effectively sustains rendering FPS compared to simulation-only conditions.Maestro effectively sustains a simulation pipeline running at the highest display FPS achievable within scenes intended for high-end mobile gaming.It only incurs a degradation of 14% compared to the pure simulation (vanilla game engine) within the high-quality scene, wherein the engine alone also failed to meet the hardware capability of 90 Hz.

Expressiblity of MaestroGraph
We assessed the expressibility of the MaestroGraph by reproducing 9 existing scenarios of the state-of-the-art AR framework, Me-diaPipe [54].These scenarios cover computationally distinct DNN analysis workloads that future MR scenarios (e.g., virtual home and office) [26] commonly require with up to 5 DNNs.
Tab. 7 details a list of these apps.With the help of Maestro, minimal effort is required from developers to build DNN-enabled MR apps.In particular, implementing 9 qualitatively distinct apps involved no changes in the core of Maestro.Only six custom nodes were added in total.The additions were primarily for the pre-/post-processing of DNN results, such as introducing a new anchor type for a new detector (e.g., BlazeFace [32]).We implemented apps in the list in a top-down order and counted newly added nodes (Framework Changes in Tab. 7).
We measure the graph model's complexity (e.g., the number of vertices and edges) compared to MediaPipe.Maestro greatly reduces (75%) the number of graph elements to represent the most complex Holistic scenario.Although the representational granularity (i.e., run the DNN or parse the tensor) is similar, the actual simulation process (e.g., individual key point rendering) greatly increases the complexity.7.2.1 Resource Utilization.We finally report the resource utilization of Maestro compared to MediaPipe using Android Studio Profiler [8].Tab. 8 shows CPU and memory consumption10 of both frameworks in object detection scenario.Maestro shows similar CPU usage of 51% compared to the baseline.Maestro uses 3.3× more memory but mostly due to excessive consumption from the simulation pipeline (e.g., memory for common materials like texture and shader).

DISCUSSION
Will challenges persist with future hardware?Display technologies strive for higher display rates in order to improve user immersion.Because the inherent switching time of OLEDs is less than 0.1 ms, we believe it will continue to get faster [36].On the other hand, obtaining more FPS in a camera necessitates costly sensitive sensors, which is impractical for a commercial device or results in darker images due to a lack of light per frame [64].The synchronization challenge (C2) may become less significant with faster displays.However, we anticipate that the difference between display and camera rates (currently 2-4x) associated with frame-level processing (C1) will likely remain, or even increase.Limitation of Object-level simulator.Our object-level simulation mechanism utilizes the stall time of the simulation CPU thread.We show our current design applies to high-quality virtual scenes (see §7.1.4).We anticipate our current approach is extendable to future MR scenarios unless they require CPU-intensive tasks like real-time physics simulation.

FUTURE WORK
Extension to saliency-based prioritization.The inconsistency that users perceive may include other factors, such as user saliency and visual representation of the virtual content.While our policy centers on modeling displacement levels (Δ  ), its adaptability extends to prioritizing objects according to saliency (such as central vision or gaze [57]) or systematically dropping the processing of objects expected to have negligible impact on user perception.Evaluation on mixed-reality devices.Our current quantitative performance evaluation with mobile platforms is extensible to MR devices.We leave a user study about user quality of experience on HMDs as a future work.This is left for future work due to the absence of (i) external API accessibility (specifically NNAPI [10]) for DNN accelerators and (ii) raw pass-through camera frames in commercially available standalone MR devices to date, such as Meta Quest 2 [20] and Pico 4 [27].Nevertheless, Maestro can easily be compatible with MR devices like HMDs and AR headsets given their shared ARM-based platform (e.g., Qualcomm Snapdragon XR2) with the hardware used in our study.
Our runtime approach -object-level execution is expected to perform better with MR devices.Visual inertial odometry for tracking opens up possibilities for estimating device movements.This estimation enables us to readjust the app outputs by accounting for the inverse movement, effectively eliminating the influence of device movements.Consequently, it further increases the importance of executing each object based on its priority.

CONCLUSION
This paper introduces Maestro, a framework for mixed reality applications that combines analysis and simulation pipelines.It offers a programming model that allows developers to designate important objects to enhance perceptual quality.Maestro runtime utilizes this information and adopts the object-level execution to maximize the overall consistency between the real scene and the displayed virtual scene.Maestro is the first to encompass the endto-end process of integrating both pipelines, especially to mitigate the inconsistencies in HMDs.We believe it will serve as a critical cornerstone for exploring future mixed reality applications

Figure 1 :
Figure 1: User perception in mixed reality.Users perceive greater visual inconsistencies in HMD where they directly observe reality with bare-eye along with virtual contents.

Figure 3 :
Figure 3: E2E latency breakdown.The simulation and rendering are duty cycled at a display rate [33] while DNN is executed at a camera frame rate.I abbreviates input.For simplicity, only a single processing sequence is shown.

Figure 5 :
Figure5: Each object has different end-to-end latency and alignment trade-off.A person who works out (right) requires faster attention than one who rests (left).The e2e latency and alignment relation of two objects from the same 60 FPS video (MPII-166[30]).sporadic analysis results.Fig.4shows the DNN computation and e2e latency of an object detection app.We implemented the app using Unreal Engine 4.27 Blueprint script and Tensorflow Lite v2.9.2 on Google Pixel 4 with SSD-MobileNetV2 running on Google Edge TPU (Pixel Neural Core).Fig.4(left) illustrates that the e2e latency reaches up to 4× the DNN computation latency.Also, Fig.4(right) indicates a step-wise behavior ( 1 and 2 ) in the e2e latency.For the group 1 , DNN analysis is completed near the start of the next simulation cycle.For the group 2 , the DNN processing is done when a new simulation cycle has just started.In this case, the results are not reflected until the next simulation cycle, incurring an additional delay (∼ 16.6 ms for 60 Hz display).This issue arises from the inherent nature of pipelines.MR hardware, specifically near-eye displays, operates periodically at a fixed rate, often incorporating a low-persistent mode for motion blur reduction[33].Furthermore, the mobile DNN execution pipeline is recognized for its unpredictability, attributed to the performance heterogeneities of DNN accelerators and latency fluctuations[43].

Figure 6 :Figure 7 :
Figure 6: Single-frame of delay with device movement is perceivable.(left) The x-axis indicates the difference in a frame between the baseline and the target -i.e., a single frame is 11ms for 90 Hz.(right) A user-study app running on Pixel 4.

Figure 9 :
Figure 9: Architectural overview of the Maestro.4.2 Abstraction: MaestroGraphMaestro models an MR app as a directed-acyclic graph called MaestroGraph.Fig.8shows three examples.The graph starts with a source node (indicating a camera input) and ends with simulation nodes to create or modify virtual content, while multiple intermediate nodes exist to process the input data.The graph comprises four types of operators and their connections.Analysis.(N -) represents the DNN execution of the analysis pipeline.It investigates various attributes of user surroundings, such as bounding box and landmarks of real-world objects, such as chairs, humans, and specific locations[32,38,40,61].Conversion.(N  -) represents any connective operations between components, such as data type conversion between other building blocks.Branch.(N  -) represents a dynamic execution depending on the input content.It is distinguished from others as its progress will be determined at runtime depending on the outcome of the analysis.We currently support two types: (i) ForEach node takes an array of objects and propagates each object to the following node (for instance, passing each of the detected faces from face detection DNN to face recognition DNN).In runtime, Maestro dynamically duplicates the child sub-graphs of the ForEach node depending on the input array size.(ii) Switch node routes the input to the following node with a predicate.Simulation.(N  -) represents one update from the app to the virtual scene for simulation.The simulation node is the key element that allows asynchronous execution.It updates the state of the virtual object in the virtual scene (e.g., Level in Unreal Engine) to delegate actual simulation (e.g., animation, physics, and app logic) and rendering operation to the game engine.For instance, the node can relocate or create a new virtual sofa close to the surface indicated by the user's index finger in pet breeding.The engine subsequently renders the scene with the updated object.

Fig. 9 Figure 10 :
Fig.9describes the architecture overview and operational flow.It consists of four main components: (i) oTask analyzer, (ii) oTask scheduler, (iii) oTask matcher, and (iv) object-level simulator.Given the graph, oTask analyzer partitions it into oTasks (subsets of operators for handling individual objects) in the offline phase ( §5.1).In the run-time phase, execution planner initiates the request with the new camera frame.oTask scheduler creates a schedule to execute oTasks based on their importance to maximize the perceptual consistencies.( §5.2).Then, execution planner dispatches the Figure 11: oTask-level schedule process.Example cost computation from root (a), and decided order (b).

Figure 12 :
Figure 12: Nodes to worker assignment process.Our uncertainty bound mechanism (b) prevents under-utilization of workers while keeping the oTask order (a).reserves  (0)..()  s in the order ( 2 ), then schedules the   as a last step ( 3 ).The scheduler re-adjusts the order if any reserved oTasks after branch node, such as

Figure 13 :
Figure 13: Representing real-world with virtual contents from multiple input frames.Example cases of overriding the current person's pose to the current scene with two people.(a)-(c) shows possible cases, such as no overriding, overriding the correct one from the previous frame, and overriding the wrong one.
Since the left box ( (0) , ) is closer to the current one ( (0) , ), the later node overrides the left person's, as shown in (b).

Figure 14 :
Figure 14: Object-level simulate process.Our simulator uses the VSync stall time (blue) to simulate each object on-demand (b) and restrict rendering of objects involved in the current processing to avoid duplicate rendering results (a).

Figure 15 :
Figure15: Streaming accuracy of each object.Maestro performs better in all cases, especially in complex scenarios with fast, diverse movement patterns with more real-world objects.The top row shows the mean average distance (MAD) normalized to Maestro (lower is better).The bottom row shows the mAP (higher is better).

Figure 16 :
Figure 16: End-to-end latency of each oTask.Maestro outperforms in most cases and can prioritize the execution of each object.Measured on Google Pixel 6, and reports its latency to two and three objects in Gym Workout (pose estimation) and Basketball (face mesh), respectively.Each object has different dotted lines, and the entire frame is solid.The latency of the last object of Maestro denotes the frame processing latency.

Figure 17 :
Figure 17: Visual impact of the on-demand simulation.Maestro asynchronously simulates each object to reduce errors from the sudden movement (i.e., sitting movement of the right person).Measured from the Google Pixel 4 in the Gym Workout.

Figure 18 :
Figure 18: Prioritizing latency-sensitive objects.Maestro prioritizes important (e.g., fast-moving) objects (top) over static objects (bottom).The x-axis indicates the video frame.Each y-axis indicates the mean average error to the ground truth normalized to inter-pupillary distance.Measured on Xiaomi K40 Gaming with face mesh on the Baseball video (varying movements with three people).Fig 16c shows the results on face mesh.w/o Od exhibits an additional ≈11ms delay compared to Maestro, equivalent to one frame on the 90 Hz device.The video presents various number of people (from 1 to 3), the frame-level latency of MP-Like fluctuates depending on the number.Frame-level latency.The e2e latency of the last object in Fig 16 from Maestro denotes the entire input frame processing time.This result shows that the scheduler effectively schedules additional oTasks out-of-order to realize a depth-first approach without throughput degradation.

7. 1 . 3
Verification of Object Prioritization.Fig.18shows the timeline with face mesh in the Baseball video to demonstrate how Maestro prioritizes latency-sensitive objects.When tracking the fastmoving person, Maestro prioritizes its execution ahead of who is

Figure 19 :
Figure 19: Rendering FPS of 3D scenes.Maestro simultaneously support simulation and analysis pipeline.Measured on Google Pixel 6. slowly moving, which prevents a rapid accuracy drop compared to MP-Like shows the effectiveness of our MovingObjectFirst policy.As expected, both Maestro and MP-Like achieve low MAD on standing people (bottom).

7. 1 . 4
Simulation Pipeline Support.The concurrent support of realistic virtual content is one of the key requirements for future MR apps.To evaluate the simulation engine's performance with the analysis pipeline, we assessed object detection using three MR scenes with various complexities: UI Only, Soul: Cave [5] -mobile highend natural environment (1k resolution textures), Big Office [19]desktop production quality office (up to 4k resolution textures).

Table 1 :
Existing mixed reality frameworks are predominantly single-pipeline-centric, with development focused either on DNN-based analysis or simulation pipelines.
particular, the graph comprises four types of operations: (a) analysis with DNNs, (b) conversion for pre-or post-processing, (c) branch for explicit object-wise control flow, and (d) simulation to the virtual world.Here, the branch and simulation operation work as an important hint for runtime optimization, such as identifying operations associated with each object.We found our abstraction model is able to express nine qualitatively different apps, encompassing workloads expected in prospective MR (e.g., virtual home and office)[26]

Table 3 :
Display and camera frame rates of MR devices.

Table 4 :
Built-in nodes provided by Maestro API.
Listing 1: A partial snippet of light estimation process from pet breeding represented by the Maestro API.1 // Assign the object-specific importance policy to the 2 // `ForEach`node, prioritize object based on the movement 3 new ForEach<BoundingBox, const CameraFrame * >(4BuiltinPolicy::MovingObjectFirst());

Table 6 :
List of hardware used in the evaluation.

Table 8 :
[8]rage resource utilization on Google Pixel 6 while processing object detection on camera measured using Android Studio Profiler[8].