PaperToPlace: Transforming Instruction Documents into Spatialized and Context-Aware Mixed Reality Experiences

While paper instructions are one of the mainstream medium for sharing knowledge, consuming such instructions and translating them into activities are inefficient due to the lack of connectivity with physical environment. We present PaperToPlace, a novel workflow comprising an authoring pipeline, which allows the authors to rapidly transform and spatialize existing paper instructions into MR experience, and a consumption pipeline, which computationally place each instruction step at an optimal location that is easy to read and do not occlude key interaction areas. Our evaluations of the authoring pipeline with 12 participants demonstrated the usability of our workflow and the effectiveness of using a machine learning based approach to help extracting the spatial locations associated with each steps. A second within-subject study with another 12 participants demonstrates the merits of our consumption pipeline by reducing efforts of context switching, delivering the segmented instruction steps and offering the hands-free affordances.


ABSTRACT
While paper instructions are a mainstream medium for sharing knowledge, consuming such instructions and translating them into activities can be inefficient due to the lack of connectivity with the physical environment.We propose PaperToPlace, a novel workflow comprising an authoring pipeline, which allows the authors to rapidly transform and spatialize existing paper instructions into an MR experience, and a consumption pipeline, which computationally places each instruction step at an optimal location that is easy to read and does not occlude key interaction areas.Our evaluation of the authoring pipeline with 12 participants demonstrates the usability of our workflow and the effectiveness of using a machine learning based approach to help extract the spatial locations associated with each step.A second within-subjects study with another 12 participants demonstrates the merits of our consumption pipeline to reduce context-switching effort by delivering individual segmented instruction steps and offering hands-free affordances.

INTRODUCTION
Paper-based instructions are common for knowledge sharing.Such instructions are often related to tasks that require users to interact with multiple objects spatially distributed in an environment.For example, when following a recipe, a user may need to interact with multiple kitchen appliances like the cooktop, fridge, and microwave.When following a safety manual, a compliance manager may need to interact with various machines on the factory floor.
However, performing a task while consuming instructions can be tedious as the text is typically disassociated from the user's physical environment.Thus, a user has to balance reading the instructions, figuring out what they mean in the environment, and performing the task, which can be cognitively demanding [37].For example, when frying a piece of steak while following a cookbook, one needs to frequently switch between the cookbook and the pan to check the searing technique, temperature, etc.This switch can be costly if the user places the cookbook somewhere peripheral so that it does not obstruct the task area.The user might end up spending more time trying to navigate the text and environment, than performing the task.This problem is made worse if the user forgets some important information like temperature or duration, and has to repeatedly come back to the instruction to double check.
Consumer Augmented Reality (AR) and Mixed Reality (MR)1 offer a unique opportunity to address this document-activity disassociation by overlaying digital elements on top of the environment.These approaches are becoming more accessible, and studies have demonstrated their potentials for training workers to conduct tasks that are spatial in nature [44].While prior works investigated the affordances of virtual guidance for conducting spatial tasks [48], and how to integrate such guidance in MR [33], they have not explored how document contents and their associated consumption experience could be constructed in MR.To that end, Microsoft Dynamic 365 Guides [12] is an industry solution to help enterprise users manually create instructions and anchor them in an MR experience.
Across these MR instruction systems, the placement of the virtual instructions is often decided by the authors, and cannot be dynamically adapted to real-world contexts.This assumption could lead to undesirable experiences in both the consumption phase and the authoring phase.For the consumption phase, a static MR instruction could be mistakenly placed at a location too far from the user's task, at a distance that is difficult to read, or at a position that occludes key objects that the user is interacting with.For authoring, the author has to spend time associating and placing an instruction with its corresponding physical object.This process is time consuming and has to be repeated for every new set of instructions, even though the physical layout of objects might not change much over time.While some prior works, e.g., FLARE [41], showed the usefulness of creating a persistent AR layout, real-world instructional activities are frequently changing (e.g., users might move from one place to another depending on the procedural step in the instructions), causing such a static layout to be infeasible.
We propose PaperToPlace, a novel end-to-end workflow that transforms paper instructions into a context-aware instructional MR experience by segmenting monolithic documents; associating instruction steps with real-world anchoring objects; and optimally placing the virtual instruction steps so that they are easy to read and revisit while completing the tasks.To realize this goal, PaperToPlace consists of an authoring and a consumption pipeline.With the authoring pipeline, the author can simply take a snapshot of the paper document by leveraging a mobile camera (Fig. 1a).Our system then segments the text in the document into individual instruction steps.The author can manually edit these segments, and associate each step with the spatial location where the relevant activities will occur.To help with this association task, we designed a machine learning (ML) approach, where a fine-tuned language model was used to suggest the relevant spatial locations to the author.Our consumption pipeline is designed to render these instruction steps and place them in the associated spatial locations.To ensure users could easily consume the spatialized instruction, the placement of each step in MR is optimized using a probabilistic optimization approach based on pre-created environmental models and the tracked gaze and hands.Fig. 1c shows an example of a spatialized instruction step, where the step "And microwave on high for 30 seconds" is tagged with "microwave" and is optimally rendered in front of the user while not occluding their view as they set the heating time.
We prototyped PaperToPlace on Meta Quest Pro [11], and conducted two within-subjects studies to evaluate the authoring pipeline with 12 participants, and the consumption pipeline with another 12 participants.We demonstrated the usability of our authoring pipeline, and the effectiveness of using an ML-based approach to help the authors extract the spatial location associated with each step.We then illustrated the effectiveness of the consumption pipeline for reducing context-switching effort, delivering the segmented instruction steps, and offering hands-free affordances.
With the assumption that the spatial profiles (i.e., the environmental geometry and associated semantic labels, see Sec. 4.1) are available, we contribute the design and evaluations of: • An authoring pipeline that allows users to transform paper instructions into a spatialized MR experience; • A consumption pipeline that can computationally place the virtual instruction steps in the optimal place without either occluding the user's view or leading to large degrees of context switching.

RELATED WORK
This paper is motivated by prior work on incorporating instruction experiences into MR and designing context-aware MR experiences.

Integrating Instruction Experiences into MR
MR has been widely used for augmenting document consumption experiences [29,30,59].Augmenting instructional documents, however, is still challenging due to the need to connect and integrate with real-world scenes and activities [42,64].Many prior works have explored the use of MR to augment a procedural instruction experience -an important asynchronous collaboration task.For example, ProcessAR [33] proposed in-situ procedural AR instructions that could be rapidly created by experts, and used to teach novices through spatial and temporal demonstrations (e.g., information about how to move a tool in the temporal domain and orient it in the spatial domain).However, the placement of textual instructions was not explored.CAPturAR [68] introduced a MR tool that helps users rapidly author context-aware applications, by referring to recorded activities.Commercial tools such as Microsoft Dynamic 365 Guides [12] enables experts to author a MR instruction experience by enacting the guidance, placing the instruction in the designated space, and recording the tool operations.
Although these works explored the design of MR-based instruction experiences, existing paper instructions are usually left behind, resulting in unnecessary time and effort to redesign a usable instruction workflow.Additionally, existing MR instruction experiences are often not able to dynamically adapt to the changing environmental context of real-world activities (e.g., [12,33]), causing user frustration when virtual graphics occlude interaction tasks.Paper-ToPlace is novel in that it supports reusing existing paper documents that are designed by professional writers in a reader-centered way [69], and can transform such documents into a spatialized and context-aware MR experience that is adapted to both the user's needs and the environmental characteristics.

Designing Context-Aware MR Experiences
Context-aware MR systems aim to show "the 'right' information, at the 'right' time, in the 'right' place, in the 'right' way" [40], which requires understanding both human and environmental contexts.
Prior research explored various computational approaches to realize this goal.For example, Lindlbauer et al. [53] used the realtime cognitive load, estimated by pupil dilation, to decide when and where the application should be shown, as well as how much information should be delivered (i.e., level of detail) in an MR system.Lang et al. [51] used simulated annealing to place virtual agents by considering key anchoring surfaces in the environment identified by pre-trained mask R-CNN.Liang et al. [52] used a similar approach to build a scene-aware virtual pet which could behave naturally in the real-world (e.g., respond to a food bowl).Yu et al. [71] proposed an interactive and context-aware furniture recommendation MR system by considering the real-world scene (i.e., spatial context) and the learnt furniture compatibility in a latent space (i.e., category context).ScalAR enabled designers to author a semantically adaptive AR experience [60].Liu et al. [54] attempt to generate suggestions for the arrangement of work surfaces in HoloLens, by capturing users' habitual behaviors of interacting with objects on the work surface.Similar to our work, SemanticAdapt [32] used an optimization approach to automatically adapt MR layouts between different environments by considering the virtual-physical semantic connections.However, the target applications were only related to information consumption (e.g., consuming news feeds) and did not consider those related to real-world activities.
Inspired by this prior work, PaperToPlace demonstrates a novel consumption pipeline that leverages a similar computational approach to analyze the tracked gaze and hand position, as well as the anchoring surfaces of the key objects in a target environment.

PRELIMINARY NEEDS-FINDING STUDY
To understand the pain-points for consuming existing paper instructions (i.e., monolithic, non-segmented, non-spatialized, and non-context-aware), we conducted a needs-finding study, rooted in participant observations [23,67] and semi-structured interviews.

Participants, Tasks, and Procedure
We recruited four participants (age,  = 24.75, = 2.87, two females, two males).Participants were required to complete the designated task using paper instructions.Specifically, PP1 and PP4 were required to make coffee with a coffee machine by following the user manual [7] (Fig. 2a).PP2 and PP3 were required to make a chocolate microwave cup cake from an existing recipe [14] (Fig. 2b).These tasks were chosen since they are common activities in an office kitchen; require participants to read the textual instructions; and could be completed in a reasonable time for an unpaid study.We also used the existing instructions [7, 14] created by professional writers to minimize the impact of non-professional writing styles.All participants reported little (PP3, PP4) to no (PP1, PP2) prior knowledge for the designated task.Next, we conducted a semi-structured interview, focusing on "what are the pain-points while performing the designated tasks using the given paper instruction, and why?" Finally, we brainstormed potential designs of a MR experience for consuming instruction documents.We explained the concept of MR for PP3 and PP4 who were not familiar with it.Participants were encouraged to sketch their imagined design on an iPad Canvas.The study took on average 30 min ( = 3.25 min).

Findings
Overall, we identified three pain-points through the study.The overwhelming amount of information or lack of necessary details in the instructions can impact the usability.During the semi-structured interviews, two participants (PP1, PP3) pointed out that the sometimes overwhelming amount of information and irrelevant content could be distracting.For example: "there was a lot of information on the document.And it wasn't easy for me to know where and what information I should be looking at" (PP1) and "the first setback was too much information" (PP3).To handle the potentially overwhelming amount of information, PP2 first skimmed the document in search of relevant content: "I am first attracted to see where the bullet points are.[...] And then if I just skim through the first two or three points, I understand that this is not relevant.So I'm just skipping those sections completely." While designing the MR experience, PP1 incorporated such insights into her design (Fig. 3c) and commented: "I would rather the MR just gives me one small step every time.For example, I am making coffee and reaching the step two, and in the virtual instruction, it will say like coffee making step two, and then here will be just a instruction with just a few sentences." On the other hand, PP2 and PP4 believed the lack of details for certain steps could impact their ability to perform the sub-task.For example: "I've never cracked an egg so I don't know how to do it.[...] If it is some things that I've never actually done [and the instruction document does not tell me how], I might actually be confused" (PP2).In essence, different users likely require different levels of information based on their prior experience with the intended task.There are missing connections between the instruction step and real-world activities.All participants identified a need for establishing spatial connections between instructions and real-world objects or activities.For example, during the design phase of the study, PP2 emphasized that "having [a virtual] arrow [in MR] that can help me and connect me to the object is helpful." On the other hand, most participants suggested that overwhelming spatial guidance might be unnecessary and could lead to visual disturbance, similar to their concerns about the potentially overwhelming amount of information already in some instructional documents.Reflecting further on PP2's suggestion to include virtual arrows, he highlighted some potential downsides of this approach, noting that "[having useless spatial indicators] is going to be overloaded.I know these basic things, and I don't need pointers to see like a spoon or a mug." PP1's design sketches implied an alternative approach to establish connections between the instruction step and real-world objects by placing the virtual step close to the coffee machine, without occluding the user's view (Fig. 3c).Frequent context-switching between the instruction document and real-world activities should be minimized.By analyzing the video recording, we found that users perform frequent context-switching between documents and real-world activities.For example, while PP2 was conducting the task, he naturally commented: "ok! we have flour, sugar, coco powder, egg would be somewhere there, vanilla extract [...]", thereby demonstrating frequent context-switching between instructions and real-world objects as he checked off items from the recipe list (Fig. 3a).Participants generally used two types of strategies to minimize context-switching: holding and switching to the document while performing the task.Specifically, we found that PP1 tended to hold the documents while performing tasks, although occasionally this approached was inconvenient for steps requiring two hands (Fig. 2a).In contrast, other participants tended to place the document on the countertop while reading the content, and move the document to a new place when switching to other steps (Fig. 2b).To understand how participants used the documents to perform individual steps, we manually labelled the timestamp of the recorded video while participants were switching between documents and real-world activities, or otherwise.Through this process, we observed that the participants highly relied on the documents to perform tasks.Specifically, while participants were attempting to complete a particular step, the document was still frequently referenced even though the participants had already read it at the beginning of each step (Fig. 3b).

Design Considerations
We identified three fundamental Design Considerations (DCs) by analyzing the data from our preliminary needs-finding process.
[DC1] Only delivering the segmented instruction could enhancing information consumption experience.We show that the participants expect to consume relevant information corresponding to their current activities, yet existing paper documents usually deliver all information to users at the same time.An improved MR instruction consumption experience could create novel and flexible ways to segment the document, such that only relevant information is delivered to users for each associated step.
[DC2] Optimally placing instruction texts next to the areas of interactive activities might be helpful for the intermittent and repetitive information consumption experience.While participants generally read the entirety of an instruction step before performing the corresponding actions, the instruction step may need to be consumed repeatably (Fig. 3b).Therefore, the placement of the virtual instruction step in MR should consider the spatial location where the relevant task would occur.
[DC3] The right level of spatial guidance could help users associate instructions with spatialized key objects.While few existing works (e.g., [48,49]) suggested the usefulness of spatial guidance for MR-based instructional experiences, our participants emphasized the importance of moderate spatial guidance, with limited visual disruptions.Thus, usable spatial guidance, without causing overwhelming visual disturbance, should be provided at the beginning of each instruction step.

PAPERTOPLACE SYSTEM OVERVIEW
Based on Sec.3.2, we designed PaperToPlace (Fig. 4), comprising of two pipelines: (1) an authoring pipeline for an author to rapidly and easily create a spatialized MR instruction experience from existing paper-based instructions (Fig. 4a -g, Sec.5), and (2) a consumption pipeline for enabling a consumer to explore context-aware, spatialized instruction steps in MR (Fig. 4i, Sec. 6).While we use cooking tasks in an office kitchen as a running example for the design and evaluations, our approach could be transferred to other types of instruction documents.

Assumptions and System Walkthrough
We consider an environment to be a typical workspace (e.g., the kitchen) for supporting a procedural task (e.g., baking a cake).Each environment contains multiple physical key objects, which are defined as the important, stationary objects that are usually attached to the environment permanently (e.g., the fridge and microwave).We did not consider non-stationary objects (e.g., a mug) due to the lack of support for real-time arbitrary object tracking with Quest Pro [11].Each key object contains one or multiple anchoring surfaces, which are virtual surface(s) that describe the approximated geometry of the objects.We use these surfaces to determine the placement of instruction in MR.
The mask indicates the located document.
The size of "X" button is approximately same as the index fingertip to make sure the key object is easy to be manipulated.environment where spatial profiles could be created in advance include office kitchen, a rental house, and a factory.To realize this assumption, we implemented a MR interface in the Quest Pro [11] that allow each anchoring surface (represented by a 2D plane) and the associated key object could be easily declared (Fig. 5a).The spatial anchor APIs [16] were used to ensure the declared anchoring surfaces are persistent in the environment (Fig. 5b).With these assumptions, the goals of the two pipelines of Paper-ToPlace are described as below: • Authoring Pipeline.Given an existing paper-based instruction document, the author would first segment the document content into smaller steps, each of which will only be associated with one key object.For each step, the author needs to identify the metadata, including: (i) the text of each instruction step, and (ii) the key object that the step should be anchored on.Together, such metadata makes up our document profile -the essential elements to recreate a spatialized MR experience from existing paper documents.• Consumption Pipeline.While consuming the document in MR, the instruction steps float in mid-air and will be optimally attached to one of the anchoring surfaces of the associated key object, based on user context-real-time user interactions data in MR such as the tracked eye gaze and hand joints.For example, the step "boil a cup of water in the microwave for 5 min" should be attached to one of the anchoring surfaces of the "microwave" key object and not impact the user's interactions.

Application Scenarios
We consider two user roles, where the author use the authoring pipeline to rapidly create an instruction MR experience and the consumer use the consumption pipeline to consume the authored MR experience while completing tasks.We target on colocated and asynchronous collaborations [47], where the instructions are authored and consumed in same environment.Specific application scenarios are described below.Author and Consumer are Different Users.For a specific procedural task, PaperToPlace could be used to facilitate asynchronous collaborations between experts and novices.For example, company administrators could use PaperToPlace for training new employees to use the provided facilities (e.g., coffee machines and fridges in the shared office kitchen).Chidambaram et al. [33] also demonstrated the usefulness of using a similar instruction MR experience to teach novices assembly mechanics.Author and Consumer are the Same User.While we differentiate two user roles, it is possible that the author and consumer are the same user.As the cognitive processes for consuming instructions usually occur in working memory, which is constrained by both time and processing capacity, it is often necessary for one to revisit the procedures repeatably for the same task [21,42].For example, while cooking the same meal, it is common for the user to refer back to the cookbook each time upon starting a new step.Fig. 3b confirmed such patterns, where participants repeatedly refer to the instructions while completing a specific step.In this scenario, the user could author a personalized and spatialized MR experience, which s/he could use repeatably when preparing the same meal in the future; for example, a user could customize a recipe to take into account his/her preferences for spice level.

AUTHORING PIPELINE
The authoring pipeline extracts the document profile from an existing paper document to create an MR experience rapidly and easily.

Document Capture and Parsing
One question for document reuse is how to enable users to rapidly capture and extract the document profile from an existing instruction document?Inspired by mobile applications that allow users to capture and analyze scanned documents (e.g., Adobe Scan [3] and Tab [17]), we similarly enable authors to simply take a snapshot of the instruction document to generate a document profile (Fig. 4ad).
The author can then adjust the scanned region to crop out unnecessary components (e.g., titles, etc.) as needed (Fig. 4c).Paper-ToPlace then leverages OCR services by Google Vision API [13] to parse the scanned image into machine readable text, due to its ability to extract paragraph structure in the parsed text using full text annotations [9].By default, we segment each paragraph as one step in the instructions.However, the author can re-segment the steps and fix errors in a dedicated mobile interface (Fig. 4g).

Selecting the Model and Key Objects
To extract the document profile, we designed a manual and MLassisted approach to help authors rapidly and easily associate key objects with each step.Our ML-assisted approach leverages a pretrained language model for a specific environment to predict the key object that is associated with each step.After transforming the existing paper document into machine readable text, the authors need to select the model for the target environment (Fig. 4e).For environments without a pre-trained model, the manual approach enables authors to manually extract the metadata of each step.
The author then selects the key objects that exist in the target workspace (i.e., the set of available key objects) (Fig. 4f).First, this step ensures the key objects contained in the extracted document profile aligns with the spatial profile of the intended environment.For example, a cooking instruction step such as "boil a cup of water" could be executed either in a typical household kitchen on a cooktop, or in an office kitchen that only provides a microwave.Second, setting the available key objects also provides prior knowledge to help increase the accuracy while predicting key object associations.For example, if an office kitchen only has a microwave, the key object for "boil a cup of water" should not be predicted as oven, even though oven is a possible label for the pre-trained model (Sec.5.3).

Creating Document Profile
Creating a document profile requires two types of metadata: (1) The text of each procedural step.While by default we consider and segment each sentence as one instruction step (Sec.5.1), the author could overwrite the system segmented results by segmenting, merging and deleting specific step(s) (Fig. 4h).When a specific step is modified or a new merged step is generated, the associated key objects will be re-predicted (if the ML supported mode is used).Although some instruction steps might be associated with multiple or no key objects, such flexibility allows the author to split or merge the target step(s).In response to [DC1], additional flexibility is also provided for the author to modify the generated text of each step to ensure that the right information with right level of details could be delivered to the consumers.
(2) The key object that the step is associated with.The authors can use either a manual or ML-assisted approach to determine the key object associated with each step.Fig. 4g shows a dedicated interface with segmented instructions, where the authors can use the drop downs to select the associated key objects, with the color scale indicating the confidence of our ML predictions (if applicable).
While manually assigning each step to a key object could work robustly, predicting key objects using a ML-assisted approach that requires a pre-trained model is challenging, due to the needs for a dataset and ground truth labels.Creating such a dataset that could be generalized to all procedural instructions is not realistic, and labeling each step with a ground truth key object is also difficult and time consuming.Instead of preparing such dataset for all instruction documents, we chose to focus on domain-specific dataset that is publicly available.Alternatively, the dataset could be created via vendors or crowdsourcing.
We describe our methods for generating such a pre-trained model below.Although our running example is based on cooking instructions, the overall approach could be transferred to other type of instructions documents, provided the unlabeled dataset is available.Dataset: We used RecipeNLG [22] for training purposes, which contains more than 2.2M cooking recipes where each recipe includes multiple ordered instruction steps.The process of aggregating all steps from all recipes yielded an unlabelled dataset with 19.5M steps, with an average of 11.54 ( = 7.13) words per step.Rule-Based Labelling of Training Instruction Steps: Instead of manually labelling each step, we used a rule-based approach to label steps that contain the exact words of the predefined key objects.For example, we label the step "boiling a cup of water in the microwave for 5 min" as "microwave", yet the step "boiling a cup of water for 5 min" will not be labeled, and thus will not be included as part of our training dataset.We iteratively selected nine key objects that exist in a typical kitchen: blender, cabinet, coffee maker, countertop, fridge, microwave, oven, sink, toaster.Such selections also ensures a reasonable amount of instruction steps for subsequent model fine-tuning purposes.We generated a dataset where each of the nine labels is associated with 218 instruction steps (i.e., the labeled dataset contains 218 × 9 = 1962 instruction steps).
Training: Due to the limited dataset size, we fine-tuned the model using the output classification layer of a 12-layer BERT model for uncased vocabulary, which has been used for generating contextual language embeddings [36].We used Adam optimizer with the learning rate, , and batch size set to 2 × 10 −5 , 10 −8 , and 32, respectively, recommended by Devlin et al. [36].80%, 10% and 10% of the dataset are used for training, validation and testing, with the amount of steps for each key object balanced across the sets.Model Performance: We demonstrated an overall 83.57% testing accuracy by considering the label generated by our rule-based approach as the ground truth (Fig. 6a).Additionally, we manually label the associated key objects on the testing dataset to limit the impact from any errors generated by our rule-based approach, which lead to a 82.13% overall accuracy (Fig. 6b).Model Execution: The pre-trained model is used for predicting key objects from the segmented text of each step.To enhance the accuracy of the predicted key objects, we use prior knowledge provided while specifying the available key objects (Sec.5.2).Specifically, the final assigned key object is the ML predicted key object with the highest confidence score that also belongs to the set of available key objects of the target environment.

CONSUMPTION PIPELINE
The consumption pipeline aims to spatialize each steps by anchoring them at the optimal position next to the key object.For example, consider how the instruction "microwave on high for 30 seconds" should be attached to a microwave.An ideal location would be at the front surface of the microwave door.A less idea location would be at the front of the input panel because the instruction might get in the way when the user tries to set the timer (see examples in Fig. 1c, Fig. 4i, and Fig. 20 in Appendix B.2).

Interaction Design
Our consumption pipeline provides dedicated interaction metaphors based on the preliminary findings (Sec.3.2).Navigating Between Individual Steps.Consumers can use hand menus to easily and rapidly switch between steps (Fig. 1b).We adopted the suggestions from [DC1] and the conceptual design of Fig. 3c that advocate the idea of delivering the right level of information only at the right time.Therefore, PaperToPlace only renders the current instruction step along with a task completion progress bar.When a new step is triggered, PaperToPlace first anchors the virtual label in front of the consumer, since a initial instruction step consuming is usually required before consumers proceeding on execute the associated steps (Fig. 3b).The midpoint of left and right eye in world coordinate.

𝑑 𝑓 𝑒𝑦𝑒
The forward direction of the gaze, averaged by left and right eye gaze.
Representation of an instruction step placement with respect to anchoring surface , with index of  and  along width and height, where  ∈ [0, ),  ∈ [0,  ).

𝑝 𝑎
The position of the instruction step in world coordinate. , The angle between vector  and .
Table 1: Notations of the key parameters and functions.
Animating Spatial Guidance.To address [DC3], we decided not using the persistent visual guidance (e.g., virtual arrows) [48,49] that might cause unnecessary visual disturbance.Instead, we use a animated flying effect where the virtual step could "fly" toward the key object after initial instruction step consuming.Such design leverage the fact that a motion effect could direct the consumers' attention, and could implicitly and rapidly offer visual guidance of the spatialized key object without causing overwhelming disturbance while consumers are executing the steps [45].
Placement of Instruction Steps.PaperToPlace places and anchors the instruction step on one of the anchoring surfaces of the key objects while not occluding the important region.This design emphasizes [DC2] suggesting the connections between instruction and real-world contexts, and could bring convenience while the consumers are attempting to refer back to the instruction step repeatedly while completing the step (Fig. 3c).If the consumer dislikes the label position, they can request a new position update on-demand using a mid-air pinch gesture.We also allow the consumers to use pinch-and-drag gestures to manually move the step to their preferred place (Fig. 1d).Such feedback action will in turn help on future decisions while placing instruction steps.

Problem Formulation
We formulated the process of optimally placing the instruction step as an optimization problem, where we used the tracked hands and gaze, as well as the anchoring surfaces defined by the spatial profile to search the optimal placement for each step.Table 1 summarizes the notations of key parameters.

Representations of an Instruction
Step Placement.We first discretized each anchoring surface into  ×  virtual cells where each cell has a dimension of 3 × 3.We assumed that the center of each virtual step should be aligned with the center of the cell on the anchoring surface.

Representations of the
Step Label Rotation.Reading angles is a critical factor for consuming document [55].It is therefore important to determine the rotation of the step, such that the text Algorithm 1 Computing the rotation of instruction step. ℎ ←  (  ,  ) if  is a horizontal anchoring surface then is always perpendicular to user's looking direction (i.e., the virtual texts should be delivered facing toward user's eye).To address this, we rotated the anchoring surfaces by using the potential looking direction (  −   ).Algo. 1 shows how we compute the rotation of the step for horizontally (e.g., countertop) and vertically placed anchoring surfaces (e.g., the front surface of fridge).Fig. 7 demonstrates two examples where the steps are appropriately rotated.Importance Map.One goal while attempting to place an instruction step onto associated anchoring surface(s) is to find the optimal placement that will not occlude consumer's interactions with the key object.While Lang et al. [51] presumed that the centroid of a key object is the most important area and should not be occluded by virtual MR agents, such assumption is invalid in our problem as the real-world interactions are highly dynamic.For example, while interacting with microwave, the critical areas might be the region on top of keypad when inputting cooking time, or the center areas when the user is checking whether the food is cooked.We used importance map () to represent the importance of each possible cell on anchoring surface, where a larger  value implies a higher probability that the corresponding position being interacted, and therefore should not be occluded by the virtual step.The values of  are determined using near real-time data provided by the MR headset.By leveraging the pre-created spatial profile, we used the tracked gaze and hands to infer the importance of each possible position on anchoring surface(s).Intuitively, the areas near the hands, which usually imply the regions that are interacted by the consumers, might be more important and should not be occluded by the steps.Sec.6.3 describes the computations of .

Importance Map on Anchoring Surfaces
Approximate the Importance of Each Frame.The contextual data from each frame refers to the tracked gaze and hand joints, at a specific time instant.Inferring  from contextual data on single frame is less reliable, as real-world activities are highly dynamic.
Algorithm 2 Approximating the importance of individual frame.return  12: end function Yet, contextual data collected from a set of frames are practically not equally important.Therefore, while aggregating contextual data across a set of frames, it is important to consider the relative importance of individual frames.PaperToPlace uses the tracked eye behaviors to approximate the relative importance of contextual data of each frame.The first intuition is based on the instantaneous angular speed of gaze, where a slower angular speed of gaze implies a higher importance.For example, while attempting to input time when using microwave, the consumer might be fixing at the keypad, during which the contextual data could offer meaningful clues for approximating .Whereas, the consumer might rapidly saccade around the environment while finding ingredients, during which the contextual data might be less meaningful.While [56,57] attempted to design closed-form solution for classifying saccade and fixation using the speed of gaze, such eye behaviours usually varies across users and tasks.Our second intuition is based on the observation that the contextual data from a more recent frame might be more useful to indicate the interactions in the subsequent task episode.Algo. 2 shows the computation of the frame weight () that is used to quantify the relative importance of contextual data at each frame.Experimentally, we set  = 90, which is approximately 1 second of past contextual data.We approximated  based on instantaneous angular speed of left and right gaze (    ,  ℎ ), and the timestamps () of each frames.Remarkably, a slower eye moves (i.e., smaller     and  ℎ ) and a more recent timestamp (i.e., larger ) would lead to a more important frame weight.The  (•) and  (•) represent the min-max normalization (to [0, 1]), and the normalization process such that the summation of the list is 1.Approximate the Importance Map from Tracked Hands.To compute the overall  from a set of frames, we first approximated the  from the contextual data collected at each frame.This could be realized by  (•) (Algo.3), which  the  of the projected points of 15 tracked hand joints [10] on the associated anchoring surfaces.Fig. 8d shows an example of the  from single frame, approximated by tracked hands in Fig. 8b.Generate the Overall Importance Map.While we assumed that the areas that are not occluded by hands might be less important, the importance assigned to each cell on the anchoring surfaces might not be equally same.For example, while inputting time using keyboard of microwave, the further the step being placed from the  ℎ ←  (ℎ ) 13: ←  (,  ℎ ) 14: return  15: end function return    ( ) {Normalize to 0 to 1.} 25: end function areas occluded by hands could lead to lower probability that the important areas being occluded by hands.Therefore,  should be expected to model how important of a specific pixel on the anchoring surface, instead of whether the particular pixel is important.Inspired by Lang et al. [51], we used    (•) to approximate the importance at each possible placement on the anchoring surface(s) generated by left and right hand respectively (Algo.3).Eqn. 1 describes this process, where  ,  indicates the 2-distance from placement (, ) to the closest placement(s) where the computed importance of individual frame from  (•) (Algo.3) is 1.
The overall  is finally computed by aggregating the  generated by left (    ) and right ( ℎ ) hands on each frame using previous computed weight () in Algo. 2. This process could be demonstrated in  (•) (Algo.3).Notably, if  there is no area being occluded by hands, we set  =   , .Fig. 8e shows an example overall .

Constraints and Costs
To solve the optimization problem for placing the instruction step on the anchoring surface(s), we need to model the constraints such that the placed step will minimally occlude the user's view and will not be too far from the user's focused attention.
Total Cost.We designed the overall cost (  (), Eqn. 2) as a weighted sum of the visibility cost (  ()), the readability cost (  ()), hand angle cost (  ()), and preference cost (  ()).  ,   ,   , and   are the weights associated with each of designed cost.Experimentally, we set them to 0.24, 0.24, 0.24 and 0.28, with slight emphasis on consumers' preference.We provide rationales of the design of each costs.
() =     () +     () +     () +     () (2) (i) Visibility Cost.  aims to to measure how much key areas of the anchoring surfaces are occluded by a step placement , and to penalize the situation while the step occluding the important areas.Eqn. 3 defines   , where  indicates occlusion map and  indicates the relative importance of each discretized cells on the anchoring surfaces computed by Algo. 3.
Notably, the  (, ) is assigned as 1 when the pixel (, ) is occluded by the step from the center of both eye (Fig. 9).Fig. 9b shows an example occlusion map when the instruction step is placed next to the wall behind the sink.We finally normalized   () to be independent of dimensions of anchoring surface(s).(ii) Readability Cost.We penalized the solution when the delivered step is too far from the user's attention.To model this constraints, we used the eye tracking results and measure the 2distance from the placed step to the weighted average of looking direction of both eye (   ).Eqn. 4 defines   (), where   represents the maximum 2-distance between two arbitrary solutions on the anchoring surfaces, which is computed by the maximum distance of the convex hull consisting of all vertices of the anchoring surfaces.Additionally,   () need to enforce the instruction step is placed within binocular vision (i.e., approximately ±60 • ) [58], to minimize the needs of moving head in order to read the instructions.We used a coefficient  to penalize the cost function, where  is set to 1 if    −  ,   < 60 • , otherwise we set  =    −  ,   .
(iii) Hand Angle Cost.We modeled the observation that the instruction documents are usually held by and placed in front of the consumers' hands (Fig. 2).We first computed the angle between forward direction of the hand and the direction vector pointing from hand to the attempted solution (   ℎ ,  − ℎ ), for left and right hand respectively (noted as     (),  ℎ () computed from frame ).We then formulated the overall hand angle cost, by aggregating the angle cost generated by each frame (Eqn.5).
(iv) Preference Cost.While PaperToPlace could determine the optimal step placement based on near-real-time context data, we retained the flexibility for users to specify their preferred placements.Eqn.6 defines   (), where    ,  refer to the preferred step placement in the world coordinates for step   .We used    ,  =  to indicate that the user has not manually fix the step placement.Similar to   , we used   to normalize preference cost.

Optimizations
We aimed to minimize   by searching optimal placement for a specific step â (i.e., â = argmin    ()).Finding optimal placement by computing   for each possibilities is impractical due to unnecessary latency and computational overhead.We instead used simulated annealing to approximate the global optimal [50].Notably, to ensure the sampled neighbouring solution is on the target anchoring surface, Eqn. 8 should be satisfied.0 ≤  +1 ≤  0 ≤  +1 ≤   +1 ,  +1 ∈ N (8) Although such method performs well while making moves within single anchoring surface, the attempted solution will not be made across the surfaces.To address this, we specify that the placement on the neighbour anchoring surface, which is closest to the current placement attempt, would be chosen upon Eqn. 8 being violated.Fig. 10a demonstrates an example of how the neighbour solution is chosen while placement moving across the surfaces.To prevent converging at local minimal, we choose a random move in the global search space if   ( +1 ) >   (  ) [50].Metropolis-Hastings Sampling and Simulated Annealing.We first selected a random placement on one of anchoring surface(s) randomly and used the Metropolis-Hastings algorithm [61] to sample the subsequent attempt.The probability for accepting the new attempt  +1 is determined by Metropolis criteria.Eqn. 9 defines the the transition kernel of the Markov chain, where  () indicates the temperature that will decay over the iteration.
Notably, we used the empirical definition of  (), where  () = To better demonstrate the merits, we used greedy approach that tries each possibilities (Fig. 10b) and simulated annealing approach (Fig. 10c).Fig. 10a visualizes the   at each pixel on the anchoring surfaces with greedy approach, where darker area indicating a lower   .We showed that the greedy approach and simulated annealing approach need to make 2650 and 49 attempts respectively before finding the global minimum.author such MR experience with both interfaces rapidly (Fig. 11e), PA10 held a neutral opinion, as she expected a fully automated system to extract the key objects associated with each instruction step.Overall, 10 participants were satisfied with the MR experience they authored (Fig. 11e).Examples include: "a cool way to transfer knowledge" (PA2) and "useful to see [the instruction] while cooking" (PA12).Few participants suggested the reason(s) for being not fully satisfied with the authored MR experience.For example, "I think I made some mistake when I author it.So it guided me to the wrong spot" (PA5) and "I have to do [the tasks with authored experience] by myself, to see what it is like for me to experience that first before having a novice do it so" (PA12).These implied the lack of ways to enable the authors to revise the authored experience inside MR iteratively.Can ML Support a Faster Authoring Experience (RQ2)?The RM-ANOVA showed a reduced overall TCT ( 1,22 = 16.66, < .001, 2  = .43,Fig. 11c) and average TCT (ART:  1,22 = 35.98, < .001, 2  = .77,Fig. 11d) for authoring each step while using ML-assisted mode, versus manual approach.Most participants echo such observations.For example: "it could help me saving time, because [the key objects have] already [been] filled out [...] it speeds up by maybe a half a second [for each instruction]" (PA3) and "it helps me to save a lot of time!And that makes it a lot more convenient!"(PA4) Can ML Support an Easier Authoring Experience (RQ3)?The RM-ANOVA demonstrated a higher SUS ( 1,22 = 8.65,  = .008, 2  = .28,Fig. 11a) and a lower TLX score (ART:  1,22 = 17.24,  = .002, 2  = .61,Fig. 11b) of the ML-assisted mode, compared to the manual counterpart.Most participants appreciated the convenience and helpfulness brought by the predicted key objects.First, nearly all participants believed that the ML helped on reducing effort for tagging key objects.For example: "ML brought less effort, I just need to check if the predicted key object is correct or not [...]Even if I still need to check it, I don't have to pay 100% of attention.I don't have to do all the thing.I just have to do part of the thing" (PA3), Particularly, the features of real-time predicting the new key object while modifying a specific step were favored by some participants.For example: "when I saw one of the step to be very long and [are associated with two key objects] [...] capable of predicting key object after being modified is obviously helpful!And also the opposite feature where you could just like combine two tasks, followed by generating predicted key object![...] it gives more flexibility while segmenting the instruction step" (PA10).
Second, some participants highlighted the helpfulness for the mental thought process for ML-assisted mode.a mental map of how I will spatially move across at different instances [by looking at the predicted key objects]" (PA4) and "it could help me make a decision" (PA3).Particularly, PA5 appraised the feature of predicting key object in real-time while revise the instruction: "[while adding or editing the steps based on existing paper instruction], the real-time predicted location could help create a more clear instruction step.For example, if I type 'heat the water', and the predicted location is oven for somehow, then I might just type 'heat the water in the microwave' to make the instruction more clear" However, PA10 held an opposite opinion: "I just read the instruction step and then check if this assigned [key object] was countertop or not, and changed it to countertop rather than going and checking the other options.[...] but I did not read [the predicted key object] first." Finally, few participants suggested the merits of using color scale to visualize the confidence of predicted key object.For example: "color could be helpful for conveying uncertainty" (PA9) and "color confidence is important to me.If it's red, I would be more aware of checking whether this is actually correct or not" (PA12).

User Study 2: Consumption Pipeline
The second study aims to evaluate the consumption pipeline and attempts to address: "how the PaperToPlace could help the consumers to complete the designated activities faster and easier?"Participants and Procedures.PC1 -PC12 (age,  = 27.83, = 6.55, incl.six males and six females) were recruited.All participants were not familiar with the designated tasks.We also built a baseline experience where the consumers could read existing monolithic instruction document inside MR (Fig. 12).Instead of asking participants to read a paper document, rendering a virtual monolithic document attached to the touch controller could minimize the impacts of confounding factors caused by uncomfort of the headset.We designed two sessions (Fig. 15) with counterbalancing being considered to minimize the impacts of prior learning experience and task familiarity.Before each session, T1 was used to help participants learn and familiarize with the system.During the session, participants were instructed to complete T2/T3 using the baseline and PaperToPlace.Participants were invited to fill out NASA TLX [46] and SUS [26] at the end of each session, followed by a semi-structured interview.Measures.To understand the performance of context switching, we defined each episode as the interval between the time when participants stopped a task to seek instructions and when they returned Results and Discussions.Overall, most participants believed that the PaperToPlace could help the consumers to complete the designated tasks faster and easier (Fig. 13c).Quantitatively, we demonstrated a higher overall SUS score ( 1,22 = 4.44,  = .046, 2  = .17,Fig. 13a) and a lower perceived workloads (ART:  1,22 = 18.52,  = 0.001,  2  = .63,Fig. 13b).Based on participants' feedback, we now discuss how PaperToPlace could help participants complete the designated tasks faster and easier.Most participants explicitly highlighted the benefits of finding optimal placements, without causing occlusions to the key interaction areas.For example: "it is useful to bring the instruction step to me by just a pinch." (PC10), "I like the position of instruction step!" (PC4), and "I think it is useful!And especially the function of where you pinch again, it will move to another location, so it can ensure [the step] will never block your sight" (PC8).More example of instruction step placements could be referred to Fig. 20e -h.However, PC6, who unveiled her ADHD [34], suggested: "I was distracted with [the PaperToPlace], because [when the step occasionally was not anchored on the optimal position] the information was here and I was there.It reminded me of the moments that I forgot what I was supposed to do, or what I have to do." Second, most participants acknowledged the helpfulness by establishing connections between instructions and key objects.For example: "I like how it took me to the sink because this activity has to be near the sink.That's a very helpful on spatial understanding!" (PC4), "Although we know where the fridge is, having that is really convenient to just not give anything a thought and do things as per the instructions" (PC10).
Finally, five participants also mentioned the merits of hands-free of PaperToPlace, compared to the baseline where the consumers need to hold the virtual document.For example "[baseline] is more cumbersome, because I need to free one hand and make the hand very clean to make sure that the hand is clean to touch the controller" (PC5).
Segmented Instruction Helps Findings the Relevant Information Easier.All participants suggested that the segmented instructions is helpful, e.g., "[instructions] need to be as concise and as short as possible to be read at the same time.[PaperToPlace] did its job!" (PC5).Most participants suggested the merits of reducing stress while translating instruction into real world activities.For example: "I have more calmness [with PaperToPlace], because [with baseline] I was seeing everything all at once, and that was giving me the feeling I'm in a hurry." (PC7), and "looking at the entire document at once was so hard that I forgot where I have to keep following from start to finish to find where I was.But [PaperToPlace] gave me one by one instructions, which is super easy!" (PC3)

LIMITATIONS AND FUTURE WORKS
We identify our limitations from four perspectives.
(1) Enabling an Iterative Authoring Process.We observed that during authoring, participants intended to segment the instruction steps by only considering whether only one key object is associated with the specific step, without synthesizing other factors (e.g., the density of the information contained by single step while viewed inside MR).However, this cannot ensure a satisfied MR experience from the perspective of the consumers.Future work might investigate potential iterative authoring workflow that allows the authors to refine their authored document profile inside MR while piloting the created instructional MR experience.
(2) Transforming Richer Metadata into MR Experiences.We consider that the metadata of each instruction step only contains the text of the step and the key objects that the virtual step should be anchored on (Sec.4.1).This might not be realistic for real-world instruction documents with heterogeneous kinds of metadata, such as the duration information, the caveats that usually requires the consumers' attention, and the notifications from the environmental sensors.For example: "I would like to have warning text, like 'do not use detergent', maybe show up in a different color or something" (PC4).Future research might investigate the richer metadata that need to be augmented inside the MR experience, and the methods to use existing language models to extract such metadata as well as transform them into spatialized and context-aware MR experience.
(3) Automatically Switching between Instruction Steps and Triggering the Position Update of the Instruction Label.We currently required consumers to explicitly click the virtual button to switch to the next instruction step, and to pinch to update the current position of the instruction step on demand (Sec.6.1).While participants (e.g., PC10) with some prior MR experience felt it is "easy and useful" to use the virtual hand menu and pinch gesture, others (e.g., PC1) suggested the frustrations of occasional failures of pinch gesture detections and the virtual button clicking.Future work might consider designing a state machine, which could specify how to switch to the subsequent step automatically based on user's activities that might be inferred from face (e.g., [28]), body (e.g., [5]) and environmental (e.g., [20,24,27]) sensor data.(4) Supporting a Broader Range of Applications.While many participants believed "cooking demonstrates [PaperToPlace] very well" (PC4), we only evaluated on cooking instructions, due to the poor quality of passthrough capabilities of Quest Pro [11]; availability of dataset to fine tune language model for alternative instruction activities; and limited study resources.Future work might explore other activities with more powerful language model such as GPT and prompt engineering techniques being used for creating document profiles.Participants also emphasized the values of adaptive placements (e.g., "the adaptive placement of instruction is definitely useful for paper cutting!I don't want to cut my hand.And I wanted the instruction to be always besides my hand" (PC9)) and reduced context switching (e.g., "in the gym, where I need some instructions to teach me how to use the equipment, such as how you hold the gears with good postures.[...] with [baseline], it is less efficient and [I have] to stop in between and read the instructions" (PC2)) that might be transferred to other activities.Another direction is to investigate the support for finer grained tasks that might be involved with moving objects, leading to a dynamically changed spatial profile.For example, "if you are doing PCB soldering, it might be hard to track that tiny component and to pinpoint the exact location on the PCB board.But if [PaperToPlace] can do that, it will be super helpful!" (PC8).This requires high quality passthrough and capabilities to track real-time location of the electronic components which are considered as non-static key objects.Instead of using Quest Pro, future researchers might consider a more recent higher-end headset, e.g., Vision Pro [18].

CONCLUSION
We present and evaluate PaperToPlace, comprising an authoring pipeline, which allows authors to rapidly transform existing paper instructions into a MR experience, and a consumption pipeline, which enables consumers to view spatialized instructions using a context-aware approach.Two within-subject studies with two different cohorts of 12 participants demonstrate the usability and effectiveness of the proposed authoring and consumption workflows.
Step Descriptions • Spray microwave-safe container (e.g., mug, ramekin, or egg cooker) with cooking spray or wipe lightly with vegetable oil.
• Whisk eggs, milk, salt and pepper in container (or whisk ingredients in another bowl and pour into microwave container).If using a mug or ramekin, cover with plastic wrap, pulling back small area for venting.If using an egg cooker, place lid on cooker base, lining up notches.Twist to secure.
• Microwave on Medium-High (70% power) for 90 seconds, stirring several times during cooking.
• Cover and let stand for 30 seconds to 1 minute before serving.Eggs will look slightly moist, but will finish cooking upon standing.

A EXPERIMENTAL TASKS
The selected tasks for final user study (Sec.8) include: • (T1) Microwave Scrambled Eggs [4]; • (T2) Quick Microwave-Poached Eggs on Avocado Toast [15]; • (T3) Instant Mac 'n' Cheese [1]; Table 2, Table 3 and Table 4 offers the supplementary material regarding the specific instruction steps of the experimental tasks T1, T2, and T3 respectively.Notably, T1 was used as the training tasks for participants to learn and get familiar with the interfaces (Table 2).Assuming the results from OCR is fully correct (i.e., all texts of all instruction steps could be successfully extracted), the overall accuracies of the associated key objects predicted by our pretrained language model for each experimental tasks are 50% (T1), 84.62% (T2), and 80.00% (T3).Notably, the overall accuracies for T2 and T3 are closed to our benchmark results while fine-tuning the BERT model in Sec.5.3, which is 82.13%.This ensures the results yielded by the evaluation of authoring pipeline (Sec.8.1) is generalizable to some extend.Although the overall accuracy of T1 is far lower than our benchmark results due to the relative short of instruction document, the instruction of T1 was only used for participants to familiarize themselves with the given interfaces (either on iPad or inside MR), and the data yielded by T1 was excluded from our evaluation results in Sec. 8.

B USER STUDY RESULTS
This section presents supplementary material for Sec. 8. Fig. 15 shows the specific tasks and interface conditions that were assigned while evaluating authoring and consumption pipeline.Notably, T1 was used for training purposes.
We also provide visualizations of survey responses for authoring pipeline evaluations (Sec.B.1) as well as consumption evaluations (Sec.B.2).

B.1 Evaluations of Authoring Pipeline
Fig. 16 demonstrates the NASA TLX responses of each perceived workloads from participants PA1 -PA12, while using manual and ML-assisted interfaces to extract document profile from designated paper instruction.Fig. 18 provides supplementary material of survey results of SUS questionnaires.For Fig. 16 and Fig. 18, we use Step Descriptions • In a small bowl, combine the basil leaves and sea salt, and set aside.
• Take a tomato from the fridge.
• And gently clean a tomato with your hand to help remove dirt and bacteria.Do not use detergent, soap, or bleach.
• Then cut the tomato into parallel thin slices working from the top of the tomato towards to the bottom and set aside.
• Take an avocado from the fridge.
• Cut and mash the avocado in a small bowl.Squeeze the lemon into the avocado and spread liberally on the toast.And put tomato on top of the toast.
• Crack 1 egg into the mug, cover with a small plate, and microwave on high for 30 seconds.
• Take the mug out of the microwave, lift the plate carefully (to let steam escape) and check the egg.
• If the white is not completely set, cover and continue to microwave in 10-second intervals until the egg white is opaque.(The time varies with the power of the microwave and may take up to 60 seconds).
• Carefully pour off the water in the mug, using a slotted spoon to keep the egg from falling out.
• Transfer the egg to one of the slices of avocado toast.
• Sprinkle the toasts with the seed mixture and serve immediately.
Table 3: Experimental cooking recipe for making an avocado toast with a microwave-poached egg (T2).
Step Descriptions • Find a mug that holds twice the volume of your dry pasta -the bigger, the better.
• Add the macaroni.
• Add some water.
• Cover with cling film and pierce 3 times.
• Stand the mug in a microwave-proof bowl to catch any spillages, and cook in the microwave on high for 2 minutes.The liquid will bubble up and over the sides, so tip any liquid from the bowl back into the mug (be careful as it will be very hot) and give it a good stir.
• Leave to stand for 1 minute.
• Repeat twice more or until the pasta is cooked (it may take longer depending on the pasta).
• Then remove from the microwave.
• Repeat twice more or until the pasta is cooked (it may take longer depending on the pasta).
• Stir through the butter, cheese and spinach or Marmite, if using.
• The heat from the pasta should melt the cheese and wilt the spinach, but if not, pop back in the microwave for 30 seconds.
"Manual" and "ML" to indicate the interface condition while manual and ML supported approaches are used while extracting the associated key objects from each instruction steps, respectively.

C CODEBOOK AND THEMES FROM QUALITATIVE DATA ANALYSIS
We used thematic analysis [25], and deductive and inductive coding [39] to analyze qualitative data, collected from preliminary needs-finding study (Sec.3) and final user study (Sec.8).As part of supplementary material, we attached the resultant codebook in Fig. 21, Fig. 22 and Fig. 23, respectively.Notably, "Count" refers to the number of quote for each theme or code.It is also possible that multiple codes are assigned to one quote.

D ETHICAL DISCLAIMER
This work has been approved by the Institutional Review Board (IRB).All Personal Identifiable Information (PII), such as the face has been intentionally removed (e.g., being pixelized or blurred) in this manuscript as well as all accompanion videos.Before each user study, we have obtained participants' consent on video and audio recordings, as well as heterogeneous behavior data collections.All participants have consented and acknowledged the data and results presented in this manuscript, to be published and presented publicly for research purposes.While monetary incentives were not provided, all participants had the opportunity to try out and experience state-of-the-art MR technologies, and to know further about our research.
Strongly Disagree Disagree Neither Agree Nor Disagree Agree Strongly Agree  Figure 21: The codebook that resulted from our qualitative analysis of interview data for preliminary needs-finding study."Count" refers to the number of quote for each theme or code.It is possible that multiple codes are assigned to one quote.Needs of multimedia support 1 Benefits of deciding optimal placements of instruction step for the discussed applications 8 Improvements of current system for future applications 5 Figure 23: The codebook that resulted from our qualitative analysis of interview data for consumption pipeline evaluations."Count" refers to the number of quote for each theme or code.It is possible that multiple codes are assigned to one quote.

Figure 1 :
Figure 1: Overview of PaperToPlace: (a) The author creates an MR experience by taking a snapshot of a paper document, with an optional ML-supported pipeline for associating key objects with each instruction step; (b) The consumer can browse the spatialized instruction steps using a hand menu; (c) The step is placed at an optimal location to minimize context switching and prevent occlusion of important interaction areas (e.g., not occluding the touchpad while setting the time on a microwave); (d) The consumer can "pinch-and-drag" the step to refine the system placement.Steps (b -d) show the first-person MR view.

Figure 2 :
Figure 2: Preliminary needs-finding tasks.(a) Making a cup of coffee using a coffee machine (PP1).(b) Making a chocolate cake in a mug with a microwave (PP3).

Figure 3 :
Figure 3: Preliminary needs-finding study results.(a) Example context switching while PP2 was attempting to map instructions with real-world objects.(b) The annotated timestamps showing participants' current focus as either the document or real-world activities.(c) PP1's design for an instructional MR experience while using a coffee machine.
(g) Extract Document Profile Merging: Merge the current step to the previous step; Segmenting: Add an empty "next" step, to segment the current step; Deleting: Delete the current step; Text of the step; Spatial profile contains the transforms of anchoring surfaces, along with the tags of associated key objects.
(h) Spatial Profile (i) Connecting Step with Key Object Document Profile (i1) Third-Person View (i2) First-Person MR View (a-d) Reuse and capture the paper instruction document key object that the instruction step is associated with, where the color of the text field indicate the confidence of ML predictions;

Figure 4 :
Figure 4: PaperToPlace system overview.We assume a spatial profile (h, red block) was pre-created.The author uses the authoring pipeline (a -g, blue blocks) to extract the document profile for the MR experience.With consuming pipeline (i, green block), the instruction steps are displayed based on the environmental (loaded via the spatial profile) and user's contexts.

Figure 5 :
Figure 5: (a) Creation of an anchoring surface, visualized as a semi-transparent mask, using touch controllers.(b) Examples of anchoring surfaces in our experimental kitchen.Both scenes were captured as first-person MR views.

Figure 6 :
Figure 6: Confusion matrices of the fine-tuned BERT model, with the ground truth generated by rule-based method (a) and manually labeling (b).
Number of discretized cells along the width and height of the anchoring surface. Rotation, represented by quaternion, of the surface .  ,   ,    The up, right, and forward direction of the surface .
We used  = (, , ) to indicate the placement of a step, where  and  indicate the index of cell along the width and height of the surface .The world position of the attempted step placement is   =     +0.03•  • −0.03•  •, where     is the position of the top left vertex of surface .

Figure 8 :Algorithm 3 9 :
Figure 8: Examples of importance map.(a) Third-person view; (b) First-person view through MR; (c) First-person view of the key object (sink) containing four anchoring surfaces; (c) Importance map from single frame; (d) Overall importance map from a set of frames.The red and blue points in (d) and (e) indicate the projected key joints of the tracked left and right hand on the anchoring surfaces, respectively.Algorithm 3 Algorithms for computing importance map.1: function GetMap(   [ ],  ,  ) 2: ℎ,  ← [ ],  ( ,  )

𝐼 = 1 ,
if the pixel is occluded by the placed step. = 0, if the pixel is not occluded by the placed step.

Figure 9 :
Figure 9: Example of occlusion map.(a) An example placement of the instruction step anchored next to the sink; (b) Visualization of the generated occlusion map, see Fig. 8c for the corresponding real-world scene.
Figure 10: (a) Example placement attempts (the green trace).The darker color of the anchoring surface indicates a lower   ; (b -d) Example optimized cost over each iteration using greedy algorithm (b) and simulated annealing approach (c).To increase readability,  scale is applied for -axis.We only showed the traces before current cost reaching global minimal.

𝑇 1 𝑖+1 68 Figure 11 :
Figure 11: Results of authoring pipeline evaluations.(a -d) The overall SUS, weighted TLX scores, TCT of extracting document profiles, and the average task completion time for deciding each instruction step with segmented instruction step; (d) Survey results of how participants assessed the overall authoring pipeline and authored MR experience.

Figure 12 :
Baseline scene.Third-person (a) and first-person view through MR (b).

68 Figure 13 :
Figure 13: Consumption pipeline evaluation results.(a -b) Overall SUS and TLX score; (c) Survey results of how participants considered the overall consumption experience of PaperToPlace faster and easier, versus baseline."B/L" and "P2P" indicate baseline and PaperToPlace condition.

Figure 14 :
Figure 14: Results of context switching evaluations.(a) The total number of context switching while using the monolithic document and PaperToPlace.The average time (b),  ℎ (d), and  ℎ (f) during each episode; The total time (c),  ℎ (e), and  ℎ (g) during all episode while completing the task."B/L" and "P2P" indicate the baseline and PaperToPlace conditions.to the task.During each episode, we analyzed the (i) time; (ii) the distance of path of head movement ( ℎ ); (iii) the angular changes of the forward direction of the head ( ℎ ).The data lies outside of such intervals are out of our scope, as the performance of realworld activities could be affected by participants' prior cooking experience.Same approaches in Sec.8.1 were used to analyze the questionnaire responses and participant's qualitative feedback.The study on average took 59.40 min ( = 5.80 min).Results and Discussions.Overall, most participants believed that the PaperToPlace could help the consumers to complete the designated tasks faster and easier (Fig.13c).Quantitatively, we demonstrated a higher overall SUS score ( 1,22 = 4.44,  = .046, 2  = .17,Fig.13a) and a lower perceived workloads (ART:  1,22 = 18.52,  = 0.001,  2  = .63,Fig. 13b).Based on participants' feedback, we now discuss how PaperToPlace could help participants complete the designated tasks faster and easier.Context Awareness Reduces the Effort of Context-Switching.PaperToPlace reduces the average time (ART:  1,22 = 49.18, < .001 2  = .82,Fig. 14b),  ℎ ( 1,22 = 57.29, < .001 2  = .72,Fig. 14d) and  ℎ (ART:  1,22 = 51.19, < .001 2  = .82,Fig. 14f) on each episode.While PaperToPlace on average leads to frequent document readings ( 1,22 = 15.77, < .001, 2  = .42,Fig. 14a), the accumulated time (ART:  1,22 = 9.20,  = .020 2  = .42,Fig. 14c),  ℎ ( 1,22 = 5.30,  = .030 2  = .19,Fig. 14e) and  ℎ (ART:  1,22 = 7.48,  = .019 2  = .40,Fig. 14g) of all episodes are reduced.First, participants suggested the convenience for referring back to the instructions repeatedly with PaperToPlace.For example: "[with baseline], I need to check it back and forth every time while trying to grab food from the fridge [...] [PaperToPlace] gives me feeling like [the instruction] is just on my side.It's like always on my side, like right beside my head" (PC9) and "because [the step] would always be right there with just a little bit of information I need, I think it'd be very useful" (PC11).Most participants explicitly highlighted the benefits of finding optimal placements, without causing occlusions to the key interaction areas.For example: "it is useful to bring the instruction step to me by just a pinch." (PC10), "I like the position of instruction step!" (PC4), and "I think it is useful!And especially the function of where you pinch again, it will move to another location, so it can ensure [the step] will never block your sight" (PC8).More example of instruction step placements could be referred to Fig.20e -h.

Fig. 17
Fig.17demonstrates the NASA TLX responses of each perceived workloads from participants PC1 -PC12, while using baseline and PaperToPlace interfaces to perform the designated tasks.Fig.19

Figure 15 :
Figure 15: Study design for evaluating PaperToPlace.Each participant needs to conduct session 1 and session 2 in order.T2 and T3 were used for formal evaluation while T1 was used for training purposes.

Figure 16 :Figure 17 :Figure 18 :
Figure 16: Survey results of NASA TLX questionnaires.We use "Manual" and "ML" to indicate the interface condition while manual and ML supported approaches are used for extracting associated key objects from each instruction steps, respectively

Figure 19 :
Figure 19: Survey results of SUS questionnaires of consumption pipeline evaluations.We use "B/L" to refer to the baseline interface, and "P2P" to indicate PaperToPlace, which delivers spatialized and context-aware instruction step.To increase readability, we cluster the survey results of positive statements (Q1, Q3, Q5, Q7, Q9) into subplot (a), where a higher level of agreement indicates a better user experience.The survey results of negative statements (Q2, Q4, Q6, Q8, Q10) are clustered into subplot (b), where a lower level of agreement indicates a better user experience.(a1)

Figure 20 :
Figure 20: First-person view through MR of the examples of placing instruction step next to the key objects with our consumption pipeline.Example key objects include fridge (a), countertop (b), sink (c), and microwave (d).

Table 2 :
Experimental cooking recipe for making basic microwave scrambled eggs (T1).
Figure 22:The codebook that resulted from our qualitative analysis of interview data for authoring pipeline evaluations."Count"refersto the number of quote for each theme or code.It is possible that multiple codes are assigned to one quote.Segmented Instruction Helps Findings the Relevant Information Easier 48Quality of video see-through (such as visual distortion etc.) 14