Heads-Up Multitasker: Simulating Attention Switching On Optical Head-Mounted Displays

Optical Head-Mounted Displays (OHMDs) allow users to read digital content while walking. A better understanding of how users allocate attention between these two tasks is crucial for improving OHMD interfaces. This paper introduces a computational model for simulating users’ attention switches between reading and walking. We model users’ decision to deploy visual attention as a hierarchical reinforcement learning problem, wherein a supervisory controller optimizes attention allocation while considering both reading activity and walking safety. Our model simulates the control of eye movements and locomotion as an adaptation to the given task priority, design of digital content, and walking speed. The model replicates key multitasking behaviors during OHMD reading while walking, including attention switches, changes in reading and walking speeds, and reading resumptions.


INTRODUCTION
Improving usability and safety is crucial in designing better interactions for Optical Head-Mounted Displays (OHMDs).Understanding the dynamics of attention switching is essential to tackling these challenges efectively.OHMDs provide users with convenient access to information during their everyday tasks [82], particularly when they are on the move.Nonetheless, achieving efcient content comprehension on the go is not always easy.A core cognitive problem is illustrated in Figure 1.When reading on OHMDs, users must constantly alternate their attention between the digital content and physical surroundings [83].This challenge is exacerbated by head perturbations during walking, which can disrupt the reading performance [8].To optimize both reading experience and walking safety, HCI research needs to better understand how users allocate attention during mobile multitasking situations.However, users' attention switching strategies can be afected by several factors, including the individual and their priorities [30], walking speed [25], the environment [43], and the design of the OHMD interface [57,83].To gauge this complex interplay of factors, leveraging computational models emerges as a promising approach, with which Figure 1: We present a computational model of how users allocate attention when interacting with Optical Head-Mounted Displays (OHMDs) while walking.When using OHMDs, reading competes with the need to attend to the environment.Users must control walking to adapt to external situations, such as walking around a slippery area after spotting a warning sign.Our model makes accurate predictions of attention switching strategies in such scenarios.
human behaviors can be framed as an adaptation to given constraints [26,54].
Computational models could ofer insights into the problem of attention switching.Prior work has modeled multitasking behavior computationally in other safety critical areas, such as driving [35].However, their fndings do not readily extend to the context of interacting with OHMDs, which poses distinctive challenges.OHMDs seamlessly integrate digital content with the users' physical surroundings within their feld of view.When users access information on OHMDs' transparent displays, they simultaneously perceive portions of the external environment, which can change dynamically.Consequently, this constant infux of surrounding environment updates creates competition for attention with the reading task on OHMDs [32,61,83].As a result, users frequently switch their attention, potentially impairing reading performance and introducing safety risks.
In this paper, we present a predictive model to shed light on users' walking and reading behaviors with OHMDs.This model simulates, on a moment-by-moment basis, how users switch their attention as they walk and perceive information from an OHMD.Building upon the theory of computational rationality [26,54], we frame attention switching behavior as a sequential decision-making process constrained by cognitive, perceptual, and motor factors.Within this framework, we consider the user as a rational agent striving to apply the optimal strategy for balancing attention between OHMD content and the surrounding environment.In this multitasking scenario, the agent's primary goal is to maintain efcient reading and environmental awareness, and walk properly.Unlike some previous models of multitasking, which necessitated explicit specifcations of how switching occurs [63,65], our approach enables us to predict such strategies, or policies.Specifcally, we can predict how these strategies adapt to various factors relevant to OHMD interaction, including walking speed, task priorities, and interface design.
We found that users' attention switching can be efectively modeled as a hierarchical control problem with sparse rewards.In particular, we model user behavior for reading on OHMDs while walking across three levels: 1) At supervisory level, a controller decides the task prioritization by allocating attention; 2) At task level, individual task models control the agent's behavior within each specifc task, utilizing working memory to track task status; 3) At motor level, we address the question of 'how to look' and 'how to walk'this involves managing eyeball movements to acquire relevant information from a scene, and managing walking dynamics, including lateral movement and walking speed control.
Our model is evaluated through four studies.Study 1 examines attention switches as an adaptation to task priority and walking speed.Study 2 assesses visual perception and readability during walking, comparing reading speed ratios with human data.Study 3 investigates reading resumption after attention switches, analyzing behaviors across three OHMD layouts against human data.Finally, Study 4, based on Zhou et al.'s work [83], explores how attention adapts in a realistic task, assessing agent's walking speed adjustments for optimal reading efciency, with comparisons to human data in terms of attention allocation, walking speed, and reading.The results show that our model successfully captures the key trends in human performance and could be used to predict users' behaviors of reading while walking.The predictions could further guide the design of adaptive interfaces that optimize the information acquisition experience using OHMDs.
To sum up, we propose the frst simulation model of OHMD interaction called Heads-Up Multitasker 1 , which can closely resemble user behaviors in terms of attention allocation, walking, and reading.It features an agent with pixel-based visual perception that works directly in MuJoCo [76], a physics engine renowned for its fexibility in modeling diverse interaction scenarios.Our technical contributions are detailed as follows: • A hierarchical reinforcement learning (HRL) architecture with novel POMDP formulations, efectively capturing key cognitive aspects of multitasking on OHMDs, such as reading, walking, memory, oculomotor control, and visual perception.This fexible, modular design allows for easy adaptation to diverse scenarios by altering specifc modules, illustrated in our studies on varied walking controls, agent-user modeling, and OHMD layout adjustments.• A fexible simulation environment in MuJoCo that allows modeling OHMD scenarios with high fdelity, including pixel-based visual perception, attention deployment based on oculomotor control, and walking control.Unlike prior work [12,35,36], our model works directly in MuJoCo without needing hand-crafted state/action spaces for the visual scene.
• Empirical and simulation-based evaluations in the context of reading on OHMDs while walking, covering attention switch between OHMDs and environment, walking speed control, reading, and resumption behaviors.
Section 2 reviews the relevant literature on OHMDs, attention switching, and computational modeling of interactive behavior.Section 3 introduces the hierarchical model overview, and Section 4 delves into its detailed implementation.Section 5 provides an overview of the four studies, which are presented in the following Sections.Sections 10 and 11 discuss the potential applications of these fndings and limitations.Finally, Section 12 provides the conclusion.

RELATED WORK
This work is positioned at the intersection of three research areas: 1) Studies of how users acquire information on OHMDs while on the go, 2) computational models of multitasking and attention switching, and 3) reinforcement learning (RL) based models of human behavior.

Information Acquisition on OHMDs while
On-The-Go While information can be accessed through various modalities while on the move [50,81], the visual modality has a critical role in information acquisition.Visual information holds distinct advantages that cannot be easily replicated by the auditory channel, primarily due to audio information's inherently sequential and feeting nature [57,82,83].Many papers have illustrated the benefts of acquiring information by reading on OHMDs while walking [23, 37, 44, 53, 59-61, 82, 83]; also when compared to the common practice of reading on mobile phones in mobile scenarios [66,67].
The transparent displays ofer users the distinct advantage of accessing on-screen information without compromising their awareness of their surroundings.However, prior research has indicated that this advantage may come at the cost of increased cognitive load because users are required to simultaneously monitor their environment while engaging in tasks [49,67] -in fact, users often struggle to maintain sufcient attention when multitasking on the move [6,57].Furthermore, there is an established trade-of between awareness of the surroundings and reading, and vice versa [37,49].Nevertheless, this trade-of could potentially be infuenced by design factors.For instance, Zhou et al. discovered that enhancing default text spacing can lead to improved reading performance [83].
Moreover, the see-through nature of OHMDs may impact the ease of perceiving displayed content in certain environmental conditions, depending on factors such as lighting, background color, or texture [18,19].
Recent research has also explored diverse methods for presenting information to users while they navigate and engage in tasks such as reading [61], text editing [23], or learning from videos [57,58].While much of the existing research relies on empirical studies to evaluate interface designs [31], simulation models could complement these eforts by providing a means to assess designs before user testing, thus enabling the optimization of designs and improvements in accessibility [48].Our studies demonstrate our model could predict reading behaviors across various OHMD designs without relying on empirical user data, ofering insights into optimal text layouts for walking readers.

Computational Models of Multitasking and Attention Switching
In multitasking, models like ACT-R [2], and EPIC [46] have been pivotal in explaining, analyzing, and predicting performance and attention switches during multitasking.Key developments include Salvucci et al.'s integration of a general executive into the ACT-R framework for better subtask management [62], and their further enhancements for modeling task switching [65].A signifcant addition to this feld is the Threaded Cognition theory [64], which posits that cognitive processes do not always transpire sequentially, but rather operate in parallel threads or streams, each representing diferent cognitive tasks or processes that can occur concurrently.This theory ofers predictions regarding how multitasking behavior may result in interference, or lack thereof, depending on the specifc tasks involved.
Multitasking has also been framed as an optimization problem of determining how individuals engaged in multitasking can best allocate their limited resources to maximize overall task performance [51].Cognitive Constraint Modeling (CCM) is an approach to understanding human cognitive processes by identifying and modeling the constraints in such optimization [27].These constraints may include limitations in cognitive resources, such as attention or working memory capacity, biases in decision-making, or other factors infuencing how information is processed and decisions are made.This approach is relevant to our model, as we build on the concept of boundedly optimal control [27,54].
However, the aforementioned approaches require signifcant efort when creating models of multitasking situations.Earlier models predefned user behaviors for specifc tasks and environments, which limited adaptability [54].Instead, our POMDP-based method frames user behaviors as sequential decisions, allowing emergent strategies that are optimally adaptive to environmental and taskrelated changes.Reinforcement learning (RL) is employed to derive these strategies, providing a dynamic, data-driven approach that surpasses the need for manually setting user behavior rules [71].This results in a fexible modeling framework for various interaction and multitasking challenges [12,15,29,34,35,40,41].
Users' decision-making and behaviors are complex cognitive processes when multitasking, adapting to various constraints, including environmental factors, task demands, and cognitive and physiological limitations [34,35,54].Our model builds on the foundational principles of hierarchical and boundedly optimal control [20], as illustrated in Jokinen et al.'s work [35], but extends to a novel task context.Unlike Jokinen et al. 's model, which is focused on driving, our model is specifcally designed for the unique challenges of reading on OHMDs while walking.

Modeling Interactive Behavior with Reinforcement Learning
Simulation-based science has emerged as a valuable approach in HCI research [48], with a growing interest in adopting Reinforcement Learning (RL) to simulate human interaction.This trend is informed by computational rationality theory, which interprets human behavior through the lens of bounded optimality and advocates for RL methodologies [54].Our work is positioned in this category of behavioral models.
Prior work in simulating human visual attention using reinforcement learning often involved manually crafted vision sensors and state-action spaces [12,36,55], limiting application scope.For example, models like Gaze-based Selection [12] and Adaptive Feature Guidance [36] used encoded visual features within a 2D system.We extend previous approaches by adopting a 'pixel-based' agent paradigm from recent RL research, allowing direct processing of camera views that simulate user vision [14,74].This method autonomously discerns visual cues from raw sensory inputs and ofers a more versatile and adaptive framework for modeling dynamic visual attention scenarios.
While the domain of multitasking research is expansive, models combining walking and information acquisition from OHMDs, to our knowledge, are yet to be developed, which applies to both timetested models such as ACT-R [3] and the burgeoning reinforcement learning paradigms.Yet, it's noteworthy that hierarchical RL has found application in replicating the intricacies of human decisionmaking processes [9,21,34,35].

Summary
Earlier research employing hand-crafted policies faces difculties in addressing the unique complexities of reading on OHMDs while walking, a task characterized by continuous scene changes and dynamic walking movements.To alleviate such complexities in oculomotor and locomotion control, we take inspiration from recent RL research building 'pixel-based' agents.These agents directly learn policies from the visual sensory inputs rendered in physics simulators [14,74], eliminating the need for modelers to manually defne action and state spaces for the external environment.

MODEL OVERVIEW: DEFINING A HIERARCHICAL CONTROL PROBLEM
As previously explained, we use a hierarchical structure to decompose the user's multitasking problem.All submodels are defned as optimal sequential decision making processes with bounds [22,26,54].The bounds include internal factors, such as constrained visual attention and memory capacity, and task priority; and external conditions including digital content designs on OHMDs and walking speed.Some of these bounds, particularly the limited visual attention and memory, induce partial observability and uncertainty in the decision-making process.For instance, while reading, the agent cannot fully perceive changes in the environment, leading to uncertainty regarding walking safety.Similarly, when scanning the environment, the agent cannot read and may forget the last read word due to memory loss.These assumptions lead us to formulate the problems as Partially Observable Markov Decision Processes (POMDPs).A POMDP is represented by a tuple < , , , , > defned by sets of states , actions , and observations , environmental transition dynamics , and a reward function .At timestep , an agent is in a state ∈ and receives an imperfect observation ∈ of the state, takes an action ∈ based on the observation, and transitions into a new state +1 ∈ via the transition dynamics , and receives a reward = ( , , +1 ).The primary reason for modeling users' visual attention behaviors as POMDPs lies in their character as sequential decision-making processes, as highlighted in various studies [12,34,35,73].Although parallel processing is possible in some scenarios, sequential task processing often emerges as a more efcient strategy for multitasking [16].Consequently, employing POMDP as the core mechanism in our model is more appropriate.Intuitively, sequential decision-making implies that each choice a user makes, whether concentrating on reading or environmental cues, carries implications for future decisions [21].For instance, overly focusing on reading might result in missing critical environmental information, whereas excessive attention to surroundings could impede efective reading.Additionally, the POMDP, focusing on partial observability, assumes that agents observe their world indirectly through sensory apparatus [54].This concept fts well with scenarios where users must strategically choose their attention, i.e., where and when to look, aligning with real-world constraints like limited perception and cognitive capacity.
Figure 2 presents an overview of our suggested framework for modeling the multitasking problem of reading while walking.The hierarchical model is structured such that higher-level models set targets for the lower-level models (emphasized with red arrows; these correspond to the actions , , ), in addition to updating the internal state.Notably, the supervisory level and task level models are called sequentially, with the frequency specifed by a long-range timestep .Based on observations of the internal state (full descriptions of state information can be found in the Supplementary Material), the Supervisory Controller chooses a task via a binary decision .Correspondingly, the agent will either read content on the OHMD or scan the environment to maintain awareness of its surroundings.The Read task outputs a target word index to the Oculomotor Control, while the Scan task outputs an objective -corresponding to either the lane to walk on or walking speed -to the Locomotion Control.The motor level models additionally get observations , of the external environment, and interact with it by actuating physical motors (via actions , ).These motor level models are called with higher frequency defned by a shortrange timestep .In Section 4 the superscripts {, , , , } are generally omitted to avoid unnecessary clutter, as the context indicates which submodel is discussed.
Learning long-horizon tasks in RL can be challenging, especially if the agent receives rewards only once a task is accomplished (i.e., the reward function is sparse).This introduces an issue known as The model consists of a three-level hierarchical structure with a Supervisory Controller (SC), task level models Read (R) and Scan (S), and motor level models for Oculomotor Control (OC) and Locomotion Control (LC).These submodels receive observations from the underlying internal and external states, and interact with them through actions while receiving rewards (black arrows).The red arrows highlight how higher-level models set targets for lower-level ones: SC determines the focus task (via ), while R and S establish specifc targets and for directing motor level models.For higher-level models, these targets act as guiding actions.Conversely, for lower-level models, they are treated as observations that infuence and defne agents' tasks.
the credit assignment problem, meaning it is difcult for the learning algorithm to attribute rewards earned from successful outcomes to specifc actions [47].In hierarchical RL, long-horizon tasks can be broken down into simpler subtasks through temporal abstraction, where tasks are performed at diferent time scales [13,38].This abstraction allows for the seamless integration of intricate tasks, such as the precise oculomotor control needed to fxate on individual words, into broader, more encompassing activities, such as comprehending a sentence while reading.Our proposed approach achieves this by allowing the lower-level motor control models to operate on a faster time scale ( ) than the higher-level models ( ).

MODEL DETAILS AND IMPLEMENTATION
To model the dynamics of reading while walking, we created a user model and a simulation environment in MuJoCo [76].MuJoCo is an efcient physics simulation engine often used in RL research.It provides a fexible model specifcation format, making it possible to create simulations for a wide range of modeling problems, ranging from robotics to biomechanics.
Our simulation presents a multitasking challenge where the simulated user (agent) reads text on an OHMD while walking on a predefned path, and navigates the environment by either switching lanes (Study 1) or controlling the walking speed (Study 4).The agent is rewarded for reading on the OHMD, walking, and attending to the environmental information properly.When training the agent, we want to learn an optimal policy for switching attention between the reading task and scanning the environment, such that the expected cumulative sum of gained rewards is maximized.The models were trained with PPO2 [70] individually, although they are evaluated together.
Figure 3 shows the user model.It consists of a torso and an eye, with a camera embedded in the eye to mimic the feld of view ahead of the user.The camera has a feld of view of 90 degrees, and the user may perceive its environment through this camera as an 80x80 pixel RGB-D image.We implemented ocular movement control by using two motors to rotate the eye horizontally and vertically.The user sees a grid of 3x4 cells from its point of view, simulating textual content presented on the OHMD.Instead of giving the user actual words to read, we simplify the task by replacing words with rectangular cells that visually signify words [10].This bypasses the need to learn to recognize words when training the agent.
Scope: After exploring diferent options, we simplifed the multitasking problem by excluding detailed locomotion control and decided to focus on the (more critical) oculomotor control problem.The human walking process involves complex neuromechanical controls of speeds, navigation, and balance maintenance [72].We collapsed these into one-dimensional translational movement and walking speed control, deeming this sufcient for our case.We represent the reading task as a sequence of eye fxations through a series of rectangles, where the rectangles represent words.This approach is based on the understanding that words can be viewed as separable units with clearly defned boundaries from one another [10].Such simplifcation allows us to focus our model on general reading patterns, such as reading resumption, without getting entangled in the nuanced perception of individual words.While it is common in natural reading behaviors to scan through and occasionally skip words, we found that a linear traversal sufces to  cover fundamental cases, efectively representing the base reading scenario with considerably less complexity.Furthermore, our model simplifes visual perception by limiting visual stimuli to those within the foveal and near peripheral view.This is because the primary aim of our model is to simulate how users switch between reading on OHMDs and interpreting environmental signs, these tasks require a fxed gaze to process visual information, contrasting with the tracking of dynamically moving objects, which is the main function of peripheral vision.To support this focused approach, we have restricted the eye camera's feld of view to 90 degrees in both the horizontal and vertical dimensions.This not only covers a substantial portion of the peripheral area, ensuring the inclusion of essential visual stimuli, but also allows for larger visual angles for objects.This aspect is crucial, given that our model downsamples visual observations to an 80x80 pixel resolution.Such a truncation and focus enhances the agent's ability to perceive and process relevant visual inputs efectively, balancing the model's complexity with its trainability.
Finally, while oculomotor control has three degrees of freedom [1], we have designed our simulation to focus on the movements of a single eyeball, limiting it to horizontal and vertical movements.This grants the agent enough fexibility to perceive all items in the environment without unnecessarily complicating the model.Last, although humans can switch attention both voluntarily and involuntarily [36,77], we only consider voluntary attention switches.Correspondingly, all agents' attention switches in our simulations are task-driven.

Supervisory Controller (SC)
When using OHMDs, users must frequently switch their attention between the device and their environment.This is driven by various factors, such as safety concerns -for instance, the need to check for a red light to avoid jaywalking [68]; and voluntary interestsfor instance, curiosity about a nearby sign [69].A critical challenge for the agent is to learn the optimal timing for these attention switches, such that disruptions are minimized and the efciency of information gathering is maximized [39].
The Supervisory Controller determines how attention should be allocated at each long-range time step.We employ POMDP to model attention allocation as a sequential decision-making process, under uncertainty about when the agent should switch attention to the environment to walk properly.The attention is allocated based on an observation of the internal states = {Current task, Reading progress, Walking speed, Remaining timesteps}.The binary decision indicates whether the agent chooses to read on the OHMD or scan the environment for updates.Upon each decision, the agent receives a reward = × + × − × − .Our reward function considers both the utilities in reading and walking, as well as the cost caused by attention switches.It is a combination of positive rewards for reading and walking , as well as a cost component for switching attention and a time penalty : • is granted if the agent makes reading progress in the current step.• is related to walking, can be granted according to walking speed, or whether walking on a correct lane.• = + -Switching attention to the environment interrupts reading.To resume reading, users need to scan the OHMD content and try to refocus on the previously read position.This process costs time and users may make errors.refers to the normalized time cost in this process, and refers to the normalized error cost.
• is a step-wise time penalty.
To control the relative importance of each reward component, we introduce weights , , , for reading, walking, and attention switches, respectively.A higher weight for reading results in the agent focusing more on the reading task and less on the walking task.Conversely, a higher encourages the agent to prioritize walking, which may result in the agent attending to the environment more often, and hence reading slower.A higher value for means the agent is less likely to switch attention to the environment.This reward function formulation provides fexibility for modelers to design agents with diferent task preferences by modifying the reward component weights.We further evaluate the model's attention allocation behavior in Study 1, and compare the behavior against human data in Study 4. The detailed reward values are specifed in the Supplementary Material.

Read (R)
Our model exhibits two diferent reading behaviors depending on its state.If the agent is reading, it will fxate on the next word.When resuming a paused reading task, the agent scans words on OHMD to locate the last read word, infuenced by a decaying memory mechanism, similar to humans [83].This concept aligns with Li et al.'s exploration of memory decay in smartphone menu selection [40].However, unlike their uniform spacing layout, our scenario focused on varied text spacings on OHMDs [83].We adopt Li et al. 's forgetting function concept and modify it to accommodate the complexities introduced by varied text layouts.
We employ POMDP to model the reading resumption behavior as a sequential decision-making process, under uncertainty about which word is the exact word to resume reading from because of memory loss.Initially, we explored basic probabilistic models, such as spatial Gaussian distributions, to predict the gaze relocation position.Although these models could predict relocation errors, they fail to simulate the sequential process of decision-making ('Is this the word I'm looking for?') and the associated time costs during this process.In contrast, a POMDP-based model addresses these aspects comprehensively.It captures not only the spatial re-entry position but also the subsequent gaze trajectory.
In the POMDP formalism, the visual search is conducted based on an observation of the internal states = {Fixation word index, Belief of the last read word position, Remaining timesteps}.In particular, Belief of the last read word position conveys information of how likely the currently fxated word is the word the agent last read before switching attention to scan the environment.To model the belief updating process during visual search, we employ Bayesian inference (see [34,35]).In our deterministic transition model, the belief update is ( +1 ) ∝ ( +1 , , ) ( ), where ( ) and ( +1 ) are the current and next state beliefs, respectively.The likelihood function ( +1 , , ) calculates the probability of an observation given the state and last action [7].Memory decay is modeled by ( ) = × ( ) + (1 − ) × ( ), with ( ) as the position memory for the last read word, and (between 0.5 and 1) representing the degree of memory decay, inferred from human data.This approach allows for adaptable Bayesian belief updates to accommodate various levels of memory loss.The Bayesian belief update was defned as below: The agent's uncertainty about the last read word is modeled using a Gaussian distribution [5,40]: ∼ N ( , ), where represents the probability distribution, refers to the true last read word's index.The uncertainty rises over time, formulated as = 1+ 4. ( , where 4.5 is obtained from Li et al.'s work [40].Given that there are no revisits to the target item (i.e., the last read word), we simplify () to () = −0.5 [36,40].indicates the time elapsed since the agent last read the word before attention switches.When scanning words, increases, which leads to memory loss and increasing uncertainty, represented by a growing .When resuming reading, the agent scans new words to examine whether they are the last read word.The belief is updated using the likelihood function, which we formalize as a Gaussian distribution ∼ N ( , ) [42], where is the probability of fxating on the current word given the last read word is at the position .The standard deviation, , is inferred from humans' foveal vision size [75].We formalize it as × , where is a parameter inferred from human data.It suggests users' re-entry position () based on their vague memory of the last read word's location, following the practice that users are more likely to re-enter from words that are closer to the last read one.
and twa e systems    This fgure shows two cases.Case 1 is a successful reading resumption case where the agent correctly scans and selects the last read word, 'that'.The belief keeps increasing as the agent approaches the right word.Case 2 is unsuccessful.Here, the agent selects the wrong word because of memory loss.The value of the word 'to' decreases over time when no scan actions are made.
While real human users may apply complex visual sampling behaviors in a visual scan task, in our model, we abstract the visual scan into fve actions: scan by moving the fxation one word to the 1) left, 2) right, 3) above, 4) below, and 5) terminate the search and select the currently fxated word as the word previously read.We apply a similar reward function as Li et al. [40].Upon each action, the agent receives a reward = − + , where = 0.1 is a time penalty to encourage the agent to quickly scan and search the last read word. is a bonus term given to the agent when it terminates the search.If the agent fnds the correct last-read word, = 10; otherwise, = −10.These rewards provide the agent enough incentives to select the last read word quickly.
Figure 4 demonstrates how the agent scans words on OHMDs to resume reading.It shows two cases: a successful one and an unsuccessful one, where resumption fails due to memory loss.We further evaluate the model's reading resumption performance with human data in Study 3 and Study 4.

Scan (S)
If the Supervisory Controller chooses to scan the environment, this model is called, and it will be active until the model terminates the scan.The scanning behavior is based on an observation = {Environment event index, Fixation index, Environment information, Remaining timesteps} of the internal states.While scanning the environment, the agent has three actions : 1) Move fxation to non-urgent events in the environment, 2) move fxation to the urgent event in the environment, and 3) terminate the scan, allowing the Supervisory Controller to return to the reading task.Upon each action, the agent receives a reward = − + , where = 0.1 is a step-wise time penalty to encourage the user to scan environmental information quickly.The bonus term = 10 is rewarded to the agent for fxating, i.e., scanning, the urgent event in the environment, which may be a notifcation to change lanes (Study 1), or a sign that must be read before continuing walking (Study 4)

Oculomotor Control (OC)
While walking in its environment, the agent experiences dynamic and noisy visual scenes when viewing reading materials on the OHMD, similar to humans [8].This, combined with inherent human ocular noises like saccadic noise [12], poses challenges to the agent's oculomotor control.The agent must learn to accurately rotate its eyeball to counter these noises and maintain fxation on the words.
The eyeball's ability to rotate along horizontal and vertical axes enables the agent to observe words and signs in the 3D MuJoCo environment efciently.The eye movement is based on an observation of the external simulated environment and the agent's "physical" manifestation = {Vision perception, Proprioception, Remaining timesteps}.Additionally, this model receives the Read model's output Target word index w that specifes a fxation target.Vision perception refers to the simulated user's frst-person viewpoint captured by the eye camera.This pixel image helps the agent perceive both words and signs, and infer the target's position.Proprioception grants the user awareness of its eyeball rotation angles.The action determines the target angles for the eye's horizontal and vertical rotations, actuated by two position motors.These rotations, mirroring human capabilities, span a continuous space with angles up to 90 degrees.
The transition function is ( , , +1 ) = ( +1 | , ), it describes the probability of achieving a new eye rotation target +1 from a prior angle under the oculomotor action .This simulation includes human-like eye movements with stochastic elements like saccadic noise [12].For a current target and action , the new target +1 follows a Gaussian distribution with mean at 's expected result and standard deviation = 0.08 × [12], where represents oculomotor variance.The agent's reward = − + includes a distance-based shaping term = 0.1 × ( −10× − 1), with as the angular distance to the target, encouraging rapid fxation.A bonus = 10 is awarded for successful fxation over a fxed duration.
Prior work demonstrated that walking-induced rotational head perturbations degrade users' visual perception when reading from head-mounted displays [8].To simulate this efect more realistically for OHMDs, we incorporate rotational perturbations on the displayed words, following Grossman et al. 's suggestion [24].These perturbations, modeled on two axes (horizontal yaw and vertical pitch), are represented as integrated sinusoidal waves, causing periodic up/down and left/right word movement.The sinusoidal perturbations' amplitudes for pitch and yaw are: where adjusts the intensity of the simulated perturbations (amplitudes) and is inferred from human data later.Additionally, real-world walking doesn't produce perfect sinusoidal waves, so we introduce stochasticity through a noise component , modeled as a zero-mean Gaussian distribution [33]: ∼ N (0, ).The standard deviation is inferred from human data.Other sinusoidal parameters are empirically based on Grossman et al. [24] (refer to Supplementary material).The model's reading degradation due to walking, compared to human data, is elaborated in Study 2 and Study 4.

Locomotion Control (LC)
This model determines the agent's locomotion behaviors, including lateral lane-changing and speed control.Lane-changing is guided by a Locomotion instruction from the Scan model, signaling the correct lane.The action is binary: continue in the current lane or switch.Upon each action, the agent receives a reward = − + , where is a cost penalizing walking on the incorrect lane, set at −0.1, and is 0 if the agent is on the correct lane.The bonus term = 10 is rewarded to the agent for successfully switching to or maintaining the correct lane.Walking speed control also follows a Locomotion instruction e from the Scan model, dictating the desired speed.Here, the action varies continuously between 1.0m/s and 1.5m/s, covering normal human walking speeds [4].The reward = is a step-wise incentive, set at 0.1, for maintaining the expected speed.

OVERVIEW OF STUDIES
In the context of reading on OHMDs while walking, our model simulates a broad spectrum of complex cognitive processes, spanning from high-level attention allocation and middle-level task completion to low-level motor execution.The key research question in our studies was: Can our model reasonably capture human behavior characteristics compared to human data?We frst conducted

STUDY 1: ATTENTION SHIFTS UNDER DIFFERENT WALKING SPEEDS AND AGENTS WITH DIFFERENT TASK PRIORITY
The objective of Study 1 is to showcase how the computationally rational agent adapts its policies and behaviors in response to a range of factors, such as cognitive constraints (such as limited attention resources), environmental conditions (such as walking speed), and task priorities (balancing between reading and walking).In this Study, the simulated agents' task is to read on the OHMD, and change lanes when they observe a sign.While numerous factors potentially afect users' attention switching behaviors, including the external environment's complexity, risk level, and environmental settings [78], in this study, we select walking speed and task priorities as the two factors that we believe are both common and representative.

Method
In the simulation, we categorize the walking speed into three levels: fast, moderate, and slow, structured in a 3 : 2 : 1 ratio.Moreover, we trained three agents with diferent task priorities by confguring diferent reward weights in their reward functions (refer to Section 4.1 and Supplementary Material): • Shakespeare: This agent values the reading task the most among these three agents.• Olaf: The most cautious agent who values the walking task most, prioritizes walking correctly on the assigned lane.• Norman: The average agent, who has the most balanced preference for reading and walking tasks.The metrics included in the study are: 1) Number of attention switches.2) Reading speed (words per simulation step).3) Walking error rate (%): Percentage of time agent walks on the incorrect lane.4) Percent of reading interruption positions: start/end vs. middle of text lines -this metric denotes where agents interrupt their reading when switching attention.

Results
Figure 5 and Figure 6 show the simulated multitasking behaviors of the three agents for the diferent walking speed levels.It is important to note that the simulated results can signifcantly difer across walking speeds in certain metrics.These variations do not necessarily refect realistic human behaviors.However, the purpose of this study is only to showcase the model's fexibility in simulating multitasking behaviors when adapting to the given rewards (task priority) and bounds (walking speed).
Our study shows how attention switching behaviors adapt to the personal task priority.As shown in Figure 5 and Figure 6, when agents' task priority switches from reading to walking, they perform more attention switches to the environment, resulting in slower reading speed and less walking errors.Moreover, when walking is more important, the agent interrupts their reading less in the middle of a sentence, and instead switches their attention to the environment at the start/end of a text line.As for the walking speed factor, faster walking speed leads to more walking errors, more attention switches, and slower reading speed.
These results align with our intuitive understanding: individuals prioritizing walking safety over reading will naturally pay more attention to their surroundings to maintain environmental awareness.Therefore, the walking performance may be improved.On the other hand, a faster walking speed results in a rapidly changing environment, increasing uncertainty and chances of missing information, leading to more frequent walking errors.Individuals need to compensate for safety by allocating more attention to their surroundings.However, frequent attention switches interrupt the  The plots show how agents adapt their attention switching behavior to rewards (task priorities) and bounds (walking speed).We trained three agents: Shakespeare, Norman, and Olaf, using varying reward weights for the reading and walking tasks.Interestingly, less focus on reading narrowed the diference in reading interruption positions across agents (a).For specifc reading and walking behaviors, increasing emphasis on walking enhanced environmental awareness at the cost of reading performance, and vice versa (b, c, d).Furthermore, a reduction in walking speed requires fewer attention switches to the surroundings, enhancing reading and improving walking performance (b, c, d).
reading, resulting in slower reading speed.Furthermore, agents who value reading take care in choosing where to pause their reading, making it easier to resume later.These results validate our model's capability to simulate how agents adapt their attention switching behaviors to the rewards (task priority) and bounds (walking speed).

STUDY 2: READING SPEED DECREMENT DUE TO WALKING PERTURBATIONS
The objective of Study 2 is to assess the model's ability to replicate users' reading experiences while walking with OHMDs.Specifically, it examines how walking perturbations afect readability, comparing the model's predictions to actual human data.Other than interruptions caused by attention switches, the physical act of walking can also adversely afect reading experiences on OHMDs.Borg et al. [8] found rotational head perturbations generated by walking can deteriorate readability on OHMDs, leading to a degradation in the reading performance.While Study 1 validates the model's capability to simulate the efects of attention switches on reading, this study aims to evaluate its ability to simulate the continuous impact of walking on reading performance through OHMDs.While researchers have identifed various factors that impact readability on OHMDs, including display contrast and external brightness [79], the infuence of walking perturbations is particularly signifcant due to their consistent occurrence and close relationship with walking movements [8].

Method
Prior work [8] focused their analysis on the efect of walking on reading using numerical time data rather than textual content.Therefore, we cannot use their data to evaluate text-reading tasks.
In this study, we collect human data from a controlled study as described below: Participants: 12 volunteers (4 females, M = 24.8years, SD = 2.9) from the university community participated in the study.They had normal or corrected visual and walking abilities.All participants were fuent in English at the university level.
Experimental design: The study adopted a within-subject design, centering on a single independent variable, mobility, in two conditions: walking or standing on a treadmill.Task: Participants in the study were tasked with reading aloud while using an NReal device.Initially, a training trial was conducted to familiarize participants with the task of reading through the NReal device while simultaneously walking on the treadmill.Subsequently, they engaged in two data collection trials, one for each of the walking and standing conditions.The order of these mobility conditions was counterbalanced using a Latin Square design within subjects to eliminate order efects.For safety during walking, participants were instructed to hold onto a table afxed to the treadmill throughout all trials.Each trial concluded once the participant fnished reading the assigned content.
Material: The reading materials were generated by ChatGPT 3 with prompts of "University level reading materials with topics in culture, science, and technology".Each reading material had approximately 300 words.To ensure the quality and appropriateness of the reading materials, we conducted a thorough review process.Three co-authors carefully examined all generated texts to ensure they were coherent, contextually appropriate, and free from complexities that could confound the readability.We further assessed articles using the Flesch-Kincaid Grade Level metric.The evaluation revealed a consistent level of complexity, with an average score of 16.55 (SD = 1.64).This score indicated that the articles' reading level was suitable for college students.
Procedure: The study was conducted in a quiet room with consistent indoor light to provide a consistent user experience.Once entering the room, participants were briefed about the study process.They were also familiarized with the OHMD and how to walk safely on the treadmill, then followed by the three trials.The entire experiment lasted for approximately 10-15 minutes.This study has received approval from the university's Institutional Review Board (IRB) for human subjects research.
Apparatus: Participants wore the NReal Light glass [80] (weight = 106 grams, FOV = 52-degree diagonal, resolution = 1920×1080 pixels) as the OHMD device.In the air casting mode, its screen is 115 inches diagonal at 3 meters such that participants can read comfortably and clearly.The words displayed on NReal were mirrored from Google slides on a Huawei P40 [28], which is connected to the NReal by a wire.Users could easily hold the phone and use a sliding gesture to swap the digital content displayed on the NReal device.Participants walked on a Spirit ftness [17] treadmill, which 3 https://chat.openai.com/has a fat table on the front, and supports a speed range from 0.8 km/h to 6.0 km/h.The use of a treadmill was a deliberate control choice designed to isolate head perturbation's efect on readability, allowing us to gather clean data on the relationship between natural head movements and reading performance, following the prior work [8].
Measure: We measured participants' reading speed ratio by comparing their reading speed while walking to that while standing.Participants were asked to read aloud to ensure clear recognition and pronunciation.This method ensures participants actively engaged with the reading material, and accurately refects the reading performance without complex skip-reading or similar strategies [83].Reading speed was determined by dividing the number of words read by the elapsed time.To evaluate the reduction in reading speed, we employed the reading speed ratio, which normalizes for time and ease of word recognition, thereby reducing the impact of individual diferences in reading speeds.
OHMDs' text oscillations caused by head perturbations are formalized as integrated noisy sinusoidal waves (as described in Section 4.4), and simulated in MuJoCo as illustrated in Figure 7.We estimated the agent's reading speed both in the standing scenario, where there were no perturbations applied, and in the walking scenario when perturbations were applied.The evaluated metric is the ratio of reading speed under the walking condition compared to standing.We obtained the simulated data from the trained RL agent, and compared the simulated data against human data.To demonstrate our model's fexibility and alignment with human data, we used parameter inference to tune the model parameters with human data.Following prior work by Li et al. [40], we frst separated the human data into two parts: a parameter-inference dataset and a testing dataset.We randomly sampled 50% of the human data as the parameter-inference dataset, which was used to estimate the parameters for the walking perturbation noise model described in Section 4.4, and the remaining 50% data as the testing dataset.We used grid search to optimize the two parameters (range: [0, 1], step size: 0.01) and (range: [0, 0.015], step size: 0.001) based on the reading speed ratio in the parameter-inference dataset.The optimal parameter was determined by minimizing the sum of the mean absolute error for the ratio.We repeated the same procedure for 500 times.It is worth noting that our model's RL training phase did not incorporate human data.Instead, the policy was learned through the agent's interactions with the environment, where the parameters to be tuned were sampled randomly from their respective ranges.This approach resulted in a policy that is applicable across a wide range of parameter values, which were then fne-tuned in the parameter inference stage.The purpose of this fne-tuning was to align our model more closely with average users' behaviors.

Reading
Figure 8 shows the comparisons between human testing datasets and simulated results over 500 repetitions of leave-half-out crossvalidation.The results indicate that the reading speed ratio of our simulated agent corresponds to the reading speed ratio of human participants.As the ratios may not be normally distributed (Shapiro-Wilks test provides W=0.993, p=0.0257 for human data, and W=0.981, p<0.001 for simulated data), we used a non-parametric permutation test to evaluate the null hypothesis that the means of these ratios are equal.The test indicates that the null hypothesis cannot be rejected (statistic=0.00254,p=0.365).This confrms our model's ability to simulate the behavior observed in human participants, wherein walking-induced head perturbations impair visual perception, leading to a reduced reading speed compared to when standing.This high-fdelity replication is achieved by simulating an agent walking in the simulator, where its pixel-based visual perception is realistically disrupted by the head movements.

STUDY 3: READING RESUMPTION PERFORMANCE ACROSS DIFFERENT OHMD TEXT LAYOUTS
Study 3 has three objectives.First, it assesses the model's accuracy in simulating reading resumption afected by line spacing in OHMDs, compared to human reading resumption data.Second, it explores the model's predictive capabilities for user behavior in novel design conditions without relying on human data.Third, the study involves ablation studies to understand the contributions of specifc modules within our model to prevent unnecessary model complexity.While design factors such as font size, text color, and character count per line are undoubtedly important, we focus on the line-spacing factor due to its crucial impact on reading performance in the context of reading on OHMDs while walking [61,83].

Method
The reading resumption task dataset from [83] comprises observations from 12 participants.Participants were asked to read on OHMDs while walking in an area where signs were attached to fxed locations on the walls.Participants were required to read the content of these signs as they approached them.After glancing up at the sign, participants resumed the reading task on OHMDs by scanning OHMD words to locate where they left of.Two reading resumption metrics are evaluated: time cost and error rate.The time cost denotes the duration from the start of the visual scan to the selection of a word.The error rate is the ratio of erroneously chosen left-of words relative to the total words read.The two metrics were measured across three text layouts, each having a standard font size of 30: • L0: The interline spacing is 0 (no space left between text lines).• L50: The interline spacing is 50.
• L100: The interline spacing is 100.The reading resumption behavior is modeled as a sequential word scan process with a memory module, where the belief of the left-of word's position decays over time (as described in Section 4.2).In the reference study [83], the exact elapsed time between participants' switching their attention away and back to OHMD words was short and not specifed in the dataset.Given this, we assumed it to be constant and assigned it as 1 second.Hence, the actual memory decay happens when the agent is scanning OHMD words to resume reading.We modeled the three text layouts -L0, L50, and L100in MuJoCo, and presented them to the agent from a distance of 3 meters (as shown in Figure 9), mirroring the study setup in [83].During the simulation, we randomize the left-of word's position for the agent to locate.We then record the time taken from the onset of the agent's visual scan to its conclusion upon word selection.The error rate is calculated as the word-wise distance between the word selected by the agent and the actual target word, divided by the total number of words read.

8.2.1
Reading Resumption Time Cost And Error Rate: Parameter Inference.To evaluate our model's alignment with human data, we applied the leave-half-out cross-validation as described in Section 7.2.We used grid search to optimize the two parameters ([0.5, 1], step size of 0.05) and ( [1,5], step size of 0.5) from the reading resumption model based on the reading resumption time cost and error rate in the parameter inference dataset.The optimal parameters were determined by minimizing the sum of the mean absolute error for both normalized reading resumption time cost and error rate across three text layouts (i.e., L0, L50, L100).For visualization, we proportionally scale the simulated data to the human data for each validation.We repeated the same procedure for 500 times.
Figure 10 shows comparisons between the human data and simulated results across the three text layouts over 500 repetitions of leave-half-out cross-validation.The average RMSE of reading resumption time cost is 0.58s, and 3.03% for the error rate.As Shapiro-Wilks tests indicated that our simulated model's outputs were not normally distributed, we used non-parametric Kruskal-Wallis tests to evaluate whether there were signifcant diferences between layouts.Kruskal-Wallis tests indicated that both the time costs and error rates in diferent layouts were signifcantly diferent (H=570.8 and p<0.001 for time costs, and H=753.5 and p<0.001 for error rates).Post-hoc Dunn's tests revealed that all layouts in both metrics were signifcantly diferent (p<0.001 for all layouts when corrected for multiple comparisons).Hence, we could conclude that the behavior produced by our model replicates the trend seen in human data, where time costs and error rates decrease as the line spacing increases.
The parameter inference (PI) method was employed to more closely align our simulation results with the behaviors of average users.To further demonstrate our model's capacity to predict user behaviors without any human data calibration, we conducted tests using default parameters for and , as illustrated in Figure 10.The middle bar in the fgure (labeled 'Simulated Data: without PI'), indicates that without parameter inference using human data, the model efectively captures the trends across layouts.While the average simulation results deviate slightly more from human data, denoted by the mean values, the model refects the trends in both resumption time cost (Kruskal-Wallis H=74.6 and p<0.001) and error rate (H=19.8 and p<0.001).These results demonstrate that our model is capable of generating simulations that conform to human behavior, even when not explicitly calibrated with human data, ofering the potential for predictions in scenarios lacking human data.

Ablation Studies.
To the best of our knowledge, no existing models have been designed to simulate users' reading resumption behaviors using OHMDs.Given this novelty, there is no natural baseline model to compare against.To gain a deeper understanding of the critical components in our model, we conducted ablation studies with two primary goals: frst, to evaluate the impact of model components whose efectiveness might not be immediately obvious or observable; and second, to assess the unique, innovative aspects of our model.With these objectives, we conducted ablation studies on the Position Memory Module (PMM) and Re-entry Position Module (RPM).As described in Section 4.2, the PMM characterizes how users progressively forget their last reading position over time during the visual search, and the RPM characterizes how a user selects a re-entry word from the approximate area where they stopped reading.We conducted two distinct ablation studies to separately assess the contributions of the PMM and RPM to the overall model.For the PMM, we set its weight to 0 in the formulation.Thus now the Bayesian belief update changes from Equation 1 to ( +1 ) ∝ ( +1 , , ) ( ).For the RPM, we replaced its Gaussian distribution with a uniform distribution.Without PPM, an agent's belief remains static during the visual search since its memory does not decay over time; without RPM, the agent's belief is unafected by text line spacings because it does not account for words' spatial position diferences.
With all other components kept constant, we retrained the two models and evaluated their performance using the same parameters generated for the abovementioned leave-half-out cross-validation.The results, plotted in Figure 10, show a dramatic degradation in performance.Without PMM, RMSE increases to 1.12 for time cost and 13.63% for error rate.Apart from the large deviation from human data, the ablation results fail to produce the expected descending trend across diferent layouts.Similarly, without RPM, RMSE increases to 1.11 for the time cost and 18.51% for the error rate, accompanied by large deviations from the human data.These performance deteriorations highlight the crucial roles of both PMM and RPM in our model.Without them, the agent cannot develop efective visual searching strategies that balance time efciency and accuracy, and fails to replicate human reading resumption behaviors across layouts.

STUDY 4: READING ON OHMDS WHILE WALKING
Prior work has indicated that reading on OHMDs while walking adversely impacts both activities, leading to frequent attention switches, interruptions in reading, and slower walking [83].The goal of Study 4 is to evaluate whether our model can accurately simulate the efects of such multitasking, in alignment with human data from a study by Zhou et al. [83].

Method
Since prior work did not provide data on attention allocation and walking speed control, we replicated their experiment while tracking participants' visual attention allocation and walking speeds.
The design and methodology of our study are detailed as follows: Participants: Twelve university students (5 females, 7 males, M: 26.25 years, SD: 2.49) participated.All had normal or corrected-tonormal vision and walking abilities and were profcient in English.
Experimental design: The study adopted a within-subject design, centering on a single independent variable: the vertical text spacing in OHMDs, specifcally at level L100.This choice was informed by prior research demonstrating that L100 is most efective for reading on OHMDs while walking [83].
Task: To determine preferred walking speed, participants were asked to walk twice along a predefned rectangular path.Then, to determine standard reading speed, they were asked to read an article on the OHMD while standing.After these, participants undertook a task that involved walking, reading, and attending to environmental signs.This 2-round rectangular path had eight signs (four per round), each displaying diferent information.During the task, participants were asked to read both texts on OHMDs and signs aloud, quickly and accurately while walking, ensuring comprehension.They were allowed to stop and resume walking as necessary.Each trial started at the start line and ended after completing the two rounds of walking, regardless of whether they fnished reading the text on OHMD.
Material: To avoid potential compounding efects, we used four English articles from the AceReader application, identical to those in Zhou et al. 's work.These articles all had the same difculty level of 8th grade, ensuring ease of readability for all participants.The average article lengths were 360 words (SD=11.1).The environmental signs used were also the same as in the prior study.
Procedure: We frst provided an example task to familiarize participants with the experiment as part of the training session.In the actual data collection experiment, each participant completed three trials, with reading and sign materials randomized and unique for each trial.The study has received approval from the university's Institutional Review Board (IRB) for human subjects research.
Apparatus: The Meta Quest Pro [45] was employed not only as the digital text display, but especially for data collection, including eye-tracking and motion tracking.Text was displayed with a 30pixel font size and 100-pixel interline spacing as suggested by Zhou et al. [83].Participants pressed buttons on the headset's controller to turn pages.
Measures: Metrics are categorized into three sections: 1) Attention allocation: where and when attention was allocated to environmental signs or OHMDs.Specifcally, attention allocation on the physical route, and the percentage of time spent focusing on these signs (denoted as attention allocation on signs).2) Walking: percentage of the preferred walking speed (%PWS).3) Reading: the reading speed ratio (same as in Study 2) and the reading resumption time cost and error rate (same as in Study 3).To evaluate our model's alignment with human data, we applied the leave-half-user-out cross-validation and grid search to optimize four parameters -([0, 1], step size of 0.01), ([1, 1.5], step size of 0.05), and (same as in Study 3) -with the collected human data.The optimal parameters were determined by minimizing the sum of the mean absolute error for metrics, including the percentage of attention allocation, reading speed ratio, percentage of preferred walking speed (%PWS), reading resumption time cost and error rate.Figure 12 shows the comparison between the human data and simulated results over 500 repetitions of leave-half-user-out crossvalidation.Our focus is on accurately capturing human multitasking behaviors in three key aspects: attention allocation, walking, and reading.To assess attention allocation to environmental signs, we measured the percentage of attention allocation to signs.The impact on reading is evaluated through the reading speed ratio (the percentage of preferred reading speed), along with metrics for reading resumption time cost, and error rate.These factors consider the infuence of both walking and attention switches as discussed in Study 2 (Section 7.2) and Study 3 (Section 8.2).For walking, we analyzed the efect of multitasking using the percentage of preferred walking speed (%PWS), following the methodology of Zhou et al. 's work [83].

Attention
Similar to the approach in Section 7.2, we used permutation tests to evaluate the null hypotheses that the means of our simulated metrics match the means of human metrics.The null hypothesis cannot be rejected for reading resumption cost (statistic=0.016,p=1), error rate (statistic=0.009,p=0.18), attention allocation (statistic=-0.001,p=0.13), and percentage of preferred walking speed (%PWS; statistic=-0.0003,p=1), suggesting that our model closely replicates human behavior.However, the null hypothesis should be rejected for the reading speed ratio (statistic=0.024,p<0.001; all reported p-values have been Bonferroni corrected for multiple comparisons), despite the average metric being close to human data.This discrepancy likely arises because computational rational models aim to represent average human behaviors as noted by Oulasvirta et al. [54].However, real human data often exhibits considerable variability, which is beyond the scope of this approach to capture.The results from our study indicate that the model efectively simulates user behaviors in the realistic task of reading while walking.It captures key phenomena such as users slowing down to counteract reading difculties caused by walking, as well as the dynamics of attention switching and the subsequent resumption of reading through visual search processes.
Figure 13 further showcases comparisons of moment-by-moment attention allocation trajectories for simulated and human users, captured through spatial and temporal dimensions.This comparison exemplifes our model's ability to sequentially simulate human multitasking behaviors, including attention distribution and locomotion control.

DISCUSSION
This paper has shown that hierarchical supervisory control can efectively simulate a number of key dynamics in interacting with OHMDs.The model generates behaviors by estimating the optimal policy that maximizes rewards under its beliefs about the world and its internal capacities.Based on deep reinforcement learning, our model distinguishes itself in simulating human behaviors within complex state spaces.It adeptly interprets pixel-based visual inputs, which is crucial for guiding users' cognitive processes in dynamic activities such as reading on OHMDs while walking.This approach not only achieves high fdelity simulation but also efectively captures the adaptability of users during interactions.By modeling the intricate interplay of oculomotor control and locomotion control, we successfully replicated the following empirical phenomena in the context of reading on OHMDs while walking: 1) Walkinginduced head perturbations decrease readability and lead to reading speed decrease on OHMDs, 2) Improved reading resumption performance with increased OHMD text spacings, evidenced by reduced time costs and error rates, 3) The tendency of users to stop near static environmental signs for reading, and 4) A trade-of between walking speed and reading ease, with users preferring to slow down to minimize the impact of head perturbations and enhance their reading experience.We posit that the model could serve as a useful tool to help design more usable and safe OHMD interactions.For example, our model allows researchers to derive quantitative predictions about OHMDs' interface designs with much fner granularity.For instance, Zhou et al. [83] tested three text layouts using a coarse interval of 50.In contrast, our model allows explorations of fner layout nuances, such as 25 intervals with confgurations that were not previously tested.This enables a more comprehensive investigation into optimal layout confgurations, potentially identifying an ideal layout between L100 and L125 as shown in Figure 14, a possibility suggested but not empirically confrmed by Zhou et al.Furthermore, our model can simulate various realistic 3D environments, extending beyond the simplifed 2D spaces often used in previous research.This feature is crucial for studying visual attention allocation when using OHMDs in diverse scenarios, like navigating a busy street or a quiet rural path.The integration of oculomotor control and a pixel-based visual perception operating in MuJoCo enhances this realism.Another key advantage of our model is its potential to predict optimal moments when users should switch their attention to the environment, informing the design of adaptive OHMD interfaces.This could be particularly useful in timing the presentation of digital content to minimize its interference with tasks like walking.While current research in this domain often relies on deep learning methods [11], our model ofers a more interpretable alternative with little dependence on human data.

LIMITATIONS AND FUTURE WORK
In Study 1, the lane-changing scenario represents an efective model for exploring attention switches in response to environmental stimuli.While this scenario serves as a foundational example, we acknowledge its limitations in capturing the full complexity of real-life interactions.Future work could expand upon this by incorporating more realistic tasks, such as navigating a virtual cityscape rendered in MuJoCo.
Our evaluation methodology primarily utilized aggregated metrics, following established conventions in prior research [12,34,35,40,41].While our model is capable of generating moment-bymoment predictions (as shown in Figure 5 and 13), it was primarily designed to capture average behavioral trends over time.Recognizing this limitation, future work should enhance the model by incorporating additional parameters, allowing for a more detailed moment-by-moment evaluation.
Apart from those, our work incorporated several assumptions that can be revisited to further improve and extend the model.First, a more detailed neuromechanical locomotion control could be added to simulate human gait and speed control more accurately.Second, the rotating eyeball enables the agent to process symbolic words or environmental cues as rectangular visual objects by fxating on them until surpassing a threshold.However, this abstracted vision falls short of real reading, where words have diverse shapes and actual semantic meanings.Moreover, we do not presently process the environment to react to it.Future work could enhance visual perception by using general-purpose segmentation methods to detect, recognize, or react to diferent objects in the environment.

CONCLUSION
In conclusion, we introduce a computationally rational model that simulates the human multitasking behaviors of reading on OHMDs while walking.The model, grounded in the theory of boundedly optimal control, utilizes POMDPs to represent human behaviors as sequential decision-making processes, and learns policies through reinforcement learning.To manage the complexity of modeling multitasking with high fdelity, we implemented a hierarchical RL structure consisting of supervisory control, subtask management, and motor control.Across four studies, we evaluated the model's overall performance and the efcacy of its key components.The results afrm its capability to replicate users' key multitasking behaviors, especially attention allocation, reading dynamics, and walking behaviors.This model, therefore, ofers a useful foundation for guiding future research.

Figure 2 :
Figure2: The model consists of a three-level hierarchical structure with a Supervisory Controller (SC), task level models Read (R) and Scan (S), and motor level models for Oculomotor Control (OC) and Locomotion Control (LC).These submodels receive observations from the underlying internal and external states, and interact with them through actions while receiving rewards (black arrows).The red arrows highlight how higher-level models set targets for lower-level ones: SC determines the focus task (via ), while R and S establish specifc targets and for directing motor level models.For higher-level models, these targets act as guiding actions.Conversely, for lower-level models, they are treated as observations that infuence and defne agents' tasks.
(a) Third person view (b) First person view (c) The two-lane path (d) The rectangle path (e) The city landscape

Figure 3 :
Figure 3: We built a simulated environment in MuJoCo with an OHMD, a user, and signs (the red rectangle) to look at.The eyeball can be rotated to focus freely on either the OHMD (the black grid) or the environment.The user model can 'walk' on predefned paths.Our model leverages pixel-based image inputs for training the agent's visual attention policies, eliminating the need for hand-crafting complex state-action pairs across various visual scenes.This approach features a fexible simulation environment, demonstrated through various confgurations: (c) a two-lane path for Study 1, (d) a rectangle path for Study 4, and (e) a city landscape for more realistic scenarios.

Table 1 :
Summary of selected parameters in our model.(a) refers to the fxation position's coordinate values after saccades.(b) refers to the probability of each word being the target last read word.As time progresses, rises, leading to a rise in .The agent becomes more uncertain about the true position of the last read word.(c) refers to the probability of observing the currently fxated word.(d) represents the noise caused by the walking perturbation.Distribution Oculomotor ∼ N ( , ) saccade's destination coordinates 0.08 × [12] Position Memory ∼ N ( , ) last read word's index 4.5 / (1 + −0.5 ) [40] Word Observation ∼ N ( , ) fxated word's index × [75] Walk Perturbation Noise ∼ N (0, ) 0 inferred from human data

Figure 4 :
Figure 4: The agent's simulated reading resumption scan paths.When reading resumes after an interruption, the agent must fnd the right location in the text to continue reading.This fgure shows two cases.Case 1 is a successful reading resumption case where the agent correctly scans and selects the last read word, 'that'.The belief keeps increasing as the agent approaches the right word.Case 2 is unsuccessful.Here, the agent selects the wrong word because of memory loss.The value of the word 'to' decreases over time when no scan actions are made.

Figure 5 :
Figure5: The simulated moment-by-moment attention switch trajectories of three agents in Study 1, highlighting how agents' strategies adapt to varying reward weights for the reading and walking.From top to bottom, as the reward weight for the walking task increases, there's a noticeable increase in the number of attention switches and attention allocation on environmental signs, from Shakespeare to Norman and Olaf.Consequently, the reading task is interrupted more frequently, extending the time required for its completion.

Figure 6 :
Figure 6: Simulation results for Study 1.The plots show how agents adapt their attention switching behavior to rewards (task priorities) and bounds (walking speed).We trained three agents: Shakespeare, Norman, and Olaf, using varying reward weights for the reading and walking tasks.Interestingly, less focus on reading narrowed the diference in reading interruption positions across agents (a).For specifc reading and walking behaviors, increasing emphasis on walking enhanced environmental awareness at the cost of reading performance, and vice versa (b, c, d).Furthermore, a reduction in walking speed requires fewer attention switches to the surroundings, enhancing reading and improving walking performance (b, c, d).

Figure 7 :
Figure 7: The MuJoCo simulation in Study 2, fgures demonstrate how head perturbations are implemented in the simulator.(a) depicts a static scene, while (b) and (c) illustrate shaking efects in diferent directions.

Figure 8 :
Figure 8: The average reading speed ratio (SD), comparing walking to standing conditions in Study 2 -after the parameter inference with 500 repetitions of leave-half-user-out cross-validation.

Figure 9 :
Figure 9: Three text layouts in Study 3. The white sphere stands for the agent's eyeball, and the yellow-ray represents the agent's current line of sight.Now the agent is reading the second word (rectangle).

Figure 10 :
Figure 10: Comparison of human data against simulated data with and without PI, PMM, or RPM in Study 3 -the average time and error rate (SD).When the model is intact (with PMM and RPM), it produces an average RMSE of 0.58 for reading resumption time and 3.03% for reading resumption error rate across three text layouts.When the model is without PMM (or RPM), the model shows a larger discrepancy and fails to capture the trend compared to human data, and produces higher RMSE of 1.12 (or 1.11), and 13.63% (or 18.51%).The model without parameter inference (PI) produces less accurate predictions compared to that with PI.

Figure 11 :
Figure 11: Comparison of Study 4's settings: On the left is the agent's simulated frst-person perspective; on the right is a frst-person-perspective screenshot from the human user via Oculus Casting [52].Both scenarios involve the same task of navigating a rectangular path and reading texts on OHMDs, while also paying attention to environmental cues.

Figure 13 :
Figure 13: Comparisons of moment-by-moment attention allocations between simulation and human data in Study 4. (a) Physical route trajectories contrast the attention patterns of simulated agents and human users, where human data was captured by the motion tracker in the headset.Larger dots indicate longer viewing times at a position.With a constant sampling rate of 1fps, sparse dots suggest faster movement.(b) Temporal attention allocation trajectories display the distribution of attention over time for both simulated and human data.These demonstrate the model could efectively simulate human's tendency to decelerate or halt for sign reading, and attention switches between environmental signs and OHMD reading texts over time.

Figure 14 :
Figure 14: The predicted normalized reading resumption time and error rate across six layouts.L25, L75, and L125 were not validated in the empirical work.The intersection point between L100 and L125 indicates an optimal layout that minimizes both users' reading resumption time cost and error rate.

Table 2 :
An overview of the four studies evaluated in this paper.Attention allocation, walking and reading Human vs. simulated three separate studies to evaluate the contributions of the key components of our model -Supervisory Controller, Read model, and Oculomotor Control.Following these, we conducted a unifed study to evaluate the overall model performance.The studies are specifed in