Imperfect Surrogate Users: Understanding Performance Implications of Augmentative and Alternative Communication Systems through Bounded Rationality, Human Error, and Interruption Modeling

Nonspeaking individuals with motor disabilities frequently rely on augmentative and alternative communication (AAC) systems that allow users to communicate through a text entry interface coupled with a speech synthesizer. Such systems are notoriously difficult to evaluate with end-users. However, recent research has proposed envelope analysis as a method to estimate text entry rates and keystroke savings by simulating the interaction of an expert surrogate user entering sentences on a conceptual word-predictive text entry system. While only a part of the evaluation process of an AAC system, this method enables AAC designers to benefit from quantitative insights early on in the design process. This paper extends prior work by (1) demonstrating how to incorporate natural language generation, such as sentence generation, in such analyses; (2) presenting a model of an imperfect surrogate user that incorporates bounded rationality, human error, and interruptions to provide a more realistic simulation of text entry behavior; and (3) demonstrating how to estimate model parameters by observing users' actual typing behavior. We validate the model with data collected from eight participants using an AAC system on a touchscreen.


INTRODUCTION
Nonspeaking individuals with motor disabilities are heavily reliant on augmentative and alternative communication (AAC) systems to communicate.Such systems provide nonspeaking users with means to communicate via a speech synthesizer.Predictive text entry AAC systems provide access techniques, such as eye gaze, dwell mouse click, touchscreen, and so on, and provide text predictions in the form of word, phrase, and sentence predictions.These features enable literate AAC users to potentially increase their text entry rates.
However, evaluating AAC systems with actual users poses a challenge since the user group is highly heterogeneous, with individual access needs, technical solutions, and personal support • A limitation of prior work [21,23] was that model parameters for envelope analysis had to be estimated by the designer.We present a method for estimating such parameters for actual user behavior at runtime and use this method to validate our model with eight users.

Paper Structure
This paper has two main objectives: (1) to extend prior work [23] of a conceptual design of word predictive text entry system to a text entry system with word and sentence prediction functions; and (2) to propose and apply the imperfect surrogate user model to perform envelope analysis and estimate parameters from actual user data at runtime.To achieve these objectives, the rest of this paper is structured as follows.First, we review prior work in AAC system design, bounded rationality, human error, and interruptions for text entry modeling.Second, we present a function structure model for the design of a word and sentence predictive text entry system for AAC.We calculate the upper bounds on error-free and optimal expert text entry rate and keystroke savings for both able-bodied and AAC users based on parameters obtained from the literature.Next, we introduce the imperfect surrogate user model and use it to carry out envelope analysis to understand the potential performance impact of incorporating human performance factors along with text entry strategies on text entry rates and keystroke savings.We then explain how to estimate parameters for the model by observing actual user behavior.We use this method to validate the imperfect surrogate user model with data collected from eight participants using an AAC system on a touchscreen tablet PC.Finally, we discuss the implications of this work and conclude.

RELATED WORK
The literature has long considered approaches for evaluating text entry methods (e.g.[59]), such as expanded rehearsal interval training [58], representative stimulus sentences [28,50], stimulus sentence presentation styles [24], and composition tasks [14,51].It has also been recognized that for text entry methods to be successful, they need to consider wider issues beyond merely improving entry and error rates (e.g.[19,20]).However, unlike other text entry domains, AAC also brings its own unique design issues.

Challenges in AAC Systems Design
The study of AAC has always been challenging.In general, the demand for research-driven technological development is enormous, especially in obtaining insights from the processes underpinning basic cognitive, motor, sensory-perceptual, and linguistic functions and utilizing them to maximize human-computer interaction efficiency through the implementation of AAC devices and methods [22,25].Moreover, the lack of researchers, engineers, and technical developers [30] results in a large number of unanswered questions and technical problems [55], especially when AI technologies, such as NLG models [6,44], are involved as design materials.
User experience studies of AAC systems often employ qualitative empirical approaches, such as field studies, which can take many weeks to several months to produce outputs [4,29].Other methods of evaluating user experience involve questionnaires [33].However, such post-hoc evaluation methods may fail to capture immediate feedback on user experience or aspects not listed in the questionnaire.For example, Black et al. [4] point out that users do not always select the correct prediction once it appears on the system, however, the researchers fail to understand users' intention behind this action.Besides, it can be challenging for AAC users to think aloud while using the system.Video analysis can capture the entire interaction process [53] and may provide insights into the intention behind user actions.However, this approach is time-consuming, requiring researchers to analyze the video frame-by-frame.Hence, efficient methods to understand and identify AAC user performance for iterative improvement of NLG-based AAC systems are still lacking.

Computational Models for Text Entry
The idea of viewing interaction through the lens of a computational model is not new but has recently been invigorated through the establishment of computational interaction [36] and the development of new mathematical tools to model interaction, such as Bayesian methods [54].The concept of computational models for text entry, commonly used in the design of general text entry systems, can also assist in the design of AAC systems without the extensive involvement of AAC users.These text entry models typically focus on two main directions.The first direction is related to Fitts' law (or FFitts law [3] for touchscreen-based research), which has been extensively researched for non-predictive text entry, modeling user typing speed on different keyboard layouts using different typing methods, such as two-thumb text entry on mini-QWERTY mechanical keyboards [9], stroke-based OPTI II soft keyboards [40] and stylus-based QWERTY soft keyboards [48].These models quantitatively simulate the time cost of each click or gesture stroke movement from one key to another, taking into account the distance of each movement and the interaction methods.
The second direction of text entry modeling focuses on predictive text entry features, which heavily involve decision-making processes.Instead of focusing on calculating the time cost of the finger moving between keys via Fitts' law based models that can be impacted by system layouts and typing methods, these studies [21,23] investigate how predictive text features can increase or decrease the text entry rate and keystroke savings at a function level.
It is particularly important to investigate complex systems at a functional level, understanding how multiple functions and interaction points in a complex system mutually impact user performance and lead to different system efficiency and effectiveness.For example, research questions could be what is the best time for a user to check word predictions, and when should a user give up on word predictions [23].Every keystroke takes time from users, which is particularly important to consider in the case of AAC users.The trade-off here is that, although correct predictions can save valuable keystrokes for users, having the user checking predictions generated from too little user input may cost the user extra time, as further user input is required to generate the user's expected predictions.Hence the goal is to type the correct text with minimum effort using prediction functions.This is a typical task analysis (TA) issue, which is at the heart of this paper.
Design researchers have been building user models for TA, such as KLM (Keystroke Level Model) [7], MHP (Model of Human Processor) [57], and GOMS (Goals, Operations, Methods, and Selection rules) [8], using psychological theories and simulation modeling since at least the 1980s.These interaction models investigate how users reason and make decisions when using complex interfaces, with the intention to allow different design elements or design configurations to be tested prior to the development of a working system or before carrying out user studies [27,41,43].
In this vein, Kristensson and Müllners [23] propose a computational model at a functional level, including three parameters about text entry strategy, to simulate and analyze the impact of text entry strategies on text entry rate and keystroke savings in rational and error-free settings.This model enables an explanation of the mechanism for why word prediction is typically not useful for an able-bodied user.
However, these studies share a common limitation in that they assume text entry is error-free and that the surrogate user is an expert, which may not necessarily reflect reality.

Human Performance Factors in Text Entry
Empirical studies in bounded rationality [12,17], human error [11,13,24], and interruptions [5] have shown that such human factors concerns have a negative impact on a user's text entry rate.
The concept of bounded rationality is derived from behavioral economics and public policy for decision-making [46].The main assertion of bounded rationality is that people, limited by time, knowledge, and resources, make satisfactory decisions instead of maximizing utility [45].Specifically, Quinn and Zhai [38] note that text entry suggestions come with a cognitive cost, while Sarcar et al. [42] adopt a computational rationality model to develop an ability-based optimization text entry system for smartphones.
Interruptions frequently occur in daily life.Pielot et al. [37] report that on average, participants received 63.5 mobile notifications per day, such as messages and emails.Borst et al. [5] propose an interruption model of memory-for-problem-states verified with text entry experiments, which integrates three factors: interruption duration, interrupting-task complexity, and moment of interruption.
Human error is one of the most common human factors concerns in text entry and is more thoroughly studied than other factors.For instance, the autocorrect function is designed to reduce the impact of human errors.Further, as for example Banovic et al. [1] point out, making typing errors when entering text is inevitable, and correcting errors is time-consuming.As a result, typists may slow down their speed to reduce typing errors.Accordingly, Banovic et al. [1] propose a computational model to estimate the effects of risk aversion to errors on expert typing speeds for QWERTY mobile touchscreen keyboards with or without autocorrect [2].
In summary, prior research has demonstrated that bounded rationality, human error, and interruptions are three significant human factors concerns that adversely affect text entry performance.Computational models have been proposed for general human-computer interaction tasks, with each model focusing on a specific human factor.Moreover, specific computational models for text entry have been proposed to estimate the upper bound of expert text entry rates under error conditions.However, there is currently no computational model that integrates all three factors to estimate the performance of non-expert users for word and sentence prediction systems, which is particularly important for AAC design.

FUNCTION STRUCTURE MODEL: TEXT ENTRY STRATEGIES FOR PREDICTIVE AAC SYSTEMS
The function structure model allows designers to understand system functions and data flows between functions.This can then be used to derive a human-computer interaction flowchart for envelope analysis [23].We adopt this model to illustrate the function descriptions of a predictive AAC text entry system.To be more specific, the overall function Generate Sentence is decomposed into six main functions: Type Key, Predict Current Word, Predict Next Word, Select Word Prediction, Predict Sentence, and Select Sentence Prediction.These functions are connected by signal flows represented by text with different fonts and different types of lines (see Figure 1).The signal flows provided by the system or the user are categorized into four different types.First, the user input-related signal flows, including Key Press, Word Selection, and Sentence Selection, are the user's physical actions.Second, Word Hypotheses, Sentence Hypotheses, and Observation are the user's mental actions.Third, Language Context is the text prediction-related internal information defined and generated by the system, such as language models and machine learning algorithms.Fourth, Word and Sentence are system-predicted text selected by the user.Based on this categorization, performance analysis of this joint user-system gives rise to three usersimulating parameters and three system-simulating parameters, respectively.The user-simulating parameters define the time cost of the user's physical actions and mental actions: Key Typing Time -  Type Key actions include entering letters (Key Press), selecting word prediction (Word Selection), and selecting sentence prediction (Sentence Selection).This parameter is determined by the time duration between two contiguous keystrokes entered by the user without any involvement of considerable mental processing time.Although this time duration can be estimated by Fitts' law [26] based on the layout of the keyboard, to simplify this model, we assume the time cost for every keystroke is identical.Reaction Time for Word Predictions -  _ This parameter is a reflection of Select Word Prediction.It is a substantial time cost for mental action that allows the user to read through the word prediction list (Observation) and determine whether or not to select a prediction (Word Hypotheses).We assume the time cost is identical every time the user checks the list.Fig. 2. The flowchart for the Overall Text Entry Strategy on an NLG system equipped with word prediction and sentence prediction functions.The white rectangles indicate the start and the end of a complete sentence entry task.The white parallelogram is the targeted text.The teal rectangles with solid lines show the specific status, and the teal rectangles with dashed lines are specific user actions.The pink rhombuses denote strategy decisions.The yellow rounded squares represent repeated user operations, whose details are illustrated in Figure 3 and Figure 4.
a sentence prediction (Observation and Sentence Hypotheses).As sentence length impacts reaction time, we estimate this parameter by multiplying the sentence length in words,   , by the reaction time for word predictions (i.e.,   _ =   •   _ ).Alternatively, this parameter can also be estimated empirically via real users.
In addition, the system simulates outcomes based on Language Context, which defines the likelihood of prediction functions obtaining the correct text when the surrogate user inputs new text via a single keystroke (e.g. a letter, a predicted word, or a predicted sentence) in a simulated text entry task.This likelihood is the proportion of queries that yield correct predictions with a range of between 0 and 1 [23]: Has the correct word prediction been found?
As described in the function structure model, there are two interaction points at which prediction functions can boost text entry rate: (1) when typing a word, the system predicts the currently typed word; and (2) when a word is completed, the system predicts the next word and the entire sentence.We parameterize the current word, next word, and sentence prediction functions to accommodate various language models through three parameters: Current Word Prediction Accuracy - __ The current word prediction function provides a list of current entering word guesses based on the typed letters and context information, displayed as a list of words on the system.This design aims to boost the user to finish an

Look at word predictions
Has the next word been predicted?
Select the predicted word Enter the current word Enter the next word Has the end of the sentence been reached?expected word by entering the minimum number of letters.To simplify this model, we assign a value between 0 and 1 to reflect prediction accuracy.Next Word Prediction Accuracy - __ The next word prediction function predicts the next word based on previous entries and context information, assisting the user to quickly form a sentence.Similarly, we assign a value between 0 to 1 to estimate the accuracy.This function may appear indistinguishable from the current word prediction function in the interface as it also shows a list of predicted words on the system, however, the underlying algorithms are different.Thus we separate word prediction accuracy into two parameters.Sentence Prediction Accuracy - _ Information retrieval-based sentence generation and large language models-based (LLMs-based) sentence generation are two mainstream sentence prediction approaches that have different attributes.The former retrieves text from a limited data set and produces sentence suggestions, while the latter, such as ChatGPT [34] and , produces sentence predictions based on prompts.However, the acceptance of predicted words and generated sentences under the scope of LLMs remains an unanswered question, as individuals may have different levels of acceptance of AI-suggested text, which leads to different  _ values.In addition, the imperfect surrogate user focuses on simulating and identifying non-expert users' performance affected by human factors concerns.Therefore, the mathematical simulation of specific word and sentence prediction functions is out of the scope of this research.Accordingly, to simplify this model, we assign a value between 0 tand 1 to estimate the accuracy, which we derive empirically from an existing AAC text entry system [56].We discuss this procedure in Section 3.2.

Text Entry Strategy Modeling
There are three parallel interaction points in interacting with a text prediction system: (1) letter-byletter typing; (2) selecting a word prediction; and (3) selecting a sentence prediction.The main goal of studying text entry strategy is to minimize the cost of a poor guess (checking the prediction list but end up without a satisficing result) and maximize entry rate (quickly finish the sentence by saving keystrokes).The strategy is defined by whether, and when, referring to the word predictions and the sentence predictions.In other words, the strategy determines how users arrange their physical and mental actions.Kristensson and Müllners [23] demonstrate that, in word predictive text entry systems, such strategies have a significant impact on text entry rate and keystroke savings.They propose three text entry strategy parameters for word prediction: Minimum Word Length - _ The minimum word length strategy restricts the use of predictions to only words above a certain length,  _ .The idea behind this parameter is to only refer to predictions for longer words to save keystrokes.Type-then-Look for Word Predictions - _ The prediction success rate increases when typing a new letter in a word.This parameter defines the number of letters that need to be typed before looking at the word prediction list.Holding off the use of the word prediction function in the initial letters' entry increases the reliability of predictions and reduces the time for checking the prediction list.Perseverance for Word Predictions - _ The system is unlikely to produce accurate predictions for every word, so the user is unlikely to pursue word predictions indefinitely.This strategy parameter assumes the user checks the prediction every time a new letter is typed.
If the correct prediction is not obtained by the nth letter, the prediction is abandoned.This parameter defines the number of letters that are typed before stopping to check the word prediction list.Similarly, for the sentence predictive function, we define three new corresponding parameters: Minimum Sentence Length - _ A correct prediction for a long sentence can produce large keystroke savings.This parameter limits the use of predictions to only sentences above a certain length,  _ , in words.Type-then-Look for Sentence Predictions - _ Consistently typing words increases the reliability of sentence prediction.This parameter defines the number of words that need to be typed before looking at sentence predictions.Perseverance for Sentence Predictions - _ Checking long sentence predictions often takes a longer time than checking short word predictions.Hence, users may discard sentence predictions when a certain number of words has been typed.This parameter defines this cut-off strategy.Figure 2 illustrates the overall sentence entry strategy on the NLG text entry system equipped with word and sentence prediction functions.Repeated steps are summarized and modularized for simplicity and clarity, and are presented as yellow rounded squares in the graph.The details of these modules are shown in Entering the Current Word Model (see Figure 3), and Entering the Next Word Model (see Figure 4) respectively.

Parameter Allocation for Surrogate Users and Predictive AAC Systems
Envelope analysis essentially simulates the user and system interaction and calculates the performance within an envelope of parameterized conditions.Therefore, the choice of these parameters can affect the results of estimations.In this study, the diversity of system prediction accuracy based on various language models is out of the scope of this paper.Instead, we aim to investigate the impact of text entry strategy on system efficiency by using fixed parameters for both surrogate users and conceptual AAC systems.This approach strikes a balance between over-parameterization and simplicity and is based on an "uninformative prior" to avoid the need for elaborate distributional assumptions that may be challenging to justify.
To regulate the prediction accuracy of the system (i.e.,  __ ,  __ ,  _ ), we utilize an existing AAC system that integrates word and sentence prediction functions [56], along with a publicly available fictional AAC-like communications dataset [49].From this corpus, we randomly select 100 sentences, which we use for the following envelope analyses and real user text entry analyses.These sampled sentences vary in length from one to ten words, with an average length of 5.13 words.To calculate the prediction accuracy utilizing equation 3, we type each sample sentence on the AAC system and log predicted words and sentences, resulting in  __ = 0.71,  __ = 0.58,  _ = 0.44.We illustrate the impact of each model on net entry rates and changes in keystrokes by envelope analyses for each condition.The text entry rate is measured in words per minute (wpm) and uses the convention that one word is five characters long, including spaces.
We create an example AAC surrogate user (  = 0.60,   _ = 1.20,   _ = 1.20 × 5.13 = 6.16,where 5.13 is the average sentence length in the dataset) using available AAC user parameter values from the literature, listed in Table 1.

Envelope Analyses and Surrogate Users via KLM
We explore the viable efficacy, in terms of communication rate, of a wide range of text entry strategies on this NLG system through quantitative envelope analyses.The fundamental mechanism of this analysis is simulating the user performance on the computational system model and calculating the time cost and keystrokes of a given task via the keystroke level model (KLM) [7], which includes three operators: physical, cognitive and system: We ignore   in this analysis since modern predictive text entry systems have a nearly instantaneous responses when producing predictions.Then, according to the text entry strategy flowcharts (see Figure 2, 3, and 4), we estimate the time cost of a task,   , as follows: which is influenced by the entry strategy parameters  _ ,  _ ,  _ ,  _ ,  _ , and  _ .Kristensson and Müllners [23] reveal that text entry strategies on a word predictive system significantly impact entry rate.They also indicate that keystroke savings do not guarantee savings Fig. 5. System evaluation with different prediction approaches.In Figure 5a-e, we use an AAC users' parameters, while in Figure 5f, we use an able-bodied users' parameters for comparison.In Figure 5c-f (word and sentence predictions as prediction method), the values of the fixed text entry strategy parameters are selected by envelope analyses by comparing each pair of parameters with the net entry rate. _ = 0,  _ = 2,  _ = 0, and  _ = 2 yield the maximum net entry rate.We use the same approach for selecting the fixed values for Figure 5a and 5b. Figure 5a shows that when using only the word prediction method, checking predictions for words with two to six letters in the first two letters' inputs leads to a higher net entry rate.Figure 5b suggests that, when using only the sentence prediction method, checking sentence predictions in the first two words' inputs for sentences with four to seven words is optimal.Figure 5c shows that for AAC users, the optimal strategy is to check word predictions for words with lengths of three to six and sentences with lengths of four to seven, yielding net entry rates in the range of 8 to 9 wpm. Figure 5d shows that frequently checking the sentence prediction reduces the stability of the results observed in Figure 5c. Figure 5e reveals that increasing reliance on word and sentence predictions leads to higher keystroke savings.Finally, Figure 5f shows that for able-bodied users, the optimal strategy is to check predictions for words with lengths between three and five and sentences with word lengths between four and six, yielding net entry rates in the range of 25 to 28 wpm.
in time.By carefully selecting the range of strategy parameters, a positive net entry rate can be achieved, such that the net entry rate is the entry rate difference between assisted text entry and straightforward letter-by-letter typing.A positive net entry rate indicates that the predictive system improves typing performance.Accordingly, on the basis of the prior study [23], the motivation behind this analysis is to identify which strategies for an AAC text entry system equipped with word and sentence prediction functions can possibly improve text entry rates and keystroke savings of letter-by-letter typing, and to understand whether the sentence prediction function has a positive impact on entry rate.We examine three conditions: (i) use word predictions only; (ii) use sentence predictions only; and (iii) use mixed predictions.
Simulations of the full parameter set are conducted for different combinations of the text entry strategy parameters:  _ ranging from 2 to 10,  _ from 0 to 5,  _ from 2 to 10,  _ from 2 to 10,  _ from 0 to 10, and  _ from 2 to 10.This combination covers a wide range of possible text entry strategies.The envelope analyses reveal the following discoveries, which are new findings in relation to prior work [23]: Extensively using word predictions after typing a few letters increases text entry rate.In condition (i), when only word prediction is involved, the system is considered equivalent to a word predictive system.It reproduces a similar result to a previous study in using the ablebodied surrogate user [23], such that word perseverance  _ has a limited influence on entry rate when type-then-look  _ < 2. This is because with 71% current word prediction accuracy and 58% next word prediction accuracy (i.e.,  __ = 0.71 and  __ = 0.58 as listed in section 3.2), statistically, 92% words can be accurately predicted within the first two letters if the word is at the start of a sentence (i.e., 0.71 + (1 − 0.71) × 0.71 = 0.92) and 96% of words can be accurately predicted within the first two letters if the word is not in the beginning (i.e., 0.58 + (1 − 0.58) × 0.71 + (1 − 0.58) × (1 − 0.71) × 0.71 = 0.96).However, when  _ > 3,  _ < 4 yields a higher net entry rate, which is not observed when using the able-bodied surrogate user.This is because the marginal benefit from selecting expected predictions decreases when the word approaches completion (after four letters are typed in this case).Regular checking of predictions consumes more time for AAC users (i.e., a larger   ), resulting in a faster decline compared to able-bodied users.Hence, we fix  _ = 2 and alter minimum word length  _ and type-then-look for word predictions  _ to observe their effects on net entry rate.As shown in Figure 5a, the entry rate strongly depends on the choice of these two parameters.The red hot colors with net entry rates above average indicate when  _ < 2 and  _ is 3-5, the net entry rate reaches its maximum.Sentence prediction strategy greatly impacts text entry rate.In condition (ii), the analysis of sentence prediction shows that sentence perseverance  _ has little influence on the entry rate when type-then-look  _ < 3, as the 82% of sentences are correctly predicted within the first three words (i.e., 0.44 + (1 − 0.44) × 0.44 + (1 − 0.44) 2 × 0.44 = 0.82).However, when type-then-look  _ > 3, sentence perseverance  _ < 4 yields a higher net entry rate.This is not observed with the able-bodied surrogate user either.We conjecture this is for the same reason as for word prediction.Hence, we set  _ = 2 to investigate the impact of  _ and  _ on net entry rate.Figure 5b shows that when only the sentence prediction function is available, when  _ is 4-6 and  _ < 1, this yields a high positive net entry rate.Word prediction and sentence prediction together improve text entry rate.The AAC system allows users to adopt both word predictions and sentence predictions in tandem.To understand whether the combined use of both prediction functions can still positively impact the entry rate, in condition (iii), we alter the usage frequency of both the prediction functions via changing  _ and  _ .Figure 5c shows that the combination of these two functions yields a higher value range of net entry rate (from -0.01 to 9.6) than these two individual functions' results shown in Figure 5a and Figure 5b (word prediction results from 0.02 to 4.5 wpm and sentence prediction from -1.4 to 8.6 wpm).In addition, frequently using the word and sentence prediction functions after typing the initial few letters/words produces high net entry rates.Extensively using predictions decreases performance consistency.Figure 5d shows the net entry rate standard deviation.The red hot color indicates a high standard deviation, and the cool blue color represents a low standard deviation.A low value is preferred as it indicates a more consistent level of performance.Extensive use of sentence prediction (i.e., small  _ ) yields a higher standard deviation.Keystroke savings do not necessarily translate into positive net entry rates.Figure 5e shows that text entry strategies that make extensive use of predictions (i.e., small  _ and  _ ) maximize the keystroke savings.However, this strategy yields a low net entry rate, as shown in Figure 5c.In contrast, the strategies that yield medium keystroke savings have a higher net entry rate (compare Figure 5c and 5e), which indicates that the keystroke savings metric is not linearly correlated to net entry rate.It is meaningful to design for individual users.By comparing Figure 5c and 5f, we observe that by adopting the same text entry strategy, the AAC surrogate user yields a much lower net text entry rate than the able-bodied surrogate user.This emphasizes that individual differences can lead to very different performance outcomes.We also find that the optimal strategies that yield the best net entry rate are actually different due to the difference between the AAC surrogate user and the able-bodied surrogate user in terms of typing speed and reaction time.

IMPERFECT SURROGATE USER MODEL KLM-BEI
This section introduces the imperfect surrogate user model KLM-BEI: a keystroke-level model augmented by modeling bounded rationality, human errors, and interruptions.It is an extension of the conventional task analysis model KLM.While conventional KLM is useful as a "cheap and cheerful" initial estimation model, it is nonetheless a very simple model that can only approximately predict the time cost of a task decomposed by unit tasks in a perfect context with no interruptions of the task, using a single method, and assuming error-free expert performance, etc. [41].However, actual AAC users are not always capable of achieving a goal in an optimal way as, inevitably, errors and interruptions occur which affect performance.To account for this, we introduce the imperfect surrogate user model KLM-BEI to address these inherent uncertainties presented in the task analysis.The model regards the user, the interactive system, and the environment as a joint system that involves decision-making in the presence of action execution failures and interruptions in the environment.
An overview of the model is illustrated in Figure 6.The system interaction simulator includes three user action stages in turn: the decision-making action stage that correlates to the text entry strategy (i.e. to select a word prediction or a sentence prediction or to type letter-by-letter) where the system checks bounded rationality (i.e. to select the correct prediction or ignore), the keystroke action stage where the system checks human error (i.e. to type the key correctly or not), and actual keystroke action execution where the system calculates the time cost of the action using a model reminiscent of KLM.Further, interruptions are monitored throughout the whole interaction process and the time cost is also calculated by the model.It is worth mentioning that, in Figure 2, 3, and 4, the teal rectangles with dashed lines are user actions that require human performance factor checks.
We now describe the rest of the key components of this model:

Start a user action
End a user action Fig. 6.An overview of the imperfect surrogate user model KLM-BEI.This model incorporates uncertainties into the KLM model, including bounded rationality, human error, and interruption.These specific models will be introduced in Figure 7, 9, and 10, respectively.The three modules checking human performance are highlighted by dashed-line boxes, which are aligned with user actions represented as teal rectangles with dashed lines in Figures 2, 3, and 4. The correlated flowchart indicating simulation steps shows in Figure 13.
the impact of each model on net entry rates and changes in keystrokes by envelope analyses for each condition and adopt the same selected sentences for simulation and the same system settings introduced in Section 3.2.

Bounded Rationality Model
Models of goal-directed planning that take the expenses of computation into account are described as models incorporating bounded rationality, a term coined by Herbert Simon [47]   satisfactory, rather than optimal, within some constraints.We introduce a bounded rationality model for envelope analysis to illustrate human decisions in a simple interaction task, such as entering texts in a predictive system.
As shown in Figure 7, the simple goal-oriented task starts with an upcoming action.Then the cognitive processor decides whether the surrogate user executes either an optimal (i.e.rational) or an irrational action.This process is determined by the parameter rational rate (  ): where   is the number of irrational actions and     is all actions under an error-free condition.A higher   represents a higher rationality level of actions.An optimal action brings the user to the goal directly, while an irrational action requires extra actions.For example, assume the goal is to type the word 'beautiful' in the predictive text entry system and the user has typed the letter 'b', and accordingly, the system presents the word prediction suggestion 'beautiful'.A rational action is to choose the word prediction suggestion to complete this entry, whereas an irrational action is to type the next letter 'e', which still pursues the goal but demands several extra keystrokes to complete the word.The extra actions are those actions followed by an irrational action until the goal is achieved, regardless of whether the user chooses to engage with a word prediction suggestion or to type letter-by-letter in subsequent keystrokes.
The comparison among Figure 8a-d shows that different bounded rationality levels can impact the text entry rate due to the extra keystrokes.A higher rationality level (i.e., a higher   value) leads to more efficient usage of the system utility and fewer extra keystrokes.Specifically, Figure 5c (  = 100%), Figure 8a (  = 90%), and Figure 8b (  = 50%) show three different distributions of net entry rate with respect to word and sentence entry strategies.In general, surrogate users with lower rationality levels tend to produce lower text entry efficiencies.In addition, the strategy parameter configurations that produce the fastest text entry rate can Fig. 8.The impact of bounded rationality and text entry strategies on typing efficiency in terms of net text entry rate and extra keystrokes.Figure 8a shows that when 90% of the actions are rational, the optimal strategy is to check predictions for words with lengths of three to five and sentences with lengths of four to six, resulting in a net entry rate range between 7 and 9 wpm. Figure 8b shows that when 50% of the actions are rational, the optimal strategy is to check predictions for words with lengths of three to four and sentences with word lengths of four to seven, yielding a net entry rate range between 4 to 7 wpm.Figure 8c shows that when 90% of the actions are rational, increasing the frequency of word and sentence prediction usage leads to higher extra keystrokes, with a maximum of two extra keystrokes.Finally, Figure 8d shows a similar trend but with a higher maximum number of eight extra keystrokes.
change under different rationality conditions.In other words, a user's optimal strategy may change depending on the user's level of rationality.

Human Error Model
Erroneous behavior is an inherent part of human performance [41].Although there are many different ways to categorize error types such as slips, knowledge-based mistakes, rule-based mistakes, etc. [16,32,39], these errors share the same attributes: task execution deviates from the goal.In this study, we assume errors can be spotted during the sentence entry task, and a set of correcting actions are executed once the error is observed.As illustrated in Figure 9, the task starts with an upcoming action.The motor processor decides whether the surrogate user executes either an expected or unexpected action.An expected action results in accomplishing the goal, while an unexpected action requires a set of corrective steps, which requires additional actions.In a predictive text entry system, unexpected actions include typing unexpected letters or selecting an unwanted word or sentence suggestion.
01(_.2-10 2-10 2-10 2-10 Fig. 11.The impact of human error and text entry strategies on typing efficiency (the net text entry rate and the corrective keystrokes).Figure 10a shows that when 10% of the actions are erroneous, the optimal strategy is to check predictions for words with lengths of three to four and sentences with lengths of four to five, resulting in a net entry rate range between 5 and 7 wpm.Figure 10b shows that when 50% of the actions are erroneous, the optimal strategy is to check predictions for sentences with word lengths less than five, yielding a net entry rate between -1 to 1 wpm, while word prediction has limited impact in this case.Figure 10c shows that when 10% of the actions are erroneous, increasing the frequency of sentence prediction usage leads to more corrective keystrokes, with a maximum of five corrective keystrokes.Finally, Figure 10d shows a similar trend but with a higher maximum number of 19 corrective keystrokes.
We define the human error rate (  ) to interpret the possibility of an action being erroneous: where   is the keystroke number of unexpected actions and   is the overall keystrokes.A lower   represents higher user expertise in using the system.We investigate the impact of human error on typing efficiency via two sets of parameters with different   .Comparing Figure 5c (  = 0) and Figure 11a (  = 10%), it is obvious that human error greatly constrains the net entry rate, even with only 10% erroneous operations.The comparison between Figure 11a (  = 10%) and Figure 11b (  = 50%) shows that a high error rate can not only lead to a very different optimal typing strategy (i.e., the hot red area indicating the relatively high entry rate changes the distribution) but also dramatically decrease 20(_.2-10 2-10 2-10 2-10 Fig. 12.The impact of interruption and text entry strategies on the net text entry rate.Figure 12a shows that with a 10% interruption rate and a five-second interruption event, the optimal strategy is to check predictions for words with lengths of three to seven and sentences with lengths of four to six, resulting in a net entry rate range between 6 and 8 wpm. Figure 12b shows that when we have a 50% interruption rate and a five-second interruption event, the optimal strategy is to check predictions for words with lengths of three to six and sentences with word lengths of four to six, yielding a net entry rate range between 5 and 8 wpm. Figure 12c shows that with a 10% interruption rate and a 50-second interruption event, the optimal strategy is to check predictions for words with lengths of three to seven and sentences with word lengths of four to six, resulting in a net entry rate range between 5 and 7 wpm.Figure 12d shows that when we have a 50% interruption rate and a 50-second interruption event, the optimal strategy is to check predictions for words with lengths of three to six and sentences with word lengths of four to six, yielding a net entry rate range between 4 and 6 wpm. the text entry rate as the maximum net entry rate in Figure 11b is 1.7 wpm, less than 7.7 wpm in Figure 11a.Figure 11c-d reveals that the reason behind this phenomenon is that the high human error rate leads to more corrective keystrokes in general.In addition, the highly frequent use of sentence prediction functions (i.e., a smaller  _ value) minimizes the corrective keystrokes, while the use of word prediction functions has a limited impact on the corrective keystrokes.The goal refers to the goal in the bounded rationality model (Figure 7) and human error model (Figure 9).
Figure 14a shows that, as expected, the highest net entry rate is achieved with a maximum rational rate (  = 100%) and a minimum human error rate (  = 0).The steep zero-crossing line in this figure indicates that   exerts more influence on the net entry rate than   , as when   > 60%, the net entry rate cannot possibly have a positive value.On the one hand, this implies that regardless of the efficiency of the prediction function, once the human error rate remains at a high level, it is very difficult to increase the entry rate through the text prediction algorithms.Instead, a better design direction would be to improve the user experience to reduce the human error rate.On the other hand, it also reveals that bounded rationality has a relatively lower impact on the entry rate than human error.Within the full range of   , the net entry rate Fig. 14.Human performance factors evaluation with optimal text entry strategy.The black dashed line makes the zero-crossing, above which predictions provide a performance gain.
can achieve a positive value if the human error rate is well managed.Figure 14b observes a similar result that the keystroke savings can only be positive when   < 40%.Further, with a higher rationality rate, the system has a higher tolerance to human errors in terms of saving keystrokes.
Having first simulated an imperfect surrogate user by fixing the human performance factors we now investigate the impact of using a different text entry strategy on performance.Specifically, we investigate the text entry rate and keystroke savings.The table in Figure 15 shows the parameter Fig. 15.Imperfect user simulation with fixed human performance factors.Figure 15a shows that when considering all the impacts of listed human performance factors, the optimal strategy is to check predictions for words with lengths of three to four and sentences with word lengths of four to five, resulting in a net entry rate range between 3 and 5 wpm. Figure 12b shows that in this setting, relying more on word and sentence predictions leads to higher keystroke savings, with a maximum keystroke saving of 61%.
configurations for the envelope analysis.Check for every sentence regardless the length after typing 2 words and keep checking until select or no predictions. 8 Check for long words (more than 5 letters) after typing 3 letters and keep checking until select or no prediction, also check for words in phrases after finishing the previous word.
Check for long sentences (more than 5 words) after typing 3-4 words and keep checking until select or no predictions.
Table 2. Descriptions of the eight participants' text entry strategies.These overall strategies are extracted based on our interviews with them and from inspecting log files.In the text entry tasks the participants tended to exhibit a consistent overall performance, but at a sentence level, they may have adopted flexible strategies to suit their needs.
significant impact on performance.Compared to the perfect user (oracle) simulation (see Figure 5c and 5e), the simulation of the imperfect surrogate user (see Figure 15a and 15b) demonstrates a notably lower maximum net entry rate and keystroke savings and a smaller range of text entry strategy parameters that produce positive net entry rates.

RUNTIME ESTIMATION OF PARAMETERS AND VALIDITY
A natural question to ask is how to validate the overall model.However, since the model is generative by definition, it creates all conceivable operating points given a particular parameter configuration [21,23].Thus, assuming the parameter interactions are valid, if the parameters are accurate, then the correct corresponding possible operating points will be generated.Therefore, a more meaningful and practical question is how to estimate such parameter values by observing the runtime behavior of the joint human-computer system during use.Providing such a function allows designers to rapidly estimate appropriate parameter values, which can then subsequently be used in envelope analysis.Table 3. Real user parameters extracted from the eight participants using KLM-BEI.
In contrast to prior work [21,23], we have developed a runtime parameter estimation function in our system.The user freely types sentences, and by observing the behavior of the joint system, our system automatically estimates parameter values.When the user has typed a complete sentence, the goal of the user is assumed to be to arrive at that sentence.Thus, the system can estimate the parameters on a sentence frequency basis.
Erroneous actions can be estimated by examining deletion actions.Since they relate to text correction, erroneous actions can be estimated as the actions that the user took to input text that ended up being deleted text.
Irrational actions can be estimated by examining any additional actions taken by the user to type the text that could have been avoided had the user noticed and used any suitable text prediction suggestions that were provided by the system.
We estimate interruption time by assessing whether the user's reaction time is longer than the time we expect the user to need to type the next key.This is achieved by examining the time interval between two subsequent actions.If the interval is longer than the keystroke typing time and the time required to assess text prediction suggestions, then we consider an interruption event to have occurred.We then calculate the interruption time as the time between the start of the user's interruption and the time when they resume their typing task.In envelope analyses, we can estimate the resumption time (i.e., the interruption cost) by using Equation 7, as it, along with the interruption event time, adds up to the total interruption time.

Validity
We have two goals for validating the KLM-BEI model.First, to validate that the KLM-BEI model can be used to extract parameters that are affected by human performance factors at runtime, such as Figure 17a shows that when using Participant 1's text entry parameters and human performance factors (i.e., using the KLM-BEI model), the optimal strategy is to check predictions for words with lengths larger than six and sentences with word lengths of five to eight, resulting in a net entry rate range between 25 and 27 wpm.Figure 17b shows that in this setting, relying more on word and sentence predictions leads to higher keystroke savings, with a maximum keystroke saving of 35%.Further, Figure 17c shows that when only considering Participant 1's text entry parameters but ignoring human performance factors (i.e., using the KLM model), the optimal strategy is to check predictions for words with lengths larger than six and sentences with word lengths of five to eight, resulting in a net entry rate range between 50 and 55 wpm. Figure 17d shows that in this setting, relying more on word and sentence predictions leads to higher keystroke savings, with a maximum keystroke saving of 51%.Purple circles reflect the estimated entry rate and keystroke savings in different models with respect to the overall text entry strategy adopted by Participant 1, which shows that KLM-BEI can better reflect reality than KLM.
,   ,   , and   .Second, to validate that by adopting the proposed KLM-BEI model, envelope analyses can produce more accurate estimations than a previous approach [23] for individual users with the aid of light touch data collection.To achieve these goals, we recruited eight participants by convenience sampling.The participants were literate able-bodied users aged 20-35 and had at least three years of experience in using text entry systems on touchscreen devices.The reason for not recruiting AAC users is that the experiment aimed to validate that the KLM-BEI model can be used for understanding user performance which assists the system design, rather than directly analyzing user performance for a specific system design.Accordingly, participants were asked to type 100 sentences on an AAC text entry system assisted with word and sentence prediction functions [56] installed on a touchscreen tablet PC (Dell XPS 13 2-in-1 tablet with 13" 3:2 3K (2880x1920) touchscreen).
The text entry performance was logged and analyzed via the KLM-BEI model built into the AAC system.The recording is a text file in which the time of each user action, the pressed key, the predicted words and sentences, and the displayed text was logged for calculating text entry rate and keystroke savings.In addition, rational (  ), error rate (  ), interruption rate (  ), and interruption time (  ) were also extracted from this log.
The same sentences described in Section 3.2 were reused, which were randomly selected from a crowdsourced AAC-like communications dataset [49].Ten additional sentences were randomly selected from the dataset and provided to the participants to allow them to familiarize themselves with the word prediction and the sentence prediction functions of the system.
Participants were told to type freely during the text entry task (that is, not given instructions about text entry strategies for using predictions) and invited to a short interview about the text entry strategy the participant adopted after the task.Table 2 shows descriptions of the text entry strategies the eight participants adopted.
We observed three main types of text entry strategies.The first type used a mix of word and sentence predictions.For example, Participant 2, 3, and 8 relied extensively on the predictions for long words and/or words in phrases, and medium and long sentences.Further, Participant 5 and 7 actively used word and sentence predictions for almost every word and sentence.The second type mainly used sentence predictions.For example, Participant 1 used predictions for long sentences and only checked word prediction when they realized they were typing a long word.Participant 6 strongly depended on sentence prediction but only checked word predictions when they felt the system could accurately predict them.The third type did not use any prediction functions, such as Participant 4 who was almost completely reliant on letter-by-letter entry.
Table 3 shows that the KLM-BEI model can identify parameters that are affected by three types of human performance factors and extract user parameters based on the interaction log.These parameters are then applied back to the proposed KLM-BEI model and the conventional KLM model for envelope analyses.The results reveal a substantial improvement in that the new model's envelope analyses yield a more accurate text entry performance estimation in text entry rate and keystroke savings than using the conventional model.Figure 16 is an example of using parameters extracted from Participant 1 for envelope analyses.The correlated actual text entry strategy of Participant 1 is highlighted in purple circles in Figure 16a-d.By comparing Figure 16a and c, and Figure 16b and d with the real user text entry rate and keystroke savings, the envelope analyses using KLM-BEI demonstrate a substantial improvement of estimations (for example, entry rate is 25 wpm and the keystroke savings is 22%) that are closer to actual measurements (for example, entry rate is 27 wpm and the keystroke savings is 14%).The prior analysis overestimated the entry rate and keystroke savings (for example, 48 wpm for entry rate and 42% for keystroke savings).The other participants' resulting simulations using real user parameters share similar improvements.That is, the KLM-BEI-assisted envelope analysis yielded more accurate estimations.

DISCUSSION AND CONCLUSIONS
The rapid development of large language models (LLMs), such as ChatGPT, brings great opportunities to predictive AAC text entry system design for users with motor disabilities.However, such systems also introduce many complexities that make it difficult for designers to know a priori how to set parameters at appropriate values, such as the number of word and sentence suggestions, and understand what the requirements are on various subsystems, such as the accuracy required for word auto-complete.As system complexity increases, it is not viable to solely rely on the traditional use of text entry experiments, as such experiments can only test a few operating points.Further, some parameters that govern the joint human-system outcomes (entry rates, error rates, keystroke savings, and so on) are latent in the sense that they are directly connected to user strategies in, for example, leveraging word and sentence suggestions.Since we cannot directly control user strategies in experiments, we need to simulate various outcomes to assess which operating points our NLG-based AAC text entry systems may realize.

Fig. 1 .
Fig. 1.Function structure model for NLG-based AAC text entry systems with word and sentence prediction functions.The fonts indicate different element types.Sans serif text in rounded rectangles indicates the main functions; Bold text aligned with solid lines indicates users' physical actions; Normal text aligned with dash lines represents users' mental actions; Italic text aligned with dot lines represents the system internal information; Italic and bold text aligned with dash-dot lines represents the system outputs.

Fig. 3 .
Fig. 3.The flowchart for Entering the Current Word Strategy Model.It is a module in the overall text entry strategy model.The big yellow rounded square corresponds to the yellow rounded square with correlated text in Figure 2. The white parallelogram is the targeted text.The teal rectangles with solid lines show specific statuses, and those with dashed lines are specific user actions.The pink rhombuses denote strategy decisions.

Fig. 4 .
Fig. 4. The flowchart of Entering the Next Word Strategy Model.It is a module in the overall text entry strategy model.The big yellow rounded square corresponds to the yellow rounded square with correlated text in Figure 2. The teal rectangles with dashed lines are specific user actions.The pink rhombuses are strategy decisions.

Fig. 13 .
Fig. 13.The flowchart of Check Human Performance Factors Model.(*): This could be a corrective action, which means the actual executed action is based on the type of correction the user is carrying out.(**):The goal refers to the goal in the bounded rationality model (Figure7) and human error model (Figure9).

Fig. 16 .
Fig.16.Entry rate (ER) and keystroke savings (KS) estimations via the imperfect surrogate user model KLM-BEI and the conventional model KLM using real human performance factors extracted from Participant 1. Figure17ashows that when using Participant 1's text entry parameters and human performance factors (i.e., using the KLM-BEI model), the optimal strategy is to check predictions for words with lengths larger than six and sentences with word lengths of five to eight, resulting in a net entry rate range between 25 and 27 wpm.Figure17bshows that in this setting, relying more on word and sentence predictions leads to higher keystroke savings, with a maximum keystroke saving of 35%.Further, Figure17cshows that when only considering Participant 1's text entry parameters but ignoring human performance factors (i.e., using the KLM model), the optimal strategy is to check predictions for words with lengths larger than six and sentences with word lengths of five to eight, resulting in a net entry rate range between 50 and 55 wpm.Figure17dshows that in this setting, relying more on word and sentence predictions leads to higher keystroke savings, with a maximum keystroke saving of 51%.Purple circles reflect the estimated entry rate and keystroke savings in different models with respect to the overall text entry strategy adopted by Participant 1, which shows that KLM-BEI can better reflect reality than KLM.
Reaction Time for Sentence Predictions -  _ Similarly, this parameter is a representation of Select Sentence Prediction, estimating the mental action time cost for processing