A Meta-Bayesian Approach for Rapid Online Parametric Optimization for Wrist-based Interactions

Wrist-based input often requires tuning parameter settings in correspondence to between-user and between-session differences, such as variations in hand anatomy, wearing position, posture, etc. Traditionally, users either work with predefined parameter values not optimized for individuals or undergo time-consuming calibration processes. We propose an online Bayesian Optimization (BO)-based method for rapidly determining the user-specific optimal settings of wrist-based pointing. Specifically, we develop a meta-Bayesian optimization (meta-BO) method, differing from traditional human-in-the-loop BO: By incorporating meta-learning of prior optimization data from a user population with BO, meta-BO enables rapid calibration of parameters for new users with a handful of trials. We evaluate our method with two representative and distinct wrist-based interactions: absolute and relative pointing. On a weighted-sum metric that consists of completion time, aiming error, and trajectory quality, meta-BO improves absolute pointing performance by 22.92% and 21.35% compared to BO and manual calibration, and improves relative pointing performance by 25.43% and 13.60%.


INTRODUCTION
In addition to its ubiquity in the HCI literature [2,10,29,43,63,64], wrist-based input has been posited in the industry as one of the key candidates for driving future freehand interactions in AR 1 .Akin to conventional interaction devices such as a mouse, which are characterized by parameters like transfer function [52] and input flter variables [12], wrist-based devices also possess various parameters for setting transfer functions or device calibration or other input functionalities [11].The parameter settings for the interactions with these input devices signifcantly infuence the user experience and performance [13,101].However, the one-design-fts-all parameter setting strategy of traditional input devices (e.g., mouse, An ideal situation where the sensed wrist angle is aligned with the user's intention.(c) A more realistic scenario where the device is worn with a subtle rotational diference ', hence the sensed direction is less than the user's intended direction.(d) We use an angular correction parameter to compensate for such rotational diference.As described in Equation 8, is added upon .
keyboard, touch-button, etc.) does not sufce for wrist-based devices for several reasons.Unlike other interaction techniques, a good parameter setting for wrist-based interactions varies significantly across diferent users owing to their physical, behavioral, and preference-based diferences [29].For example, some users prefer to perform wrist motions with a wider range, while others prefer narrower wrist motions.Further complicating matters, the optimal parameter setting for a user may change based on how the device is worn, hand positions, and environmental factors.A user may wear the device in slightly diferent positions each time, necessitating unique parameter settings for each use.To identify the best user-specifc and session-specifc parameter values, users typically undergo a manual calibration process to determine a reasonable setting [28,29].However, calibration requires extra dedicated time from users, delaying the intended interaction [1,29,32,44,96], and is often designed manually by developers which does not guarantee the optimal outcome.
Human-in-the-loop (HitL) optimization [31,85] has the potential to automatically identify optimal parametric settings while users engage in the intended interaction [15,45,104].Among various optimization methods, Bayesian optimization (BO) has gained popularity for HitL applications due to its versatility and efectiveness [8,17,31,49,103].Although BO has proven to be more sampleefcient than many other algorithms [8], it still requires many iterations and a long duration to converge.For instance, Chan et al. [15] spent an hour optimizing a pointing interaction using BO.
How can we enhance the efciency of BO for its application in online and rapid human-in-the-loop applications?Our solution augments BO with prior experience, enabling it to proactively explore parameter areas with the most potential for promising user performance.We adopt our solution from meta-Bayesian optimization (meta-BO) [3,23,54,91,93,100], an emerging paradigm of BO-based methods for rapid optimization by utilizing datasets gathered from similar optimization tasks 2 .By leveraging the experience of optimizing the parameter settings for a group of users in advance, meta-BO can efciently optimize for new users for the same interaction.While meta-BO has gained growing attention in machine learning research, its potential utility in HCI remains largely unexplored.We extend a specifc meta-BO method called Transfer Acquisition Function (TAF) [100], which builds independent population models using the past optimization data and then combines these population models with incoming data for rapid optimization of interaction parameters in online settings.Our approach, which we term Transfer Acquisition Function + (TAF + ) extends TAF for parametric optimization in interaction applications by enabling designers to control weighting both in balancing an application's multiple objectives (e.g.speed and accuracy in pointing) and between prior population data and new incoming data in real-time deployments.TAF + thus enables design fexibility for achieving diverse objectives and efcient adaptation.
We demonstrate the efcacy of our TAF + -powered workfow for rapid parametric optimization of two common and distinct wristbased pointing interactions: absolute pointing (Figure 1) and relative pointing (Figure 2).Absolute pointing involves moving the cursor using forearm motion sensed by a wrist inertial measurement unit (IMU) similar to prior work [34,65].Relative pointing involves moving the cursor using relative wrist rotation [28,43,77].Pointing is a particularly relevant challenge for AR interfaces, and absolute and relative pointing represent the two fundamental types of pointing.Combined with the signifcance of wrist-based input for future AR interfaces, wrist-based absolute and relative pointing interactions are a timely and relevant problem to solve.
We conduct a study that evaluates our meta-BO method TAF + along with two baseline procedures we mentioned above: manual calibration and standard Bayesian optimization.The results show that meta-BO led to signifcantly better performance on a weighted-sum metric that consisted of completion time, aiming error, and trajectory quality.Specifcally, meta-BO improves absolute pointing performance by 22.92% and 21.35% compared to BO and manual calibration respectively, and improves relative pointing performance by 25.43% and 13.60% than BO and manual respectively.In summary, we make the following key contributions: • Introducing a novel meta-BO method called Transfer Acquisition Function + (TAF + ) as an online, sample-efcient parametric optimization approach in HCI : TAF + extends TAF by enabling designer control of (a) weighting multiple task objectives, and (b) tuning the importance between prior population and current real-time user's data.• Demonstrating the efcacy of TAF + to identify user-specifc, optimal parametric settings for two distinct forms of wristbased pointing: TAF + outperformed established baselinesmanual calibration and standard BO.

RELATED WORK 2.1 Wrist-based interactions
Inertial measurement units (IMUs) are used commonly to detect hand motion.Dipietro et al. [20], Perng et al. [72] proposed various glove-mounted IMU systems, and later research proposed wristworn form factors [2,10,43,63,64].Previous works have deployed IMUs for detecting human activity [2,86], projecting raycasting with the detected motion [33,34,65,71], and gesture recognition [51,59].Our absolute pointing approach is similar to Nancel et al. [65] where forearm movements sensed by an IMU control a cursor's position in a 2D interface.Our relative pointing approach uses wrist rotations to control a cursor's relative motion.Wrist rotation tracking via the wristband has typically used outside-in or inside-out sensing.Outside-in uses sensors such as EMG [40,79], electrical impedance tomography [109], and capacitive sensing [75] for inferring wrist angles or gestures.Inside-out sensing uses wrist-worn cameras [29,43,106].For our relative pointing, we use a similar device and method as RotoWrist [77], which consists of a wrist-worn IR sensor array that tracks the wrist's relative angles.

Calibration
In HCI, calibration is a procedure for setting up system parameters so the device interaction can work properly, such as tuning a sensor's internal values [25,74,105] or setting parameters based on user-dependent features [16,53].Calibration has been widely applied, such as for touchscreen interactions [53], gaze input [73], and wearable devices [58,102].However, efcient calibration for the transfer function of input devices remains a signifcant challenge in HCI due to the vast parameter space [52].
Given our focus, we review procedures pertaining to wristband device calibration and pointing transfer function calibration.Wristbased pointing often needs to identify a function that maps sensor values to cursor position.WristWhirl [29] requires users to move their wrists to the maximum along two axes, and then defnes a mapping accordingly.Similarly, WristText [28] requires several wrist rotations to capture the maximum sensor values.Our manual calibration baseline for absolute pointing uses a similar approach where the users identify their preferred operation ranges via forearm motion.While absolute pointing involves a one-to-one mapping between input and output (cursor) displacement, relative pointing (e.g.mouse pointing or video game sensitivity) involves a CD-gainbased transfer function that varies cursor motion speed based on input motion speed [11,52].This function is typically pre-determined using trial-and-error [11] or heuristic iteration [52,108] and is uniform across all users, with an option for the user to fnetune it themselves (as in Windows, Mac, and Linux devices).Our manual calibration baseline for relative pointing uses a similar approach where we instruct the users to fnetune the velocity transfer function starting with a fxed predetermined setting.

Human-in-the-Loop optimization
Since such calibration procedures do not necessarily result in optimal settings, Human-in-the-Loop (HitL) optimization approaches have been proposed.HitL optimization is a general framework in which users serve as the evaluation function, and a mathematical parametric optimization procedure aims to efciently identify the optimal parameter setting [17,31,49,103].Among many optimization methods, Bayesian optimization (BO) is a computational method that has been developed over decades [62,84].Prior work has used BO as an HitL method [39,47,76] for minimizing temporal error [57], increasing animation realism [9], game engagement [41], haptic distinguishability [56], hearing aids [68,69], and optimizing pointing transfer functions [15].However, BO is mainly seen as an ofine design tool [48,56,104] since it requires a large number of iterations resulting in long-duration user sessions.For a 3D target-selection transfer function optimization Chan et al. [15], BO required participants to spend 60-90 mins.Consequently, no prior work has employed BO for real-time target selection interactions with end-users.Another recent HitL approach, AutoGain [52], proposed updating the transfer function based on submovement errors.However, AutoGain requires a dedicated session of 30 minutes to converge and is constrained to the single objective of minimizing aiming error.Our meta-BO method is aimed at overcoming these issues and performing rapid, online multi-objective optimization.

Meta-learning for BO
Meta-learning is a general concept that aims to improve the learning speed of a system by leveraging the prior experience of similar tasks [46,82,92,94].Successful meta-learning implementations have been proposed for recognition and reinforcement-learning tasks using deep neural nets [6,21,26,67,78,81].
In the context of BO, meta-Bayesian optimization (meta-BO) is a machine-learning paradigm consisting of diferent implementations that use prior optimization data to improve the speed of ongoing optimization [3,18,54,95].Owing to its recency, meta-BO has not been employed yet to solve HCI problems.Given its promise of speeding up standard BO, we employ the use of meta-BO for our problem.There are multiple ways to apply meta-BO and our goal was to select one that would ft for HitL tasks like ours.The frst approach is to ft all the prior data into a unifed Gaussian Process (GP) model [5,7,91,107].However, the model-ftting increases cubically (O (3 )) with the number of observations making the computation time for suggesting the next design impractically long.Incorporating the sparse GP potentially allows for better scalability by only retaining a smaller representative dataset [88,97].However, new challenges and uncertainties arise from constructing such complex techniques; for instance, determining the appropriate number of datapoints for the sparse GP, tuning hyperparameters efectively, managing increased model complexity, and balancing computational efciency with accurate uncertainty estimates.The second approach is to replace elements in BO with neural networks trained on prior data [36,89,90,93,99].However, it potentially requires a relatively larger amount of data to pre-train the networks.Moreover, it generally does not ofer explanability, which is crucial for HitL applications.The fnal approach, the one that we adopt, is the weighting-based solution [55,80,100] which stores separate GP models, each model being derived from the data of a previous task (in our case, a participant session).The next design suggestion is decided based on a weighted aggregation of all the previously gathered GPs' information.This approach has low computational complexity, can work with small amounts of prior data, and ofers higher explainability to the users with the opportunity of observing the result generated by each model.Among several implementations along this line of research [24,37,55,80], our method extends Wistuba et al. [100]'s TAF approach.They demonstrate TAF with a single-objective and a naive test function.Our TAF+ approach extends this to our multi-objective HitL scenario which presents new challenges.

PRELIMINARIES: EXISTING METHODS
To appropriately explain TAF + , we frst introduce the key concepts for BO and meta-BO in this section.Given that TAF + is built upon an existing meta-BO method called Transfer Acquisition Function (TAF), we also provide an overview of TAF and its limitations.

BO using Expected Improvement as the acquisition function
BO identifes the optimal parameter setting that maximizes or minimizes an objective function (e.g., completion time) over iterations.
In each iteration, BO selects a parameter setting (denoted as ) to be evaluated ( ()), resulting in an objective function value = ().BO has two key elements: the acquisition function determines which should be evaluated in each iteration, and the surrogate model of the true objective function .
Since each evaluation of is expensive e.g., through user interaction, BO relies on acquisition functions, which is cheaper to evaluate [38], to determine the next setting to be evaluated.In each iteration, BO samples many parameter settings and calculates their acquisition values (its "worth value") with the acquisition function.The with the highest acquisition value is picked for the actual evaluation.To generate the acquisition value of a given , the acquisition function relies on BO's another element -the surrogate model.This surrogate model is typically a Gaussian Process regression (GP) [70,83], which generates the predicted objective value ( ˆ) and its associated uncertainty (i.e., variance) 3 for a given .After the actual evaluation of , each BO iteration results in an observation of (, ) pair; all the observations are then ft into the surrogate model (i.e., GP).As BO accumulates more data through iterations, its GP becomes more accurate to the true function , enabling the acquisition function to make better predictions.For more details on BO, please see [27].
Among common acquisition functions [27,98], we select Expected Improvement (EI) as the base acquisition function since it is used in the TAF paper, upon which we develop our TAF + algorithm.An intuitive way to understand Expected Improvement ( ()) is that it calculates the amount of potential improvement in the objective function from the current best observation.A formal defnition is: where () is the acquisition value at for the current iteration , ˆ = ˆ() is the predicted objective value at based on the GP model, + is the best-observed performance thus far over the whole optimization history H, consisting of all previous datapoints {( 1 , 1 ) . . .( −1 , −1 ))}.

Meta-BO
Meta-learning is a paradigm of machine learning, focused on achieving fast adaptation in a given task by leveraging prior data of similar tasks [35].In the context of HitL BO, each task is to identify the optimal parameter setting of a user using BO.Thus, the goal of meta-learning for BO (i.e., meta-BO) is to leverage the optimization data of previous users to enhance the efciency of BO for the new user(s).There are generally two phases in meta-BO (Figure 3): The frst phase is population modeling which involves gathering data from a set of users.We run HitL BO on each user, resulting in one GP model (i.e., surrogate model) per user.We defne these models as population models.In the second phase, adaptation, meta-BO is deployed on the new users.In particular, a new GP is constructed by ftting the observations of the new user, while leveraging the population models when possible.We call this new GP model adaptation model since it aims to "adapt" the previous GP models to the new user.

Transfer Acquisition Function (TAF): a meta-BO method
BO needs to search randomly in the initial iterations since its surrogate GP model does not have information to guide a meaningful search.Transfer Acquisition Function (TAF) is a specifc meta-BO method that addresses this limitation by utilizing previous population models as informative prior for guiding the search.Specifcally, TAF is an acquisition function that considers both the adaptation model built upon the observations of the current user and the previously gathered population models.Thus, even when the adaptation model has no (or limited) information, population models can still guide the optimizer in selecting a setting () that is likely to lead to good performances ().Similar to the purpose of the regular acquisition functions in the BO process, the with the highest TAF value will then be evaluated by the new user in each iteration.The gathered observation will ft into only the adaptation model, further improving its predictions in the later iterations.TAF aggregates the Expected Improvement ( ) value from the current adaptation model as well as from all the population models for a given parameter setting .Here we defne the calculated from the population models as Population Expected Improvement ( ) to diferentiate from the calculated from the current adaptation model.is calculated similarly to (see Equation 1, and refer to section 7 of the original paper [100] for more details).This leads to one value and values for a given , where is the number of population models.TAF then computes a weighted combination of these values to obtain the acquisition function value at : where () is the Expected Improvement from the adaptation model, while () is the Population Expected Improvement based on the -th population model.∈ 1, ..., denotes the index of the population models.Finally, are weights on each , and +1 is the weight assigned on the .We defne this summary across diferent models as "between-model combination" (Figure 3).

3.
3.1 Model weights.TAF computes a weighted combination of the and values using weights .A higher weight means that this model's (or ) value is more valuable or reliable.Here, we defne such weights on models as "model weights".Following Wistuba et al. [100], we use a variance-based method for determining these weights.An intuitive explanation is that the weight of a model is based on the confdence of its prediction at .Low variance from a particular model (see the red area of each model in Figure 3) indicates higher confdence, so its resulting weight is higher in TAF computation (Equation 2).On the contrary, when the variance is large, the model has less confdence in its prediction, so the corresponding weight should be lower 4 .
The population models generally do not have a large variance since they are already ftted with data from the previous optimization processes.However, at the beginning of an adaptation, the adaptation model has none or very few datapoints, so the variance is overall high, leading to its low model weight.Hence, TAF initially relies more on the population models.Figure 3 shows an example.As the adaptation model is ftted with more observations from the current user, its variances decrease over iterations.Consequently, TAF gradually increases the adaptation model's weight after ample iterations, achieving personalization.

Limitations of TAF.
Prior works have evaluated the performance of TAF with various single-objective testing functions.TAF signifcantly outperformed the standard BO while being computationally lightweight [93,100].However, TAF has two major limitations, making it unsuitable for realistic HitL problems.It was designed to handle a single objective.Yet, a realistic interaction usually involves multiple objectives, which are unclear as to how to address with TAF.Secondly, TAF does not allow proactively shifting weights from population models to the adaptation model.Although TAF gradually increases the weights of the adaptation model with more observations, there are scenarios in which we hope TAF relies on the adaptation model in earlier iterations.For example, when the new user exhibits behaviors diferent from all the population models, shifting the importance to the adaptation model allows for a more efcient user-specifc adaptation.However, TAF does not have a mechanism to support that.

TRANSFER ACQUISITION FUNCTION + (TAF + ) AND ITS ACCOMPANYING WORKFLOW
We develop our Transfer Acquisition Function + (TAF + ) algorithm by leveraging TAF.Accompanying the algorithm is its workfow (see Figure 5).The frst three steps are ofine steps to be performed by the developer before deploying the method on the device.The No data yet (or limited data) where is the index of the models).In PHASE 1, Population models are constructed per user using optimization data.Each model predicts the user performance ˆ (red line), the corresponding acquisition value ( ), and the uncertainty of this prediction (the red area) of a specifc .In PHASE 2, a new Adaptation model is created for the new user.To derive the acquisition value, TAF computes between-model combination values across all models (including adaptation model's and population models' ) based on the model weights.Model weights are denoted as , and they are computed based on the variance (width of the red area) of each prediction.The (or ) with a higher uncertainty have lower weights.The example is computing the TAF at = 0.7, where the adaptation model has very high uncertainty in early iterations, so the TAF value is majorly determined by the population models.As the adaptation model gains more observations, TAF will gradually be dominated by the Adaptation model, leading to the user-specifc optimal result.fnal step is the online adaptation that the end-users experience while performing the pointing interactions with the device.

Value of TAF(0.7)
Here, we frst introduce TAF + and then provide an overview of each step of the workfow.Lastly, we compare the TAF + workfow with other baseline methods.

Transfer Acquisition Function + (TAF + )
The main idea of TAF + is to mitigate the limitations of TAF with two crucial extensions: dynamically handling multiple objectives and proactively balancing the weights of the population models and the adaptation model.

Extension 1:
Dynamically handling multiple objectives.In realistic interactions, there is usually more than one objective function, which limits the direct application of TAF.A naive solution is a weighted-sum approach that transforms multiple objectives into a single objective, where a set of weights on each objective must be predefned.The resultant weighted-sum objective can then be used for both population modeling and adaptation phases, thereby enabling the application of TAF to multi-objective settings.Despite its simplicity, this approach has various limitations for realistic tasks.The weight assignment needs to be arbitrarily done by the designer beforehand, and there is no fexibility to tune the weights later.In practice, there is a high potential that the designer would need to adjust the weights since the predefned weights may not be ideal.
As shown in Figure 4, our TAF + takes a diferent approach.Instead of predetermining a fxed set of weights for objective functions, our population modeling is performed in a multi-objective manner.That is, our population models generate multiple acquisition values, each for one objective, instead of only a single value, for a given .Then, in the adaptation phase, TAF + dynamically combines these acquisition values into one value per model according to the weights assigned to the objective functions.Such weights can be adjusted whenever needed.
We denote the acquisition values on diferent objectives as or , where is the index of the objectives.During the adaptation phase, for each model, TAF + frst combines these values of diferent objectives into a single acquisition value ( + or + , where the + sign indicates this is an aggregated value) based on the weights of objectives.We refer to this combination as "withinmodel combination" to diferentiate from the between-model combination (Figure 3), and we defne these weights of the objectives as "objective weights" to diferentiate them from the weights on models.This feature enables a designer to dynamically adjust the objective weights at any time, even after the population modeling.Designers can even dynamically tailor the objective weights for diferent users or contexts.In subsection 4.2, we illustrate a workfow that allows designers to identify the optimal objective weights based on the users' subjective ratings (see step 2 in Figure 5).Subsequently, TAF + combines all the acquisition values across diferent models into a fnal value based on the model  weights.Such a between-model combination is identical to that in TAF.To this point, TAF + can be formally written as: where + and + are weighted sums of and values respectively from several objectives.∈ 1, ..., denote the index of the population models.Assuming diferent objectives, we can denote + and + as: =1 =1 where ∈ {1, ..., } denotes the index of objectives, and is the objective weight of the -th objective function.Furthermore, in contrast to TAF, where each population or adaptation model associates with a single objective and has one model weight, TAF + considers multiple objectives.Consequently, each model has multiple weights -each is associated with one objective for each EI or PEI.Formally, we denote these weights as where is the index of the objectives.During the within-model combination in TAF + , similar to the process of deriving + or + , the objective weights are utilized to combine multiple weights into one single weight + : where is the same objective weight shared with Equation 4, and () is the -th model's weight on the -th objective function.Each () value is calculated in a manner similar to TAF, as described in subsubsection 3.3.1.() is a decay factor applied only to the population models, which will be elaborated in subsubsection 4.1.2and Equation 6. and () = 0 means the adaptation fully relying on the adaptation model.An ideal decay should update over iterations -it may not be efective in the early iterations since the adaptation model has very limited information but should increase over iterations as more data becomes available from the current user.The decay () is thus an iteration-dependent function with two hyperparameters: 1 the iteration number after which the decay kicks into efect, and the other 2 determines the rate of the decay.We can formally describe () as: where is the count of the iteration, 1 is a positive integer (or 0), and 2 ∈ (0, 1].There is no decay when the iteration count is less than 1 .After 1 -th iteration, the decay starts with the rate of 2 ; i.e., every iteration, the scalar () decreases by 2 .Once the scalar reaches 0, it stays at 0, so + stays 0 as well, and hence adaptation starts fully relying on the adaptation model.
With this extension, TAF + allows actively determining how the population models should decrease their importance.For instance, in a population where all the users exhibit high similarity, the optimal design for the new user is likely to be highly similar to the population models.Hence, the designer can set 2 as 0, allowing the population models to consistently guide the current optimization.On the other hand, when there is a higher diversity between users, a designer can leverage the population models in the initial steps and let the adaptation quickly develop based on the current data.In such a case, the designer can set 1 and 2 to properly decay to achieve faster adaptation.Our workfow has an additional step to derive the optimal 1 and 2 (see step 3 in Figure 5).On the contrary, TAF does not have a similar mechanism, potentially not suitable for interactions requiring fast adaptation.

Potential generalizability of TAF
+ for HitL optimization: TAF + shares the foundational principles of BO, a versatile parametric optimization method with minimal assumptions of the task [27,84] and has proven its generalizability over a wide range of HCI applications [9,15,17,47,57,87].Extended from BO, the only essential assumption of our TAF + is that while users have individual differences, there exist parameter ranges that generally lead to good user performance.This is a common assumption in HCI and design, where a certain range of parameters or designs are considered efective for the broader user base despite individual disparities.Thus, TAF + can be used as a general approach for other HitL problems as well.In rare occasions where there are completely no overlapping traits between the population models, designers can use the TAF + workfow to forecast this outcome (see subsection 4.5) and use other methods instead.We demonstrate the potential generalizability of TAF + with a series of simulations, as presented in subsection 4.6.

Overview of the TAF + workfow
Figure 5 provides the overview of TAF + workfow, which contains three steps to prepare the population models followed by the deployment in the adaptation phase.The frst step of TAF + workfow entails building population models that generate multi-objective predictions.This step eliminates the need to predefne the objective weights for multiple objectives in advance and enables the fexibility of setting them later.The second step focuses on identifying the optimal objective weights corresponding to the maximum subjective ratings.By optimizing the objective weights, the designers can actively steer the optimization to explore the parts of the Pareto-frontier that maximize the user's feedback.In the third step, the optimal decay hyperparameter settings (Equation 6) for the gathered population models are identifed using grid search in simulations.Diferent decay hyperparameter settings are tested in simulation to identify which setting leads to the optimal simulated adaptation performance.Finally, we deploy meta-BO, our TAF + , in adaptation on the end-users, where the population models, optimal objective weights, and the decay hyperparameters are utilized.We detail each step below.

Step 1: Population modeling
We performed a data collection where users went through HitL optimization guided by multi-objective BO.The data of each user is then used to construct a GP model, which predicts the performance of multiple objective functions when given a .Note that this step does not incur any additional costs for end-users because the endusers only experience the adaptation phase (see Figure 5).

Step 2: Objective weight optimization
TAF + transforms the optimization problem from multiple objectives into a single objective based on objective weights.The selected three objectives involve intrinsic trade-ofs, the same as other input devices.To identify the optimal weights, the subjective ratings of each user's Pareto-optimal designs are obtained and used to identify the weight set that results in the designs with the highest subjective rating.This step is a "population-level" weight optimization because its goal is to identify the weight setting that captures the highest ratings across all users.
4.4.1 Objective weight optimization for a single user: Consider a simplifed single-user scenario for a better understanding of our method (Table 1).Our procedure involves sampling a list of possible weight sets (e.g, [0.1, 0.1, 0.8], [0.1, 0.2, 0.7], ... for a problem with three objectives).Each weight set is applied to all the Pareto-optimal designs' objective values of this user to identify the design among the Pareto frontier with the highest weighted-sum objective value.We then record the user's rating for this particular design as the score of this weight.By trying out all the weights and comparing their corresponding ratings, we can identify the optimal objective weights that leads to the highest user rating.
With a single user, we could conclude the objective weight assignment by assigning weights in accordance with the highest user rating.However, at the population level, diferent users may favor the objectives diferently.We therefore need to fnd the objective weight assignment that is the best across all users in the population.For instance, for the user presented in Table 1, design B has the highest user rating, and its third objective function has the highest value.Intuitively, it suggests that this user favors the third objective; then assigning the objective weights as [0.1, 0.1, 0.8] is a reasonable and straightforward solution.However, at the population level, diferent users potentially favor diferent objectives, so it Optimal setting for the new user

A new user
Rapid, online optimization + Figure 5: TAF + workfow: subsection 4.2 provides the details and explanations of each step.Similar to other meta-learning workfows, the frst step in ours is population modeling, and fnally, TAF + is deployed on the end-users (adaptation).TAF + workfow has two additional steps (steps 2 and 3) for deriving the optimal objective weights and the decay hyperparameters.Table 1: The table shows a set of Pareto-optimal designs in a three-objective optimization problem.We demonstrate how to obtain the optimal objective weights leveraging the user ratings.The second column shows that designs A, B, and C have diferent objective value sets.Columns 3 -5 demonstrate we can calculate the weighted-sum objective values when a set of objective weights are given.Diferent objective weights lead to diferent optimal designs.For instance, under weights = [0.7,0.2, 0.1], design A is the best design.However, under weights = [0.2,0.3, 0.5], design B is the optimal design.Since we have gathered the user's ratings on each design, we can compare which objective weight setting leads to the design that corresponds to a higher user rating.In this example, the second weight ([0.2, 0.3, 0.5]) is the most preferred weight among the three weight settings because it leads to B as the fnal design, whose user rating is the highest, 100.
is unlikely to intuitively identify a single objective that leads to the highest ratings for all.Therefore, a grid search is a more thorough solution.More details are below.

Objective weight optimization across all users:
To obtain the best weight setting across all users, we leverage the process described in subsubsection 4.4.1 for each user in the population.In particular, we obtain the weighted-sum optimal designs among the Pareto-optimal designs of all users by running the aforementioned process for a given weight set (e.g, [0.1, 0.1, 0.8], [0.1, 0.2, 0.7], [0.1, 0.3, 0.6], ...).Then, all the users' subjective ratings corresponding to the resultant optimal designs are combined as a fnal score for that objective weight set.Finally, the fnal scores of all weight sets are compared, and the weight set that leads to the highest net user rating is identifed.Appendix A.1 elaborates on the need for searching for the best objective weight confguration in this manner through a small example scenario.Appendix A.2 also provides the detailed algorithm that is explained here.

Step 3: Decay hyperparameter optimization
TAF + has a set of hyperparameters that decreases the weights of the population models (Equation 6).The step aims to identify the optimal hyperparameter setting through simulations.In these simulations, we take one population model at a time and treat it as a new user.Meanwhile, we treat the remaining models as population models to conduct TAF + .We then performed a grid search over diferent sets of < 1 , 2 > values to identify the < 1 , 2 > pair that yielded the best performance at the population level.This optimal hyperparameter setting can then be used for subsequent TAF + runs.This simulation can be an efective pre-check step before deploying meta-BO.Future practitioners can utilize such simulations to foresee the potential efcacy of meta-BO for their own tasks.In rare cases, users may behave completely diferently for a particular interaction.This step informs the practitioners that TAF + would not outperform BO regardless of the hyperparameter setting, so they can deploy a standard BO instead.Furthermore, practitioners can utilize the simulations to observe the performance of meta-BO with diferent numbers of population models and learn that larger user groups may be needed in certain cases.

Evaluating TAF + 's viability via simulations on synthetic test functions
Before applying TAF + onto the target interactions, we present synthetic simulations that evaluate TAF+ in multi-objective problems with common test functions, such as Sphere5 , Branin 6 , and Hartmann 3D 7 .Our simulations use the same objective functions and number of parameters as relative pointing.We evaluated TAF + 's performance under 6 diferent objective weight confgurations, highlighting its advantage of ofering high fexibility for the designers to fne-tune the objective weight when needed.We further evaluated TAF + 's performance with fve levels of user group similarity, showing its efectiveness even when users exhibit high differences.We also evaluated the potential generalizability of TAF + with four diferent functions and with fve population model sizes.
Our simulation analysis shows that TAF+ always converged to global optimality and outperformed BO in various conditions.This provides evidence for its potential in a wide range of interactions.Appendix B presents the detailed procedure and results of our simulations.

Summary
To summarize, TAF + improves over TAF so as to handle HitL scenarios in two aspects: it can fexibly handle multiple objectives and it can proactively decay the weights of prior models as the user progresses in the adaptation phase.Further, TAF + achieves its goal of converging faster than standard BO in the synthetic simulations.

INTERACTIONS
We study two distinct and representative wrist-based pointing interactions: absolute pointing and relative pointing.While these two interactions share the same task and objective functions, they use diferent hardware (IMU v.s.infrared sensors), diferent body parts (forearm v.s.wrist angular motion), diferent device parameters (forearm yaw-pitch v.s.wrist angle), diferent transfer functions (linear v.s.sigmoid) and diferent parameter counts (2 v.s.4).We purposely chose these two cases to demonstrate the efectiveness and the potential generalizability of the meta-BO approach.Below, we frst describe the shared details and then the two interactions.
5.1 Task and software interface  6), where participants were asked to move the cursor to the target and select it by performing a double pinch.The pinches were detected using the highly accurate active electrical sensing approach, the same technique used in ElectroRing [42].Participants were asked to select the targets "as quickly and accurately as possible".

Interface:
Our goal is to deploy meta-BO online while the user performs pointing in a real-world interface.We thus design our study interface with varying target sizes and distances in a grid to resemble a real-world interface (see Figure 6 a).The targets are circular, arranged in an 8 × 4 grid.The diameter of each circle is uniformly sampled from a range of [20,35] mm.The default color of all the circles is light grey, and the target circle is highlighted in blue.
Upon selection, the target turns red.The system then randomly samples a diferent circle as the next target.The cursor does not reset to the origin between selections.We apply a one-euro flter on the cursor position to overcome jitter [12].

Objective functions
Two interactions share the same objective functions, which we aim to optimize.To prevent bias toward any objective function during multi-objective optimization, these objective functions were further linearly normalized into the range of [−1, 1].We converted the problem into a maximization problem, so 1 is the best performance and −1 is the worst 8 .Similar normalization was also done in prior work [15].
1. Normalized completion time ( ): Completion time (CT) is the duration from the moment that the cursor leaves the previous target to when the selection is complete, a typical way to assess the input efciency.Our targets varied in size and distance; a standard way to counterbalance the efect of these variances is to average the performance over many selections (e.g., [15]).However, since we aimed to use the minimal number of selections for the optimization process, we normalized the completion time with the Index of Difculty (ID) [61], and thus = .
2. Trajectory aiming error ( ): TAE is a crucial metric for evaluating pointing accuracy [15,52].We follow AutoGain [52] that used the Persistence1D algorithm [50] to segment a full trajectory into sub-movements and then calculated the aiming error of each ballistic sub-movement.For simplicity, we assumed the implicit aiming point to always be the target position.After excluding the unaimed, interrupted, and non-ballistic movements, we then calculate the aiming error (overshoot and undershoot) of each sub-movement as the distance between the cursor position at the local minimum speed and the closest edge of the target (Figure 6 a).Summing up all the aiming errors ( , where ∈ [1..] is the -th ballistic Í sub-movement), = =1 .

Trajectory Travel Distance ( ):
TTD measures the amount of detour of a selection trajectory.A selection could be fast and accurate, but if the cursor travel distance from the previous to the current target is longer than the shortest distance, it indicates a potentially skewed transfer function.This metric is expressed as = / , where is the measured travel distance and is the ideal distance.In absolute pointing, if the transfer function values misalign with the user's assumed ratio, the user would take extra distance to move the cursor along the intended direction.In relative pointing, a rotational misalignment in the x-y mapping may result in deviations in motion.While TTD may be correlated with the previous objective functions, the correlation is partial.TTD captures additional useful information.Note that there are intrinsic trade-ofs in the selected objectives.

Optimization iteration
From a 4-person pilot test, we decided to have 6 target selections in each optimization iteration.The frst two selections were seen as "practice" because the users may still be adapting to the new setting.We took the average value of the three objective functions (see subsection 5.2) of the last four selections as the fnal objective values of that iteration.

Absolute pointing
Our absolute pointing interaction utilizes an IMU on the wrist to detect the absolute position of the forearm (Figure 1).The challenge here is to identify the ideal function that maps the forearm positions to the cursor's 2D positions.

Device and interaction:
The interaction linearly maps the IMU < , ℎ > to < , > coordinates on the interface.Initially, the < , ℎ > corresponding to the user's preferred central forearm position in the air is mapped to < 0, 0 >.The system then linearly updates the cursor's & positions based on & ℎ respectively.Ideally, each < , > coordinate corresponded to a specifc < , ℎ > pair.

Design parameters:
The transfer functions for determining cursor position are: where (0) and (0) denote the cursor's centered and positions in the scene.() and () stand for the cursor's and positions at time .(0) and ℎ(0) are the centered and ℎ values.() and ℎ() are the and ℎ value at timestamp .The two design parameters that need to be determined are the transfer scalars and (Table 2).For diferent users, different motion ranges may be optimal (which may also depend on the task), thus requiring per-user parameter optimization.

Relative pointing
Our relative pointing (Figure 2) interaction is similar to mouse pointing where users move the cursor by clutching [66].The cursor's direction is determined by the sensed wrist movement's angle.The cursor's moving velocity is determined by a velocity function that takes the wrist's relative velocity as input.The user pinches to initiate cursor motion, rotates the wrist to move the cursor in the desired direction, and releases the pinch to end cursor motion.The user clutches and performs this repeatedly to reach the target.

Device and sensing:
We used the same approach as Salemi Parizi et al. [77] to detect wrist angles via infrared (IR) sensors on a wristband.The detected wrist fexion/extension and radial/ulnar deviation at each timestamp were mapped to the wrist plane's x and y coordinates, respectively (see Figure 2b).With measurements at two consecutive timestamps, we derive the wrist velocity ( ), and the detected wrist movement's angle ( ).
5.5.2Determining cursor's moving direction: One important factor is the wearing position of the device.Ideally, the device should be placed exactly perpendicular to the body (Figure 2b).Then, moving the cursor along would match the user's intention.Yet, users wear the sensor slightly diferently.The slight angular diference between the ideal and worn positions can cause the cursor to move in unintended angles (see Figure 2c).We introduced an angular correction parameter , which is a value that directly adds to the sensed wrist angle such that = + , where is the corrected angle, is the sensed angle, and is the correction parameter.After the correction, the cursor moves as intended .Figure 2d shows an efective correction.Thus, is an important design parameter that directly impacts . Here, we defne our relative transfer function: where (), ( ) are the cursor , at timestamp .( ) determines the cursor velocity based on wrist velocity .

Determining cursor's velocity:
As in [65], our velocity function ( ) is a sigmoid function, and consists of 3 parameters ( 1 , 2 , 3 ), which defne the sigmoid curve properties (detailed in Table 3): In total, there are 4 design parameters to be optimized: the aforementioned angular correction, , and 3 parameters in the velocity function.From an optimization perspective, relative pointing is more complex than absolute.In addition to diferent users having diferent wrist rotation ranges, here tiny variations in wearability may lead to large sensing variations due to how IR sensors work.Additionally, since only the relative wrist motion matters, the user's arm is positioned downwards here (e.g., [60]), which enables relaxed use (see Figure 2a).

IMPLEMENTING TAF + WORKFLOW FOR OUR WRIST-BASED INTERACTIONS
In section 4, we introduced the details of each step in the TAF + workfow.Here, we report the process, data collection, and results of each step with our target interactions.

Step 1: Population modeling via a user data collection
For each interaction, participants performed the task while the parameters were changed every iteration based on multi-objective BO using Expected Hypervolume Improvement [19] and BoTorch [4] 9 .The objective functions and design parameters were detailed in section 5. We ran 25 and 40 iterations for absolute and relative pointing, respectively.These numbers refect the complexity of the tasks.These collected data, i.e., 25 and 40 pairs of (, ), were used to construct the population models.
6.1.1Procedure.We recruited 14 participants, 5 male, 8 female, and 1 non-binary, aged 22 − 57 ( = 28.5).None of them have any experience with VR.The data collection took 2 hours.Because we aim to analyze the two interactions independently, all participants went through relative pointing frst and then absolute pointing, rather than mixing the order of the interactions.To familiarize participants with the task, we provided 10 "practice iterations", in which we uniformly sampled design instances from the entire design space.There was a 1-minute break every 10 iterations and a 5-minute break between the two interactions.
6.1.2Deriving population models.We derived 14 population models for each interaction.Appendix C shows the plots of the hypervolume increase of two interactions at each iteration.On average, there were 4.8 (.. = 0.78) and 5.79 (.. = 1.12)Pareto-frontier settings for absolute pointing and relative pointing, respectively.This showed there are intrinsic trade-ofs between the selected

Step 2: Objective weight optimization based on user ratings
We determine the ideal values of the objective weights through optimization as described below.
6.2.1 Gathering user ratings.We compare the quality of the Paretooptimal weight settings based on the participants' subjective ratings.
For each participant, after they fnished the 25 and 40 iterations in the previous step, we extracted all the Pareto-optimal parameter settings among these samples.Then, we asked the participants to perform pointing with these settings and rate them subjectively.The instruction was: "For each of the following designs, please rate how much you agree with this statement: [This design is easy to use].
Please rate from 1 to 100; 1 stands for strongly disagree, and 100 means strongly agree."Participants interacted with each Pareto-optimal design twice in a randomized order.We took the average of the two ratings on the same parameter setting as its fnal rating.We normalized each user's ratings such that the lowest and highest ratings of a user would be 1 and 100.If the rating scale is scarce (e.g., only 5 or 7 levels), we may end up with many optimal weights that reach identical results when running the objective weight optimization.We therefore provided a 1-100 scale to gain the most granular information even if the users are not able to be exact in their assessments.To avoid a scenario where a user who rates all designs highly dominates the optimization process, we normalized each user's rating to a fxed range.

Results of objective weight optimization:
We observed that 5 users see the 1-100 range as 10 levels (only rated at tens digits), 4 users see it as 20 levels (only rated at every fve digits), and we further found 5 participants provided more fne-grained ratings (such as 73 or 92).This shows that the 1-100 scale allows each user to be fexible about the granularity they want to use for their scores.Following the procedures described in subsection 4.4, we sampled all the weight combinations to the frst decimal and then performed the population-level objective weight optimization.The resulting optimal weight settings of [NCT, TAE, TTD] for absolute and relative pointing were [0.4,0.3, 0.3] and [0.5, 0.3, 0.2], respectively.The resulting objective weights highlighted the consistent importance of speed (indicated by ) in pointing interactions.

Step 3: Decay hyperparameter optimization via simulations
We simulated the TAF + with diferent < 1 , 2 > values with the population models and observed which value pair results in the best performance.We set the 1 values to be [1,2,3,4,5,6,7,8,9] and the 2 values to be [0.1, 0.2, 0.3].We created a list of < 1 , 2 > pairs of all the possible combinations.Additionally, we included two baselines in the simulation.The frst was a standard single-objective Bayesian optimization where the objective was the weighted-sum objective.The second baseline was the TAF that handled multiple objectives using Equation 4and no decay on the population model.For each < 1 , 2 > pair, we ran 14 simulations.In each simulation,

Absolute pointing Relative pointing
Net performance over multiple objectives

Iteration number Iteration number
Figure 7: Result of the decay hyperparameter optimization.We simulated the user performance over 10 iterations with many combinations of hyperparameter values.Note that the TAF (multi-objective) condition handled multiple objectives with Equation 4. We derived the global maximum from a grid search of the GP prediction.The results showed that TAF + with proper decay setting led to the best performance.An extended version of this simulation is presented in Appendix D.
we singled out a population model and treated it as the new test user.We then excluded this model and utilized the remaining 13 population models to work with TAF + .We set the optimization iteration to 10.Thus, each < 1 , 2 > pair resulted in 14 (users) ×10 (iterations) datapoints.

6.3.1
Results of the simulation.The simulation results are shown in Figure 7.With 1 = 2, 2 = 0.3, TAF + achieved the best performance for both interactions.Since there were too many combinations, we only show the performance of the best setting and two important baselines.TAF + with the optimal hyperparameter outperformed the standard BO.Additionally, TAF (multi-objective) had benefts in the early iterations but struggled to improve quickly, indicating directly deploying TAF without a mechanism to balance population models and the adaptation model may hinder adaptation efciency.This issue would not exist if all the users had high similarities; the new user's optimal design would be highly aligned with the population.Yet, this assumption may not hold in certain scenarios, e.g., our interactions.

Summary
Through this workfow, we derived (1) 14 population models, (2) optimal objective weights based on subjective ratings, and (3) optimal setting of hyperparameters (1 and 2), which will be employed in the next adaptation phase.
7 ADAPTATION: EVALUATING META-BO ON NEW USERS

Experimental design
11 entirely new participants were recruited for the evaluation study: 6 males and 5 females, aged 23 − 42 ( = 29.5).None of them have any experience pointing in VR or using VR.We conducted a within-subjects study with 2 independent variables: optimization procedure (meta-BO vs. standard BO vs. manual calibration) and iteration (10 iterations) for each of the pointing interactions.The procedures were counterbalanced using a Latin square.

Standard BO.
For the BO procedure, we had 5 initial random samplings and 5 optimization iterations.These values were determined from a 4-participant pilot test.To ensure a fair comparison, BO optimizes for the same weighted-sum objective obtained in Step 2, which TAF+ also optimizes for.Other hyperparameter settings are the same as in population modeling.

Manual calibration.
Diferent from meta-BO and BO, the users adjust the weight settings based on their perception unrelated to the weighted-sum objective.For absolute pointing, an established way to calibrate for similar interactions is to ask the users to indicate their preferred operational range, and then map the maximum detected values to the boundary of the interface (E.g.[28,29]).We followed the same, mapping the user's comfortable horizontal (yaw) We evaluated the efcacy of our meta-BO (TAF + ) algorithm and workfow with 11 new participants for both pointing interactions. 10The hyperparameter settings of the adaptation model followed the ones used in population modeling: a single-task GP with Matern 5/2 kernel and inferred noise levels.
We evaluated against two baseline procedures: standard BO and For computation efciency, at each iteration, we generated 1024 parameter settings as manual calibration.
candidates using a Sobol grid.
and vertical (pitch) forearm motion range to the scene boundaries along , .
For relative pointing, we frst determined (correction parameter) for each user by using the radial-ulnar deviation of the user (corresponding to vertical cursor motion) since it has a smaller range than fexion-extension.The users were asked to perform radial-ulnar deviation to move the cursor back and forth between a top and a bottom target (Figure 6b).The diference between the recorded direction and the ideal vertical direction determined .Next, we calibrated the velocity function.Since 3 controls the overall scale of the transfer function, we viewed this parameter as the most critical parameter.Users did 12 random target selections during which they were allowed to adjust 3 by using a slider as many times as they wanted, similar to the sensitivity tuning of a computer mouse.As for 1 and 2 , we used TAF + to propose the 1 and 2 values assuming there is no adaptation model; i.e., these are the best suggestions based on the population data.The values are 0.914 ( 1 ) and 1.438 ( 2 ).Unlike meta-BO and BO, manual calibration occurs before the data gathering.Once manual calibration was complete, the parameter values were fxed and did not change throughout the iterations.

Study procedure
Similar to step 1, participants performed relative pointing frst, followed by absolute pointing.Instructions were the same as before.Diferent from step 1, each procedure only had 10 iterations.Since there were 3 procedures (meta-BO, BO, and Manual) and 10 iterations for each procedure, every pointing interaction had 30 iterations in total.After each procedure, we asked the participants to fll out the NASA-TLX questionnaire to assess their subjective workload.Afterward, we conducted an open-ended interview to understand the participants' experience.

Results
The weighted-sum performances at each iteration for both pointing interactions are shown in Figure 8.
A simple main efects analysis found a signifcant diference between the procedures ( (2, 30) = 1.652, = 0.027).Pairwise comparisons showed signifcant diferences between meta-BO and BO, and between meta-BO and Manual (both < 0.05), which indicated that meta-BO resulted in higher overall performances than the other two procedures.Overall, meta-BO enables performances that are on average 22.92% and 21.35% higher than BO and manual calibration across the 10 iterations.The simple main efects analysis also showed a signifcant diference between the iterations ( (9, 100) = 2.171, < 0.001).
Another pairwise comparison between the iterations within each procedure found that for the meta-BO procedure, the performance did not signifcantly improve beyond iteration 6.On the other hand, BO still made signifcant improvements up to iteration 9.This indicates that meta-BO converges faster to optimal performance.We conducted a pairwise comparison between procedures for each iteration (Figure 8) which shows that meta-BO consistently outperforms BO and Manual in several iterations.The detailed numbers for Figure 8 are provided in Appendix E.

Relative pointing:
A 2-way repeated-measures ANOVA showed no statistically signifcant interaction between procedure and iteration ( (1.922, 29.403)= 1.623, = 0.188).A simple main effects analysis found a signifcant diference between the procedures ( (2, 30) = 7.231, = 0.005).Pairwise comparisons showed signifcant diferences between the meta-BO procedure and BO procedure ( = 0.006) and between the meta-BO procedure and the Manual procedure ( = 0.033), which indicated that meta-BO resulted in higher overall performances than the other two procedures.Averaging the performance of 10 iterations, meta-BO enables 25.43% and 13.60% better performances than BO and manual in relative pointing.A simple main efects analysis also showed a signifcant diference between the iterations ( (9, 100) = 44.795,< 0.001).Pairwise comparisons between the iterations within each procedure showed that for meta-BO, the performance did not signifcantly improve after iteration 6.Meanwhile, BO improved up to iteration 7. Similar to absolute pointing, meta-BO is faster to converge to optimal performance.We conducted a pairwise comparison between procedures for each iteration (Figure 8) which shows meta-BO consistently outperforms BO and Manual in several iterations.

Other analyses:
We found only one signifcant diference in the NASA-TLX questions: For absolute pointing, both Meta-BO and BO led to signifcantly lower frustration than the Manual procedure, indicating Meta-BO delivers better or comparable user experiences.This is mainly because the users needed to invest more efort in the calibration process but it did not lead to a better experience.From the interviews, we further learned that the participants' experience is highly infuenced by the Index of Difculty of targets, than by the transfer function.More detailed analyses on perceived workload and user experience are presented in Appendix F. We further analyzed the individual metrics derived from the three procedures for both interactions, and found that meta-BO generally led to comparable or signifcantly better performances than the baselines.Please refer to Appendix G for more details and the plots of the individual metrics.Finally, each user's performance is separately presented in Appendix H, and we visualized the objective function of absolute pointing in Appendix I.

Findings and discussion
Overall, we found that meta-BO allowed for signifcantly better performances than standard BO and manual calibration.For absolute pointing, standard BO started with a lower initial performance which is not surprising since it was not informed by prior population models.Standard BO further took more time to converge in both absolute and relative pointing.Even though meta-BO continued to improve until iteration 6, we can see that it reached near-peak performance by iteration 3, which translates to just 18 selections.Thus, meta-BO overcomes the slow start and slow convergence issues of standard BO for our tasks.
With manual calibration, since the participant calibrates it in the beginning according to their preference, the expectation would The performance (weighted-sum objective value over multiple objective functions) in 10 iterations.We have normalized all the objective functions into the range of [0, 1].The error bar shows a 95% confdence interval.The red * (meta-BO and BO), the black * (meta-BO and Manual), and the blue * (Manual and BO) signs denote a signifcant diference between two procedure at that iteration.Note that we showed "the best" performance reached from the beginning to each iteration.Since we are comparing the optimal performance given the same amount of iteration, this is the conventional way of showing and comparing the performance.Also note we have converted it to a maximization problem (see subsection 5.3).The detailed numbers of this fgure are presented in Appendix E.
be that it will perform well at least in the beginning.While there are a few indications that the manual calibration procedure may have marginally better performance than BO, the diference is not signifcant.Thus, manual calibration does not guarantee an excellent result.This could be because users may not fne-tune to the extent of fnding the optimal design.Even though the manual calibration parameters did not change, its performance improved over time, presumably because the participants were adapting over time.However, despite user adaptation, meta-BO consistently outperforms manual calibration for iterations 3-8.Given that meta-BO as an online procedure can match explicit manual calibration in the beginning and outperform it in intermediary trials, it is a viable candidate to replace explicit manual calibration procedures in real-world deployments.

DISCUSSION
In this work, we propose a novel HitL technique for rapid, online personalized parametric optimization for wrist-based input.It is known that adapting or optimizing transfer functions is challenging; wrist-based interaction adds further difculties due to wearability and posture factors.We tackled the two most representative and distinct wrist-based input interactions.With just 14 users for population modeling, meta-BO outperforms the existing manual calibration and standard BO approaches for new users.This demonstrates the specifc utility of meta-BO for interactions that beneft from personalized parametric settings.Meta-BO can eliminate dedicated calibration routines for wearable device interactions and help rapid attainment of optimal settings for each user.Given the results from two diferent pointing applications and a series of simulations and the fact that meta-BO does not have any overbearing assumptions (subsubsection 4.1.3),this work provides a meta-BO workfow that HCI researchers and practitioners can further apply to their applications and other problem contexts.We also showed how TAF + extended the TAF approach for HitL applications.This involved the derivation of optimal objective weights based on users' subjective ratings and confguring decay hyperparameters through simulations to balance the population and adaptation models.The optimal weight derivation occurs after population modeling in our workfow, allowing higher fexibility to tune the objective weights when needed.There are multiple open questions and limitations that future work can address.
Encountering new users: Meta-BO utilized the similarities between prior users and new users to converge to optimal settings faster.To account for a new user that is drastically diferent and the potential of negative transfer, we introduced decay parameters to let the optimization proactively rely on the observations from the new user.Our results show that with only 14 population models, meta-BO converges faster on new users on average.Further, we analyzed the individual users in Appendix H and found P5 (Table 14) as one example of a user who has drastically diferent optimal parameters from others.As we see in P5's performance plotted in Figure 25, meta-BO is still able to converge to a performance comparable to the baselines.It also shows the benefts of having a diverse initial user set.Of course, when a user behaves even more extreme, such as wearing the device completely wrongly or their optimal parameter setting is beyond the parameter range, TAF + can not adapt for such cases.To address this challenge, future work could consider developing mechanisms to diagnose a new user's performance in real-time and switch to standard BO or manual optimization.
Enhancing the eficiency, scalability, and determining the objectvie weights: To enable more efcient calibration of wrist-based interactions, it is worth exploring more advanced normalization techniques to determine the quality of a setting with fewer selections.In addition, although TAF/TAF + is relatively lightweight, its computation cost is linearly increased by the number of models, introducing difculties when scaling up the population models.This issue could be mitigated by calculating the and in parallel.Further, the current approach to deriving optimal objective weights based on subjective ratings of the population may not be suitable for all new users.Future research could investigate deriving the individual user's optimal objective weights by user feedback during adaptation; preferential Bayesian optimization [30] may be incorporated in this direction.
Potential co-adaptation: In the evaluation of TAF + , there were only a few observations in each adaptation procedure, and each parameter setting was evaluated only once, making it hard to detect user learning.Future work should consider developing methods that take the user to revisit certain parameter settings for better-estimating user learning.Also, more advanced computational methods are also needed to infer the user's learning.
Multi-objective TAF:.Meta-BO is an emerging topic with many open research questions and opportunities.The feld of HCI can beneft highly from it.In addition to applying TAF + to other applications, future research can investigate other meta-BO methods for HitL optimization.One potential direction is extending TAF for multi-objective tasks by changing the base acquisition function to Expected Hypervolume Increase (EHVI).We ran a simulation with this approach using the population models, and the results showed that TAF outperformed the standard multi-objective BO, as plotted in Appendix J.However, multi-objective TAF would require the end-users to engage with extreme designs that heavily prioritize one objective while neglecting others.Further, one would need a heuristic to select one design from the Pareto-frontier which may or may not be the one preferred by the user.Alternatively, the user will need to manually determine one fnal setting among the Pareto-frontier through trials, which will introduce more efort to the users.Finally, the time required for computing the multiobjective TAF to yield the next parameter setting during adaptation is massive for each iteration because the computation complexity is (the total number of models) multiplied by the complexity for computing EHVI from a model.Thus, it would be unsuitable for online adaptations which require a fast turn-around time.

CONCLUSION
In this paper, we present an online, fast-converging parametric optimization procedure through meta-Bayesian optimization.We introduce a novel meta-BO algorithm, TAF + , and a tailored workfow to meet the unique requirements of human-in-the-loop problems.We apply TAF + and its workfow to two distinct wrist-based interactions.The positive result and the outcome of each step showcase the efectiveness and efciency of meta-BO compared to conventional calibration procedures and state-of-the-art BO.Calibration is a common practice among HCI practitioners and researchers, typically created by developers or designers.Crafting an efective calibration procedure is challenging and often specifc to a particular device or interaction.Moreover, our study indicated that manual calibration does not always yield optimal results.Our success in wrist-based pointing demonstrates that meta-BO holds signifcant potential as a general calibration method across various interactions and devices, eliminating the difculties associated with designing calibration procedures and achieving a better user experience.We encourage future research to explore the application of meta-BO for optimizing parameter settings across diverse applications, both within HCI and beyond.We anticipate this work will pave the way for more personalized and adaptive user interfaces.

A DETAILS OF OBJECTIVE WEIGHTS OPTIMIZATION A.1 A demonstrative example of objective weights optimization with two users
To explain the need for the across-user objective weight optimization approach in step 2 of our workfow, we provide an example where two users are involved.User A's and User B's Pareto-optimal performances are presented in Table 4 and Table 5, respectively.Only by examining User A's profle (Table 4), design B has the highest user rating, and its third objective has the highest value.Therefore, setting weights as [0.1, 0.1, 0.8] is a straightforward confguration that leads to the highest user rating.On the other hand, examining only User B's profle (Table 5), design A has the highest user rating, and its frst objective has the highest value, so one could conclude with weights [0.8, 0.1, 0.1].However, looking at the group level, setting weights as [0.1, 0.1, 0.8] or [0.8, 0.1, 0.1] to both users at the same time would result in the other user ending up with a suboptimal design (i.e., the design not associated with the highest user rating).On the contrary, with an appropriate search, setting objective weights as [0.4,0.2, 0.2] would allow both users to end up having the designs with the highest ratings.With more users taken into consideration, it becomes increasingly challenging to directly see which objective weight confguration is optimal for the group.Hence, a principled grid search is an easier and more principled approach.

A.2 Algorithm for Objective Weight Optimization
Altorighm 1 presents the details of objective weight optimization.
Weighted Table 5: The table shows a set of Pareto-optimal designs in a three-objective optimization problem with an example User B. Intuitively, setting the objective weights as [0.8, 0.1, 0.1] would end up having design A, which has the highest user rating (100).However, this weight setting is not optimal for the other user (presented in Table 4).While not intuitive, setting objective weights as [0.4,0.2, 0.4] leads to the designs with the highest ratings for both.
Algorithm 1 Optimize objective weight confguration based on user ratings.

Inputs:
∈ [1, ] for there are users in total; stands for the -th user among all users.is the total number of Pareto-optimal designs for the -th user.For instance, the 5th user tries out 10 parameter settings, and 3 of them result in Pareto-optimal performance; then 5 = 3. ∈ [1, ] denotes the -th user's -th Pareto-optimal design.
[1..] is the -th user's -th Pareto-optimal objective value set, which contains objective functions.is the -th user's rating for the -th Pareto-optimal designs.2: Outputs: _ℎ [1..], an optimal weight setting for objectives.Initialize: ℎ ← All possible non-zero weight combinations to the frst decimal point. (e.g., ⊲ Get the maximum rating of user j.

6:
← min( ) ⊲ Get the minimum rating of user j. for ∈ [1, ] do ⊲ Normalize the user rating to be in the range of [1,100].− end for 10: end for for ℎ ∈ ℎ do ⊲ Start the actual optimization.Loop over all the weight settings. 12: score_of_this_weight = 0 for ∈ do ⊲ Loop over every user.
14: ⊲ Get the optimal weight setting.return optimal_weight ⊲ Return the optimal weight setting.
Parameters Corresponding Objective Functions Range Table 6: The parameters of the base function used in our Simulation 1 and Simulation 2.Here we present the corresponding objective functions of each parameter and the parameters' ranges.

B EVALUATING TAF + 'S VIABILITY VIA SIMULATIONS ON SYNTHETIC FUNCTIONS
We evaluate the performance of our proposed TAF + in multiobjective tasks through simulations with commonly used testing functions.For the evaluation of TAF in single-objective problems, please refer to Wistuba et al. [100] and Volpp et al. [93].Here, we aim to answer four goals through these simulations: • Simulation 1: Evaluating the robustness of TAF + under diverse objective weight confgurations.• Simulation 2: Validating the efectiveness of TAF + under varying levels of user similarity.• Simulation 3: Investigating the generalizability of TAF + across diferent testing functions.• Simulation 4: Investigating the performances of TAF + with diferent numbers of population models.

B.1 Base function for simulation 1 and simulation 2
In Simulations 1 and 2, we utilize the same base function, which has four design parameters and three objective functions.The number of parameters and functions aligns with our relative pointing interaction.Specifcally, we have adopted the 2-dimensional Sphere function 11 as our choice for objective functions.To elaborate further, our Sphere function is mathematically defned as follows: Here, ¤ represents the selected value of a design parameter, and denotes the center of the square function.This Sphere function has its maximum objective value ( = 1) when the selected ¤ is right at the center position; i.e., (¤ 1 , ¤ 2 ) = ( 1 , 2 ).As the parameter values deviate from this central point, the objective values gradually decrease.We incorporate an additional coefcient , set to be 8, to accelerate the rate of decay in the function.We chose the Sphere function for two reasons.Firstly, it efectively simulates the real-world scenario where a user's performance gradually decreases as they deviate from the optimal design parameter setting.Secondly, the Sphere function is widely recognized and used for similar optimization evaluations.
We then utilize the Sphere function to construct the base function, which contains four parameters ( 1 , 2 , 3 , 4 ) and 3 objective functions ( 1 , 2 , 3 ).Details are provided in Table 6.We frst create three distinct Sphere functions, each having a unique center position ( 1 , 2 ).Consequently, there is no single parameter setting that maximizes all three Sphere functions simultaneously.Furthermore, the frst two parameters in the base function ( 1 , 2 ) contribute to the frst Sphere function, ( 2 , 3 ) contribute to the second Sphere function, and ( 3 , 4 ) contribute to the third Sphere function.Parameters 2 and 3 are shared by two Sphere functions.This decision was made to introduce trade-of scenarios: there exists no single 2 or 3 value that can optimize both Sphere functions.Thus, when performing multi-objective optimization, the outcome comprises a series of Pareto-optimal values but not a single optimal value, which simulates the trade-ofs in real-world design challenges.The range of the parameters is [0, 1] and the range of the objective values is roughly Finally, when obtaining the output () from the Sphere functions, we introduce a noise value for mimicking humans' noisy performance.The noise value is sampled from a Gaussian distribution, denoted as ∼ N (, 2 ), where = 0 and = 0.05.

B.2 Generating synthetic users and user groups
In our simulations, we shift and scale the base function to create a group of testing functions, which is a common approach [24] 12 .We also call these generated testing functions as user functions or synthetic users.Each created user function can be seen as a unique user as it has its own set of Pareto-optimal parameter settings and the corresponding objective values.More specifcally, we shift every parameter value in the base function ( 1 , 2 , 3 , 4 ) by diferent amounts.Figure 9(a) shows as an example.The magnitudes of these shifts are sampled from a uniform distribution within given ranges.

′
We can formally denote this shifting as = + , where ′ is the original parameter value, is the shifted value, ∈ {1, 4} (each represents a parameter), and the shifting for each parameter is ∼ (− In addition to shifting the base function, we also scale the objective function values to further generate diversity.A scalar is directly multiplied by the objective value; see Figure 9(b) as an example.Such a scalar is drawn from a uniform distribution where 1 is the center.We denote this function scalar as S ∼ _ _ (1 − , 1 + ).When the scalar () exceeds 2 2 1, the maximum value of the function will go beyond 1.Conversely, if is less than 1, it reduces the function output value.An example is illustrated in Figure 9 (b).This scaling simulates the diferent levels of performance exhibited by diferent users.
To simulate diferent scenarios where the user groups have various levels of similarities, we sample function shifting () and scalar (S) from diferent ranges (ℎ _ and _).In the below simulations, we deployed diferent ℎ _ and _.A small range naturally generates a group of functions with the highest similarities, simulating a highly similar user group.An illustrated example is illustrated in Figure 9  large range leads to a group of highly diverse functions, mimicking a group of very diversifed users.See Figure 9(d) for an illustrated example.
In each of the following simulations, we randomly sample 20 shifting () and scaling (S) to create 20 distinct synthetic users, each representing a user.Among these 20 synthetic users, 10 of them are designated as "population users." We perform population modeling on these users.The remaining 10 functions are considered "new users." TAF + , is subsequently applied to these new synthetic users.

B.3 Simulation 1: Validating the robustness of TAF + under various objective weight settings
One important feature of our TAF + is that it allows the designers or developers to dynamically assign the objective weights when performing weighted-sum optimization in the adaptation phase.Simulation 1 aims to investigate whether TAF + can maintain stable performance across varying objective weights.Note that the three Sphere functions in our base function have distinct optimal parameter values, and two parameters ( 2 , 3 ) are shared by two Sphere functions.Thus, altering the objective weights in weighted-sum optimization results in diferent sets of optimal parameter settings.Since our focus is on assessing the impact of objective weight assignments, we intentionally chose the smallest sample range to create the user group: both ℎ _ and _ are set to be 0.01.
Step 1: Population modeling via a user data collection.Following our workfow (see subsection 4.2), we performed multiobjective BO to construct 10 population models 13 .The progress is plotted in Figure 10.

B.4 Simulation 2:
Validating the efectiveness of TAF + under diferent user group similarities In the second simulation, we aim to understand the performance of TAF + when dealing with diferent levels of similarities in the user group.To that end, we modify the sampling range of shifting (ℎ _) and scaling (_) when generating the testing functions.Diferent sampling ranges will result in functions of diferent levels of similarities.The sampling ranges in this simulation are 0.05, 0.1, 0.2, and 0.3.Particularly, a shifting range of 0.3 can create a huge diversity, given that the whole parameter range is only 1. Similar to Simulation 1, we sampled 20 functions for each sampling range; 10 serve as population users, and the remaining 10 serve as new users.
Step 1: Population modeling via user data collection.Similar to the previous simulation, for each sampling range, we performed multi-objective BO to construct 10 population models 16 .The resulting hypervolumes are plotted in Figure 14 and Figure 15.Despite the increasingly higher variations between functions (indicated by the error bar), multi-objective BO can efectively explore the Pareto frontier for each group.
Adaptation: Evaluating TAF + on new users.We applied TAF + on 10 new user functions.Standard BO and random sampling are set as baselines.The resulting performances of TAF + are presented in Figure 16 and Figure 17.
Findings.TAF + overall performs well across diferent sampling ranges.Upon closer examination, we noted that greater diversity, such as sampling from ranges of 0.2 or 0.3, results in a lower starting performance for TAF + .This outcome is natural, as higher sampling ranges introduce potentially larger diferences between the population users and the new users, necessitating more iterations for TAF + to adapt to the new user functions.However, despite the lower starting point, TAF + still converges to the maximum objective value within 10 iterations.This highlights TAF + 's potential even when being applied to an interaction where users exhibit high diversity in their preferences and performances.Sampling range = 0.05 Sampling range = 0.1 From the previous two simulations, we demonstrated the consistent performance of TAF + when facing the three-sphere base function.
In the third simulation, we investigate the efcacy of TAF + when facing other, potentially more challenging, base functions.These diferent base functions are introduced, simulating diferent interactions.There are three base functions in this simulation: • Relocated Spheres: This base function is similar to the one used in the previous simulations, which contains 3 Sphere functions.However, the Spheres here have diferent and further apart center locations.See Sampling range = 0.05 Sampling range = 0.1 • Sphere + Branin + Hartmann 3D: This base function is further complicated.It contains a Sphere, a Branin, and a Hartmann 3D 18 .Hartmann 3D is a function that takes 3 parameters as input, and it is also a widely recognized testing function.Note that 2 in this function is shared by three functions, which added more complexity to this task.Details of this function are in Table 9.

Relocated Spheres
Spheres + Branin Sphere + Branin + Hartmann 3D Relocated Spheres Spheres + Branin Sphere + Branin + Hartmann 3D In this simulation, we aim to evaluate the performance of TAF + when it has various numbers of population models.We utilized the 3-Sphere base function, the same as Simulations 1 and 2 (see Table 6).
The sampling ranges (ℎ _ and _) are set to be 0.2, and the results are presented in Figure 20.We do not present the details of population modeling and decay hyperparameter setting as they are the same as in previous simulations 1 and 2.
Adaptation: Evaluating TAF + with diferent numbers of population models.The resulting performance over iterations are shown in Figure 20.Overall, with a higher number of the population model, TAF + potentially has a better starting performance.However, regardless of the number of population models, with suffcient iterations and proper decay hyperparameter confguration, TAF + can still converge toward optimality.

B.7 Overall fndings of the simulations
The above simulations collectively demonstrate the robustness of TAF + under diverse conditions, highlighting its adaptability and generalizability.Simulation 1 examines TAF + under varying objective weights, and the positive results highlight the fexibility of adjusting the objective weights when deploying TAF + .This fexibility is particularly useful when conducting weighted-sum optimization for real-world interactions, where designers may need to fne-tune the weighting on objectives.Simulation 2 assesses TAF + 's performance under diferent levels of user diversity, and the results showed TAF + 's efectiveness under diversifed user groups, highlighting TAF + 's ability to adapt to diferent user performances and preferences.Simulation 3 validates TAF + 's generalizability across a variety of base functions.Despite the increasing challenges of the functions, TAF + retains promising performance whereas standard BO may not be able to converge efciently.Simulation 4 shows the impact of diferent numbers of population models.While the less population models may lead to a lower initial performance, TAF + can converge more efciently than standard BO.To summarize, the results and fndings provide evidence that TAF + is adaptable and capable of handling a wide spectrum of problems with diferent levels of difculties and conditions.
Finally, to further assess the computation complexity, the frst 3 simulations with 10 population models were run on a ThinkPad X1 Carbon, with Ubuntu 22.04.3OS and an Intel i7-10150U CPU.Throughout these simulations, TAF + 's computation time for each iteration is 6.17 seconds (.. = 1.17).

C HYPERVOLUME INCREASE IN POPULATION MODELING
Figure 21 shows the hypervolume of both interactions at each iteration during the population modeling.

D EXTENDED SIMULATION FOR THE DECAY HYPERPARAMETER OPTIMIZATION
To validate the convergence of the optimization methods, we extended the decay hyperparameter optimization.Figure 22 shows the extended simulation for the decay hyperparameter optimization.
The result shows that TAF+, TAF, and BO all converge near the global max given enough iterations.

Absolute pointing Relative pointing
Net performance over multiple objectives

Iteration number Iteration number
Figure 22: resulting performance under simulation.The max performance was derived by a grid search.

E DETAILS OF THE EVALUATION
Here, we provide the detailed numbers of the evaluation, which complements Figure 8. Table 10 and Table 11 list all the mean and standard deviation of weighted-sum performance in absolute pointing and relative pointing, respectively.We also found the overall performance is better than the estimated performance in step 3 (decay hyperparameter optimization, and see Figure 7).We examined the performance we gathered in step 1 (population modeling) and compared it to the performance gathered in adaptation 19 .With t-test, we found a signifcant diference ( (23) = −3.576,= 0.02) between the population user group (mean = 72.87,.. = 0.114) and the adaptation user group (mean = 0.87, .. = 0.11).The diference in the performance levels may cause the overall diference in absolute pointing between Figure 7 and Figure 8.The positive result in Figure 8 also validates the efcacy of TAF + when the users have diferent levels of performance.  1For each user, we frst derived the highest weighted-sum performance, and then we compared the optimal performance of the population users and the users in the adaptation phase.

F ANALYSIS ON THE PERCEIVED WORKLOAD AND USER EXPERIENCE
We present the detailed result of the NATA-TLX questionnaire in Table 12 and Table 13 for absolute pointing and relative pointing, respectively.We performed one-way repeated measures ANOVA with Greenhouse-Geisser correction on each question individually.The only statistically signifcant diference was found in the 6th question, which is related to the perceived frustration ("How insecure, discouraged, irritated, stressed, and annoyed were you?"), in absolute pointing ( (1.26) = 6.735, < 0.05).Pairwise comparison with Bonferroni correction showed that both Meta-BO and BO led to signifcantly lower frustration levels than Manual (both < 0.05).In the interview, most participants refected on the fact that the manual calibration is time-consuming and not bringing a better experience is frustrating and unwanted.We did not fnd signifcant diferences for the rest questions in both interactions.Overall, these results indicate that Meta-BO delivers a comparable or better experience to the users than other procedures.

F.1 User experience
Based on the user interviews, we learned that most users felt both TAF + and BO gradually proposed improving designs.Interestingly, six users felt BO proposed abrupt parameter settings.P1 mentioned, "[...] this round (BO) has more noticeable changes.It suddenly changed speeds (the transfer function) [...] It was a bit unexpected."P5 made a similar remark, "It feels a bit more unstable than the previous ones (TAF + and Manual).Overall, it still improves, but sometimes behaves strangely."Such abrupt changes were not mentioned with the TAF + and Manual procedures.
While the performance diferences between the techniques are clear, participants found it somewhat difcult to clearly state which technique was better at adapting better and faster.Our experiment involves selecting targets of diferent distances and sizes, further adding to the challenge of diferentiating the techniques subjectively when the user is focused on the task and not on perceiving the diferences.Prior work that analyzed user experience in pointing interactions had similar challenges.For example, Casiez and Roussel [11] investigated the impact of various transfer functions.They summarized that the users could not tell the diferences between diferent transfer functions despite the signifcant diferences in performance.Casiez et al. [14] conducted a thorough investigation on the impact of diferent types of transfer functions, focusing only on performance metrics.Lee et al. [52] compared various transfer function adaptation techniques and found few signifcant diferences when analyzing NASA-TLX scores.

G PERFORMANCE OF EACH PERFORMANCE METRIC
Figure 23 and Figure 24 show the individual performance metrics throughout 10 iterations for absolute pointing and relative pointing, respectively.We ran one-way repeated-measures ANOVA on each iteration across all metrics and found that the meta-BO procedure led to either comparable or signifcantly better results than the other baselines.
For normalized completion time in absolute pointing: Meta-BO outperformed BO ( < 0.01) at the third, sixth, and eighth iterations; Meta-BO also outperformed Manual at the fourth iteration.For aiming error and trajectory travel distance in absolute pointing: Meta-BO outperformed BO ( < 0.01) at the third iteration, and also outperformed Manual at the fourth iteration.
For normalized completion time in relative pointing: Meta-BO outperformed BO ( < 0.01) at the third, sixth, and seventh iterations; Meta-BO also outperformed Manual at the fourth iteration.For trajectory travel distance in relative pointing: Meta-BO outperformed both BO and Manual ( < 0.01) at the third iteration.These signifcant diferences are also marked in Figure 23 and Figure 24.The red * (meta-BO and BO) and the black * (meta-BO and Manual) signs denote a signifcant diference between the two procedures at that iteration.

H PERFORMANCES AND OPTIMAL PARAMETER SETTINGS OF EACH USER
Figure 25 shows the performances of absolute pointing of each user, and Table 14 presents all the users' optimal parameter settings of absolute pointing.10 users start with better performance with meta-BO than with other baselines.Only P5 had a lower initial performance; still, the performance of Meta-BO quickly increased after the frst 2 iterations for this user.Upon examining the optimal parameter setting of each user (Table 14), we can see most users' optimal parameter settings have relatively high and , which is aligned with Figure 27.P5 is exceptional since their optimal parameter setting is drastically diferent from others, which explains their lower initial performance in Figure 25 (also denoted as P5).However, TAF + can still identify an optimal setting which is diferent from other participants for P5, highlighting its capacity to deal with drastically diferent new users.
Figure 26 shows the performances of relative pointing of each user, and Table 15 shows the optimal parameter settings for relative pointing of each user.4 users have the highest performance with meta-BO than with other procedures in the frst iteration.Later at the 3rd iteration, meta-BO results in the best performances for 8  users, which highlights TAF + 'S strength in faster adaptation.Table 15 further shows there exist good parameter ranges for most users.For instance, the best range of and 1 is around 0.7 to 1.However, P4's optimal setting has a particularly low 1 , which results in a slightly lower starting performance when using TAF + (P4 in Figure 26).Again, despite the diversity, TAF + is able to identify the unique optimal setting, showing its ability to handle the diversity within the user group.

Figure 1 :Figure 2 :
Figure 1: Our absolute pointing interaction using a wrist-worn device.The sensed yaw (green) and pitch (red) values map to the cursor's x and y coordinates.

Figure 3 :
Figure 3: Overview of Transfer Acquisition Function (TAF) with an example of 2 population models: TAF is a weighted sum of several acquisition values generated by the currently constructed model and the models gathered in advance (i.e., orwhere is the index of the models).In PHASE 1, Population models are constructed per user using optimization data.Each model predicts the user performance ˆ (red line), the corresponding acquisition value ( ), and the uncertainty of this prediction (the red area) of a specifc .In PHASE 2, a new Adaptation model is created for the new user.To derive the acquisition value, TAF computes between-model combination values across all models (including adaptation model's and population models' ) based on the model weights.Model weights are denoted as , and they are computed based on the variance (width of the red area) of each prediction.The (or ) with a higher uncertainty have lower weights.The example is computing the TAF at = 0.7, where the adaptation model has very high uncertainty in early iterations, so the TAF value is majorly determined by the population models.As the adaptation model gains more observations, TAF will gradually be dominated by the Adaptation model, leading to the user-specifc optimal result.

Figure 4 :
Figure 4: Transfer Acquisition Function + (TAF + ) in an example task of two objectives: Both the Adaptation model and the Population models can generate predicted performances of two objectives (red/blue lines), the corresponding acquisition values ( or , where is the index of the objectives and is the index of the models), and the uncertainty of this prediction (shown as the red/blue are).The two major phases (population modeling and adaptation) are identical to TAF (Figure 3).For handling multiple objectives, TAF + frst performs a within-model combination: Within a model , it summarizes acquisition values ( or ) of diferent objectives into a single weighted-sum value ( + or + ) and summarizes weights of diferent objectives ( ) into a single weight ( + ) based on the objective weights.Once every model has a summed acquisition value ( + or + ) and a model weight ( + ), TAF + then performs a between-model combination, similarly to TAF, deriving the fnal TAF+ value.

4. 1 . 2 Extension 2 :CHI ' 24 ,
Proactively balancing the weights of population models and the adaptation model.The second extension is meant to cope with the potentially high user diversity.() in Equation5denotes the decay of the weights on the population models.This decay is a scalar value ranging [0, 1], which is directly multiplied by the original weight values ().() = 1 means there is no decay, May 11-16, 2024, Honolulu, HI, USA Y.-C.Liao, et al.

Figure 6 :
Figure 6: Our study interface: (a) The main study interface.The smaller black dot is the cursor, the red circle is the previous target, and the blue circle is the current target.(b) The interface for calibrating for the relative pointing.The cursor only moves on the y-axis during calibration when the participant performs radial-ulnar deviation motions.(c) Example transfer functions for relative pointing.

Figure 8 :
Figure8: The performance (weighted-sum objective value over multiple objective functions) in 10 iterations.We have normalized all the objective functions into the range of [0, 1].The error bar shows a 95% confdence interval.The red * (meta-BO and BO), the black * (meta-BO and Manual), and the blue * (Manual and BO) signs denote a signifcant diference between two procedure at that iteration.Note that we showed "the best" performance reached from the beginning to each iteration.Since we are comparing the optimal performance given the same amount of iteration, this is the conventional way of showing and comparing the performance.Also note we have converted it to a maximization problem (see subsection 5.3).The detailed numbers of this fgure are presented in Appendix E.

Figure 9 :
Figure 9: Here we use a typical Gaussian distribution to illustrate how we generate a group of testing functions by shifting and scaling a base function: (a) demonstration of shifting a base function, (b) demonstration of scaling a base function, (c) an example user group with a higher similarity, (d) an example user group with higher diversity.

Figure 10 :
Figure10: The hypervolume per iteration during the population modeling in simulation 1.The random condition was to serve as a baseline.The error bar denotes a 95% confdence interval.The global maximum hypervolume was determined by a grid search when there was no noise involved.

Figure 11 :
Figure 11: The resulting performance of TAF + when the objective weights are [1, 0, 0] and [0, 1, 0], and the population modeling has and S are sampled from range of 0.01.The error bar indicates a 95% confdence interval.The global maximum objective value was determined by a grid search when there was no noise involved.

Figure 12 :
Figure 12: The resulting performance of TAF + when the objective weights are [0, 0, 1] and [0.33, 0.33, 0.34], and the population modeling has and S are sampled from range of 0.01.The error bar indicates a 95% confdence interval.

Figure 13 :
Figure 13: The resulting performance of TAF + when the objective weights are [0.5, 0.3, 0.2] and [0.3, 0.5, 0.2], and the population modeling has and S are sampled from range of 0.01.The error bar indicates a 95% confdence interval.

Figure 14 :B. 5 Simulation 3 :
Figure 14: The hypervolume per iteration during the population modeling in Simulation 2. The random condition was to serve as a baseline.The error bar denotes a 95% confdence interval, and the maximum volume is determined by a thorough grid search.

Figure 16 :
Figure 16: The resulting performance of TAF + when the functions' shifting () and scaling (S) are sampled from range of 0.05 and 0.1.The error bar indicates a 95% confdence interval.

Figure 18 :
Figure 18: The resulting performance of TAF + when the functions' shifting () and scaling (S) are sampled from range of 0.05 and 0.1.The error bar indicates a 95% confdence interval.

Figure 19 :
Figure 19: The resulting performance of TAF + when the functions' shifting () and scaling (S) are sampled from ranges of 0.2 and 0.3.The error bar indicates a 95% confdence interval.

Figure 20 :Figure 21 :
Figure 20:  The resulting performance of TAF + when the number of population models varies.

Figure 23 :Figure 24 :
Figure 23: The performance of individual objective functions in absolute pointing.The error bar shows a 95% confdence interval.The red * (meta-BO and BO) and the black * (meta-BO and Manual) signs denote a signifcant diference between the two procedures at that iteration.relative

Figure 25 :
Figure 25: The performances of absolute pointing of each individual user.The axis is the number of iterations, and the axis is the weighted-sum fnal objective value.Note that we have converted the problem into a maximization problem by normalizing the objective functions.

Figure 26 :
Figure 26: The performances of relative pointing of each individual user.The axis is the number of iterations, and the axis is the weighted-sum fnal objective value.Note that we have converted the problem into a maximization problem by normalizing the objective functions.

Table 4 :
(100)able shows a set of Pareto-optimal designs in a three-objective optimization problem of an example User A. Intuitively, setting the objective weights as [0.1, 0.1, 0.8] would easily end up having design B, which has the highest user rating(100).However, when taking the second user (presented in Table5) into consideration, [0,1, 0.1, 0.8] would bring the suboptimal design (i.e., not associated with the highest rating) to the other.However, setting objective weights as [0.4,0.2, 0.4] leads to the design of the highest ratings for both users.

Table 7 .
• Spheres + Branin: This base function has two Sphere functions and a Branin17.Branin is a commonly used function for testing optimization and it is considered more challenging than Sphere.Details are in Table8.Figure15:The hypervolume per iteration during the population modeling in Simulation 2. The random condition was to serve as a baseline.The error bar denotes a 95% confdence interval, and the maximum volume is determined by a thorough grid search.

Table 10 :
The mean and standard deviation of the weightedsum performance at each procedure in absolute pointing.

Table 11 :
The mean and standard deviation of the weightedsum performance at each procedure in relative pointing.

Table 12 :
The mean and standard deviation of the raw NASA-TLX scores in absolute pointing.One-way repeated-measures ANOVAs showed that the only signifcant diference was found in the sixth question where both Meta-BO and BO led to signifcantly lower scores than the Manual procedure.

Table 13 :
The mean and standard deviation of the raw NASA-TLX scores in relative pointing.No signifcant diferences were found across all questions.