Benefits of Human-AI Interaction for Expert Users Interacting with Prediction Models: a Study on Marathon Running

Users with large domain knowledge can be reluctant to use prediction models. This also applies to the sports domain, where running coaches rarely rely on marathon prediction tools for race-plan advice for their runners’ next marathon. This paper studies the effect of adding interactivity to such prediction models, to incorporate and acknowledge users’ domain knowledge. In think-aloud sessions and an online study, we tested an interactive machine learning tool that allowed coaches to indicate the importance of earlier races feeding into the model. Our results show that coaches deploy rich knowledge when working with the model on runners familiar to them, and their adaptations improved model accuracy. Those coaches who could interact with the model displayed more trust and acceptance in the resulting predictions.


INTRODUCTION
While there is mounting evidence that it is possible to build accurate and effective prediction models, it is less well studied how users interact, or wish to interact, with such models, and whether these models will be accepted and useful to its users [23].Due to the nature of these data-driven prediction models, it is hard for users to understand the inner workings of the model and thus to judge the quality of outcomes.On top of that, in particular for users with rich domain knowledge it is critical that when working with the model it acknowledges and resonates with their own way of working and thinking.Being knowledgeable on a domain implies that one has current routines on handling cases and tasks, and support from data-driven or AI-tools do not necessarily blend in easy.For example, healthcare professionals have shown difficulty in incorporating such tools in their routines and question their competence and reliability [14,32].
Over the last years various solutions have been proposed to improve AI-assisted decision making by calibrating trust and foster 'appropriate reliance' [50].After all, ideally users are facilitated to understand the support tool well enough to decide when and when not to rely its advice.To this end, the field of Explainable AI (XAI) studies how we can and should explain the model (e.g., [35,45]) in order to inform users on various aspects, such as how an outcome is generated, what the uncertainty of the outcome is, and why the actual outcome is given over other outcomes (e.g., why the diagnosis is pneumonia and not flu).This facilitates users' understanding and thus proper decision making.Yet, many XAI techniques are technical and not very intuitive, even for expert users [30].Another solution that facilitates human-machine collaboration is Interactive Machine Learning (IML) [16,60], which goes one step further than XAI in a sense that it not only informs users but also gives them control.Through IML, and similarly in Explainable Active Learning (XAL) [20,61], users are facilitated to debug and teach the model (e.g., [34,52]), potentially enabling them to use it more effectively and in a more meaningful way [4,23] and foster and calibrate trust [20,61].
Few studies in IML however have studied users with domain knowledge working on tasks that are realistic and relevant to them.While several studies have shown different experiences between experts and novices when working with IML [18,23,64], this work typically relies on participants recruited from a general audience (typically through Amazon MTurk), not on domain experts whose daily job is to work on these tasks.We argue that to increase ecological validity and understand the actual dynamics of interacting with an ML solution, it is important to recruit domain experts and provide them with a realistic and relevant task.Not only the fact that they know about the domain, also that they care about the specific case and outcome, might importantly change their interaction needs.This paper presents a study with knowledgeable and invested participants, that is marathon running coaches.
The domain of marathon running provides a good case for exploring the needs and effects of interactivity with knowledgeable users in a realistic setting, for two reasons.First, data from previous marathons are publicly available and models for setting realistic yet challenging marathon finish times have been developed and validated [53].Thus, we can provide actual coaches with realistic models to do predictions for their own runners, for whom they have good background knowledge.The data also enables us to test coaches' ability to work with models for unfamiliar runners (i.e., how IML is typically tested) and see to what extent that differs from predictions made for familiar runners in terms of trust, acceptance and accuracy of model predictions.
Second, marathon running is a suitable domain since the majority of coaches are indeed currently not always keen on using data-driven support.Although they usually have access to their runners' data, they typically do not analyze them, nor do they actively seek the support of models to do so.Instead, they rely on their experience and intuition, fine-tuned from years of coaching [11,21,38,47].Running coaches usually have large domain knowledge on running and physiology, and knowledge about the runner that is typically not incorporated in models, such as the runners psychological traits, their motivations to run, c.f. [43], injury history, and personal context.This implies good human-AI collaboration is key, since prediction models and coaches can bring in complementary information.However, this collaboration is also challenging.For instance, human experts often tend to over-rely on too many, and sometimes irrelevant, factors when making predictions [48], leading to sub-optimal predictions.On the other hand, models might miss relevant factors and can benefit from leveraging the personal knowledge, experience, and intuitions of coaches.
The current paper presents a user study of an interactive machine learning (IML) system designed to support running coaches with planning upcoming races for marathon runners.Through thinkaloud sessions and an online study (n=71), we explored coaches interactions with the system, and tested against a non-interactive system.The results show that allowing coaches to interact with the model benefits the system and the coaches; it shows improved levels of acceptance and trust from coaches, and it reveals the potential of coaches' feedback to improve model accuracy.We will conclude the paper by discussing the results and drawing implications for future studies and the design of interactive systems.

RELATED WORK AND RESEARCH QUESTIONS 2.1 Marathon Running, Pacing Strategies and Data
Running is increasingly popular as a recreational sport [49], and participation in endurance events such as the marathon has been steadily growing over the last few decades [62].Most runners are using wearable monitoring to track and optimize their running performances [28,44].Tracking running activities can provide runners with useful feedback, including instructional, motivating or challenge-oriented feedback [12].Sharing these data with others is also increasingly commonplace and encouraged by popular platforms such as Strava [2] and RunKeeper [1].In addition, many marathon races provide, often publicly, official race-time data for 5 kilometer intervals throughout the race.These so called 5k splittimes provide a pacing profile of the race, that have the potential to support runners and their coaches when training for and planning their next marathon.For example, models learned from training session information [15,31,59] or marathon performances of similar runners [53], have been shown to accurately predict marathon challenging yet realistic finish-times and provide pacing advice for a next race [6].
Running-related data may be particularly useful when preparing for a race.Runners, and their coaches, can learn from their previous performances.Several data-driven models have been developed to generate performance predictions for future events.Riegel's early work [46] predicts finish-times based on finishing times of shorter races, using linear regression techniques.More recent efforts have explored more advanced machine learning approaches [10] and predicting performance using wide sets of data, including training determinants [15,31] and past race performances of similar runners [53].
When preparing for a long distance race such as a marathon, it is particularly important to plan a suitable pacing strategy, or pace profile [19], for during the race.Inexperienced runners often show a positive-split, meaning that they start too fast, often resulting in exhaustion towards the end the race, also referred to as hitting the wall [56].An even-split or slightly negative-split, running faster at the end than the beginning, is associated with optimal finish times.A good pacing strategy is of key importance, as given the same training preparation and fitness of the runner at the start, a suitable pacing strategy can still greatly influence a runner's performance.The actual race-plan is directly derived from the target finish time, combined with the pacing profile, which results in the actual pace that a runner needs to achieve at every moment of the race.

On the Acceptance of Interactive Machine Learning
Support from machine learning systems for decision making is not necessarily accepted by users.For instance, studies in healthcare settings reveal challenges for the adoption of clinical decision support systems.Several barriers for use have been identified, including a lack of fit in their current workflows making it inefficient to use, and scepticism about the competence and reliability of such system [14,32].Health coaches have expressed similar concerns, and add that it potentially distracts from the personal experiences of a client, which they consider to be an essential part of effective coaching [47].Making these systems interactive potentially allows for a better human-AI collaboration in which the user is facilitated to understand and teach the AI where needed.Interactive Machine Learning (IML) [60] creates a collaborative setting in which a system and a users can interact more openly and, crucially, where users can provide feedback to guide, and potentially improve, the operation of the underlying model.In IML systems, human feedback is incorporated in the model training process [16,17].There is a broad range of possible interactions, for example feature selection, model selection or model steering [16].Enabling users to interact with machine learning models has been shown to be a beneficial process, as it allows for incorporating users' knowledge, potentially improving the accuracy of the models as well as users' trust [4,9,24,57,58].
Stumpf and colleagues [57] find that users, when asked in the right way, can provide rich feedback to a machine learning system that can lead to an improved accuracy of the model [58].Also Amershi et al. [4] show, by reviewing user studies in the field of IML, that users wish to express rich domain knowledge in order to improve models, beyond just generating labels to data.At the same time, there may also be drawbacks of interactivity, such as anchoring effects and increased cognitive workload [20].Other research even show that soliciting for user feedback may result in reduced perceived system accuracy [27].This emphasizes the value of explicitly studying and understanding users' interactions with models, in order to improve the design of interactive systems [4].With this work we aim to understand running coaches' experiences while interacting with support systems, and the degree to which model interactivity influences their acceptance and trust in those systems.
• RQ1: Does model interactivity improve coaches' acceptance of the model's recommendation and their trust in the model?
Very often, studies in IML use fictional tasks or unrepresentative participants as proxy for realistic situations.It is notoriously difficult to recruit a large number of participants who have sufficient knowledge and experience to perform a domain-specific task.Thus, studies with intended end-users on realistic tasks typically remain small in sample size, c.f. [7].In studies involving larger samples, participants are mostly drawn from a general audience through tools as Amazon MTurk, which implies that the dynamics of the interactions between users and system not necessarily resemble a real-life situation where domain experts are working with realistic cases.Still, those studies reveal insights on the potential differences between knowledgeable and novice users when interacting with an AI.Specifically, when users are supported by IML, those who are familiar with a task (a Tictactoe game) perceive lower control and higher difficulty on giving feedback to the system compared to those unfamiliar with the task [23].Other work suggests that users with larger (self-reported) domain expertise typically compare the explanation of an AI with their own rationale, resulting in a more critical judgement on the AI compared to novices [25,64].More domain knowledge has also been shown to make users less sensitive to an AI's suggestion, because they are more confident in their own decision [18].Furthermore, the extent to which users trust and accept a support system has shown to be subject to personal investment in the task [5,25].Specifically, when participants are not personally responsible for a certain task, the tend to rely more on decision support systems [5].
Therefore, we aim to understand how coaches' interactions with a prediction tool -and their tendency to rely on it -varies when working with data of their own pupils, or unfamiliar runners.Working with their own pupils potentially results in stronger -and perhaps more biased -opinions regarding the recommendation, and a stronger need to interrogate and control the model.Accordingly, we consider the following research questions: • RQ2: Do coaches show different acceptance and trust levels when considering familiar or unfamiliar runners in the model?• RQ3: Is the effect of model interactivity on acceptance and trust (RQ1) different when coaches are interacting with data of familiar runners, compared to unfamiliar runners?
We extend existing work, by using a substantial sample of knowledgeable and invested participants.Adopting coaches' own pupils in the systems allows for high ecological validity, at the same time, comparing coaches' interactions across familiar and unfamiliar runners provides insight in how user's interactions and evaluations differ when working on a realistic task compared to a more fictitious task.

Accuracy of Predictions by Models and Humans
Besides coaches' trust and acceptance, there is an additional perspective on whether or not model interactivity is to be considered successful.That is, we examine the model's accuracy, and the extent to which the predictions improve when coaches interact with it.In IML in general, incorporating user feedback is recognized as beneficial to improve system performance [4,16,17].Indeed, as coaching is an inherently interpersonal [47] and complex [8] process, coaches potentially contribute unique and important knowledge that can add considerable value to a ML system trained using instances that may be limited to a narrow range of observations and sensor data.They know their runners, including their current form, character, personal context and injury history.At the same time, model predictions have been shown to generally offer equal or better performance then human predictions [42].This holds for quantitative prediction tasks within a variety of contexts, including diagnosing diseases, predicting student performance or the fit of a job applicant [22], and this may to be true for the task of marathon finish time prediction, which is, after all, a quantitative prediction task.When working on such tasks, coaches may be subject to biases, c.f., [29,63], for example, a tendency to incorporate too many cues in their predictions [48] which may result in over-fitting.Observing coaches' interactions with the model, specifically, their level of interactivity with their familiar compared to unfamiliar runners, what they change and how they explain their contributions, provides insight in the knowledge that coaches aim to bring in.Furthermore, we can measure how these contributions actually affect the prediction accuracy.Accordingly, in this work we consider the following additional question: • RQ4: Is model accuracy improved by coaches interacting with it?
We preregistered this study on AsPredicted1 .

AN IML SYSTEM TO SUPPORT MARATHON PLANNING
In this section, we describe the marathon prediction model that we use in the current work, and the means of interaction with this model.

A Case-Based Reasoning Model for Marathon Prediction
Smyth and Cunningham [53] developed a Case-Based Reasoning (CBR) model to predict suitable target finish times and pacing strategies for marathons.Briefly, CBR is a machine learning technique that solves new problems by reusing the solutions for similar cases that have been solved previously and are stored in a case base [3].Smyth and Cunningham [53] applied this idea to marathon running by building a case base of past races, pairing a non personal-best race (nPB) for a runner with a subsequent personal-best (PB) race, and predicting a finish-time for a new target runner by reusing the PB races from runners with similar nPB races to the target runner.The approach also recommends a pacing plan based on the pacing profiles of the PB races for these similar runners.The intuition behind the approach is that runners that previously did a marathon with a similar time and pacing strategy (their nPB), and that subsequently improved (PB) provide a good foundation for a prediction and recommended pacing profile for a target runner.The advantages of this approach are three-fold.Cross-validation studies demonstrate that it is capable of generating reasonably accurate finish-time predictions [53], but in addition, the idea of reusing past races from similar runners is intuitively appealing [3] making it more straightforward to explain to coaches and runners alike.Moreover, the 5k split times are publicly available from previous marathon races so no personal tracking data is required and coaches will be able to find their own pupils in this data.In this way an understandable interactive machine learning tool can be built that provides predictions for realistic cases and for both familiar as well as unfamiliar runners.
Based on this case-based reasoning approach we developed a system to support running coaches when determining a suitable target finish time and planning an appropriate pacing strategy for their runners' upcoming races.The original CBR model [53] generates recommendations for a runner based on their performance in a single previous marathon.As most marathon runners will have run more than one marathon, a practical implementation of the CBR model to be used by coaches should be able to combine the results of several marathons.We extended the CBR model by incorporating multiple previous marathons, by querying the original model for all previous marathons separately, and then combining the resulting recommendations using adjustable weights per marathon.
The dataset on which the model was built consisted of data from the three largest marathons in The Netherlands, as we were targeting Dutch running coaches: Amsterdam, Eindhoven and Rotterdam, from 2008 until 2019.This data set included 63,000 race records from 24,000 unique runners.Each race record included, among other features, the 5k split-times, the finish time, age, and gender.

The Interaction
Building an interaction with a model that is meaningful and useful for users starts from understanding users' interaction needs [57], including how users tend to question [36] and wish to steer the model [4,51].For making the marathon prediction model interactive, we combined knowledge from marathon planning and the characteristics of the case-based reasoning approach.First, we identified possible means to adapt and steer the model, based on the procedure of CBR.For example, in CBR an important parameter is the number of similar cases to consider, and it would be technically possible to allow coaches to select and deselect cases (i.e., comparable runners).Moreover, in our extended version we can allow coaches to control which prior races of the target runner to include using an adjustable weight per marathon.We compared these interaction possibilities to the way of working of running coaches, based on exchanges with running coaches from our own network and domain knowledge from marathon planning.We concluded that one type of interaction particularly resonated with coaches' approach to marathon planning, that is indicating which previous races are representative for the upcoming race, and thus should be included in the prediction.As a result, the running coaches in the interactive condition were provided with controls to allow them to weight the model's inputs.For all previous marathon races for a target runner, the coaches were able set a slider to indicate how representative a given race was with respect to predicting their upcoming performance (see Figure 1).We developed the interface in an iterative way, based on insights from the running coaching domain and discussing designs within the team and adapted after testing with actual coaches in our think-aloud sessions.
This type of interactivity resembles the feedback assignment and model inspection interface elements of an IML as proposed by Dudley et al. [16].By default the sliders were set at position 0.5, and could be changed with increments of 0.1, to values ranging between 0 and 1.Note that the default position of the sliders resembles equal weight policy of the non-interactive condition.The model output changed in real-time with slider movement, thereby providing coaches with immediate feedback and encouraging them to experiment with the model as they adjusted various slider positions and observed their impact on predicted finish times.Such short and incremental interaction cycles are a key element of IML applications [4] and they facilitate even users with low-expertise to experiment and improve a ML model.The final interface is shown in Figure 1.
The non-interactive condition was similar to the interactive condition, but used equal weights on the previous races of the target runner.Such an equal weighting policy typically leads to good performance, often outperforming human-adjusted models  [13].This highlights the value of the non-interactive condition, as not necessarily inferior to the interactive condition, nor unrealistic.

Procedure in the Study
We embedded the model in an online survey, which was used in 7 initial think-aloud sessions and then a subsequent online study.The coaches (71 in total) signed an informed consent on the first page, followed by a page with a brief introduction of the study.Subsequently, the coaches selected their runners, by searching for their own pupils in our database.When there were more than five races available for a specific runner, we only presented the five most recent ones.We added the option to add marathon data by hand, if they could not find their runners in our database, or if some races were missing.The coaches worked consecutively with four runners during the study, and were randomly assigned to start with two of their own pupils they had just selected, or with two runners unfamiliar to them.After collecting data of 12 coaches in the online study, we observed a large drop out at the runner selection page.They could not find their own runners, and apparently filling in the data by hand was too much effort.Therefore, we decided to provide participants with an option to continue without familiar runners.Of the remaining 59 participants, 22 used this option and were presented with four unfamiliar runners.The final number of observations are given in Table 1.
We did not assign unfamiliar runners randomly to the coaches but aimed to reuse previous participants' selected runners, to make sure that these were comparable, relevant and timely cases.This also avoid confounds, since otherwise unfamiliar runners would likely be less coached.In addition, we aimed to keep the number of marathons similar across the familiar and unfamiliar runners and within the coach, to avoid confounding.So, when a new participant started, we assigned them with the last previously selected runners with an equal number of marathons to their own selected runners.In the beginning of the experiment, when no previously selected runners were available, we drew runners from our own prepared list of suitable cases.
The coaches consecutively assessed all four runners.For each runner, the coaches followed the same procedure.On the first page the runner was briefly introduced by their gender, age and previous marathon performances (including year and location of races, finish times and 5k split-times, with graphs similar to the layout in Figure 1).On this page, the coaches were asked to recommend an initial finish time and a suitable pacing strategy to go with it.The purpose of this initial prediction was to provide a baseline against which to calculate acceptance (RQ1-3).
On the next page, the model presented its recommendation for the runner, including finish time and pace strategy (see Figure 1).On top of the page, we provided an explanation of the model, including the general idea of CBR and a concrete example, that was unfolded when participants clicked on it.The coaches in the interactive condition were able to change the model inputs on this page (see Figure 1).In the non-interactive condition, the sliders were not presented, and the model was thus using equal weighted inputs.After seeing the model's recommendation, we asked the coaches again for their recommendations on target finish time and pacing strategy.
To summarize, for each runner, the following predictions were made by both the coach and the model, regarding finish time and pacing profile: The think-aloud sessions (n=7) were facilitated by one or two researchers.The sessions were all audio recorded and the researchers took notes based on their observations.We aimed to distract the participants as little as possible.While the participants were following the steps in the survey, we were attentive to usability issues.We more actively solicited for their thoughts and motivations for their actions at the pages where the model presented its recommendations, when they were adapting the model and providing their recommendations on finish time and pacing strategy.
During the think-aloud sessions we observed that coaches' mostly focused on setting a suitable finish time, and the interaction for setting a the pacing strategy was less intuitive and less important to them.Also in the online study, a large number of participants did not use the option to adapt the pacing strategy.Since this resulted in limited data about coaches' intended pacing strategies, we decided to focus our quantitative analysis solely on the coaches' predictions regarding finish time.
At the end of the survey, after assessing all four runners, the coaches completed a questionnaire to evaluate their trust in the system, coaching experience, several self-efficacy measures, and other demographic information, as discussed in more detail below.

Acceptance and
Trust.To answer R1-3, we measured acceptance and trust.For the non-interactive condition, we calculated acceptance as follows, similar to Weight on Advice (WoA) as used as a measure of advice taking [55,65] and trust in AI work [37]: In the interactive condition, it makes more sense to calculate to what extent the coaches accepted the final (adapted) model as follows: Acceptance is 1 when coaches fully adopted the (adapted) model's outcome, and it is 0 when they stick with their initial finish time.If they revised their initial prediction in the direction of the model's prediction, then this is expressed as a fraction between 0 and 1, relative to the distance between the model's and coach's initial prediction.The distribution of Acceptance is given in Figure 2. Note that in the raw data we also observed a few negative values, indicating that coaches were disagreeing even more with the model after seeing the model.We capped those values to 0, as they indicate no acceptance.
We observed that most coaches rounded their target finish time to a multiple of five minutes.For example, for one of the runners of Coach 46, the model predicted a finish time of 3h42, and the coach filled in 3h40 and indicated in open text field "The model and I are totally aligned!", indicating a full degree of acceptance.This type of feedback occurred frequently.Therefore, the Acceptance is set to 1, if the final prediction of the coach is equal the model's prediction when rounded to the nearest 5 minutes.
The final distribution of Acceptance (Figure 2) shows that the majority of the coaches is either fully accepting (Acceptance = 1) or not accepting at all (Acceptance = 0) the models' output.In the analysis, we will consider both Acceptance as a continuous measure, as well as as a binary (rounded at .5) measure.
To evaluate trust, we used a questionnaire consisting of 12 questions on a 7-point scale, translated from [41].We used only the items that applied to our case, which were the subcategories Competence (e.g., The model performs its role of predicting marathon performances well.),Willingness to Depend (e.g., I feel I can count on the model when I need a suitable finish time and pace strategy.),and Subjective Probability of Depending -Follow Advice (e.g., If I had to set a suitable finish time and pace strategy, I would want to use the model again.).The other subcategories were not applicable, for example regarding the willingness to make purchases, and therefore omitted.Factor analysis revealed two factors.The first factor loaded highest on items from the subcategory Competence (labelled as Perceived Competence, Cronbach's  = 0.917).The second factor loaded highest on items from the other two categories, i.e., Willingness to Depend, and Subjective Probability of Depending (labelled as Willingness to Depend, Cronbach's  = 0.910).
Note that per coach, we measured Trust once (n=71), but Acceptance was measured four times, that is, once per assessed runner (n=71x4=284, also see Table 1).

Accuracy.
In order to determine whether model accuracy was influenced by the adaptations of the coaches (RQ4), we calculated the accuracy of the predictions   and   .We used a cross-validation approach; for all unfamiliar runners we omitted the best race in the data set (i.e., fastest finish time) and used this as ground truth (G), after which we compared that with the model's prediction (P) based on the other races.It was only feasible to calculate Accuracy for the unfamiliar runners (n=186, see Table 1).as omitting the best race from the data of the familiar runners (n=98), would have probably been recognized by the coaches that know the best races of their pupils.We calculated the Percent Error as follows: The rationale for omitting the best race should be considered within the context of the task.Coaches, and thus the prediction tool, are facing the task of setting a suitable target finish time, that is challenging yet realistic.The runner's best recent race (out of a maximum of 5 most recent races) serves as a representative measure for this, after all, it shows what the runner recently has been capable of.If the model is able to determine this finish time as a realistic target based on the other recent races, we say this is an accurate prediction.Comparing the Accuracy of   versus   then shows whether coaches were able to improve the Accuracy of the model's prediction by interacting with the model.This analysis was only possible for the unfamiliar runner in the interaction condition (n=102, see Table 1) We can also apply the Accuracy measure on the coaches' predictions, that is, ℎ  and ℎ  .This provides insight in the quality of the coaches' predictions and if working with the (interactive or non-interactive) model can improve these predictions.It also allows to relate the coaches' prediction Accuracy to relevant covariates such as coaching experience.However, the task of the coaches (i.e., determining a suitable target finish time for the next marathon), while aligned, is not identical to the procedure of the model (i.e., predicting the best recent race based on other recent races).Therefore, we only analyze the Accuracy of coaches' predictions in an exploratory manner.

Covariates.
Other variables are likely to play a role in the acceptance, trust and adaptation of models, including coaches' selfefficacy, demographic variables such as age or gender, and coaching experience.We aimed to explore, and statistically control for, the potential effects of these variables by including the following selfreport measures: • Self-efficacy measures: -Self-efficacy of setting target finish times by oneself (1 question, 7-point scale): I believe I am able to set a suitable finish time and pace strategy without the help of a model.
-Self-efficacy of providing appropriate input in the model (interactive condition only, 2 questions, 7-point scale): I believe I am able to determine which races are representative for a runner, and I believe I am able to adjust the model such that it gives a meaningful prediction.2 -General computer self-efficacy (translated from [39], factor scores based on 6 questions, 10-point scale, Cronbach's  = 0.948).

• Demographics and Experience:
-Age, gender, and education level.
-Experience as a coach / runner, through multiple questions: years being coach, years being runner, number of runners coached, and number of marathons participated in.-Previous experience with using tools or models for setting target finish times (factor scores based on 3 questions, 7point scale, Cronbach's  = 0.937); e.g., As running coach I often use (calculation) models.-Familiarity with the runner (for familiar runners only, 1 question per runner, 5-point scale).
A relatively high correlation was found between the following five variables: age, years being coach, years running, number of runners coached, number of marathons raced.We decided to omit age, and used factor analysis on the remaining four experience measures (Cronbach's  = 0.66) to construct a factor score as a measure of 'Coaching Experience'.

Participants
Running coaches were recruited through social media and news messages on running platforms, by contacting running associations and marathon organizations, and via our personal network.We started with think-aloud sessions.After 7 participants (mean age of 58 years, 2 female), we terminated the data collection as we agreed on saturation.The sessions approximately lasted for 1 -1,5 hours, and we compensated the coaches with a e10 voucher.We asked the coaches to fill in the online survey, including working with the marathon prediction tool, while thinking aloud.We presented them all with the interactive model, because we mainly sought to understand their interactions.We continued the study online and 68 coaches completed the study.Of the 7 coaches that participated in the think-aloud sessions, 3 fully completed the questionnaires and could thus be included 3 .This resulted in a total of 71 participating running coaches, of which 28 females, mean age was 45 years, ranging from 19 to 69 years, mean working experience as coach was 5 years, ranging from 0 to 28 years 4 .The online study lasted for approximately 30 minutes, and we compensated the online participants by raffling one e25 voucher per 5 participants, which they received by e-mail.The participants were randomly assigned to either the interactive (n=39) or the non-interactive (n=32) condition (see Table 1).

Results of the think-aloud sessions
The think-aloud sessions showed that the data and the model triggered coaches to extensively reflect on the runners and their performances.The task of determining a suitable finish time and pace strategy was clearly relevant to them, and the presented data were intuitive, as they did not hesitate to start talking about the data and the runner immediately.When presented with their own pupils, it was often clear that these data were on top of their minds.Most coaches immediately filled in the target finish time which they used in their training sessions, and they often knew their runners' previous marathon performances by heart.So, our system clearly resonated with the running coaches' daily practice.
Coaches were aiming to express rich knowledge when adapting the model.When interpreting data of their own pupils, they gave lengthy reflections on the runner's background, motivations ("This runner has had a heart attack recently, if she will ever run a marathon again, it is for fun rather than performance" (P6)), how well they were prepared for specific races, their character (e.g., "I know this runner is stubborn" (P1)), and their approach to races ("We should challenge this runner, because she's too conservative herself" (P6)).Rather surprisingly, also when assessing the data of the unfamiliar runners, their reflections and possible interpretations were about as lengthy.They showed to be eager to use all information at hand to understand the runner and her performances, such as age ("Given the relatively old age, there is not much room for improvement" (P1)), and how different performances were distributed over time ("This runner improved greatly within a relatively short amount of time, that's a great achievement" (P3)).They were trying to find possible explanations for anomalies or trends in the pacing data ("Here she hit the wall, maybe it was hot weather, or she started too fast" (P7)).Thus, the results show rich domain knowledge at play, even when knowledge of the runner was limited.Overall, the coaches stated they would be more likely to accept the model's recommendation with the unfamiliar cases, compared to familiar cases, because they had relatively little information to oppose.
Most coaches (5 out of 7) were freely experimenting with setting the sliders, and observing their effects, before committing to their final input.Some coaches (3 out of 7) showed to be keen on adapting the model such that it would fit their own ideas, for example, one coach stated: "I hope the model understands it now" (P5).Coaches could adequately explain their intentions when setting the sliders.For example, one coach stated "By making this race very representative, I say: this is how I like the runner to approach the next race" (P6).Notably, coaches rarely set the sliders to zero (i.e., not representative at all), because they believed that "even bad performances are in a way representative for their capabilities" (P2).
Lastly, some coaches (3 out of 7) reflected on the applicability of the model for different types of runners.For example, one coach explained: "I think this tool is very useful for those runners who are highly driven by performance, but I know many other runners who mainly want to be healthy and enjoy the race." (P4) To test the assertion that the model might be more applicable to specific types of runners' motivations, and less to others, we added an additional question in the online study, measuring the type of motivation of the familiar runners perceived by the coach (nine motivation types, translated from [40], such as life meaning or competition, multiple answers possible).Some small usability issues emerged, for example, the process of selecting runners in the beginning of the survey turned out to be not very intuitive.Also, the visualization of the graphs was not always clear (e.g., the labels on the axes).Based on the participants' feedback and our own observations, we improved the usability of the survey and the model interface.
As the results from the think-aloud sessions suggested that coaches are able to successfully interact with the model, we did not change the model nor the means of interaction.We will use the data from the think-aloud sessions, supplemented with the answers in the open text fields in the online study, to interpret and enrich the quantitative findings of the online study.

Interactions
Coaches in the interactive condition made ample use of the interactivity option; on average they interacted with the model 25 times per runner (i.e., changing the slider positions, median = 16.5, ranging from 0-105 times).Figure 3 shows that the number of interactions is right-skewed, and that coaches' interacted most with the model when assessing the first runner in the experiment, probably because they were still exploring the functionality of the tool in the first trial, with fewer but more focused interactions later.Interaction level was also high with the third runner in the experiment, which is when they switched from familiar to unfamiliar runners -or vice versa.Analysis using multilevel regression (random intercept per coach) supports these results in that the number of interactions was indeed significantly lower for runners assessed later.The number of interactions did not significantly vary across coach characteristics (i.e., covariates such as gender and experience), nor across familiar and unfamiliar runners.

Effects of Model Interactivity on Trust and Acceptance (RQ1)
For understanding coaches' Acceptance, we use a multilevel regression model (random intercept per coach), as the data contains repeated observations, each coach assessing four runners.For Perceived Competence and Willingness to Depend, the two Trust components, we fit regular regression models, as there is a single measurement per coach.For an overview of the regression results, see Table 2.
First, to answer to RQ1, the results show a strong positive effect of model interactivity on coaches' Acceptance and Perceived Competence of the model.Coaches in the interactive condition were more likely to accept the model's recommendation ( int = 0.266, p < 0.001), and their Perceived Competence of the model also increased significantly when the model was interactive ( int = 0.780, p < 0.001).As in the think-aloud sessions, the qualitative data in the online study showed that coaches often explicitly appreciated the ability to adapt the model, e.g., "Without my adjustments the model did not make sense, but by eliminating the race from Eindhoven, we're getting somewhere" (Coach 53, familiar runner, interactive condition).Even some coaches working with the non-interactive model expressed a desire to adjust the model, e.g., "This runner coming back from a serious injury, so her old races are not representative at all.I wonder what would happen if the model would only be based on her last two races" (Coach 4, familiar runner, non-interactive condition), even though the participants in the non-interactive condition were not aware of the existence of an interactive model.The qualitative data also revealed why some coaches deliberately decided not to accept the model.Oftentimes this was related to the model predicting a Personal Best, whereas they had a different goal for their runners, for example because they were recovering from an injury, or because of their age, as one coach explained "it is fine to aim at a PR, but on his age, he should just treasure his fitness level and enjoy the audience cheer" (Coach 11, unfamiliar runner, non-interactive condition).
For Willingness to Depend, the other Trust component, the effect of interactivity was significant but smaller ( int = 0.515, p = 0.041, see Table 2).However, the model itself was not significant, illustrated by the negative adjusted R2, and the effect of interactivity on Willingness to Depend was not robust5 .

Effects of Runner Familiarity on Acceptance (RQ 2 & 3)
Our results point to a difference between familiar and unfamiliar runners regarding coaches' Acceptance.First, answering RQ2, coaches are more inclined to accept the model when working with unfamiliar runners ( fam = -0.121,p = 0.030, see Table 2).Indeed coaches regularly reflected in their explanations that "having limited information about the runner makes that I now rely more on the models' recommendation" (Coach 1, unfamiliar runner, interactive condition).When working with familiar runners, coaches showed to be much more informed and had stronger opinions, illustrated by expressions like "I know this runner and I know what he is able to do" (Coach 37, familiar runner, non-interactive condition).Regarding RQ3, the results suggest that model interactivity is more important when coaches are working with data of their own runners.The increase in Acceptance as result of model interactivity was stronger for familiar runners (from 0.28 to 0.51) compared to unfamiliar runners (from 0.50 to 0.61), though this interaction effect was not significant in the regression model ( int*fam = 0.193, p = 0.082, see Table 2).

Effects of Covariates on Trust and Acceptance
Covariates play a role in both coaches' Acceptance and Trust levels (see Table 2).Coaches with more experience were less inclined to accept the model ( coachExp = -0.098,p = 0.013).Higher selfefficacy regarding setting finish times by oneself led to increased Acceptance ( finishEff = 0.075, p = 0.008), and having more experience with data and modelling led to lower Acceptance ( dataExp = -0.100,p = 0.005).Then regarding Trust, women showed higher Perceived Competence than men ( gender = 0.569.p = 0.015) and lower educated coaches showed less Perceived Competence than higher educated coaches (e.g., comparing High School with Undergraduate,  highschool = -1.065.p = 0.004).Lastly, higher scores on computer self-efficacy result in higher levels of Perceived Competence of the model ( compEff = 0.243, p = 0.043).While these effects may not all have straight-forward explanations, all together they do suggest that coaches with different backgrounds, experience and self-efficacy levels may respond differently to support tools.

Additional Analysis on Model Adaptation and Acceptance
There is one caveat in our measure of model acceptance.Coaches in the interactive condition had the possibility to steer the model fully towards their opinions, in which case we would observe full acceptance of the model, but in fact the coach did not take the model predictions into account at all.To illustrate this problem we provide four distinct cases of full Acceptance in To see if this issue has influenced our results, we re-run the analysis while controlling for the extent to which coaches adapted the model towards their own initial prediction, by using this measure: This measure is given in the last column in Table 3, with 0 representing coaches not adapting the model at all, and 1 representing coaches adapting it to exactly their initial prediction.Based on this measure, we are able to classify all cases as described in the previous paragraph (see Table 4. Analyzing the mean Acceptance (see the last column in Table 4) using multilevel regression with these classes as predictors, we find that class 1 and 2, representing coaches who did not or only to some extent adapted the model towards their own initial prediction, show significantly higher mean Acceptance compared to the coaches in the non-interactive condition ( = 0.211, p = 0.032 and  = 0.187, p = 0.032 respectively).This shows that solely the possibility to adapt the model, regardless of using it for one's own sake, already increases coaches' Acceptance.This adds to our previous analysis of Acceptance, representing a more clean effect of interactivity on Acceptance, as it excludes those coaches who actively steered the model towards their own opinions (class 3) and beyond (class 4).

Effects of Coaches' Interactions on Accuracy (RQ4)
As discussed in Section 4.2.2, we calculate the Accuracy of the model predictions by testing how much the (adapted) model output resembles the (omitted) best performance for unfamiliar runners only in the interactive condition (n=102).
In answer to RQ4, the model Accuracy had indeed significantly improved by the coaches from the initial model (mean error = 3.14%) to the adapted model (mean error = 2.33%; paired t-test, p = 0.018).As coaches apparently could improve the model, this raises the question what coaches actually have changed, and which domain knowledge they have been aiming to express.Analysing the final positions of the sliders after coaches' interactions, we found that more recent races are typically indicated as more representative (linear regression predicting slider position based on year of the race,  = 0.042, p < 0.001).This resonates with the qualitative data, where coaches frequently indicated "I decided to make older races less representative" (Coach 46, unfamiliar runner, interactive condition).Furthermore, in line with the results of the think-aloud sessions, the qualitative data of the online study painted a rich picture of knowledge that coaches employed for their predictions and adjustments to the model.Frequently mentioned topics include a runner's mental strength, their ability to keep a constant pace, their training intensity, whether runners reached their maximal ability yet or there is something to gain, injuries, and personal circumstances unrelated to running.When coaches were assessing the data of unfamiliar runners, they had no access to other information than the runner's previous performances, pacing, age and gender.We observed, however, that coaches were actively trying to make sense of the data nevertheless, for example: "There is clearly something going on with this lady.Maybe she stopped training, or she has a persistent injury?To make sensible prediction, I would need more information.For now, I would say, a finish time of 4 hours should be suitable" (Coach 45, unfamiliar runner, non-interactive condition).

Exploratory Analysis on Coaches' Prediction Accuracy
Did coaches themselves also improve after interacting with the model?As discussed in Section 4.2.2, we should carefully consider the context of the task.The coaches were asked to provide a "challenging yet realistic finish time" for the runner's next marathon (see Figure 1).Thus, the Accuracy of coaches' predictions actually shows the extent to which the coaches recommendations are aligned with this runner's best recent performance.We can run this analysis for both the interactive and non-interactive condition for unfamiliar runners (n = 186, see Table 1).We find that the Accuracy measure of coaches' initial predictions (mean error = 3.77%) has improved after assessing the model (mean error = 3.41%; paired t-test, p = 0.009).We found that this improvement was similar across interactive and non-interactive conditions.This suggests that being able to interact with the model does not necessarily improve coaches' learning process, but that in general assessing the model did contribute towards better predictions.No factors or covariates (such as number of interactions, sequence number or other factors) could explain improvement in accuracy.

Additional Analysis on Runners' Motivations
Based on insights from the think-aloud sessions, we hypothesized that the marathon prediction tool might be more applicable to runners driven by performance, compared to runners with other motivations such as fun or life meaning.To explore this, we tested whether the Acceptance level of coaches in the online study was different across the nine possible types of runners' motivations indicated by the coach (e.g., 'life meaning', 'competition', 'recognition', 'weight concern', as described in [40]).Regression6 reveals that Acceptance is only significantly lower for those runners who's motivation is 'life meaning' (19 out of 98 runners,  = -0.303,p = 0.013).For the other runners' motivations, we did not find an effect.

Conclusion
Coaches were keen on deploying their knowledge on the specific runner to improve the model, through determining how representative the runner's past race performance were for future performance.Model interactivity improved running coaches' levels of Trust and Acceptance in the model (RQ1).They showed higher Acceptance levels when working with unfamiliar runners compared to familiar runners (RQ2), and longer coaching experience was related to lower levels of Acceptance.In addition, our results suggest that model interactivity is most appreciated when coaches are working with familiar runners, however, that effect is not significant (RQ3).When coaches interacted with the model, the model's prediction Accuracy improved (RQ4), and they showed to employ rich knowledge to do this, about running in general, about the data in terms of pacing profiles and past performances and about the specific runner if the runner was familiar to them.

DISCUSSION
Prediction models have the potential to improve and support expert decision making.Users' domain expertise makes interaction with these models key and challenging.Few studies in interactive machine learning have actually studied expert users and realistic cases.In this paper we took the case of marathon running that provided us with validated prediction models [53] and data for realistic cases, from previous marathons.Moreover, we were able to reach out to a large body of running coaches of different levels of expertise.We explored the ways in which coaches wish to and actually do interrogate a predictive model.We investigated how coaches interpret these running data and model predictions for both familiar and unfamiliar runners, the extent to which they are inclined to accept and trust these predictions, and whether that is influenced by model interactivity.We also investigated how model accuracy was affected by model adaptations of the coaches.
We observed high involvement of coaches during the think aloud sessions, and extensive and detailed reflections in the open text fields in the online survey.This shows that coaches cared about the task and the runner at hand.Our results provide insight in two aspects of the knowledge level of our participants, first their knowledge about a specific case (comparing familiar with unfamiliar runners), second their general domain knowledge (measured by coaching experience).Both aspects showed to influence their experience and behavior when interacting with an AI.More specifically, we found that coaches were significantly more inclined to accept the model's recommendation when working with unfamiliar runners, because they lacked the required knowledge to deviate from the model, whereas with familiar runners, they relied more on their own opinions on which target finish time was suitable for these runners.Furthermore, coaches with larger coaching experience showed lower levels of acceptance in general, and thus these coaches rely more on their own experience when making predictions.This shows that both case-knowledge as well as domain-knowledge adds to a more critical judgement of the AI's recommendation.While in prior research similar effects are found [18,25,64], these effects are potentially even larger in a realistic setting with invested users compared to a setting with participants working on tasks not resembling their daily practice.It highlights the importance of ecologically valid user tests for future evaluation studies in IML.
In our analysis we focused mostly on how coaches estimated a realistic but challenging finish time.Coaches did not interact much with the pacing strategy controls, probably because the task given was focused on finish time and because the pacing strategy widget was not part of the IML interface and did not affect the finish time directly.However, this does not mean that coaches did not consider pacing: coaches did use the pacing profiles of earlier racers to interpret past performances, and especially for unfamiliar runners this was the primary source of information, as we found in our qualitative results.Future work should consider how to include pacing in the decision process more, as it might support the coaches even better in determining a realistic finish time.
When giving users the possibility to interact with an AI, it can both help and harm the decision making.Expert users potentially have important domain knowledge to add to the AI, at the same time, they might over-rely on irrelevant factors when they make predictions.Our study shows that coaches can potentially improve the model.And beyond accuracy, also acceptance is an important factor to consider, since under-reliance or non-use of models also may result in sub-optimal outcomes.Our results show that users actually appreciate the ability to steer the model, and being able to do so improves their levels of trust and acceptance.To enable and foster effective and satisfying human-AI collaboration, we argue that having and keeping a human-in-the-loop is beneficial [26,33], from both a models' and a users' perspective.
Though our study focuses on the specific domain of marathon running, our results could be generalized to other domains.Obviously, other 'cyclic' sports that have similar data, such as lap times during speed skating (a domain in which CBR also seems to work well [54]) or GPS data from cycling sports could employ a similar modeling approach and support coaches interactively in predictive modeling.As data-driven approaches in sports progress towards developing coach-centric dashboarding [21] we believe that integrating such predictive tooling in those dashboards will be the next advancement in supporting coaches in their daily practice.But our results can also be generalized outside of the sports domain.Our IML task is simple and generic: users indicating the importance of different inputs going into a machine learning model, and the users get direct feedback of how this changes the model output.Any domain that allows for such an IML approach with direct feedback would likely show the same benefits, and we expect similarly that trust and acceptance will increase together with model accuracy.

Figure 1 :
Figure 1: Example of the main page of the survey, where the model prediction is presented and the coaches are asked to reconsider their advice.This example shows the interactive model; the non-interactive model did not show the sliders, nor the instruction to adapt the model.
ℎ  : Coach provides initial prediction based on previous marathon performances (introductory page) •   : Model provides initial prediction (model page) •   : Final model prediction after adapted by the coach (model page, NB: interactive condition only) • ℎ  : Coach may revise their initial prediction, after assessing (and when interactive: adapting) the model's output (model page)

Figure 2 :
Figure 2: Distribution of Acceptance, coloured by condition: interactive or non-interactive model.

Figure 3 :
Figure 3: Distribution of coaches' Number of Interactions with the model, split by position of runner in the experiment.NB: Coaches in the interactive condition only (n=39).

Table 1 :
Number of observations, split by conditions.

Table 2 :
Multilevel regression for Acceptance, random intercept per coach.For Trust variables: regular regression.
Table 3for which the actual model adaptations are quite different.In case 1, Coach 49 did not adapt the model but did change her final prediction to be similar to the model prediction, resulting in full Acceptance.In case 2, Coach 24 adapted the model somewhat in the direction of the initial prediction, and accepted that intermediate result in her final prediction.Both these cases are clear situations in which the model influences the coach's opinion.In case 3, Coach 3, adapted the model completely towards her own initial prediction, showing a full score on our Acceptance measure, but essentially ignoring the model entirely.In case 4, Coach 47 adapted the model towards and beyond their original prediction and fully accepted that new outcome.Cases 3 and 4 are problematic, as they shows up as full model Acceptance but in the end is the result of a coach not accepting the model prediction at all.We argue that only if coaches did take (some of) the model predictions into account when adapting and accepting the model, this shows value of model predictions for coaches.

Table 3 :
Examples of coach's and model's predictions for target finish times (h:mm) for different cases of Model Adaptation.

Table 4 :
Mean Acceptance by adaptation case.