What and When to Explain?

Explanations in automated vehicles help passengers understand the vehicle's state and capabilities, leading to increased trust in the technology. Specifically, for passengers of SAE Level 4 and 5 vehicles who are not engaged in the driving process, the enhanced sense of control provided by explanations reduces potential anxieties, enabling them to fully leverage the benefits of automation. To construct explanations that enhance trust and situational awareness without disturbing passengers, we suggest testing with people who ultimately employ such explanations, ideally under real-world driving conditions. In this study, we examined the impact of various visual explanation types (perception, attention, perception+attention) and timing mechanisms (constantly provided or only under risky scenarios) on passenger experience under naturalistic driving scenarios using actual vehicles with mixed-reality support. Our findings indicate that visualizing the vehicle's perception state improves the perceived usability, trust, safety, and situational awareness without adding cognitive burden, even without explaining the underlying causes. We also demonstrate that the traffic risk probability could be used to control the timing of an explanation delivery, particularly when passengers are overwhelmed with information. Our study's on-road evaluation method offers a safe and reliable testing environment and can be easily customized for other AI models and explanation modalities.


INTRODUCTION
In highly automated driving, drivers are no longer required to take over driving-related tasks.For example, vehicles with SAE Level 4 can operate independently under limited conditions, while those with SAE Level 5 can drive autonomously under all conditions [92].This allows drivers to engage in non-driving-related tasks (NDRTs), such as relaxing, working, texting, or viewing multimedia content.Despite the ease and convenience of

RELATED WORK 2.1 Explanation Research for Automated Vehicles
Public anxiety regarding automated vehicles has led to hesitation in their adoption [50].Low trust in automated vehicles can also increase worry and lower acceptance [53].In this context, explanations can help alleviate potential negative experiences and assist people in understanding the capabilities of automated systems.For example, Koo et al. [46] demonstrated that appropriate explanation content could help drivers overcome their anxiety and build trust in automated vehicles.Similarly, explanations provided during or after a ride can mitigate the negative experiences of passengers by offering a greater sense of control [76].While previous studies highlighted the importance of explanations using simulation environments and high-level contextual information, they were less aligned with current practices used in the development of explainable driving algorithms.Nevertheless, recent studies have shown that explanations more directly tied to the driving states or decisions themselves, such as vehicle perception or path planning information [13,18], can improve user experience and trust.These attempts suggest that established design considerations for in-vehicle explanations such as explanation content [30], timing [89], modality [46,75], aesthetics [22], and visualization methods applied [12] can be integrated into real-world environments when combined with proper algorithms to generate explanations.
Explainable AI (XAI) describes the hidden intention behind the decision-making of a model.When applied to automated vehicles, XAI models can be used to design more transparent and trustable automated vehicles by explaining the reasons behind their driving decisions.In addition to accuracy and precision, the success of XAI models depends on various design factors, including the content of the explanation and visualization methods applied.For example, many models use heatmap-based attention visualization to show the attention regions of an image that an algorithm focuses on when making a decision [41,58], whereas others use textual explanations [42,43] or other graphical representations such as arrows [94].Although most XAI models are designed to provide operational or tactical driving explanations, they can also provide other driving-related information, such as accident risk [65].Wiegand et al. [90] also emphasized the importance of explaining the machine perception itself using mental models of the passengers, such as the state of the vehicle sensors and object detection.Although AI algorithms offer methods for delivering explanations in automated vehicles, relatively limited research has been conducted to assess their effectiveness.Only a few studies have tested these algorithms with human participants and only using videos played on a monitor [68,69], which is far from an actual riding experience.In addition, most of these algorithms have rarely been implemented on physical platforms.Because human drivers and passengers ultimately use such algorithms, it is important to examine the actual impact of these explanations on the passenger experience in the development of truly helpful explainable AI models.
Based on prior works, we aim to evaluate visual explanations for the perception and attention state of SAE Level 4 and 5 automated vehicles.We first considered visual explanations, as explanations on WSD may not interfere with or alarm passengers when they do not watch it, which may be an important feature for Level 4 and 5 automated vehicles where passengers do not have to maintain the full situational awareness required for driving.Among visual explanations, we tested perception information to further validate its effectiveness under laboratory conditions [11,12], and considering the mental model presented by Wiegand et al. [90].Additionally, we included saliency-based attention information given that their impact was rarely tested with humans despite its direct relation with the driving decisions made by neural network-based algorithms.Among different types of attention, we presented the vehicle's attention in predicting traffic risk as explaining traffic risk is important in reducing discomfort in automated vehicles [28].

Driving Simulators for Automotive UI/UX Research
Automotive user interfaces for manually driven vehicles have traditionally focused on promoting safe driving behavior.Driving simulators, which demonstrate behavioral validity by reproducing similar driver performance patterns to real-world conditions, such as speed maintenance [26,95] and lane-keeping behavior [72], have been widely used for designing and testing these interfaces.However, as the automotive industry transitions to automated vehicles, particularly those with Society of Automotive Engineers (SAE) levels 4 and 5, the emphasis has shifted towards building a satisfactory passenger experience in terms of trust and acceptance [84].Since driving simulators in laboratories are inherently safe, concerns have been raised that these simulators may not perfectly mimic the experience of being in automated vehicles.Hock et al. [35] specifically highlight the potential impact on trust measurement, noting: 'the inherently safe environment may influence measurements of trust in automation [21]', and 'participants who are more immersed may experience a more realistic feeling of trust [74]'.
One possible solution to this is conducting an outdoor experiment on actual roads.While simulators provide immersive and reproducible testing environments, evaluating interfaces in real-road settings can yield more ecologically valid results, as all automotive interfaces are ultimately integrated with actual vehicles on the roads.By applying the Wizard-of-Oz paradigm [15] and hiding the wizard driver under the seat [73] or behind a partition [2,83,88], actual vehicles can be transformed into automated driving simulators to test the experiences of passengers and pedestrians without safety or ethical concerns.The wizard driver in on-road simulators can also be hidden by connecting the physical system with a virtual [27,34,61,63] or augmented reality environment [63,96,97], where participants cannot see the driver.These platforms are particularly effective for testing advanced interfaces, as they allow the augmentation of automotive UI/UX services in real-road testing environments.
On-road environments, despite their inability to simulate accident-critical scenarios, can provide a more realistic experience of traffic risks than inherently safe indoor environments.Meanwhile, risk, significantly impacts automated driving experiences and attitudes towards explanations.As the risk level increases, reliance on automation requires a higher degree of driver trust, particularly during initial interactions [85].Consequently, passengers' experiences with explanations in automated vehicles can vary depending on traffic risk levels [30,55].Recent research by Goldman and Bustin [28] even emphasizes the importance of explaining the risk scenario itself in reducing passenger discomfort.
In this study, we tested visual explanations using actual vehicles under real-road conditions, thereby leveraging the two benefits of on-road experimentation: a naturalistic driving scenario and a realistic experience of traffic risk.In particular, we explored the presentation of vehicle-interpreted risky areas as part of attention information and used the vehicle's interpretation of traffic risk as a means to determine the timing of explanation delivery.

ON-ROAD EXPLANATION TEST METHOD 3.1 In-car Extended Reality for Explanation Visualization
Our system extends the on-road platform MAXIM [96,97] and adopts the Wizard-of-Oz method [15] for exploring self-driving scenarios without safety issues or ethical concerns.The vehicle was driven by a "wizard" driver placed in the driver's seat, and the study participant sat in the front passenger seat while wearing a VR headmounted display (HMD) (Figure 1).We used a Varjo VR-2 device for our system (1440x1600 per-eye resolution, 87°horizontal FoV, 90Hz).The participant is shown to be sitting in the driver's seat in an extended reality environment, developed using the Unity 3D framework, in which the driver is removed, and the 360°streaming camera image surrounds the graphical model of a vehicle to form the Wizard-of-Oz-based self-driving experience (Figure 2).Because the participant sees a graphically rendered vehicle, the user interfaces of the vehicle can be easily augmented through extended reality.Based on previous studies' reports on the strengths of WSDs for information delivery in automated vehicles [10][11][12], we set a simulated WSD as a method for providing visual explanations in vehicles.Since the surrounding image and WSD are streamed independently, any delay from the explanation algorithms does not impact the overall simulator experience.Because the video see-through MR environment viewed by the participant is identical to the video being fed into the machine-learning algorithms generating the explanations, the platform enables a contact analog registration [64].
The use of extended reality technologies in moving vehicles poses a tracking challenge for HMDs.The base station used to track the HMD is incompatible with a moving platform, and the IMU embedded in the HMD does not distinguish between the motions generated by the user and those of the vehicle [60,62].To address this issue, we constrained the translational movement of the HMD and calibrated its horizontal rotation to reflect the rotation of the user in the reference frame of the vehicle only.Specifically, we set a base IMU to track the orientation of the vehicle to distinguish its rotation from that of the HMD in a global reference frame.Before the experiment began, the horizontal orientation of the vehicle was captured by an IMU sensor placed in the vehicle to set the offset for calibration (Figure 3 (a)).During the drive, the HMD of the user was calibrated using a compensation angle, which is the difference of the current orientation of the vehicle from the IMU offset (Figure 3 (b)).To correct for the accumulated IMU drift, the base station of the vehicle recalibrated the orientation of the HMD when the vehicle stopped for a particular amount of time, such as waiting at a traffic light (Algorithm 1).

Explanation Algorithms
Figure 4 provides an overview of the algorithms used to generate the in situ explanations.The front view of the vehicle is captured in Unity 3D and then sent to a Python environment where machine-learning algorithms generate visual explanations.The outcomes are returned back to the Unity 3D to be visualized in a mixedreality environment.In the Python part, the streamed front view undergoes two parallel processes: 1) semantic segmentation, which is a part of the perception state of the vehicle and 2) a 3D CNN for traffic risk prediction with Grad-CAM, which is used both for a visual explanation for attention and a means to modulate the explanation timing.Depending on the experimental conditions, the two types of explanations were provided separately or together to form three explanation conditions to be tested (perception, attention, and perception+attention).

Perception (Segmentation).
As Wiegand et al. [90] noted regarding the significance of explaining machine perception, we presented passengers with a semantic segmentation map as part of the vehicle's perception state.In a video-based experiment, the segmentation information provided in automated vehicles increased driver situation awareness [11].To offer semantic segmentation by superimposing segmented objects on the WSD, we incorporated PDINet [93] to represent the state of the machine perception of the vehicle.A PIDNet-S model trained on the Cityscapes dataset with a small number of parameters [14] demonstrated an mIOU of 78.6% and FPS of 93.2.We assigned the yellow color to the labels car, truck, bus, motorcycle, bicycle, caravan, trailer, person, and rider while omitting other classes such as the sky, road, sidewalk, building, vegetation, parking, traffic signs, and traffic lights (see Figure 5).The color coding and removal of classes were intended to provide an adequate amount of information, preventing passengers from being visually overloaded.Also, we aimed to avoid the need for the passengers to decipher the meaning of each color, which may cause additional cognitive load.

Traffic
Risk Prediction (Attention/Explanation Timing).We developed a custom 3D CNN model to predict the traffic risk probability (Figure 4, lower), which was used to control the explanation timing.The 3D CNN model was designed to take the image volume with a width of eight and classify whether a video contained an accident.We trained the model with the Car Crash Dataset [3], resulting in a 93.74% validation accuracy.We considered the sigmoid output of the classification model as the vehicle's interpretation of the in-situ traffic risk, providing explanations when the probability was greater than .5 for conditions with risk-adaptive explanations.
Also, the Grad-CAM [80] was used to visualize the vehicle's attention, showing what prompted it to determine whether a given driving scenario was hazardous.The Grad-CAM was designed to compute the back-propagated  weight up to the final CNN and generate the class activation map.The Grad-CAM was color-coded, omitting low saliency regions (<.8) to prevent information overload due to heatmap covering the entire WSD (Figure 6).

Sanity Test for ML-predicted Risk as a Predictor of Passenger Arousal
We conducted a pilot study to investigate the relationship between traffic risk probability and passenger arousal, aiming to explore the potential of traffic risk as an unobtrusive, indirect predictor of passenger arousal.Physiological signals, such as electrodermal activity (EDA) or pupil diameter, change in response to the cognitive demands or arousal that a task might induce (e.g., event-related EDA [16,56], task-evoked pupillary responses (TEPR) [20]).Our study focused on these physiological responses to varying levels of traffic risks while in automated vehicles rather than responses to specific tasks.We did not differentiate between sources of arousal, which may be cognitive load, fear, anxiety, or demand for situational awareness, but assessed how external risks, as quantified by the risk prediction algorithm, influenced the arousal of the passenger watching the environment.The Granger-causality test was applied to determine whether traffic risk probability could significantly predict the EDA and pupillary responses, indicators of passenger arousal.The subsequent subsections delve into further details of the pilot experiment.

Pilot Experiment Settings.
We recruited five participants with an average age of 24.6 years (SD = 0.89, 2 Females) for our study.The participants were seated in front of a screen with Shimmer3 sensors attached to their index and middle fingers, measuring their electrodermal activity at a frequency of 16Hz.A Tobii Pro X2-60 eye-tracker was also set below the screen to capture the participants' gaze activities and pupil dilation at 60Hz.While factors such as lighting conditions can influence pupil dilation, the experiment was conducted indoors, ensuring consistent illumination.
Initially, the participants observed nine dot stimuli for eye-tracking calibration.They then watched a tenminute nature relaxation video to stabilize the EDA signal and establish a baseline for pupil dilation.Following this, participants were asked to watch a 15-minute recording of naturalistic driving.Participants were tasked to view the video as if they were passengers in automated vehicles, focusing on the overall experience rather than annotating each specific traffic event.
3.3.2Result and Analysis.We conducted a Granger-causality test with the predicted traffic risk and the recorded physiological responses.The Granger-causality test evaluates whether the 'effect' variable is influenced by past and present values of the 'cause' variable.Similar methodologies have been used in studies by Lavanuru et al. [51] and Ghouali et al., [25], investigating the causality between physiological response and perceived workload, and between cardiorespiratory and myogalvanic signals during driving tasks, respectively.As we considered the task of watching the video, which simulated the experience of riding in automated vehicles, to be a holistic experience rather than tasks with time-specific events, we analyzed the physiological responses recorded over the entire 15-minute duration of the naturalistic drive.
The results of our Granger-causality analysis indicate that traffic risk probability can significantly predict the EDA signal (Table 1).Although traffic risk probability was not a significant predictor of pupil dilation for some participants, it consistently served as a significant predictor of the passengers' EDA signal.Since an increase in EDA is indicative of heightened arousal states-including stress, workload, and anxiety [56,57]-our findings suggest that traffic risk probability could potentially be used to predict moments of passenger arousal due to external risks when riding automated vehicles.

Experimental Conditions
The conditions comprise seven distinct conditions, including the default condition without an explanation (condition 1), as well as three explanation types (perception, attention, and perception + attention) and two explanation timings (continuously and only when it is risky) (see Figure 7).

Implementation Note
We implemented the proposed system on a computer with an Intel® i9@2.50GHz CPU, 128 GB of RAM, and an RTX 3090.We used the Unity 3D environment to provide the participants with an extended reality environment guaranteed to render at least 30 frames per second (fps).The Python side computed the segmentation and attention map with a framerate of greater than 15 fps.Because both sides transfer images through TCP socket communication, any visual explanation can be added to our system with an appropriate communication configuration.The detailed system implementation, including the TCP-based communication framework, is available at https://github.com/GWANGBIN/WW2E.Fig. 7. Seven experimental conditions for the user study.Three explanation types (perception, attention, and perception + attention) and two explanation timings (continuously presented (always) and only when it is risky (if risky)) were tested.

USER STUDY
We conducted a user study to compare the passenger experience when algorithmic explanations of automated vehicles' perception and attention state were provided.To ensure ecologically valid experimental settings, the study was conducted on actual roads with a wizard experimenter under a naturalistic driving scenario.We investigated usability, trust, perceived safety, situational awareness, cognitive load, preference, and other factors associated with the acceptance of automated vehicles by exposing participants to a variety of explanation settings.
In addition, as indications of arousal, we assessed the physiological responses of participants during the ride.

Participants
We recruited 30 participants (8 females) with an average age of 28.4 (SD = 8.34, min = 19, max = 50).Since we assumed highly automated vehicles with SAE levels 4 and 5, we did not restrict our participants to driver's license holders.Of the participants, 23 had driver's licenses, with an average of 6.72 years of driving experience (SD = 7.43, min = 1, max = 30).All participants were Korean nationals, and the user study experiment was approved by the Institutional Review Board.

Procedure
The user study was conducted with the following experimental protocol and driving scenarios.

Protocol.
Initially, the participants were instructed about the experiment and wore an E4 wristband.We opted for the E4 over the Shimmer3, used in our preliminary test, as it offered a firmer body attachment and allowed for multi-modal physiological response measurements (PPG sampled at 64Hz and EDA sampled at 4Hz).
Participants then filled out questionnaires regarding their age, driving experience, and trust propensity.Following this, they experienced 8-12 min of naturalistic driving in ascending order along the route shown in Figure 8.
During the ride, participants wore a Varjo VR-2 HMD and were instructed to behave like they were passengers in highly automated vehicles without needing to control the vehicle.Though a wizard driver controlled the vehicle during the experiment, we informed participants that this driver was present primarily for safety and regulatory reasons and would only intervene with the vehicle's operations at the beginning or end of the experiment or to handle specific experimental scenarios.This was done to prevent the participants from perceiving the vehicle as non-automated during the experiment, despite any subtle auditory cues that the wizard driver might have produced.We view that the presence of the 360°camera and the machine-generated explanations provided to the passengers also helped the deception that they were in an automated vehicle.
Participants were exposed to seven different explanation conditions.These included the default condition without an explanation and three explanation types (perception, attention, and both) for each of the two explanation timings (continuously provided and provided only when conditions are evaluated as risky).Using a balanced Latin square, the explanation condition was counterbalanced to ensure that the driving route and order of the experimental conditions did not influence the results.After each condition, participants provided their responses.The experiment concluded with a semi-structured interview in which participants numerically rated their preferences, explained their reasoning, and suggested improvements.On average, the entire experiment took approximately 2 hours and 30 minutes per participant.We informed participants they could halt the experiment if they experienced discomfort from motion sickness, yet no such requests were made during the study.

Driving Scenarios.
To ensure naturalistic experiments and maintain external validity concerning road types, we diversified the types of roads within the given experimental site.These were counterbalanced over experimental conditions.Routes #1 and #7 are urban roads, each with a length of 2.4km, a speed limit of 60km/h, and consisting of 12 crosswalks (10 equipped with traffic lights).Routes #2(#6) and #5 are arterial roads, each 3.1km long with a speed limit of 70km/h and 6 crosswalks with traffic lights.Routes #3 and #4 are local highways, 2.9km long, with a speed limit of 80km/h and 10 traffic-light controlled crosswalks.The experiments were conducted from 9 am to 6 pm to account for varying traffic volumes while ensuring ample light for the 360°camera.

Automation Wizard.
Rather than providing an experimental protocol to various drivers, one of the authors (a 30-year-old male with 10 years of driving experience) served as the automation wizard, fully understanding the study objectives (we referred to [17]).The wizard driver was instructed to cautiously follow the designated route, maintain 50-80% of the speed limit and avoid abrupt lane changes, sudden acceleration, or deceleration.However, responses to unpredictable road events, such as reducing speed for an inappropriately overtaking vehicle, were acknowledged as inevitable.

Measurement
We collected questionnaires, interviews, and physiological responses to triangulate each method's results.

4.
3.1 Questionnaire.Usability was tested based on the system usability scale (SUS) [7].SUS evaluates the usability of a system with 10 questionnaire items using a 1-5 Likert scale, transformed into a 0-4 scale for a total of 100 points.A system is considered to have acceptable (above average) usability when its SUS score is greater than 68 [54].Passenger trust towards automated vehicles was assessed using the scale of trust in automated systems, which comprises 12 questionnaire items [37].Situational awareness was assessed using the situational awareness rating technique (SART) with a 7-point Likert scale [86], and cognitive load was measured using the mental demand item of NASA-TLX [31] based on a 0-10 point scale.While most of the measures were adapted from questionnaires with confirmed reliability and validity, the mental demand item is a single-item subscale of NASA-TLX.We intended to capture the immediate mental demand with a minimized item right after the experiment, before participants responded to the detailed experience, to exclude the effect of the cognitive load of answering the survey itself.However, the limited reliability of a single-item questionnaire should also be noted.
We also measured the dependence, understandability, familiarity, and propensity to trust as a way to model the acceptance of automated vehicles in terms of the Reliability/Competence, Understanding/Predictability, Familiarity, and Propensity to Trust subscales based on Q1-6, Q7-10, Q11-12, and Q15-17 of the trust in automation scale provided by Körber [47], respectively.Attitudes towards technology, self-efficacy, anxiety, willingness (behavioral intention), and perceived safety, each of which was also used to form our acceptance model, were measured using the Attitude Towards (Using) Technology, Self-Efficacy, Anxiety, Behavioral Intention (to use the Vehicle), and Perceived Safety subscales of the AVAM questionnaire [33], i.e., Q13-15, Q16-18, Q19-21, Q22-23, and Q24-26, respectively.

Physiological
Response.Using the E4 wristband, we measured the participants' physiological responses to triangulate the results of the questionnaires and interviews.The measurements included body temperature, heart rate (HR), and electrodermal activity (EDA).The participant's heart rates were analyzed as they were calculated from the PPG signal.The EDA signal was preprocessed by omitting data with values of less than .05,smoothed with a Gaussian window having a width of 8, leaving repeatedly measured sample frames of 23 participants.We then categorized the EDA signal into phasic and tonic components using MATLAB-based ledalab software [5,6].

Results
The subsections below describe the results for each aspect of the passenger experience.Descriptive statistics for the survey result are given in Table 2.All measurements underwent skewness and Kurtosis normality check and were compared between conditions using a two-way repeated analysis of variance and Holm post-hoc analysis (Figure 9).We also checked the internal reliability for each questionnaire; all questionnaires showed valid internal consistency with Cronbach's alpha higher than the acceptable range of 0.7 [78].
Table 2. Descriptive statistics for the survey results from study participants are shown.Bold highlights the best case, while underline indicates conditions higher than the baseline.For most measures, a higher value represents a better experience, but for cognitive load and preference, a lower value denotes a better result.Upon comparing each condition, it was found that condition 2 (perception, Always) also yielded the highest trust, which was significantly higher than that of conditions 3 (attention, Always) and 6 (attention, if risky), i.e.,   = 3.458, p = .010and   = 4.078, p = .001,respectively.Condition 7 (perception + attention, if risky) ranked second-highest and was trusted significantly more than conditions 3 (attention, continuously) and 6 (attention, if risky), i.e.,   = 3.014, p = .038and   = 3.635, p = .006,respectively.Some participants trusted the vehicle more without explanations, as failure cases had a greater impact on their trust than the vehicle's abilities.P25 and P29 mentioned that imperfect information led to distrust, while P7 and P21 felt anxious or distrusted the vehicle when its perception was not perfect.Participants also noted specific failure cases beyond the experimental vehicle's design capabilities, such as not looking at traffic lights (P11) or not checking the left and right sides of the car (P15).

Perceived Safety.
Perception and perception + attention information were preferred over attention information in terms of perceived safety, i.e., F(2) = 6.819, p = .002,with   = 3.367, p = .004and   = 2.998, p = .008.The interaction effect shows that the risk-adaptive explanations enhance the perceived safety only when combined with the perception + attention explanation, i.e., F(2) = 3.951, p = .025.Specifically, explanation condition 2 (perception, Always) was perceived to be the safest, significantly more so than conditions 3 (attention, Always) and 6 (attention, if risky), with   = 3.452, p = .011and   = 4.173, p < .001,respectively.Also, explanation condition 7 (perception + attention, if risky) was rated the second safest, and only these two conditions ranked higher than the baseline condition without an explanation.
In most cases, risk-adaptive explanations resulted in adverse effects on perceived safety.This is because the moment individuals experience a driving hazard does not necessarily correspond with the algorithmic decisions.For example, participants expressed concerns when the vehicle's judgment of a traffic hazard differed from their perspective as drivers "The vehicle's judgment of a traffic hazard differed from my perspective as a driver, " (P16, P25, P29) and when explanation timings were irrelevant "Some of the explanation timings were irrelevant; they were offered notwithstanding the actual risk."(P19, P30).They also felt less safe when the vehicle did not provide an explanation despite the imminent danger, fearing it wouldn't handle the issue appropriately.

Situational Awareness (SART)
. Providing perception information resulted in higher situational awareness than attention information, i.e., F(2) = 5.885, p = .005with   = 3.425, p = .003.Explanation condition 5 (perception, if risky) promoted the highest situational awareness, followed by condition 2 (perception, Always).Condition 7 (perception + attention, if risky) supported the third highest situational awareness, and the other three conditions were not superior to the baseline condition without an explanation.Overall, risk-adaptive explanations supported higher situational awareness compared to continuous presentation.Participants appreciated selective information delivery in high-risk scenarios (P20, P22, P24), but some found the abrupt appearance of information disruptive (P14).A mismatch between perceptions of risk and risky driving conditions contributed to negative experiences with risk-adaptive explanations.
Regarding the SART subscales, the demand subscale varied significantly depending on the explanation type, F(2) = 8.237, p < .001.Conditions with perception information scored significantly higher than those with attention information, with   = 3.817, p < .001.

Cognitive
Load.Providing perception information resulted in the lowest cognitive load, i.e., F(2) = 5.120, p = .009.By contrast, attention information exhibited the highest cognitive load, significantly higher than perception information, with   = 3.328 and p = .018,and higher than the default condition, albeit without a statistically significant difference.
The implicit nature of attentional information, which required a deliberate interpretation process, led to increased cognitive performance, as observed in the interviews.Participants found it difficult to understand the attention information (P10) and noted that perception (segmentation) information was more direct, while attention information needed interpretation (P18, P26).They also had to interpret why the vehicle paid attention to specific areas (P30).

Preference (Rank).
The passenger experience also created a different preference among the explanation options, i.e., F(2) = 5.607, p = .006.Provisioning of the perception information was preferred, i.e.,   = 3.337, p = .004(the measure is the preference rank, and thus the rank is high for low values).Explanation condition 2 (Perception, Always) ranked the highest ( = 2.97,  = 1.88) and was significantly higher than condition 3 (attention, Always), i.e.,   = 3.284, p = .020.Conditions 4, 5, and 7 were less favored than the default condition without an explanation.

Physiological
Responses.Although most of the signal was statistically insignificant, the phasic EDA was significantly different among the conditions, i.e., F(6) = 2.232, p = .044(Figure 10).Specifically, the Holm posthoc analysis results showed that the phasic EDA for condition 4 in which the perception and attention were continuously displayed (M = .164,SD = .198)was significantly higher than that for the condition 1, default without any explanations (M = .071,SD = .110),with   = 3.104, p = .049.
Li et al. [56] have reported that phasic EDA, which is also referred to as the skin conductance response, is most significantly responsive to cognitive load.Our results indicate that condition 4, which presents perception + attention information constantly induces the highest cognitive load, consistent with the highest self-reported cognitive load.In addition, the insignificance of the physiological responses between conditions 1 (default) and 7 (perception + attention, if risky) also suggests that such arousal can be abbreviated by delivering the explanation only under a driving scenario evaluated as hazardous.

Lessons
From Study Participants.During interviews, participants suggested ways to enhance in-vehicle explanation visualization.Participant P4 recommended sharpening the contour of the segmentation map and removing color of the map to avoid interfering with visibility, while P21 advocated for user-customizable explanations with diverse visualization options.Also, P30 suggested on-demand explanations, P5 advised continuous display of the segmentation map coupled with attention map presentation only under hazardous conditions, and P29 proposed employing a distinct color code on the segmentation map to denote hazardous situations.These suggestions provide intriguing prospects for designing in-vehicle explanations that are less visually and cognitively demanding.
We also observed that people are insensitive to minute alignment errors of their viewpoints in physical/virtual vehicles.Although our system was designed to properly track the position and orientation of an HMD in a moving vehicle, the drift of the IMU sensor and intrinsic inaccuracies in the HMD system caused an angular deviation from the participant's standard viewpoint.Most participants instinctively corrected these by adjusting their head orientation.However, they remained oblivious to their physical head orientation until these were rectified using a VR base station.This observation is consistent with findings from VR-redirected walking experiments, which manipulated undetectable gains [67].

What Type of Explanation? Sharing the Perception State of Automated Vehicles Was Favored
Over Attention Information in Most Passenger Experience Measures (RQ1).
The provision of perception information through WSD was deemed to have the highest usability, trust, and perceived safety among the explanation conditions tested.It fostered greater situational awareness without increasing cognitive burden.This result is consistent with prior research [11,12] that segmentation visualization promotes passenger trust and situational awareness while reducing cognitive load.On the other hand, despite being the most widely employed among AI engineers and dataset experts, the saliency-based attention map (Grad-CAM) was less effective in promoting end-user passenger experience than the perception state itself, or in some measures, the condition without explanations.The most frequently mentioned problem regarding the attention heatmap was its indirectness.One must interpret why the vehicle is paying attention to a given object in terms of the object's behavior and the potential consequences of the situation.Since the driving scene changes rapidly, individuals may be unable to accept and analyze implicit information quickly.Hence, explanations should be sufficiently clear, either by providing direct and obvious information or by applying additional algorithms to translate indirect explanations into a more human-centered format.

When to
Explain? Traffic Risk-adaptive Explanations Can Be Effective When the Amount of Information Is Overwhelming (RQ1).
Explanation timing had no main effect, but it had an interaction effect with the explanation type on trust and perceived safety.Specifically, risk-adaptive explanations improved the levels of trust and perceived safety when combined with the perception + attention explanation type, whereas it had adverse effects for the perceptiononly or attention-only explanation conditions.Such enhanced passenger experiences also lead to greater user preferences.Participant responses indicate that providing explanations based on predicted traffic risk can prevent information overload, particularly when excessive amounts of visual information are presented.The lower arousal of participants measured based on phasic EDA in condition 7 (perception + attention, if risky) than in condition 4 (perception + attention, always) supports the idea that risk-adaptive explanations can be an effective strategy for reducing passenger burden in automated vehicles.Moreover, risk-adaptive explanations support higher situational awareness for all types of explanations, despite the reduced amount of information delivered.While numerous studies have reported the role of explanations in fostering trust and acceptance for automated vehicles, the specific aspects of passenger experience affected by explanations and how they translate to acceptance have not been actively modeled.In our research, we focused on the provision of perception and attention state information, as well as the timing of these provisions, and how they can lead to automated vehicle acceptance, mediated by passenger experience and other UX-related measures.Referring to Körber [47] and Choi and Ji [9], we established a latent growth model to understand how provisioning explanations affected passenger experience and perceived capabilities of the vehicles (Figure 11).The model fits well with the comparative fit index (CFI) at .961 (>.9), the Tucker-Lewis index (TLI) at .957 (>.9), and Bollen's relative fit index (RFI) at .916 (>.9).The provision of perception and attention information positively affects the perceived capabilities of vehicles and passenger experience with perception information having greater impact than the attention information.In addition, the model describes that situation awareness and familiarity with automated vehicles are the most important factors in determining the perceived capabilities of automated vehicles and passenger experience.
We connected the latent growth model to a structural equation model designed to represent the relationship between perceived capabilities, passenger experience, and acceptance.Referring to Hewitt et al. [33], the model views acceptance in three ways: willingness (behavioral intention to use the vehicle), self-efficacy, and attitudes toward technology.The model fits well with the CFI at .953 (>.9), TLI at .920 (>.9), and Bollen's RFI at .953 (>.9).The structural equation model reveals that under automated vehicle explanation scenarios, perceived safety, trust, the user's propensity to trust, and understandability are the most crucial factors in facilitating user acceptance of automated vehicles.Since propensity to trust is an individual factor, explanations should be designed to enhance automated vehicle acceptance by promoting perceived safety and trust among passenger experience factors and understandability among user-perceived capability factors.
Nonetheless, it's important to note that the model, though fitted with 210 data samples, reflects the experiences of just 30 participants.While this model aptly represents the study participants' experiences, it may not universally represent all passengers and the results should be interpreted in the context of these limitations.

Understanding the Scope of Explanations and Managing Error Types are Crucial for Building Trust in Automated Vehicles
Ensuring that passengers understand the scope of explanations and managing different types of errors are essential for building trust in automated vehicles.Some participants pointed out that the car did not look at the left and right sides of the vehicles, which was outside the scope of our WSD explanation.Participants' reactions to different types of errors in explanations were also noteworthy.When the vehicle did not appear to perceive a particular object, participants questioned its capabilities.However, they were not as concerned when the vehicle's segmentation was erroneously superimposed on roads, trees, or traffic lights.Participants either did not detect the error, believed the vehicle segmented the image due to exceptional circumstances, or did not care about the error, interpreting it as a possible precautionary behavior.Thus, they were more accepting of false-positive errors (perceiving vehicles or pedestrians when none were present) than false-negative errors (failing to perceive vehicles or pedestrians that were present), regarding the "what to explain".
Regarding "when to explain," participants preferred false negatives (not explaining when a situation was dangerous).Those who did not prefer risk-adaptive explanations questioned the vehicle's criteria for judging a situation as risky.Their complaints primarily focused on false positives, where the vehicle provided an explanation in situations they did not perceive as risky.Conversely, they were not as concerned about false negatives, where the vehicle did not provide an explanation despite a perceived danger.They believed the vehicle coped with the situation safely and did not perceive it as a hazard.While explanations should be designed to minimize errors, a rigorous investigation into how individuals perceive different types of errors can be leveraged to enhance the passenger experience, which may vary depending on how individuals are engaged in monitoring tasks.

LIMITATIONS & FUTURE WORK 6.1 On-road Simulators Can Be Complemented by Indoor Experiments and Actual Implementation
Our system, while enhancing ecological validity by allowing experiments in actual vehicles on roads, still exhibits limitations compared to genuine self-driving cars.For instance, it constricts the participant's view with an HMD that suffers from latency, reduced field-of-view, and a lower framerate compared to human vision.Moreover, our system cannot distinguish between the translational movement of the HMD and the vehicle, thus restricting the HMD's movement.Prolonged use also leads to rotational disorientation of the HMD due to IMU drift, as discussed in section 4.4.8.To create on-road simulators with higher fidelity, additional sensors such as OBD2, GPS, and GNSS could be used to track the motion of both the HMD and the vehicle [63].
Additionally, our method may be less suitable for automated vehicles of SAE Level 3 or lower since the car is controlled by a wizard driver and the passenger passively experiences the drive without any possibility of intervention.As these vehicles allow for driver take-over, safety measures related to driving, such as performance during and after take-over [8,45], or critical scenarios that can lead to potential crashes, should be tested using indoor simulators.Thus, the choice of a simulator platform should depend on the experiment type, considering the trade-off between naturalistic scenarios and fully controllable settings.Indoor simulators and on-road testing methods can complement each other in designing explanations for automated vehicles.
Also, the limitations of wizard-of-oz automated driving simulations should be noted.Detjen et al. [17] expressed concerns that the discovery of the wizard could influence passengers' perceptions and behaviors.In contrast, Schneider et al. [77] found in their WoZ study, conducted on actual roads, that informing passengers about the wizard driver did not affect their experience of the automated vehicle simulation.In our study, we did not ask whether each participant had noticed the non-automated nature of the experiment.Consequently, there is a possibility that the passenger behavior observed may not precisely reflect behavior in real automated vehicles.Furthermore, the use of a Head-Mounted Display (HMD) in our mixed-reality-based wizard-of-oz automated vehicle study may have prevented participants from actively engaging in certain Non-Driving Related Tasks (NDRTs), such as eating or drinking.Therefore, implementing visual explanations using an actual Windshield Display (WSD) should be considered to further validate results from wizard-of-oz studies.
6.2 Effect of NDRTs, Information Quantity, and Explanation Modality Should Be Further Explored.
Since working memory is a finite resource, the task of understanding in-vehicle explanations can compete with ongoing NDRTs.Our study has a limitation in not incorporating complex NDRTs like eating, drinking, reading, or interacting with multimedia due to experimental constraints such as HMD.However, the quantity of information and the modality of explanations should be carefully designed, depending on the NDRT types, to effectively deliver the explanation to the passenger without imposing additional cognitive load.
Passengers in automated vehicles engage in NDRTs that require visual, auditory, and motor skills [82].While motor tasks, such as those involving handheld devices, were key considerations for designing take-over requests [87], they become less significant in highly automated vehicles that don't require human intervention.Instead, the impact of NDRTs on human memory resources and the quantity of information conveyed through different channels become crucial.For instance, auditory channels can effectively relay information in vehicles when passengers are engaged in NDRTs [4] as they don't distract from the visual attention necessary for tasks like reading and watching [32].Therefore, further exploration into the use of textual [43], auditory [19], or sonic explanations [23] could improve the passenger experience in automated vehicles.Given that our platform is based on the Unity 3D engine, incorporating audio-based applications such as speech recognition [49], text-to-speech generation [4], and natural language understanding [38,48] would establish an environment suitable for testing verbal explanations or natural interactions [1] with automated vehicles.

Passenger Perceived Risk and Demand for Explanation is More than Binary.
We evaluated the explanation condition using a risk-prediction model that was trained on the Car Crash Dataset [3] to classify driving scenarios into two categories based on the probability of a risk.However, driving scenarios and passenger responses are often more complex and nuanced than such a binary framework can capture.For instance, Li et al. [55] derive situational risk from three scenarios: speed, traffic, and abnormal behaviors, while Wiegand et al. [89] identify six categories to describe unexpected driving behaviors.Drawing from these systematic approaches to situational classification and established research on driving situation analysis [71,91], self and external interruptions [24,36,81], and driver interruptibility [39,40,44], more accurately moments when passengers require explanations can be detected.

Passenger Experience with Vehicle Attention can be Enhanced by More Accurate Saliency Map.
Despite its popularity in describing the behavior of deep-learning models, displaying an attention map to the passengers did not enhance the passenger experience and was not preferred over displaying the perception information of the vehicle.However, we also stress that the outcome should be taken in light of our particular saliency map configuration, which can be enhanced by employing alternative algorithms.Whereas the saliency map applied in the current study is generated using a CNN model that predicts the risk probability, driving a car requires a more comprehensive visual analysis than predicting the likelihood of a traffic accident.Consequently, the saliency maps generated by algorithms that cover more complex activities of automated vehicles, such as endto-end driving [41], may improve the passenger experience.In addition, CNN-based solutions may demonstrate a poor saliency map for the driving decisions, i.e., they highlight irrelevant regions such as bushes over the road horizon and along the edges of the road [52].Incorporating more accurate saliency maps can enhance the passenger experience by showing that the vehicle is focused on the regions crucial for driving decisions.
The current study tested the WSD-based explanations, and the potential of additional visualization approaches should be further investigated.For example, the Tesla ADAS features visualization of the surrounding objects, and its effectiveness should be further investigated.Colley et al. [11] have discovered that although AR visualization is preferable to a tablet placed on the fascial center, a tablet-based system similar to that developed by Tesla can potentially enhance the passenger experience [10].An on-road comparison of visualization methods, such as the ADAS features of the Tesla Autopilot and the visualization methods suggested by the study participants, can inspire the development of explainable algorithms that can generate effective explanations.

The Platform can be Extended to Test the Interaction Between Passengers and Automated
Vehicles with Parallel Autonomy.
The current study investigated the passenger experience with explanations in a highly automated car with a wizard driver.In our experiment, passengers only viewed the Wizard-of-Oz-based automated ride of the vehicle.However, safety and regulatory considerations make shared automated vehicles a more likely and immediate future, as they do not totally eliminate the role of drivers in cars.Collaborative autonomy provides safer driving because humans and artificial intelligence can cross-check or assist one another [59].Among the different varieties of shared autonomy, cars with parallel autonomy operate as "guardian angels" that avoid potential accidents by adjusting human driving [79].The platform used in this study can be expanded to test vehicles with parallel autonomy using natural language services or displays that are less distracting, such as HUDs, center fascias, or optical see-through MR applications.
Future studies may include advancing the platform to test the "guardian angel" feature for automated vehicles with parallel autonomy.By integrating a drive-by-wire system, sensors, and algorithms for self-driving, the platform can test various types of feedback and explanation methods for parallel automated vehicles (e.g., a parallel autonomy research platform [66]).More detailed descriptions of the vehicle state and decisions with expanded modalities, such as verbal or textual explanations during or after a driving adjustment, can be tested to enhance the passenger experience for automated vehicles with parallel autonomy.Such a guardian system can also be applied with implicit interactions [84] to promote safe control of the driver and minimize the discrepancies between the self-driving algorithm and human driver in an unobtrusive manner.

CONCLUSION
In this study, we examined the impact of explanation type and timing mechanisms provided in automated vehicles on passenger experience using a mixed-reality Wizard-of-Oz self-driving simulator.We compared three types of windshield displays for explanations: perception, attention, and a combination of both perception and attention.Through a human-subject experiment conducted on actual roads, we validated previous indoor study results, confirming that sharing perception state itself enhanced perceived usability, trust, safety, and situation awareness.In addition, we leveraged the benefits of outdoor experiments, which can provide a more realistic sense of risk when testing explanations.Specifically, we utilized Grad-CAM attention to highlight risky regions under naturalistic driving scenarios and provide explanations selectively depending on traffic risk levels.Although attention information alone was not highly favored, the risk-adaptive strategy for explanation delivery was effective in the perception + attention condition, where passengers were provided with extensive information.In our study, we emphasized the importance of suitable explanations for fostering understandability, safety, and trust, consequently promoting the acceptance of automated vehicles.However, our findings also suggest a nuanced perception among participants regarding the "what" and "when" aspects of explanations, which can be leveraged in tailoring in-vehicle explanations.

Fig. 3 .
Fig. 3. Calibration process for horizontal rotation of the HMD of the user in a moving vehicle

Fig. 4 .
Fig. 4. Overview of algorithms used in our study to provide an explanation.

Fig. 6 .
Fig. 6.(a) Color-coded Grad-CAM attention map and (b) attention map overlaid on the augmented reality head-up display.

Fig. 8 .
Fig. 8. (a) Driving routes and (b) experimental protocol of our user study experiment.

Fig. 11 .
Fig. 11.Modeling acceptance of automated vehicles mediated by passenger experience with explanation provisioning.

𝑎𝑐𝑐 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then 𝜑 𝑐𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑒 ← 0 end if end while
←     while the car is driving do   ←        −         ←     −   if   <

Table 1 .
Results of the Granger-causality test for risk probability relative to the EDA signal and pupil dilation.
How Do Explanations Help Acceptance?Explanations Foster Acceptance by Promoting Understandability, Perceived Safety, and Trust (RQ2).