ACAV: A Framework for Automatic Causality Analysis in Autonomous Vehicle Accident Recordings

The rapid progress of autonomous vehicles (AVs) has brought the prospect of a driverless future closer than ever. Recent fatalities, however, have emphasized the importance of safety validation through large-scale testing. Multiple approaches achieve this fully automatically using high-fidelity simulators, i.e., by generating diverse driving scenarios and evaluating autonomous driving systems (ADSs) against different test oracles. While effective at finding violations, these approaches do not identify the decisions and actions that caused them-information that is critical for improving the safety of ADSs. To address this challenge, we propose ACAV, an automated framework designed to conduct causality analyses for AV accident recordings in two stages. First, we apply feature extraction schemas based on the messages exchanged between ADS modules, and use a weighted voting method to discard frames of the recording unrelated to the accident. Second, we use safety specifications to identify safety-critical frames and deduce causal events by applying CAT-our causal analysis tool-to a station-time graph. We evaluated ACAV on the Apollo ADS, finding that it can identify five distinct types of causal events in 93.64% of 110 accident recordings generated by an AV testing engine. We further evaluated ACAV on 1206 accident recordings collected from versions of Apollo injected with specific faults, finding that it can correctly identify causal events in 96.44% of the accidents triggered by prediction errors, and 85.73% of the accidents triggered by planning errors.


INTRODUCTION
Autonomous Vehicles (AVs) are set to bring about a paradigm shift in transportation.AVs operate through the use of advanced Autonomous Driving Systems (ADSs), which eliminate the need for human drivers to control the vehicle's movements.ADSs are considered highly security-critical systems, as malfunctions can result in severe consequences [1,4,22].For example, a minor error in trajectory prediction can lead to potentially hazardous or even fatal situations for passengers, other road users, and pedestrians.Thus, it is imperative for AV developers to subject ADSs to rigorous testing to ensure their accuracy and reliability.Given that on-road testing suffers from several limitations (such as safety risks and high expenses), simulation-based testing in high-fidelity simulators such as SVL [49] and CARLA [23] has emerged as a popular approach for evaluating AVs.
Many researchers utilize search-based [7-9, 24, 47, 64] and sampling techniques [15,28,33,56,61,62] to generate and execute test cases against a set of testing oracles in simulation environments.This provides a controlled and repeatable means of evaluating AVs without the risks associated with real-world testing.For instance, AV-Fuzzer [38] uses fuzzing to generate scenarios that cause safety violations such as near-and actual collisions.LawBreaker [52], also based on fuzzing, further evaluates AVs against specifications of national traffic laws (e.g., rules for crossing junctions).While these methods are effective at finding different violations, they typically do not provide insight into the specific decisions and actions of the AV that ultimately caused the violations.Such information is critical for engineers to improve the safety and reliability of AVs but is time-consuming and laborious to extract manually, especially in large-scale testing frameworks.This problem has been emphasized in a recent study [39]: given the vast amounts of driving recordings collected during testing, there is an urgent need for automated tools to support ADS engineers, e.g., in tasks such as clipping and interpreting.
Causality analysis has been proposed within the software engineering community as a means to assist developers in deducing the underlying causes of faulty behaviors observed in a failed test case.This technique has shown notable effectiveness in analyzing complex systems [11,18,19,59].Unfortunately, given an AV accident recording extracted from a simulator, it is non-trivial to apply existing causality analysis techniques due to two main challenges.First, ADSs consist of multiple independent, decoupled modules that communicate via message passing.Thus, minor faults in one module can eventually propagate into serious faults in other modules.For instance, an incorrect trajectory prediction may be used by an AV's planning module in a way that leads to an accident: in this context, the planning module is not solely to blame.Second, the analysis space of a typical accident recording is huge, requiring new approaches for identifying the accident-related segments that should be focused on.
To address these challenges, we present ACAV, a framework for Automatic Causality analysis of AV accident recordings.Our approach consists of two stages: accident recording simplification and causality analysis, summarized in the high-level workflow diagram of Figure 1.In the first stage, we define and apply feature extraction schemas based on the messages exchanged between ADS modules.These schemas are used to vectorize information about the map, as well as the AV's perception, prediction, and planning.We then propose a weighted voting method to integrate the slicing plans generated by these schemas, allowing for segments unrelated to the safety violation to be discarded.In the second stage, we identify safety-critical frames using an a priori method based on safety specifications extracted from the driver's handbook and traffic laws of California.Next, we apply our novel causality analysis tool, CAT, to identify the causal events of an accident by analyzing the Station-Time graph (ST graph).In conclusion, our framework is designed to identify the safety-critical frames that brought about an accident, and to generate detailed reports enumerating potential causes.This functionality empowers engineers to gain a comprehensive understanding of the accident dynamics without first needing to replay entire recordings.
To evaluate the effectiveness of our framework, we implemented it for Apollo 7.0 [2] and the SVL simulator [49], which are widely used tools in the field of autonomous driving research and development.Using an AV testing engine [52], we collected a total of 110 accident recordings, including accidents involving intersections, merging, and tailgating.We applied ACAV to vectorize and simplify these recordings, finding that ACAV achieved a 62.23% reduction ratio rate without discarding critical frames, demonstrating its effectiveness in simplifying accident recordings.Upon analyzing the simplified recordings with CAT, our approach identifies five distinct types of causal events in 93.64% of the recordings, including incorrect priority prediction (found 26 times), incorrect trajectory prediction (51 times), improper behavioral planning (17 times), unsafe motion planning (67 times), and vehicle out-of-control (103 times).Finally, we further evaluated ACAV on 1206 accident recordings collected from versions of Apollo injected with specific faults, finding that it can correctly identify causal events in 96.44% of the accidents triggered by prediction errors, and 85.73% of the accidents triggered by planning errors.
Our website [6] provides videos of multiple accidents involving the Apollo ADS, together with the complete accident reports generated by ACAV, as well as our source code.
Overall, we make the following contributions: • Feature extraction schemas for vectorizing map, perception, prediction, and planning information from ADS messages in AV accident recordings.• A mechanism for identifying and discarding recording segments unrelated to the accident.• A tool for identifying safety-critical frames from an accident recording by leveraging ST graphs.• ACAV, which to the best of our knowledge, is the first modular framework for AV accident analysis and explanation.• An implementation for Apollo 7.0 and SVL that is able to identify five types of causal faults in AV accident recordings.The paper is organized as follows.In Section 2, we review some essential background and present a motivational example.Section 3 introduces the design of ACAV, including the detailed algorithms of its two stages.Section 4 evaluates whether ACAV achieves its goal of identifying causal events from AV accident recordings.Finally, Section 5 compares our approach against some related work, before Section 6 concludes.

BACKGROUND AND EXAMPLE 2.1 Multi-Module ADSs
The ADSs of AVs are composed of various modules, including perception, localization, prediction, planning, and control.These modules utilize multiple sensors, such as cameras, LiDAR, GNSS, and IMU, that capture raw data (e.g., images, 3D point clouds) about the AV's state as well as the environment it is operating in.To facilitate collaboration among the modules, industrial-level ADSs use a publish-subscribe (i.e., message-based) model for communication.Each module subscribes to one or more channels in the ADS to obtain the required inputs and publishes its output as a message to the corresponding channels.
Specifically, the localization module constantly processes data collected from the GPS, IMU, and (sometimes) LiDAR, then publishes messages containing information about the vehicle's position, orientation, and speed.The perception module receives this data, along with additional information from cameras and radars, then publishes data about perceived obstacles in front of the AV.The prediction module receives the messages published by the perception and localization modules to predict the trajectory of the detected obstacles, and publishes the results to the prediction channel.The planning module subscribes to messages from all of the previous modules to make driving decisions, e.g., determining the appropriate speed and acceleration.Finally, the control module converts the trajectory points generated by the planning module into control commands for the chassis, such as steering, throttle, and brake, to ensure the vehicle travels according to the planned trajectory.Planning module.The ADS's planning module performs three main functions: route planning, behavioral planning, and motion planning [31,45,50].Given a destination, route planning selects a route by choosing a list of lanes and junctions from the map.This route serves as the reference line for behavioral planning and motion planning.Behavioral planning is responsible for making high-level driving decisions based on the current driving scenario to interact with pedestrians and other vehicles safely.For instance, when the AV detects a construction area ahead of its lane, behavioral planning needs to consider both the dynamic behavior of surrounding traffic participants and the road conditions to decide how to bypass it, e.g., by changing lanes.Lastly, motion planning translates high-level decisions into a series of waypoints as part of an executable trajectory, which can be translated into throttles and steering commands by the control module.
Behavioral and motion planning are critical tasks of the planning module, translating the path obtained from route planning into a series of waypoints by calculating specific speed and acceleration plans.This ensures that the AV interacts safely and comfortably with other traffic participants in the current scenario.Various planning techniques employ distinct approaches to integrate the three essential functions.For example, the lattice planner [57], a graph search-based technique, performs behavioral planning and motion planning implicitly and simultaneously under the guidance of well-designed cost functions.In contrast, the EM planner [26] performs behavioral planning and motion planning explicitly and step-by-step.In addition, the Frenet frame method is a well-known approach for describing the motion and trajectories of vehicles, which decouples vehicles' lateral and longitudinal motion, corresponding to the lateral and longitudinal control.The longitudinal behavioral and motion planning can be visualized effectively in a Station-Time graph (ST graph), where time is the horizontal axis, the planned longitudinal trajectory distance is the vertical axis and the planned longitudinal trajectory is a curve, as shown in Figure 2. Additionally, the curve's gradient represents the longitudinal speed of the vehicle.

Motivating Example
To introduce the concept of accident causality and demonstrate how our framework works, we elaborate with an accident driving recording collected from version 7.0 of Apollo.As illustrated in Figure 3, we summarize the scenario in an accident driving recording as six critical scenes, the demo video of which can be found on page 3 of 'Video Demos' on [6].Initially, the AV drove alone without encountering any traffic signals (Scene 3a).However, it later detected traffic signals and non-player characters (NPCs) as it was approaching an intersection (Scene 3b).The AV made an 'overtake' decision with respect to NPC 4 and executed it (Scenes 3c-3d).After overtaking NPC 4, it interacted with NPC 2, making a 'yield' decision, but still collided with NPC 2 (Scenes 3e-3f).Accident-related recording segment.As we described above, the AV did not detect any NPCs nor was it near any NPCs in Scenes 3a-3b.In contrast to the other four scenes, these first two had no impact on the accident.Furthermore, during simulation tests, the AV persisted in moving forward after a collision, rather than stopping; this behavior, too, had no influence on the accident.Given that our objective is to perform accident analysis, our framework is designed to automatically identify and exclude such segments that are unrelated to the accident.The remaining segments are then fed into the causality analysis stage of our framework.For this recording, ACAV can significantly reduce its length from 17 seconds to 4 seconds, without removing any critical frames.This reduces the workload of ADS engineers who can then immediately focus on the most important parts of the recording.Safety-critical frames.As depicted in Scene 3c, the AV made an 'overtake' decision with respect to NPC 4 near the intersection.This decision was unsafe because it violates traffic regulations and increases the risk of accidents.Our framework uses a priori knowledge to label frames containing potential accident risks, such as this one, as safety-critical frames.Moreover, each of the frames marked as safety-critical (e.g., Scenes 3c-3e) will be individually inspected by our framework.Causality analysis.Our framework can automatically identify causal events in an accident recording, as shown in Table 1.According to the table, the AV chose to overtake NPC 4 at the intersection due to failing to predict NPC 2's trajectory (Scenes 3c-3d).When the AV finally correctly predicted the trajectory of NPC 2 (Scene 3e), it was traveling at a speed of 39km/h (approximately 10.83m/s) and was less than 14 meters away from the NPC.Despite making the appropriate 'yield' decision, there was insufficient time and space to carry it out, resulting in a collision.The causal factors of this accident can be attributed to the incorrect priority and trajectory prediction by the prediction module, the flawed decision made by the planning module, and the vehicle's skidding.

FRAMEWORK DESIGN
As illustrated in Figure 1, our framework consists of two main stages: accident recording simplification and causality analysis.In the following sections, we provide a detailed explanation of the two stages of our framework and present an implementation of it for Baidu Apollo 7.0.Our framework is available online [6].

Stage #1: Accident Recording Simplification
In the first stage, the primary objective is to extract short segments from a long driving recording related to an accident.The idea is to provide coarse-grained filtering based on the scenario information.
To accomplish this, we partition the driving recording into a series of frames and assign three vectors to each frame to capture information on the current scenario, ranging from the environment to the maneuvers and status of the AV and NPCs.We use a scenario-based recording segmentation technique to merge the frames, and subsequently, we use specifications derived from the driver handbook and traffic laws of California to identify segments of the recording that are relevant to accidents.Data Collection.A prerequisite for our approach is to collect several accident driving recordings, which requires generating numerous testing scenarios for testing ADSs holistically against different safety oracles.To satisfy this prerequisite, we adapted the AVUnit framework [63], which provides domain-specific languages (DSLs) for specifying testing scenarios and oracles, as well as a fuzzing engine for obtaining effective test cases.Our adaptation extends the fuzzing engine by adding a recorder that captures the corresponding driving recording for each test case, i.e., each test case is captured in a single recording file.The set of initial configurations we used in our experiments includes different combinations of starting points, destinations, and NPCs.To handle the varying routes of all the combinations, the duration of each recording file was set at 60 seconds, which can cover all possible durations of a single test case.We ran the fuzzing engine for two days, generating 1260 test cases, including 131 accident test cases.The combined length of all these recordings exceeded 21 hours.After the termination of the fuzzing algorithm, we selected 110 accident driving recordings based on the output of AVUnit and classified them into three categories (intersection, merging, and rear-end accidents), excluding 21 accident test cases in which the car crash occurred after the AV stopped at its destination or the AV got hit from behind by an NPC.AV's planning speed is too fast or too slow.

2.6s
Wrong motion planning; AV skidding sometimes AV's planning speed is too fast or too slow.

4.3s Accident!
Message Alignment and Vectorization.In a multi-module ADS, the modules collaborate by asynchronously exchanging and processing messages.The content of each message varies depending on the module that published it.To facilitate causality analysis, we select and align messages from the communication channels of the map, localization, perception, prediction, and planning modules, each of which have different publishing frequencies.We divide the recording into several frames, each of which has a duration of 0.08s (chosen because the localization module has the fastest frequency, publishing messages every 0.08s).If these channels publish messages within the frame, we hold the messages and align them to the beginning of the frame.If a channel does not publish any messages within the frame, we copy the last message generated before it to the beginning of the frame.
In the vectorization phase, our primary objective is to extract information related to accidents, which can be associated with the map, perception, prediction, and planning modules in the ADS.To comprehensively capture information from across these modules' messages, we have designed three feature extraction schemas: one for the map, one for perception and prediction, and one for planning.Each schema includes factors that impact the AV's planning, or properties that reflect its current planning status.
The map schema contains information on whether the AV is at a junction, crosswalk, or near a stop sign, as well as the color of the perceived traffic signal.The perception and prediction schema includes four lists of NPCs, indicating which NPCs the AV is approaching, which are in close proximity, and which are predicted as ones to take 'caution' of or 'ignore'.The planning schema includes information on the main driving decision the AV currently executes, the operational design domain (ODD), motion, and whether it is safe according to the responsibility-sensitive safety (RSS) rules [51].
It is worth noting that these are fundamental components among industry-level ADSs such as Autoware and Apollo.Specifically, the ODD defines the specific operating conditions and scenarios in which an AV is designed to function safely and effectively.For instance, Autoware's ODDs include 'Lane Following', 'Lane Change, ' and 'Pull Out,' among other scenarios, each suggesting the appropriate scene module in Autoware that should be launched to handle the specific driving situation.Similarly, Apollo's ODDs consist of scenarios such as 'Lane Change,' 'Lane Borrow,' and 'Path Assess, ' indicating the corresponding decider/optimizer in Apollo that should be activated to make informed driving decisions.To ensure safety and responsible behavior, the planning schema utilizes RSS rules, which are designed to formalize concepts such as dangerous situations, appropriate responses, and the allocation of blame in a mathematically rigorous manner.Our framework converts each frame into three feature vectors based on these three schemas.Each feature vector contains specific semantic properties, with each dimension representing a particular attribute.
For instance, in the feature vector for the map schema, we have four dimensions indicating whether the AV is: 1. Near an intersection (The distance between the AV and the intersection is less than 5 meters); 2. Near a crosswalk (The distance between the AV and the crosswalk is less than 5 meters); 3. Near a stop sign (The distance between the AV and the stop sign is less than 5 meters); 4. Detected traffic signals.Thus, the vector ⟨, ,, ⟩ indicates that in the current frame, the AV is approaching a stop sign, not in an area near an intersection or a crosswalk, and not encountering any traffic signals.In this way, we transform the driving recording into a list of feature vectors while preserving the abstract semantic information of each frame, facilitating subsequent segmenting and pruning.Segmenting and Pruning.After the frame vectorization stage, the framework segments the recording by comparing the similarity of consecutive feature vectors.The idea is to group together sequential frames with identical feature vectors into a single segment.For example, if the AV drives on a road segment for 100 uninterrupted frames, then the feature vectors of these 100 frames are the same, and they will be clustered as a single segment based on the static map environment schema.Our framework generates segmentation plans for each of the three types of vector schemas previously described.These segmentation plans fuse vectors together using a weighted voting method that determines the optimal clipping point.For each frame, a general voting function can be defined for any weighted combination of feature vectors.Let   denote the weighted value of  feature vectors, and   denote the vote by the  feature vectors.Let where  = {, ,  }, which returns  or , indicating (respectively) whether the current vector should be deemed as a clipping point (i.e., last frame of the segment) or not.The weight of the vote by each category is discussed in Section 4.2.1.Numerous AV accident reports indicate that most accidents happen in specific contexts, e.g., at intersections, or when there are multiple traffic participants [20,27,55].Armed with this knowledge, our approach creates an overapproximation of relevant frames to narrow down our focus to the most crucial situations.To achieve this, we seek out and discard irrelevant frames by analysing static map environments as well as perception and prediction information.To classify a frame as irrelevant, we consider several factors.First, we check if the static environment of the frame includes a junction, a crosswalk, or a stop sign.Next, we verify that the AV is neither approaching nor near any NPC in the frame.Finally, we ensure that the AV does not predict a 'caution' or 'ignore' priority for any NPC.If all of these conditions are met, we classify the frame as an irrelevant frame.To determine whether to discard a segment , we count the irrelevant frames within it using function  (), and compute the irrelevant frame ratio   =  ( )  ( ) .If   is larger than the threshold ℎ  ,  will be discarded, otherwise, it will be kept.We discuss the selection of a particular threshold ℎ  in Section 4.2.1.
Algorithm 1 summarises the steps of our segmenting and pruning method.Specifically, given three categories of feature vectors of an aligned recording, for a feature vector of a frame within it, if the vector is different from its previous one, then we deem that the frame gets one vote by one of the three feature vector categories (Lines 5-10).After collecting votes from all three categories, we perform voting (Line 11) to decide whether to slice in this frame (Line 12-14), the definition of which is shown in Equation 1, and the weight selection of which is discussed in Section 4.2.1.We prune the accident-related segments by examining the segments (working backwards) at Lines 17-26.The last segment is deemed as a part of the accident-related segment (Line 17).For other segments, we find the irrelevant map or perception vectors and determine whether to discard them.For a non-irrelevant segment, we merge it into   if   follows it, as shown in Lines 21-25.We discuss the selection of a threshold value in Section 4.2.1.

Stage #2: Causality Analysis
In the second stage, we automatically analyze the accident-related segments that were generated in the first stage to identify potential causes of the accident.We utilize automotive safety specifications from California's driver handbook [16] and traffic laws [17] to identify safety-critical frames that may have contributed to the accident.Next, we implement a causal analysis tool, CAT, that works by examining speed planning.For frames that are identified as suspicious, CAT compares their current speed planning and actual trajectory to deduce the causal events of the accident.This process enables our framework to effectively identify the causes of the accident and provide valuable insights for future improvements.Potential Safety-Critical Frame Identification.In order to identify safety-critical frames in an accident-related driving recording segment, our framework uses a frame checker that utilizes a priori knowledge, i.e., a list of specifications extracted from background knowledge.In particular, we examine California's driver handbook [16]-published by the Department of Motor Vehicles (DMV)and traffic laws [17], to obtain a list of specifications for each stage.These specifications include identifying critical obstacles, improper priority prediction, and driving decision-making.
In order to ensure compliance with the rules outlined in the driver's handbook [16], it is necessary to have a robust specification language that allows us to precisely describe these rules.To this end,

Algorithm 1: Segmenting and Pruning
Input:  : all the three categories of feature vectors of the original aligned recording before the accident with length ; Output:   : the reduced accident-related segment; we have adopted a specification language based on propositional logic.The specification language consists of propositions (based on a set of pre-defined variables), as well as the usual logical connectors.Before introducing the specifications, we first introduce the predefined variables, which can be organized into three categories: state variables, deviation variables, and maneuver variables.Firstly, the state variables describe the states of vehicles.Table 2 lists a subset of these variables and their usage in describing vehicle properties.For instance, suppose there is an NPC  driving near a junction with a speed of 5m/s to the front-left of the AV, then . is 5m/s, . is , and . contains the Secondly, Table 3 summarizes deviation variables to specify various deviation calculations.Here,  and  represent the upper and lower speed limits of the road on which the AV is traveling.Functions  (, ) and  () represent (respectively) the distance between two objects and the error in trajectory prediction.Additionally, we define the function  () to filter out the NPCs that need to be focused on in a given scenario.The function  () outputs  if and only if the distance between object  and the AV is less than three times the current speed of the AV.
Finally, the (subset of) maneuver variables presented in Table 4 reflect the prediction and planning status of the AV.These variables are directly extracted from prediction and planning messages.For example, if the AV is closely and cautiously following an NPC , then () would be , and  () would be .The remaining maneuver variables would be set to .
Here, the AV's priority prediction for an NPC can be roughly divided into three types: 'caution' for a critical NPC, 'ignore' for an immaterial NPC, and 'normal' for the rest.The AV's driving decision towards an NPC can be summarized as a list of maneuvers, including 'ignore', 'stop', 'follow', 'yield', 'overtake', 'nudge', etc.
With the defined variables, we can now describe the specifications checked by our framework.Specifically, it assesses the correctness of the AV's prioritization, trajectory prediction, driving decisions related to NPCs, and speed planning.For instance, to identify an improper 'overtake' decision, we define the specification as: () := (.∨ .)∧  (), which means that if the AV decides to overtake an NPC while near an intersection or on a crosswalk, the 'overtake' decision is considered improper.In this case,  refers to perceived objects, such as vehicles, bicycles, or pedestrians.It is important to note that if a specification is satisfied, a vulnerability has been identified.The detailed specifications can be found on our website [6].
Causal Events Deduction.To identify the causes of accidents from the simplified accident recordings, we design a tool called the Causality Analysis Tool, or CAT for short.CAT analyzes frames labeled as safety-critical to determine whether the planning trajectory could intersect with other traffic participants in a way that might cause an accident.If CAT identifies a potential accident scenario, it analyzes the events leading up to that moment and identifies the actions or behaviors that contributed to the scenario.It is worth noting that even if the AV changes its planning in response to a potential accident scenario, incorrect behavior at that moment could waste valuable reaction time and increase the risk of an accident.
To achieve this, our tool analyzes ST graphs depicting the AV's planning states to discover potential causal events.Based on the Frenet frame method, the ST graph provides a visual way to describe longitudinal behavioral and motion planning.Besides directly presenting whether the trajectory plan is collision-free, the ST graph also describes aspects of the AV's driving decisions and speed planning.Specifically, in an ST graph, time is the horizontal axis, the planned longitudinal trajectory distance is the vertical axis and the planned longitudinal trajectory is a curve.Each point on the curve represents a waypoint on the planned trajectory, and the curve's gradient represents the speed.The motion of other traffic participants can be drawn as rectangles that block certain parts of the AV's longitudinal path during a specific time interval.An ideal speed curve intersects with none of these rectangles so that there is no collision between the AV and NPCs.The positional relationship between the speed curve and an obstacle block in the ST graph presents the AV's behavioral planning result for the related traffic participant.If the obstacle block of a traffic participant is above/below the AV's speed curve, the driving decision by the AV is to yield/overtake, as shown in Figure 4. Therefore, for achieving collision-free trajectory planning, it is imperative that the vehicle accurately perceives all surrounding NPCs and predicts their future trajectories with high precision.This ensures that there is no overlap between the AV and NPCs at each time step.Fundamentally, this planning process equates to solving a constraint satisfaction problem, where the constraints are defined by the drivable area.In an ideal scenario, precise outputs from the perception and prediction modules would enable the computation to guarantee a collision-free trajectory.
Our tool performs a detailed comparison and analysis of the ST graph from the AV perspective against the ground truth, frame by frame.The idea is that for any given frame in the recording, CAT can reconstruct accurate subsequent trajectories of NPCs using data from the future segments of the recording.This reconstructed trajectory is then treated as the ground truth for assessing the effectiveness of the prediction module.Additionally, we examine the planning module of the tool to verify whether it accurately calculates the necessary constraints for ensuring collision-free trajectory planning for the respective frame.
The analysis process of CAT is shown in Figure 5. CAT firstly checks the priority prediction of the NPC involved in the accident.If the NPC's priority prediction is 'ignore', it means that the cause of the collision is wrong priority prediction.This is because AVs do not consider an ignored NPC in the subsequent planning.This omission manifests as a lack of black blocks representing calculated constraints in the ST graph for the NPC, with only the blue blocks indicating the ground truth constraints present.If the AV's speed planning curve does not intersect with the obstacle blocks by the AV but intersects with the obstacle block in the ground truth, it means that the cause of the collision is the AV's misunderstanding of the NPC's future action.This situation is characterized by a significant deviation in the ST graph for the NPC, where there is a clear discrepancy between the constraints calculated by the AV and those of the ground truth.
If the prediction of the NPC made no error, CAT checks the AV's behavioral planning and then the motion planning.In the potential safety-critical frame identification step, ACAV filters potential improper driving decisions made by the AV.When CAT checks the behavioral planning result, if the speed curve intersects with any obstacle blocks near the risky driving decision, it means that improper behavioral planning is to blame for the accident.For example, the speed curve in the ST graph improperly extends beyond an NPC's block to overtake it.However, in this particular scenario, the AV is unable to find a viable trajectory to avoid a collision with another NPC.If the speed curve still intersects with other obstacle blocks based on reasonable behavioral planning, it means that improper motion planning caused by risky speed limits is to blame for the accident.In this case, the speed curve in the ST graph demonstrates an insufficient margin relative to the NPC's block, indicating a lack of adequate space to safely avoid the NPC.If CAT finds that the AV's planning is collision-free, it compares the actual trajectory with the planned trajectory.If there is a deviation between the two trajectories, we can infer that the AV failed to execute the planning due to being out of control (e.g., due to skidding).Generalizability.While we have presented ACAV in the context of Apollo, the overall approach can be generalized to other ADSs, given that it operates solely on accident recordings and does not require knowledge of the specific internal designs of the systems involved in generating the recordings.The primary assumption for employing our framework is thus the ability to generate/obtain similar recordings.Fortunately, modern ADSs typically have multimodule architectures similar to that of Apollo.
We illustrate the generalizability of ACAV by applying it to recordings obtained from the Autoware.universeADS [3] and the Carla simulator [23].We systematically examined the semantic structure of message fields required by ACAV from various modules, including localization, perception, and planning modules.In the case of localization messages, there were similarities between the fields in Autoware.universeand Apollo.Meanwhile, the perception module in Autoware.universecontained tasks related to detecting nearby obstacles and predicting their future trajectories, a functionality akin to the combined roles of perception and prediction modules in Apollo.Nonetheless, some disparities arose in the message structure.Notably, Autoware.universelacked an obstacle priority field within perception messages and a behavioral planning field within planning messages.To mitigate these differences in message format, we populated the missing fields with default values.As a result, ACAV demonstrated the capability to identify causal events such as wrong trajectory prediction, incorrect speed planning, and instances of vehicles going out of control.However, it was unable to identify causal events related to incorrect priority prediction

EVALUATION 4.1 Research Questions & Evaluation Metrics
To evaluate the performance of our framework, we conducted experiments to answer the following research questions: • RQ1: Which combination of weights for feature vector categories and which threshold in the "segmenting and pruning" phase are the most effective?• RQ2: Does ACAV effectively simplify accident recording compared to other approaches?• RQ3: How many different causal events can the causality analysis of ACAV automatically identify?• RQ4: To what extent can ACAV accurately identify causal events?For RQ1 and RQ2, we evaluated the performance of the simplification methods used in the first stage based on two metrics: the 'ratio of reduced frames' and the 'recall of critical frames'.The ratio of reduced frames refers to the length of the removed driving recording over the length of the driving recording before the accident, whereas the recall of critical frames is the number of critical frames in the reduced recording segment over that in the entire recording.Since the subsequent causality analysis relies on these critical frames, we aimed to preserve them as much as possible.Therefore, we initially focused on the recall metric of different methods and then considered their ratio of reduced frames.For RQ3, Figure 6: Ratio (higher is better) and recall (higher is better) of the pruning method under different thresholds Table 5: Ratio (higher is better) and recall (higher is better) of different combinations of voting methods Weight Ratio (map:perc:pln) 1:1:0 1:0:1 0:1:1 we assessed the effectiveness of the ACAV by analyzing the number of different causal events it could automatically identify based on the simplification of accident recordings.For RQ4, we evaluated the accuracy of our framework in identifying causal events resulting from versions of Apollo injected with specific faults.

Experiments and Discussion
4.2.1 RQ1.Different segmenting methods lead to different segmentations of the recording, which can affect the efficacy of test reduction and the final analysis.This is due to the varying contributions of features in depicting a driving scenario.Additionally, using the same contribution for all features can result in many short clips and a lower reduction ratio of original recordings.To design a coarse-grained test reduction method, we evaluated the effectiveness of various combinations of weights assigned to categories of feature vectors and the threshold for determining accident recording segments in RQ1.This method aims to identify and remove non-accident segments to reduce the overall size of the recording for analysis.We first evaluated the performance of different settings of the voting method for frame segmenting when the threshold value was set as 0.8 for pruning and present the results in Table 5.We focused on the recall of critical frames, as this factor can significantly impact the causality analysis conducted by our a priori frame checker and CAT.The results indicated that the voting method with a weight ratio of  :  :  = 1 : 1 : 2 (i.e., the method adopted by our framework) achieved the best total recall rate of 94.41% across all frame segmenting methods.This method also had a reduced frame ratio of 62.23%, signifying its effectiveness in removing non-accident recording segments from the analysis.It is also worth noting that the voting method with a weight ratio of  :  :  = 1 : 1 : 1 achieved a similar recall rate (93.01%) compared to our method (94.41%) while having the lowest ratio of reduced frames among all the weight combinations.However, we observed that the segments generated by this weight combination were fewer in number and larger in length than those created by our segmenting method, leading to fewer segments being discarded in the recording pruning stage.As a result, insufficient recording pruning allowed this method to maintain a promising recall, but it does not necessarily imply that this is an effective segmenting method.The optimal balance between recall and pruning efficiency is crucial for an effective segmenting method, and our method with the weight ratio  :  :  = 1 : 1 : 2 has demonstrated better overall performance in capturing critical frames and pruning irrelevant ones.
In order to determine the optimal threshold for our segment pruning method, we conducted a series of experiments, adjusting the threshold for identifying accident-related segments in increments of 0.2, starting from 0.2.We focused on the same two metrics: recall and ratio.The results presented in Figure 6 reveal that as the threshold value increases, recall progressively improves.When the threshold value exceeds 0.4, recall consistently remains above 80%.Simultaneously, the ratio gradually decreases as the threshold value rises.From a threshold of 0.2 to 0.8, the ratio experiences minimal change and maintains a level above 60%.However, when the threshold increases from 0.8 to 1, the ratio experiences a substantial decrease compared to previous levels.Based on these findings, we concluded that a threshold of 0.8 is optimal, as it strikes a balance between high record reduction performance and the retention of a sufficient number of safety-critical frames.
4.2.2RQ2.For RQ2, our objective is to compare our accident recording simplification method with a variety of alternative fixedlength recording pruning methods and the STRaP framework [21], an AV recording simplification method.We set the lengths at 4, 8, 12, and 16 seconds before the accident, considering that the remaining segment length of our approach is approximately between 4s and 16s.The results of our experiment are displayed in Table 6.The rows represent the evaluation metrics of different segmenting and pruning methods, while the columns indicate the various accident categories included in the experiments.A comparison with fixed-length segmenting methods reveals that it is not feasible to establish a fixed remaining length that effectively balances a substantial reduction ratio with a high critical frame recall.Upon further examination of the accident-related segment lengths, we believe that the primary reason for this outcome is the variability in the duration of interaction between the AV and the NPC involved in different accidents.This observation also highlights the utility and generalizability of our approach, which can adapt to a wide range of cases.
As our segmenting and pruning method shares similar goals with the concept of test reduction and prioritization, we further compared our accident recording simplification method with STRaP, which scales redundant segments with similar contents down to a given length to reduce the length of a recording.As shown in RQ1, ACAV's ratio of reduced frames is 62.23% on average.Therefore, we restricted the retained recording length in STRaP as 40% of the number of frames in the original segment to ensure a similar reduced frame ratio, i.e., a ratio rate of about 60%.In our experiment, STRaP achieved a total reduced frame ratio rate of 39.43% and a recall rate of 30.81%.The reason is that the STRaP framework, while effective in its intended purpose, modifies the content of the original recordings in such a way that distorts the temporal relationships between events and their true durations.This alteration of the original recordings makes STRaP unsuitable for causality analysis.
4.2.3RQ3.In RQ3, our objective is to determine ACAV's performance on the accident recordings collected for the original ADS.To achieve this, we conducted a comprehensive evaluation by applying our framework to a dataset comprising 110 accident recordings, all generated by an AV testing engine [63].This dataset encompassed a variety of accident scenarios, including 43 intersection accidents, 31 merging accidents, and 36 rear-end accidents.ACAV successfully identified the causal events for 103 of these accident recordings.However, our study found that ACAV was unable to detect any significant causal events in 7 accident recordings.Upon further examination, we discovered that these accidents merely involved minor scratches between the AV and the NPC, without any severe impacts taking place.This issue can be attributed to the limitations of computational precision, which can be perceived as an engineering challenge arising from the complexities of accurately processing distances.
For the remaining 103 accidents, we conducted a manual verification process.This entailed revisiting all the causality analysis reports by replaying the accident recordings and validating the causal events identified by ACAV.In particular, we conducted a systematic examination of all causal events identified by our framework and present the specific numbers for each accident type in Table 7.These results indicate that ACAV can effectively identify multiple causal events in various accidents, utilizing each causal event defined by CAT.Through ACAV, the events of wrong trajectory prediction were primarily found in merging and rear-end type accidents, while wrong speed planning events occurred more frequently in intersection and merging accidents.It is important to note that all the accidents in our dataset occurred in rainy or snowy weather conditions, which explains the "vehicle out-of-control" event appearing in all 103 accidents.4.2.4RQ4.In RQ4, we sought to assess the accuracy of our framework in identifying the causes of accidents.To achieve this, we injected eight distinct fault types, as detailed in Table 8, into the ADS.Specifically, F1 can cause an accident due to wrong priority prediction causal events, while F2 causes accidents based on wrong trajectory prediction.Conversely, F3 through F7 are designed to cause accidents due to improper behavioral planning.Finally, F8 is identified as the trigger for a causal event related to improper motion planning.For each fault type, we ran the testing engine [63]  AssignIgnoreLevel()@obstacle_prioritizer.ccAssign 'ignore' priority to all the detected NPCs by default.F2 PredictObstacle()@predictor_manager.cc Assign improper trajectory prediction models to NPCs to get erroneous trajectory prediction.F3 MakeStaticObstacleDecision()@path_decider.cc Make 'ignore' decisions to all the static NPCs near the AV's planned trajectory.F4 MakeObjectDecision()@speed_decider.cc Make 'follow' decisions to any NPCs in front of the AV which tend to stop, instead of 'stop' decisions or changing lanes.F5 MakeObjectDecision()@speed_decider.cc Make 'ignore' decisions to an NPC ahead of the AV, if the AV is not following or keeping distance from it.F6 MakeObjectDecision()@speed_decider.cc Make 'yield' decisions to a high-speed NPC accelerating ahead of the AV, which leads to AV's low speed in a fast lane.F7 MakeObjectDecision()@speed_decider.cc Make 'overtake' decisions to any NPC if it is near the AV.F8 GetSpeedLimits()@speed_limit_decider.cc Keep a high speed even being close to NPCs. for approximately one day and recorded the resulting accidents.It is imperative to highlight our efforts to ensure the complexity of each recorded test case.We accomplished this by implementing varying extended routes and incorporating multiple NPCs of diverse types.Furthermore, we standardized the duration of each recording file to 120 seconds.In total, we amassed a dataset comprising 1206 accident recordings.Subsequently, we applied our framework to analyze these accident recordings, documenting the accident causal events and their respective time frames.In this experiment, if a causal event's duration significantly surpassed those of other causal events, we deemed it to be the 'main' cause of the given accident.For example, in the case of fault F2, if the injected fault takes effect, it should persist for a sufficient duration to accumulate a noticeable trajectory prediction error, which is crucial for causing accidents.Consequently, the associated causal event, namely, 'wrong trajectory prediction', would be identified in the recording files as the main cause of this fault.If our framework correctly identifies the functions in line with the injected faults, we conclude that our framework accurately determines the cause of the accidents.
As shown in Table 9, ACAV performs well, accurately identifying causal events in 1064 out of the accident recordings, with a precision of 100.00% for both the prediction and planning modules.This indicates that, for a specific type of fault, our framework can both precisely identify the causal events within the recording and distinguish recordings that do not include these causal events.Furthermore, this is complemented by a recall rate of 95.97% and an accuracy of 96.44% in the prediction module, along with a recall rate of 82.56% and an accuracy of 85.73% in the planning module.
Nevertheless, our investigation uncovered a limitation, as ACAV failed to detect the causal events in 142 accident recordings.Upon a more in-depth examination, we discovered that when faults are injected into the planning module, two or three closely interrelated causal events often occur simultaneously.For instance, in 15 accidents linked to fault F7, an additional causal event surfaced: the vehicle going out of control.This event was attributable to the elevated speed requirement of the 'overtake' decision, particularly evident during inclement weather conditions like rain or snow.We observed that ACAV successfully identified interrelated causal events in 105 out of the 142 accidents.4.2.5 Threats to Validity.We acknowledge certain limitations and threats to the validity of our evaluation.While our approach has been implemented for two distinct platforms-Apollo, simulated with the SVL Simulator, and Autoware.universe,simulated with Carla-our evaluation is exclusively focused on the Apollo ADS.The reason is that there is currently no suitable fuzzing engine implemented for Autoware.universe.This absence presents a challenge in acquiring sufficient accident recordings for a comprehensive evaluation of our approach on the platform.Second, during testing, we observed that the AV primarily considered NPCs in front of it when planning driving behavior.When an NPC hits the AV from behind, ACAV may not yield effective analysis results.This issue could be addressed by incorporating more intelligent NPC behavior configurations in the simulator, which would better emulate interactions between real-world vehicles.Furthermore, it is generally accepted that the rear vehicle should bear more responsibility in a rear-end accident, a principle that is also practiced in many jurisdictions [5].Third, it is imperative to acknowledge that the faults injected in RQ4 do not reflect the real-world faults in the ADSs.However, the resulting accidents from these injected faults are similar to those caused by real-world faults in ADSs, lending credence to our framework's ability to accurately identify causal factors of accidents.Moreover, the inherent complexity of ADSs, attributed to their reliance on logic-based code, external dependency libraries, and machine learning-based models across various modules, contributes to a significant challenge for repairing.As reported in a study [30], more than half of the AV faults originate from incorrect algorithmic implementations or configurations, often involving extensive code segments exceeding 20 lines.Consequently, while our framework can interpret accident recordings and pinpoint potential causes, it should not be considered a panacea for repairing the underlying bugs in ADS systems.

RELATED WORK
System-level testing for AVs is designed to evaluate the performance of the entire ADS, as opposed to module-level testing, which focuses on individual modules or specific functionalities.This comprehensive evaluation is achieved through the use of scenario-based test cases and test oracles.Current research in system-level testing primarily focuses on generating corner cases and error-prone driving scenarios.There are two main categories of scenario sources: real-world data, and testing frameworks.
One category of work generates scenarios derived from scenarios observed in the real world, emphasizing the similarity between the generated scenarios and real-world ones [43,44].Zhang et al. [58] proposed a method based on 3D scene reconstruction, which uses images collected by the in-vehicle camera to recreate scenarios as test cases.Gambi et al. [25] proposed AC3R, which extracts information from collision reports and constructs new test scenarios using simulation methods.DEEPCRASHTEST [12] recreates accident scenarios based on accident videos.Fremont et al. [29] combined formal verification with clustering algorithms to select usable test scenarios.There is also an approach [48] that evaluates the performance of the ADSs by comparing them with that of human drivers according to features extracted from real-world scenarios.
Another category of work generates scenarios by using a (domainspecific) testing framework.Two widely-adopted methodologies are search-based or sampling-based methods [7,8,13,24,32,40,47,64]. Search-based methods, or fuzzing, typically search the parameter space for specific parameter values to achieve a certain testing goal.To guarantee the efficiency of the heuristic search method adopted, e.g., genetic algorithms, a well-defined fitness function is required.Althoff et al. [9] defined a calculating metric, the drivable area, to quantify the search of solution space, and combined reachability analysis with optimization techniques to obtain test scenarios.Li et al. [38] proposed AVFuzzer, which uses safety potential, the distance between the ego vehicle and other traffic participants, as the fitness function for a genetic algorithm-based fuzzer to find scenarios that could lead to collisions.Combining program analysis techniques and evolutionary algorithm-based fuzzing, PlanFuzz [54] defines behavioral planning vulnerability distance as the guidance for the generation of test scenarios that would cause the autonomous vehicle to stop under safe conditions.Sun et al. [52] defined a metric for quantifying the degree to which autonomous vehicles violate traffic rules in a driving scenario, guiding their fuzzer to generate test cases that violate traffic regulations.Sampling-based methods sample from a naturalistic scenario distribution to generate test cases.A series of works [33,56,61,62] has studied sampling in different driving scenarios based on importance sampling [53].Batsch et al. [15] built a Gaussian process classification model to estimate the safety of a scenario probabilistically, with the training data sampled from simulation-generated traffic congestion scenarios.NADE [28] collects driving scenarios from real-world data and samples to generate realistic and safety-critical scenarios.
The aforementioned works primarily concentrate on evaluating the performance of ADSs comprehensively and identifying new vulnerabilities.However, their focus lies in determining whether the ADSs fail to meet the test oracles, rather than understanding the underlying reasons for these failures.Our method is driven by the goal of analyzing the actual cause of safety violations, such as collisions, by concentrating on the testing process itself.
In recent years, causality has become a widely-adopted methodology to analyze complex systems.Forney et al. [35] proposed an interactive platform for fault diagnosis and forensic investigation in fields such as airplane accidents.Bareinboim et al. [10] proposed a causal inference-based method to solve data fusion problems in the context of big data.Biebl et al. [14] presented a causal model to predict accident risks in an intersection for drivers with impairments.In addition to works focusing on AI [11,18,19,59], some works have applied causality to the security analysis of CPSs [34,36,37,41,42].Zhang et al. [60] monitored, inspected, and located anomalies in industrial control systems using a causal model based on maximum information coefficient and transfer entropy.Poskitt et al. [46] proposed a causality-guided fuzzing method that identifies and generalizes the causality of events in testing to find new test cases with different causal relationships.Our method is designed to employ causality analysis on autonomous driving accident records to facilitate deeper fault analysis and uncover the underlying causes of accidents.By examining the causal factors that contribute to accidents, we can better understand the limitations and vulnerabilities of autonomous driving systems.This, in turn, allows engineers to make more targeted improvements, enhance safety, and reduce the likelihood of similar accidents occurring in the future.

CONCLUSION
We presented ACAV, an automated framework for determining the causal events in AV accidents.We successfully implemented it in both Apollo and Autoware.universeand evaluated our framework using 110 accident driving recordings from the Baidu Apollo ADS, successfully identifying causal events in 103 of them.After analyzing 1206 accident recordings collected from ADSs injected with specific faults, we further showed that it identifies causal events correctly.
In future work, we are interested in developing automatic program repair techniques for ADSs, leveraging the results of causality analyses from accidents.By incorporating these advancements, we hope to create a comprehensive framework that can contribute significantly to the safety and reliability of AVs in real-world scenarios.

Figure 1 :
Figure 1: Overview of ACAV: the first stage vectorizes data exchanged between ADS modules and discards recording segments irrelevant to the accident; the second stage performs a causality analysis using the CAT tool

Figure 3 :
Figure 3: Motivating example: six key scenes from a recording of an AV accident

Figure 4 :Figure 5 :
Figure 4: Speed planning based on an ST graph

Table 1 :
Results of a causality analysis for the example For NPC 4: improper 'overtake' decision.

Table 3 :
Deviation variables in the specification language  Number The threshold of the error of trajectory prediction  Number The maximum speed limit of a road segment  Number The minimum speed limit of a road segment  (, ) Number The distance between two objects  and   ( ) Number The error of the trajectory prediction   ( ) Bool True if and only if  (,  ) < 3 × . for an NPC

Table 4 :
Maneuver variables in the specification language Bool True if and only if the EV makes an "overtake" decision on NPC  waypoints in the predicted trajectory of .The other variable values of type  are all .

Table 6 :
Ratio (higher is better) and recall (higher is better) of different recording segmenting methods

Table 7 :
The number of causal events over different accident types

Table 8 :
The eight types of faults injected into the customized ADS

Table 9 :
Precision (higher is better), Recall (higher is better), and Accuracy (higher is better) of causal events over accidents with different fault injections