Safety of Perception Systems for Automated Driving: A Case Study on Apollo

The automotive industry is now known for its software-intensive and safety-critical nature. The industry is on a path to the holy grail of completely automating driving, starting from relatively simple operational areas like highways. One of the most challenging, evolving, and essential parts of automated driving is the software that enables understanding of surroundings and the vehicle’s own as well as surrounding objects’ relative position, otherwise known as the perception system. Current generation perception systems are formed by a combination of traditional software and machine learning-related software. With automated driving systems transitioning from research to production, it is imperative to assess their safety. We assess the safety of Apollo, the most popular open-source automotive software, at the design level for its use on a Dutch highway. We identified 58 safety requirements, 38 of which are found to be fulfilled at the design level. We observe that all requirements relating to traditional software are fulfilled, while most requirements specific to machine learning systems are not. This study unveils issues that need immediate attention; and directions for future research to make automated driving safe.


INTRODUCTION
The automotive industry has transitioned from an electro-mechanical to a software-intensive industry.Current and future vehicles are characterized by immense use of software to enable automated (a.k.a.self-) driving.Today, industry players such as Waymo and Baidu have shown the capability of completely automating driving (i.e., without the need for a human driver for emergency takeover) in relatively simple situations such as specific geographic locations and restricted weather and illumination conditions. 1 Likewise, many software companies, including Apple, Sony, and Uber, are reportedly developing their automated driving frameworks. 2  This article focuses on perception systems in automated driving frameworks.Perception refers to sensing surroundings for semantic understanding, such as identifying traffic signs and locating the vehicle's own position and the relative position of objects around [20].This information is used to plan and execute the next driving decision.Perception systems are arguably the most evolving and relevant part of any automated driving framework [52].
Software engineering research on perception systems has explored multiple aspects, including their development [106], complexity [13], and use of machine learning (ML) models in perception systems of existing automated driving frameworks [48,77].With many automated driving frameworks transitioning from research to production, one challenge that the automotive industry, regulatory bodies, and legal authorities experience today is the safety of automotive software, which is imperative for its public acceptance [56].Relating to safety, recent software engineering literature primarily focuses on validation & verification, with most of the studies on testing [9,30,82] and related aspects [86].Bridging a gap in the literature, this article presents a case study assessing the safety of perception systems at the design level.
For our case study, we chose Apollo 7.0's [1] perception system software.Existing studies show that Apollo is the most popular open-source automotive repository [52], with its development history in GitHub dating back to 2017.It is currently one of the most advanced automated driving frameworks [77], embraced by many world's top automakers, and is used to offer automated driving services in parts of the world. 1  In this article, we answer the following research question: How safe is the design of Apollo's perception system for completely automated driving on Dutch highways?
We study Apollo 7.0's [1] perception system software for its use in a segment of the Dutch highway A270. 3 We answer our research question by first eliciting safety requirements for completely automated driving on Dutch highways (segment) and then assessing these requirements on the design of the perception system of Apollo.In the rest of this article, we discuss (1) what kind of and how we elicit the safety requirements and (2) how we assess the fulfillment of the safety requirements on Apollo's perception system.This study makes two contributions: one, eliciting safety requirements, and two, design assessment of the elicited safety requirements.We focus on three aspects of safety requirement elicitation: (a) system or sub-system failures [45]; (b) data corruption [45]; (c) insufficient situational awareness arising from limitation of sub-systems on specific conditions (e.g., due to weather) [46].There are more dimensions to safety requirement elicitation, such as deficiencies in specified driving behavior [28] and incorrect and inadequate human-machine interface design leading to inappropriate user situational awareness (e.g., confusion, overload, or user inattentiveness) [46,73], which are not considered in this study.

Safety of Perception Systems for
For safety requirement elicitation on failures and data corruption, we use the industry standards and traffic authority guidelines [5,15,45,94] based on their high adoption [52], compliance requirements [5], and proven applicability in the automotive domain. 1 To the best of our knowledge, no existing case study in the scientific literature exists that elicits these three kinds of safety requirements for a real-life highway and a mature software stack from the industry.
The resulting requirements can be divided into two categories: (1) requirements that can be assessed in the traditional software and (2) requirements specific to ML systems.An example of traditional software requirement is "a failure of the camera sensor in the camera-based perception system shall not lead to an incorrect estimation of the state of vehicles or other obstacles."An example of an ML system requirement is "the performance deterioration of a camera-based perception system due to low light in the night shall not lead to an incorrect estimation of the state of vehicles or other obstacles." 4or traditional software safety requirements, we use existing frameworks to assess Apollo's design [17,54,55].Since there is no similar framework for assessing safety requirements specific to ML systems, we systematically prepare a curated list of ML-specific design choices relating to safety and use them for design assessment.Our assessment uses publicly available data such as documentation, architecture, code, datasets and related artifacts, and scientific papers linked to the documentation.An overview of the entire elicitation and assessment process is depicted in Figure 1.
In summary, our study contributes the following: -We present a case study of a mature, automated driving software stack from industry for its real-life highway use, the first in the scientific literature.For transparency and replicability, in addition to the safety requirements, we provide results from all intermediate steps [3].-We identify 58 safety requirements, specific to a Dutch highway segment of A270, that can enable safe automated driving on highways.
-We present a curated list of 10 ML-specific design choices for assessing the quality attribute safety at design level.-Our study shows that there exists design evidence for the fulfillment of 38 out of 58 safety requirements.A detailed description of how and where to find them is available as a part of the replication package [3].
Note that our study does not involve the execution of the software such as in testing or dynamic verification.Rather, our study is complementary to these techniques.In our study, we evaluate whether (and how) the requirements to operate safely are considered in the design of Apollo via design evidence.Such evidence is not a guarantee for the satisfaction of safety requirements, but rather a first step and an indication that the requirements are considered in the design.Formal verification (and not testing) is the only method to give guarantees, however, they do not scale to such complex software.Testing still is a practical validation approach in automated driving context.However, design assessment is the less costly, complementary method for all of the above mentioned validation and verification methods.
The rest of the article is organized as follows: Section 2 presents an overview of safety assessment along with a brief introduction to the architecture of Apollo automated driving framework and a description of our operational area-a Dutch highway segment.Section 3 presents the research methodology followed in this work.Sections 4 and 5 describe how we elicit and assess safety requirements, respectively, and our findings.There are running examples demonstrating the process followed and intermediate results throughout Sections 4 and 5.They are demarcated by .Section 6 discusses our findings and their implications for research and practice.Threats to validity and related work are presented in Sections 7 and 8, respectively.We present concluding remarks in Section 9.

OVERVIEW & CONTEXT
This section presents an overview of the safety assessment process and outlines the assessment context.The entire process can be divided into three parts, as shown in Figure 1.The first part is systematically identifying the necessary information needed to conduct safety requirements elicitation.This includes Apollo's detailed architecture [1] and a systematic description of the intended operational area [15,94].There did not exist a detailed architecture or a operational area description.A detailed description of both and how we created them is presented in Sections 2.1 and 2.2.
The second part is deriving safety requirements for Apollo's perception system.In this work, we focus on safety requirements relating to three aspects: (1) failure of a component; (2) data corruption; (3) limitations to the intended functionality (leading to insufficient situational awareness).For the limitations in the functionality of ML components, we concentrate on different weather and illumination conditions of the operational area that can lead to the limitations.Section 4 presents the method we used for eliciting safety requirements and the resulting requirements.
The third part is assessing the perception system, where we focus on the design decisions and how they (do not) fulfill the safety requirements.The perception system relies on multiple ML models along with traditional software.We assess the requirements related to failure and data corruption in the architecture of the traditional software.ML models have different design choices, since they are fundamentally different from traditional software.For ML models, the logic is automatically deducted from the training data, while logic is manually programmed for traditional software.Requirements on limitations to the intended functionality (specific to ML components) are assessed in the sub-systems that rely on ML components.We explain the method and the results in Section 5.
Safety of Perception Systems for Automated Driving 64:5

Apollo: An Open Autonomous Driving Platform
Apollo is an open-source, automated driving platform from the Chinese search engine company Baidu.Our choice of Apollo is motivated by its popularity [52], prominence of usage [52], continuous development since 2017, prior usage in research articles (e.g., Reference [77]), and industry ownership.We use the current version, Apollo 7.0, for this study.The information used in this study is derived from publicly available documents, including Apollo's documentation on Github [1].
To create a detailed architecture, we started with the documentation available in GitHub.While an abstract outline is available in the documentation, the detailed architecture as shown in Figure 2 did not exist.To create this architecture, we combined information from the source articles pointed by the documentation as well as a prior research article [77] that discusses the platform.Each individual module is identified from the documentation and folder structure of the repository.The architecture of the individual modules is identified based on the code, referenced (scientific) articles, and associated documentation.For example, the module for localization and its overall role in perception is found using the overall documentation 5 and the organization of the repository. 5Then, the next level of details is identified from the module documentation 6 and code. 7The next level of details is derived from the source article [98] (pointed at by the module documentation), which dives deeper into the architecture and implementation of the different localization techniques.
The rest of this section presents an overview of Apollo's architecture with a focus on its perception system.
The components in Apollo's architecture can be grouped into four categories: perception, decision & control, interface to vehicle platform, and safety systems, as shown in Figure 2. The perception system is responsible for understanding the surroundings, identifying obstacles, and giving all information needed for components in decision & control.The decision & control part is formed by the following four sub-parts: (a) prediction, which predicts the trajectory of moving objects surrounding the automated driving vehicle; (b) routing, which identifies a path from source to destination to be followed by the vehicle; (c) planning, which plans the next maneuver of the vehicle based on the inputs from prediction, routing, and perception; and (d) control, which takes its inputs from the planning and various sensors to identify the current pose (a combination of position and orientation including yaw, roll, and pitch) 8 of the vehicle and generate messages to the vehicle platform for executing automated driving through the trajectory obtained from planning.The vehicle interface is responsible for two kinds of functions: (a) conveying commands such as steering angle and throttle to execute the desired maneuver of the vehicle (or simulation system) on top of which the Apollo framework operates; and (b) dealing with other parts such as lights and turn signals.
The safety system is responsible for monitoring (primarily) the perception and decision & control parts to identify potential faults and failures and maintain the automated driving system in a safe state during its operation.For instance, in the event of partial failure of the perception system, the safety system is responsible for making the vehicle reach a safe stop.We exclude some components from the architecture that are not required for the perception system's safety requirement elicitation-e.g.human-machine interface.In the rest of this article, we focus on the perception system.The perception system in Apollo primarily uses three kinds of sensors to sense the environment: camera, Light Detection And Ranging (LiDAR), and radar.The information from these sensors is augmented with details from a High Definition (HD) map.The data from each of these sensors is processed individually for obstacle classification (camera and LiDAR) and obstacle detection and tracking (camera, LiDAR, and radar) as shown in Figure 2. The camera is also used for traffic light detection, traffic light color recognition, and lane detection and tracking.The information from the individual object perception and detection sub-systems is further fused to have an overall view of all the objects surrounding the vehicle and allow their tracking.For the self-localization and pose (a combination of the position and orientation of the vehicle) estimation, data from GPS/GNSS, Inertial Measurement Unit (IMU), LiDAR, and HD map are used.Localization is performed individually Safety of Perception Systems for Automated Driving 64:7 using data from LiDAR and GPS/GNSS.Further, this data is combined with HD map and IMU data to identify the automated driving vehicle's position, velocity, and altitude-related information.
This entire suite of sensors and software around them are organized as five (kinds of) sensors (camera, radar, LiDAR, GPS/GNSS, and IMU), HD map, and their nine pipelines that use the data from the sensors (as highlighted in dotted arrows in Figure 2).Each of these pipelines forms a module or a cluster of modules in Apollo 7.0.

Operational Design Domain Description
This study is on complete automated driving with no human supervision9 in a Dutch highway segment.For this level of automation, the current automotive safety standards [45,46], industry consortiums [15], and regulatory bodies [94] recommend that the operational area of the automated vehicle should be taken into account to identify safety requirements.Moreover, specifying an operational area reduces the complexity and overall set of scenarios for developing and deploying autonomous driving vehicles rather than considering every possible scenario.Such a scoping has been shown to make it feasible to deploy automated driving vehicles without human supervision with current technological limitations [84].
The operational area for this case study is a 3.4-kilometer segment of highway A270 in the Netherlands. 10We systematically define our operational area based on the best practices outlined by industry consortiums and traffic regulatory bodies [15,94].The data for specification of the operational area are extracted from maps, 11 Google Street View, 12 and guides from Dutch authorities [4,5].Our operational area definition consists of the following six aspects: (1) Physical infrastructure, i.e., characteristics of the traffic infrastructure including road types, surfaces, markings, and geometry; (2) Operational constraints, which include speed limits and traffic conditions; (3) Objects that can be present on the road, including signage and types of road users; (4) Environmental conditions that include weather, weather-induced road conditions, particulate matter on the road due to weather, and illumination; (5) Connectivity including possible (wireless) networking options and data provided via these networks; (6) Zones that include different traffic-related zone classification.
A detailed specification of the operational area, individual variables considered in each of the above six categories, and their range of values is provided in the replication package [3].

RESEARCH METHODOLOGY
The research presented in this article follows principles of design science research methodology [40,76,102].Design science, in the context of software engineering, aims to create new knowledge through the design and investigation of artifacts [40,76,102].The two dimensions of design science: (1) design and (2) investigation, are related to two kinds of research problems: design problems and knowledge questions, respectively [102].The former calls for a change in the real world with designs as its outcome, while the latter asks for knowledge about an existing artifact (e.g., the safety of an existing architecture) with the answer being a proposition (e.g., which aspects of the architecture are safe, which are not, and why) [102].In this article, we answer the knowledge question: How safe is the design of Apollo's perception system for completely automated driving on Dutch highways?
This question is derived from our knowledge goal, which is to describe the safety of automated driving software stacks for completely automated driving on Dutch highways.
Our knowledge question is an empirical one, where the answer to the question is derived using data from various sources (e.g., documentation, scientific articles, code) rather than mathematical analysis.To answer our knowledge question, we study the Apollo automated driving software stack without any intervention being performed on the artifact.In design science, the research design for studying an artifact (e.g., the architecture of Apollo) in a context (e.g., a Dutch highway) without any intervention (e.g., without changing the architecture of Apollo) is called an observational case study.
An observational case study's research design, according to Wieringa [102], consists of three parts: (1) Case selection, (2) Sampling, and (3) Measurement design.The case selection process in observational case studies is aimed at selecting cases that can offer meaningful and valuable insights into the research question or problem at hand.Section 2.1 presents our case selection (Apollo's architecture) and describes our case in detail.
Sampling refers to the process of selecting specific cases or instances from a larger population for in-depth study and analysis.The sampling process involves choosing representative cases that can provide insights into the research question or problem under investigation.In our case, the sample that we choose to conduct our study on is a specific part of the Dutch highway.The description of our sample has been already presented in Section 2.2.
In our context, measurement design refers to the process of identifying and designing the metrics (tactics and design decisions) that will be used to analyze the artifact (Apollo's architecture) studied.Sections 4.1 and 5.1 refer to how we perform measurements (which in our contexts are identifying safety requirements and assessing the architecture of the perception system of Apollo with respect to the safety requirements).Note that we use descriptive inference for our research where the conclusions are based on descriptive information derived from various sources, including documentation of Apollo, 13 scientific articles referred to in the documentation [25,27,58,65,92,98,99] code of Apollo and its organization structure, 14 and the documentation of the context (the specific highway segment) from maps, 15 Google Streetview, 16 and guidelines from Dutch authorities [4,5].

SAFETY REQUIREMENTS ELICITATION 4.1 Method
In the automotive domain, methods for safety requirement (also referred to as functional safety requirement) elicitation of software systems is described in two domain-specific safety standards: ISO 26262 [45] and ISO 21448 [46].ISO 26262 covers the safety requirements relating to the malfunction of components, while ISO 21448 describes the limitations in achieving the intended functionality of ML-based components.We use the two standards to derive safety requirements for situations when (a) components of the perception system become non-operational; (b) components are operating as intended, but the output is lost or corrupted before reaching the destination; and (c) limitations arise in achieving the intended functionality of ML-based components due to unsuitable weather and illumination conditions.
For the first two cases, safety requirement elicitation methods are described in ISO 26262 standard [45]; and for the last case, in ISO 21448 [46].A framework combining the two standards for eliciting the safety requirements can be described in three steps: (1) hazard analysis, (2) risk assessment, and (3) safety analysis.
(1) Hazard analysis focuses on identifying potentially hazardous situations to the traffic participants or infrastructure.The hazard analysis step results in system-wide safety goals to prevent harm in those situations.We use the hazard and operability analysis (HAZOP) [49] technique, which uses systematic brainstorming to identify such situations.HAZOP identifies all possible situations based on the environment, functions of the automated driving vehicle, and the possible behavior of other traffic participants.Then, the technique associates a situation with possible harm to generate hazardous events using guide words.We used the guide-words: no, more, less, as well as, part of, reverse, other than, early, late, before, and after, which are widely used in the literature [54,55].
For example, the harm (otherwise known as a hazardous event) "does not avoid collision with a decelerating vehicle in front, in the driving lane" is formed by combining: the guide word no with the situation "avoid collision with a decelerating vehicle in front, in the driving lane." The situation is a combination of function ("avoid collision"), the behavior of traffic participant ("decelerating vehicle in front"), and the operational area ("in the driving lane").The latter two parts forming the situation, i.e., (a) all possible situations and (b) behaviors of traffic participants, are directly from the operational area description (detailed in Section 2.2).Next, each hazardous event is converted into a system-wide safety goal to prevent, avoid, or reduce its impact.
(2) Risk assessment estimates the risk associated with a safety goal.Since every situation does not lead to the same level of harm, we need to prioritize situations based on their potential for harm.The safety standard's [45] framework proposes four risk levels ("A" through "D" in increasing order of importance) for a safety goal, also referred to as Automotive Safety Integrity Levels (ASILs).These ASIL levels are identified based on qualitative levels of three parameters: (a) exposure, relating to the frequency of occurrence of a hazardous situation [45]; (b) controllability, relating to the level of control a vehicle has of the situation [45]; (c) severity, relating to the severity of the potential harm in a situation [45].The safety goals without a reasonable risk are classified as a different risk level, QM (quality management), and are removed from further consideration.The assumptions we have taken to arrive at a specific risk score are described in Section 4.1.
(3) Safety analysis translates the system-wide goals to the requirements on individual components.Broadly, there are two types of approaches for safety analysis: (a) deductive or top-down analysis, where a top-level event (such as a system-wide safety goal) is divided into requirements for lower-level components of the perception system; (b) inductive or bottom-up analysis, where the analysis starts from the bottom-level events to identify its possible impact [66].
We use fault tree analysis [60], a deductive analysis technique, to translate system-wide safety goals to the safety goals specific to components (pipelines, sensors, HD map, or safety system).The choice of fault tree analysis is based on its prominence and use in literature in similar contexts [66].We further subdivide each safety goal (specific to a component) into requirements for (a) failure of a component and (b) corruption or loss of messages during communication among components.We need two pieces of information to perform fault tree analysis: (1) system-wide safety goals and (2) detailed architecture of the system.The system-wide safety goals resulted from the first step-hazard analysis; and we created the detailed architecture of Apollo as described in Section 2.1.
We also consider weather and illumination conditions that can limit individual components in achieving their safety goals for the components that use ML-based systems.We use inductive analysis, as specified in the ISO 21448 standard [46], for identifying external conditions (or triggering conditions) that can violate safety goals due to the limitations of ML systems in delivering the intended functionality and translating them into requirements.To conduct this inductive analysis, in addition to safety goals and detailed architecture, we need a third kind of information-the possible weather and illumination conditions applicable to safety goals.This information is derived from the operational area description (see Section 2.2).
The result is a list of safety requirements where each requirement is mapped to a (set of) component(s).Note that the ISO 21448 standard also provides a post-design risk evaluation for the safety requirements specific to ML-based systems' limitations.Doing so is beyond the scope of this work, since this evaluation pertains to the validation and verification (of the measures to make risks due to the limitations of ML-based systems tolerable) and not design assessment.
The above steps are performed by the first two authors.The results are compared until an agreement is reached.The process was supervised by the third co-author who is a researcher from industry with more than 10 years of experience in the automotive industry and more than 15 years of experience in safety-related and safety-critical systems development and certification according to IEC 61508 [44] and ISO 26262 [45].Note that the scope of this study is using existing methods to elicit safety requirements.The state-of-the-art in automotive safety requirement elicitation involves considerable manual efforts.Automating and reducing the amount of manual effort is its own research topic and is out of the scope for this study.However, to ensure soundness while using the current method, two authors performed the entire set of steps.For the first step (hazard analysis), which is the one step with the possibility of individual interpretation, we performed an inter-researcher agreement [26].The result was a kappa score of 1.0 [26] showing an ideal agreement.The ideal agreement might be the result of a clear and extensive definition of operational area and vehicular functions.Going a step further, the entire process was supervised by researchers from both academia and industry (the third and fourth authors), with the researcher from industry who has more than 10 years of experience in the automotive industry and more than 15 years of experience in safety-related and safety-critical systems development and certification according to IEC 61508 [44] and ISO 26262 [45].For future validation and repeatability, we have provided the entire set of intermediate results and sources of information in the replication package [3].

Results
Hazard analysis.We identified a tractable number of scenarios for hazard analysis by combining automated vehicle operations with driving situations.For this study, the operational dimensions of the automated vehicle pertaining to the non-functional property of safety can be divided into two categories: (1) avoid collision with other road users and obstacles and (2) follow traffic rules in the operational area.Likewise, we partitioned the driving situations into the following four categories: (1) driving in the lane, (2) changing lanes, (3) driving through an intersection, and (4) location-specific behavior for a merging point.This partition is based on the permitted operations of a vehicle in the specific stretch of the Dutch highway [4,5].The scenarios for hazard analysis are then a cross-product of the operations and driving situations.
We used guide words to identify hazardous events for the list of scenarios identified for hazard analysis.
An example of a hazardous event is the automated driving vehicle does not avoid collision with a slower-moving vehicle in its driving lane.This way, we identified 69 potential hazardous events.Finally, we translated each potential hazardous event into a system-wide safety goal.
One such safety goal is the automated driving vehicle shall avoid collision with obstacles or vehicles in the driving lane.The data relating to deriving scenarios, hazardous events, and finally, system-wide safety goals is available as a part of the replication package [3].

Risk assessment.
Since not all safety goals are equal, we assign a risk score (or ASIL) to each safety goal.We identified risk scores in terms of controllability, severity, and exposure, as mentioned before and defined in ISO 26262 standard [45].To assign a risk score to each safety goal, we make two assumptions: (1) no "controllability" by a human driver in case of hazard, since we focus on fully automated driving; and (2) high severity levels for highway driving speeds [16].For example, the above-mentioned safety goal was assigned the risk level ASIL D (highest).
Similar safety goals are aggregated to form one safety goal, inheriting the highest ASIL level of the safety goals combined.We identified 18 distinct safety goals, aggregated from the 69 safety goals identified above.The aggregation of safety goals was performed by combining different future or current operations of the vehicle.
For example, the safety goals: "avoid collision in the scenario: decelerating vehicle in front in operational mode: driving in the lane;" and "avoid collision in the scenario: decelerating vehicle in front in operational mode: changing lanes;" and other similar safety goals are aggregated to -"avoid collision with an object (obstacle or vehicle) in driving lane in all operational modes."Further, we excluded three safety goals (since they have ASIL level QM 17 ; as discussed in Section 4.1) and explored the remaining 15 safety goals in the rest of this study.More details on risk assessment are available in the replication package [3].

Safety analysis.
Using fault tree analysis, we translate the 15 system-wide safety goals into the safety goals relating to the nine pipelines, five sensor types in Apollo (camera, LiDAR, radar, GPS/GNSS, IMU), and HD map (see Figure 2 for details).We map each safety goal to the entire pipeline and the safety system to ensure that the design choices in Apollo that might satisfy our safety goals will be covered in the design assessment part described in Section 5.
One such pipeline-specific safety goal is: LiDAR obstacle detection, classification, and tracking shall estimate the correct state of vehicles and other obstacles.Next, we converted each safety goal into requirements relating to failure, data corruption, and ML-based systems' limitations.For example, two requirements derived from the above-mentioned pipeline-specific safety goal are: "if any component in LiDAR obstacle detection, classification, and tracking pipeline becomes non-operational, then this failure shall not lead to an incorrect estimation of the state of vehicles or other obstacles"; and "if the output of any component in the LiDAR obstacle detection, classification, and tracking pipeline is corrupted or lost, then this corruption or loss shall not lead to an incorrect estimation of the state of vehicles or other obstacles." We identified 30 safety requirements relating to failure and data corruption of the different pipelines, as shown in Table 2. Details of individual requirements are presented in the replication package [3].To derive requirements on limitations of ML components, we first identified the pipelines that use ML-based systems.Similar to prior work [77], we noticed that out of nine pipelines, only four use ML-based solutions (based on analysis of documentation and code).These pipelines are (1) traffic light detection and recognition, (2) lane detection, (3) camera obstacle detection, classification, and tracking, and (4) LiDAR obstacle detection, classification, and tracking (as also shown in Figure 2).These four pipelines use two sensors: camera and LiDAR.Therefore, we use safety goals specific to the four pipelines to identify the limitations of ML solutions relating to the weather and illumination conditions in the operational area.Mainly, we map the description of the known limitations of our operational area to the violations of safety goals specific to the four pipelines.The limitations of ML solutions, as identified from the literature on camera and LiDAR, are as follows: (1) Camera-related limitations: low-illumination conditions [10,19], illumination conditions rarely captured in training datasets such as dusk and dawn [2], and weather conditions, in particular fog [43], rain [90,96], snow [68], and strong sunlight [105].(2) LiDAR pipelines are not affected by low illumination conditions.However, they are affected by strong sunlight [24] and conditions leading to light (laser) scattering effects, which include fog [21], rainy conditions [93], and snow [75].
The requirements for each of the above conditions are identified from the respective safety goals and allocated to the corresponding pipelines.An example requirement allocated to LiDAR obstacle detection, identification, and tracking pipeline is "if the performance of LiDAR obstacle detection, classification, and tracking pipeline is deteriorated due to moderate inclement levels of fog, then this deterioration in performance shall not lead to an incorrect estimation of the state of vehicles or other obstacles." The result consists of the 28 requirements, as shown in Table 2.While a requirement is shown above, the details of all requirements are available in our replication package [3].A summary of results of the entire safety requirement elicitation is presented in Table 3.Note that all our requirement phrasings are based on the industry-specific standard guidelines and literature specific to the automotive domain [45,46,101].

DESIGN ASSESSMENT
There are many ways to assess the safety requirements for perception system software, including formal verification.To assess the safety requirements for perception system software at the design level, one widely used method is to assess the software using its underlying architecture.However, this solution alone does not work for a perception system consisting of ML components.In addition to architecture, it also requires datasets and ML models to describe them adequately.Excluding these artifacts is not an option, since the design decisions for these artifacts can directly impact quality attributes such as safety [70,87,88,100].Therefore, for the safety assessment of perception system software, we study its (1) software architecture and (2) design choices specific to ML-based systems.

Method
The software that forms the perception system of Apollo can be classified into two categories: (1) traditional software, where humans decide on the logic; and (2) ML software, where the logic is derived from the data.Since the two types of software are developed differently, their design choices and considerations for safety differ in some aspects.We assess the safety requirements of the perception system by identifying the design decisions using its architecture and complementing it with additional artifacts specific to ML software.

64:13
5.1.1Software Architecture Design Choices.To identify the design choices of software, we look at its architecture.We look at the smallest units of architecture design choices, called tactics [17].Tactics are abstract design decisions without an implementation structure that can influence the behavior of a system [17].An example of a tactic is diverse redundancy, which is the introduction of redundant systems for detecting or masking failures [79].Tactics that address the quality attribute safety are called safety tactics.Note that these tactics originate from the architecture assessment method ATAM [17], which stood the test of time (now in use for more than 20 years).
We use all 13 safety tactics (heartbeat, simplicity, substitution, sanity check, comparison, replication redundancy, diverse redundancy, condition monitoring, repair, voting, degradation, override, and barrier [79]) that have been codified and presented as a framework in prior studies [80,103].Prior studies have shown the use of these safety tactics in the automotive domain for safety assessment [54,55].For more details on the framework and individual tactics, we point our readers to the study by Preschern et al. [79].

ML Design Choices.
To the best of our knowledge, no framework exists in the literature that codifies ML design choices that address the quality attribute safety.Therefore, we refer to prior secondary studies that have aggregated the (best) ML design choices.These ML design choices are for different life-cycle stages and have demonstrably direct impact on quality attributes [70,87,88,100].We aggregate the known (best) ML design choices from prior works and curate a list of ML design and related choices to assess the quality attribute safety.
To identify ML design choices, we follow a two-step process.First, we create an aggregated list of design choices specific to ML software.Then, we select design choices specific to our use-case, i.e., relating to safety and applicable in the context of ML software relating to camera, LiDAR, or object/lane/color recognition and tracking.
List of design choices.To create an aggregated list of practices or decisions specific to ML-based components, we rely on secondary studies, which are aggregations of primary studies.To identify secondary studies, we search Google Scholar using the following keyword: "safety" AND "software architecture" AND ("machine learning" OR "artificial intelligence" OR "neural networks") AND ("review" OR "survey").In this search term, we added "software architecture" to improve the signalto-noise ratio.
Once we identified a list of secondary studies summarizing ML practices (e.g., References [70,87,88,100]), we shortlisted studies that are (a) the most recent for the most comprehensive list of practices and (b) have at least one design choice specific to ML systems and particularly safety, and at least one design choice specific to the limitations relating to weather or illumination conditions.We identified four secondary studies [70,87,88,100] that cumulatively discuss 67 design choices or practices specific to ML-based systems.
We applied another level of inclusion criteria to identify ML design choices relevant to safety assessment.These include: -Applicable to automated driving systems (Apollo's perception system).
-Identifiable from architecture, model, code, dataset-related artifacts, or documentation.For example, practices such as neuron coverage testing, fuzz testing, and formal verification cannot be identified from the above-mentioned artifacts.-Applicable to ML software's design stage (like design choices related to ML model or dataset).
-Related to the quality attribute safety.
-Usable for countering limitations caused by weather or illumination conditions.
Note that these inclusion criteria are defined iteratively.We came up with initial criteria and applied them to a small set of ML design choices.The results were discussed among the authors, followed by correcting existing criteria and adding new ones.This process is repeated several times to reach the criteria listed above.Once we reached maturity (no more changes to the criteria were happening when applied to a random small set), the criteria were applied to all the ML design choices identified.
We found 10 ML specific design decisions that are listed in Table 1.These design decisions correspond to the choices made relating to the dataset (like ensuring "input data is complete, balanced and well distributed"), the neural network model, and other design choices (like "monitor data quality issues") relating to the safety of ML systems.

Assessment.
To identify whether the design decisions in the perception system of Apollo fulfill the safety requirements, we use a previously demonstrated method in the automotive domain [17,54,55], which is an extension of Architecture Tradeoff Analysis Method (ATAM) [17].This method has two parts.First, identification of the applicable design choices for each safety requirement such that the implementation of a design choice itself or in combination with other design choices, can fulfill the safety requirement.Here, we have two categories of design choices (safety tactics and design choices specific to the limitations of ML) and three categories of requirements (failure, data corruption, and limitations to ML).We make a cross-product of the requirements to the selected design choices.The requirements concerning failure and data corruption are crossed with the safety tactics used in the prior studies [54,55].The rest of the requirements (regarding limitation to ML systems) are crossed with the design choices specific to ML systems selected in the previous step (refer to Section 5.1.2).Each combination in the cross product is checked for validity, and invalid choices are discarded.
For example, consider the traditional software requirement "if any component in LiDAR obstacle detection, classification, and tracking pipeline becomes non-operational, then this failure shall not lead to an incorrect estimation of the state of vehicles or other obstacles." When crossed with the 13 tactics, this requirement has 13 possible choices.Of these 13, 2 invalid choices are substitution and repair.The substitution tactic 18 is invalid in this context, because LiDAR pipelines in the automotive industry are still in their initial stages and have not yet reached wide adoption; we do not have any alternate, well-proven option to choose from.The repair tactic 19 is invalid, since manual intervention is not an option for our use-case of fully automated driving, and automatic restore is not applicable in this context.An example of a valid choice is sanity check, 20 since it is possible to continuously monitor the state and output of the pipeline for implausible outputs or states.After this step, we have a list of design decisions associated with each safety requirement.Each design decision in the list, either in itself or in combination with other design decisions (from the list), can fulfill the safety requirement.
Next, we look for evidence of whether these safety requirements are fulfilled in the artifacts relating to the perception system software of Apollo.We rely on publicly available artifacts: architecture, code, documentation, dataset descriptions, and scientific papers pointed to by Apollo documentation for our assessment.For efficiency, safety tactics related to failure and Safety of Perception Systems for Automated Driving 64:15 Input data is complete, balanced, and well distributed [87] Datasets used for training (and testing) the ML models shall be representative of the complete operational area (see Section 2.2), covering the corner cases and situations that may trigger functional insufficiencies [18,23,67,78,85].
To ensure that the training (and test) data covers situations associated with a given safety requirement that the ML model will encounter during its real-world usage.[18,23,67,78,85].
Calculate the statistical distribution of the training and testing data across diverse situations it covers to understand the representation of different situations in the dataset [6].
Dataset and its related artifacts.

Design specification [70]
The specification of the non-functional (safety specific) properties of the ML model considered during the design and development of the ML model [70].
To guarantee safety properties in the design of the ML model [70].
Formal specification or breaking down ML components into smaller algorithms to work in hierarchical structures [11,61,89].
ML model-related artifacts, including scientific papers or documentation.

In-distribution error detectors [70]
Mechanisms to detect ML-model's incorrect classification of in-domain samples (samples that fall within the distribution boundaries of training and test data) [70].
To detect failures of the ML model in the context of n-domain samples for which the ML model has low confidence [70].
Runtime prediction error detectors or monitors and prediction for high confidence samples and withholding results otherwise [33,34,37].
ML model-related artifacts, including scientific paper or documentation, ML (software) architecture, code, or documentation.

Out-of-distribution error detectors [70]
Mechanisms to detect the ML model's incorrect classification of outliers or out-of-distribution samples (samples that fall outside the distribution boundaries of training and test data).For instance, inputs beyond the intended operational area conditions.[70].
To detect failures of the ML model due to novel inputs.[70].
ML model and related artifacts, including scientific paper or documentation, ML (software) architecture, code, and documentation.

Domain generalization [70]
Demonstration of the robustness of the ML model to deviations in the input data distribution in contrast to the training data [70].
To demonstrate the ability of ML model to give correct output under input situations not covered directly in its training dataset [70].
ML model and related artifacts, including scientific paper or documentation.

Robustness to corruption and perturbations [70]
Demonstration of the robustness of ML model to natural corruption and perturbations (e.g., elastic deformation of images due to different viewing angles, occlusions of objects in sensor data) [70].For instance, camera-based object detection under camera lens flares and snowy conditions may hinder the ideal visibility of objects in the image.
To demonstrate the ability of ML model to give correct output under non-ideal or noisy inputs [70].
ML model and its training-related artifacts, including scientific paper or documentation.

Uncertainty estimation [70]
Estimate ML model capability boundary in terms of confidence in its prediction (epistemic uncertainty or model uncertainty) and uncertainty for unknown samples (aleatoric uncertainty or data uncertainty) [70].
To detect failures of the ML model in contexts such as domain shift and out-of-distribution samples [70].
ML model and related artifacts, including scientific paper or documentation, ML (software) architecture, code, or documentation.

Uncertainty monitoring [88]
Monitoring the uncertainty of the ML model to predict the failures of the ML model during runtime [12,41,72].
Failure prediction through a secondary model: training a student model (monitor or failure predictor) for predicting failure of the teacher model's (main model) [69].
ML model and related artifacts including scientific paper or documentation, ML (software) architecture, code, to documentation.

N-versioning [88]
Rather than using a single ML model, using ensembles of ML models [88].
To improve safety properties of ML models such as reducing over-fitting, detection of failures, and improved interpretability of decision-making [12,41,72].
Using an interpretable or rule-based model as back-up [88] Software architecture, ML model and related artifacts including scientific paper or documentation, ML (software) architecture, code, and documentation.

Metric monitoring and alerts to detect failures [88]
Use of continuous monitoring [50,53] of metrics and alert systems that can notify potentially safety-critical incidents (e.g., decrease in accuracy, increase in uncertainty) to a human-in-the-loop [88].
To detect silent failures of the ML model [88].
Monitor the consistency of perception outputs across modules and predict performance metrics such as per-frame mean average precision [14,81].
Software architecture, ML model and related artifacts including scientific paper or documentation, ML (software) architecture, code, and documentation.

Title of each row or block is the design choice [secondary study].
data-corruption-related requirements are first checked in the architecture description and documentation.If a safety requirement is not satisfied, then we look at the code of specific components for coverage of the requirement.To identify the parts of the code that deal with data corruption or failure, we look for error messages and logging statements in the code of specific components associated with the safety requirement.
For requirements relating to the limitations of ML systems, we first look at the dataset, documentation, and the base papers (relating to the ML system or ML model used in the ML system) for design decisions.If a requirement is not satisfied, then we analyze the specific parts of the code relating to the requirement.Table 1 (third column) presents pointers to which subjects (e.g., documentation, source code) are used to identify the usage of each of the design decisions.
For example, consider the safety requirement "if the performance of LiDAR obstacle detection, classification, and tracking pipeline is deteriorated due to moderate inclement levels of fog, then this deterioration in performance shall not lead to an incorrect estimation of the state of vehicles or other obstacles." We find that all design choices (see Table 1) except design specification (see the second row of Table 1) are potential candidates for fulfillment of this requirement.Design specification is not a candidate, since a formal specification of infinitely many possibilities arising from the combination of different types of LiDARs, their hardware variations, different levels of fog and different kinds of objects, their varying reflectivity, and diffused laser beams after reflection, in reality, might not be feasible with current technology [63], and current studies focus on quantitative and qualitative analysis based on experiments [63].In the following part, we present how we assessed two of the possible design choices.
One way to satisfy this requirement is n-versioning (see ninth design choice in Table 1 for details), since it can be used for fail safety, where employing different LiDAR pipelines with complementary abilities might be able to identify and (ideally) correct for deterioration in one of them due to moderate inclement levels of fog.To identify whether n-versioning is used starts with identifying ML models used for the pipeline and then looking at the properties of these models and how they are trained.The model and its properties can be identified from the code, 21 the documentation, 22 and associated research articles [27,99].According to References [27,99], these models do not use n-versioning.
Another way to satisfy this requirement is to have the training and testing data being complete, balanced, and well distributed (see first design decision in Table 1) concerning moderate inclement levels of fog.This design choice can ensure that the pipeline is at least trained and tested in situations similar to moderate and inclement levels of fog, providing evidence that the requirement is considered and tested.In this context, the architecture and implementation of LiDAR obstacle detection, classification, and tracking pipeline in Apollo is based on scientific articles [27,99] and the underlying ML models have then been trained with custom data.The training data and its source are not available; thus, we cannot make any conclusion about the data.In other cases such as camera obstacle detection, classification, and tracking pipeline, datasets typically consist of meta-data (and its summary), including time of day and place of capture.The meta-data and summary can be used to identify the distribution of illumination and weather scenarios (the focus of this study) required to assess requirements relating to ML systems.For example, the dataset used for training ML models in the camera obstacle detection, classification, and tracking pipeline is captured from Phoenix, San Francisco, and Mountain View.Therefore the dataset does not contain weather situations such as snow and sleet, while our operational design domain can.Thus, the dataset does not represent our context's weather and illumination conditions.We followed a conservative approach of marking requirements to be fulfilled at the design level only if we found conclusive design evidence in the specific components relating to a requirement.

64:17
Note that design evidence is not a guarantee that a safety requirement is fulfilled in the final product (similar to passing the testing phase does not show the absence of bugs).Instead, it indicates that the requirement is considered at the design level.Without such consideration, the requirement will not be fulfilled when the design is implemented.In our context, we are looking at the final product architecture and, thus, what exactly is implemented in the final product.Therefore, if the results do not point to any design consideration for a requirement, then it shows with high confidence that the final product does not satisfy the specific safety requirement.
Similar to safety requirement elicitation (detailed in Section 4), the state-of-practice in design (safety) assessment is still manual-effort heavy.Therefore, two authors perform the above processes under the supervision of researchers from academia and industry, as detailed in Section 4.1.
For selecting the ML-specific design choices, the inclusion criteria and the search terms were defined iteratively.For the selection of the 10 ML-design choices and their feasibility for each requirement, an inter-researcher agreement was calculated using Cohen's kappa coefficient [26].We got a score of 1.0 and 0.84, respectively, indicating an ideal agreement in the first case and a very good agreement in the second case.The ideal agreement in the first case might be due to the clarity and systematic nature of secondary studies [70,87,88].The second score (0.84) shows difficulty mapping design choices to ML-specific requirements.Each conflicting case is discussed, and if not resolved among two researchers (first and second author), a third researcher (last author) is involved.

Results
We identified 58 safety requirements for the different subsystems of the perception system, covering the failure of a module or data corruption (30) and limitations of ML systems in adverse weather and illumination conditions (28).The cross-product of these requirements with 23 design choices relating to safety (13 architecture tactics crossed with 30 requirements pertaining to the failure of a component or data corruption, and 10 ML-specific design choices crossed with 28 requirements relating to ML-based systems) led to 698 design choices.After removing infeasible design choices, we had 477 design choices.A detailed list of the feasible design decisions for each requirement is presented in our replication package [3].
We search for design choices associated with each requirement in the components related to the requirement (also presented in our replication package).For each safety requirement, we reach one of the following three conclusions: (1) there exists evidence that a requirement is fulfilled; (2) there does not exist evidence that a requirement is fulfilled; and (3) unknown.We reach the second conclusion when despite searching all the associated components, the evidence is nonconclusive.We arrive at the third conclusion if a resource is not found or we cannot comprehend the code or its structure.For example, the dataset or the description of dataset characteristics is required to assess choices related to a dataset used for training an ML model.If the dataset or its characteristics are not specified, then we reach the third conclusion.
We found evidence that Apollo's design fulfilled 38 out of 58 safety requirements.We noticed that all the requirements relating to architecture are fulfilled in the design.The status of the rest 20 requirements was concluded to be unknown, and all the 20 requirements were specific to ML components and related explicitly to three pipelines: (1) lane detection; (2) camera obstacle detection, classification, and tracking; and (3) LiDAR obstacle detection, classification, and tracking pipeline.The only pipeline that contains ML components and is found to satisfy ML-specific safety requirements is the traffic light detection and recognition pipeline.We identified 180 design choices for the 20 requirements with unknown status.Out of these 180 choices, 119 were concluded as not used.The other 61 were concluded as unknown primarily due to the non-availability of the dataset or its characteristics and no comments and unknown structure of the code.An overview Every requirement is assessed in the safety system (refer to Figure 2 for details) in addition to the component itself.Further, the requirements relating to sensors (IMU, LiDAR, Radar, Camera, GPS/GNSS) and HD map (last six rows in this table) are assessed in all modules that use their output.
of the number of (un-)fulfilled requirements and related components are shown in Table 2.More details on how Apollo's design decisions do (not) fulfill each requirement are available in the replication package [3].A summary of the results of the entire design assessment is presented in Table 3.

DISCUSSION
This section presents interpretations of our findings, their potential use and implications, the role of the choice of methods, and the applicability of our findings to other contexts.
Interpretation of our findings.Our study shows that 20 out of 28 safety requirements specific to ML systems are not found satisfied in Apollo's design (7.0) [1].The lack of data relating to ML systems (e.g., specification of datasets on which the ML models are trained and documentation of the ML models and code of modules containing the ML models) has rendered the decision-making inconclusive for several of these 20 safety requirements, thus making the satisfaction of these requirements unknown.The low number of satisfied requirements related to ML systems corroborates with the literature suggesting that quality attribute safety is not yet one of the highest priorities in developing ML systems [88].Note that our findings of the (non) satisfaction of requirements related to ML systems are based on design evidence pointing to the explicit consideration of the requirements.This implies that the implicit considerations that are not documented or not directly available from the sources did not reflect in our work.
Understanding unfulfilled safety requirements in design is the first step to safety.If design issues are not corrected in time, then they transfer to implementation.Since design deficiencies cannot be fixed in implementation, they can cause catastrophes, risking the lives of passengers and other traffic participants.For example, the infamous Uber self-driving car accident that led to the death of a pedestrian was caused by the decision system failing to act after the perception system identified a pedestrian well before safety margins [22,104].In addition, fixing a design issue later in the product life cycle is orders of magnitude costlier than in the design or early prototype stage.
In contrast to non-satisfaction of ML requirements, all traditional software requirements (related to failure and data corruption) are found satisfied.This might be obvious to some readers, especially since this is the seventh (major) version of the Apollo stack.This points out the maturity of the stack from a traditional software safety standpoint.
To the best of our knowledge, this study is the first one in the scientific literature to present the safety design assessment for the perception system of a mature software stack for automated driving in a real-life setting.Therefore, based on the insights derived from our study, we suggest the industry to provide their ML models' and datasets' characteristics, such as their representativeness in different situations, training and test data, and resultant accuracy.Specifically, our study identifies areas that might need more work that can inform the planning of tech leads and managers.For example, the LiDAR obstacle detection, classification, and tracking pipeline may require more work than the traffic light detection and recognition pipeline.Further, the industry can use our list of requirements and our curated list of practices to generate documentation relating to ML components.
ML-based design choices: While the (ATAM) architecture assessment method [17] as well as the architecture tactics [17] are tried, tested, and proven by their use for more than two decades, the same is not true for the ML-based design choices.Many ML techniques used in Apollo stack [25,27,65,92,98,99] and ML design choices shown in Table 1 are proposed for the first time in the past few years and lack a proven track record of usage in any safety-critical contexts, including autonomous driving.We discuss the different design choices from Table 1 below.
In the context of the design choice "input data is complete, balanced and well distributed" (Table 1 first row), there is little consensus on what constitutes "complete, balanced and well distributed" [6,18,23,67,78,85].For this study, we checked for the presence of the different weather and illumination situations of our operational area in the dataset (not all of which we found present).Invalidating the presence of situations is less challenging than validating their presence with respect to "complete, balanced and well distributed." Regarding the design choice "design specification" (Table 1 second row), it is non-trivial to produce and verify specifications for state-of-the-art object detection/recognition ML algorithms for use cases in automated driving [95].Properties such as completeness and correctness of a specification as well as the black box and probabilistic nature of a significant fraction of ML models used in practice [64] make every aspect of specification and verification more complex than traditional software.We did not find this design choice a feasible option for the satisfaction of our requirements.
Considering the design choices, in-distribution error detectors, out-of-distribution error detectors, and domain generalization (Table 1, third to fifth rows), it is impossible with current technologies to generalize the correctness, reliability, and robustness of output of the ML models employed in Apollo [25,27,65,92,98,99] to every possible environmental situation.Our scope is their generalizability to the specific situations that might arise in our operational area and operational situations.We found some of these techniques to be employed in the Apollo stack.An example is the in-distribution error detector design choice employed in the modules of traffic light detection and recognition pipeline (see Figure 2) to identify cases where the ML-model cannot confidently output the color of the traffic light [25,65]. 23,24We also found the usage of out-of-distribution error detector design choices in the same module.The design choice is used to cover corner cases such as broken and flickering conditions of a traffic light [25,65]. 23,24 In the context of the rest five design choices (Table 1, sixth to tenth rows), the corruption and perturbation effects on ML models as well as model and data uncertainty-related effects are identified when new research or investigation to a (possible) catastrophic event identifies such issues.While this case study finds the usage of some of these design choices such as "metric monitoring Safety of Perception Systems for Automated Driving 64:21 and alerts to detect failure, " we did not find (conclusive evidence for) the usage of design choices like "n-versioning" in Apollo's perception system.
Our study points out that the ML-based (especially neural network-based) techniques employed in perception systems need future research on their design choices for use in safety-critical contexts.Note that the scope of our case study is limited to the usage of existing techniques.
Recommendations & Future directions: This study brings the high amount of human effort required to elicit safety requirements into the limelight.In its current form, each automotive stakeholder that plans to sell an automated driving stack directly or indirectly to an end-user has to perform requirements elicitation.Then, ideally, safety certification bodies in respective countries have to examine the entire process.One key takeaway that we saw in this case study (which might already be known in the community) is that many steps (e.g., hazard analysis, risk assessment) of requirement elicitation are common irrespective of the underlying stack, given the end functionality is the same (in our context, automated highway driving).Instead of each entity performing the same steps separately, we recommend all the entities, especially safety certification bodies in respective places, to perform such common steps together and make the results available to all interested parties.Such a practice can not only remove the unnecessary waste of resources but also update those steps consistently in the future.Note that such common steps (due to their very nature as "common") will not affect the exposure of the intellectual property and any related advantages of any of the stakeholders involved.We also think that vehicle users have the right to an unbiased understanding of the safety of the vehicles' software, especially in automated driving settings.An improved understanding of the safety of autonomous vehicles can strengthen the trust and confidence of users and regulators in the underlying technology, thus making it easy for public acceptance.
Another important part we noticed is the lack of clarity and the highly distributed nature of documentation and associated resources.To make a detailed architecture, we (and similar prior research [77]) have to create a detailed architecture using a multitude of resources and code spread across multiple domains (e.g., image recognition, neural networks, localization methods).Yet, many details are missing; in our case, leading to the fulfillment status unknown for 20 requirements.This hampers not only the safety requirement elicitation and assessment but also the idea of open sourcing, which is to elicit community participation, and overall under-stability, usability, and maintenance.We strongly recommend updating the missing details (see our replication package for details [3]).
This study is primarily qualitative, while future studies can consider the analysis on the impacts of the obtained results in a quantitative way.Another future direction studies can explore is safety requirement elicitation on other automated driving stacks and how they compare to Apollo.Techniques to reduce the human effort (and the resulting subjectivity) of the requirement elicitation and assessment are another dimension to explore.
Applications.We foresee many applications of our findings for industry, research, education, traffic authorities, and lawmakers.The research community can use our results as a first step to identify weak points in automotive perception systems and identify directions for future research from a safety perspective.For education, including safety in a curriculum, might avoid catastrophic events 25 resulting from considering safety as an afterthought.
For lawmakers and traffic authorities, one major challenge is identifying who is responsible in case of a catastrophic event involving an automated driving vehicle: the automotive company, the software suppliers, tool vendors, or the users themselves? 26A publicly available safety analysis can be the first step to ensure that basic steps are taken to avoid such situations.
Method.As the first study on assessing safety in the design of Apollo, we chose the de facto method in the automotive domain for safety requirement elicitation and banked on literature for design assessment.However, alternative techniques, for instance, failure mode effect analysis [91] or system theoretic process analysis [62], can be used instead of fault tree analysis.Future research should validate whether the choice of a method can influence findings and, if so, how.
Generalizability.We discuss two dimensions of generalizability of our study (and our results): (1) to perception systems of other automated driving stacks and (2) to other highways.
Our methodology generalizes to any automated driving stack, since we use the steps from the relevant standards and industry practices.However, the results of this study will only be the same for other software stacks for the steps until safety goals and associated risk levels from hazard analysis and risk assessment (see the first part of Section 4.2).The safety goals are agnostic on the underlying software stack and specific to the operational area and operational scenarios.The safety requirements and their allocation to various software components [51] of automated driving software stacks will vary based on the architecture of the stack.We expect similar results concerning traditional software in the context of any other mature automotive software stack, since these software architectures and associated tactics have been in use for decades now.However, new studies are required to identify whether and how the results on ML-based components vary across different software stacks.This is because many ML models are custom trained, tuned, and paired up with traditional software in different ways by different software stack vendors.
This study is on a 3.4 km stretch of a Dutch highway and Apollo's perception system.Since the highways in the Netherlands are relatively standardized with minor variations and similar weather and illumination conditions, 27 we believe that our findings should generalize (with minor variations) to highways in the Netherlands.We suggest investigating other highway segments before exploring an entire highway-wide safety requirement elicitation.However, a similar generalization may not hold across Europe or beyond Europe, since the traffic environment, traffic rules, weather, and illumination conditions can vary drastically.We also expect similar results if other automated driving frameworks were used instead of Apollo.We need more research to test these scenarios.
In retrospect, the validity will improve when independent researchers replicate our work.For reproducibility, data and step-by-step results are publicly available [3].

THREATS TO VALIDITY
Construct validity.Many steps in our case study rely on human judgment, which can introduce researcher bias.For instance, many steps in requirement elicitation require brainstorming and manual inspection.While researcher bias remains a valid threat, we tried to mitigate it by using systematic methods and inter-researcher agreements where possible (e.g., using HAZOP for hazard analysis).Two authors performed each step that required manual analysis under the broad supervision of a subject matter expert from the industry.The industry expert (and co-author) has more than 10 years of experience in the automotive industry and more than 15 years of experience in safety-related and safety-critical systems development and certification according to IEC 61508 [44] and ISO 26262 [45].
We identified ML-specific design practices from secondary studies.So, if these studies systematically missed a subset of design practices (e.g., linked to their scope), then they are missing from our study, too.To minimize this threat, we selected the most recent secondary studies for the latest and most comprehensive list of design practices.We also noticed that these studies used systematic and mixed methods approaches, reinforcing our belief that the aggregated list of practices is comprehensive.The same validity threat applies to the (architecture) tactics used in this study, since we use tactics directly from literature.
Note that we followed the current state of practice in requirement elicitation and architecture assessment, which is human-effort intensive.At the same time, we have employed systematic methods and qualitative evaluation to improve reproducibility and soundness.Another way to reduce human effort and improve reproducibility might have been automating the entire process.However, it is out of the scope of this work and a research direction on its own.Internal Validity.Many steps in our design assessment make assumptions (e.g., assumptions for risk assessment).As long as these assumptions hold, our results are likely accurate.To limit the risk of introducing unjustified assumptions, we only made assumptions that are grounded in literature.
Our design assessment relies on publicly available documents.While we tried to be as comprehensive as possible, the results in this article are as sound as the documentation, code structure, error and logging code, and pointers to the base papers.
External validity.Our case study uses Apollo's automated driving framework on a Dutch highway segment.While Apollo is one of the most advanced automated driving frameworks available in the open-source and the highway scenarios we choose are generic, our findings may not generalize.To improve the external validity of our findings, our solution should be tried on other Dutch highway segments, other highways, and automated driving frameworks.

RELATED WORK
There is an industry-wide consensus on the importance of the safety of automated driving systems, especially after the catastrophic Uber automated driving vehicle crash, which led to the death of a pedestrian. 26Nowadays, every manufacturer and software vendor who tests their vehicles on public roads releases a safety report for the public [6][7][8], further acknowledging the relevance of safety.Unfortunately, these reports neither disclose safety requirements nor how they are assessed, making it hard to gauge their usefulness.Our study is an attempt to bring safety assessment into the public domain.Making the safety assessment available publicly will be a first step in showing that the basic steps to avoid potential catastrophic events are taken right from the design stage.
The safety of automated driving systems can be assessed in many stages of product development, including design [54,55], development [74], validation & verification [9,30,82], and deployment [38].Currently, the vast majority of literature focuses on safety assessment in validation & verification stage including testing [9,30,82], particularly for ML-based systems [86].While coding standards [74], design patterns [17,80,103], and best practices [32] to address safety during the design and development stages of traditional software exists, a similar set of guidelines is still in their inception phase for ML-based systems.Given automated driving systems are being used in highly dynamic settings with proximity to other traffic participants, without operator (human driver) supervision, and in safety-critical settings, their safety assessment at every product life-cycle stage requires immediate attention.
Literature has shown the cost of fixing any issue in software increases exponentially with every product life-cycle stage [47].The design is likely a better stage to start making automated driving safe.While relatively unexplored, studies on design assessment for safety offered methods to elicit and assess requirements in settings such as connected driving [54,55,83].To the best of our knowledge, the scientific literature on safety assessment in the design of a mature automated driving framework for complete automated driving has not been explored, nor has the design assessment of limitations of ML-systems considering environmental factors [108].Note that environmental factors, including adverse weather and illumination conditions, have been shown to cause functional limitations for ML systems that process data from various sensors [108].Building on the prior works, this study presents a design safety assessment of an automated driving system for its use in a Dutch highway segment.

CONCLUSIONS
This article presents a case study assessing the safety of the Apollo automated driving framework's perception system in design.We elicited 58 safety requirements to enable automated driving in a Dutch highway segment.For the assessment of safety requirements, we used 23 design choices; 13 relating to traditional software and the other 10 specific to ML-based systems.We found design evidence that 38 out of 58 requirements are met.While all requirements relating to traditional software systems are satisfied, many requirements specific to ML-based systems are not found satisfied.This points to the higher maturity of the stack from a safety standpoint of traditional software than ML-based software.
To the best of our knowledge, this study is the first in the scientific literature to present the safety design assessment for the perception system of a mature software stack for automated driving in a real-life setting.Our study opens up a multitude of future research directions, including safety requirement elicitation on other automated driving stacks and their comparison to Apollo, and techniques to reduce the human effort (and the resulting subjectivity) of the requirement elicitation and assessment.For practitioners, our contributions include the parts of Apollo that need more work and possible design choices to consider for closing the safety gap.We have shared our data, including results from its intermediate steps for transparency, replicability, and reusability of our work for research and practice.

Fig. 1 .
Fig. 1.Overview of the design assessment process.

Fig. 2 .
Fig. 2. Relevant parts of Apollo's architecture workflow.The dotted rectangles indicate the four functional categories of components.The dotted arrows inside the perception system indicate information processing pipelines.The pipelines and components consisting of ML systems are shown in light green.The following details of the architecture are not shown for simplicity: (1) The information from the HD map is used in (a) traffic light detection and recognition pipeline; (b) LiDAR obstacle detection, classification, and tracking pipeline; and (c) radar obstacle detection and tracking pipeline.(2) The output of the localization fusion pipeline is used in (a) radar obstacle detection and tracking pipeline and (b) traffic light detection and recognition pipeline.

Table 1 .
ML Design Decisions Related to Safety

Table 2 .
[3]ponents and Count of Associated RequirementsExample requirements.See Reference[3]for a complete set of all requirements for each of the pipelines and sensors.If the performance of camera obstacle detection, classification, and tracking pipeline deteriorates at night, then this deterioration in performance shall not lead to an incorrect estimation of the state of vehicles or other obstacles.If the output of any component in the RADAR obstacle detection and tracking pipeline is corrupted or lost, then this corruption or loss shall not lead to an incorrect estimation of the state of vehicles or other obstacles.If the performance of LiDAR obstacle detection, classification, and tracking pipeline is deteriorated due to moderate inclement levels of fog, then this deterioration in performance shall not lead to an incorrect estimation of the state of vehicles or other obstacles.If the output of any component in the obstacle fusion pipeline is corrupted or lost, then this corruption or loss shall not lead to an incorrect estimation of the state of vehicles or other obstacles.If any component in the LiDAR localization pipeline becomes non-operational, then this failure shall not lead to an incorrect estimation of the ego pose.

Table 3 .
Summary Table