S3C: Spatial Semantic Scene Coverage for Autonomous Vehicles

Autonomous vehicles (AVs) must be able to operate in a wide range of scenarios including those in the long tail distribution that include rare but safety-critical events. The collection of sensor input and expected output datasets from such scenarios is crucial for the development and testing of such systems. Yet, approaches to quantify the extent to which a dataset covers test specifications that capture critical scenarios remain limited in their ability to discriminate between inputs that lead to distinct behaviors, and to render interpretations that are relevant to AV domain experts. To address this challenge, we introduce S3C, a framework that abstracts sensor inputs to coverage domains that account for the spatial semantics of a scene. The approach leverages scene graphs to produce a sensor-independent abstraction of the AV environment that is interpretable and discriminating. We provide an implementation of the approach and a study for camera-based autonomous vehicles operating in simulation. The findings show that S3C outperforms existing techniques in discriminating among classes of inputs that cause failures, and offers spatial interpretations that can explain to what extent a dataset covers a test specification. Further exploration of S3C with open datasets complements the study findings, revealing the potential and shortcomings of deploying the approach in the wild.


INTRODUCTION
Autonomous vehicles (AVs) promise to increase safety through super-human perception and reaction ability, reduce emissions through less crowding, increase independence for those unable to drive, and provide other quality of life improvements [43].Yet, automating the full scope of the generalized driving task has remained an ongoing challenge, limiting current systems to only partial autonomy under specific environments [1].To reach viability for the generalized driving task, we must design, build, and test AV systems under the full range of scenarios they may encounter under their operational design domain (ODD) [43], including and especially the long tail of rare but safety-critical events.
The development of AV systems that can operate in such scenarios is predicated in part on the collection of rich datasets consisting of sensor inputs and expected outputs that can support training of learned components and system testing.The open question that we tackle that constitutes the developer's use case for this work is: to what extent does a driving dataset cover the long tail distribution of relevant scenarios?
Developing a coverage measure that can render such a judgement requires the definition of a coverage domain, which identifies a set of relevant elements to which a test input may be mapped.Useful coverage domains are: (1) interpretable, containing elements that allow engineers to understand the semantics of an input and determine what additional inputs are lacking, (2) discriminating, mapping complex inputs into equivalence classes when they cause similar system behavior or in different classes otherwise, and (3) automated, allowing them to scale to systems and inputs beyond the reach of manual methods.Coverage domains that satisfy these attributes exist for traditional software, ranging from the coverage of models to the coverage of code constructs 1 .
Developing such coverage domains for the complex and highdimensional sensor inputs that make up AV datasets is difficult.Simple domains, such as the number of inputs or the "miles travelled" acquiring inputs [4,23,27], while easily automated, are not discriminating or interpretable as the abstraction fails to reveal information about what occurred during those miles, or how much remains unseen.Code-coverage domains, while automated, are limited in their interpretability relative to the semantics of the complex sensor inputs.Alternatively, one could follow a requirement-based testing approach, constructing a dataset that by design provides domain coverage.For example, datasets collected through the exploration of hand-chosen scenarios including those developed by NHTSA [43], e.g. a lane change maneuver, lead to coverage domains that are interpretable and discriminating by design.Although useful for identifying and testing key scenarios, such manual efforts offer very limited coverage of the long tail of potential scenarios, saturating too quickly to be informative about whole-system safety.
Still, NHTSA [43] and EUCAR [38] offer valuable insights about what type of coverage domain would be most relevant for AVs.They observe that "It will be difficult to achieve significant coverage of the  [45], its bounding boxes (middle-left), a scene graph (middle-right) that captures an abstraction of the scene that is sufficient to evaluate the preconditions of test specifications TS1 and TS2 on their sliced subgraphs (right).
variety and combination of conceivable test conditions, particularly related to ODD and OEDR." In particular, we note that object event detection and response (OEDR), which is a characteristic of level 3 AVs, requires the vehicle to perceive the spatial distribution of entities from the perspective of the ego vehicle (the one perceiving the environment).Evaluating such capabilities over a broad range of conditions is essential for high-confidence AV deployment.
Given an AV sensor input dataset, our goal is then to judge the extent that a dataset covers a coverage domain with spatial semantics as per the ODD and the OEDR specifications.Consider a narrow test specification, TS1, which states when at least a car or truck is near and to the left of the ego vehicle in the left lane, then the ego vehicle should not steer left by more than .An appropriate coverage domain for the bolded precondition would distinguish inputs that have (1) a car that is near on the left in the left lane, and (2) a truck that is near on the left in the left lane.The input image in Fig. 1 provides coverage for 50% of the precondition as a truck is not present.Now consider a test specification, TS2, which states when a car is near in the front left, near in the front right, or in the same lane as the ego vehicle, then the ego vehicle should not change steering rapidly unless changing lanes.The coverage domain for the bolded precondition would distinguish inputs that have (1) a car near in the front left, no car near in the front right, and no cars in the same lane, (2) a car near in the front left, no car near in the front right, and a car in the same lane, etc.The image in Fig. 1 covers 14%, 1 of 7 scenarios, of the domain.Defining a coverage domain that captures a scene's spatial semantics permits the conditions that give rise to the satisfaction of a test specification precondition to be thoroughly assessed.
To quantify the coverage of such test specifications we propose the use of scene graphs (SGs) [6,22] as the basis for a sensorindependent coverage domain for AVs that enjoys all three benefits outlined above.As illustrated in Fig. 1, an AV camera renders images from which relevant vehicles and lanes are detected, serving as nodes in the SG, and then the identified spatial relationships between those entities define the graph edges.By encoding the semantic spatial relationships between entities in a scene, SGs are interpretable for a broad range of system requirements such as TS1 and TS2.The definition of an SG can be customized, in terms of both its vertex and edge sets, allowing the level of discrimination to be tailored to the system so that inputs are grouped into equivalence classes that render similar behavior.Finally, recent advances in machine learning allow high-quality SGs to be produced automatically from sensor data.This intermediate representation has been recognized in prior AV work for its ability to synthesize relevant information for risk assessment [30,53], mining robotics specifications [31], and image synthesis [41], but has not been explored for test coverage.
The key contributions of this paper are: (1) Identification of scene graphs as an appropriate basis for an AV dataset coverage domain; (2) Definition of the S 3 C framework which abstracts scene graphs to define interpretable and discriminating coverage subdomains for test specifications; (3) A study applying S 3 C in simulation and for real AV datasets to explore its abilities as a test-adequacy computation technique; (4) A publicly available open source implementation2 of S 3 C.

RELATED WORK AND BACKGROUND
This section reviews coverage families and introduces scene graphs, the basis for the proposed coverage domain.

Coverage
Coverage quantifies the extent to which a test suite exercises a system.We now discuss the kinds of coverage available for the AV domain focusing on intepretability, discrimination, and automation.
White-box structural code coverage metrics have been widely studied and adopted for their ease of use and basic interpretability for developers [20,50,54].A plethora of tools to automatically collect code coverage metrics exist and developers readily interpret them as they correspond to their implementation.Research has shown how sufficiently precise structural coverage metrics, e.g.statement coverage rather than file coverage, can discriminate system behavior for a range of development tasks, such as fault localization [48].However, these criteria do not provide this same utility when applied to AVs.Many AV system components have a linear control flow, e.g.parameterized control-loops or machinelearned components for perception, limiting the ability of structural coverage to discriminate relevant behaviors [18].Prior work generalized white-box structural coverage to neural networks [35,42].Such coverage can be computed automatically, and further exploration showed its discrimination for the AV domain [44], but it is not interpretable due to the opaqueness of neural networks [26].
Black-box coverage includes a range of approaches from those focused on the domain derived directly from the requirements [2,32,37] to those on domains defined by system and environmental models [10,46].They are interpretable and discriminating by construction, but typically involve manual crafting of models and formalization of requirements to derive the inputs.Recent work on black-box coverage of neural networks relies on either (a) manually developed models of the input space [55] or (b) machine-learned models of the input space [11].Manually developed models can be tailored to a system so that the elements of the coverage domain are interpretable and can discriminate between relevant system behaviors, yet the manual effort involved typically limits automation.For target domains for which automated mechanisms exist, as we propose for AVs given the proper abstractions, this obstacle can be overcome.Machine-learned coverage domains can also be tailored to control the level of discrimination and they enjoy the advantage that they are automated, which allows them to scale to systems beyond the reach of manual methods.Unfortunately, like most ML approaches they lack interpretability [26].
Several coverage metrics have been proposed to target the AV domain.The number of miles driven has been explored due to its ease in automation, but is not discriminating or interpretable [4,23,27].Majzik et al. defined scenario coverage [28], where a scenario describes a sequence of scenes, and a scene describes a snapshot of the environment.As a form of requirements testing, such a process guarantees interpretable coverage of safety-critical inputs that can discriminate system behavior, but the substantial manual effort in designing pertinent scenarios prevents automation [38,43].Hu et al. compute trajectory coverage by examining utilization of road regions within a scenario [19].It has limited discrimination power as it makes over-simplifying assumptions about the road structure, and ignores entities.Closest to meeting our requirements is Hildebrandt et al. 's recent work PhysCov [18] that combines sensor data and AV kinematics to define coverage over the environment and internal system state.While this gray-box approach is automatic and discriminatory (we compare against PhysCov as a baseline for discrimination in § 4.2), describing coverage based on LiDAR-sensor values provides limited interpretability through just spatial information and lacking support for entity and relation semantics.No prior approach provides an interpretable, discriminating, and automated coverage metric for the AV domain at the input level.

Scene graphs
A long studied task in computer vision is that of object detection.Early systems could place bounding boxes around instances of a few different object classes, e.g., cars, people, buses.Modern deep learning approaches, like Detectron2 [49], are pre-trained to identify thousands of object classes and perform panoptic segmentation which identifies pixel-masks for each instance of the class.
Most scene understanding and reasoning tasks require information about the relationships between objects, e.g., one car next to another.SGs represent objects and relationships in the form of a directed graph [22]; vertices define the set of entities in a sensor input, e.g.image or LiDAR point cloud, and edges define spatial relations over the entities.SGs can be enriched with vertex or edge attributes that capture additional semantic details, e.g. car model.SGs define an abstraction of the sensor data and can be defined to capture just the information needed in a scene-based specification.
There are a growing number of scene graph generation (SGG) techniques [6].Most SGGs consist of an object detection stage followed by a pairwise relation recognition stage.Within the context of AVs, specialized SGGs have been proposed that take advantage of domain semantics, e.g., static versus dynamic objects, road types, vehicle size, to produce more meaningful SGs [25,29,36] and are becoming more configurable allowing definition of object categories, e.g., vehicle types, enhanced attribute capture, e.g., direction of vehicle movement, or refinement of relationships, e.g., what distances between vehicles is considered 'near'.Even more sophisticated SGGs can track objects across video frames [52] and generate relations from natural language captions for images [51], though not specialized to the AV domain.
Among SGGs, we highlight RoadScene2Vec, an open-source toolset for generating, encoding, and utilizing road SGs specialized for the AV domain [29,53].Of particular interest to S 3 C is Road-Scene2Vec's ability to generate SGs from images collected in the CARLA simulator or other image datasets.The generator offers a rich set of configuration parameters that facilitate the experimentation of different pixel to SG abstractions that we leverage in our approach.For example, the configuration file includes parameters that define what entities to include (the default list are those included in the CARLA simulator such as car, pedestrian, etc.), what directional relations (pairs of object types for which relationships are generated such as ego to car, or car to pedestrian) can be instantiated and what are their thresholds (e.g., direct front , and what proximity relations (e.g., near collision, super near, very near, near, and visible) must be instantiated also with their thresholds (e.g., [0, 4), [4,7), [7,10), [10,16), and [16,25) meters).

APPROACH
We formalize coverage for the AV domain, describe the architecture to compute S 3 C, and conclude with a discussion of its applicability.

AV Coverage Domains
Given a set of sensor inputs,  , selected to represent a much larger set, , of inputs that could be encountered in deployment, an abstraction,  :  ↦ → , retains features of input data as elements of an abstract domain, .Abstractions allow quantification of the extent to which  is representative of  relative to those features: ( , ) forms an adequacy criterion for  and helps to identify limitations in  , i.e., uncovered elements of  that developers would target with data augmentation or additional test inputs.
There are two significant challenges to make   ( , ) for the AV domain interpretable, properly discriminating, and automated.
The first challenge is the development of abstraction  .For AVs, we contend that abstractions must lift raw sensor inputs (e.g., camera images, LiDAR point clouds) to semantic spatial distributions relative to the ego vehicle since those ultimately inform the vehicle's behaviors in the physical world.As such, we propose abstractions that are refinements of SGs defined over entities, , in the AV domain (e.g., cars, pedestrians) and their relationships,  (e.g., left, in front).SG abstractions, (), can be computed automatically and provide interpretability, but are just the first step in the input abstraction process which allows subsequent abstractions, , to be composed,  =  • , that discriminate relevant system behavior.
Second,  aims to define the feasible abstract domain, { () :  ∈  }, but this may be challenging to calculate when a precise  definition of  is not available.In traditional notions of structural code coverage it is common to overapproximate the coverage domain, e.g., assuming that all pairs of definitions and uses are feasible.We adopt a similar approach by using the structure of  to compute an overapproximation, Â, which is then used to compute coverage.For example, an identity abstraction,  () = , on 100 by 100 pixel images with 256 values per RGB channel could be overapproximated by the full-space of pixel combinations, . This is a severe overapproximation since any realistic definition of  will comprise an infinitesimal portion of Â .A coarser abstraction that counts the number of cars in an input,  (), would yield an abstract domain that could be overapproximated relatively accurately with reasonable assumptions about the deployment context, e.g., Â = [0, 100].These extreme abstractions illustrate the concept, but a more customizable framework for abstracting inputs is needed.

Architecture
The S3 C architecture consists of sequence of 4 components shown in Fig. 2 whose functionality is described next 3 .

Scene Graph Generation.
A scene graph generator (SGG) maps a set of sensor inputs, , to a graph representation,  :  ↦ → .
SGGs can be parameterized to define how they interpret sensor data.For example, the set of entity kinds, , represented by graph vertices, the relationships among entities, , that define edges in the graph, and attribute values, , associated with edges and vertices.More formally, a scene graph,  = ( ,  :  ↦ →  ,  ∈  ,  :  ↦ → ,  :  ↦ → ,  :  ∪  ↦ → ) is a directed graph with a distinguished  vertex and functions to access the entity  of a vertex, the ation encoded by an edge, and ribute values of vertices and edges.A map  is used to associate attribute values with each type of attribute.
Example.RoadScene2Vec [29] uses a forward facing camera image to generate a scene graph.A configuration file enables the user to tailor the SGs to their specific ODD and AV characteristics by specifying the entity and relationship kinds.For example, if the ODD is a rural environment, the entity list may include tractors4 .

Abstraction.
Abstractions transform SGs to retain salient information for analysis.An SG abstraction,  :  ↦ → , allows the discriminating power to be refined.Since  •  defines an abstraction of the input sensor data,  defines a coverage domain, { (()) :  ∈  }, for sensor datasets.Moreover, as we shall see in the study, different abstractions may be defined and composed to provide alternate coverage measures for analyzing a given dataset.
Separating generation from abstraction offers several advantages.First, off-the-shelf SGGs can be reused without modification.Second, different SGGs, perhaps using different sensors, could be used to generate more accurate SGs.Third, abstractions can be defined and reused across systems.Finally, an appropriate composition of abstractions can be selected to suit developer needs.
Example.The SG vertices generated by RoadScene2Vec include an identifier attribute associated with each entity to differentiate among multiple instances of an entity class.Such identifiers are superfluous to the spatial distribution of entities a coverage domain is meant to capture and can be abstracted from the graph.For example, the left of Fig. 1 shows 'car_2' and 'car_3' in the middle lane-the spatial relationships would be identical if they were relabelled 'car'.An entifier suppressing abstraction might be defined as The space of possible abstractions includes ones that can transform the structure of the graph.For example, the vertices in the abstraction could be restricted to those that lie on paths of length  from the ego vehicle as follows: and  ′ . is restricted to edges in  incident on  ′ . .

Abstraction Clustering.
Given dataset,  , and abstraction, , the collection of abstracted scene graphs,  ( ) = { (()) :  ∈  }, is the set of scene graphs computed over  .The multiset of scene graphs represents equivalence classes of isomorphic graphs: In general, performing the clustering is computationally expensive because of its reliance on graph isomorphism; there is currently no known polynomial algorithm for determining if two graphs are isomorphic [3,14].The worst case scenario requires testing isomorphism across all pairs of ASGs and is  (| | 2 ).However, in practice we can exploit the data's long tail distribution to drastically reduce the number of computations.We first use a hash table to group the ASGs using a hash based on easy-to-compute graph statistics, such as number of entities (nodes) and relations (edges) or number of entities by kind which can be done in  (| |) for the average case.Then, the final equivalence classes are computing by pairwise isomorphism testing across these smaller groups, reducing the number  of computations and allowing for parallelization.With additional optimizations such as special handling for the empty abstraction, computing the clustering can be performed in practical time scales.
Example.Fig. 3 visualizes the count and size of the equivalence classes of an abstraction, , further discussed in § 4, that retains information about entities, their relationships, and their locations within the road lanes to analyze an image dataset,  .Here, | ( )|=46,006 and |  ( )|=9,532-a reduction of almost 80%.This also shows the imbalance in the equivalence classes as the largest class, the class of an empty single-lane road, contains 25% of the dataset, in other words, a quarter of the dataset presents no diversity in terms of spatial distribution.The largest 10 classes contain more than half of the dataset, and the largest 6, containing 22,684 images (49.3%) represent different lane configurations with no other entities besides the ego vehicle.Additionally, we can see the effect of the long tail of the distribution as 6,450 dataset images are in singleton classes, meaning more than 2 /3 of the classes contain only a single image.Computing this   ( ) took 1047 seconds 5 .§ 6 discusses time performance for additional datasets.

Coverage
Computation.This component refines the notion of interpretability introduced earlier by considering the matching between the abstraction,  • , with the preconditions of test specifications,   :  ↦ →  (we refer to   as  when  is clear by its context).Let  be the codomain of  • -the set of possible abstracted scene graphs.We say that  is interpretable relative to  when it reflects the semantic distinctions made by  in the input domain:   () ⇐⇒ ∀ ∈  :  =  (()) =⇒  () For an interpretable abstraction, coverage of a test specification precondition, , is the portion of the space defined by  that is exercised by  ( ).To determine coverage,  is projected onto  to compute a subspace,   ⊆ , that includes all possible ASGs that satisfy .Satisfaction of  depends on the subset of the relations, entities, and attributes in  that are referenced in .Many SGs may exist that satisfy , but vary only in features of  independent of . 5 30 threads.32 logical core Xeon 4216, 128GB of RAM We apply a form of model-based slicing [24] to reduce the coverage domain from   to include a single representative of each subset of   that varies only in features on which  is dependent.For each clause in the disjunctive normal form encoding of , we slice based on the clause, and eliminate any portion of the slice that is not reachable from the ego vehicle in the sliced scene graph  (): The coverage domain is defined  =  ∈  ( (()); for brevity, we say  =  ().This allows coverage to be computed for a test specification as: To establish a finite coverage domain we bound the number of vertices in an SG-in practice SGGs can only produce bounded SGs.Note that  may overapproximate the coverage domain, but it is a tight approximation when  is purely conjunctive or consists of disjunctions of independent clauses.This is because the former reduces to a singleton coverage domain under slicing, i.e., there is a car in front of the ego vehicle, and the latter can be computed as the product of the valuations of each independent clause.
Example.Consider the test specification precondition, : the closest lane to the left of ego contains a car at distance:  1 , . . ., or   ; i.e. for an input,  ∈ ,  () ⇐⇒  contains a car in the closest lane to the left of the ego at distance  1 , . . ., or   .Like all test specifications, it is stated relative to the ego vehicle.The precondition establishes bounds on the number of lanes and the number of cars in subgraphs which satisfy it.Specifically, there is 1 lane that is the closest lane to the left of the ego, and it may contain up to  cars, one each at distance   from the ego vehicle.There are 2  possible subgraphs that can be constructed from the direction, containment, and distance relations mentioned in this specification.One of those subgraphs has no cars, which falsifies the precondition, leaving | ()| =2  − 1 satisfying subgraph models to cover.In this example, because different cars can appear at different distances independently, the approximation of the coverage domain is tight.We will see specifications in § 4.3.1 for which this approach suffices.We leave to future work the problem of computing tight approximations of the coverage domain for test specifications with dependencies among precondition clauses.

Applicability of Coverage Values
We identify 3 potential failure modes that can limit S 3 C applicability.
Abstraction-Sensor Mismatch.The abstraction design must align the coverage domain with the physical capabilities of the system's sensors.That is, the system's sensors must be sufficiently sensitive to capture the dimensions of the abstraction.For example, an abstraction domain that includes vertex attributes based on the color of an entity produces an immeasurable distinction if the system is using a LiDAR sensor which cannot sense color.Another example is a camera with a shallow depth of field that blurs far-away entities.This could obscure entities and distort distance estimates, limiting the entities and relationships in the abstraction.These limitations can lead to portions of the abstraction domain being impossible to cover, causing coverage to appear artificially low.This lack of sensitivity can also affect the interpretation of the abstract domain.Consider an abstraction intended to count the number of cars.Given a shallow depth camera, the abstraction more accurately defines the number of cars within a shallow distance not occluded from the camera, which alters the interpretation of the abstraction.
Imprecise Abstractions.Designing an abstraction domain requires careful consideration to ensure that the retained information is sufficiently specific to differentiate between inputs leading to distinct system behaviors.For example, the abstraction  that counts the number of cars in the scene does not provide sufficient specificity to appropriately differentiate system behavior as seeing a car parked off the road and parked in the ego vehicle's lane both have  () = 1, but should elicit completely different behaviors.This can lead to artificially high coverage as the coverage domain is not capturing the relevant details of the problem domain.We explore the level of abstraction specificity required for utility in § 4.2.
Perception Failures.The process of automatically mapping sensor inputs to the abstraction domain may introduce inaccuracies where the input is not mapped to the intended SG.An incorrect mapping can occur due to failures in sensing or perception, and can cause it to incorrectly identify entities, not perceive entities, or perceive entities that do not exist- § 6 delves into this analysis.These failures can both increase and decrease coverage as two inputs that should be mapped to the same SG could be separated, or vice versa.Although problematic, these failures can be mitigated, e.g.Detectron2 provides a confidence threshold that can be tuned to reduce the number of inaccuracies, and RoadScene2Vec provides parameters to redefine the relationships.Further, combining multiple different sensors or perception techniques, or sequences of sensor values, can reduce perception failures [13].The rapid evolution of techniques for SG generation, particularly when specialized for a domain like AVs, the suggested mitigation strategies, and S 3 C parameterization offer a rich space of tradeoffs to overcome these challenges.

STUDY
This study explores the following research questions: • RQ#1: How effective is S 3 C at discriminating inputs that lead to different behaviors?More precisely, we assess the discriminating capabilities at different levels of abstraction in terms of equivalence classes of scene graphs covered during training versus those failing during testing.• RQ#2: How effective is S 3 C to assist in the interpretation of the differences between datasets and the coverage achieved for various dataset subsets with regard to test specifications?

Setup
To answer the research questions we designed an experiment.The units of analysis are datasets consisting of sensor inputs captured from an AV operating in simulation.The treatments are the approaches to abstract the inputs from a dataset into classes that correspond to a coverage domain.The dependent variables correspond to the ability of the treatments to discriminate and interpret those sensor inputs.The hypothesis is that S 3 C will outperform alternative approaches, and that the best instance of S 3 C's approach will be able to assist in the discrimination of failing inputs not seen in training, in the analysis of those failures, and in the interpretation of the extent to which test specifications are covered by a dataset.We provide details on these aspects next.

Dataset.
A dataset of sensor inputs was collected by launching a pre-programmed AV in a simulated environment.We use the CARLA simulator [12], a state-of-the-art driving simulator that is widely used in AV development and testing.CARLA provides multiple preconstructed environments designed to mimic various driving environments.We utilized 4 environments, called "towns" in CARLA, to cover a mix of suburban, urban, and highway driving and roads with 2, 4, and 8 lanes.To vary the spatial structures observed in the simulations, we run simulations under two conditions: zero vehicles and max vehicles.In the zero vehicle scheme, the ego vehicle is the only dynamic vehicle on the roadway.In the max vehicle scheme, we utilize CARLA's autopilot feature to place and guide the maximum number of vehicles the environment can supportranging from 101 to 372 vehicles for the towns chosen.The vehicles are generated to be 80% car and 20% truck, each randomly sampled from the 25 car and 4 truck models available.Using the CARLA Python API, we recorded at 5Hz: an RGB sensor input from a single camera mounted on top of the ego vehicle, the steering commands supplied by the ego vehicle's autopilot, the orientation and position of all vehicles, and road structure details, e.g.lanes, curvature, etc.For each of the 2 traffic schemes and 4 maps, we executed 15 tests 5 minutes in duration with different random seeds.

Steering AV.
Since the study aims to assess the ability of S 3 C to discriminate passing from failing behaviors with respect to a coverage domain, we built an AV steering model that consumes images to steer the vehicle; this allows us to record which inputs have been covered, i.e. seen by the model during training.We trained an AV using a PyTorch [34] model based on ResNet34 [17] pre-trained on ImageNet [9] with the last layer modified for predicting a steering angle to learn to mirror the behavior of the CARLA autopilot.This allows us to use the CARLA autopilot as both the source of training data and the oracle of test correctness for the learned system.As this task does not involve traffic decisions, we remove all observations that involve approaching or transiting intersections, and prevent the ego vehicle from changing lanes.Further, ≈ 5% of data points were removed due to simulation failures in CARLA (sensor timing mismatch, rendering error, etc.), resulting in a dataset of 46,006 observations partitioned into 80% (36,809) training and 20% (9,197) test inputs 6 .When evaluating model performance, we designate a failure as a steering prediction more than 5°from the ground-truth.An error threshold of 5°is sensible because at 25mph, turning 5°off normal results in the vehicle fully crossing into an opposing lane in less than 1 second.This leads to 17 failures of the 36,809 training inputs (0.05%) and 362 failures of the 9,197 test inputs (3.94%).

Treatments.
As discussed in § 3.3, SGGs are rapidly improving and are being tailored to the AV domain.Still, they are sophisticated systems with their own failure modes and complex configuration spaces.Working in simulation allows us to control for the quality of SGs as a coverage domain 7 .To explore the effect of SG expressiveness, we implemented five SGG-based abstractions including various levels of semantic information.Each abstraction uses ground-truth data from CARLA's internal state and includes all entities within a 50m by 50m square, horizontally centered on the ego vehicle and offset to include the 45m ahead of the ego vehicle.
To evaluate the effect of the SGG-based abstraction considered, we explore five abstractions of varying specificity as outlined in Table 1.The weakest abstraction uses only the list of entities in the scene without any spatial information, providing a baseline for our experiments emulating what entity detection tools provide.We additionally explore three abstractions inspired by the abstraction used in RoadScene2Vec [29] that add information about lanes and inter-entity relationships.We utilize the default inter-entity relationship configuration of RoadScene2Vec (introduced in § 2.2) to add spatial relationships to the graph-full details of the configuration space are available in the online appendix.In its original abstraction, RoadScene2Vec assumes that a road can always be characterized into three lanes: 'Left', 'Middle', and 'Right' with the ego vehicle always in the 'Middle' lane.We remove this assumption and strengthen this abstraction, creating separate nodes for each logical lane in the graph; the lane containing the ego vehicle is labeled the 'Ego Lane' with lanes to the right or left traveling with the ego lane labeled 'Right Lane 1', 'Right Lane 2', etc.Additionally, any other lanes present are labeled 'Opposing Lane ' with  between 0 and the number of other lanes present.In our experiments, the lane structure is recovered using the ground-truth simulation data, though it could come from high-definition maps in use in modern AVs [47] or from onboard sensing [33].Finally, we explore an abstraction that models lanes as a sequence of nodes that preserve the road structure in 2.5m increments, e.g. a 10m section of two-lane road becomes a directed graph on 10 nodes, 4 per lane, where each node represents a 2.5m length of a lane and has edges to the lane segment that traffic will flow to and the segment corresponding to a lane change; each segment also has a curvature attribute.We add an edge between each entity and the lane segment it occupies; we do not include inter-entity relationships as many relationships can be approximated from the graph-path between entities through the road; future work should explore adding additional relationships.This refines the granularity of the abstraction and thus increases the number of equivalence classes generated.

RQ #1: On Discrimination
In this section we utilize an instance of S 3 C that includes multiple abstractions to assess its discriminating capabilities and tradeoffs.
Metrics.For S 3 C to provide utility it must discriminate among complex inputs that cause distinct system behaviors.We partitioned the dataset so each input is either used for training or testing, and based on the AV model studied each input results in either a pass or fail.To analyze the discriminating ability of S  equivalence class that was Not Covered during training under a given abstraction, which we refer to as the PNFNC metric.A novel testing failure is one that resides in an equivalence class with no training failures; given the 17 (0.05%) training failures, there may be test and training failures that are equivalent and thus the metric should not penalize grouping these failures together.A low PNFNC means the abstraction provides limited discrimination to isolate the failure-inducing inputs from those covered during training, while a high PNFNC indicates high discrimination.However, it is important to note that an abstraction that generates a higher number of equivalence classes is likely to better discriminate between inputs.In the extreme case, a scheme that considers all inputs distinct trivially results in 100% PNFNC, at the detriment of interpretability.Thus, we also examine the total number of equivalence classes needed to partition the dataset; an abstraction aims to maximize PNFNC while minimizing the total number of equivalence classes.
Baselines.Besides the five SG-based abstractions described in § 4.1.3,we instantiate 3 baselines.First, the Ψ 10 algorithm from PhysCov [18], a LiDAR-based technique for clustering AV inputs.Since its default parameterizaton produces a relatively small number of classes, we devised a parameterization Ψ * 10 with increased precision that generates a comparable number of classes to the SG-based abstractions.Second, we cluster the inputs based on the ground-truth steering output.We refer to this as the 'Codomain Ground Truth' since it groups inputs based on the expected output and thus is not usable in practice; as this relies on ground-truth output information, it can be viewed as a test equivalency oracle from the output perspective.Third, we show the average performance of an algorithm that, given a max number of equivalence classes, uniformly at random chooses a class for each data point 8 .
Results.As seen in Fig. 4, the weakest SG-based abstraction, , achieves a PNFNC of 2.9% from 305 classes, performing markedly better than the similarly-situated original PhysCov's (Ψ 10 ) performance of 0.5% from 235 classes.The four abstractions that consider spatial information, , , , , provide a PNFNC that outperforms the random baseline and have a substantially greater PNFNC.The  abstraction yielded 215 novel failures, of which 100 were not covered for a PNFNC of 47%, while the  abstraction achieves a PNFNC of 101/263=38%.These both outperform PhysCov's Ψ * 10 PNFNC of 119/343=35%.We note that although the PNFNC decreases across these three techniques, the number of novel failures and those uncovered both increase, indicating that the technique is discriminating behavior less efficiently.This provides a trade-off between the number of classes, the interpretability of the classes, and the overall performance.
We observe that the most precise SG-based abstraction ERS achieved a PNFNC of 63% while using 22987 classes compared to the 'Codomain Ground Truth' PNFNC of 65% while using 15532 classes, demonstrating the capability of rich graph abstractions to discriminate system behavior while pointing to potential future refinements to reduce the number of classes used.Further, this increased capability to discriminate behavior comes at the expense of interpretability as the richer abstraction requires more effort to comprehend.The average size of the representative ASG of equivalence classes for  has 131 nodes and 264 edges compared to 16 nodes and 47 edges for -this substantially smaller graph space may be much easier for developers to quickly interpret and utilize.
We observe that S 3 C performs better than the random baseline across all abstractions, and slightly better than existing baselines, i.e.PhysCov, while adding a new dimension of interpretability.These results demonstrate the ability for S 3 C, particularly when implemented with semantically rich graph abstractions, to discriminate between inputs that lead to different system behaviors.

RQ #2: On Interpretation
For this research question, we investigate the interpretability of classes generated by S 3 C.We focus on the ELR abstraction as it provides a balance between PNFNC and the number of classes based on Fig. 4 while providing a sufficiently rich semantic basis for expressing the test specifications studied in Table 2.We analyze the classes from two interpretation perspectives.First, we compare the ASGs of testing failures against the ASGs of the training passes used in RQ#1, to determine what particular spatial elements contribute to test failures.Second, we explore the extent to which the datasets associated with the CARLA towns cover a set of specifications' pre-conditions defined in terms of scene spatial structure.

Differentiating Training Images from Failure-causing Testing
Images.We aim to identify the features that differentiate the failure test set from the training pass set.To achieve this, we build and train a decision tree classifier to determine whether a given ASG belongs to the PNFNC failure test set or the training pass set.The tree nodes provide insights into the features that significantly 8 Details and implementation of random and Ψ * 10 available in the online repository.contribute to test failures and distinguish them from non-failing inputs.The tree is fed with a feature vector of size 484, encompassing all possible combinations of the main features extracted from the ASGs, namely, 4 entity kinds, 11 lanes, and 11 relationships.Each element in the vector represents the number of occurrences of a specific combination within that particular ASG.We trained a decision tree classifier that let us identify almost 95% (68 out of 72) failure classes with ASGs that were not seen in training.
We find that a few spatial semantics are associated with multiple testing failures studied.For example, as shown in Fig. 5 (right box), there are 4 conditions that need to be met to distinguish 8 failures from the passing inputs.More specifically, images where there are less than 3 cars on the opposing lane, at least one truck on the immediate right and near, a car on the ego lane to the left of the ego, and a car on the right-2 lane that is in direct front of the ego, were not observed during training and led to testing failures 8 times.In other words, when there are cars surrounding the ego vehicle in the on-going lanes and the opposing lane traffic is not high, the system struggled.Fig. 6 (left) shows one of the testing images (annotated to ease interpretation) corresponding to that ASG class.
Identifying the differences between an image in testing causing a failure from the training ones, however, is often more subtle.On average, test failures required 11 predicates to be differentiated from the training ones.For example, the predicate shown in Fig. 5 (left box) is the 11th, where the final classification is made.For this predicate, at least one car must be directly in front of the ego vehicle on its lane.Fig. 6 (right) shows an example of this image inducing failure, with 2 vehicles in front of the ego car.
These findings indicate that there exists not just a long-tail distribution of ASGs, but also that the difference between passing and failing may appear in just one of the hundreds of spatial features we explored.Thus, future augmentation techniques must strive to equip training datasets with such one-off scenarios.Findings.The specifications' preconditions, their coverage domain ||, and their corresponding coverage per town and for the union of all towns is shown in Table 2.  1 gets single-digit coverage across all towns.This happens because the dataset was collected using the CARLA driver which, with its builtin conservative behavior, attempts to avoid approaching vehicles within 2m, resulting in a dataset without many near_coll and super_near relationships with other vehicles in the Ego Lane. 2 gets 0% coverage for Town01 and Town02 which is expected due to the absence of a left lane in these towns. 3 gets 0% coverage across all towns due to several factors such as the lack of left lane for Town01 and Town02, the left lane was infrequently empty and the conservative nature of the CARLA built-in driver.As is, the dataset does not provide enough images to assess how the system would perform under the safety-critical conditions mentioned in  1 and  3 .On the other hand,  4 and  5 highlight the influence of the road topology.Since Town01 and Town02 only have 2 lanes, they provide a coverage of 18.18% for  4 , while Town10HD with 5 lanes attains 45.45%, and Town04 with 11 lanes reaches 100% coverage.Similar findings apply to  5 .
As expected, the general trend indicates that more datasets (towns) lead to greater diversity of scenarios, resulting in increased coverage.However, it is clear that not all towns contribute equally to all preconditions.Furthermore, the coverage remains low for several preconditions, with a fundamental one like having a vehicle close ahead entirely missing.The lack of coverage for critical preconditions highlights the significance of our approach, as it enables us to identify particular gaps in the datasets which can then guide augmentation for addressing its deficiencies.

Threats to validity
The external validity of our findings are affected by our use of simulation to explore the application of S 3 C, which may not fully generalize to real-world data.The use of simulation allowed us to instantiate S 3 C with perfect information of the scenario including zero-noise sensing and the ability to include occluded objects in the SGs.While SGG precision will increase as perception technology advances, no real-world system will be able to fully match the performance of the simulated SGG.Further, although CARLA is widely used in the AV testing and development space, the simulationreality gap [21] remains a limitation in generalizing results to realworld performance.While we believe the demonstrated discrimination and interpretability results of S 3 C are encouraging, future work should investigate the ability for abstraction and automation of S 3 C using real-world data.We take the first step in that direction in § 6, exploring S 3 C on existing open datasets.
Further, although the AV steering model used to classify behaviors is similar and followed similar training techniques as prior work, the model was selected, configured, and trained by the authors as part of the study.AVs with input-causing failures less relevant to spatial semantics may render different results.Also, while the AV steering task has been extensively studied in prior literature [15,35,39,55], it does not capture the full driving task.We focus only on binary pass-fail outcomes based on a threshold steering angle error to control the scope of the analysis; future work should explore a more refined analysis of the continuous error space.Additionally, our model-based study only validates S 3 C's performance on single-instant inputs, e.g. a single camera image, which may not generalize to multi-frame inputs and system-level failures [16].We provide an initial exploration into using S 3 C with multi-frame inputs in § 5.
The internal validity of our findings may be affected by the implementation of S 3 C as it involves several complex internal and external components including the integration with CARLA, the generation of the abstractions, and the clustering of the SGs to perform the tasks outlined in the evaluation; besides carrying out extensive validation, we share the code to mitigate this threat.
In terms of construct validity, although S 3 C exhibited interpretability as per the direct connection to test specifications, we have not validated these interpretations in the hands of domain experts to improve their datasets, refine their specifications, or investigate failures.Similarly, the findings for discrimination for input inducing failures is subject to the subtle and often noisy relationship between coverage and fault distribution.Additionally, exhibiting coverage may be valuable even for large coverage domains for which  () cannot be accurately estimated.

HANDLING INPUT SEQUENCES
As discussed in § 4.4, our initial exploration has focused only on testing single-instant camera image inputs which may not generalize to multi-frame inputs; here we provide an initial exploration  of extending S 3 C to that end.As described in § 4, the data used in the study originated from a series of 5 minute tests across the range of treatments.Within each 5 minute test, we obtained all of the relevant inputs at 5Hz.Although the evaluation in § 4 does not consider the ordering of the resulting data, we can leverage this information to provide an initial assessment of how analyzing sequential data affects S 3 C's performance.To do so, rather than clustering based on the scene graph at a single instant, we cluster based on the history of scene graphs over a desired window, i.e., for a window of 2 frames, we cluster based on the pair of scene graphs corresponding to the scene graphs of the current frame and previous frame.We describe this process in more detail below.

Computing multi-frame equivalence classes
Given a window of  frames, we compute the sequence-based clustering    .Initially,   is used to cluster the graphs as described in § 3.2.3;then each equivalence class is assigned a unique label.Next, we iterate over the frames in chronological order and, for each frame, take the most recent  frames and build a tuple of length  containing the equivalence class labels of the corresponding scene graphs.There may not exist  previous frames, e.g., at the beginning of the test or if data points were filtered out in the middle of the test.In these cases, we fill in a label of UNKNOWN for all missing frames and consider all instances of UNKNOWN to be equivalent.Now, each scene graph has a -tuple describing the equivalence classes of the scene graphs over the previous  frames.We then re-cluster the scene graphs through the process described in § 3.2.3 using the -tuple as the basis for equivalence, i.e., two scene graphs are equivalent if they are equivalent and their previous  − 1 graphs are also equivalent.Note that for  = 1 this is equivalent to the original clustering.The new clustering    will have at least as many equivalence classes as   , but may have many more.The new clustering retains the interpretability of the scene graphs used to generate the tuple, though future work should explore the interpretability of these sequences of scene graphs.

Results
In this section, we explore the effect of the sequence-based approach on the equivalence classes generated under the , , and Ψ * 10 abstractions for windows of length 2, 5, and 10 frames, so chosen to represent a minimal sequence and sequences of 1 and 2 seconds in duration respectively.This aims to understand the impact that increasing the amount of data considered in the sequence has on the PNFNC metric and number of equivalence classes.We denote the technique using abstraction  with a -window as  ⟨⟩.
In Fig. 7 we see that for each of the abstractions adding more information with the sequence-based approach improves the performance.However, we note that the rate of improvement is different between the abstractions; although starting from similar positions, ⟨10⟩ achieves a PNFNC of 62.3% in 18591 equivalence classes compared to the 68.2% of Ψ * 10 ⟨10⟩ in 31481 equivalence classes.As such, adding timing information is proportionately much more beneficial for  than PhysCov.We also note that the sequence information allows ⟨10⟩ to achieve a similar PNFNC performance to  with single-frame inputs: ⟨10⟩ achieves a PNFNC only 1 percentage point lower (62.3%vs 63.3%) while using 19.1% fewer equivalence classes (18591 vs 22987).Additionally, we observe that although the sequence information helps PhysCov increase the PNFNC metric, its performance relative to the random baseline decreases, and it experiences a far greater increase in the number of equivalence classes than either of the scene-graph based approaches.These results highlight the potential for improvement in S 3 C with additional consideration of sequential information and encourage future research to this end to further enable S 3 C's robust application to systems that leverage such information.

EXPLORING S 3 C ON REAL DATASETS
We now explore the usage of S 3 C utilizing an off-the-shelf SGG called RoadScene2Vec that implements its own abstraction (described below) similar to ELR on real-world datasets from the AV literature to gauge the potential of the approach and current challenges for its application in the wild.We selected five open-source datasets from the AV training and testing literature shown in Table 3 of different sources, sizes, environments, and image resolutions.
As described in § 2, RoadScene2Vec is an SGG that uses a single RGB image to generate an SG that captures entities such as cars, trucks, and people, and spatial relationships such as left, right, near, and far along with information about which lane an entity occupies.However, we observe that RoadScene2Vec exhibits two of the failure modes described in § 3.3.First, the lane abstraction is imprecise as it always assumes a three-lane structure of "Right", "Middle", and "Left", which may not match real-world data; as seen in Fig. 1, there are three lanes identified for a four lane road and car_4 is marked as being in the left lane.This limitation lessens the interpretability of the image.Second, RoadScene2Vec encounters perception failures leading to inaccurate SGs.We performed a cursory examination comparing one of the author's image perception to that of RoadScene2Vec under its default configuration on 15 images.We found that 9/15 (60%) images exactly matched, 1 had one error, and the remaining 5 (34%) contained multiple errors.Through this process we identified that a key challenge in applying S 3 C on real-world data is reducing the types of failures described in § 3.3 by refining the abstractions deployed and improving the perception capabilities of underlying components.We note that the reported performance was under RoadScene2Vec's default configuration, yielding a lower bound of performance.Better parameter tuning may eliminate some perception failures, while reducing the effects of the imprecise abstraction requires more fundamental changes.Notwithstanding the potential weaknesses of RoadScene2Vec, we explore the scene graph based equivalence classes it generated for the datasets.The left portion of Table 3 shows the number of classes per dataset and their union.We note several interesting trends in this data.First, the number of equivalence classes does not increase proportional to the number of images.The CommaAi dataset, although more than 10 times larger than any other dataset, only had 3 to 4 times as many equivalence classes.This highlights the diminishing returns of collecting additional data without regard for what data is being collected-without attempting to sample from the long tail distribution the collection will cause for a coverage to quickly saturate.On the other extreme, smaller but carefully crafted datasets like NuScenes and Cityscapes have a low ratio of dataset size to   , meaning they are capturing distinct portions of the coverage domain more efficiently.Second, the average number of images per equivalence class is negatively correlated with the resolution of the images in the dataset.We hypothesize that the lower resolution limits RoadScene2Vec's ability to perceive entities in these images, leading to less rich abstractions.
The right side of Table 3 further reports on the interpretability of the abstraction in terms of the entities per image (similar analysis could be performed at the relationship level, or at the entity and relationship level).We observe that NuScenes and Cityscapes contain many more people and buses than the other datasets.This is expected since they are from urban settings unlike the others, but we also see that Cityscapes has substantially more people and bicycles than NuScenes, indicating that Cityscapes was sampled from an area with higher pedestrian traffic.This highlights the ability for S 3 C to automatically provide interpretability in comparing the coverage across datasets independent of an AV.
Last, we discuss the efficiency of the S 3 C implementation to process these datasets in Table 4.The SGG generation time is proportional to the number of images and their resolution, while the abstraction and clustering approach is proportional to the number of scene graphs generated by the SGG.For the largest dataset with over half a million images, CommaAi, generating the scene graphs took over 4 days.For this dataset with low resolution images, scene graph generation, abstraction, and clustering can be performed in under a second per image.For the datasets with the largest resolution, the time per image increases to up to 5.2 seconds which is only one order of magnitude slower than the time to collect the image, and will likely decrease with advances in the state of the art in perception.Overall, the findings indicate that even an unoptimized implementation of S 3 C can scale to existing datasets.

CONCLUSION
We have introduced S 3 C, the first specialized framework for the AV domain that takes a sensor input dataset and computes its coverage as per the spatial scene semantics.The framework is unique in how it leverages multiple levels of abstraction to automatically transform raw sensor inputs into scene graphs and then into equivalence classes that are interpretable to AV developers and that can discriminate among distinct behaviors.To assess S 3 C, we conducted an experiment with an AV in simulation that showed that the equivalence classes generated by the approach can (1) discriminate between input images used in training versus those that result in failures during testing, (2) provide interpretable explanations of the subtle differences between those inputs, and (3) quantify the extent to which a dataset provides specification coverage.We also explored applying S 3 C to a sequence of inputs, finding that it can enable simpler abstractions to match the performance of more sophisticated ones.Finally, we found that instantiating S 3 C using an off-the-shelf scene graph generator on various open datasets illustrated its potential for use in practice, pinpointed the challenges that may arise, and identified the mechanisms to solve them.

Figure 1 :
Figure 1: An input (left)[45], its bounding boxes (middle-left), a scene graph (middle-right) that captures an abstraction of the scene that is sufficient to evaluate the preconditions of test specifications TS1 and TS2 on their sliced subgraphs (right).
is the maximal set such that all graphs in  are isomorphic to each other and not isomorphic to any other graph in  ( ) − .| | gives the number of times the abstract scene graph occurs in the dataset.If no pair of ASGs are isomorphic, then | ( )| = |  ( )|, and each ASG is in its own class.However, in most cases, clustering will result in a reduction in the number of graphs to consider such that |  ( )| < | ( )|.

Figure 3 :
Figure 3: Distribution of images across scene graph equivalence classes for the  abstraction studied in § 4.2.

Figure 4 :
Figure 4: Percentage of novel test failures in an equivalence class not covered in training (PNFNC) versus count of equivalence classes under different abstractions.High percentage and low total classes is better.

3 1
Car on Opposing Lane 0 to the right of Ego vehicle Truck on Right 1 of Ego Lane near the Ego vehicle < Car on Ego Lane to the left of Ego vehicle >= Car on Right 2 of Ego Lane in direct front of Ego vehicle

Figure 7 :
Figure 7: PNFNC versus count of equivalence classes for , , and Ψ * 10 with 2, 5, and 10 frame windows.High percentage and low total classes is better.

Table 1 :
Scene graph abstractions studied.|| = 46,006 3C as a coverage metric, we measure how well it discriminates between passing training inputs and failing test inputs.We compute the Percentage of testing inputs that cause Novel Failures that reside in an ASG

Table 2 :
Testing specifications' preconditions and their coverage.There is a truck that is either inDFrontOf or inSFrontOf of the ego and at distance: near_coll, super_near, very_near, near, or visible. 3 : The closest lane to the left of ego is empty and there is a car or truck that is either inDFrontOf or inSFrontOf of the ego and at distance near_coll or super_near.Coverage of test specifications.Previous findings showed the presence of distinct scene compositions among most failures, pinpointing weaknesses in the training dataset to capture diverse driving scenarios seen in deployment.Achieving coverage across all the possible scene combinations envisioned earlier can be difficult and costly.Instead, a developer may opt to strategically focus on smaller combinatorial spaces relevant to safety-critical scenes.We investigate this strategy by drafting five test specifications that include a precondition about the AV spatial semantics, and then analyze the extent to which a dataset covers those preconditions.

Table 3 :
Equivalence Classes generated using RoadScene2Vec (left) and Entity Statistics for Open AV Datasets (right)

Table 4 :
Computation time for each dataset