Perception Contracts for Safety of ML-Enabled Systems

We introduce a novel notion of perception contracts to reason about the safety of controllers that interact with an environment using neural perception. Perception contracts capture errors in ground-truth estimations that preserve invariants when systems act upon them. We develop a theory of perception contracts and design symbolic learning algorithms for synthesizing them from a finite set of images. We implement our algorithms and evaluate synthesized perception contracts for two realistic vision-based control systems, a lane tracking system for an electric vehicle and an agricultural robot that follows crop rows. Our evaluation shows that our approach is effective in synthesizing perception contracts and generalizes well when evaluated over test images obtained during runtime monitoring of the systems.


INTRODUCTION
Deep learning has become the de facto method for solving perception tasks such as object detection, object tracking, and semantic segmentation.Deep Neural Network (DNN) functions are poised to be deployed en masse in safety-critical autonomous systems.Vision-based lane tracking in driver assist systems, vision-based landing protocols in urban aerial vehicles, and indoor robotic navigation with LIDAR are prime examples.To make progress towards reasoning about the correctness of the decision making programs in such systems, one has to gain some level of understanding of the correctness of the DNN-implemented perception.Fragilities of DNNs are well-known-they can fail to detect objects (e.g., pedestrians) and fail to recognize salient aspects of the environment (e.g., lanes from lane lines, and hence the position of ego vehicle in the environment).Since the precise set of raster images cameras can capture for each groundtruth setting is impossible to formally characterize, formal veri cation of such systems is hard.
There have been, nevertheless, several novel attempts to characterize desirable characteristics of e ective neural perception.The notion of local robustness (and global robustness) demand that the neural perception on training images (respectively, all images) evaluate to the same or similar perception when images are perturbed mildly with respect to a norm, such as the norm.In recent years, several advances in automatic veri cation for local robustness for neural network have emerged [Bonaert et al. 2021;Gehr et al. 2018;Singh et al. 2019;Tjeng et al. 2019], including tool competitions [Bak et al. 2021b], and some approaches for global robustness have also emerged [Katz et al. 2017;Leino et al. 2021].Robustness guarantees, to some extent, rule out adversarial examples to neural networks.Another approach that has emerged is that of using image generators-given an image generator that can generate images from groundtruth con gurations of objects and probability distributions, we can test neural perception based controllers (such as autonomous vehicle controllers) against them, and use techniques such as fuzz testing to nd scenarios where they fail [Dreossi et al. 2019;Fremont et al. 2022;Katz et al. 2021].
We posit that probing the perceptual role of DNNs in autonomous systems could unlock new opportunities for systematic analysis for safety of decision modules (called reactive modules throughout the rest of the paper).Neural perception pipelines are complex, but at the end of the day, they give state estimates.They provide estimates of certain environmental variables that are important for the reactive module.In the lane-keeping example, the vision pipeline provides an estimate of the relative position of the car with respect to the lanes and the heading angle.Thus, capturing how far the estimated state output from neural perception deviates from the true state (groundtruth state) serves as a natural speci cation or contract for the neural perception.
The primary contribution of this paper is the development of a technique based on perception contracts.Perception contracts are formal contracts for the perception module that describe the errors that the perception module can make regarding estimation of groundtruth while keeping the global system safe.Perception contracts hence negotiate the agreement between learned components for perception, and the programmed components of the system and the environment.This splits the analysis of the system to its modular components.In particular, in order to prove that a programmed system (which we call a reactive module in this paper) with a learned component for neural perception is safe under an environment, we need to ensure (a) that the ML component for perception meets the perception contract, and (b) that assuming the perception contract is met by the perception module, the system stays safe under its environment.The second task can be performed using traditional techniques in formal veri cation of cyberphysical systems using automated logic engines.The rst task cannot be formally proven as there is no formal speci cation of which images correspond to which groundtruth.However, we can use extensive testing techniques obtained from images gathered in the eld as well as synthetically generated images that test for borderline/rare events to check if the perception module meets its contract.
The salient feature of perception contracts is that it is closely tied and dependent on the rest of the system and environment.The key idea is to measure neural perception errors, that is, the deviation of neural perception's estimation of groundtruth from the actual groundtruth.This deviation is not in absolute terms but in terms of its e ect on preserving invariants that prove the safety of the system.More precisely, we deem a groundtruth estimate by neural perception as acceptable if the system's action based on this estimate, though it may di er from groundtruth, preserves the invariant designed for the system.A perception contract is hence a contract on groundtruth and groundtruth estimates that ensures that the system maintains its invariant.
The rst contribution of this paper is the theory of perception contracts.We propose building safe systems that interact with neural perception using two distinct system-environment couplings that are tied together using perception contracts.First, we have the reactive module interacting with a model of the environment using perfect perception of groundtruth that is proven to be safe.Second, Proc.ACM Program.Lang., Vol. 7, No. OOPSLA2, Article 299.Publication date: October 2023.
Perception Contracts for Safety of ML-Enabled Systems 299:3 we consider the same reactive module working with the real environment (which cannot even be modeled, of course), and where sensors compute an impression of the environment state (say camera images).A neural perception module processes this impression to estimates groundtruth, which is then used by the reactive module.
A perception contract relates the above two instantiations of the reactive module -one working with the environment model under perfect perception, and the other working with the real environment using sensors and neural perception (see Figure 2).The key property of a perception contract is that it maintains the invariant Inv that was used to prove the rst system safe.More precisely, a perception contract is a symbolic contract expressed in logic that captures a set of pairs ( , ) of groundtruth and groundtruth estimates such that in any con guration satisfying the invariant Inv, the system acting on the estimate nevertheless preserves the invariant (even though the invariant itself typically will refer to the actual groundtruth ).
We develop the theory of perception contracts, developing a framework and formally proving that the system working with the real environment through sensors and neural perception is guaranteed to be safe as long as the environment model can simulate the environment and the real environment with sensors meets the perception contract.
Since machine learned components are not designed by humans, their perception contracts, which specify the errors they make in various regions while maintaining the invariant, are complex to determine.Errors in perception that maintain an invariant also are dependent on the state of the system.Consider an invariant proving that a vehicle stays within the lane boundaries in a lane tracking system.This invariant naturally allows for more error in perception when the vehicle is in the middle of the lane and its steering angle is aligned with the road, while allowing less error when it is at the edge of the lane or its steering angle is misaligned with the road.This leads us to the second main contribution of this paper: formulating and solving the synthesis problem for perception contracts.We are given a reactive module working with an environment model that has been proved to be safe using an invariant Inv, and we are given a neural perception module.We are also given a nite set of impressions of the sensors (like a set of images) along with the corresponding groundtruth values.We assume that the groundtruth estimates computed by neural perception on these impressions preserve the invariant (in practice, we can in fact verify this).The problem then is to nd a perception contract in a given logic L ( preserves the invariant of the system) that includes all the groundtruth-estimate pairs computed by the neural perception module.In other words, we want a contract in L that includes the groundtruth and estimation pairs exhibited by the neural perception on the given images, and maintains the invariant.
We develop a technique for synthesizing symbolic perception contracts using counterexample guided synthesis [Alur et al. 2015].A decision-tree learner synthesizes quanti er-free logic formulae (allowing Boolean combinations) working with a veri cation engine that checks whether proposed contracts indeed preserve the invariant, and returns counterexamples otherwise.
The synthesis of perception contracts in speci c regions of the groundtruth con guration space where the reactive module and the real environment seem to be working safely gives a conjecture of a formal speci cation of neural perception as to why the system stays safe even with neural perception.The adherence of the perception module to this contract cannot be formally proven, of course, as the real environment and sensors cannot be modeled mathematically e ectively.However, extensive runtime monitoring in specialized settings where groundtruth can be determined (say environments with special sensors to detect position and heading angle of a vehicle) and checking such images against the contract gives greater con dence in the safety of the system.The perception contract also gives a formal speci cation for downstream attack techniques-for example, techniques that may try to generate an image using an image generator on which the neural perception violates the perception contract.One particular utility of perception contracts is in runtime monitoring of a system in an environment where groundtruth can be assessed.Rather than just monitoring the system for safety or even the invariant, we can monitor the perception module against the perception contract.Since a perception contract is a local speci cation for the perception module, monitoring adherence to it is stronger as the contract ensures safety of the system even for other valuations of variables in the system.For instance, consider an autonomous vehicle monitored when driving on a dry road; monitoring against a perception contract will check for the safety of the vehicle in other states as well, like at di erent speeds and wetness conditions of the road.We articulate this subtle issue in the paper, aided with examples in the evaluation section (Section 3.2 IV and Section 6.4).
The nal contribution of this paper is an implementation and evaluation of our techniques for synthesizing perception contracts using two autonomous systems: a GEM vehicle for lanetracking and an agricultural robot that navigates in aisles between crop lines, both using DNN-based perception (LaneNet [Neven et al. 2018] and CropFollow [Sivakumar et al. 2021]).Our evaluations show that our approach is extremely e ective in synthesizing perception contracts.We also evaluate how well these synthesized contracts generalize by evaluating them against test sets obtained using runtime monitoring of these vehicles in closed-loop, and show that the contracts have high precision in capturing safe perception.

PRELIMINARIES: SAFETY OF SYSTEMS INTERACTING WITH ENVIRONMENTS USING PERFECT PERCEPTION
In this paper, we advocate the design approach that starts with designing a system that satis es safety properties that interact with environments using perfect perception, before analyzing the system with sensors and neural-network based perception.In this setting, we assume that the system perceives aspects of the environment (called percepts) precisely-we do not yet model sensors, the output of sensors, and perception algorithms based on machine-learning and other computation (these are introduced in the next section).We hence start by rst modeling reactive modules interacting with environments with perfect perception.In such cyberphysical systems, traditionally, we consider the system interacting with a model of the environment, and prove that the system and model of the environment with perfect perception satisfy certain desirable safety properties.Such a proof ensures that the system is safe with respect to any environment that can be simulated by the model of the environment.We formalize these notions in this section (see Figure 1).The notations that we use in this section and the next section are depicted in We formalize systems interacting with an environment (see Figure 1, in particular the solid path showing an environment model interacting with the system (called reactive module directly, without sensors and machine-learned perception components).
We have two modules-the environment and the Reactive Module modeling the system.In our driving example, is the control software of the car, and includes the car's hardware platform, the road, as well as the surrounding environment that in uences the behavior of the car.Let us x a set of percept variables that captures some attributes of reality in the environment which are used by .In an autonomous driving setting, for example, this set of percepts would be variables that give the position of the car in some xed coordinate system, e.g., the ego vehicle's coordinates, the lay of the road, the position and speed of pedestrians and vehicles nearby, etc.
Let us also x a set of control or feedback variables that captures the feedback the gives to e ect changes to the environment.Again, in the autonomous vehicle example, this feedback can be variables for control of brakes, acceleration, and steering angle of the vehicle.
The system and environment communicate with each other in discrete time steps or rounds.In each round the environment updates its state based on the current feedback.The percepts are fed to the system which in turn computes both an update to its state and feedback to the environment.In the autonomous vehicle example, note that vehicle dynamics are part of the environment, and the feedback from the system is used to e ect only certain variables, such as an update to the steering angle.
We model both the system and the environment using a set of variables over arbitrary domains, with state-space being the valuations of these variables, a set of initial states, and a transition relation describing (potentially nondeterministic) changes to these variables.For any set of variables , let Val( ) denote the set of all possible valuations of (over the respective domains).Also, for any set of variables , let ′ denote a fresh set of primed variables corresponding to variables in , i.e., ′ = { ′ | ∈ }.
Let us x an environment with a set of variables ∪ that consist of a set of latent variables and the percept variables .Let us assume an initial state predicate Init ( , ), giving a set of initial valuations for latent and percept variables.And a transition relation ( , , , ′ , ′ ).

Let
= Val( ∪ ) denote the set of states of the environment and we will denote particular states as a pair ( , ).A transition (( , ), fb, ( ′ , ′ )) denotes that the environment, when in state ( , ) and reading feedback fb can transition to ( ′ , ′ ), and give the system the percept ′ .
Let the reactive module have a state-space de ned by a set of variables and an initial state predicate Init ( , fb) giving possible initial states and initial feedback.Let = Val( ) denote the set of states of the reactive module.The transition relation is a relation ( , , ′ , ).A transition of the form ( , , ′ , fb) means that the system, when in state , reading a percept , can transition to state ′ , and give the feedback fb to the environment.

Safety under Perfect Perception
Real world environments are incredibly complex to model and reason with.The classical approach of proving safety of systems that interact with the environment proceeds in two steps: (a) develop a simpler model of the environment to simulate all possible behaviors of the real world environment (and potentially more), and (b) prove the safety of the system with respect to this model of the environment.The completion of this proof shows that the system interacting with any environment that can be simulated by the model of the environment will be safe.
We formally outline the invariant-based technique for proving systems safe against (a model of the) environment.
A model of the environment has precisely the same structure as environments as described above.It has a set of latent variables (latent variables of the model), initial state predicate ( , ), and transition relation ( , , , ′ , ′ ), as described above.Typically, the set of latent variables and transitions of the model are much simpler than real world environments, and can be seen as assumptions made of the real world.In the automated vehicle context, for example, vehicle dynamics may model nondeterministic skidding of a vehicle up to some degree (without modeling the wetness of the road precisely), making formal reasoning much easier.Safety Properties.A safety property is a predicate Safe( , ) over the set of percepts and state of the system.Note that the safety property is independent of the latent variables of the environment, and consequently, we can determine safety of systems with respect to di erent environments (e.g., the real world environment as well as a model of it).
Turning to the autonomous vehicle example, a safety property may demand that the vehicle remains well within the left and right anks of the road.Another desirable property (related to stability) is to demand that in any two consecutive con gurations and ′ , if is reasonably away from the center of the lane, then ′ will be closer than is to the center of the lane.Though this property involves two consecutive con gurations, it can be modeled as a safety property.We can have the system storing in its state the vehicle's previous con guration (say using variables added to the set ) and having the safety property demand the appropriate relation using the previous con guration and the current con guration.
Veri cation of Safety against Environment Models.Let us x a system and an environment model .The set of reachable con gurations of the system and environment is de ned as usual-the least set that includes the initial con gurations and is closed with respect to the transitions over con gurations (as de ned above).Let us denote this set as ℎ.The system and environment are said to satisfy the safety property Safe( , ) if for every con guration ( , , fb, , ) ∈ Reach, Safe( , ) holds.
An invariant is a predicate over global con gurations Inv( , , , , ) that de nes a superset of Reach.Such an invariant is said to prove the safety property Safe( , ) if for every con guration ( , , fb, , ) ∈ Inv, Safe( , ) holds.
It is in general hard to check whether a given predicate Inv is an invariant of the system.The typical approach to ensure predicates are invariants is to have a stronger notion of invariants that is easier to verify.One approach is to use inductive invariants.An inductive invariant is a predicate Inv that satis es the following properties: it includes all the initial con gurations, and for every con guration satisfying Inv and every con guration ′ that can transition to, Inv( ′ ) also holds.It is easy to see (by induction on length of executions) that such a predicate is always an invariant.
In this paper, we will not focus on methods to synthesize or prove safety for systems that work with perfect perception, as these are well studied problems in cyberphysical system veri cation (see, for example [Alur 2015;Astrom and Murray 2008;Mitra 2021]).However, we assume some invariant Inv that establishes safety (or stability) of the system working with an environment model-i.e., we assume that as long as a system and environment model work together in such a way that the reachable state space stays within the invariant, the safety/stability property is guaranteed to hold.
We assume that for desired safety properties of the system and a model of the environment, interacting with perfect perception, there is an invariant Inv( , , , , ) that establishes safety of the system and environment.
Safety under Environments Simulated by an Environment Model.Now let us turn to proving the safety of systems interacting with an environment using perfect perception using the fact that the system is safe under an environment model using perfect perception.
Consider an environment with variables ∪ interacting with a system.And assume that the system interacting with a model of the environment (over variables ∪ ) satis es a safety condition Safe( , ), established using an invariant Inv( , , , , ).Now under conditions that relate the environment and the model of the environment, namely a simulation relation, we can argue that the system working with the environment will continue to be safe.Formally, we say that the model of the environment simulates the environment if there is a simulation relation ∼ between the states of the environment and the states of the environment model, i.e., between Val( ∪ ) and Val( ∪ ), such that the following hold (below, • For every initial state of the environment, there is an initial state of the model of the environment that it is related to, i.e., for every , , if (( , )) holds, there is some such that (( , )) holds and ( , ) ∼ ( , ).
The rst condition says that states of the environment and environment model must share the same perception valuations.The second demands that every initial state of the environment is related to some initial state of the environment model.And the third demands that from any pair of states that are similar, the environment model should be able to simulate every move of the environment, and reach similar states.Theorem 2.1.Let a reactive module (system) with the environment model satisfy an invariant Inv, which in turn proves a property Safe, under perfect perception.Let be an environment module that simulates .Then the environment working with the system is guaranteed to preserve the property Safe.
Proof.Let ∼ be a simulation relation between and .We can show by induction that on that for every reachable con guration ( , , fb, , ) reached by ∥ in steps, there is a reachable con guration ( , , fb, , ) such that ( , ) ∼ ( , ).The base case is easy, and the induction step involving a system move as the last move are trivial.For the induction step involving an environment move as the last move, the fact that the simulation relation guarantees a move of the environment model that simulates the environment ensures the property.□

SYSTEMS WITH NEURAL PERCEPTION AND PERCEPTION CONTRACTS
We now de ne systems that interact with environments with "neural perception"-perception is now not precise, but approximate perception is realized from impressions of observing the environment using machine-learned components and possibly other forms of computation.
We introduce two new components, a sensor and a neural perception module (see Figure 2).A sensor computes from environment states an impression.Impressions include many kinds of data produced by sensors-images taken by cameras, sound recordings of the environment, LIDAR readings, sensor readings of vehicles, etc. Formally, we x a set of impression variables , and sensors are modeled as a function : Val( ∪ ) → Val( ), i.e., functions from environment states to impressions.
A neural perception (NP) module is a module that processes impressions and outputs perceptions.Neural perception modules typically consist of machine-learned models (e.g., for vision, processing sound, etc.) as well as programmed components (e.g., geometric algorithms that complete line segments detected to lines, calculate middle of lanes from anking lines, etc.).Formally, neural perception modules are modeled as a function : Val( ) → Val( ) The global behavior of the system with the environment and with neural perception is de ned over the same set of con gurations as in perfect perception, . The transitions are the following two kinds, where the rst kind for environment moves is precisely as earlier for perfect perception, while the second one utilizes the sensor and neural perception: Note that the system moves on the perceived perception it gets, delivered through the sensor and neural perception module.
Safety with respect to a predicate Safe( , ) is de ned as usual.
Imperfect Perception but Action That Maintains Invariants.Let us x a proposed invariant of the system Inv( , , , , ).The key idea of the contracts we de ne in this paper relies on the following idea of maintaining inductively the invariant with imperfect perception: In any con guration satisfying the invariant, the reactive module, though it acts on the estimated perception, gives feedback that results in a con guration satisfying the invariant.
The formalization of the above introduces certain subtleties.First, consider a predicate Inv( , , , , ) that has been proved to be an inductive invariant when the system interacts with an environment model .Let ∈ Val( ) be a ground-truth estimate-this is any valuation of the perception variables, not necessarily the one given by the composition of sensors and neural perception.Let be any con guration of the system working with the environment model satisfying Inv, where it is the system's turn to move, i.e., is of the form ( , , fb, , ), and let satisfy the invariant Inv.
Then we say the ground-truth estimate preserves the invariant at if for any con guration ′ = ( , , fb ′ , , ′ ) where ( , , ′ , fb ′ ) ∈ , it is the case that ′ satis es Inv.The above de nition declares to preserve the invariant at the con guration if the system module, reacting to the ground-truth estimate (rather than the groundtruth ) results in con gurations ′ that respect the invariant.Intuitively, , despite being not the precise groundtruth, is tolerable by the system as it keeps the invariant.
It is easy to see that the system will preserve the invariant as long as the neural perception gives, at every reachable con guration , a groundtruth estimate that preserves the invariant at that con guration.
Let Preserve Inv ( ) be the set of all ground-truth estimates at the con guration that preserve the invariant.Our goal now is to nd simple perception contracts that imply that groundtruth estimates preserve the invariant.
Perception Contracts.We are now ready to de ne the primary contribution of this paperperception contracts.
Fix a system interacting with a model of the environment, and an invariant Inv for the system interacting with this model under perfect perception.A perception contract is a predicate ( , ) with respect to the invariant Inv, expressed in a given logic L, such that for every con guration = ( , , fb, , ) and ∈ Val( ), if ( , ) holds, then preserves the invariant at con guration , i.e., ∈ Preserve Inv ( ).Intuitively, a perception contract is a condition between groundtruths ( ) and ground-truth estimates ( ) that ensures the estimate is safe with respect to maintaining the invariant.
We can now prove the main theorem regarding perception contracts.Consider an environment , with a sensor and neural perception .We say that ( , , ) satis es a perception contract if in any reachable con guration, the neural perception working on the output of the sensor reading an environment state with groundtruth computes a ground-truth estimate such that ( , ) holds.We can now prove the main theorem associated with perception contracts.
Theorem 3.1.Let a system interact with an environment through a sensor and a neural perception module . Let be a model that simulates the environment .Let Inv be an invariant of the system interacting with the model of the environment with perfect perception that ensures safety with respect to a predicate Safe.Let be a perception contract with respect to Inv and assume that the environment, sensor, and neural perception satisfy .Then the system working with the environment , under the sensor and neural perception, ensures safety with respect to Safe.

Proof. (gist):
The proof is by induction on the number of steps of taken by the environment, that for any reachable environment state with perception groundtruth , there is a similar state in the environment model with the same perception that satis es the invariant Inv.Notice that this proves that the system working with the environment through neural perception is safe.
The base case is trivial.The induction step is as follows.Given a state with similar state in the environment model , both with the same perception valuation , where with the system state satis es the invariant Inv.Now, since the environment satis es the perception contract, the groundtruth estimate that the neural perception computes from the signal generated from the environment state, must produce a feedback and system state update that satis es the invariant Inv.Furthermore, the next state reached by the environment from will be similar to a successor of (since the model simulates the environment).Consequently, the invariant will be preserved in the next state as well, completing the proof.□

Perception Contracts: Salient Features, Extensions, and the Synthesis Problem
There are several salient features of the notion of perception contracts that we want to highlight.
I. Perception Contracts Relate Di erent Systems.Perception contracts relate groundtruth perception values and groundtruth estimates given by the neural perception.Intuitively, we expect groundtruth estimates to be close to groundtruth.This error in perception is captured by the perception contract, symbolically.The error that is allowed is not a priori (say using some bounded norm), but rather it is allowed to be any error that the system can tolerate and maintain its invariant.Note that perception contracts are not like usual contracts which often relate only preand post-states of modules.Rather, a perception contract ties together two distinct systems-(a) the system that interacts with a model of the environment under perfect perception, and (b) the system that works with an environment (that can be simulated by the model) under sensors and neural perception.Note that the neural perception module takes as input the impression from the signal and generates a groundtruth estimate, and has no access to the groundtruth perception that the contract mentions.One can think of the groundtruth perception variables as a variant of ghost variables in veri cation literature that helps capture the speci cation for the module.
II.Maximal Set of Groundtruth Estimates That Preserve Invariants.Let Γ be the set of pairs ( , ) such that for every con guration satisfying the invariant Inv with groundtruth perception , preserves the invariant at .Then any perception contract de nes a set that is a subset of Γ.However, this maximal set of pairs Γ, being de ned using universal quanti cation over an unbounded set of con gurations, and with intricate non-linear semantics of how the system and environment module behave, is a complex set that may not even be expressible in a reasonable symbolic logic.In this paper, we seek perception contracts that assure the invariant but also are expressible in a simple logic L (for example, using Boolean combinations of linear constraints over the groundtruth and groundtruth estimates).See section 4.2 for a formal description of the logic used in our approach.
III. Perception Contract Synthesis Problem.The above discussion motivates the central algorithmic problem that we tackle in this papersynthesizing perception contracts.
Let us x a nite set of sensor outputs Σ = { 1 , . . ., } (these are, for example, a set of training images obtained by cameras observing a real environment).For each ∈ Σ, assume that we know the groundtruth perception, .We can execute the neural perception module on each element of Σ to get a set of groundtruth estimates, .Let us assume that these estimates indeed preserve the invariant of the system (note that we can verify this).We hence have a set of pairs ( , ) of groundtruth and groundtruth estimates that are exhibited by the environment and neural perception, and that preserve the invariant.It is natural to expect to include these points in any perception contract for the network.Let us also x a logic L to express perception contracts.The perception contract synthesis problem is the following: Given a set of sensor outputs Σ associated with groundtruths, i.e., the set {( , )}, as above, nd a perception contract expressed in L (which by de nition maintains the invariant Inv) and that includes these pairs.The perception contract synthesis problem asks to nd a perception contract expressible in the logic L that includes the given pairs Σ = {( , )} (and provably maintains the invariant Inv).We propose to solve the perception contract synthesis problem computationally.
Can Humans Write Perception Contracts?Given that a machine learning model is trained using tons of data (like ResNet), we believe that it would be hard for humans to predict the errors the ML perception module makes in each region in order to write a perception contract for it.Even for the given set of images (hundreds/thousands), we do not believe that humans can look at the errors made by ML perception to write a perception contract.The contracts synthesized by our tool also do not suggest that they can be written by humans easily (even though they are interpretable by humans).For example, a contract synthesized by our tool for a particular region is the decision tree: If ( + + + >= 0.163) then ( + − >= 0.859 ∧ + + + < 0.415) else + < 0.132 where ( , ) represents gt (groundtruth) variables-is the distance to the center line and is the heading angle, and ( , ) represents gte (groundtruth estimate) variables-is the estimated relative distance and is the estimated relative heading angle.
Another possibility is for the user to write or mechanically derive the most general perception contract, independent of the ML module.Note that we need a perception contract expressed in a symbolic logic that maintains a complex invariant that the de nition of perception contract demands.The most general perception contract may not even be expressible in the logic, let alone be simple enough for humans to construct.
We hence posit that we require automatic computational methods to solve the perception contract synthesis problem.
IV. Utility of Perception Contracts.Assume we have a set of outputs Σ of training data along with groundtruth, and we synthesize a perception contract from it that provably maintains the invariant (and hence keeps the system safe).Note that, of course, this does not mean the system is safe under the environment (that kind of safety is after all impossible to prove, given that we cannot even formalize the environment, and have only sampled some states).However, the contract not only gives a formal symbolic contract conjectured for the neural perception but also keeps the system satisfying its safety property whenever the contract is obeyed.More precisely, if all reachable states of the environment satisfy the perception contract , then we are guaranteed that the system will be safe.Though formal checking of whether the environment will satisfy the contract is never possible, given that the training data adheres to the contract, and by more runtime monitoring to gather data and checking against the contract, gives con dence that the contract is satis ed, and the system is safe.Notice though that runtime monitoring needs to be in special environments where groundtruth can be measured or inferred (for example, cars in environments with enough position sensors that observe the actual groundtruth).
In fact, one utility of perception contracts is that they can be used for runtime monitoring.As opposed to naive runtime monitoring that monitors whether the system is within the invariant, runtime monitoring perception against perception contracts give a stronger guarantee.These checks ensure that the system is safe not only from the current state, but in any other state, where latent variables can be di erent.For example, in the setting of autonomous vehicles, a perception that veri es against a perception contract ensures that the perception is the same at di erent speeds and di erent road conditions as well.We refer the reader to the end of the Evaluation Section 6.4 for a more detailed discussion.
V. Relaxing Invariants.In the above development, we assumed that the inductive invariant Inv is any invariant that proves the safety of the system working with an environment model.However, extremely restrictive invariants may disallow perception contracts despite the system being safe.Intuitively, we want the invariant to be a bit more liberal than the strictest invariants that can prove the system under perfect perception correct, in order to allow errors in perception.
A concrete example occurs in our evaluation.Consider a lane-keeping vehicle where safety is proved using an invariant that demands that the vehicle always move towards the center line.More precisely, we record the position of the vehicle in the previous time step, and require that at the next time step, the vehicle is closer or at the center line.In operating with perfect perception, this is easy to achieve, as the system knows where the center line is, and can navigate the vehicle towards it.However, in any system with neural perception, this invariant will not hold.In particular, when the vehicle is very close to the center line, even an extremely accurate neural perception may detect it to the other side of the line, causing the system to violate the invariant.However, notice that for the safety property of being within the lane, this stringent invariant isn't required.We can relax this by asking for the invariant property to hold only if the vehicle is far enough away from the center of the lane.This leads to the existence of perception contracts that can prove the system safe, and we show in our evaluation that our mining algorithms are able to synthesize one.

LEARNING ARCHITECTURE FOR SYNTHESIZING PERCEPTION CONTRACTS
In this section we explain our general learning architecture (Figure 3) for synthesizing perception contracts.As de ned in the previous section, the synthesis problem for perception contracts is to nd a formula in a logic L that captures properties of groundtruth estimates (in relationship to groundtruth perception) that (a) include the groundtruth estimates of the neural perception module working on a nite set of impressions (images), and (b) maintain an invariant Inv.
Let us x a system, environment model , sensor, neural perception (using notation developed in the previous section).Let us also x a nite set of impressions (images) Σ.Let be the nite set of groundtruth estimates computed by the neural perception on Σ.

Overview
To synthesize a perception contract, we use a symbolic learning algorithm that learns concepts in the given logic L.
Our learning architecture (depicted in Figure 3) has the learner interact with a teacher, which provides counterexamples to adherence to the invariant.In each round, the learner takes a set of ( , ) pairs (of groundtruth and groundtruth estimate), each pair labeled positive and negative.It then synthesizes a classi er in the logic L that is precisely classi es the samples in , and forwards it to the teacher.
The teacher, receiving a proposed perception contract checks whether the pairs of groundtruth and groundtruth estimates de ned by the contract maintain the invariant Inv.This requires checking whether for every pair ( , ) satis ed by , and for every con guration with perception component that adheres to the invariant, preserves the invariant at .We achieve this using a constraint solver (Gurobi in our evaluation).If the perception contract does not maintain the invariant, we extract a counterexample pair of the form ( , ), label it negative, add it to , and recurse calling the learner.
The learner, in each round receiving a set of positive and negative sample , computes a feature vector which has additional features depending on the logic, and uses decision tree learning to learn concepts expressed as Boolean combinations of these features.The decision-tree learning algorithm is an exact learning algorithm (classi ed the training set perfectly), and is an adaptation of Quinlan's ID3 algorithm [Quinlan 1986], modi ed slightly as described below to increase margins.
Note that when the teacher veri es a perception contract to maintain the invariant the algorithm can return this contract as it is a valid perception contract that includes init .

Realizing the Learning Architecture
We now describe how to realize the learner and teacher in the learning architecture for perception contracts.
Logics for Expressing Invariants.We assume that the logics for invariants are expressed as quanti er-free formulae over the domains of variables appearing in the environment model and the system, the feedback variables, and the perception variables.While the logics can be arbitrary, in order to apply decidable veri cation techniques such as Gurobi, we restrict these to use a class of nonlinear functions, trigonometric functions, etc. that lead to instances of constraint solving problems that can be decided by Gurobi.
Logics for Perception Contracts.In xing a logic to realize our learning, one must strike a balance between expressivity and simplicity.The logic for perception contracts should be powerful enough to capture regions of groundtruth and groundtruth estimate pairs that preserve an invariant.Note that invariant expressions can be complex (expressed using nonlinear arithmetic and trigonometric functions).We need the formulas in the logic describing the perception contracts to be simple in order for veri cation of invariant preservation to be decidable using tools such as Gurobi.
Let us x the variables and ′ (two copies of percept variables).Let Π be a nite set of predicates over ∪ ′ and let Γ be a set of real-valued functions over ∪ ′ .Let us introduce new variables { } and { } that stand for the Boolean and real values corresponding to the valuation of predicates Π and Γ, respectively.The logics for learning are parameterized over Π and Γ, and are essentially quanti er-free Boolean combinations of predicates in Π and upper/lower bounds on the functions in Γ: where ranges over , and ranges over .The expression stands for if-then-else terms: ( , , ′ ) evaluates to if is true and to ′ otherwise.In our evaluation, the predicates Γ are typically octagonal expressions (+ − + − ) over the realvalued variables.Consequently, perception contracts are Boolean combinations of linear expressions over reals, and hence amenable to veri cation by tools like Gurobi.Symbolic Learner.We can learn formulas over the logic for perception contracts from samples using decision-tree learning.For every positive/negative sample (gt, gte), we expand it to a larger feature vector where new features evaluate to every and .The problem then reduces to learning a Boolean formula over predicates ( ) and upper/lower bounds of numerical variables by constants ( < , ≤ ), and we can employ a standard decision-tree learning algorithm for this purpose.
In our evaluation, we utilize the ID3 algorithm for decision tree learning, modi ed appropriately to nd perfect classi ers [Garg et al. 2014;Mitchell 1997].The ID3 algorithm builds the tree top down in one pass, choosing predicates to apply at each node.The best attribute at any node of the tree is chosen based on the information gain statistical measure based on entropy [Mitchell 1997].While typical decision tree algorithms can stop building the tree when the leaf nodes are mostly pure, we need to continue building the tree till leaves are entirely pure (i.e., all samples that ow to any leaf must all be positive or all be negative).
Teacher.Given a proposed perception contract , the teachers need to check if all ( , ) pairs admitted by preserve the invariant.
Formally, we need to check the validity of the following formula: The above formula captures the crucial property of when a perception contract preserves an invariant.It says that if ( , ) holds and there is a con guration satisfying the invariant with percept variables evaluating to , and the system reading the groundtruth estimate can move to a new state ′ giving feedback fb ′ , then the invariant must hold where the perception variables are evaluated according to the groundtruth .
The teacher needs to check the validity of the above formula, and if not valid, return a valuation of variables where the formula does not hold.Projecting this valuation to ( , ′ ) gives the counterexample pair that is returned to the learner.In our evaluation, we implement this teacher using the constraint-solver Gurobi [Gurobi Optimization 2020].

CASE STUDIES
We will study how perception contracts can be used to analyze the safety of two realistic vision-based control system, namely, a Lane Tracking System (LTS) on a Polaris GEM E2 electric vehicle [Du et al. 2020] and an agricultural robot that follows crop rows [Sivakumar et al. 2021].The controllers (reactive modules) used in these systems do provably satisfy certain safety properties when they work with perfect perception.However, those same properties may be violated under some conditions when the controllers work with neural perception.The perception contracts constructed here capture the positive examples, and as we shall see, they can help discover the conditions under which the overall system can be proven to be safe.

Case Study: Lane Tracking System
For the experiments on the GEM vehicle in this paper, we use the high-delity Gazebo simulation of the vehicle model together with the actual controller code.The neural perception module uses LaneNet [Neven et al. 2018] for lane detection. 1We use the well-studied Stanley controller [Homann et al. 2007] for lane tracking as the reactive module.Environment Model.Given a global reference coordinate system, we de ne latent variables for the environment model LM = { , , } where and represent the position of the vehicle in the 2D plane, and represents the orientation of the vehicle with respect to the -axis (see Figure 4).The percept variables are P = { , }, where is called the cross-track error and heading error is the di erence between the direction of the lane and the heading of the vehicle.When the lane center line is aligned with the -axis of the global reference, the groundtruth cross-track error = − and the heading error = − .Thus, the valuation Val(LM ∪ P) ⊆ R 5 .Feedback variable FB = { } is the steering input to the vehicle, and the valuation Val(FB) ⊆ R. The transition of the environment model is derived from a relatively simple textbook bicycle model [Hsieh et al. 2022b]: where v f , ∆T, and L are constants as in Table 2.
Reactive Module.The Stanley controller in [Ho mann et al. 2007] is given as follows: where v f , K, and δ max are constants in Table 2.The transition of the reactive module is de ned as follows: (( , ), ) ∈ RM i = ( , ) Initial Con gurations, Safety, and Invariant.Safety is to guarantee the vehicle never leaves lane boundary.
Recall that con guration is de ned by = × P × F B × LM × RM .Since the reactive module has no state, we can ignore .We consider the following initial con gurations: To show the safety, we consider that the system should preserve the Lyapunov stability [Gibson et al. 1961] as the invariant and relax the Lyapunov stability based on input-to-state stability [Sontag 2008].That is, we are given a Lyapunov function , which takes the groundtruth values of and and outputs a non-negative real value representing the (generalized notion of) distance to a desired state.The invariant then is to require that, when the system is evolving, this distance is always nonincreasing so that the system stays safe since it always stays within a distance to the desired state.In addition, the nonincreasing requirement is relaxed if the distance has decreased to less than a small threshold parameter ρ.In this case study, the desired state is when the vehicle is aligned with the center line of the lane, i.e., = 0 and = 0; hence the Euclidean norm ( , ) := 2 + 2 is given as the Lyapunov function over the domain P. The relaxed invariant Inv that includes initial con gurations Init is as follows: where ρ is the parameter for relaxing the invariant.We will further evaluate over di erent ρ values in Section 6.

Case Study: Agricultural Robots
The second case study is a visual navigation system, named CropFollow, for under-canopy agricultural robots (AgBot) developed in [Sivakumar et al. 2021].The system is responsible for avoiding collisions to the row boundaries when the vehicle traverses the space between two rows of crops.Similar to Lane Tracking System, the latent variables LM = { , , } consist of the 2D position and and the heading .The sensor captures the image in front of the vehicle with a camera (Figure 5).Likewise, the percept variables P = { , } are composed of the heading di erence and cross track distance to an imaginary center line of two rows.The valuation of environment model states is thus Val(LM ∪ P) ⊆ R 5 .
where v f and ∆T are constants in Table 3. Reactive Module.The modi ed Stanley controller uses the feedback variable FB = { } where is the angular velocity instead of the steering angle.Formally, the reactive module using the modi ed Stanley controller is as follows: where v f , K, and ω max are constants shown in Table 3.The transition of the reactive module is de ned as follows: (( , ), ) ∈ RM i = ( , ) Initial Con gurations, Safety, and Invariant.For the agricultural robots, we wish to avoid two undesirable outcomes: (1) if | | > 0.5W = 0.38 meters, the vehicle will hit the corn, and (2) if | | > 6 , the camera view will face crops and neural perception will fail.Formally, We consider the following initial con gurations: The invariant Inv that includes initial con gurations Init and prove the safety Safe is: where ( , ) := (3 + 0.75 ) 2 + (2 ) 2 is a norm function for proving the safety and stability.

EVALUATION
We implemented our technique for synthesizing perception contracts for autonomous systems with ML-based perception in a framework called Perceptor.
Our system is given (a) a system (a reactive module) interacting with a model of the environment, (b) a safety property and an invariant that proves the safety property under perfect perception, and (c) a set of ground truths and their estimates (perception pairs) i.e., {( , )}, Perceptor synthesizes a perception contract that is guaranteed to maintain the system invariant and includes the given set of perception pairs.The ground truths, , are sampled from a simulated environment.The estimates, , are derived from images and outputs of ML perception working over them.The resulting Perception contracts synthesized are parameterized over a logic L, which is the one described in Section 4.2.
To assess the e cacy of our approach, we investigate (1) how e ectively can Perceptor synthesize perception contracts that include all positive samples ((gt, gte), +) while preserving the invariant and also (2) how well can Perceptor synthesize perception contracts that generalize.

Implementation
Perceptor is implemented in Python and contains approximately 2536 lines of code.We use Quinlan's C 5.0 decision tree algorithm [Mitchell 1997;Quinlan 1986], with some modi cations, to build our learner.We ensure our learner can only nd exact classi ers for a given set of samples.Additionally, for every individual sample our learner receives, it produces a feature vector that has additional features depending on the logic.Therefore, perception contracts are decision trees (i.e., Boolean combinations) over these new features.We use the optimization solver Gurobi [Gurobi Optimization 2020] as the teacher.Given a candidate contract, Gurobi checks if the perception contract maintains the invariant, i.e., whether Formula (1) is valid, if not, Gurobi returns a set of counterexamples (negatively labeled ((gt, gte), −) perception pairs) indicating valuations where the formula does not hold.Due to approximations of trigonometric functions, limitations of oating point arithmetic, etc., Gurobi can return spurious counterexamples, i.e., (gt, gte) pairs that are already excluded by the contract.In these cases, we use Z3 [De Moura and Bjørner 2008] to double-check that the returned pairs are true counterexamples.When Gurobi returns spurious counterexamples, if it cannot also return at least one true counterexample (i.e., a pair allowed by the contract but not the invariant after a transition), we deem the contract safe (in these cases, technically, we have not proven the contract to be safe, and using more powerful veri cation techniques is a potential future direction).
Case Studies, Relaxing Invariants, and Safe Regions.We evaluate our prototype on two case studies of vision-based control systems, a Lane Tracking System (LTS) for an electric vehicle and a navigation system (CropFollow) for a crop row-following agricultural robot, as described in Section 5.There are several aspects of the systems that need to be articulated.First, these systems are proven to be formally safe under perfect perception.Second, these systems, with their current neural network based perception, are evidently not safe.For example, if the vehicle is at the edge of a road and the heading angle and camera are facing away from the center of the lane, then NN perception fails many times, and the vehicle is led o the road by the controller.In fact, there are concrete images in our training set that already show this unsafe behavior.
We refer the reader to Figure 6a that shows the ratio of safe images in the training data.We divide the groundtruth values into regions based on intervals of heading angle (on the -axis) and the distance of the ego vehicle to the center of the lane (on the -axis), and represent the fraction of safe images in the training set in each region.
Notice that there are unsafe regions labeled in white (safe regions are those colored blue).In fact, some of these unsafe regions are where the ego vehicle is further away from the center of the lane and where the vehicle is facing away from it.
Another set of unsafe regions is when the vehicle is close to the center of the lane.However, in these cases, the vehicle is not truly unsafe.The reason the invariant is not maintained is that even a slight error in perception can make the controller veer the vehicle slightly away from the center of the lane.For example, when the vehicle is slightly to the right of the center of the lane, a mild perception error can place it to the left of the center, making the controller steer it to the right and violate the invariant.As described in Section 3.2, the invariant needs to be typically relaxed to accommodate some error in perception.Recall from Page 16, Equation 2, that the invariant demands that the function always decreases, where the function captures both how close the vehicle is to the center of the lane and how well it is aligned to it.We relax this invariant ( ′ , ′ ) ≤ ( , ) to ( ′ , ′ ) ≤ ( ( , ), ).This relation says the vehicle need not decrease the -function under a threshold value (as it is already quite close to the center and aligned).Note that this relaxes the invariant only when the vehicle is close to the center of the lane, and hence still assures that the vehicle will not leave the lane.
We now refer the reader to Figure 6b, where the training images are now evaluated with respect to the relaxed invariant above.Notice that now there are no unsafe regions when the distance and heading angle are both small.
The goal of our experiments is to synthesize perception contracts for every region using both the original and relaxed invariant.For each safe region, we want to synthesize a perception contract that includes the perception of all training images in that region.And for unsafe regions, we want to synthesize perception contracts that include only the safe training images in the region.

Experiment Setup
Our experiments for the Lane Tracking System (LTS) case study span six experiments where the training data is the same across the experiments.In each experiment, the data is divided into 40 partitions (as in Figure 6a).However, experiments are con gured di erently to study the impact of the feature space and choice of invariant on the e ectiveness of our learning architecture.The feature space for con gurations are either 2D and 4D values.The 2D feature space consists of two base features ( gt − gte ) and ( gt − gte ).That is, the two base features capture the di erence between groundtruth and perceived values for distance and heading angle.The derived octagonal constraint features are formed using the two base features (to get 8 features).The 4D space has base features gt , gt , gte , and gte .
The invariant for the LTS is de ned in Equation 2 and it is parameterized by ρ.We use two values for ρ, 0.0 and 1.0.When ρ := 1.0, the invariant is relaxed while when ρ := 0.0, the invariant is strict.We perform four experiments taking two choices for feature space and two choices for values of , and a speed of 2.8 / .And an additional two experiments where the speed of the car is assumed to be in the range [2.5 / , 3.0 / ], for 2D and 4D feature spaces, and with ρ := 1.0.
The setup for the Agricultural Robots case study is the same except that we do two experiments, both for the relaxed invariant, one for the 2D space and the other for the 4D space.
Training Data Preparation.To prepare the training data for the Lane Tracking System (see Section 5.1), we use the Gazebo model for the GEM vehicle [Du et al. 2020] and generate camera images with their groundtruth percepts gt = ( gt , gt ).Each image is sampled from a uniform distribution over (i) the states of the environment model MEnv = ( , , ) | | | ≤ 0.3W ∧ | | ≤ 12 (ii) three types of roads with two, four, and six lanes, and (iii) two lighting conditions, day and dawn.We process each image using LaneNet [Neven et al. 2018] and obtain gte = ( gte , gte ).In total, we collect 24144 pairs of (gt, gte) as training data.
For each of these images and outputs , we check using Gurobi whether the system maintains the invariant from any state satisfying the invariant when the controller is fed (note that this labeling is di erent for the constant speed and variable speed cases).This gives us a positive/negative label for each sample.
The training data for the agricultural robot (AgBot) case study is prepared using the Gazebo model for AgBot in a similar manner with di erent environments.Images are sampled from a uniform distribution over (i) the states of the environment model MEnv = ( , , ) | | | ≤ 0.3W ∧ | | ≤ 12 (ii) ve di erent plant elds, including three stages of corn (baby, small, and adult).We process  Follow [Sivakumar et al. 2021].In total, we collect 7733 pairs of (gt, gte) as the set of samples and labelled them positive and negative, as described above.

Results: E ectiveness in Learning Perception Contracts
We study the e ectiveness of our architecture in learning perception contracts.In this paper, we say an approach is e ective if it can learn a perception contract that includes all positive samples and preserves the invariant.We present our results in Table 4 for the Lane Tracking System (LTS) case study and Table 5 for the CropFollow case study.We especially focus on nding perception contracts in safe regions (i.e., when all training data, in a region, maintain the invariant).Nonetheless, the tables show our results for safe as well as unsafe regions.For each con guration, we report whether the 2D/4D feature space was used, the number of regions where our technique was successful in generating a perception contract (for safe as well as unsafe regions), the average time required for synthesis, the average and maximum number of rounds our tool takes to nd a correct contract, and the average and maximum number of paths (leaves) in the decision tree representing the synthesized perception contract.
The results for the LTS case study in Table 4 show that our technique is in general able to synthesize perception contracts (∼ 85% of the time).In particular, when the invariant is relaxed ( = 1.0) and in the 4D case, our tool successfully found perception contracts in all the 30 safe regions (30/30, 100%), for both the constant speed case as well as the variable speed case.Furthermore, for the variable speed case, our tool was also successful in synthesizing contracts in all the unsafe regions as well (where these contracts include all safe images in the region).
For the CropFollow case study in Table 5, our ndings show that the tool is e ective in nding perception contracts in most cases as well, and in the 4D feature space, succeeds in nding perception contracts in all the 14 safe regions and in all 11 unsafe regions.

Generalizability of Perception Contracts
In this subsection, we want to study how well our synthesized perception contracts generalize.To do this, we want to evaluate on new images processed by ML perception.We check whether images that maintain the invariant satisfy the contract and images that do not maintain the invariant violate the contract.We use runtime monitoring of the two case studies to obtain these new images.
Setup for Obtaining Test Set.We validate the learned perception contracts within the context of the closed-loop system.As test data, we use simulation traces of the system and use conformance to the perception contracts for runtime monitoring.
For the lane tracking system, we use the simulator to collect 800 traces as the testing set for the synthesized contracts.Each trace simulates the lane tracking system for 20 seconds and consists of at least 400 pairs of groundtruth values gt and estimated values gte.We then prepare our test dataset by labeling each pair of gt and gte positive (negative) with respect to preserving the invariant Preserve Inv .
Similarly, we collect 400 traces as the testing set for AgBot.Each trace simulates the AgBot system for 10 seconds and consists of over 200 pairs of images.
Results on Generalizability of Perception Contracts.To calculate precision of a contract over a set of images, we compute the percentage of correctly classi ed ground truth and estimate pairs, (gt, gte), over the total number of pairs.Note that the pairs are extracted from images evaluated on ML perception.
Table 6 and Table 7 show results on precision of perception contracts synthesized by our tool, for the various con gurations involving relaxation of the invariant, feature space, and safe and unsafe regions.Our results on synthesized contracts show high precision (> 79%) for both case studies when the invariant is relaxed.
Using Perception Contracts as Runtime Monitors.Our results show that our learning algorithm has generalized from training data well, and hence can be used for runtime monitoring.On average, it only takes 0.003 seconds to evaluate perception values over the perception contract in our setting, and hence can be used in an online runtime monitoring setting.
The reader may wonder why we cannot do runtime monitoring simply by checking whether the system is within the invariant.Note that the runtime monitoring using perception contracts performs a much better job than simply checking invariants at runtime.When perception on an image during runtime monitoring passes the perception contract, we are guaranteed that in all states of the system satisfying the invariant (not just the current state), the current perception will maintain the invariant.For example, consider a car moving at a particular speed on a dry road, and consider perception on an image that keeps the car within the invariant.While naive runtime monitoring will declare this safe, we require a lot more-we want the perception on the image to be safe in all states, where latent variables can be di erent, like when the car is moving at a di erent speed or the road is wet.Monitoring using perception contracts gives this assurance.
For example, in Figure 7a, the left image shows an ego vehicle in a three-lane road positioned on the rightmost lane but close to the middle lane, with heading angle towards the middle lane.However, the vision component only perceives (seen in Figure 7b) a two-lane road represented by the three red lines in the image on the right.As a result, the ego vehicle considers itself positioned in the perceived left lane instead of the rightmost lane, and maintains its current heading instead of steering to the right.It turns out that when the vehicle is at a slower speed, the invariant is not violated, but the invariant is violated at higher speeds (the invariant is a complex function).Consequently, if we were doing naive runtime monitoring when the car is going slow, we may declare the perception is safe, though it is not, while our perception contract will declare the perception unsafe.

Comparison with Approximate Abstractions of Perception
The work in [Hsieh et al. 2022b] creates approximate abstractions of perception (AAPs) that are similar to perception contracts in that it captures errors in perception that preserve the invariant.However, there are many di erences; AAPs do not intend to capture all safe images in regions unlike perception contracts.AAPs capture errors using simple shapes like spheres (and hence convex regions) while perception contracts use a logic that represents various shapes (which is why they can include all safe images).In short, AAPs are not perception contracts as de ned in this paper.We however compare experimentally our learned PCs with AAPs generated using the tool in [Hsieh et al. 2022b].In order to run the tool for AAPs following [Hsieh et al. 2022b], we collect the training data by uniformly sampling di erent positions and orientations of the LTS on the one-road scenario and collect the testing data using the three-road scenario.Further, we need to simplify the system model with constant speed v f = 2.8 so that we can obtain AAPs.
Table 8 shows the results comparing the e ectiveness of the two approaches.It includes the time for synthesis, the percentage of included pairs in training data, and precision with respect to testing data.We rst observe that, unsurprisingly, inferring AAPs is much faster than learning PCs.However, AAPs do not provide the strong guarantee by PCs; they do not include all positive pairs in training data.Especially, for unsafe regions of states, the inferred AAPs include only half of the pairs on average and none of the pairs in the worst case.On the other hand, our approaches using either 2D or 4D feature space can consistently nd PCs that include almost all positive pairs for all regions (the entries that are close to but not 100% are due to numerical errors in Gurobi to determine true safety of perception on images).Similarly, when the testing data is gathered from a uniform distribution over positions and orientations, PCs achieve higher precision scores than AAPs regardless of the texture and the color of the roads in the three-road scenario.

Example Perception Contract
In this section, we show a perception contract synthesized by Perceptor.We further illustrate how to interpret the contract to better understand the constraint it poses on estimated percept values.An example contract synthesized in the LTS case study, is the decision tree: corresponding to region ∈ [0.6, 1.2] meters and ∈ [−0.16, −0.10] radians.The region associated with the perception contract indicates that the car would be positioned to the left of the center line within the range of 0.6 to 1.2 meters away from the line.The range of values for the heading angle indicate that the vehicle is oriented toward the center line as depicted in Figure 4.The variables and are estimates of the negative of and , respectively.Hence when there is perfect perception for estimating relative distance and relative heading angles, = − and = − as de ned in Section 5.1.

RELATED WORK
Related work spans the areas of program synthesis, speci cation mining and quality assurance techniques for autonomous systems.
Analysis of Closed-Loop Systems with NN Perceptions.The closest related work is the recent paper [Hsieh et al. 2022b] by Hsieh et al.In that work, a notion called approximate abstraction of perception (AAP) was developed to create abstractions of neural perception modules, which were then used for system-level safety analysis.Similar to the perception contracts proposed here,  AAPs relate groundtruth values with the groundtruth estimates produced by neural perception.The relation there is de ned in terms of piece-wise error bounds that are derived from sampled data as well as the constraints imposed by the system invariant.In [Hsieh et al. 2022a,b], it has been shown how AAPs can be used to establish system-level safety invariant properties for a realistic lane tracking system, a formation control system for drones, and a row-following system for an agricultural robot.The industry paper [Abraham et al. 2022] discusses the relevance of these approaches in the context of engineering autonomous systems.There are several salient di erences between these prior works and this paper.(1) This work gives a much more general formalization of perception contracts (PCs), including stateful environments and reactive modules.(2) We formulate the general problem of synthesizing perception contracts (which is independent of the logic used for the contracts).Finally, (3) we propose an iterative learning approach that synthesizes PCs that includes all positive samples from neural perception module and maintains a system invariant.
Other closely related works are VerifAI [Dreossi et al. 2019;Ghosh et al. 2021] by Dreossi and Ghosh et al., [Katz et al. 2021] by Katz et al., and NNLander-VeriF [Santa Cruz and Shoukry 2022].VerifAI [Dreossi et al. 2019] and related publications [Fremont et al. 2020[Fremont et al. , 2022] ] provide a comprehensive framework to falsify a closed loop system with ML-based perception.Further, the Counter Example Guided Inductive Synthesis (CEGIS) based approach in [Ghosh et al. 2021] uses VerifAI to nd counterexamples of the closed-loop system, synthesizes a controller, and learns a surrogate model.Their techniques focus on the falsi cation of the system speci cation.[Katz et al. 2021] trains generative adversarial networks (GANs) to produce a simpler network for DNN-based perception.NNLander-VeriF [Santa Cruz and Shoukry 2022] veri es NN perception along with NN controllers for an autonomous landing system.Isolated Neural Network Veri cation.Recently, there are many works on verifying an isolated neural network such as ReLuplex [Katz et al. 2017], NNV [Tran et al. 2020], Verisig [Ivanov et al. 2019], etc.We refer readers to a summary of the VNN competition [Bak et al. 2021a] for a complete list.Our notion and learning approach of PC can decompose the system level speci cation and search for the speci cation for the neural network perception, and the NN veri cation tool can be applied to check neural network perception with respect to PC.
Program Synthesis.Program synthesis deals with the problem of synthesizing expressions that satisfy a speci cation.Counterexample-guided inductive synthesis (CEGIS) [Alur et al. 2015] is one of the most promising approaches in synthesis and resembles online learning.In this setting, the target expression is learned in multiple rounds of interaction between a learner and a veri er.In each round, the learner proposes a candidate expression and the veri er checks whether the expression meets the speci cation.Our work can be seen as instance of CEGIS.where along with a veri er, we also rely on a vision component for positive examples via simulation.Speci cation Mining.Our work can also be seen as mining speci cations [Alur et al. 2005;Ammons et al. 2002;Astorga et al. 2019Astorga et al. , 2021;;Ernst et al. 1999;Henzinger et al. 2005;Whaley et al. 2002;Xie et al. 2006] for ML-based vision components.In this setting, mining approaches observe program executions to build abstractions of the speci cation by generalizing from observed runs.Ammons et.al [Ammons et al. 2002] are the rst to propose this line of work.It learns Probabilistic Finite State Automatons (PFA) to represent valid method call sequences.Various approaches have been proposed since then on automata learning [Alur et al. 2005;Henzinger et al. 2005;Whaley et al. 2002;Xie et al. 2006].Ernst et al. [Ernst et al. 1999] proposed Daikon for dynamically inferring conjunctive Boolean formulas as likely invariants from black-box executions by collecting program state at method entry and exit.The work by [Jahangirova et al. 2016] also uses mutation testing to infer assertions, and iterates over rounds with the programmer to infer specs.

CONCLUSION
We have introduced perception contracts that formalize groundtruth estimation deviations by neural perception that preserve invariants of systems.We additionally argue that they can form contracts for neural perception modules for ensuring safety.We have also studied the problem of synthesizing perception contracts given a set of images whose groundtruth estimate by neural perception maintains the invariant.We have evaluated our synthesis algorithm on two realistic vision-based control systems, for e cacy and precision.We also demonstrate the e ectiveness of perception contracts as runtime monitors over simulation traces of these vision-based control systems.Several future directions are interesting.First, perception modules in autonomous vehicles, drones, robots, etc. glean groundtruth not just from a single image, but a sequence of frames.Extending the notion of perception contracts to sequences of frames augmented with reasoning mechanisms for perception is an interesting future direction.Second, it would be interesting to perform larger experiments that subject vision-based control systems to continuous runtime monitoring, checking synthesized perception contracts to validate them.Third, we believe that our techniques are orthogonal to other approaches for evaluating perception correctness such as robustness and testing against image generators; exploring combinations of these techniques would be interesting, especially using perception contracts as speci cations for these ML components.Finally, it would be interesting to utilize synthesized perception contracts for both runtime monitoring as well as falsi cation, exploiting the fact that these are local contracts for the perception module.Evaluating the e cacy of perception contract guided monitoring and falsi cation in comparison with traditional monitoring and falsi cation techniques, would be interesting.We are also intrigued as to how perception contracts, monitoring, and falsi cation extend to more complex properties than safety, such as temporal properties of systems [Lukina et al. 2021;Mamouras et al. 2021].

Fig. 1 .
Fig. 1.Autonomous system interacting with environment/model of environment with perfect perception

Fig. 2 .
Fig. 2. Autonomous system interacting with environment/model of environment with neural perception; perception contracts.

Fig. 5 .
Fig. 5. Real and simulated camera images for corn row following for agricultural robots.
Fig.6.Safe and unsafe regions with learned perception contracts with respect to the original and relaxed invariants in combination with two choices of features.
(a) The image shows a three-lane road from the perspective of a car positioned in the rightmost lane.(b) Image from LaneNet misperceiving lane boundaries.

Fig. 7 .
Fig. 7. Images depicting actual view of road vs vision estimate

Table 1 .
Notations used in Section 2 and Section 3

Table 2 .
Constants in Lane Tracking System.

Table 3 .
Constants in Corn Row Following System for Agricultural Robots

Table 6 .
Precision represents the percentage of correctly classified testing data in the LTS case study.

Table 7 .
Precision represents the percentage of correctly classified testing data in the AgBot case study.

Table 8 .
[Hsieh et al. 2022bApproximate Abstractions of Perception (AAPs) from[Hsieh et al. 2022b] and Perception Contracts (PCs) for the LTS case study with invariant relaxation = 1.0 and constant speed v f = 2.8 m/s.