Trustworthy Runtime Verification via Bisimulation (Experience Report)

When runtime verification is used to monitor safety-critical systems, it is essential that monitoring code behaves correctly. The Copilot runtime verification framework pursues this goal by automatically generating C monitor programs from a high-level DSL embedded in Haskell. In safety-critical domains, every piece of deployed code must be accompanied by an assurance argument that is convincing to human auditors. However, it is difficult for auditors to determine with confidence that a compiled monitor cannot crash and implements the behavior required by the Copilot semantics. In this paper we describe CopilotVerifier, which runs alongside the Copilot compiler, generating a proof of correctness for the compiled output. The proof establishes that a given Copilot monitor and its compiled form produce equivalent outputs on equivalent inputs, and that they either crash in identical circumstances or cannot crash. The proof takes the form of a bisimulation broken down into a set of verification conditions. We leverage two pieces of SMT-backed technology: the Crucible symbolic execution library for LLVM and the What4 solver interface library. Our results demonstrate that dramatically increased compiler assurance can be achieved at moderate cost by building on existing tools. This paves the way to our ultimate goal of generating formal assurance arguments that are convincing to human auditors.


INTRODUCTION
Safety-critical cyber-physical systems (CPSs) are subject to strict regulation to ensure public safety.Historically, such systems are constructed using conservative requirements-driven practices, yielding predictable systems that are amenable to verification by testing [RTCA 2011;SAE International 2010].There is increasingly a desire to use off-the-shelf components and employ techniques like machine learning to build autonomous systems, but these cannot be assured using traditional approaches [Cofer 2021;Cofer et al. 2020;Council 2014;Members 2020].
Runtime verification (RV) [Falcone et al. 2013;Goodloe and Pike 2010] addresses this problem by monitoring a system under observation and responding to property violations during the mission.For example, an RV system might monitor engine heat levels, aircraft location within an authorized airspace, or autopilot changes between flight modes.While not static formal verification, RV provides a significant improvement in assurance for systems over testing alone.
Copilot [Perez et al. 2020;Pike et al. 2010Pike et al. , 2013] is a language and toolchain for writing RV monitors.Monitors are written in a high-level, stream-based domain-specific language (DSL) that is embedded in Haskell.Copilot is equipped with a compiler that generates C code, which can then be linked against other application code for use in production.
When used in safety-critical applications, Copilot monitors also form part of the safety case that justifies the system's mission readiness.As a result, monitors must be trustworthy, and trust must be based on systematic and rigorous evidence that can be audited.Specifically, Copilot monitors must be trustworthy in two regards.First, the monitor must be validated, meaning that the monitor enforces the higher-level properties that were intended.Second, the C implementation of the monitor must be verified, meaning that the compiled code correctly realizes the expected semantics of the monitor.The problem of validation is about establishing properties of the monitor program itself, and the problem of verification is about the correctness of the Copilot compiler.A number of mechanisms can be used to validate and verify the monitors.In recent years, formal methods have been increasingly accepted as a means of evidence in domains such as civil aerospace.For example, DO-333 [RTCA 2011] gives official guidance on the use of formal methods in certification practice.This paper presents CopilotVerifier, which addresses the problem of verification. 1 More precisely, a monitor in the Copilot DSL has an executable semantics and, once compiled, the resulting C program also has an executable semantics.CopilotVerifier proves that the monitor and the C program are extensionally equal: given a stream of inputs, the monitor and its compiled form produce equivalent outputs.We also prove that either (1) both are memory safe, i.e. they cannot crash, or (2) that they crash exactly on equivalent inputs (the tool supports both modes).
CopilotVerifier takes a translation validation [Pnueli et al. 1998] approach rather than verifying the compiler for all possible programs.That is, it runs alongside the Copilot compiler and generates a proof for the particular compiled output.The proof establishes a bisimulation between the Copilot program and the compiled result.At each program point, it establishes that the state of the Copilot monitor corresponds to the state of the C program.Our verifier is designed to be push-button: given a Copilot monitor as input, the verifier automates the steps needed to construct a proof.
This problem of compiler correctness is familiar from research projects such as CompCert [Leroy 2009] and CakeML [Kumar et al. 2014], but these have tended to be multi-year multi-person efforts.This level of effort was not available to us as an industry project operating under budget constraints.Our objective was more pragmatic: to maximize our confidence in the Copilot compiler using off-the-shelf formal methods tools and libraries.We make several design choices that reduce the cost of building CopilotVerifier but, in some cases, reduce the power of our results (see Section 6).We judge these to be low-cost tradeoffs relative to the large increase in confidence we have achieved by successfully completing the CopilotVerifier project in just under one year.Overall, our results have been heartening: through careful design, CopilotVerifier shows that it is possible to apply formal assurance to a DSL compiler with modest levels of effort.
Our ultimate goal, of which this project is the first step, is to automatically generate formal evidence that is convincing to human auditors.By providing a proof of equivalence between the 1 For validation, Copilot supports reasoning about monitor specs using theorem provers [Goodloe 2016;Laurent et al. 2015 monitor and the specification as supporting evidence, we aim to build a certification argument based in formal methods rather than software engineering practices.
The rest of the paper is structured as follows.In Section 2, we describe the Copilot language and illustrate the problem that CopilotVerifier addresses.In Section 3, we give a high-level description of CopilotVerifier, and we illustrate our work by showing bugs in Copilot that CopilotVerifier helped identify.Section 4 discusses the bisimulation relation between Copilot and C programs.Section 5 presents an example of evidence that the verifier produces for assurance cases.In Section 6 we elaborate on the design trade-offs made in CopilotVerifier.We close the paper with a discussion on related work (Section 7) and conclusions (Section 8).

COPILOT OVERVIEW
Copilot is an embedded domain-specific language (EDSL) implemented as a Haskell library.This section provides a brief overview of the Copilot language and the aspects of its semantics relevant to our verification task.Readers interested in further details should consult [Perez et al. 2020].Readers familiar with Copilot can skip to Section 2.4 for a discussion of the challenge of verifying Copilot compilation.Note that we generally refer to programs written in the Copilot DSL as monitors, but they are also often called specs or specifications, reflecting their role in runtime verification.

Copilot Streams
The main programming abstraction in Copilot is the stream, which represents a discrete sequence of values over time.Values include integers, floating-point numbers, booleans, and compound types like structs and arrays.Copilot programs are evaluated one step at a time, either at regular intervals or as new data becomes available.The time between each value in a stream is abstract and application-dependent; it can be thought of as a constant corresponding to a unit of real time.
To interact with the system being monitored, Copilot programs may reference external streams, which generally represent values obtained from the system being monitored (e.g., sensor data).From these external streams, other intermediate streams of data are computed, eventually resulting in boolean streams that capture the conditions being monitored.At execution time, an external handler function is called whenever a trigger stream evaluates to true, which allows the execution environment to take action.
Figure 1 shows a complete Copilot program implementing a simple thermostat.The program monitors an external stream (temp) representing data from a temperature probe, firing a trigger whenever the temperature (avgTemp) is sufficiently above or below a fixed setpoint, each captured by a boolean stream in the spec.To provide some robustness against noisy data, the probe input data is smoothed by computing a sliding window average of the last 5 samples.
This program demonstrates how to construct streams of constant values using constant, as well as how to cast from a stream of Word8s to a stream of Floats using unsafeCast. 2 The thermostat program also demonstrates some operations on streams whose implementations differ from the Haskell functions of the same name.Operations such as multiplication (*), subtraction (-), and comparison (>) work pointwise over the elements of a stream.The (++) operator prepends a list containing a fixed number of samples to the front of a stream.Invoking sum n s will compute a stream where the th element of the stream consists of the sum of the elements in s from indices through + (n − 1).
In addition to unfolding helper definitions to produce the intermediate syntax, the reification step checks that the streams verified are well-formed.Well-formedness is a syntactic check that ensures that no computations violate temporal causality (i.e., they only depend on "past" values) and that they require only a finite history to compute.These properties ensure that monitors can be implemented using a step-by-step strategy that requires only a constant amount of memory.
The reified version of the thermostat example is shown in Figure 2. Here, it becomes clear how the sliding window calculation is unfolded into a stream computation involving the drop operator to sum up the 5 most recent values of the stream named s0.Given a number n, drop n drops the first n elements from a stream, where the stream must have at least n elements of history available.As a result, the s0 stream must have a history of at least 5 values, and we must initialize it with an array of constant values.The cast operator is the reified counterpart to unsafeCast in the stream language.

Compiling to C Code
The CopilotC99 library takes a reified monitor and generates C code.The generated code is not a standalone program and is intended to be used part of an existing control system.To facilitate integration, the code is free of external dependencies and targets the C99 subset of the C language.
There is a straightforward translation from each possible stream value to a C value.For instance, an Int32 in a stream program would be translated to an int32_t in the generated C code, and similarly for other scalar types.Compound stream values such as structs and arrays are translated to C structs and arrays with corresponding struct field names and array lengths.
While Copilot streams are conceptually infinite sequences, Copilot-generated C programs only use a finite amount of memory.Each stream is translated to a ring buffer, implemented as an array with fixed length equal to the history needed to compute the stream.Each buffer has an associated index that tracks the current position in the array, which is incremented as time advances.Figure 3 shows the generated C code for the thermostat example.The s0 stream is translated to C as a ring buffer with 5 elements, just large enough to store the initial 5 temperature values.This ring buffer will be used to compute the current values for the various streams appearing in the trigger definitions, and it will be updated based on the current value of the external temperature stream on each tick.A more complicated monitor may have additional ring buffers of various lengths, each of which will be updated as necessary.
The step function advances the state of the program by a single tick.The following diagram shows how step computes subsequent temperatures from the previous time step and inserts into the appropriate position in the buffer, tracked by the index value s0_idx.This diagram assumes that the inputs from the external temperature stream begin [150, 95, ...].One additional temperature value is computed at each time step, which is marked in this diagram in red italic font.
Aside from updating the ring buffer, the step function is responsible for checking if any of the monitored conditions have been met and, if so, firing the corresponding triggers.The trigger functions heaton and heatoff are only given forward declarations; they are meant to be defined by the application that links against the Copilot monitor.

The Challenge of Verifying Copilot Compilation
We have seen how the thermostat example has been translated from a stream-based program (Figure 1) to a C program (Figure 3), but can we be sure that they actually behave in identical ways?With careful evaluation, one can see that the values in the s0 buffer, as computed by s0_get(0), s0_get(1), . . ., and s0_get(4) at a given time step , will match the th, ( + 1)th, . . ., and ( + 4)th elements of the avgTemp stream.As a result, the heaton and heatoff trigger functions in C will be called if and only if the corresponding trigger streams evaluate to true, and the arguments to the trigger functions will always equal the value of the avgTemp stream at that time step.In this sense, the two programs exhibit the same behavior.Most Copilot monitors in the wild are significantly more involved than this example, however, and establishing a correspondence between the monitor and the generated C code is substantially harder.

VERIFIER OVERVIEW
CopilotVerifier runs alongside the CopilotC99 compiler and formally verifies that the results are correct.It does this by proving that the semantics of the input Copilot monitor correspond to the semantics of the generated C program.In most cases, the verifier requires no input annotations beyond what is supplied to the compiler (we discuss a few exceptions in Section 6.3).The full implementation of the verifier can be found in our supplementary artifact [Scott et al. 2023].
The architecture of CopilotVerifier is depicted in Figure 4.At its heart is a comparison between the input Copilot program, which computes over streams (1), and the compiled C program, which represents streams in real memory (2).To enable this comparison, both the input and compiled program are represented by a collection of formulas in What4, an SMT interface library [Hendrix et al. 2020].The relationship between the two semantics are then established by the bisimulation prover (3).CopilotVerifier reports back to the user whether it solved all of the generated proof goals successfully.If it succeeds, the verifier can produce an assurance argument that the proof is valid, as explained in Section 5.If not, the verifier identifies which goals were falsified.
Intuitively, programs at both the Copilot and C levels are translated to transition systems corresponding to the program control-flow graph.The formulas that are generated in What4 encode the effect of a single transition on the state-either streams or memory.The generated transition systems are intentionally very similar in structure, so the main task of the bisimulation proof is to demonstrate that the states stay in correspondence (see Section 4).
The translation from Copilot programs to What4 is performed by CopilotTheorem, which we developed.Because Copilot is a functional language, it is relatively simple to encode the semantics of each program step in an SMT style.For instance, Copilot's integer arithmetic operations involving streams translate straightforwardly to SMT-Lib's fixed-size bitvector operations, so Copi-lotTheorem would translate the expression x + 42 :: Stream Int32 into an SMT formula resembling (bvadd x (_ bv42 32))).There are similarly direct translations for all other stream operations with the exception of floating-point operations, a special case that we discuss further in Section 6.3.1.
It is much more complex to faithfully capture the semantics of a C program.To do this, we lean on a pre-existing tool, Galois's Crucible [Christiansen et al. 2019] symbolic simulation library.Specifically, CopilotVerifier compiles the C program to LLVM bitcode and uses Crucible's LLVM backend to simulate it.The result is a collection of What4 formulas that precisely represent the LLVM program's semantics.Crucible provides an accurate model of LLVM, intended for industry formal methods applications.For example, Crucible has previously been applied in tools used to verify industry cryptographic libraries [Boston et al. 2021;Chudnov et al. 2018;Dockins et al. 2016

BISIMULATION PROOF STRUCTURE
The core theorem we wish to prove is extensional equality.That is, the Copilot program and its compiled C representation behave identically in the input-output behavior that can be observed by the execution context.Only trigger functions can be observed in Copilot, which leads us to the following more formal description.A copilot program and its compiled form are extensionally equal if for any arbitrary input stream, the following holds at every time step: • The same set of trigger functions are called in and with the same arguments.
• has crashed iff has crashed.Proving extensional equality between arbitrary programs is difficult, but the Copilot program and the compiled program are intentionally very similar in their structure.Consider again the Copilot thermostat program in Figure 1 and the resulting C code in Figure 3 (we will use this as a running example in this section).Let us assume that the C program's first inputs correspond to the first values of the external stream inputs.Intuitively, after calls to the step function in the C program, the state of the ring buffers should be equal to the value of the corresponding stream expressions at index .Moreover, the trigger functions in the C program should be called from the step function at the same times when the corresponding stream expressions evaluate to true.
We can view a Copilot stream program and its generated C program as labeled transition systems (LTSes).To prove correctness, CopilotVerifier constructs a bisimulation relation between the two systems.Intuitively, the proof shows that the two systems start in corresponding states, and every transition in one system has a transition to a corresponding state in the other system.This has the effect of proving extensional equality because it shows that trigger functions are called in corresponding states, and that the two systems transition to crashing states at the same time.
To be precise, CopilotVerifier does not prove the final bisimulation property.Rather, it proves a set of per-transition properties whose conjunction implies a bisimulation.Verifying the bisimulation would take us outside the logical fragment which can be easily reasoned about in SMT solvers.

Programs as Transition Systems
Formally, an LTS consists of a set of states, a set of labels, and a set of labeled transitions between pairs of states.Consider again the thermostat example (Figures 1 and 3).Let us assume that the inputs from the external temperature stream begin [150, 95, ...].In Celsius, these roughly correspond to 38.2 • and 5.9 • .Note that the triggers heaton and heatoff will fire if the sliding average of the previous five temperatures dips below 18 • or if it exceeds 21 • , respectively.
Copilot is a stream processing language based on functional programming ideas.As a result, it has no explicit state in its semantics, beyond the (immutable) input streams and the current time step.Instead, the value of stream expressions are calculated from the input stream at the current time step and its prefix.A sequence of transitions for the thermostat stream program is pictured in the upper diagram of Figure 5.
In a C program, each state consists of the global memory that contains the ring buffers and their current indices.The transition relation is defined by the generated step function.An LTS for the thermostat C program is pictured in the lower diagram of Figure 5.This program's global memory only tracks one buffer, s0, which holds the five most recent temperatures.The global memory also tracks s0_idx, the current index into the buffer.In each transition, a newly sampled value from temperature is placed in the position that s0_idx points to, after which the index is incremented.
The same set of labels is used for both the stream and C LTSes.Each label is associated with observable events required for the transition to occur.In the stream spec, the events correspond to trigger streams evaluating to true.In C, the events correspond to the step function invoking the corresponding trigger functions.
In Figure 5, each transition records whether the heaton and heatoff triggers were fired.The heatoff trigger fires in transition (B), as adding 38.2 • makes the sliding average temperature 23.2 • , which exceeds the upper bounds that avgTemp checks for.In transition (C), however, heatoff no longer fires, as adding 5.9 • lowers the average temperature to 20.5 • .

Correspondence Relation
We now define a correspondence relation between stream and C program states.Copilot is designed so that each stream always has a finite window of past values that can be accessed at any point in the program.Consider a Copilot program containing a stream with a window value .Let buf be the ring buffer in the C program that corresponds to , and let idx be the current index into buf.A stream program state is related to a C program state by the correspondence relation if and only if the value of at index + is equal to the value of buf at index (idx + ) mod , where ranges from 0 to − 1.We lift this to sets of streams in the obvious way.
The thermostat example has a single stream definition s0 that retains window = 5 previous values.At time step 0, the s0 stream's first five elements are 19.5 due to the use of replicate window 19.5 in its definition.In C, the values of s0_get(i) (as i ranges from 0 to 4) are also 19.5, as these are the initial values of the s0 buffer before running step.Therefore, the two programs are in correspondence at time step 0. We can also intuit that the two programs correspond at subsequent time steps.For instance, at time step 1, the s0 stream would produce 38.2 as its fifth value, which would match the value of s0_get(4) after a single invocation of the step function on the C side.

Proving the Bisimulation
The goal of CopilotVerifier is to demonstrate that the correspondence relation is a bisimulation.That is, for every pair of states ( , ) in the correspondence relation, if transitions to ′ in the stream program with label , then there must exist a C program state ′ such that transitions to ′ with label .The converse must also be true: if transitions to ′ , then there must exist a stream program state ′ such that transitions to ′ with label .It is straightforward to identify which transitions in each LTS correspond to each other, as the th time step in the stream program corresponds to the th invocation of step in the C program.The main challenge, then, is to demonstrate that the stream values equal the ring buffer values at each time step.The verifier does this by proving three properties about the correspondence relation: (1) The initial states of the programs are in the correspondence relation.
(2) For each pair of states ( , ) in the correspondence relation, there exist a stream program state ′ , a C program state ′ , and a label such that transitions to ′ with label , transitions to ′ with label , and ( ′ , ′ ) is in the correspondence relation.
(3) For each label used in the transition relations, the triggers for fire in corresponding ways in both programs.
Note that the definition of a bisimulation has two directions: one direction in which the stream program state ′ is universally quantified and the C program state ′ is existentially quantified, and another direction in which the order of quantification is reversed.Property (2), on the other hand, checks both of these directions simultaneously.This is done for practical reasons, as it is always clear what the existentially quantified states ′ and ′ will be after each transition.As such, the verifier combines both of these directions into a single step.
CopilotVerifier reduces each of these properties to SMT formulas.The proof principle of bisimulation itself is not amenable to SMT, as it falls outside of the first-order theories that SMT solvers understand.Likewise, the semantics of Copilot and C could potentially be reduced to SMT, but it would be impractical to do so.Instead, we reduce the individual proof obligations listed above into a series of lower-level logical statements that can be represented with SMT queries.

Initial State Correspondence.
The first proof obligation that the verifier must discharge is that the initial states of the two programs correspond.This is tantamount to taking the initial values in each ring buffer and proving that they are equal to the first values of the corresponding stream at time step 0, where is the length of the ring buffer.Due to the restrictions that Copilot places on programs, these first values must be concrete and cannot depend on external inputs.As a result, this step is simple to translate to SMT queries and only requires evaluation of concrete values.
Thermostat example: We demonstrated initial state correspondence for the thermostat example in Section 4.2.The proof is tantamount to showing that the values of the s0 stream and ring buffer all equal 19.5 at time step 0, a total of five proof goals.

Transition
Correspondence.Most of the proof effort consists of demonstrating that the correspondence relation is preserved by transitions.In this phase of the proof, we must begin with completely symbolic program states at an arbitrary time value .As a result, we must create fresh symbolic values for each stream definition and its corresponding ring buffer.
More precisely, let range over the stream definitions, where each is required to retain previous values.Let buf be the ring buffer in the C program that corresponds to , and let idx be the current index into buf.For each ranging from 0 to − 1, we create a symbolic value and assume it is equal to both the value of at index + and the value of buf at index (idx + ) mod .We then advance the symbolic state of the stream and C programs once.Then, for each ranging from 1 to , we read the value of at index + and check that it is equal to the value of buf at index (idx + ) mod under the previous assumptions.
Advancing the symbolic state of the stream program is a matter of evaluating each stream expression at the next time step.For the C program, we advance the symbolic state by invoking the Crucible symbolic simulator to run the step function, which updates the memory used in the program.In addition to generating proof goals about bisimulation equivalence conditions, Crucible can also generate side conditions that relate to the memory safety of the program.For instance, each memory access into a ring buffer could potentially have out-of-bounds indexing.If CopilotC99 generates C code correctly, all such accesses should be within bounds, but this must be checked as a part of the query submitted to the SMT solver.The simulator also generates side conditions related to C operations with undefined behavior-see Section 4.3.4 for details.
Thermostat example: Ten goals involve checking symbolic values for equality, with two goals discharged for each element that the s0 stream retains.Thirty goals involve checking if the array indexes in s0_get and step are within bounds.Four goals involve checking if the trigger functions fire in equivalent ways, which is described in Section 4.3.3.In total, the verifier discharges 44 goals.

Triggers.
The proof must establish that, for each trigger function in the generated C program, the trigger function is called if and only if the corresponding trigger stream in the stream program evaluates to true.In a real setting, the application linked against a Copilot monitor would implement the trigger functions.However, we are verifying the generated monitor code in isolation, so we do not have implementations of the trigger functions available.Instead, the verifier creates stub implementations for each trigger function that captures the arguments and the path condition under which it was called.After symbolic simulation finishes, the captured arguments and path condition are asserted to be equivalent to the corresponding trigger stream and arguments from the stream program.
Thermostat example: Four proof goals arise from checking for equivalent behavior for the heaton and heatoff triggers in the example.Two goals check whether each trigger is fired in both programs during a transition.Two goals check that the arguments passed to each trigger correspond.

Partial Operations.
Another subtlety of the transition relation step of the proof is how to handle partial Copilot operations.These range from division, which can fail if the second argument is zero, to signed integer arithmetic, which can overflow.If a partial operation is used on an input for which it is not defined, it can result in undefined behavior in the generated C code.
One way to handle partial operations is to take an uncompromising approach: if Crucible detects undefined behavior when simulating the generated C code, it aborts and causes the proof to fail.This ensures that Copilot specifications do not have any misbehavior, but, if a spec does misbehave, the verifier will simply fail and not reveal anything about the rest of the spec.
Another way to handle partial operations is to prove that a Copilot spec is "crash-equivalent" to its corresponding C program.That is, if the C program invokes a partial operation on undefined inputs, the verifier will check that this corresponds to an invocation of the corresponding operation in the Copilot spec on the same inputs.CopilotVerifier supports both the uncompromising approach and the crash-equivalence approaches as user-configurable options.
To check for crash-equivalence during symbolic simulation, the verifier will analyze any invocation of an operation in the stream program which could be partial and generate a side condition that this operation will only be invoked on well defined inputs.During the transition step of the proof, the verifier will assume these side conditions before starting symbolic simulation.Therefore, if the simulator generates any side conditions due to partial operations in the C program, they should be dischargeable using the corresponding side conditions from the stream program.

ASSURANCE CASES
CopilotVerifier is the first step in a longer project: integrating formal methods into safety-critical deployment practices.Our ultimate goal is for Copilot code to be deployed with a safety case largely constructed by automated means.Existing standards such as DO-333 provide guidance on how formal methods evidence should be handled.However, there are few examples of successful formal methods deployments in safety-critical practice.See [Cofer and Miller 2014;Wagner et al. 2017] for a discussion of the issues involved in certification and formal methods.
To use Copilot in safety-critical systems, the evidence provided by our tools must be understood and accepted by human auditors.This is challenging because CopilotVerifier performs most of its reasoning through complicated SMT queries.These queries are difficult to analyze manually without some way of connecting these queries back to higher-level requirements.
We address this challenge by giving CopilotVerifier an optional setting that, when enabled, displays each Crucible proof goal that is generated during a successful run of the verifier.Each proof goal has an accompanying high-level description of what it has demonstrated, such as a symbolic stream value being equal to its corresponding C ring buffer value.Each proof goal also has an associated What4 formula and SMT query representing the goal.With this information, each Goal 8 (SMT): Goal 9 (SMT): portion of a proof can be broken down into lower-and lower-level requirements until eventually reaching SMT, with a chain of evidence linking each intermediate step.
Using the evidence that CopilotVerifier produces, one can construct a complete assurance case that is more amenable to certification. Figure 6 shows a sketch of what an assurance case would look like for the thermostat programs from Figures 1 and 3.The evidence is captured using Goal Structuring Notation (GSN) [Kelly and Weaver 2004], which presents each goal in a CopilotVerifier safety case in a way that emphasizes the relationship between high-level parent goals and lowerlevel child goals.Aside from proof goals, a GSN diagram can also be used to document assumptions that the verifier makes during verification, such as those discussed in Section 6.
The process of making CopilotVerifier's evidence be acceptable for auditors is ongoing work.We plan to work with auditors to iteratively improve formats and explanations towards the goal of providing convincing evidence in a suitable format.

DESIGN TRADEOFFS
When designing CopilotVerifier, our goal was to create a verification tool that achieves high assurance at modest cost.This section explains the tradeoffs involved in achieving this.

Trusted Computing Base (TCB)
We trust the underlying C toolchain.CopilotVerifier relies on Clang to compile C into LLVM bitcode, which becomes the basis for producing the semantics of the C file.Bugs in Clang may affect the soundness of the verifier.We consider this risk mitigated by the fact that Clang is a well-tested compiler and that CopilotC99 targets a well-understood subset of C, reducing the likelihood of triggering compiler bugs.The Copilot developers have experimented using CompCert to verify the compiled binary code [Goodloe 2016;Leroy 2009], which would remove this portion of the TCB.However, CompCert does not target many of the processors that the Copilot users utilize.
We trust that Crucible's LLVM backend faithfully encodes the semantics of LLVM bitcode.Errors in this part of Crucible could affect the soundness of the verifier.To justify this trust, we note that Crux [Tomb 2020], a verification tool also based on Crucible, has been tested on a large number of C verification problems from the SV-COMP verification competition [Scott et al. 2021].
We also trust that the CopilotTheorem library accurately encodes the semantics of Copilot stream programs.At present, we do not have a robust way to test these semantics beyond careful manual engineering and comparison with the Copilot interpreter.
Because of the way we model trigger functions, we make a number of implicit assumptions about how the implementations of those functions must behave.In particular, we assume that trigger functions do not modify any memory under the control of the Copilot program, including its ring buffers and stack.We also assume that the trigger functions are memory-safe and do not perform any undefined behavior.Responsibility for enforcing these assumptions lies with the user who supplies definitions for the trigger functions.

C Compiler Optimizations
CopilotVerifier assumes that all streams in a Copilot spec have corresponding static, global arrays in the generated LLVM bitcode.This assumption simplifies the work done by CopilotVerifier, as it can associate each stream with a distinct, top-level array.
While this assumption is safe to make when the generated C code is compiled with low optimization settings, it is not safe at higher settings.For instance, consider a stream containing a single value, which the verifier assumes to be translated to an array of length 1.At -O1 or higher, Clang replaces length-1 static arrays with scalar values, which breaks the verifier's assumptions.
The verifier mitigates these issues by always invoking Clang with -O0.This makes the generated LLVM bitcode more predictable at the expense of only verifying code at lower levels of optimization.

Limits of Automation
CopilotVerifier aims to be as push-button as possible.However, in exceptional circumstances, a statement may be true but cannot be automatically verified: that is, the verifier is sound, but not complete.Users may need to alter the original spec in order to make it amenable to verification.6.3.1 Floating-Point Support.Copilot provides a variety of floating-point operations, including transcendental functions and other primitive functions.There is limited SMT solver support for floating-point values, however, and even less support for robustly handling special functions.Nevertheless, we wish to have some level of support, as many Copilot monitors make essential use of floating-point operations.For instance, detecting if an unmanned aircraft system is well clear [Upchurch et al. 2014] uses the square-root function to compute the lengths of vectors.
CopilotVerifier treats floating-point operations as uninterpreted functions, leaving the semantics of the operations abstract.This is sound, as the verifier need only demonstrate that a Copilot program applies the same floating-point operations as the corresponding C program, and in the same order.The downside to this approach is that reasoning about floating-point operations is somewhat fragile.The verifier relies on the Clang not optimizing the operations to the point where they differ from the stream program's operations.As an example, consider ctemp from Figure 1: then the reified Copilot spec would not evaluate the result of dividing the two streams to 0.5882353.On the other hand, the generated C code would contain 150.f / 255.0f, and C compilers will perform constant folding on this, even on low optimization levels.As a result, the stream semantics would contain an uninterpreted division function but the C semantics would not, leading an SMT solver to conclude that they are not equivalent.
We take some measures to mitigate this issue, such as invoking C compilers on low optimization settings (see Section 6.2) and avoiding the use of the --fast-math flag.Under these settings, C compilers, and Clang in particular, are more reluctant to rearrange floating-point code, which results in more programs successfully verifying.Nevertheless, the example above shows that this is not foolproof, and users may need to rearrange code to make the operations align in just the right way.
6.3.2Invariants for Partial Operations.As detailed in Section 4.3.4,the verifier has a mode for aborting the proof early if it finds a partial operation applied to undefined inputs.In this mode, the verifier does not try to infer invariants needed to make operations well defined.For example: When the abort-early mode for partial operations is enabled, this example will fail to verify, as stream + 1 could result in signed integer overflow if a value in stream is equal to the maximum value of an Int32.It is possible, however, that the applications that use this monitor maintain the invariant that the values in stream are always less than the maximum Int32 size.If that is the case, the user must communicate this invariant by declaring a property in the spec: spec :: Spec spec = do prop " notInt32Max " ( forall ( stream < constI32 maxBound )) trigger " streamAdd " ...
Here, prop is used to declare a property named notInt32Max, and the forall combinator is used to express that the property should hold for all elements in the stream.The verifier must then be instructed to assume the notInt32Max property during verification so that the addition is well defined.

RELATED WORK
Copilot was originally based on Lustre [Caspi et al. 1987] and LOLA [D' Angelo et al. 2005].Other similar RV frameworks include RMOR [Havelund 2008], which compiles to C, and Proteus [McClelland 2021;McClelland et al. 2021], which compiles to C++.CopilotVerifier shares similarity with other translation validation-based compilers, such as previous efforts to translate SIGNAL [Pnueli et al. 1998] and Simulink [Ryabtsev and Strichman 2009] to C, as well as the Alive2 bounded translation validation tool for LLVM [Lopes et al. 2021].
Most other RV frameworks do not verify their compiled code against the specifications.A notable exception is Lustre, which boasts a verified compiler named Vélus [Bourke et al. 2017[Bourke et al. , 2019[Bourke et al. , 2021]].Vélus is an ambitious project that verifies the Lustre compilation pipeline within the Coq theorem prover by building on CompCert [Leroy 2009].This is in contrast to CopilotVerifier's translation validation approach, which only allows verifying individually translated programs in isolation.
Vélus first translates Lustre's stream functions into a synchronous transition code (STC) language based on state transitions and values rather than streams.STC is, in turn, translated into an objectoriented language (Obc) with each Lustre node represented as a class with its own memory, possessing a reset method performing initialization and a step method to process the next instant in time.Obc is translated into CompCert's Clight, which is then translated into assembly.Coq is used to manually prove an equivalence at each translation step.The proofs are aided by a memory semantics given in terms of streams of memory trees.The culmination of the effort is a bisimulation theorem relating the behavior of a Lustre node and the generated assembly code.
An approach grounded in manual proof can be more complete than CopilotVerifier's automated approach.For instance, the difficulties that we encountered with floating-point operations (Section 6) could be surmounted using a Coq formalization of floating-point computation [Boldo and Melquiond 2011;Ramananandro et al. 2016].On the other hand, adapting Vélus to build a verified Copilot compiler would be non-trivial.Although the core stream languages are very similar, Vélus introduces complexities such as the object-oriented intermediate language to accommodate Lustre features not present in Copilot.Alternatively, one could build a verified Copilot compiler from scratch, but in either case, considerable resources would need to be dedicated to the effort.Our approach to Copilot verification was influenced by the need to assure an existing compiler for a DSL that is in use at NASA, and to accomplish our task under significant resource and time constraints, hence demonstrating the applicability of the technique to many industrial projects.
Previous versions of Copilot [Pike et al. 2012[Pike et al. , 2013] ] performed limited verification of C code by compiling with two different backends and verifying the equivalence of the generated programs using CBMC [Clarke et al. 2004].This approach can potentially catch many of the same issues as CopilotVerifier, but it does not prove that the C programs match the original Copilot monitors.
Current versions of Copilot use property-based testing [Claessen and Hughes 2000;Fink and Bishop 1997] for added assurance.Although techniques such as property-based testing could catch some of the same bugs that CopilotVerifier was able to detect, it would be unlikely to detect some of the memory safety bugs uncovered by CopilotVerifier (some of which went unnoticed until recently [GitHub 2022c]).
There have been efforts in Copilot to generate C code that includes Hoare-logic-style contracts as function comments, where the contracts are derived from parts of the stream program that the C function should implement.The Frama-C static analyzer [Cuoq et al. 2012] is then used to prove that the C code satisfies the contracts [Goodloe 2016].This approach is not fully automated: manual edits are sometimes required for the generated contracts to pass the analysis checks.

CONCLUSIONS
We have presented CopilotVerifier, a verifier that proves correspondences between high-level Copilot monitors and low-level C code generated by the Copilot compiler.Our aim was to increase confidence in the Copilot compiler while working within a realistic engineering budget.Through careful design choices and strategic use of existing tools, we have achieved this goal.
In the future, we plan to use CopilotVerifier to construct assurance cases that are amenable to certification.Specifically, the NASA tool Ogma uses Copilot to produce complete monitoring applications for NASA Core Flight System, Robot Operating System 2, and FPrime [Perez et al. 2022].We plan to extend Ogma to leverage CopilotVerifier to produce evidence of assurance for the applications generated.

ACKNOWLEDGMENTS
This manuscript has been authored by Ivan Perez, an employee of KBR under Prime Contract No. 80ARC020D0010 with the NASA Ames Research Center.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, either expressed or implied, of any of the funding organizations.The United States Government retains, and by accepting the article for publication, the publisher acknowledges that the United States Government retains, a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for United States Government purposes.
Fig. 3. C code generated for Figure 1.The code has been cleaned up for presentation purposes only.

Fig. 5 .
Fig. 5. Two LTSes.The upper LTS represents the stream program in Figure 1.The lower LTS represents the C program in Figure 3. Transition labels are shared between both LTSes.Transition B fires the heatoff trigger.
Proc.ACM Program.Lang., Vol. 7, No. ICFP, Article 199.Publication date: August 2023.Trustworthy Runtime Verification via Bisimulation (Experience Report) 199:11 Goal 1: The generated C program is traceable to low-level requirements Strategy 1: Establish a bisimulation relation between the stream and C programs Goal 2: The initial program states are in the relation . . .Goal 3: The relation is preserved across a program transition Goal 4 (Crucible): Assert that the s0 ring buffer value at index 0 is equal to the corresponding stream value after a transition Goal 5 (Crucible): Assert that accessing the s0 ring buffer at index 0 is within the bounds of the array

Fig. 6 .
Fig.6.A sketch of a GSN diagram presenting a CopilotVerifier assurance case for the thermostat programs from Figures1 and 3.For brevity's sake, only a subset of goals are shown.