Hyperblock Scheduling for Verified High-Level Synthesis

High-level synthesis (HLS) is the automatic compilation of software programs into custom hardware designs. With programmable hardware devices (such as FPGAs) now widespread, HLS is increasingly relied upon, but existing HLS tools are too unreliable for safety- and security-critical applications. Herklotz et al. partially addressed this concern by building Vericert, a prototype HLS tool that is proven correct in Coq (à la CompCert), but it cannot compete performance-wise with unverified tools. This paper reports on our efforts to close this performance gap, thus obtaining the first practical verified HLS tool. We achieve this by implementing a flexible operation scheduler based on hyperblocks (basic blocks of predicated instructions) that supports operation chaining (packing dependent operations into a single clock cycle). Correctness is proven via translation validation: each schedule is checked using a Coq-verified validator that uses a SAT solver to reason about predicates. Avoiding exponential blow-up in this validation process is a key challenge, which we address by using final-state predicates and value summaries. Experiments on the PolyBench/C suite indicate that scheduling makes Vericert-generated hardware 2.1× faster, thus bringing Vericert into competition with a state-of-the-art open-source HLS tool when a similar set of optimisations is enabled.


INTRODUCTION
High-level synthesis (HLS) is the automatic compilation of software programs into custom hardware designs.With programmable hardware devices (such as FPGAs) now widespread, HLS tools such as AMD Vitis HLS [4], Intel HLS Compiler [30], LegUp [12], and Bambu [19] are increasingly relied upon.But existing HLS tools are too unreliable for safety-and security-critical applications; indeed, random testing uncovered miscompilations in all four of the above tools [24].
Herklotz et al. [25] have begun to address this shortcoming by building Vericert, a prototype HLS tool that extends the CompCert verified C compiler [33].Vericert guarantees reliability by being proven correct in Coq, but unfortunately, the hardware designs it generated could not compete performance-wise with those generated by unverified HLS tools.Its key weakness was that it serialised every operation, taking no advantage of the parallelism that custom hardware offers.
This paper reports on our efforts to close this performance gap by extending Vericert with a scheduling pass that collects operations into groups that can be executed in parallel.
Context.Many approaches to scheduling have been proposed over the years.Some, such as list scheduling [5, p. 257] only reorder instructions within a basic block.This means they squander opportunities for performance improvements that could be obtained by reordering instructions across branches.A more powerful alternative is trace scheduling [17,20], which works by creating paths (or 'traces') through the code, across basic block boundaries, and then reorders the instructions within those paths.In its most general form, trace scheduling is considered infeasible on large programs, but two special cases called superblock scheduling [28] and hyperblock scheduling [34] have been developed, both of which impose restrictions on the form of traces in order to obtain tractable algorithms.Superblocks generalise basic blocks by allowing early exits, while hyperblocks generalise superblocks by additionally allowing each instruction to be predicated.CPUs often lack support for predicated execution, so superblock scheduling is the natural choice in that setting.But in custom hardware, predicated execution can be implemented efficiently, making hyperblock scheduling a natural fit for HLS.Indeed, the use of hyperblock scheduling in HLS was first proposed over two decades ago [9,10], and is nowadays used by popular HLS tools such as AMD Vitis HLS [3], LegUp [11, p. 60], Google XLS [21, line 112], and Bambu [18, line 304].Hyperblock scheduling is therefore a natural choice for Vericert to close the gap between the verified and unverified HLS tools, and model what existing tools are already doing.Support for the scheduling of predicated instructions is also important for future HLS-specific optimisations, such as loop scheduling, because these often assume that basic blocks can be merged using if-conversion.

Contributions. In this paper, we present:
• the first verified implementation of hyperblock scheduling, which is more general than Six et al.'s verified superblock scheduler [40] and more computationally tractable than Tristan and Leroy's verified trace scheduler [43], • the first verified implementation of general if-conversion (a pre-scheduling pass that turns if-statements into hyperblocks), building on CompCert's naïve if-converter that only handles simple cases [1], • a novel use of a verified SAT solver during translation validation in order to reason about predicates, and • experiments on the widely used PolyBench/C suite [36] showing that scheduling makes Vericert generate 2.1× faster hardware, thus making it competitive with Bambu [19], a state-of-the-art open-source HLS tool, when a similar set of optimisations is enabled.

OVERVIEW
Our starting point is the first version of Vericert [25], which generates strictly sequential hardware designs: each clock cycle has just a single C operation mapped to it.Making those hardware designs parallel is actually not the challenging part of our work -Verilog synthesis tools do this implicitly.Indeed, it is easy to write Verilog so that an arbitrary sequence of C operations is mapped to a single piece of combinational logic that executes in just one clock cycle.But the problem with such unfettered parallelism is that these large pieces of combinational logic may have long critical paths, and thus lead to hardware that can only run at a low clock frequency.The actual challenge, then, is producing the right amount of parallelism.This is a scheduling problem.Vericert must decide how to schedule each block of instructions across one or more clock cycles so that when the downstream Verilog synthesis tool performs parallelisation, the resulting combinational paths will be short enough to meet the target frequency.
To clarify: the downstream synthesis tool is responsible for the parallelisation itself -Vericert does not perform it, nor does it guarantee its soundness.Rather, it predicts the parallelisation that the synthesis tool will perform, and schedules so that if parallelisation is performed as predicted, the CompCert [33] Vericert [25] This work target frequency should be met.What Vericert does guarantee is that if it reorders any instructions as part of the scheduling process, these reorderings preserve the program's (sequential) behaviour.
An overview of our solution is given in fig. 1, which shows the main components of our hyperblock scheduler and how they fit into the wider Vericert and CompCert projects.
Actually, our scheduler produces a list of groups of lists of instructions.The idea behind this three-level representation, which we call RtlPar, is that each inner list is a sequence of instructions that can be chained together, then each group contains instruction chains that we expect the downstream synthesis tool to place in parallel.This way, we support operation chaining, a long-established optimisation in hardware design [35, p. 1101]. 6The scheduler itself is written in unverified OCaml and works similarly to those in existing HLS tools [12]: it takes a set of scheduling constraints that capture the target clock frequency, available hardware resources, and dependencies between operations, encodes them all as a system of difference constraints (SDC) [15], and hands them off to a linear program solver. 7The CFG is then encoded as a finite state machine (FSM) in a Verilog-like language called Hardware Transfer Language (Htl).This process is largely inherited from Vericert, but where Vericert produces FSMs that perform just one assignment per state, we produce FSM states with a sequence of assignments. 8We have a final pass that performs forward substitution [27, p. 109].This turns each sequence of Verilog blocking assignments into an equivalent sequence of nonblocking assignments, which makes the downstream logic synthesis tool more likely to perform the parallelisation that our scheduler predicted.For example: The two versions are semantically equivalent, but we find that the second, in which both righthand sides must be evaluated before either assignment is performed, makes the downstream logic synthesis tools more likely to produce the hardware we intend (which, in this particular example, involves exploiting a fused multiply-accumulator unit if available). 2hat remains is to ensure the correctness of each schedule produced in step 6 .Following previous work on verified scheduling by Tristan and Leroy [43] and Six et al. [40], we use translation validation, but dealing with hyperblocks brings additional complexity, as explained below.
Tristan and Leroy implement trace scheduling in its full generality.They use a tree to represent all the possible control-flow paths through a block.These trees can be "exponentially larger than the original code" [43, p. 25], which makes it prohibitively expensive to construct and compare the trees before and after scheduling, and thus undermines the usefulness of their scheduler.
Six et al. restrict their scheduler to superblocks.Since a superblock has only a single control-flow path (with early exits), the need for trees is avoided.This allows their validator to be "efficient even for large superblocks" [40, p. 53].However, superblocks are less general than hyperblocks, and there are code patterns where superblock scheduling can lead to considerable code duplication that hyperblocks would avoid.Moreover, superblock scheduling is reliant on profiling and branchprediction heuristics to pick a hot path through the program -should such a hot path even exist.
Hyperblocks can branch and merge control-flow using predicated execution, and hence a single hyperblock can capture many control-flow paths without the exponential blow-up that Tristan and Leroy encountered.In particular, hyperblocks can handle well the case where two branches of a conditional statement are executed equally often, unlike the superblocks that Six et al. use.However, our task of validating the equivalence of two hyperblocks is complicated by having to reason about predicates.Where the prior works only needed to check that the scheduled block contains a dependency-respecting permutation of the original block's instructions, we must account for the registers: , r1, r2, . . .∈ r predicates: , p1, p2, . . .∈ p CFG node labels: ∈ L ::= N guard expressions: fact that predicates may be modified during scheduling.For instance, the sequence p => i; !p => i, which executes i if p holds and then executes i if p does not hold, may be optimised to i.
The approach we take is to translate both the RtlBlock hyperblock and the RtlPar hyperblock (i.e., before and after scheduling) into their strongest postconditions, starting from the same symbolic initial state, and then comparing these postconditions for equivalence with the help of a SAT solver that we have programmed and verified in Coq.By being able to solve queries like (p ∧ !p) ↔ false, the SAT solver enables reasoning about reordering of instructions in a predicate-aware fashion.

NEW INTERMEDIATE LANGUAGES
Our work introduces two new intermediate languages: RtlBlock and RtlPar, which implement the sequential semantics of hyperblocks.They are based on CompCert's Rtl, but instead of mapping from states to instructions, RtlBlock maps from states to hyperblocks, and RtlPar maps from states to hyperblocks of instructions grouped into specific cycles.
Hyperblocks are made up of instructions as defined in fig.2, where ì • denotes a list and • ?denotes an optional parameter.Most instructions are similar to their Rtl counterparts, except each instruction is now guarded by an optional predicate.One additional instruction is for setting a predicate ( ) equal to an evaluated condition ( 1 op c 2 ).The other new instruction is E, which takes a control-flow instruction ( cf ) and allows for early exit from the hyperblock.
These instructions are used in both RtlBlock and RtlPar.The main difference between these two languages is how these instructions are arranged within the hyperblock and the execution semantics of the hyperblock.An RtlBlock hyperblock is a list of instructions, with a straightforward sequential semantics.An RtlPar hyperblock is a list of lists of lists of instructions, with nested blocks corresponding to where instructions should be placed in hardware.Each innermost list contains a chain of instructions that can be executed sequentially within a single clock cycle; each middle list contains a group of chains that can be executed in parallel; and the outermost list contains groups to be executed sequentially (in consecutive clock cycles).The existing CompCert semantics for Rtl is a small-step operational semantics defined on a CFG.At each step, the instruction in the CFG at the current program counter is evaluated in a context Γ.This is a 3-tuple comprising an environment Γ Env that has global information about the program, a mapping Γ R from registers to values, and a mapping Γ M from memory addresses to values.
In order to give a semantics for RtlBlock and RtlPar, we need to handle predicates, so we turn Γ into a 4-tuple (Γ Env , Γ R , Γ P , Γ M ), where the additional component Γ P maps predicates to Booleans.Moreover, we need to deal with CFGs where each node is not just a single instruction, but a hyperblock.The semantics, which we provide in fig.3, is denoted by Γ ⊢ ⇓ and states that under the context Γ and using the definition , will be evaluated to .It is a big-step semantics in the sense that it executes an entire hyperblock in a single step.The ExecInstr rule executes an arithmetic instruction if its guard evaluates to true (and there are similar rules for the other guarded instructions).Here we write Γ ⊢ ↓ for an evaluation function for operations: given Γ and it computes .The ExecInstrFalse rule handles the case where the guard does not hold.ExecExit handles the execution of an exit instruction by recording the control-flow instruction cf for leaving the block.The next two rules are for executing an instruction list, with BlockContinue handling the case where the head instruction does not exit the block, and BlockExit handling the case where it does.Finally, ExecRtlBlock and ExecRtlPar provide the semantics of RtlBlock and RtlPar blocks respectively.Both languages use the list execution semantics defined by the above rules, but RtlPar blocks are first flattened into a single list (via concat).
Note that these rules define a sequential semantics for RtlPar.That is, although RtlPar blocks contain lists of instruction chains that have been identified by the scheduler as being suitable for parallel execution, our semantics nonetheless executes them sequentially.This is because although the scheduler identifies where parallelism can be profitably extracted, it does not perform any parallelisation itself -this is left for the logic synthesis tool.(A parallel RtlPar semantics may allow more optimisations to be validated, but we save that for future work.) The overall behaviour B of an RtlBlock or RtlPar program is the same as that of existing CompCert intermediate languages.Either the program terminates with a return value and a finite trace of externally visible events, like external I/O events which can be produced by system calls (part of the cf instructions), or the program diverges with a possibly infinite trace of external events.Note that the top-level correctness theorem of Vericert does not include a trace of externally visible events, because the only external behaviour of the hardware is the final return value; nevertheless, until the hardware is generated, instructions emitting external behaviours are kept and reasoned about.

VERIFIED IF-CONVERSION
If-conversion introduces the predicated instructions that form hyperblocks.As mentioned in section 2, CompCert does have an if-converter already, but it applies only in a few special cases.We need a more general algorithm that can handle arbitrary branching code.We therefore implement the first formally verified implementation of a general if-conversion algorithm, with support for heuristics for branch prediction.To simplify both the implementation of the if-converter and its correctness proof, it is split up into three distinct transformations: (1) Condition Elimination First, every conditional instruction in the block is replaced by two predicated goto instructions.For example: condition elimination (2) Block Inlining This is performed by replacing a predicated goto instruction by the list of instructions in the block that it points to, adding the predicate to each of those instructions.This transformation only converts one level of blocks at a time, but it can be repeatedly invoked to create larger blocks, as shown in fig. 4. The pointed-to block is left unchanged in case it is still pointed to by other blocks; as such, this transformation performs tail duplication [13].
(3) Dead Block Elimination Finally, any blocks that are now unreachable from the function's entry-point (such as 3 ′′ in fig.4) are removed, to reduce code size.
The decision about which goto instructions should be inlined is offloaded to an external procedure.This separation of concerns means that the correctness of the transformation can be proven onceand-for-all for a single, general, if-conversion algorithm, which can then be extended with various heuristics to change the performance of the generated code.In our implementation, we use simple static heuristics to pick these paths, following Ball and Larus [6], such as avoiding inlining loop back-edges, or blocks with an instruction count that exceeds a threshold (currently 50).
To prove the top-level end-to-end correctness theorem of Vericert, forward simulations proven for each transformation are composed together.The following theorem is a forward simulation and states the correctness of if-conversion.
Proof sketch.Each of the three transformations is verified using the simulation-diagram approach [33, p. 379].These simulations are then composed into an overall simulation for if-conversion.Condition elimination is straightforward because it is a purely local replacement.Dead block elimination is also straightforward, being similar to a CFG-pruning transformation from CompCertSSA [7].The block inlining pass is a bit more involved.To see why, consider how to prove a forward simulation for the first transformation in fig. 4. The edges 1 → 1 ′ , 2 → 2 ′ , and 4 → 4 ′ , are straightforward, but 3 is tricky, because 3 ′ does not straightforwardly simulate 3 (there is no edge from 3 ′ that can mimic the edge from 3 to 4 ).To resolve this, we make our simulation relation a little more fine-grained, so that 3 can be mapped to the first 'part' of 3 ′ and 4 can be mapped to the second.□

IMPLEMENTING HYPERBLOCK SCHEDULING
This section discusses our implementation of hyperblock scheduling in Vericert.The scheduler takes each hyperblock of an if-converted RtlBlock program in turn, and schedules it to form an RtlPar hyperblock.The scheduler is unverified, but it uses a verified translation validation algorithm to prove each output correct, which we will describe in section 6.
Our scheduler is written in OCaml, and follows the SDC scheduling approach [15].SDC scheduling is widely used in HLS tools, and it has also been extended to support modulo scheduling of loops [47], which we hope to incorporate into our future work.The SDC scheduler generates a function that should be minimised plus a set of constraints that must be respected while doing so.In our case, the function we minimise is the overall latency of the block (i.e. the end time of the last operation).The constraints come from three sources.First, the cumulative latency of all the operations in each chain must not exceed a predefined limit; this ensures that operation chaining does not reduce the maximum clock frequency of the resultant hardware.Second, whenever operation 1 has a data dependency on 2 , 2 's end time must precede 1 's start time.Third are resource constraints -for instance, one technical limitation of the Vericert back end is that it produces hardware with only a single RAM controller, which gives rise to the constraint that no two memory operations (loads or stores) may be scheduled for the same cycle.
These are all passed to a linear program (LP) solver; we use lp_solve [8].The solver outputs a mapping from instructions to states (clock cycles).We reconstruct from this mapping an RtlPar block.Data-dependent instructions mapped to the same state are placed into the same chain, at the innermost level of the RtlPar block; independent instructions mapped to the same state are placed in different chains in the same parallel group; and instructions mapped to different states are placed in different groups (the outermost list of the RtlPar block).Figure 5 shows an example of our scheduler in action.In fig.5a, we see the RtlBlock hyperblock to be scheduled.It contains six predicated operations: two additions, three multiplications, and a predicate assignment.The scheduler analyses the hyperblock and constructs a dependency graph (fig.5b).Each edge of the graph is annotated with the combinational delay of the operation at its head and an estimate for the delay incurred by the path produced by the edge.For example, every edge that leads to operation 8 E(goto 10) is annotated with a delay of 0 because the assignment to the 'state' variable (state) is performed immediately and the path delay is assumed to be negligible.These two delays are combined so that the graph can be reasoned about as a flow-graph to find the longest combinational path between two nodes.The scheduler exploits predicates to eliminate dependencies.For example, 3 and 4 appear dependent due to a write-after-write (WAW) conflict on r3, but because their predicates are mutually exclusive, the conflict can be removed from the dependency graph.The scheduler would also remove operations whose predicates are false.
The scheduler transforms the RtlBlock hyperblock into a RtlPar hyperblock (fig.5c).Even though the addition in 2 and the comparison in 6 both depend on 1 , they can still be placed into the same state because the addition has a short enough combinational delay that two additions can be performed in a single clock cycle.The multiplication in 3 can also be placed into the same state as it does not have any data dependencies with any of the other instructions.The next state has two independent multiplications, 4 and 5 , that can be scheduled for the same cycle.Finally, the hyperblock is terminated by a control-flow instruction that jumps to state 10.This operation needs to be scheduled after all the other operations, but because it is performed by simply setting the next state of the state machine, this can be done in parallel with the last operation.
Figure 5d shows the corresponding Htl blocks.Htl maps each state to a sequence of Verilog blocking assignments.The translation from RtlPar to Htl first translates each element of the outer list into a separate Htl state.Then, the sequence of blocking assignments is produced by flattening scheduling Fig. 6.An example schedule.It is valid to move 3 before 1 and 2 because despite the appearance of data dependencies on r2, 3 is in fact independent because its guard is mutually exclusive with the other two.Table 1.First a empt: basic symbolic execution.

Pre-scheduling symbolic state
Post-scheduling symbolic state the remaining inner nested lists.Verilog's blocking assignment has a sequential semantics, meaning it behaves like the RtlPar block.Figure 5e shows how the resultant hardware executes.In particular, the two multiplications 4 and 5 could be allocated to the same resources since they never execute at the same time.

VALIDATION OF HYPERBLOCK SCHEDULING
Although the scheduling algorithm itself is complex with many heuristics, it is quite simple to check each specific schedule.To do so, we follow Tristan and Leroy [43] and symbolically execute each block before and after scheduling, then compare the two obtained symbolic states for equivalence.The main difference from Tristan and Leroy's approach is that our paths are represented by predicates instead of by explicit branches.As a result, several non-obvious design decisions need to be made so that the validation process is tractable in the presence of hyperblocks.In what follows, we explain these decisions informally with the aid of the example shown in fig.6.We then present our validator more precisely in section 6.5.

First A empt: Basic Symbolic Execution
The most natural way to extend Tristan and Leroy's approach is to treat predicates in the same way as registers.Symbolic execution then yields a symbolic state that assigns to each register and predicate an expression that is in terms of the initial values of the registers and predicates.Applying this approach to the example in fig.6 produces the two symbolic states shown in table 1.Note that we write r2 0 for the initial value of r2, and so on.
The pre-and post-scheduling expressions for r2 are syntactically equal, but reasoning about the equivalence of the two expressions for p2 is more involved: our validator needs to understand that if ¬p1 0 then 1 else r2 0 is equivalent to r2 0 in any context where p1 0 is true.Such reasoning could be performed by an SMT solver, encoding each arithmetic operator as an uninterpreted function, but formalising an SMT solver involves a lot of additional proof, and would be slow at run time.

Second A empt: Using Value Summaries
Instead we would prefer to rely on a SAT solver, as it is easier to formalise and verify.A SAT solver can handle Boolean reasoning nicely, but cannot reason about arithmetic.So, to allow the use of a SAT solver, we rewrite each expression into a normal form where all the if-expressions are pulled to the top level, e.g.replacing (if ¬p1 0 then 1 else r2 0 ) == 0 with if ¬p1 0 then 1 == 0 else r2 0 == 0. On our worked example (fig.6), this results in the symbolic states shown in table 2.
Note that we treat register expressions and predicate expressions slightly differently.For register expressions, we combine all the if-expressions into a single multi-way conditional, which we write using "cases" notation.We call these expressions value summaries after Sen et al. [38], who used the same data structure for a different purpose (namely, making symbolic execution more efficient).
For predicate expressions, we do not need value summaries, because all of the if 1 then 2 else 3 operations that appear in the predicate expressions become purely Boolean (rather than a mix of integers and Booleans), and hence can be expanded to ( 1 → 2 ) ∧ (¬ 1 → 3 ), as we have done in table 2. These predicate expressions can then be straightforwardly translated into propositional formulas that can be reasoned about using a SAT solver.For example, the generated query for checking the equivalence between the expressions assigned to p2 looks like the following: where abbreviates (( p1 0 → ((¬ p1 0 → 1==0 ) ∧ (¬¬ p1 0 → r2 0 ==0 ))) ∧ (¬ p1 0 → p2 0 )).In the encoding, each SAT variable encodes the truth value of the expression in the formula.We can also generate SAT queries to check the equivalence of the register expressions.This involves issuing multiple queries to the SAT solver: if an expression appears in both the preand post-scheduling value summaries, then we generate a query to check that their guards are equivalent, and if an expression only appears in one of the value summaries, then its guard should be equivalent to false.For instance, to check r1 we generate the following three SAT queries: However, in constructing all these SAT queries, we have assumed that a Boolean value can be assigned to each of the atoms in a formula.This might not actually be the case -for instance, r1/r2 is not evaluable if r2 is zero, and r1 == r2 is not evaluable if either r1 or r2 is an invalid pointer.So, to compare the two expressions for p2, we actually need to use 3-valued logic.That means all the SAT variables in (1) and (2) actually need to be pairs of binary variables, and the ∧ and ∨ operations need to be 3-valued analogues of conjunction and disjunction.This is a problem because we have seen already that these formulas, particularly those for comparing register expressions, become quite large even for toy examples.Indeed, when we tried this 3-valued approach on the test cases in our evaluation (section 8), most ran out of memory during validation.
Hence, in the next subsection we describe how we manage to avoid 3-valued logic where possible.

Third A empt: Using Value Summaries and Final-State Guards
Although it is not the case in our worked example, it turns out that when comparing the predicate expressions that arise in realistic examples, syntactic equality or near-equality usually suffices.This means that we only need to resort to solving 3-valued SAT queries as an occasional fallback, so its performance impact is limited in practice.
Where syntactic methods usually do not suffice is for comparing the guards of register expressions.However, here we can actually avoid the need for 3-valued logic altogether.We observe that the guards in the r1 expressions are simply copied from the expressions for p2 (which we abbreviated as in table 2), so rather than writing out the full expressions in the guards, we can write p2 f as a shorthand (the 'f' clarifies that it is referring to the final value of p2).
To make it possible to refer to final values, we need to ensure that once p2 has been assigned or used, it is never overwritten.We can achieve this by enforcing SSA form for predicate assignments.That does not impose any restrictions on the Vericert user because predicates are only introduced by internal compiler transformations.SSA form is not needed for register assignments.
The resultant symbolic states are shown in table 3. It can immediately be seen that the expressions have become much shorter.Indeed, the three queries for validating r1 become: What is less obvious is that we no longer need 3-valued logic either.This is because: • We can assume that all predicate expressions in the pre-scheduling symbolic state are evaluable, because if any were not, the input program would fail at run time and we do not need to prove anything about our scheduler.• We have already proven, either using syntactic comparison or 3-valued logic, that the predicate expressions in the post-scheduling symbolic state are equivalent to those in the pre-scheduling state, which means that they too must be evaluable.• The guards in the register expressions only refer to these expressions -they cannot include unsafe expressions like division or pointer comparison -and so they must also be evaluable.• It therefore suffices to use 2-valued logic to compare the guards.

Handling Overwri en Expressions
There is an additional subtlety that needs to be handled: the possibility that the scheduler introduces undefined behaviour.Consider the following example, due to Tristan and Leroy [43].
scheduling Symbolic execution yields identical pre-and post-scheduling results, namely r3 ↦ → r2 0 + 4. Despite this, the schedule is invalid because the post-scheduling block only executes correctly when r1 is nonzero.To detect and forbid such cases, we follow Tristan and Leroy and keep track of all the expressions that are evaluated into a register or memory location.We shall call this the encountered expression set. 3 For example, the encountered expressions of the pre-scheduling block above includes only r2 0 + 4, but the post-scheduling block's also includes 5/r1 0 .Because the encountered set has grown, we deem the schedule invalid.
In order to use Tristan and Leroy's approach with hyperblocks, it needs extending to handle predicated instructions.The obvious way to do this is to generate the encountered expressions for each predicated instruction in the same way that we perform symbolic execution, which is essentially to treat p => r := e as if it is the non-predicated instruction r := p ? e : r.However, this approach leads to too many unnecessary constraints being imposed on the scheduler, leaving it unable to reorder some instructions that have only benign WAW dependencies.To see this, consider the following example.
scheduling When performing symbolic execution on the pre-scheduling block, we encounter the pair of expressions if p1 f then 1 else r1 0 and if ¬p1 f then 2 else if p1 f then 1 else r1 0 , but on the postscheduling block we encounter if ¬p1 f then 2 else r1 0 and if p1 f then 1 else if ¬p1 f then 2 else r1 0 ; these two pairs are not equivalent, so this (correct) schedule cannot be validated.
Instead, for the purposes of calculating encountered expressions, our approach is to treat p => r := e as the instruction r := p ? e : •, where • is a dummy expression representing the absence of an assignment.The set of encountered expressions is now the same for both pre-and post-scheduling: {if p1 f then 1 else •, if ¬p1 f then 2 else •}, so validation can be completed.
(set of encountered expressions) Fig. 7. Syntax of symbolic states.

Formalising the Symbolic State and Symbolic Execution
The previous sections gave an informal overview of the structure of the symbolic state and the validation algorithm.This section will give formal definitions of these concepts.
Symbolic states.Figure 7 defines the symbolic states that symbolic execution produces.Several components make use of value summaries (as explained in section 6.2), so we define the value summary S( ) as a set of terms of type , each paired with a Boolean guard of type G. Henceforth, we shall sometimes write value summaries explicitly as a set of (guard, value) pairs.A symbolic state is made up of five components, the main three being: a register map R that assigns an arithmetic expression (as a value summary) to each register, a predicate map P that assigns a Boolean expression to each predicate, and an expression M for the contents of memory (again as a value summary).We also need symbolic execution to track how control exits the block (to make sure that it does so in the same way after scheduling), so E stores a value summary that evaluates to the instruction that is executed to exit the block (or to 'None' if the block hasn't finished yet).Finally, C tracks the set of encountered expressions, as motivated in section 6.4.The symbolic states presented by Tristan and Leroy were structured in a similar manner, requiring R , M and C , but not needing value summaries to represent expressions.They also did not need E because basic blocks could not represent arbitrary exits.
Constructing symbolic states.The expressions are constructed using a function which updates the symbolic expressions assigned for each resource.A core function used to update value summaries is the coalescing union operator ⊎ [38], which conjoins ¬ to each guard in its left operand and to each guard in its right operand:

Symbolic execution of selected instructions
To turn a value summary back into a Boolean formula, we use the following operation, where expands gates into predicate expressions: It is also useful to have an applicative interface for value summaries, so that when we have value summaries of functions and of inputs, we can obtain a value summary of outputs: Following Sen et al. [38], we simplify value summaries as they are built up, so as to keep their size from exploding: coalescing two elements ( , ) and ( ′ , ′ ) where = ′ into a single element ( ∨ ′ , ), and removing elements ( , ) whenever ↔ false.
Symbolic execution.The symbolic execution of instruction is performed by the function.It takes the current symbolic state and produces an updated one.It also takes an 'enabled' predicate , which is conjoined with the current instruction's guard; it ensures that after an exit instruction is taken, any subsequent instructions are nullified.So, whenever an exit instruction is encountered, the enabled predicate is conjoined with the negation of the exit instruction's guard.
In fig.8, we show three important cases of : symbolically executing an arithmetic operation, an exit instruction, and a predicate assignment.To symbolically execute a whole hyperblock (denoted • # ), we run on each instruction in turn, threading the symbolic state through, starting from the empty symbolic state (denoted ∅): Comparing symbolic states.After symbolically executing the RtlBlock and RtlPar blocks, we obtain two symbolic states, and ′ .We wish to show that ′ is a symbolic refinement of (written ≳ ′ ), and we do so by component-wise comparison, as shown in eq. ( 8) and explained below.
The core comparison operation that we rely upon is between two value summaries, written ≈.Whenever a value appears in both value summaries, we check that its guards are equivalent (via a SAT query), and whenever a value appears in just one value summary, we check that its guard is equivalent to false (again via SAT query).This approach suffices for the register maps ( R ≈ ′ R ), the memory maps ( M ≈ ′ M ), and the exit expressions ( E ≈ ′ E ).For the predicate maps, we first attempt to show syntactic equality ( P = ′ P ).If this fails, we fall back to using a slow but reliable equivalence check with a 3-valued solver ( P ≈ 3v ′ P ).Finally, for the encountered expressions sets, we write C ≳ ′ C to mean that every expression in ′ C has an equivalent in C .

Defining a Verified Scheduler
We can now define a verified scheduler using the standard translation validation approach:

PROVING THE VALIDATOR CORRECT
In order to prove our validator correct, we need to prove that whenever our validator deems # par to be a symbolic refinement of # , there is indeed a forward simulation from to par (defined more precisely in definition 7.1); that is: The natural way to prove par is to follow Tristan and Leroy [43] by constructing the chain # # par par .In order to do this, we need to be able to talk about forward simulations that involve not just programs ( and par ) but also symbolic states ( # and # par ).Thus we need a semantics not just for blocks (cf.fig. 3) but also for symbolic states.

A Semantics for Symbolic States
We need a function that takes a symbolic state and applies it to an initial concrete state Γ.The output is the concrete state Γ ′ , together with the control-flow instruction cf that is executed to exit the block.The function is written as Γ ⊢ ⇓ (Γ ′ , cf ), and is defined in fig.9.It works as follows: • The entry point is the SemState rule.This rule has six antecedents.The first constructs a Boolean value for each predicate in the final state.The second constructs a value for each register in the final state, consulting Γ ′ P to get the final values of predicates when evaluating value summaries (cf.section 6.3).The third constructs the final contents of memory.The fourth determines the control-flow instruction for exiting the block and the fifth expands Γ.The sixth does not calculate a component of the final state; instead, its purpose is to prevent the final state being calculated at all if any of the encountered expressions (cf.section 6.4) are unevaluable.
• The ⇓ A rules (RegBase, Load, and Op) map register expressions to concrete values (integer, float, pointer, or 'undefined').In the Op rule, we write ↓ to indicate the existing CompCert evaluation semantics for the arithmetic operation op a , which may need to consult Γ Env to handle operands that are relative to the stack pointer or a global variable.• The ⇓ M rules map memory expressions to concrete values.
• The ArithEmpty and MemEmpty rules map the dummy expression • to an arbitrary arithmetic value (1), or to an arbitrary memory (the initial memory).This is so we can check that all encountered expressions are evaluable; their actual values are immaterial.RegBase Fig. 9. Semantics of symbolic states.
• The ⇓ B rules map predicate expressions to Boolean values (true and false).In the Pred rule, evaluating 1 op c 2 returns an option type because 1 or 2 might be an invalid pointer.• The rules for ∧ and ∨ are designed to produce a lazy semantics.This is necessitated by the fact that they originate from if-statements in the source program, which must be evaluated lazily.In particular, if 1 evaluates to false and 2 is unevaluable, then we need 1 ∧ 2 to evaluate to false, not to be unevaluable.• The PredExpr rule is for evaluating a value summary of type S( ).It finds an entry ( , ) for which the guard evaluates to true in the final state ( f ), and then evaluates at type .Value summaries are constructed so that the guards are exhaustive and mutually exclusive, so there will always be exactly one such entry.

Establishing the Chain of Simulations
Now that we have a semantics for symbolic states, we can define the required forward simulation relation, , where and are both blocks, or both symbolic states, or one of each.
Definition 7.1 (Forward lock-step simulation diagram).For every execution of , there exists an execution of that, when starting from a matching initial state, results in a matching final state.
It remains to construct the chain # # par par .The three steps of that chain are captured by the following three lemmas.Lemma 7.2 (Soundness of symbolic execution).For every execution of a block , there exists an equivalent execution of its symbolic state # .That is: for all , we have # .
Lemma 7.3 (Symbolic refinement implies behavioural refinement).For all , ′ , we have Proof Sketch.For expressions this is just syntactic equality, while for value summaries this comes down to proving the correctness of the SAT solver.□ Lemma 7.4 (Completeness of symbolic execution).For every execution of the symbolic state # par , there exists an equivalent execution of block par .That is: for all par , we have # par par .
Proof Sketch.Completeness requires a bit more work, because we are given the final symbolic state and need to show that the whole block that generated it produces the same result.First we show that any instruction in the original block necessarily produces a value, which follows from the semantics of encountered expressions.From this, we can show that the execution of par in the current context must produce a state.Next, we use lemma 7.2 to show that # par must produce a state that is equivalent to par .Finally, because our semantics of symbolic states is deterministic, we can show that this state must be unique, therefore they must be equivalent.□ Finally, we can show the correctness of the verified scheduling implementation.
Theorem 7.5 (Scheduler Correctness).Whenever # par is a symbolic refinement of # , there is a forward simulation from to par .That is: for all , par , we have # ≳ # par =⇒ par .
Proof.This follows from lemmas 7.2, 7.3, and 7.4, and the transitivity of .□

Managing Complexity in the Proof
The proof of theorem 7.5, together with the necessary additions to the Vericert back end, is as large as Vericert's original correctness proof.If one then adds the proofs of the if-conversion pass and the changes that had to be made to existing passes, Vericert with hyperblock scheduling has 16681 sloc of Coq definitions and 17426 sloc of Coq proofs according to coqwc, making it 3× larger than the original Vericert implementation.(We note, however, that our hyperblock-scheduling extension has not enlarged Vericert's trusted computing base at all.)It was therefore particularly important to take steps to manage the proof's complexity, primarily by breaking it up into reusable lemmas.
A substantial portion of the proof involves reasoning about the function for symbolically executing instructions.As shown in fig.8, the definition of this function is naturally broken down into smaller state-updates using the applicative interface for value summaries, < * > (from eq. ( 6)).Accordingly, it is desirable to formulate lemmas that follow the same structure.For instance, if we want to show some property holds for < * > , we would like a lemma that breaks this down into some related properties holding for and separately.However, it is not possible to reason about the behaviour of value summaries like { (true, (+)) } in isolation, because our current semantics of symbolic states (fig.9) gives no meaning to functions such as (+).Our solution is to extend the semantics with a rule that can handle any value summary.
This rule achieves this by making no attempt to evaluate the expression that it selects from the value summary, and instead simply returns it.(In contrast, PredExpr demands that can be evaluated to a value .)Hence we call this the identity semantics for the value summary.By using identity semantics, it becomes possible to formulate lemmas that capture the behaviour of < * >, such as: We found that only once we were able to formulate lemmas like these did the proof become feasible.Without them, it involved a number of special cases that was simply unworkable.

EVALUATION
Our evaluation aims to answer the following research questions: • RQ1: Does adding scheduling to Vericert lead to a significant improvement in the quality of the generated hardware (in terms of area and delay)?• RQ2: Is hyperblock scheduling better than naïve list scheduling?• RQ3: Does adding scheduling make Vericert competitive with unverified HLS tools?• RQ4: Did our design decisions (e.g.section 6.3) lead to an acceptable compilation time?Experimental setup.Following Herklotz et al. [25] and Six et al. [40], we evaluate our work using PolyBench/C [36].For each benchmark, the resulting Verilog hardware design was simulated using Verilator to get the total cycle count.Each design was synthesised, placed, and routed onto a Xilinx series 7 FPGA (part number: xc7z020clg484-1) using Vivado to get its total area and its estimated maximum frequency.We then calculated total execution time = total clock cycles maximum frequency .This is a minimum execution time, because in practice all designs will only run at 100MHz.
Figure 10 visualises the results of simulation and synthesis of the PolyBench/C benchmark.We use default Vericert (Vericert-original), Vericert with list scheduling (Vericert-list), and Vericert with hyperblock scheduling (Vericert-hyperblock).We also use the state-of-the-art open-source HLS tool Bambu [19] in two modes: one where all default optimisations are enabled (Bambu-default), and one where as many optimisations as possible are disabled (Bambu-no-opt).Several optimisations are built into Bambu though and cannot be disabled, such as list scheduling and loop flattening.
Answering RQ1.To assess whether adding scheduling to Vericert leads to better hardware designs, we compare the hardware produced by Vericert-original with that produced by Vericert-hyperblock.We see that, on average, hyperblock scheduling leads to hardware that requires only 0.48× the time to execute a benchmark (fig.10a).This can be attributed to the scheduled hardware requiring only 0.46× the number of cycles (fig.10b) and the minimum clock period being nearly identical between the scheduled and unscheduled designs (fig.10c).This improvement is unsurprising given that Vericert-original only executes a single instruction per clock cycle.In terms of area (fig.10d), hyperblock scheduling leads, on average, to a slight increase in area of 23.6%.This can mostly be attributed to the additional circuitry needed to check the value of predicates and gate instructions.However, in some cases, like floyd-warshall, the area increase is quite egregious, which can be attributed to some back-end optimisations like operation fusing not triggering because the Verilog code does not match a specific pattern the synthesis tool recognises.Tweaking the syntax of the generated code should allow us to produce designs with similar area to those of Vericert-original.10. Results of simulating and synthesising the PolyBench/C benchmark suite using a range of HLS tool configurations.All measurements are relative to Bambu-default except the greatest critical path delay, which is given absolutely.The dashed red line in that graph corresponds to the target clock with a period of 10ns, or a frequency of 100MHz.
Answering RQ2.Hyperblock scheduling is considerably more complicated to implement and verify than list scheduling because it requires if-conversion to combine basic blocks into hyperblocks, as well as predicate-aware scheduling.If we omit if-conversion entirely (hence avoiding predication too), we obtain list scheduling as a special case.Does hyperblock scheduling yield enough of a performance improvement over list scheduling to justify its additional complexity?
To answer this, fig. 10 measures the hardware produced by Vericert-list.On average, list scheduling leads to hardware requiring 0.52× the total time compared to Vericert-original and 1.08× the total time compared to Vericert-hyperblock.We expect hyperblock scheduling to extend its small lead over list scheduling once the if-conversion heuristics are improved.In particular, our predictions of predicated instruction latency are currently quite conservative to ensure that timing constraints are met; improving these estimates is an active research area [37,41,44,45,48].On the other hand, the latency of instructions without predicates is much more predictable and its estimation less conservative, leading to better overall performance.We believe list scheduling is at its best, whereas hyperblock scheduling could be greatly improved with better predictions.
In terms of area, we see that Vericert-list leads to the smallest hardware designs, on average 0.69× the size of Vericert-hyperblock designs.This can be attributed to the downstream logic synthesis tool being able to save area by optimising chained operations, such as multiply-accumulate, while not having to handle the predicates that are introduced with Vericert-hyperblock.
Answering RQ3.To assess how Vericert-hyperblock fares against unverified HLS tools, we compare it against Bambu.We see that although Vericert-hyperblock is well behind Bambu-default (its designs require 3.6× the execution time), it only performs slightly worse when compared to Bambu-no-opt (1.57× the execution time).This is encouraging because the main reason for the slow-down in execution time is the higher critical path delay -Vericert-hyperblock performs comparably with Bambu-no-opt in terms of cycle count, with its designs needing only 1.04× the cycles.Cycle count is under the direct control of the scheduler, and therefore shows the effect of the scheduler most clearly, whereas the critical path delay is sensitive to which optimisations the downstream synthesis tool decides to perform.As mentioned previously, estimating the delay of operations and predicting when downstream optimisations will fire is an active research area and is currently implemented conservatively.For example, even Bambu-no-opt failed to predict the delay of the critical path correctly for the fdtd-2d benchmark, and failed timing for it, as did Vericert-list on the cholesky benchmark.In terms of area, Vericert-hyperblock designs are 1.7× larger than Bambu-default and 1.2× larger than Bambu-no-opt, however, this is tightly linked to the fact that some chaining optimisations are currently missed by the downstream synthesis tool.Tweaking the representation of predicated instructions could improve the area significantly.
Answering RQ4.To assess whether Vericert-hyperblock has acceptable compilation times, we also compare it against Bambu.Compilation times did not deviate for Bambu, all of them being around 3s mainly due to long startup costs.Vericert-hyperblock compiled each benchmark in 0.9s, also without much variation, showing that verification was not overly costly.As for whether our design decisions led to these compilation times: we remark that if we disable the 'final-state predicates' innovation that we introduced in section 6.3, none of the benchmarks compile within a few minutes and eventually the machine runs out of memory.

RELATED WORK
The most closely related works to ours are those by Tristan and Leroy [43] and by Six et al. [40], so we begin this section by recapping our main points of similarity and difference.
Tristan and Leroy [43] were the first to propose adding scheduling to a verified compiler, and we adopt their method for validating schedules -running symbolic execution before and after scheduling and comparing the resultant symbolic states.Their scheduler only reorders instructions, so syntactic equality suffices for comparing symbolic states, whereas our scheduler can also modify instructions (by manipulating predicates).This means that our state comparisons are more involved, and we turn to a SAT solver to help resolve them (section 6.2).Tristan and Leroy also devised the use of constraints to prevent the scheduler introducing undefined behaviour; we adopt this technique too, taking care to extend it to handle predicates in such a way that valid schedules can still be validated (section 6.4).We remark that a direct empirical comparison with Tristan and Leroy's work is not feasible because their method was implemented for an old version of CompCert and was not incorporated into its main branch.
Six et al. [40] formalise superblock scheduling, which is a form of trace scheduling that is wellsuited for VLIW processors.Hyperblock scheduling is more general than superblock scheduling and is well-suited to our application domain, HLS.Six et al.'s scheduler reorders instructions, and also combines them where it is advantageous to do so, reasoning only about registers that are live at the end of the block, so comparing symbolic states is more involved than it was for Tristan and Leroy, but it still does not require a SAT solver because there are no predicates to reason about.A direct empirical comparison between our scheduler and Six et al.'s is difficult because they have different targets and Six et al.'s is based on an incompatible fork of CompCert called CompCertKVX.
Outside of the realm of mechanised proof, most HLS tools implement some form of static scheduling.For instance, AMD Vitis HLS [3], LegUp [11, p. 60], Google XLS [21, line 112] and Bambu [18, line 35] all employ SDC-based hyperblock scheduling.Other HLS tools, such as Dynamatic [31], defer scheduling decisions until runtime.It is notable that these unverified tools tend towards fewer, larger passes, where several optimisations are packed into 'scheduling'.This minimises the number of times the solver needs to be invoked, and gives it the best chance of finding a global optimum.Verified tools such as CompCert and Vericert, on the other hand, tend towards more, smaller passes that are individually feasible to prove correct.
Several HLS tools are associated with translation validators, either for an individual scheduling pass [14,32,46] or for the HLS tool as a whole [39,42].These typically work by establishing equivalence at the level of state machines, whereas we take the approach of comparing symbolic states, because Tristan and Leroy [43] showed it to be viable in the context of a verified compiler.Unlike our work, none of these translation validators provide mechanised proofs of equivalence, so might produce false positives.Google XLS [22] performs scheduling on a dataflow language, which may also be a useful intermediate representation to formalise a scheduler on directly (as opposed to validating each schedule produced).Formalisation of dataflow languages within CompCert is being actively worked on [16,23], which may make this feasible.

Fig. 1 .
Fig. 1.New passes and intermediate languages introduced in this work.

Fig. 4 .
Fig. 4.An example showing two iterations of the block inlining pass.

Theorem 4 . 1 (
Forward simulation of if-conversion).If program is safe (free from undefined behaviour) and has behaviour B, then ifconvert( ) should have the same behaviour.That is: safe

Table 2 .
Second a empt: using value summaries.

Table 3 .
Third a empt: using value summaries and final values in guards.