Khronos: Fusing Memory Access for Improved Hardware RTL Simulation

The use of register transfer level (RTL) simulation is critical for hardware design in various aspects including verification, debugging, and design space exploration. Among various RTL simulation techniques, cycle-accurate software RTL simulation is the most prevalent approach due to its easy accessibility and high flexibility. The current state-of-the-art cycle-accurate simulators mainly use full-cycle RTL simulation that models RTL as a directed acyclic computational graph and traverses the graph in each simulation cycle. However, the adoption of full-cycle simulation makes them mainly focus on optimizing the logic evaluation within one simulation cycle, neglecting temporal optimization opportunities.In this work, we propose Khronos, a cycle-accurate software RTL simulation tool that optimizes the memory accesses to improve simulation speed. RTL simulation often involves a large number of register buffers, making memory access one of the performance bottlenecks. The key insight of Khronos is that a large number of memory accesses across consecutive clock cycles exhibit temporal localities, by fusing those accesses we can reduce the memory traffic and thus improve the overall performance. In order to do this, we first propose a queue-connected operation graph to capture temporal data dependencies. After that, we reschedule the operations and fuse the state access across cycles, reducing the pressure on the host memory hierarchy. To minimize the number of memory accesses, we formulate a linear-constraint non-linear objective integer programming problem and solve it by linearizing to a minimum-cost flow problem iteratively. Experiments show that Khronos can save up to 88% of cache access and achieve an average acceleration of 2.0x (up to 4.3x) for various hardware designs compared to state-of-the-art simulators.CCS CONCEPTS• Hardware → Simulation and emulation.


ABSTRACT
The use of register transfer level (RTL) simulation is critical for hardware design in various aspects including verification, debugging, and design space exploration.Among various RTL simulation techniques, cycle-accurate software RTL simulation is the most prevalent approach due to its easy accessibility and high flexibility.The current state-of-the-art cycle-accurate simulators mainly use full-cycle RTL simulation that models RTL as a directed acyclic computational graph and traverses the graph in each simulation cycle.However, the adoption of full-cycle simulation makes them mainly focus on optimizing the logic evaluation within one simulation cycle, neglecting temporal optimization opportunities.
In this work, we propose Khronos, a cycle-accurate software RTL simulation tool that optimizes the memory accesses to improve simulation speed.RTL simulation often involves a large number of register buffers, making memory access one of the performance bottlenecks.The key insight of Khronos is that a large number of memory accesses across consecutive clock cycles exhibit temporal localities, by fusing those accesses we can reduce the memory traffic and thus improve the overall performance.In order to do this, we first propose a queue-connected operation graph to capture temporal data dependencies.After that, we reschedule the operations and fuse the state access across cycles, reducing the pressure on the host memory hierarchy.To minimize the number of memory accesses, we formulate a linear-constraint non-linear objective integer programming problem and solve it by linearizing to a minimum-cost flow problem iteratively.Experiments show that Khronos can save up to 88% of cache access and achieve an average acceleration of 2.0x (up to 4.3x) for various hardware designs compared to state-of-the-art simulators.

INTRODUCTION
Register transfer level (RTL) simulation is a critical tool for hardware design flow, with cycle-accurate software RTL simulation being the most common solution due to its accuracy, flexibility, accessibility, and debug capabilities.Upstream tasks of hardware design such as verification, debugging, and design space exploration heavily rely on multiple or extensive runs of RTL simulation [18,24,28,43].However, due to its relatively low-level representation, RTL simulation is a time-consuming process, which is considered one of the bottlenecks in IC design flow [34].The stateof-the-art software cycle-accurate RTL simulators mainly employ full-cycle simulation techniques.Fig. 1a illustrates the simulation process of the full-cycle simulator.In each simulation cycle, the simulator reads the state, traverses the computational graph to obtain the next state, and writes back the state to continue the next simulation cycle.In general, this process needs to be repeated millions of times, making the acceleration of RTL simulation an important problem.
To improve the simulation speed, previous works have tried various optimization techniques.Verilator [37] assigns RTL statements to multiple threads for parallel simulation.ESSENT [10] partitions the graph into blocks to exploit low activity factors.RepCut [42] replicates certain nodes in the graph to reduce thread synchronization overhead.RTLFlow [27] compiles RTL design into CUDA code to exploit stimulus parallelism on powerful GPGPUs.All these techniques primarily focus on improving the evaluation latency within one simulation cycle by graph partition and parallelization.While these techniques have achieved great simulation acceleration, they do not exploit the temporal optimization opportunity across consecutive cycles.We find that memory access is a major performance limiting factor for software RTL simulation.For example, hardware accelerators often employ pipelined design and the register buffers between stages can result in a significant amount of state access, leading to slow simulation speed.This is because state-of-the-art simulators all employ full-cycle simulation, which needs to write the states at the end of the current cycle and read them again at the beginning of the next cycle as shown Fig. 1a.Our performance analysis finds that, for pipelined design, at least 25% of instructions generated by Verilator [37] are memory access instructions.Moreover, for highly pipelined deep learning accelerators, 65% of the generated instructions are memory access instructions.Our insight is that this memory bottleneck can be eliminated if we reschedule the operations and fuse the memory accesses with temporal localities.An optimized example is shown in Fig. 1b.Khronos reorders the operations between cycles and fuses the state access of  2 .Then, in each simulation cycle, the host memory access is reduced from 4 to 2, and the simulation speed will be improved.
In this work, we present Khronos, a cycle-accurate software RTL simulation tool that exploits the temporal data (hardware state) locality between consecutive cycles.Instead of modeling RTL spatially as a computational graph, we model it as a parameterized queue-connected operation graph.The queue graph captures temporal data dependencies and enables temporal optimization between cycles.We model the simulation states as queues and the combinational operations as the nodes in the graph.The operations can be reordered by modifying the capacity of the queues.Then, Khronos will fuse state writes and reads with temporal localities, effectively reducing the pressure to the host cache and memory.
The fusion problem in Khronos is formulated as a complex linear-constraint non-linear objective integer programming problem.However, the complexity and scale of RTL design make it challenging to find the optimal solution for Khronos.Instead of finding the best solution, we design a gradient-based optimization algorithm to find a good solution.This algorithm iteratively linearizes the problem to a minimal cost flow problem, resulting in almost linear time cost and making the compilation speed of Khronos fast.Khronos is developed based on CIRCT [1] and MLIR [26] compiler infrastructure and evaluated using a comprehensive set of open-source benchmarks from [7,8,14,17,32,36,38,39,45].
The contributions of this paper are summarized as follows.Khronos is open-sourced 1 .Experiments demonstrate that our simulator reduces up to 88% cache access and achieves an average acceleration of 2.0x (up to 4.3x) for various hardware architectures compared to the state-ofthe-art simulator, Verilator.

BACKGROUND AND MOTIVATION 2.1 RTL Simulation
RTL simulation is widely used to evaluate the behavior of digital designs.There are two main methods of RTL simulation: eventdriven and full-cycle [43].Event-driven simulators propagate signal changes as events.This introduces a large schedule overhead but has high wave accuracy.On the other hand, full-cycle simulators remove this overhead by pre-computing a schedule for the hardware design and simulating it as a static program.The wave accuracy of full-cycle simulators is limited to clock edges, but such accuracy is generally enough for functionality verification.The comparison of full-cycle simulators and event-driven simulators is listed in Table 1.The state-of-the-art software cycle-accurate RTL simulators such as Verilator [37], mainly employs full-cycle simulation techniques.The wide-used commercial simulator, VCS [3], employs eventdriven simulation techniques.In the full-cycle simulation, the input  RTL is represented as a directed acyclic graph, called an RTL graph.
In each cycle, the simulator consumes an input, and traversals the RTL graph to produce the output values.In general, such graph traversal has to be repeated thousands or millions of times.

Motivation
The latency of a single simulation cycle can be modeled in equation Eqn. 1, where ComputationTime represents the time to evaluate all logic operations and MemoryTime represents the time to access memory.Prior techniques mainly focus on improving the first term in Eqn. 1 by reducing the number of operations per cycle or improving the evaluation speed.
For example, to reduce the ComputationTime, ESSENT [10] evaluates the design lazily to reduce operations to evaluate, and Verilator [37], RTLFlow [27], and RepCut [42] utilize thread parallelism.However, we find that MemoryTime is the performance bottleneck for many cases.Fig. 2 shows the proportion of data memory access instructions ratio to the total dynamic instructions of Verilator in different benchmarks.As shown, Verilator spends at least about 25% of the instructions to access the simulation state.For highly pipelined architecture such as systolic arrays, 65% of the instructions are used to read or write memory.
Therefore, memory optimization is important, especially for pipelined architectures.We find that a large portion of the memory accesses is not necessary.To illustrate this, Fig. 3 presents the structure FanNetwork, the core component of SIGMA flexible inner-product accelerator [32].FanNetwork divides the addition operation into many stages to form an addition tree.Each stage contains only a few of FanNode, which compute the addition or forward the input data.Similar to other pipelined designs, FanNetwork has the following characteristics, a) The complex functionality is divided into multiple stages; b) The operation in each stage is as simple as possible to increase the clock frequency; c) Registers are inserted between stages for synchronization.The register buffers inserted by FanNetwork cause a lot of memory read and write in software simulation.However, all of these memory access are redundant and can be fused similar to Fig. 1b, this will lead to a large reduction in memory access.

OVERVIEW OF KHRONOS
Fig. 4 presents the overview of Khronos, which consists of five components, frontend, modeling, formulation, optimization, and backend.Khronos uses the frontends and IRs in CIRCT [1] to support different RTL languages and outputs LLVM dialect in MLIR [26] to reuse the code generation functionality of MLIR and compilation backend of LLVM.
The modeling of RTL is based on the queue-connected operation graph (QGraph).QGraph can effectively model RTL operations temporally as a transform of sequence, which consumes a sequence of input signals and generates a sequence of output signals.Such temporal modeling allows us to capture the temporal localities of memory accesses globally and easily adjust the simulation behavior at the sequence level rather than the element level.The edges in QGraph represent delay transform.Khronos changes the order of simulation by modifying the delay of the edges to shift the whole I/O sequence.Moreover, such modification must satisfy some constraints to ensure the correctness of the simulation.The details of modeling and formulation are discussed in Section 4.
We model memory (circuit state) fusion as an objective function subject to the constraints and formulate the state fusion as a complex integer programming problem.The integer programming problem is NP-hard.In order to solve it efficiently, we propose an iterative linearization algorithm in Section 5.The algorithm exploits the characteristic of the constraint and converts the problem to a linear programming problem iteratively.By further dualizing the problem as a minimal-cost flow problem, our algorithm can obtain a good solution in almost linear time.

RTL MODELING AND FORMULATION
In Khronos, the RTL design is modeled as a queue-connected operation graph (QGraph).QGraph models RTL operations as a transform of a sequence.The nodes in QGraph are element-wise sequence transforms and edges are queues that represent a special delay transform.The temporal modeling enables easy manipulation of an entire I/O sequence.Khronos can adjust the order of simulation by modifying the capacity of the queues in the QGraph.Moreover, the capacity must satisfy certain constraints to ensure the adjusted QGraph has the same simulation behavior as the original RTL design.The optimization objective is formulated analytically to minimize the number of state accesses.

Modeling RTL using QGraph
Instead of modeling RTL spatially as a computational graph, Khronos models RTL operation temporally as a transform of a sequence.As shown in Eqn. 2, the special queue operation Q is defined as a transform that delays its input sequence by fixed cycles.
QGraph is a directed graph where each node represents an elementwise transform and each edge represents a queue.The transforms in the QGraph are simulated iteratively in each cycle.Combinational logic operations are always completed in a single cycle and can be directly translated into QGraph nodes.The sequential logic operations, requiring multiple cycles to complete, are translated to a combination of queues and corresponding "combinational" operations.Each edge in QGraph is annotated with a few attributes, including capacity and width.The capacity is the delay of the queue transform and the width represents the bit width of the corresponding signal.
The translation rules between RTL design and QGraph are defined in Table 2. 1 ○ a combinational wire doesn't delay the input sequence, is represented as a queue of capacity 0. 2 ○ a register delays the input with 1 cycle, is represented as a queue of capacity 1. 3 ○ a group of shift registers of size n can be seen as a queue of capacity . 4 ○ a combinational logic operation can be directly translated into QGraph nodes.At the beginning of the simulation, the elements of the queues are undefined, similar to uninitialized registers, and are represented by "x" in 3 ○.These undefined signals won't affect the circuit's functionality.Other operations can be translated as a combination of the operations in Table 2.For example, a pipelined adder with 3-cycle latency can be translated to a node that performs the addition, and an edge with capacity 3 represents the delay.
The correspondence between RTL and QGraph is a one-to-one relationship, as demonstrated in the example translation presented in Fig. 5.As shown in the figure, each module in the circuit is translated into a node, the connections between modules are translated into edges.The edge capacity of combinational connections are 0, and the sequential modules, such as f4, f2, f1 and the registers, will generates edges with capacity equal to its latency.The input RTL design is first translated into QGraph for formulation and memory access fusion.After fusion, Khronos modifies the queue capacity of the QGraph and translates it back to RTL for code generation.

Formulation of Memory Access
To capture temporal locality, we expand the QGraph to obtain the cross-cycle data dependency graph.Fig. 6 2 ○ shows an example of the cross-cycle data dependency graph, where each column represents an iteration and each row represents the output sequence of the node.If the data generated by  at -th iteration will be used by  at the -th iteration, we add an edge connecting the node  of the -th column to the node  of the -th column to represent the data dependency.The cross-cycle data dependency graph can be obtained by unrolling the QGraph on the x-axis and connecting nodes according to queue capacity.We use "iteration" instead of "cycle" to refer to the simulation iteration for clarity, because Khronos may fuse operations from multiple cycles to a single iteration.
Khronos can adjust the simulation order by assigning a schedule vector ì  to the nodes, which denotes how many cycles the node is evaluated ahead.Thus, if node  is associated with   , the input and output sequence of  are shifted left by   cycles.All the data dependency edges connecting to node  will be shifted to ensure that data dependencies are still satisfied.If both endpoints of an edge fall in the same iteration, this data dependency is fused and does not need to be written in the state.
Fig. 6 presents 4 examples of schedule vectors and the corresponding data dependency graph.In case 1 ○, the schedule vector is 0, and no order adjustment is made.The simulation order is the same as that of full-cycle simulators.In 3 ○, we adjust the values of f1 and f2 by 1, making the evaluation order of f1 and f2 ahead by 1 cycle.The adjusted data dependency graph is shown in 4 ○, where f1 and f2 are shifted left by 1 iteration.After re-scheduling, the edges (f4,f1) and (f4,f2) fall into the same cycle and are fused.In 5 ○, another schedule vector is assigned and 3 edges are fused.An invalid schedule is shown in 7 ○ 8 ○, where f2 depends on a future value of f4.Since the simulation is strict in iteration order, the dependency of the future value cannot be resolved correctly, leading to an invalid schedule.
If a wrong schedule vector is given, the re-scheduled QGraph may be invalid, as some nodes may depend on a result of a future iteration.To model this iteration dependency constraint, we first compute the re-scheduled queue capacity  ′ in Eqn. 3, where   and   indicate the iteration number of the two endpoints,   and   indicate the schedule vector value of  and .The capacity should be non-negative, to make the node  not depend on a future value of node .In Fig. 6, the re-scheduled queue capacities are associated with the edges.As shown in 7 ○, the edge violating the iteration dependency constraint has a negative queue capacity.
To formalize the optimization of QGraph, we can represent it as a linear-constraint integer programming problem in Eqn. 4. The objective function is  ( ì ) and the constraints are that the queue capacity must be equal to or greater than 0.
To accurately model the memory access, we introduce a cost function,  (, ), to represent the computational or memory cost of a queue with a bit-width of  and capacity of .The objective function of minimizing memory access is shown in Eqn. 5, where we sum up the cost of each queue to obtain the total cost.
An edge (, ) is fused when  ′ , equals to 0. However, such a 0/1 cost is very difficult to optimize.Thus, we smooth the cost function to a logarithmic function in Eqn. 6.

Modeling Special RTL Components
We have added special models for broadcast signals and clock gates, which are critical in RTL simulation.

Modeling Broadcast Signal.
Broadcast signals are the signals with very large fan-out and occur frequently in RTL design.For example, a reset signal connects almost all registers and is a typical broadcast signal.Due to its large fan-out, a small change in ì  at the broadcast signal endpoint can make a huge change in the objective function.We find the queues next to the broadcast signals can be merged into a single queue, leading to better optimization of memory footprint.Fig. 7 gives a motivational example of optimization for broadcast signals.As shown in the Fig. 7a, following the QGraph modeling, we need to insert three queues with capacities of 1, 3, and 2 respectively.However, these three queues can be combined into a single queue of capacity 3 for multiple reads in Fig. 7b.The optimized model of broadcast signals can be modeled in Eqn. 7.For a broadcast signal starting from node  to nodes  1 ,  2 . . .,   , the queue capacity is modeled as the maximum of original queues.8a, where the registers are updated only when the clock gate is high.Khronos can improve the memory access of clock-gated registers.Fig. 8b shows the full-cycle simulation model and Fig. 8c shows the fused simulation model.In the fused model shown in Fig. 8c, the write operation immediately follows (the MUX) after the read.We treat the fused model as only 1 memory access.But in Fig. 8b, the read operation near R can be far away from the register update operations.We treat this as 2 separate memory accesses.Therefore, we model the cost of clock-gated registers in Eqn. 8, where we consider the merged version of registers with clock gates to have half of the cost than normal registers.

OPTIMIZATION ALGORITHM
The optimization of QGraph is an NP-hard problem as the vector ì  can only take integer values.Furthermore, the optimization objective function of a QGraph can be complex and non-convex.The scale of the RTL graph is usually large, in the order of 10 4 ∼ 10 5 , which makes the optimization even more challenging.To overcome these challenges, we propose an efficient gradient-based optimization algorithm that is a special case of the Frank-Wolfe algorithm [16].Our algorithm linearizes the non-linear objective to a linear approximation function to exploit the totally unimodular characteristic of the constraint in Eqn. 4. The totally unimodular characteristic ensures the linear programming solution of Eqn. 4 to be an integer solution when the objective function  ( ì ) is linear [13].This allows our algorithm to solve the problem by only running linear programming multiple times, without the need to run time-consuming search algorithms such as the branch-and-bound algorithm.Thus, our algorithm can obtain a good solution in almost linear time.
Our optimization algorithm commences with an initial solution ì  (0) = 0.In each iteration, we obtain the subsequent solution ì  (+1) from ì  ( ) .The objective function is approximated by a first-order Taylor expansion of the objective function  at ì  ( ) , given by Eqn. 9.The optimizing algorithm is shown in Algorithm 1.The algorithm starts from the initial solution 0, which is the schedule of the full-cycle simulator.At each iteration, the gradient of the objective function is calculated, and the gradient is used as an objective vector to solve a linear programming problem to obtain the next solution.The constraints of the linear programming problem are always met throughout all iterations.

Algorithm 1: Iterative optimization algorithm
Input: a queue-connected graph  Input: the objective function  Input: max iteration  Output: the optimized schedule vector schvec To enhance the performance of solving the linear programming problem, the algorithm dualizes the original linear programming problem to a minimum cost flow problem in each iteration.The relationship between primitive programming and dual programming is shown in Table 3 and Eqn.11 shows the formation of the dual problem.The dual problem shares the same solution as primitive programming, and we utilize the NetworkSimplex [30] algorithm to efficiently solve the cost flow problem.The NetworkSimplex

, ∈ 𝑉
In real RTL design, the optimization algorithm easily converges to a saddle point.Regardless of the objective function, the gradient disappears the saddle point and the solution cannot be further improved.An example of a saddle point is shown in Fig. 9, where ì  = (0, 0, 0, 0), and the gradient of objective function towards ì  is (1, 1, 1, 1).When the gradient propagates to ì ,  1 ,  2 ,  , and  4 are exactly added one time and subtracted time, which mutually cancel out, resulting in the gradient of ì  being 0, and optimization stops.However, the optimal solution is to set ì  = (0, 1, 2, 3).In order to escape from such a local minimal point, we add small random perturbations to the gradient, so that the gradient can't completely cancel each other and the algorithm continues to improve the solution.

IMPLEMENTATION DETAILS
We implement Khronos based on MLIR [26] and CIRCT [1] frameworks.In the following, we provide implementation details for the intermediate representations (IRs), front-end, optimization, and back-end of Khronos.
IR Design.Khronos designs 3 sets of IRs internally, including HighIR, QueueIR, and LowIR.HighIR provides dedicated operations to model special RTL modules such as memory, clock gates, and plus arguments.QueueIR provides queue operations to capture the
the pattern matching and rewriting infrastructure of MLIR to find the special RTL objects in Section 4.3 and rewrite them to the dedicated operations in HighIR.After converting CoreIR to HighIR, we flatten the circuit into a single module.The flattened module has a graph-like form and can be accepted by Khronos.
Optimization Pass.This pass translates the HighIR to QGraph by the techniques in Section 4.1 and runs the optimization algorithm in Section 5. We use the LEMON graph library [23] as the minimal-cost flow solver.After optimization, HighIR is translated to QueueIR, where each register is attached with an attribute representing whether it is fused or not and the queue operation is inserted into the dataflow.QueueIR is ready for the next lowering and code generation pass.
The implementation of queues has a critical impact on the performance of the simulator.Khronos implements various queue types as shown in Table 4.The simplest shift queue, NaiveQ has a small overhead and is suitable for queues with small capacity.The ShiftQ packs the elements of a queue into a bit vector, using bit shift operations to simulate a queue.ShiftQ is very suitable for queues with many small elements, such as the queue of a control signal.PtrQ is a pointer-based circular queue, which guarantees a strictly  (1) complexity, but has a larger overhead and is suitable for large capacity.PtrQ is effective for deep pipeline and systolic architectures, which may produce very deep queues after memory fusion.
Backend.The backend first lowers QueueIR to LowIR and then translates it to LLVM IR.Khronos generates the LLVM dialect of MLIR and reuses the IR translation of MLIR to emit LLVM IR, which is then compiled into a linkable obj by LLVM compiler infrastructure [25].When the testbench file is changed, the obj file only needs to be relinked instead of recompiled, saving a lot of compilation time.

EVALUATION 7.1 Evaluation Setup
Table 5 presents the benchmarks for evaluation, which have diverse behaviors, including cryptography, encryption, deep learning, and general-purpose processors [7,8,17,32,36,38,39,45].All of these benchmarks are written in Chisel [9], and each benchmark generates a single CHIRTL file as output, which serves as the input for the FIRRTL compiler or the CIRCT framework.The IR Nodes and IR Edges in Table 5 refer to the number of MLIR operations and the number of MLIR operands in the single flattened top module, respectively.Additionally, system calls such as printf have been  removed from the design to ensure a fair comparison, as these instructions are implemented differently in various simulators and can affect the statistical analysis.
The simulators under evaluation are shown in Table 6.There are two ways to generate Verilog for Verilator: one with CIRCT and the other with FIRRTL, which are named circt-verilator and verilator, respectively.ESSENT [10] and RepCut [42] accept FIRRTL IR directly, while Khronos accepts CIRCT IR directly.We use the latest verilator 5.007.Among the simulators, Khronos and ESSENT support single thread only, verilator and RepCut support both single-thread and multi-thread settings.The suffix number {1,2,4} means the number of threads used for simulation.We compare using both single-thread (Section 7.3) and multi-thread (Section 7.4) settings.
We perform simulations on various benchmarks using different performance metrics.To ensure consistency in our results, we bind each simulator to specific cores to prevent core switching during the simulation.We repeat each run 10 times, with the first 3 times

Memory Access Comparison
Figure 11a shows the memory access optimization profiled by all data cache accesses event, and Table 8 shows the fused register state size.We divide the benchmarks into 4 groups in Table 8, where shallow pipeline indicates the corresponding design has only 1 or 2 levels of a shallow pipeline; highly pipelined indicates the design has a deeper pipeline; systolic indicates the design is a systolic array; partly pipelined indicates that only part of the design is a pipeline.Overall, Khronos fuses 42% of the register state and reduces 38% memory access on the geometric average 2 for all the benchmarks.
For highly pipelined or systolic designs, Khronos can reduce the number of memory accesses significantly.As shown in Table 8, Khronos fuses 40∼70% of the register state, and reduces the memory accesses by 70∼95% in Fig. 11a for these designs.One of the most notable optimizations is Gemmini, where Khronos reduces the memory accesses by 93.8%.This is because Gemmini is fully pipelined for both control signals and data paths.In order to ensure a feasible hardware layout and routing, it inserts a large number of register buffers.These shift registers are then fused by Khronos, leading to a large memory access reduction.For accelerators with shallow pipeline such as SHA256 and StreamComp, Khronos fuses 20∼30% of the register state, leading to 60∼70% memory access reduction.For partly pipelined general-purpose processors, Khronos can optimize the pipelined data paths but not the finite state machines (FSMs).Khronos fuses 5∼20% of the register state and achieves 40∼50% of memory reduction for these benchmarks.
In practice, the memory reduction achieved in Fig. 11a is much larger than the state fused by Khronos in Table 8.This is because state fusion leads to more cross-cycle optimization opportunities that can be exploited by the back end.The back-end compiler can perform code combination, elimination, and other optimizations on the logic originally in different cycles to achieve better performance.Fig. 10 presents one of the examples, clock-gated pipes optimization, where Fig. 10a shows the RTL model of clock-gated register pipe and Fig. 10b shows the fused model.As shown in Fig. 10b, if the valid signal is 1, the output data is the same as the input data.If the valid signal is 0, only d1 is outputted.LLVM will find the register value of d0 to be useless and eliminate it because it is always the same as d1.The optimized simulation model is shown in Fig. 10c, where d1 is removed.In fact, for an arbitrarily long pipe, only the last node needs to be kept after fusion.This brings considerable optimization for designs that rely on pipes, such as Gemmini and SIGMA.circt-verilator use the same frontend.Therefore, this can reflect the effect of our optimization clearly.On average, verilator-1, Khronos, ESSENT, RepCut-1 and vcs achieve 1.05x, 2.03x, 0.51x, 0.72x, 0.50x speedup.For Gemmini, Khronos achieves speedup up to 4.3x.

Single-thread Speedup Comparison
FMUL, FPU, and SIGMA are pipeline designs, and Gemmini, GEMM, Conv2D are systolic accelerators that can be seen as twodimensional pipelines.These architectures divide the computation into multiple stages and connect to each stage by register buffer.Each stage contains only a limited combinational logic to make the hardware work at a high frequency, which causes a lot of memory access in RTL simulation.Khronos fuses all the register buffers, making the input data immediately propagated to the output.The memory access is reduced and more signals can be saved, leading to better performance.Khronos achieves 2.5 ∼ 4.3x speedups for these designs.
The main architecture of SIGMA is shown in Fig. 3. SIGMA is highly pipelined to fully utilize all the hardware computation resources.As shown, ESSENT fails to optimize SIGMA due to its high logic activity.RepCut also fails on SIGMA because it has very simple combinational logic at each layer, and there is little room for replication.The complex data path between different stages in SIGMA makes RepCut spend a lot of time transferring data between threads, resulting in extremely poor performance of RepCut.However, Khronos can fuse 70% of the state and reduce memory accesses by 84%, yielding a 1.77x speedup.
RocketCore and SodorCore are both components of Chipyard.They have relatively simple implementation.Their ALU computes addition in a single cycle and the multiplier is not pipelined.Although they are pipeline architecture, the depth of the pipeline is actually very limited to only 3 to 5 layers.And, they also have a complex FSM used as a controller, which cannot be fused by Khronos.Finally, Khronos manages to fuse about 10% of the states and reduces only about 40% of the memory access.Khronos speedups these designs by 1.2 ∼ 1.5x times.
Comparing the memory access in Fig. 11a and speedup in Fig. 11b, it can be found that the number of memory accesses correlates well with the simulation speedup.

Performance Breakdowns.
To analyze the speedup of Khronos, we profile the performance counters including instructions per cycle (IPC) and instruction count (OPS).The results are shown in Table 9 and 10.In the tables, C-Ver means circt-verilator, other simulators are abbreviated by the first three letters.Khronos successfully fuses the memory accesses, leading to a smaller OPS and a better IPC.On average (geometric mean), Khronos achieves 1.31x better IPC, with only 68% OPS compared to circt-verilator.
The design architecture has a large impact on the speedup of Khronos.In general, Khronos performs well for domain-specific accelerators, because they usually employ deep pipelines to provide high throughput computation tasks.But for general-purpose processors, although they are designed with pipeline-like structures, their stall/flush logic complicates the cross-cycle data dependency and makes it difficult to analyze and optimize.
For domain-specific accelerators, the speedup of Khronos varies with the internal pipeline structures.Complex designs such as FPU, Gemmini and SIGMA, have large structures like multi-stage multipliers, adders, or cross-bars inside the pipeline.These structures incur a lot of communication between stages and such communication can be fused by Khronos.Khronos can reduce 40∼70% OPS for these designs.For simple pipeline designs such as GEMM and Conv2D, their stage logic is simple and the cache locality becomes the bottleneck.For example, GEMM is a systolic array with only a few 8-bit integers transferring between processing elements (PEs).
Since the data transfer is only a small portion of the design, Khronos doesn't have a large benefit on OPS but achieves a better cache locality with 1.5∼1.8xIPC improvement.

Multi-threading Speedup comparison
We also compare Khronos with the multi-thread setting of verilator and RepCut as shown in Fig. 12.We observe that multi-threading doesn't always guarantee better performance.For module-level design, such as SodorCore and RocketCore, multithreading incurs more synchronization cost than its benefit, leading to worse performance.For large and SoC-level designs such as Gemmini and RocketChip, multi-threading is effective as it can partition them into small pieces and simulate in multiple threads.We leave the extension of Khronos with multi-threading support as future work.

Performance on Different Platforms
We also use multiple platforms for evaluation.The platforms selected in Table 7 for this evaluation have similar CPU frequencies but vary greatly in cache size, which affects the memory access locality.The experiment result is shown in Fig. 13.On plat-1, the average speedup of verilator-1, Khronos, ESSENT and RepCut-1 is 1.08x, 2.13x, 0.57x, 0.77x, respectively.On plat-2, the average speedup is 1.05x, 2.05x, 0.51x, 0.72x.All the simulators behave similarly to the base platforms.plat-1 uses a CPU from Intel, while the other two are from AMD.Meanwhile, their cache sizes and management policies are different.Such architectural differences can have a subtle influence on the results.For StreamComp, the performance of Khronos on plat-1 has a 1.2x larger speedup than base.It is because the StreamComp checks input data at different locations to run the compression algorithm, which generates a large number of random memory access.Such irregular memory access is very sensitive to the cache size.
The CPUs in plat-2 and base are released in the same generation, they have similar architecture and only vary on cache size.Almost all simulators have slightly better performance on plat-2 than base due to its larger cache size, but the relative speedup is almost unchanged.

RELATED WORKS
Queue Based RTL Modeling.The approach of modeling RTL as a queue-connected graph is utilized in several tools such as Golden Gate, MIDAS, and FireSim (MIDAS II) [21,22,29].MIDAS and FireSim run RTL simulations on cloud Field Programmable Gate Array (FPGA) clusters and can achieve a similar simulation performance as FPGA prototyping.Golden Gate is the core compiler of FireSim and MIDAS.It implements synchronous sequential circuits (SSMs) in RTL designs to latency-insensitive bounded dataflow networks (LI-BDNs).LI-BDN [40] is a special class of BDNs where nodes are connected by queues and run computation asynchronously.It allows for the decoupling of the timing of an FPGA host platform from that of the target RTL design, enabling the independent and automatic simulation of different RTL components.The major difference between our model and LI-BDNs is that ours are latency-sensitive.The queue capacity has an impact on the correctness in our latency-sensitive model, but not in LI-BDNs.
Opensource Software RTL Simulators.Verilator [37] is a widely used open-source full-cycle simulator that is known for its speed and robustness.It translates verilog and system verilog sources to C++ code via AST transformations.To improve performance, Verilator uses a partitioning algorithm to group adjacent nodes into a task graph and runs it in a multi-threaded environment using a static scheduling algorithm.Yosys [44] is an RTL synthesis tool and provides the functionality to translate synthesizable RTL into a C++ simulator.LLHD [35] and ARC are two RTL simulation tools integrated into the CIRCT [1] framework.LLHD offers a Sys-temVerilog [6] frontend and an event-driven simulator.ARC is a full-cycle RTL simulator currently under development in the CIRCT repo.ESSENT [10] and RepCut [42] are two simulators based on FIR-RTL [15]  Commercial Software RTL Simulators.Almost all commercial hardware design platforms come equipped with a corresponding software RTL simulator.One such simulator is ModelSim [2], developed by Intel and Siemens, which can provide highly detailed simulation waveforms.Additionally, the Vivado [4] platform, developed by AMD Xilinx, is integrated with an XSIM simulator.Several vendors offer software simulation solutions, such as Synopsys and Cadence.VCS [3] from Synopsys is a widely used commercial simulator known for its performance.Xcelium [5] is another powerful simulator from Cadence.
High-Level RTL Simulators.The representation of hardware in RTL is relatively low, which can negatively impact simulation performance.To address this issue, ArchHDL [33] provides a custom hardware description language and parallel simulation annotations that allow users to add hints to the hardware design to improve simulation performance.Mamba utilizes specialized Just-In-Time (JIT) optimizations to accelerate simulation on the PyMTL hardware construction platform [19,20].FLASH [12] extracts scheduling information from the HLS tool and uses it to automatically construct an equivalent cycle-accurate simulation model that preserves RTL semantics.
GPU Based RTL Simulators.RTLFlow [27] extends Verilator to support multi-stimulus RTL simulation on GPGPUs.It translates RTL into CUDA kernels and leverages CUDA Graph for efficient runtime execution.The generated code is optimized for memory layout and task scheduling to achieve high-performance multistimulus simulation.The architecture of the GPU requires high regularity of the computation, making accelerating single-stimuli RTL simulation on GPU very difficult.SAGA [41] separates the RTL graph into subsets based on sink nodes and assigns each block to one CUDA thread.Chatterjee et al. [11] merge small RTL operations into a large macro-gate to create a better-balanced partition.Qian et al. [31] utilizes the CMB algorithm to enhance the parallelism of simulation on GPU.These techniques require multiple stimuli or very large or regular designs to fully utilize the parallelism of GPGPUs.

CONCLUSION
In this work, we introduce Khronos, a cycle-accurate software RTL simulation tool that optimizes cross-cycle memory accesses to achieve better memory locality.Our analysis and experiments reveal that memory access becomes the bottleneck of RTL simulation, and the key insight of Khronos is that a large number of memory accesses across consecutive clock cycles are redundant.To fuse the redundant memory access, Khronos uses a queue-connected operation graph to capture cross-cycle data dependencies and formulates the fusion problem as a complex integer programming problem.We propose a gradient-based optimization algorithm to solve the complex programming problem in Khronos, which iteratively linearizes Khronos-optimized RTL simulation.

Figure 2 :
Figure 2: The proportion of data memory access instructions to the total dynamic instructions of Verilator for different benchmarks.

Figure 5 :
Figure 5: Example RTL and its translation of QGraph.

Figure 6 :Figure 7 :
Figure 6: The schedule vector ì  and the adjusted simulation order of Khronos.

Figure 8 :
Figure 8: Cross-cycle optimization for the clock-gated pipe.

Figure 9 :
Figure 9: An example of a saddle point.

Figure 10 :
Figure 10: Cross-cycle optimization for the clock-gated pipe.

7.3. 1
Speedup Analysis.The speedup of different simulators on various benchmarks is presented in Fig.11b.We normalize the speedup to the performance of circt-verilator.Khronos and 2 The geometric average is calculated by 1 −  √︁  (1 −   ), where   is the reduction on -th design.
intermediate representation, exploiting advanced graph partition techniques on RTL design.ESSENT mixes event-driven and full-cycle simulation by partitioning the RTL graph into many small DAG blocks and tracking signal change dynamically.During simulation, ESSENT skips inactive blocks whose input signal is the same as the previous iteration.RepCut is based on ESSENT, addressing the thread synchronization problem in multi-thread RTL simulation.It replicates some common signals used by different threads to reduce synchronization overhead.

Figure 13 :
Figure 13: Performance on different platforms.theproblem to a minimal-cost flow problem and costs almost linear time to obtain a good solution.By fusing the successive memory access in simulation cycles, Khronos reduces the cache access by 70∼95% on pipeline architecture and achieves 2.0x (up to 4.3x) acceleration.Our results of memory access reduction demonstrate the large potential of temporal optimization opportunity in full-cycle RTL simulation.
• We propose Khronos, a cycle-accurate software RTL simulation tool.Khronos can fuse state accesses with temporal localities, reducing the memory pressure to accelerate simulation speed.• We formulate an optimization problem to minimize the number of memory accesses and propose an efficient algorithm for the problem.Our algorithm exploits the gradient method to solve the complex problem in constant iterations and costs almost linear time.• We implement Khronos based on the multiple levels of IRs and leverage the modular and extensible MLIR framework.

Table 1 :
Comparison of full-cycle simulation and eventdriven simulation.

Table 2 :
Translation rules for RTL operation and QGraph.
1 ,  , 2 , . . .,  ,  } (7) 4.3.2Modeling Clock Gate.Registers with clock gates are quite common in RTL design.In the RTL simulation, the clock gate is modeled as a mux.The simulation model of the clock gate is shown in Fig.

Table 3 :
The primitive programming and dual problem.

Table 4 :
The implementation of queues in Khronos.

Table 7 :
Evaluation Setting

Table 8 :
The init state size and fused state size All simulators are evaluated using the same compilation optimization level of -O2.While other simulators generate C++ code, Khronos generates LLVM IR directly.The evaluation setting is listed in Table7.As shown, we test on different platforms and we use the latest version of verilator and FIRRTL.VCD dump is disabled in all evaluation runs.vcs is evaluated on the base platform.
being used as a warm-up simulation to account for the overhead associated with cold starts.circt-verilator,Khronos and VCS compile all benchmarks without any issues.verilator-4fails on StreamComp for an out-of-memory error (96G memory consumed).ESSENT generates invalid C++ code on StreamComp.RepCut stack overflows on FMUL and time outs on StreamComp (1 hour consumed).The optimization level of ESSENT is -O3.

Table 9 :
Instructions Per Cycle