Scheduling and Physical Design

In a typical integrated circuit electronic design automation (EDA) flow, scheduling is a key step in high-level synthesis, which is the first stage of the EDA flow that synthesizes a cycle-accurate register transfer level (RTL) from the given behavior description, while physical design is the last stage in the EDA flow that generates the final geometric layout of the transistors and wires for fabrication. As a result, scheduling and physical design are usually carried out independently. In this paper, I discuss multiple research projects that I have been involved with, where the interaction between scheduling and physical design are shown to be highly beneficial. I shall start with my very first paper in EDA on multi-layer channel routing which benefited from an unexpected connection to the optimal two-processor scheduling algorithm, a joint work with Prof. Martin Wong, who is being honored at this conference for the 2024 ISPD Lifetime Achievement Award. Then, I shall further demonstrate how scheduling can help to overcome interconnect bottleneck, enable parallel placement and routing, and, finally, play a key role in layout synthesis for quantum computing.


INTRODUCTION
A typical integrated circuit electronic design automation (EDA) flow goes through four stages (Figure 1).The first stage is high-level synthesis (HLS) [22,11], which takes in a behavior description, often in C, C++, or SystemC languages, and generates a cycle-accurate register transfer level (RTL) circuits expressed in hardware description languages, such as Verilog or VHDL.Sometimes, this stage is done manually by designers instead of using HLS tools.The second stage is logic synthesis, which takes the RTL circuit as the input and performs sequential and combinatorial logic optimization to minimize the number of literals and depth of the resulting Boolean networks [32].The third stage is technology mapping (e.g., [27,9]), which takes a generic Boolean network as input and maps it to a given gate library used for chip implementation.For example, if an FPGA is used for circuit implementation, the logic gates in the library will be K-input lookup tables.The fourth, also the last stage, is the physical design.This stage includes placement, which determines the position of logic gates under various objective functions (such as minimizing the total estimated wirelength and the longest delay), and routing, which generates wires to complete the interconnections of the placed logic gates.
Scheduling and resource binding are key steps in HLS.In particular, the scheduling algorithms take the behavior description as input, generate a control and data flow diagram (CDFG), and decide at which clock cycle each computation node in the CDFG will be performed.Since scheduling is carried in the first stage of the EDA flow, while physical design is done in the last stage, they usually take place independently.However, in this paper, I shall discuss multiple research projects that I have been involved where the interaction between scheduling and physical design are shown to be highly beneficial.
This paper is organized as follows.Section 2 presents an early work using the optimal two-processor scheduling algorithm to help multi-layer channel routing.Section 3 discusses how to use scheduling to cope with the interconnect bottleneck to support multi-cycle on-chip communication and enable the physical design tools to achieve high clock frequencies.Section 4 demonstrates how proper scheduling with global interconnect pipelining can enable effective parallelization of placement and routing tools.Section 5 describes the integrated role of scheduling, placement, and routing in layout synthesis for quantum computing.Finally, Section 6 concludes the paper with a recommendation for the physical design community to broaden the optimization scope from the spatial domain to include the time domain as well.This paper is part of the special session for celebrating Prof. Martin Wong's ISPD Lifetime Achievement Award.
is to connect two rows of terminals using as few horizontal routing tracks as possible, with the goal of using the smallest routing area.A two-layer channel routing example is shown in Fig. 2(a).At that time, we expected that the VLSI technology would advance so that more metal layers could be available.Thus, my first assignment was to develop an efficient three-layer channel routing algorithm.
After some literature study, I was impressed by the quality of several two-layer channel routing algorithms (e.g.[38,53,40]), as they could produce optimal solutions to a number of complex channel routing problems, including the Deutsch's difficult example [18], despite the fact that the channel routing problem is NP-hard in general.So, I decided to take a somewhat unusual approach to transform a good two-layer channel routing solution to a three-layer solution (different from other approaches at that time which tried to construct a three-layer solution directly).Since the goal is to minimize the number of horizontal tracks, it was natural to assume that one would use both layers 1 and 3 for horizontal wires and layer 2 for vertical wires.Such a transformation is fairly straightforward, except one has to avoid the overlap of the two vertical wires going to the lower and upper terminal rows at each column.For the example illustrated in Fig. 2(a), if tracks 1 and 2 are distributed to the first track on layers 1 and 3, respectively, there would be a conflict in column 3 because the two vertical wires going to the upper and lower terminal rows would touch.In this case, tracks 1 and 2 would each have to pair with an empty track, as shown in Fig. 2(b).However, if we permute tracks 2 and 3, then the first four tracks can be distributed to two tracks on layers 1 and 3 without causing any conflict, while tracks 5 and 6 still have to be paired with an empty track each due to the conflict at column 6 as shown in Fig. 2(c).
In order to determine the best track permutation without causing any vertical wire overlap, I introduce the concept of the track ordering graph (TOG), where each node represents a track in the given two-layer channel routing solution.There is a directed edge (, ) if there is a via in track  above a via in track .The TOG of the example in Fig. 2 is shown in Fig. 3.Under this formulation, any topological order of the TOG gives a valid track permutation of the input two-layer solution.The question is how to find the best topological order of the TOG so that the resulting three-layer solution has the fewest number of tracks.
After some discussions, Martin Wong made an insightful observationthis problem is equivalent to the two-processor unit-task scheduling problem, where each track can be viewed as a unit task and the TOG can be considered as the task precedence constraint graph.When two tasks can be scheduled and executed in the same time step, the corresponding tracks can be distributed to the same track on the two horizontal layers in the resulting three-layer solution.Otherwise, when an idling slot is introduced in the two-processor scheduling problem, it implies that an empty track needs to be inserted.Although the general multi-processor scheduling problem is NP-hard, the two-processor unit-task scheduling problem can be solved optimally in linear time [21].
Combining the optimal track permutation solution with a few other techniques, such as local re-routing and singular-track shifting (formulated as the shortest path problem), we were able to achieve the first optimal three-layer channel routing solution to the Deutsch's difficult example and outperformed other constructive approaches to the three-layer channel routing problem on multiple benchmark examples.This work became my first publication in EDA [14,15].Indeed, two years after the paper was published, Intel introduced the three-layer metal routing technology in its 40864 microprocessor designs [intel-process].I was grateful to have Martin Wong as a collaborator for my first EDA project.This work left me with a deep impression that scheduling might play an unexpected role in some physical design problems.
This transformation based approach can be extended to four-layer channel routing as well [15].However, the VLSI technology made significant advances since we published the multi-layer channel routing paper in 1987.The recent Intel 4 technology detailed in 2022 [42] employs 16 metal layers for routing.It is unlikely that the transformation-based approach for multilayer routing can still be applied effectively.Nevertheless, my lab used a similar approach to construct 3D-IC placements.Given significant progress in 2D circuit placement in the decade of 2000s [13,34], we chose to generate a good 3D-IC placement solution from a high-quality 2D-IC placement solution as shown in [12].In fact, it explored three possible folding schemes, local-stacking, fold-2, and fold-4, as illustrated in Fig. 4 for constructing a 3D-IC placement solution.By properly combining these methods, guided by efficient thermal estimation models and optimization procedures, a range of 3D-IC placement solutions can be obtained with different wirelength and through-silicon-via usage trade-offs under a given thermal constraint.This work was published in the ASP-DAC 2007 and received the ASP-DAC 2017 Ten-Year Retrospective Most Influential Paper Award.Since the number of logic layers in a 3D-IC is usually between two to four in the current technology, such a transformation-based 3D-IC placement method can still be applied today.

SCHEDULING AND INTERCONNECT OPTIMIZATION
As transistors scaled down to submicron dimensions in the early 1990s, interconnect delays began to overshadow logic delays and became the dominating factor in determining the clock frequency.Extensive research on interconnect optimization took place in the decade of the 1990s, with many advances in interconnect topology optimization, optimal wire-sizing, simultaneous routing and repeater insertion, and timing-driven partitioning and placement for generating the physical hierarchy to guide the interconnect planning.A good portion of these research results were summarized in the paper titled "An Interconnect-Centric Design Flow for Nanometer Technologies" [7].However, despite these great efforts, it was clear by the early 2000s that multiple clock cycles were needed to cross the chip [26,10] for high-performance designs.This presented a need and opportunity to consider computation and communication scheduling together with physical designs to deal with multi-cycle on-chip communication.
Our initial effort was to develop a regular distributed register (RDR) microarchitecture that divides the entire chip into an array of islands.The registers and functional units are distributed to each island, and all local communication within an island can be completed in a single clock cycle.The local registers in each island are divided into (up to)  banks (where  is the maximum number of cycles needed to communicate across the chip) so that registers in bank  will hold the results for  cycles for communicating with another island that is  cycles away.
Associated with this RDR architecture, we developed the MACS (multicycle based architecture synthesis) system [10].It takes a C program as input specification, derives the corresponding CDFG, and performs resource allocation and binding to generate an interconnected component graph (ICG), which consists of a set of components (i.e., functional units) that operation nodes are bound to.They are interconnected by a set of wires for data transfers between components.At the core, MCAS performs the schedulingdriven placement, which takes the ICG as input, places the components in the island structure of the RDR microarchitecture, and returns the island index of each component.After the scheduling-driven placement, both the CDFG schedule and the layout information are produced.To further minimize the schedule latency, MACS performs placement-driven scheduling together with resource re-binding to further optimize the interconnects, based on the force-directed list-scheduling framework.The results were promising -compared to the traditional 4-stage design flow outlined in Section 1, MCAS reported a 44% improvement on average in terms of the clock period and a 37% improvement on average in terms of the final latency for data flow examples, a 28% clock-period reduction and a 23% latency reduction on average for designs with control flow.Such significant improvement is difficult to achieve by optimizing in the physical design space alone.
However, the MCAS flow was not widely adopted for multiple reasons: (i) the industry was not ready to adopt the HLS technology in the early 2000s -in fact, Synopsys ended the Behavior Compiler effort in 2004 (their HLS effort at that time) [46], (ii) the RDR architecture had some limitations (e.g., it did not support interconnect pipelining), and (iii) the HLS engine used in MACS was not robust enough.As a result, the community had to wait another 15 years for more mature tools that can effectively use scheduling to help physical design.
Seeing the benefit of HLS in global interconnect optimization, as well as its potential in supporting software-hardware co-design for system-onchip platforms and reducing the verification cycle by raising the level of abstraction, my lab continued to invest in HLS throughout the 2000s, despite it being at its lowest point.We were the first group to adopt the LLVM compilation framework for HLS, now a widely accepted practice in the industry.We also made a number of algorithmic advancements, including platform-aware HLS, scheduling using systems of difference constraints, and automatic memory partitioning, etc., which were summarized in [55,11].These developments led to a spin-off company, named AutoESL Design Technologies, Inc., which was acquired by Xilinx in 2011 (Xilinx is now part of AMD).
Built upon AutoESL's HLS tool AutoPilot, Xilinx started to offer robust HLS solutions, named Vivdo HLS and more recently Vitis HLS, from C/C++ specification to RTL circuits for FPGA designs.Vivado HLS and Vitis HLS received wide adoption from both academia and industry in the decade of the 2010s.In the meantime, the on-chip global interconnect limitation was getting worse.High-end FPGAs adopted the silicon-interposer technology, which integrates multiple FPGA dies on a common silicon substrate to form a large FPGA.For example, the AMD/Xilinx XCU250 FPGA shown in Fig. 5 consists of four dies.In this case, the global interconnect delay increases further when it crosses the die boundary, presenting more significant challenges to both HLS (due to the lack of knowledge which connections will become global interconnects) and physical design (due to the difficulty in achieving timing closure in the presence of many global interconnects).
To address these challenges, in the early 2020s we developed an efficient framework named TAPA [25] that interleaves HLS (especially scheduling) with physical design for a large class of HLS applications that consists of a set of concurrent tasks interacting via FIFOs.Many applications, such as systolic arrays [52], stencil computation [4], and various graph algorithms [6,5], can be implemented in this style.Such dataflow designs have the potential to be latency-insensitive as advocated in [2], although extra care is required to balance communications due to reconvergent paths, which will be discussed below.
TAPA provides simple APIs to describe the communication channels between the concurrent tasks written in C/++.TAPA first extracts the concurrent tasks and synthesizes each task using Vivado/Vitis HLS tools to get its RTL representation and obtain an estimated area for each task.Then it performs coarse-grain placement of the tasks into user-defined islands (similar to the concept of islands in RDR, which can be covered in a single clock cycle) utilizing the AutoBridge method [23] via partitioning-based  Based on the floorplan, all the cross-slot connections will be accordingly pipelined (marked in red) for high frequency.Reproduced from [23].
placement (other placement methods can also be used).Based on the islandlevel placement, TAPA optimally computes the pipeline stages between each pair of tasks to pipeline the global interconnects between tasks to meet the clock frequency requirement while minimizing the total number of pipelining registers needed.This formulation also imposes a constraint to equalize the pipeline delays of different paths at each reconvergent node.This can be achieved by solving an integer linear programming problem, which is in the form of a system of difference constraints [16], and, thus, in polynomial time.Finally, TAPA generates the RTL circuits of the FIFO and pipelining logic that connect the tasks, and combine them with the RTL circuits generated by the Vivado/Vitis HLS tools for each task.Together with all the RTL designs, TAPA also produces a layout constraint file based on the coarse-grain placement result and passes it along with the combined RTL circuits, to the the downstream physical design tools.The entire TAPA flow is illustrated in Fig 6.
The results of TAPA are impressive.It was tested on 43 designs, improving the average frequency from 147 MHz to 297 MHz with a negligible area overhead.For example, Figure 7 shows a CNN accelerator implemented on the AMD/Xilinx U250 FPGA.It interacts with three DDR controllers, as marked in grey, pink, and yellow blocks in the figure.Without using TAPA, the Xilinx physical design tool tries to pack the whole design into die 2 and die 3 to reduce the amount of the cross-die interconnections.In contrast, TAPA is able to disburse the logic in four dies and avoid overlapping the user logic with DDR controllers.Additionally, TAPA pipelines the FIFO channels Since TAPA and AutoBridge were introduced in the early 2020s, they have been used in a number of FPGA accelerator designs from multiple universities over the world, such as the matrix multiplication and CNN acceleration in systolic arrays [52], shortest path acceleration [5,6], and sparse linear algebra solvers [43,44,45], stencil computation [51], etc.

SCHEDULING AND PARALLEL PHYSICAL DESIGN
It turns out that the way that TAPA interleaves HLS with physical design also enables scalable parallel placement and routing on the FPGA implementation, a very difficult problem for the physical design community.(The methodology can be easily generalized to ASIC design as well.)The source of difficulty is the need of timing optimization.In theory, one may partition a large design into smaller sub-circuits (with cut-size minimization), assign them to different regions on the chip, and perform placement and routing on each sub-circuit independently.In this case, it is not difficult to achieve timing closure for each sub-circuit within its assigned region.But there could be many signal paths going through multiple sub-circuits, and it is very difficult to guarantee the timing closure of these global paths when we perform placement and routing for each sub-circuit independently.Due to this limitation, commercial placement and routing tools suffer a low degree of parallelism.For example, the study in [24] profiled the CPU utilization of a 14-hour compilation of the convolutional neural network (CNN) design on an FPGA by the commercial Xilinx Vivado tool suite.Vivado only used 2.1 cores on average when coming to placement and routing steps with timing optimization.When one increases the number of CPU threads from one to eight, the runtime reduction is less than 2X, underscoring the challenge of parallelization.
The TAPA approach greatly simplifies this problem as the signals crossing any two regions are registered.As a result, any cross-region global interconnect is reduced to a sequence of local interconnects, properly pipelined at the region boundaries.Thus, the global timing closure problem is eliminated.We implemented a parallel FPGA physical design system, named RapidStream, based on the TAPA design flow presented in the preceding section.It consists of three phases.In Phase 1, RapidStream employs TAPA to map a HLS dataflow design to the disjoint islands and ensure that every inter-island connection is pipelined with an anchor register at the region We first invoke the TAPA compiler to extract the parallel tasks and synthesize each task using Vitis HLS to get its RTL representation and obtain an estimated area.Then the AutoBridge [23] module of TAPA floorplans the program and determines a target region for each task.Based on the floorplan, we intelligently compute the pipeline stages of the communication logic between tasks and ensure that throughput will not degrade.TAPA generates the actual RTL of the pipeline logic that composes the tasks.A constraint file is also produced to pass the floorplan information to the downstream physical design tools.Reproduced from [24]. Figure 9: CPU and memory usage of the RapidStream run on the CNN design.Reproduced from [24].
boundaries.This provides timing isolation that is crucial in the subsequent parallel placement and routing.
During Phase 2, RapidStream carries out parallel placement and routing of the disjoint islands and inserts the anchor registers.In the placement step, RapidStream iteratively co-optimizes the placement of anchors and the circuit inside the islands since they are inter-dependent.Also, before stitching together all global signals, RapidStream pays special attention to clock routing to ensure that the clock skews are well managed with fixed clock delay and entry point for each island, so that they can be stitched together later on to achieve a zero-skew clock tree.Without this step, one may run into hold violations when stitching the solutions from different islands.
In Phase 3, RapidStream implements a stitcher based on the RapidWright framework [28] to stitch the physical netlists from the post-routing solutions from all the islands together while eliminating the possible routing conflicts due to the inter-island anchor nets (resulting from their accesses to the shared switchboxes between the islands) via partial rip-up and reroute.Fig. 8 illustrates the 3-phase process of RapidStream.The details of each phase can be found in [24].
The speedup by RapidStream is significant.While Vivado uses 2.1 cores on average and runs for about 14 hours for the CNN design shown in Fig. 9, RapidStream uses 26 cores on average and runs for about 2 hours.RapidStream was further evaluated on two large stencil computation designs generated by the SODA Compiler [4].Compared with the baseline Vivado flow from Xilinx, RapidStream 2.0 has 5.1× (from 7.6 hours down to 1.5 hours) and 7.5× (from 15 hours down to 2 hours) shorter runtimes, respectively, and achieved 10-20% improvement on clock frequencies.Clearly, the high degree of parallelism of physical design was made possible due to the effective pipelining of all global interconnects to decouple the interaction of different islands during the timing optimization, yet another accomplishment realized by integrating scheduling and physical design.Both TAPA and RapidStream systems are open-source projects [50,36].They are also being extended and commercialized by RapidStream Design Automation, Inc. [37], a recent spin-off from UCLA.

LAYOUT SYNTHESIS FOR QUANTUM COMPUTING
When we come to quantum computing, it turns out that scheduling and physical design are tightly integrated.The advances in quantum computing (QC) platforms in the past decade are exciting.For example, superconducting qubit systems of over a hundred qubits have been developed and are now commercially available [1,39].Trapped ion systems in long chains or quantum charge-coupled device (CCD) [35] architectures and neutral atoms trapped in arrays of optical tweezers [41,19].have also been developed, reaching very low gate errors [17,33,20].IBM recently announced the plan for building a 100,000-qubit quantum processing unit (QPU) [3].Thus, there is a strong interest and need to have efficient and scalable design automation tools or quantum compilation tools to effectively map various quantum applications to different quantum processing units (QPUs).A quantum application is usually expressed by a program with a sequence of quantum operations (which are unitary operations) [31].Fig 10 shows an example of a quantum circuit consisting of four qubits, nine single-qubit gates, and ten two-qubit gates.Each Quantum compilation for a given quantum program and QPU typically goes through two stages: (i) logic synthesis, which transforms a quantum program into a quantum circuit using only the native gate set on a given quantum computer (similar to the logic synthesis and technology mapping stages in integrated circuit designs), and (ii) layout synthesis, which performs gate scheduling, placement and routing together with the possible insertion of SWAP gates [48] or atom movement [47] to conform to the connectivity constraint of the chosen QPU.In other words, one needs to decide both the space and time coordinates of each quantum gate execution during the computation, which truly requires simultaneous consideration of scheduling and physical design.
For example, Fig. 12 depicts a compilation result of the circuit shown in Fig. 10 for the five-qubit quantum device shown in Fig. 11.This compilation result uses nineteen time steps and two SWAP gates indicated by the dotted box.The mapping from the program qubits to the physical qubits is specified next to each program qubit line, and the scheduling of gate execution is described by the row  .A SWAP gate is used to modify the qubit mapping solution.For example, initially,  0 is mapped to  2 , but after time eight,  0 is mapped to  3 instead because of the effect of the SWAP gate.
Figure 10: A circuit implementing the Toffoli gate with one ancilla qubit.Reproduced from [30].  Figure 12: Layout synthesis result of TB-OLSQ2 for Toffoli circuit for IBM QX2.Reproduced from [30].
In the noisy intermediate-scale quantum (NISQ) era, the compiler optimization objective is to minimize the resulting circuit depth or the total number of SWAP gates, as they both affect the circuit fidelity.
Quantum layout synthesis continues to be an active area of research despite multi-decade-long research efforts on this topic.A study in 2020 [49] revealed surprisingly large optimality gaps (5-45× larger depths, even for near-term feasible circuit sizes with about 50 qubits) of the leading layout synthesis tools from the industry and academia at that time, employing a set of carefully constructed quantum circuit examples with the known optimal solutions (QUEKO circuits).A more recent study in 2023 [30] showed that for a well-engineered leading layout synthesis tool, such as SABRE [29] which has been incorporated into the IBM Qiskit Compiler, it can still consume 6-7× more SWAP gates than the optimal solutions when evaluated on a large collection of QUEKO circuits and practical quantum circuits.
In order to achieve an optimal solution to the layout synthesis problem, various efforts have been made.The most successful one is based on the satisfiability modulo theory (SMT) solvers, as first demonstrated in the OLSQ compiler [48].It introduces space and time variables for each gate, imposes Boolean or arithmetic constraints to model a valid layout synthesis solution for a given QPU, and then invokes a modern SMT solver iteratively to optimize the depth, SWAP gate count, or circuit fidelity of the solution.This method was further improved in OLSQ2 [30], which combines bit-vector-based encoding for variables, CNF-based encoding for the cardinality constraints, and transition-based coarsening for the time variables, achieving a speedup of close to 7,000× over OLSQ.Yet even with this impressive speedup, QLSQ2 can only solve quantum circuits of up to medium-scale circuits, e.g., circuits with 20-40 qubits.
Non-SMT-based approaches have been explored as well for optimal quantum layout synthesis.In particular, an A * -search-based algorithm has been developed [54].It reported good runtime for depth minimization, but cannot be used for SWAP gate count minimization (which often has a bigger impact on the circuit fidelity).Therefore, there is a great need for scalable, flexible, and optimal algorithms for quantum layout synthesis [8].This is where I believe one's ability to integrate scheduling and physical design will play a key role in the solution quality.

CONCLUDING REMARKS
While most physical design researchers focus on the problems in the spatial domain, such as determining the optimal positions of logic gates and wires in a design, through multiple examples illustrated in the paper, I would like to suggest that there is much value in considering the freedom in the time domain via different and better scheduling to enable a better, sometimes simpler solution in the spatial domain.Ultimately, the EDA solution needs to decide the space and time coordinates of every computation to be performed, either in VLSI circuits or quantum computers.I hope that this article can inspire more innovative solutions that couple scheduling with physical design effectively.

Figure 1 :
Figure 1: A typical flow electronic design automation (EDA) of integrated circuits.

Figure 2 :
Figure 2: Example for channel routing reproduced from [15].Dashed lines are the empty tracks inserted.(a) A two-layer channel routing example.(b) A track pairing example.(c) A perfect track pairing by permuting track 2 and 3.

Figure 6 :
Figure 6: Generating the floorplan for a target 2 × 4 grid.Based on the floorplan, all the cross-slot connections will be accordingly pipelined (marked in red) for high frequency.Reproduced from[23].

Figure 7 :
Figure 7: Implementation results of a CNN accelerator on the Xilinx U250 FPGA.Spreading the tasks across the device helps reduce local congestion, while the die-crossing wires are additionally pipelined.Reproduced from [23].

Figure 8 :
Figure 8: An overview of our TAPA framework.The input is a task-parallel dataflow program written in C/C++ with the TAPA APIs.We first invoke the TAPA compiler to extract the parallel tasks and synthesize each task using Vitis HLS to get its RTL representation and obtain an estimated area.Then the AutoBridge[23] module of TAPA floorplans the program and determines a target region for each task.Based on the floorplan, we intelligently compute the pipeline stages of the communication logic between tasks and ensure that throughput will not degrade.TAPA generates the actual RTL of the pipeline logic that composes the tasks.A constraint file is also produced to pass the floorplan information to the downstream physical design tools.Reproduced from[24].