GPU/ML-Enhanced Large Scale Global Routing Contest

Modern VLSI design flows demand scalable global routing techniques applicable across diverse design stages. In response, the ISPD 2024 contest pioneers the first GPU/ML-enhanced global routing competition, selecting advancements in GPU-accelerated computing platforms and machine learning techniques to address scalability challenges. Large-scale benchmarks, containing up to 50 million cells, offer test cases to assess global routers' runtime and memory scalability. The contest provides simplified input/output formats and performance metrics, framing global routing challenges as mathematical optimization problems and encouraging diverse participation. Two sets of evaluation metrics are introduced: the primary one concentrates on global routing applications to guide post-placement optimization and detailed routing, focusing on congestion resolution and runtime scalability. Special honor is given based on the second set of metrics, placing additional emphasis on runtime efficiency and aiming at guiding early-stage planning.


INTRODUCTION
Global routing (GR) techniques, which establish coarse-grain routing paths for signal nets throughout a VLSI circuit, have many applications across various stages of the modern VLSI design flow, as outlined in Table 1.Global routing techniques employed during early design stages, such as logic synthesis [7] and physical planning [3], prioritize scalability and runtime for routing-aware early-stage planning.In contrast, global routing methods utilized in later stages, such as post-placement optimization [16] and routing stages [8,11,14,15], concentrate more on congestion resolution, guiding routability-aware optimizations and detailed routing.
Despite extensive research on global routing in the literature, there is a pressing need to enhance the runtime scalability of global routing techniques across different stages.In guiding early-stage planning, GPU-accelerated placement algorithms have enabled the rapid placement of circuits with millions of cells and hundreds of macros in minutes [2,13].However, the considerable latency of current global routing algorithms, approximately ten times that of placement algorithms, hinders the utilization of global routing in early-stage planning.In the context of guiding detailed routing, the global optimization inherent in the global routing problem, in contrast to the local optimization nature of detailed routing, prevents global routing algorithms from fully leveraging advancements in parallel computing platforms.Over the past decade, there has been a notable reduction in the runtime ratio between detailed and global routing, shrinking from approximately 20:1 to 5:1.In summary, the runtime scalability of global routing algorithms at various stages is a critical concern in modern VLSI design flows.
Over the past decade, GPU-accelerated computing platforms have been evolving into highly versatile and programmable systems capable of delivering immense parallel computing power.Recent studies have successfully leveraged GPUs to achieve over a 10× acceleration in global routing without compromising performance [12].Furthermore, machine learning (ML) techniques have been integrated into the global routing process, leading to enhanced routing solution quality [17].
The ISPD 2024 contest aims to promote academic research addressing the scalability challenges of global routing algorithms by leveraging GPU and ML technologies, ultimately advancing the frontier of global routing techniques in modern VLSI design flows.Specifically, the competition primarily focuses on global routing applications guiding post-placement optimization and detailed routing, extending the capacity from a few million cells to around 50 million.Due to the limitations of current routers, hierarchical and partitioning-based methods are commonly employed to manage large circuits, albeit at the risk of sacrificing a certain degree of optimality.A scalable global router capable of handling circuits with tens of millions of cells has the potential to yield improved routing solutions with a global perspective.Our main contributions are summarized as follows: • This contest presents the first GPU-accelerated global routing competition.The utilization of GPUs also facilitates and promotes the application of ML techniques to global routing.The remaining sections are organized as follows.Section 2 describes the problem for the ISPD 2024 contest and input/output formats.Section 3 elaborates on the evaluation environment and metrics.Section 4 introduces the benchmark information.Finally, Section 5 presents the acknowledgements.

PROBLEM DESCRIPTION
In global routing, a 3D routing space is defined using global routing cells (GCells), created by a regular grid of horizontal and vertical lines, as illustrated in Figure 1 (a).This configuration results in the formation of a grid graph G(V, E) where each GCell is treated as a vertex ( ∈ V) and edges ( ∈ E) connect adjacent GCells within the same layer (GCell edges) or between GCells in neighboring layers (via edges), as depicted in Figure 1 (b).It is important to note that each layer has a preferred routing direction, meaning GCell edges can be horizontal or vertical.The global routing aims to establish a concrete path for each net within the grid graph.This process ensures the interconnection of all pins without overflow while minimizing the total wire length and the number of vias.
For each circuit, we provide two essential input files: a routing resource file (with a .capextension) and a net information file (with a .netextension).They contain all the necessary input information for global routing.For reference only, we also release the LEF/DEF files of the circuits.The routing resource file offers a detailed representation of the GCell grid graph and its available routing resources.Meanwhile, the net information file provides the access points for all the pins within each net.For detailed formatting specifications of the routing resource file, please refer to Appendix A. Likewise, for the format specifications of the net information file, please consult Appendix B.
The global routing solution is described in the GCell coordinate system.And the routing solution is defined on metal (routing) layers.To enhance routability and ensure pin accessibility during the subsequent detailed routing process, we operate under the following assumptions: • Metal 1 (the 0-th layer) is not employed for net routing.To reach pins on Metal 1, vias must be utilized to establish connections from Metal 2. • We distinguish between the effects of stacked vias [9] and unstacked vias on the utilization of routing resources.Unstacked vias are defined as vias that establish connections with one or more horizontal and/or vertical wires.In contrast, stacked vias connect to two other vias, as depicted in Figure 1 (c).Given that stacked vias impede the utilization of routing resources on the associated GCell edges, we increase the routing demand of the pertinent GCell edges by half of a routing track.We exclude the consideration of routing demand overhead attributed to unstacked vias since the routing demand on the associated GCell edges is already factored in when managing the horizontal and/or vertical wires connected by the unstacked vias.Here is an illustrative example of a global routing solution for a net (as depicted in Figure 2 where each row (       ℎ  ℎ  ℎ ) describes a line/rectangle in the 3D GCell graph, spanning from (  ,   ,   ) to ( ℎ ,  ℎ ,  ℎ ).
In the above example, three wires are defined for Net0, each covering one or multiple contiguous GCells.For example, "0 2 2 3 2 2" represents a wire covering GCells (0,2,2), (1,2,2), (2,2,2) and (3,2,2).The total wirelength of this routing solution, denoted as TotalWL, is calculated by summing up the wirelength on all metal layers: To be considered valid, a global routing solution for a net must ensure that its wires cover all pins of the net and that the wires collectively form a connected graph.In this graph representation, each wire corresponds to a vertex.An edge exists between two vertices (wires) if they satisfy one of the following conditions: (i) They touch each other on the same metal layer, or (ii) Vias connect them.The resulting graph must be a connected structure.For an overall global routing solution to be deemed valid, it must satisfy the validity criteria for all nets in the circuit.In addition, Metal 1 should not used for net routing.

EVALUATION AND RANKING 3.1 Evaluation Environment
Submitted global routers will run on a computing platform equipped with four A100 GPUs, each with a memory capacity of 80GB.Additionally, up to 8 CPU threads will be supported.We supply a Docker image preconfigured with CUDA Toolkit 11.7 and some popular machine-learning libraries to facilitate a standardized development environment.We encourage participating teams to capitalize on the potential of GPU acceleration and explore the integration of machine-learning techniques.However, the usage of the GPU is optional, and teams are free to choose their preferred approach, whether it involves leveraging the GPUs or not.

Evaluation Metrics
The evaluation of a global routing solution encompasses two key aspects: the quality of the global routing solution and the runtime required to execute the global routing process.A solution with a smaller scaled score is considered a better solution in this contest: where  represents the nondeterministic penalty.The original score is measured by the weighted sum of the following metrics: total wirelength, via utilization, and routing overflow: where TotalWL and ViaCount denote the sum of the wirelength for all signal nets and the total number of vias, respectively. 1 and  2 correspond to UnitLengthWireCost and UnitViaCost, respectively, defined in the .capfile.In our evaluation,  1 and  2 are set to be 0.5/M2 pitch and 2, respectively.The overflow cost for a GCell edge with routing capacity  and routing demand  at the -th layer is calculated as follows: where  is a pre-defined scaling factor.We assign a large  to GCell edges with zero capacity to markedly penalize the utilization of obstructed GCell edges.OFWeight[] is overflow weight for GCell edges at the -th layer, which is defined in the .capfile.The Over-flowScore is the summation of overflow cost for all GCell edges.The efficiency of global routing is also of great importance.Therefore, the runtime factor applies to the global router, which is calculated as The median wall time is the median runtime of all submitted global routers from contestants for the benchmark.The runtime factor is limited to the range The nondeterministic penalty  accounts for the complexities of debugging and maintaining a router with nondeterministic behavior.We execute the router multiple times during evaluation and select the median scaled score as the final score.If we encounter nondeterministic results, a minor penalty of 0.01 is imposed; otherwise, the penalty remains at 0.
The scaled score is considered infinite if one of the following conditions is satisfied: (1) The proposed router encounters a segmentation fault or crashes; (2) The proposed router changes the placement, netlist, or any design information in the solution file; (3) The runtime usage of the proposed router is over the given runtime limitation.The details about the runtime will be released later; (4) The memory usage of the proposed router is over 200GB in CPU or over the GPU memory limitation; (5) The routing solution file generated by the proposed router is invalid; (6) There is an unconnected net.

Ranking
Our ranking ranking process follows these steps: (1) Rank each team for each benchmark.The team with a smaller scaled score will get a smaller ranking number, which means a better ranking; (2) Weigh the ranking of each benchmark by the cube root of its number of nets after normalization (ensuring the sum of weights equals 1).In other words, the weights for designs Ariane_rank, MemPool-Tile_rank, NVDLA_rank, BlackPar-rot_rank, MemPool-Group_rank, MemPool-Cluster_rank, and TeraPool-Cluster_rank are 0.05, 0.05, 0.05, 0.09, 0.15, 0.22 and 0.39, respectively.Then, calculate the weighted average ranking for each team.The team with the smallest average ranking number wins the contest.(figure taken from [1]).The number of cores per tile and of groups can be increased to reach 50 million cells in the cluster (=TeraPool).

Special Honor
In addition to the evaluation metrics presented in Section 3.2, we propose an additional set of evaluation metrics that prioritize runtime speedup, aiming to encourage the development of global routing techniques tailored for routability-aware early-stage planning.

BENCHMARKS
The benchmark suite, including seven designs, is derived from the open-source TILOS macro placement benchmark [5].Each circuit information for global routing is converted with a custom script to the .cap/.net format presented above.The designs are synthesized and implemented using the open-source academic NanGate 45nm PDK [10].Cadence Genus and Innovus are used to perform logic synthesis, floorplanning, and placement on the designs.The relevance of these designs to the contest is due to their representative characteristics of modern, macro-heavy SoCs, including GPU and RISC-V CPU designs.These designs are challenging for routing due to the heavy presence of macros (with obstructions spanning four routing layers) and relatively high cell densities.Moreover, the scale of the largest test cases is much higher than in previous contests.Using the flexible MemPool architecture [4] detailed in Figure 3 allows the generation of huge and realistic test cases.Previous contests' largest designs include up to one or two million cells that commercial PnR tools can easily handle.Here, we include designs with ten and fifty million cells, for which commercial tools struggle to place and route in a reasonable amount of time.
Table 2 details the statistics of the test cases.We provide two versions of each test case, one publicly available for testing and another unpublished for final blind evaluation.Notably, the unpublished blind benchmark suite exhibits slight netlist variations (e.g., taken at different stages of the physical design flow, such as post-place vs. pre-CTS vs. post-CTS), highlighting the global routers' different use case application requirements.The blind test cases might also differ in the implementation settings, like the floorplan core density or macro placement differences.Finally, the following are the simplification steps done on the designs: • The power routing is not considered; • The clock tree routing is not considered; • Ten routing layers are available for all designs; • The GCells have the same size across the designs.

Figure 2 :
Figure 2: Example of GCell definition and global routing solution.(a) shows an example of GCell definition; (b) depicts a global routing solution (adapted from [6]).

Figure 3 :
Figure3: Flexible hierarchical architecture of MemPool test case[4] (figure taken from[1]).The number of cores per tile and of groups can be increased to reach 50 million cells in the cluster (=TeraPool).

•
This contest releases a set of large-scale benchmarks with a scale of up to 50 million cells.These benchmarks exhibit industrial-level congestion, posing challenging routing scenarios.The goal is to drive global routing research to address the scalability challenges regarding runtime and memory usage.•This contest offers simplified input/output formats and evaluation metrics, framing the global routing challenges as mathematical optimization problems.This simplified view will encourage participants from diverse backgrounds to engage in the competition.• This contest introduces two sets of evaluation metrics to advance the frontier of global routing techniques applicable across various stages in VLSI design flows.The primary set of metrics concentrates on global routing applications to guide post-placement optimization and detailed routing, focusing on congestion resolution and runtime scalability.Special honor is given based on the second set of metrics, emphasizing runtime efficiency and aiming at global routing applications guiding early-stage planning.

Table 1 :
Application of global routing across various VLSI design stages.topology, power-driven guide high-quality DR result where WL(  ) represents the wirelength on Metal .The wirelength on Metal 1 (the 0-th layer) is always zero as Metal 1 is not utilized for net routing.The wirelength on Metal 2 (the 1-th layer) is determined by the total length of vertical GCell edges utilized on Metal 2. Likewise, the wirelength on Metal 3 (the 2-th layer) equals the total length of horizontal GCell edges utilized on Metal 3.This routing solution contains two unstacked vias transitioning from Metal 1 to Metal 2 and two unstacked vias from Metal 2 to Metal 3.

Table 2 :
Design statistics of the ISPD 2024 benchmark suite.Technology is the NanGate 45nm process.Approximate numbers are reported -we provide variations in terms of netlist/placement/floorplan for public testing vs. blind evaluation/ranking.The number of routing layers is 10.The GCell is typically a square between 16 and 32 standard rows in size.