ExHiPR: Extended High-Level Partial Reconfiguration for Fast Incremental FPGA Compilation

Partial Reconfiguration (PR) is a key technique in the application design on modern FPGAs. However, current PR tools heavily rely on the developer to manually conduct PR module definition, floorplanning, and flow control at a low level. The existing PR tools do not consider High-Level-Synthesis languages either, which are of great interest to software developers. We propose HiPR, an open-source framework, to bridge the gap between HLS and PR. HiPR allows the developer to define partially reconfigurable C/C++ functions, instead of Verilog modules, to accelerate the FPGA incremental compilation and automate the flow from C/C++ to bitstreams. We use a lightweight Simulated Annealing floorplanner and show that it can produce high-quality PR floorplans an order of magnitude faster than analytic methods. By mapping Rosetta HLS benchmarks, we demonstrate that the incremental compilation can be accelerated by 3–10× compared with state-of-the-art Xilinx Vitis flow without performance loss, at the cost of 15–67% one-time overlay set-up time.


INTRODUCTION
Over the past decades, Field-Programmable Gate Arrays (FPGAs) have been widely used to accelerate diverse applications on machine learning [11,16], data analysis [7,11], image processing [10,25], and others.The hardware-programmable features allow the developers to customize the application instances with more lexibility.However, the coding efort and long compilation time hinder the wide deployment of FPGAs.Vendors have been developing versatile tools, such as Vitis [56], SDSoC [55], and OpenCL [23], to reduce coding diiculties by supporting high-level languages (C/C++).While these solutions can improve coding productivity, the source code will inally go through place-and-route, which is the most time-consuming part.In fact, the incremental compile strategy is poorly supported for this most time-consuming place-and-route step.Figure 1 proiles the compilation time breakdown to implement Rosetta Benchmarks [61] on a Data Center FPGA Card (Alveo U50) [57].Synthesis usually takes more time for the initial compile (green blocks) as some peripheral modules are compiled once and can be reused in later incremental compiles.However, when we change only one source ile, we only see a 21ś36% reduction in the incremental compile times; it takes almost the same time for placement, routing, and bitstream generation.In contrast, software applications can be compiled in a diferent way, where only the modiied source iles need to be recompiled.This can save considerable time during incremental development where it is common to change only a few functions at a time.We raise the key question here: Can we compile the HLS source code incrementally, like software, such that we only need to perform placement and routing on the portions of the design that change?Several novel proposals [19,20,38,46,51,52] for FPGA parallel compilation low have been brought forward in recent years.Guo et al. [20] propose to partition the HLS-code and perform split compile using RapidWright [29], which can accelerate the compilation by 5ś7× while increasing the frequency by 1.3×.However, a global stitching step is still needed which restricts the maximum compilation speedup.Our previous work PLD [51,52] uses Partial Reconiguration (PR) technique to compile separate C-functions in parallel.The incremental-compilation time can be decreased, as only the modiied functions need to be recompiled.However, the incremental compiles are based on a pre-compiled ixed overlay, and the applications cannot be mapped until C/C++ functions have been manually decomposed to match the ixed PR block sizes.In this work, we propose to customize PR block size by deining the partial reconigurable functions in a high-level language (C/C++) and automating the complete design low to generate and exploit PR regions with no manual intervention.
In this paper, we propose a framework called HiPR (High-level Partial Reconiguration), that is fully compatible with the newest Xilinx Vitis tool low.Taking as the input the applications based on the Kahn Processing Networks (KPN) model [24], where operators are connected through stream links, HiPR allows the users to deine the partial reconigurable function (operators in KPN) at C-level.We use pragmas to identify a function as being under development and to signal that it should be given its own PR region for fast recompilation.When compiling the application for the irst time, HiPR compiles each operator function in parallel from C to a post-RTL-synthesis netlist.Using resource requirements from RTL synthesis, HiPR automatically generates a design-speciic overlay with a static region and custom target PR regions deined in Figure 2(a) Line 3. Next, when the user only modiies the target function(s), HiPR only recompiles the modiied function(s).If the user needs to change the interconnection between diferent operators or add more PR-target functions, HiPR will automatically redeine the loorplan for the static and PR regions.Based on the context above, we summarize our contributions as follows: • We demonstrate that automatically loorplanned, partial-reconiguration decomposed designs can support incremental compilation to reduce compile times by evaluating HiPR on the full set of Rosetta benchmarks on the Alveo U50 card to reduce compilation time by 3ś10×.
The remaining paper is structured as follows.The recent FPGA compilation techniques are discussed in Section 2. The proposed model and HiPR toollow are presented in Section 3, followed by the lightweight loorplanner in Section 4. We compare our lightweight loorplanner with other classic methods in Section 5. Section 6 discusses the experiment results, and Section 7 concludes the paper.

BACKGROUND 2.1 FPGA Compilation and PR Technique
Diferent from the incremental compilation strategy in software, FPGA compilation can take hours to days, as the EDA tools need to place and route ine-grained (bit-wise) netlists.Global optimization, placement, and routing across diferent modules will be performed in a monolithic manner.This cross-module optimization can generate the best area-performance solutions, but the heuristic algorithms that are usually adopted to solve these NP-hard problems [5,27,47] result in a long compile time.Moreover, even tiny modiications can trigger complete recompilation, which lengthens the edit-compile-debug loop and reduces the development eiciency at the initial tuning and veriication stage.
Partial Reconiguration (PR) techniques are widely supported by modern FPGAs [22], where only a portion of the FPGA chip is reconigured while allowing other modules to run.Xilinx recently released the Dynamic Function eXchange technique (DFX) [58], with which the user can redeine PR regions into sub-PR regions without recompiling the static logic.Additionally, the abstract shell [58] by Xilinx can isolate diferent PR regions better by only including the related wires, which provides short compilation-time potentials, as CAD tools do not need to load the whole chip database.However, Xilinx leaves all these low-level, detailed PR deinitions to the designers, which makes DFX nearly inaccessible for the vast majority of HLS users.

Compilation Acceleration
Various approaches in literature propose to divide the FPGAs into separate physical blocks and conduct independent logic mapping [4,8,9,26,28,30,31,39,43,59,60].However, these approaches do not support high-level compilation from C or address the compile time reduction.Grigore et al. [18] proposed a toollow to automate the generation of partially reconigurable modules from the MaxJ language to bitstreams.However, the toollow heavily relies on GoAhead [12] and Xilinx ISE, which are not compatible with modern FPGA vendor tools, and the compilation time is not considered.RapidStream [20] can accelerate the compile time by leveraging RapidWight [29] to perform parallel compilation from HLS code to bitstreams.Unfortunately, global routing is still needed to stitch the separate blocks together.In our previous work [38,51,52], we propose to use the PR technique to accelerate the compile time.Separate PR regions connected by a pre-compiled Network-on-a-Chip can reduce the compile time.Nevertheless, users have to decompose the applications into small modules to it the ixed PR pages, and designs may sufer high fragmentation.Also, the uniform bandwidth of the NoC can degrade the performance when some links between pages need high bandwidth (more than 0.8 GB/s).Park et al. enhanced our previous work by introducing nested DFX overlays [37], which can merge several pages to map large operators, partially solving the ixed-page issue.It also provides the potential to address bandwidth limitations by merging small pages with heavy cross-page communication into one page.Nonetheless, the bandwidth limitation is not fully addressed when two pages are too large to map.Moreover, it still sufers from fragmentation issues as it is not speciically customized for diferent applications.
The approach we propose in this paper difers from the methods above: HiPR can automatically generate the PR overlay according to application requirements while [38, 50ś52] rely on ixed, pre-compiled overlays; the compilation isolation enables parallel compilation on the cloud, while the global stitching for the inal bitstream generation cannot be separated in RapidStream [20]; our low is fully compatible with modern FPGA vendor tools unlike [18].

Floorplan for Partial Reconfiguration
The Floorplan is the key to illing the gap between RTL synthesis (generated by HLS) and place-and-route implementation, and there is a signiicant body of literature on PR loorplanning [2,15,32,33,45].Taking into account both the heterogeneous resource distributions and PR constraints for modern FPGAs, many loorplanners use heuristic methods [3,40,48].Bolchini et al. [3] adopt the Simulated Annealing (SA) algorithm to explore a reduced search space represented by sequence pair [34].A greedy loorplan method (Columnar Kernel Tessellation) is proposed in [48] to reduce resource wastage.A Genetic Algorithm (GA) is adopted in [40] to explore wider feasible solutions.Analytic methods, such Mixed-Integer Linear Programming (MILP) and Nonlinear Integer Programming (NLP) have recently been brought forward to generate global optimal solutions [35,41,42,44].The MILP-based loorplanner [41,42] can ind the global optimum, and the users can also change the objective functions with diferent weights to total wire length, aspect ratio, and resource wastage.FLORA [44] is another MILP-based loorplan tool, which takes into account the more realistic PR constraints and adopts a ine-grained model for modern FPGAs.
While the analytic (MILP) method can outperform the heuristic method with the ability to ind the global optimal solution, it sufers from a long execution time and poor scaling with problem size (detailed in Section 6.1).Hence, HiPR adopts the SA-based loorplanning algorithm to accelerate the compile time, extending the SA by considering modern hierarchical DFX constraints (detailed in Section 4.1).

PROBLEM MODEL AND PROPOSED FRAMEWORK 3.1 Compute Model
The datalow computational graph model [8,13,24,51] has proven efective in isolating kernels for separate compilation.For Kahn Processing Networks (KPN) [24], each kernel, called an operator, is described by a C function in HiPR.The operator receives input tokens and sends output tokens through latency-insensitive streams [6], which can be mapped to FIFOs or handshake relay stations [6,50].The FIFO links can isolate the timing constraints between operators allowing the operators to be compiled independently enabling parallel compilation.
The datalow graph in our model is illustrated in Figure 2(b): 1) the design consists of a cluster of operators; 2) diferent operators are connected by stream links.

Strategy
With the application written in the manner in the previous section, developers can identify the PR functions by using PR pragmas.HiPR assigns speciic PR regions to it the PR function resource requirements, compiles interconnect and non-changing functions to the static region, and compiles each function to its own PR region.
As the interconnect wires can be customized to it the bandwidth requirement, no performance will be degraded as in other frameworks [51,52].

Fragmentation
By creating custom pages for each operator, HiPR can avoid some of the fragmentation inherent in the onesize-its-all, ixed-size pages of prior work [20, 50ś52].Pages can be sized to include only the resources needed, avoiding the fragmentation that comes from trying to allocate adequate resources to handle a wide variety of operators.For example, operators that do not need DSPs can be given pages with no DSP columns, and pages can be customized for operators that need a large number of DSPs.Furthermore, there is no need to allocate regions exclusively to NoCs that will make some of the LUTs, BRAMs, and DSPs inaccessible to compute pages.However, the HiPR strategy of allocating a PR region that satisies Xilinx's partial reconiguration constraints exclusively for each operator can still lead to fragmentation.Since vertically-stacked PR regions within one clock region are not allowed, the minimum granularity of resource allocation for the loorplan is one column wide and one clock region height (hereafter referred to as a tile).Consequently, BRAMs must be allocated in blocks of 24 BRAM18s on the UltraScale+ architecture leading to some fragmentation when the number of BRAM18s needed is not a multiple of 24.For example, if we have an operator that needs 25 BRAMs, this can lead to a fragmentation of 48% (23/(2*24)).Similar granularity issues impact LUTs (60-CLB clock-height column quanta) and DSPs (24-DSP clock-height column quanta).Furthermore, since we allocate PR regions as a contiguous set of columns to satisfy the constraints on LUTs, DSPs, and BRAMs for an operator, when the operator's mix of resources does not match the FPGA's local mix of resources, extra columns of the non-limiting resources may be included to get enough resources for the limiting resource.

HiPR Framework
The state-of-art Xilinx Vitis low is illustrated in Figure 3 2(a), to detect whether a function/operator should be partial reconigurable.All the parsed information above is included in spec.xmlile.At the same time, HiPR calls Vitis_HLS and Vivado to perform compilation for the separate operators in parallel.As the overlay generation is needed for initial-compile, the post-synthesis information is delivered to HiPlanner.Simulated-Annealing loorplanning (Section 4) is conducted to generate the PR.xdc, which will be fed into Vivado to generate a partial reconigurable overlay.When the initial-compile is completed, an overlay.xclbin is generated, which corresponds to the post-routed device layout in Figure 4(a).For traditional PR low without the abstract shell [58], a giant overlay (Figure 4(b)), which contains the deinition for all PR regions, is generated.It will be entirely loaded in whenever any PR region needs re-implementation, which can last 10-20 minutes for Alveo data-center FPGAs.With the abstract shell technique, independent DCP iles are generated for PR functions to perform in-context place-and-route.In this example, 4 abstract shell DCP iles are generated for the 4 PR functions (a, b, c, d). Figure 4(c) shows the abstract shell for PR-function a.Only the partition pins and wires (yellow blocks) related to that PR region are reserved.The post-synthesis netlists for the PR-functions can be placed and routed within the PR regions deined by their abstract shells in parallel.As we use the same Vitis development platform (hw_bb_locked.dcp)released by Xilinx [53], the inal xclbin iles can be executed by Xilinx Runtime [54] by loading the overlay.xclbinirst and then xclbin iles for 4 PR-functions.
The header ile is used to signify whether the function/operator is partial reconigurable.The user can also specify the resource ratio parameters.Our loorplanner can reserve more space for a PR function according to the elastic ratio pragma in Figure 2(a) Line 3. In that case, the tools reserve a PR region that contains 4 times more LUTs, 2.4 times more BRAMs, and 8 times more DSPs than the initial resource requirements.We ofer this as a user directive since only the user knows which functions are currently under development and need room for reinement.This is important to reserve enough space in the loorplanned PR block to accommodate design growth, as the developer can change functionality, add code to ix bugs, and increase parallelism.An application-speciic overlay will be inally generated.
For incremental compilation, the developer can modify the PR-functions and perform quick compilation as shown in the orange dashed block in Figure 3(b).For instance, when function a is modiied, only this function is recompiled by Vitis_HLS and Vivado.The post-synthesis design netlist (a.dcp) is placed and routed within the PR region individually without touching other parts of the chip shown in Figure 4(d).Based on these dependencies, we use servers with the parallel task manager Oracle Grid Engine [36] installed to schedule the compilation tasks.HiPR can generate proper scripts with correct dependencies and submit the compilation jobs to the servers.HiPR also supports local machine compilation by using a makefile [17].The parallelism depends on the local cores and memory size.If the existing PR regions cannot it the increasing operator size, HiPR will re-generate the overlay by re-launching initial-compile.Changing the stream links between the operators also leads to re-launching initial-compile as it afects the interconnect wires in static regions.
In summary, HiPR launches the initial-compile to generate an overlay with several partial reconigurable regions according to the pragmas in the C++ header iles.Thereafter, the developers can tune the PR-functions by launching quick incremental-compile within individual PR regions.

HIPLANNER
Our loorplanner, HiPlanner, is the key step to bridge HLS and physical PR implementations.Various approaches have been proposed for loorplanning.We adopt the classic Simulated Annealing (SA) as our main loorplanner engine, since it is faster than analytical methods [41,44].We also implemented the MILP-based loorplanner according to [44] for detailed comparisons in Section 6.1.

Problem Formulation
Modern data-center FPGAs can be represented by Cartesian integer coordinates as shown in Figure 5.In addition to the heterogeneous resource (e.g., CLB, DSPs, BRAMs) with a non-uniform distribution, vendors also deine a certain static region to pre-implement some irmware circuits and deine a Level-1 DFX region for the users (e.g., Amazon EC2 F1 Instances [1] and Alveo data center accelerator cards [53]).Therefore, we will perform the loorplan for the application pages within this Level-1 DFX region.As noted in Section 3.3, the basic loorplanning unit is a tile.HiPlanner takes in the resource requirements from RTL synthesis and a device description ile and produces a set of PR constraints that are fed to Vivado along with the logic netlists to generate an overlay.We assume modern FPGAs obey columnar resource distribution: on-chip resources are distributed heterogeneously in units of columns spanning from top to bottom over the entire chip; diferent rows have the same resource distribution; the programmable column resources are replaced in some rows by other functions (e.g., hard core integration, static region implementations) making them unavailable for normal programmable logic use.Based on the columnar style of modern FPGAs, we model the FPGA device as a 2-dimensional matrix, which contains columns of resources (CLBs, Block RAMs, DSPs, and IOBs).A -element vector<CLB, CLB, BRAM, BRAM, ... CLB, CLB> is used to represent the resource distribution over one row.Since certain on-chip IPs (e.g., PCIe) are not considered in this work, we mark them as forbidden areas.We deine the variables for our model as below: := width of the device in units of tiles; := height of the device in units of tiles; := set of tile types considered (CLB, BRAM, URAM, DSP); := set of forbidden areas; := set of PR functions; := set of all the links between 2 PR functions; := leftmost tile column coordinate for a PR region; := lowermost tile row coordinate for a PR region; := width of a PR region in units of tiles.To ensure optimal routability, must be set to a minimum of 4.; ℎ := height of a PR region in units of tiles; := an area represented by a 4-element vector < , , , ℎ >, where , and are the lower-left coordinates for the region and and ℎ are the width and height of the region (e.g., < 5, 4, 4, 2 > for a Level-2 DFX region in Figure 5); := an area that cannot be used by PR regions ( ∈ ), such as < 10, 2, 3, 1 > and < 10, 5, 3, 1 > in Figure 5; := number of type resources ( ∈ ); , := number of interconnect wires between PR regions and ( ∈ , ∈ ); := number of wires connected to DMA (Direct Memory Access).We assume only one module drives DMA input and DMA output only drives one module; GAP := number of columns between two PR regions when both occupy the same row.The goal of the HiPlanner is to ind a set of non-overlapping areas :< , , , ℎ > | ∈ {0, .., || − 1} to map all the PR functions ∈ | ∈ {0, .., || − 1}.
With the speciied variables as above, we compute the centroid coordinates of an area : (2) We use Manhattan Distance to represent the wire length between 2 areas:

Objective Function
The main factors we consider in optimization objective functions are total wires length, wastage areas, and PR function overlaps as below: where and are weights for total wire length and resource wastage, respectively and is the weight for overlapping PR regions; the sum of and is 1; is the normalized wire length; is the normalized resource wastage; is the normalized overlapping area in units of tiles.
The absolute total wire length, , is computed as: where and are 2 diferent PR functions; ( ) means area is assigned to PR function .The irst term represents the total number of wires for all the links between PR regions, and the second term represents the number of wires between PR regions and the static DMA regions.
The normalized total wire length is calculated as: where || + 2 represents the total link number plus one DMA input and one DMA output; { , | , ∈ , } represents the maximum width of all the links; + represents the maximum Manhattan distance between two PR regions or between one PR region and the DMA location.The normalized total wire length is less than 1.
The normalized resource wastage is computed as: where , represents resource type in an area that is assigned to PR function ; , represents the number of resource type for PR-function .The numerator means the extra resource the PR region provides beyond what the PR functions really need.We divide it by the total resources of the chip ℎ, , | |, and || to guarantee that the normalized resource wastage is also less than 1.
The normalized overlapping area is calculated as: where , is equal to the number of PR functions that use , .By Equation 8, the overlapped area term is normalized to (1,2].
The sum of and is 1 and, is no less than 1.The loorplan is only legal when the cost function is less than 1, as any overlapping areas will increase the overlapping term to more than 1.Higher encourages our method to generate a legal loorplan more quickly.We use a greedy method to reshape the region to cover the required resources.For each PR region :< , , , ℎ >, when the and are determined, we will greedily include more columns in the right direction by increasing to meet the resource requirements, assuming ℎ = 1 initially.When + reaches or the /ℎ is more than 802 , we increase ℎ by 1 and start over from the previous greedy step again.Consequently, If + ℎ reaches , we set and all to 1 and start the previous greedy step again.This can provide access to the whole chip resources.

Greedy PR Shape Generation
Since the FPGA fabric is non-homogeneous, when we move a region from one location to another, the existing , and ℎ may not provide the needed resources.For example, in Figure 6, assume we need 4 tiles of CLBs, 2 tiles of DSPs, and 4 tiles of BRAMs.If our left-bottom tile is < 3, 1 >, our inal shape will be < 3, 1, 7, 2 >.If the start tile is < 4, 3 >, intuitively the shape will be < 4, 3, 6, 4 >.Compared with the previous shape, we waste 8 tiles of CLBs and 2 tiles of DSPs adding to the fragmentation.To avoid extra resource wastage on the left boundary region, after the resource requirements are met, we will increase the x coordinate of the shape as long as the resource requirements can be met.By applying this greedy reshaping strategy, the inal shape is shown as the shaded area < 6, 3, 4, 4 > in Figure 6.
Note that we need to obey more practical constraints as well.For example, the minimum width for a PR region is 4-tiles.Also, for each PR region, we intentionally include 3 () more columns on the right and leave these 3 () columns in the static region.This extra space is reserved to route the wires between diferent PR regions.
With this greedy reshaping method, we only need to use Simulated Annealing to determine the lower-left coordinates for each operator.

XY Simulated Annealing (XYSA)
As the main goal of our Simulated Annealing algorithm is to generate proper and coordinates for all the operators, we call our algorithm XYSA.As shown in Algorithm 1, the inputs to HiPlanner are device parameters and resource requirements for all the operators.For the initial point, we randomly generate the and coordinates for all the operators and perform the greedy reshaping method to generate the PR shapes (Section 4.3).We update the cost function and use this initial cost as the current cost.For the following simulated annealing steps, we randomly select one PR region and randomly generate new and coordinates, and reine the PR regions by using the greedy reshaping method above.In fact, the PR regions can be represented as :< , , ( , , ), ℎ ( , , ) >, as and ℎ are determined by the , and .After the shape of the operator is determined, we update the cost function with the new set of PR regions.We accept the results if the cost function is improved or the calculated accepting possibility is greater than the random possibility.For our implementation, and coordinates are the simulated annealing targets since we believe this representation is simple and fast to run.By using the variables we deined in Section 4.1, our implementation can easily be extended to support sequence pair [34,48], which is another traditional representation in loorplan and placement.We will compare our results with the sequence pair simulated annealing algorithm (SQSA) and Mixed-Integer Linear Program (MILP) in Section 5.3 and Section 5.4, respectively.

DESIGN METRICS
In this section, we will use realistic benchmarks to proile our loorplanning algorithm (XYSA) and compare it with other classic loorplanning methods (SQSA and MILP).Since not all the implementations are open-source, we implement our own versions of these algorithms in C++ with similar variable deinitions as Section 4.1.for operator in operators_set do Randomly select an operator_j 13: Randomly generate lower-left < , > for operator 14: Reshape(operator , < , >) end if 39: end procedure

Benchmark Preparation
To characterize the XYSA algorithm, we use digit recognition (varying BRAM utilization3 ) from the Rosetta Benchmark Suite [61], as it is easy to tune the application resource usage up and down.
Digit recognition is based on K-Nearest-Neighbour (KNN) algorithm.A subset of the MNIST database [14] was downsampled to ( =18,000 in the original Rosetta Benchmark) training samples and 2,000 test samples and stored as 196-bit unsigned integers per image.196-bit images are stored in the BRAMs on the chip.For each test image, a Hamming distance is calculated for each training sample.K training samples with the smallest Hamming distance are voted to decide the inal result.For our case, we use the full database [14] of 60,000 196-bit training images.We split the training set into a systolic/cascaded chain of operators.The irst operator calculates the best K candidates for the input testing sample and passes the K best candidates along with the input test sample to its downstream operator.All the operators in the middle vote to choose K best candidates from their local candidates and the upstream K best candidates, and pass the best K candidates along with the testing sample to the next stage.The inal operator calculates the best candidate.All the operators are running in a task-level pipelining or datalow manner shown in Figure 7.By changing the parameters below, we can generate diferent benchmark versions with various BRAM utilization ratios of Alveo U50 in Table 1.In fact, the full database [14] only provides =60,000 training samples.We can still set =83,200 in Table 1, since we can instantiate more BRAMs when more training samples are ready later.Since we can only assign BRAMs to the PR function in units of tiles (24 BRAM18s per column), we can barely increase the utilization above 80%, especially when the page number is high.This is because the datawidth for the training set is 196 bits, and the granularity of BRAMs increment for each page is 11 BRAM18s (⌈ 196  18 ⌉).If the total BRAM numbers for each page cannot be divided by 24, some fragmentation issues will show up.
P : the number of decomposed operators/pages; PAR : the partition number of the training set within an operator/page; N : the total sample number of the training set.

XYSA Charaterization
Below lists the parameters for XY Simulated Annealing (XYSA) algorithm.In this section, we will use digit recognition case 5 in Table 1 to show how these parameters afect the quality of results (QoR) of XYSA.Initial Temperature ś Figure 8 shows the cost function with diferent initial temperatures for XYSA. Figure 8(a) shows the cost functions for all temperatures converge after running for 15 seconds.However, diferent initial temperatures can generate diferent inal cost functions.For Figure 8(b), we can see cost functions converge slowly in the temperature range [1,000, 1e-3], but converge quickly below 1e-3.This means the lower temperature can accelerate the convergence for the XYSA algorithm.Nevertheless, a higher initial temperature is useful to avoid being trapped in a local optimum.From Figure 9, we can see when the initial temperature is below 1, XYSA may fail to ind legal loorplan results.Therefore, to guarantee that we can inally generate legal loorplan results, we will set the initial temperature to 100 for the following experiments.
TRIAL_NUM ś We sweep the TRIAL_NUM, the interval before the temperature is decreased by 10 × (per log-scale unit).From Figure 10(b), we see the cost function decreases faster with temperature when we have more trials in each temperature range.When TRIAL_NUM is 1e+7, we see the cost function converges most quickly.However, when we look at cost function versus runtime shown in Figure 10(a), we see the cost function converges more slowly when TRIAL_NUM is higher.This is because it takes more time at each temperature with a higher TRIAL_NUM.It is worth noting that too low TRIAL_NUM (< 1e+4) may make the loorplanner fail to ind a legal solution, as shown in Figure 10(b).By plotting the cost function with the TRIAL_NUM in Figure 11, we see the higher TRIAL_NUM could slightly improve the inal cost function.However, we see no signiicant improvement BRAM Utilization=61%, N=44,800, P=20, PAR=2 TRIAL_NUM = 100000, T_MIN=1e-10, =0.9997

Sequence-Pair Simulated Annealing
Sequence Pair is a classic representation for loorplanning [34]: a positive sequence (Γ + ) and a negative sequence (Γ − ) are adopted to represent the relative locations between each pair of blocks.If block a is before block b in both Γ + and Γ − , block a must be left of block b, shown in Equation 9.If block a is before block b in Γ + and after block b in Γ − , a must be above b.Based on the relative locations for all the blocks, a directed and vertex-weighted graph called horizontal-constraint graph ( ) and vertical-constraint graph ( ) can be constructed.The weight of and can be viewed as the width and height of a block.By using the well-known longest path algorithm for vertex-weighted, directed acyclic graphs, the absolute location coordinates can be calculated for all the blocks.Readers can refer to [34] for a detailed explanation.
(Γ + :< ..., , ..., ... >, Γ − :< ..., , ..., ... >) ⇒ is left of ( (Γ + :< ..., , ..., ... >, Γ − :< ..., , ..., ... >) ⇒ is above (10) We implement the Sequence-Pair Simulated Annealing (SQSA) from [3] but extend it to support PR constraints and forbidden areas for modern FPGAs.We irst use the longest path algorithm to determine the weights of .For the left-most block, we use a similar greedy reshaping strategy in Section 4.3 to determine the width and height of the block.The width () and height (ℎ) of the following blocks are determined by the and of the blocks on their left.As the heights of all the blocks are determined after updating , the weights of are determined, by which the coordinates of all the blocks can be easily calculated.Intuitively, these relative location representations can generate a more compact outlined loorplan, as the right and upper blocks are always adjacent (which is not necessarily true with XYSA) and determined by the left and lower blocks.However, this strategy might be inadequate to represent the rich space in optimizing inter-page link length.In Figure 13, we can see diferent implementations represented by the same sequence pair (Γ + :< , , , , >, Γ − :< , , , , >).We assume d and e have heavy linking wires.If we use the longest path algorithm, we will get Figure 13(a) where block d is adjacent to block b.However, if we move d in the right direction by certain tiles, d and e have shorter distances.This move cannot be represented by sequence pair but can be represented by XYSA.
Figure 12 shows the diference between XYSA and SQSA algorithms.We see SQSA converges more quickly than XYSA initially, as SQSA tends to generate compact loorplans.Since SQSA has a smaller design space (e.g., SQSA can not represent the loorplan in Figure 13(b)), XYSA will slightly outperform SQSA when the temperature is lower than 10e-5 shown in Figure 12.In terms of runtime, since it takes more time for SQSA to update the loorplan and the cost function, we can see XYSA converges faster than SQSA in Figure 12(a).
XYSA can generate cost functions with a negligible diference in a much shorter time than SQSA.In Section 5.5, we will show that XYSA and SQSA generate similar quality of results.

Mixed-Integer Linear Programming
Mixed-Integer Linear Programming (MILP) is another classic optimization tool used for loorplanning.We implement the MILP model from FLORA [44] in C++ prototype and use Gurobi 9.5.1 [21] for academia as our solver.We extend the original MILP to support an extra horizontal gap between diferent PR regions when both occupy resources in the same row.We deine the extra variables and constraints below.Readers can refer to [44] for the other constraints.Since it usually takes signiicant time for MILP to reach an optimal or even a feasible result, we set the maximum runtime for the Gurobi solver as 24 hours and keep the other parameters as default.We use a dashed horizontal line to extend the line after XYSA converges in Figure 14 to compare the cost function between XYSA and MILP.We can see XYSA can converge an order of magnitude faster than MILP.However, MILP will eventually outperform XYSA by 1 % at the cost of hours runtime.We will show the improvement has a negligible impact on the quality of results in Section 5.5.

uality of Results
BRAM Utilization vs. Compile Time ś Figures 15(a) (c) (e) show how BRAM utilization afects the HiPR compile time with diferent loorplan algorithms.For the compile time, we use the average compile time from all the partial reconigurable functions.We see the compile time is mainly driven by the clock frequencies since it takes more time to meet more strict timing constraints.The average compile time with diferent loorplanners is similar.This means that XYSA and MILP can both generate legal overlays with similar page compile time.As a result, we can exploit the fact that XSYA has an order of magnitude shorter runtime than MILP during overlay generation without increasing page compile time.We evaluate the compile time acceleration of our framework by implementing the realistic Rosetta HLS Benchmarks [61] on the Alveo U50 Data Center card [57] with a Virtex UltraScale+ XCU50 FPGA and 8 GB HBM.Subtracting the pre-implemented irmware from Xilinx, a large PR region is available for the users (705,520 LUTs, 2,232 18Kb BRAMs and 4,920 DSPs).HiPR uses Xilinx Vitis 2021.1 including associated Vivado and Vitis_HLS and XRT as the backend.We perform the compilation on a cluster of 8 servers.Each server is equipped with two 2.7 GHz Intel E5-2680 CPUs and 128 GB of RAM (total of 8×2×8=128 cores).

Floorplanner
Table 3 shows the comparison between HiPlanner (XYSA) and the state-of-the-art loorplanner (MILP).The proposed SA-based loorplanner is implemented in a C++ prototype (Section 4) and is compared to our implementation (Section 5.4) of the MILP loorplanner [44] which already showed better results than [41] and [40].However, since [44] is only based on Virtex-7 series and does not consider the hierarchical DFX features, we enhance it to support these features mentioned in Section 4.1.
From Table 3, we can see it takes less than 15 seconds (column 6) for XYSA to converge to good results (column 9, XYSA Cost Function).For the MILP method, the runtime in column 7 (MILP Milestone) means the MILP methods reach the same results as XYSA.We can see it takes more than 2 hours for MILP to generate similar results as XYSA when the number of pages (partial reconigurable functions)is 15-22.Column 8 lists the runtime when MILP reaches the optimum or the maximum runtime we set (30 hours).Only 3d-rendering and Optical flow can reach optimality in 110 and 109,601 seconds.Column 10 lists the best results the MILP method can get within 30 hours.Figure 16 shows how the cost function changes with runtime between XYSA and MILP methods.Column 11 lists the predicted best bound MILP can possibly reach (not achieved unless optical).It takes much more time for MILP to generate similar results to XYSA when page numbers increase.Column 12 lists the improvements by MILP over XYSA (11-17%).overlay with PR modules deined.In Figure 17, we can see HiPR takes more time to generate the overlay for recognition benchmarks, as the operators have to be placed and routed along with overlay generation.In Figure 17, we see the abstract shell generation for all the PR functions should be performed after the placement and routing since each PR region's abstract shell is related the logic and wires in the static region.However, the overlay bitstream can be generated simultaneously with abstract shell generation since they are only based on the post-place-and-route netlist.In Table 4 column 9, we choose the maximum value between overlay bitstream generation and abstract shell generation.It takes at most 55% overhead in compile time to set the overlay up.However, this process is usually performed once, and users can beneit from incremental compilations afterward.

Incremental-Compile:
The main goal of HiPR is to accelerate the incremental compilation since only the modiied functions need to be recompiled.Figure 18 shows the compilation distribution for diferent operators over the full benchmark sets.The operators can be incrementally recompiled in 7-20 minutes.For all the benchmarks, the median values are near 11 minutes.This means that in most cases, users can beneit from short incremental compilation to tune their target functions more eiciently.We can see that incremental compilation can be improved by a factor of 3ś10 (Figure 18(b)).Figure 19 shows the compilation time breakdown for digit recognition benchmark.We can see the place-and-route time is accelerated most.Table 5 shows the detailed compilation time.For HiPR, we choose the maximum compile time from all functions for each benchmark as the inal compilation time.Even with the worst case, HiPR can still outperform Vitis by 3.4ś5.6×.Performance Comparison: Table 6 summarizes the performance between Vitis and HiPR.As we rewrite the original code in a latency-insensitive manner (Section 3), the throughput is slightly diferent from Vitis implementation performance.However, HiPR achieves the same frequency and better performance than the original Vitis Flow.We also implement loorplan generated by MILP, and get similar compile time and timing slack as XYSA.
Several related work proposes similar but diferent methods with our HiPR (Section 2.2).Compared with PRlow [52] and PLD [51] low, HiPR can customize the overlay according to applications.This can address two limitations in PLD [51] low ś ixed-size issue and ixed-bandwidth issue.As HiPR can assign various PR blocks PR operators/functions according to resource requirements, there is no need for the developers to decompose their design to it the ixed page size in PLD [51] low.Also, HiPR allocates dedicated links between operators and uses relay stations [6,8] to meet the inter-page bandwidth instead of potentially being limited by the uniform NoC bandwidth.Therefore, HiPR does not degrade communication bandwidth, while delivering similar short incremental compile time as PLD.Since it takes more time to map larger operators, developers have the freedom to decompose their designs to accelerate compilation.While this work targets the KPN model (streaming model), several recent works can convert diferent compute models into streaming models (AutoSA [49]).

CONCLUSIONS
In this paper, we propose HiPR, a framework that allows users to deine partial reconigurable C-functions instead of Verilog modules.This can greatly beneit the incremental FPGA development, as only the modiied functions are recompiled (place-and-route) without waiting a longer time for full recompilation.The experiments from the Rosetta Benchmark implementation show that HiPR can decrease the incremental-compilation time by a factor of 3ś10× without performance loss or the need to target ixed PR region sizes.

Fig. 4 .
Fig. 4. Initial-compile vs. Incremental-compile (a).Taking in all the C/C++ iles as the inputs, Vitis_HLS is called to generate app.xo ile.Compiling app.xo to FPGA-loadable ile app.xclbin by executing the linkage command (v++ -link) is the most time-consuming step.As this linkage step is not open for normal commercial users, it is hard to perform incremental compilation with the of-the-shelf PR technique.For HiPR, it takes the same input source as Xilinx Vitis: each operator is described by a C++ function; the function can be deined as partial reconigurable (Figure2(a) Line 3).From the text (#pragma HLS PR clb=4 bram=2.4dsp=8), the keyword PR means that the function is a partial reconigurable function.In this example, we deine operators a, b, c, and d as Partially Reconigurable functions (PR-functions) and operator e as a Non-Partially Reconigurable function (NPR-function).We classify the development compilation into 2 types: initial-compile and incremental-compile.For the initial-compile, shown in the blue dashed block in Figure 3(b), the HiParser parses the top.cppile and extracts the interconnection between diferent operators.The HiParser also needs to parse the header iles of all the operators, shown in Figure

T 0 :
the initial temperature for Simulated Annealing; TRIAL_NUM : the number of trials before the temperature is decreased by 10 ×;

Fig. 9 .Fig. 10 .
Fig. 9. Final Cost Function with Diferent Initial Temperature ś when the initial temperature is less than 1, XYSA may fail to find a feasible solution

Fig.
Fig. Final Cost Function with Diferent TRIAL_NUM ś regions with diferent color represent diferent runtime; e.g., it takes 1ś15 seconds when TRIAL_NUM is less than 1e+5

Table 1 .
Digit Recognition Resource Utilization

Table 2 .
Compile Time and Timing Slack with BRAM Utilization Figures 15(b) (d) (f) show how BRAM utilization afects the HiPR timingslack.We see the timing slack is related more to frequency constraints than the BRAM utilization when the utilization is under 80%.When BRAM utilization is 80%, HiPR fails to generate overlays.We see XYSA, SQSA and MILP have similar max clock frequencies in Table2.Based on the facts above, the maximum frequency HiPR with XYSA can achieve is around 200 MHz to 250 MHz.

Table 4 .
Roseta Benchmarks Initial-Compile Times (seconds)Vitis Flow with 30 ThreadsHiPR with 8 Threads for each Operator hls syn p&r bitgen total syn p&r abs_gen max op † total Overhead § Maximum compile time for all the operators.§ The overhead is calculated by dividing the total time diference between HiPR and Vitis over the Vitis time.