Abstract
Post-Moore’s law area-constrained systems rely on accelerators to deliver performance enhancements. Coarse-grained accelerators can offer substantial domain acceleration, but manual, ad hoc identification of code to accelerate is prohibitively expensive. Because cycle-accurate simulators and high-level synthesis (HLS) flows are so time-consuming, the manual creation of high-utilization accelerators that exploit control and data flow patterns at optimal granularities is rarely successful. To address these challenges, we present AccelMerger, the first automated methodology to create coarse-grained, control- and data-flow-rich merged accelerators. AccelMerger uses sequence alignment matching to recognize similar function call-graphs and loops, and neural networks to quickly evaluate their post-HLS characteristics. It accurately identifies which functions to accelerate, and it merges accelerators to respect an area budget and to accommodate system communication characteristics like latency and bandwidth. Merging two accelerators can save as much as 99% of the area of one. The space saved is used by a globally optimal integer linear program to allocate more accelerators for increased performance. We demonstrate AccelMerger’s effectiveness using HLS flows without any manual effort to fine-tune the resulting designs. On FPGA-based systems, AccelMerger yields application performance improvements of up to 16.7× over software implementations, and 1.91× on average with respect to state-of-the-art early-stage design space exploration tools.
1 INTRODUCTION
As CMOS scaling slows, the increase in micro-circuitry per unit area, deciding which code regions to accelerate becomes harder. However, the demand for improved performance is exacerbated by novel domains such as AR/VR, Robotics, and Machine Learning that require more agile and efficient design methodologies. The traditional approach in accelerating novel applications is to manually profile them and decide in an ad hoc manner what code regions should be mapped onto hardware. These manual approaches are ad hoc since they either directly map specific functions or loops onto hardware at a fixed granularity (e.g., [32]) or informally identify a simple pattern occurring across different functions and create an accelerator for that (e.g., [5, 17]). Code patterns can be as simple as repeated sequences of multiply-add operations or as complex as control flow graphs (CFGs) and data flow graphs (DFGs) representing a function (CDFG) or a function call graph (nested-CDFG). In the case of manual accelerator design, patterns are only exploited at fine basic-block/DFG-level granularities, based on the architect’s domain intuition.
The lack of a systematic granularity detection and hardware pattern reuse motivated the creation of Early-Stage Design Space Exploration (Early-DSE) methodologies. Early-DSE [13, 42, 43] has emerged as a profiling methodology able to discover new SoC architectures adjusted to the application’s characteristics. Furthermore, Early-DSE informs the SoC designer without having to manually transform the most executed application functions into a synthesizable format, i.e., acceptable by HLS Tools [3, 4, 24, 38], an extremely error-prone and time-consuming process. However, Early-DSE comes unprepared to exploit variable CDFG granularities and much less to reason about architectures that reuse CDFG logic.
Merging small/fine-grained DFGs [6, 21, 33, 36] has been implemented using a graph-isomorphism approach, at HLS time, therefore missing an opportunity to think about the whole system in terms of reusing CDFG patterns. These approaches have received limited adoption in RTL synthesis tools [4, 24] most likely because graph-isomorphism is NP-Hard and not very scalable. Additionally, at fine granularities, the benefits of calling an accelerator are greatly diminished by the cost of updating and reading from a register file (RF). Similarly, fine-grained merged accelerators tend to be tightly coupled with the general purpose core, for example, via the cache memory system. In the HLS context, RF-less, coarse-grained accelerators have had the most success since rich CDFGs are mapped to hardware via datapaths handled via control logic. There is no current technology able to merge these coarse-grained accelerators.
In this article, we present
Fig. 1. AccelMerger overview.
The four main steps performed in
Accelerator Modeling: We achieve quick and automated accelerator modeling via Machine Learning Models, such as the Multilayer Perceptron (MLP). Our modeling relies both on the static information available in the application source code and on dynamic profiling information for each function indicated in Figure 1 with the “dyn” label. This allows us to predict post-HLS accelerator resource consumption with less than 15% error, which is very close to the error produced by late-DSE tools when applied to accelerator merging.
Merging Codegen: Being able to quickly model accelerators goes hand in hand with code-generating merged functions and realizing them in hardware only if they present predicted opportunities for area savings. We ensure that the introduced multiplexing latency overheads do not cancel the benefit of allocating newly merged and non-merged accelerators using the saved area resources. We describe our scalable sequence alignment approach to achieve merging at coarse granularities.
Late DSE and HLS: We use cycle-accurate simulation and HLS not only to create a dataset to predict accelerator area and latency but also to validate accelerators when the original functions are synthesizable.
MIP Early-DSE: When area wins are measured on the corresponding merged accelerators, we forward them to the selection stage via mixed-integer linear programming (MIP), which determines the final list of merged and non-merged accelerators in the system. In particular for merged accelerators, it is a requirement for the two input functions to be synthesizable to be able to run the final validation step via HLS and Cycle-Accurate Simulation. In Figure 1, we can see that functions such as \(f_2\) are selected for acceleration and RTL can be generated for them as indicated by the “rtl” label attached to the function. However, the requirement for merged functions to be synthesizable is for both of the input functions to be synthesizable, which is the case for functions \(f_3\) and \(f_4\) but not for \(f_i\) and \(f_N\) since \(f_N\) is non-synthesizable.
In experiments carried out on the SPEC CPU2006 benchmark suite [10] and the H.264 video decoder [18], targeting Programmable Systems on Chip (PSoCs) Artix Z7007S and Artix Z-7012S [39], we observe performance improvements of \(1.91\times\) on average, for state-of-the-art (SOA) Early-DSE, while remaining compatible with well-known, mature HLS tool flows.
This work makes the following contributions:
Merging of candidates for acceleration. To the best of our knowledge, we merge coarse-grained accelerators for the first time. We describe the machine learning models, the code generation techniques, the metrics, the dynamic profiling, and the optimization techniques necessary to identify and merge accelerators effectively.
Automated design space exploration for large-area designs. We provide an automated tool flow from high-level C/C++ to RTL that generates merged accelerators when the application is synthesizable. When the application is only available in non-synthesizable format,
AccelMerger can still provide insights about the code most amenable for acceleration and merging. By using Neural Networks, \(4.5 \times 10^6\) merged accelerators can be analyzed in less than 10 seconds withAccelMerger . In contrast, pure Late-DSE techniques struggle to analyze 800 merged accelerators in less than 22 hours.Merged and non-merged accelerator selection.
AccelMerger can select the optimal mix of merged, non-merged, and software versions of the functions in the original program. We contribute a Mixed Integer Programming model that can operate on nested CDFGs while deciding what to merge based on an area budget.
To motivate our approach to DSE automation, we categorize in Section 2 existing design tools and techniques for SoC accelerators. That leads to a statement, in Section 3, of the problem AccelMerger solves: efficient selection of application regions of varying granularity for hardware implementation optimized for area constraints and chip characteristics.
2 RELATED WORK
Table 1 presents a taxonomy of related work. The rows represent desirable features for accelerator merging/DSE tools, and the columns represent bodies of related research. For each column, we look at the following features:
| Feature | Fine-grained | Manual | Early | Accel |
|---|---|---|---|---|
| Merging | DFG Merging | DSE | Merger | |
| Application Time | ✗ | ✓ | ✓ | ✓ |
| Communication | ✗ | ✓ | ✓ | ✓ |
| Fine-grained Merging | ✓ | ✓ | ✗ | ✓ |
| Coarse-grained Merging | ✗ | ✗ | ✗ | ✓ |
| Early HW/SW Part. | ✗ | ✗ | ✓ | ✓ |
| Flexible Granularity | ✗ | ✗ | ✗ | ✓ |
| Automated | ✓ | ✗ | ✓ | ✓ |
We highlight the dimensions along which
AccelMerger provides most contributions in bold.
Table 1. Taxonomy Table
We highlight the dimensions along which
AccelMerger provides most contributions in bold.
Application Time. It indicates whether the set of techniques evaluates applications end to end instead of focusing on datasets of basic blocks as it occurs in the case of Fine-grained DFG Merging.
Communication. It indicates whether latency and bandwidth for the communication between the accelerators and the CPU are considered. The manual design of accelerators takes this into account using back-of-the-envelope calculations and eventually slow-cycle accurate Late-DSE [28, 29, 30], which still requires manual application transformations, to validate a limited number of designs. Among prior Early-DSE techniques, only AccelSeeker [42] supports latency estimation. For synthesizable applications, in which the amount of data produced and consumed is easier to analyze statically,
Fine-grained Merging. In manual accelerator design, the architects might develop Processing-Element-based accelerators that capture common DFGs relevant across an application. The fine-grained DFG literature has automated this step on small basic-block datasets.
Early HW/SW Partitioning. Other than the Early-DSE literature, manual accelerator design and fine-grained merging do not approach accelerator design systematically, tying together resource consumption with the performance maximization problem.
Coarse-grained Merging and Flexible Granularity. This property refers to being able to merge typical HLS-generated coarse-grained accelerators and navigate a rich accelerator and SoC design space at different granularities.
We next discuss the columns in Table 1:
Early-DSE: The only other Early-DSE-related work to use Machine Learning as part of its pipeline is Peruse [13]. Peruse, however, does not tie together resource area constraints with the system performance maximization as a single optimization problem. RIP [44] Accel/RegionSeeker [42, 43] and AutoDSE [31] formulated the problem as a single optimization procedure using mixed-integer programming and ad hoc algorithms. RIP is very limited by its applications, containing only two end-to-end applications, and similarly, AccelSeeker is only demonstrated on one application. Furthermore, neither RIP nor AccelSeeker uses machine learning to estimate key accelerator statistics. This is necessary for Early-DSE in general since the application CDFG goes through many complex transformations during HLS, and estimating accelerator statistics with high precision directly on the original application CDFG is challenging. An ML approach is even more critical for
Manual DFG Merging: Noteworthy implemented accelerators that benefit from manually merging computational patterns include the PuDianNao and DianNao accelerators [5, 17], which are built specifically for machine learning workloads and reuse common computational patterns such as activation functions and typical linear algebra operators.
Fine-grained Merging: This represents the classical resource sharing body of work. This body of work is primarily targeted at extending HLS flows to reuse expensive functional units and small data-flow graphs contained within the boundaries of a basic block. These works were built on top of fundamental technology mapping and optimization research. For example, in [20], the authors co-optimize memory, functional unit, and wiring mapping from data-flow graphs. QsCores [36] is an infrastructure that generates accelerators tightly coupled to the CPU via the L1 cache and programmable through a specific interface that allows arbitrary control transitions, a model that is incompatible with many HLS and cycle-accurate simulation flows. QsCores supports fine-grained merging as part of an in-house HLS tool called Conservation Cores [35]. QsCores builds on Conservation cores by enhancing system-level considerations when generating fine-grained merged RTL. Similarly, Stitch [33] is focused on generating such tightly coupled fine-grained accelerators for wearable applications. While these fine-grained merging approaches together with older contributions focus on merging dataflow graphs available in basic blocks [2, 15, 21, 33, 36],
Late-DSE contributions relevant for
Function Merging. FMSA and SalSSa [26, 27] were recently introduced as LLVM [16] compilation transformations that target code reduction for embedded devices. Work prior to FMSA is only able to merge equal functions, whereas FMSA is able to merge functions with different argument lists, returned values, and references as well as different control-flow graphs. FMSA does not take into account the dynamic behavior of the application, nor does it try not to hurt the application performance by introducing less multiplexing in the application hot spots. Finally, neither FMSA nor other function merging approaches have been used for accelerator merging. Both the function merging and the Early-DSE-related work [13, 26, 27] demonstrate their results on the SPEC CPU2006 suite, and we include these applications in this article for comparison.
ML for Early-DSE. Recently, research on applying ML to build accelerator cost models has been limited to the CPU-GPU code mapping classification problem. This code mapping problem is about determining whether a CPU or a GPU is more appropriate for running a piece of code. [7] and [40] have used Recurrent Neural Networks and Graph Neural Networks for this classification problem for GPUs and CPUs. However, there has been limited work on accurately predicting FPGA and ASIC accelerator statistics without integrating these predictions into an Early-DSE infrastructure [13]. This work is not Early-DSE per se but as indicated in this article can constitute a very helpful component.
3 PROBLEM DEFINITION
Vanilla HW/SW Partitioning. We first describe the vanilla HW/SW partitioning problem as addressed in the related work and then proceed to the larger problem present in systems with limited area resources. Given a set of functions \(P=\lbrace f_1, \ldots , f_n\rbrace\), each best represented by a CDFG, a list of arguments, and a returned type, the vanilla HW/SW partitioning generates two sets of functions \(A = \lbrace f_{i_1}, \ldots , f_{i_p}\rbrace\) and \(B = \lbrace f_{j_1}, \ldots , f_{j_q}\rbrace\) to execute on a general-purpose core and on specialized hardware, respectively. The two sets represent a partitioning of \(P\) since \(A \cap B = \emptyset\) and \(A \cup B = P\). This partitioning is done by maximizing program performance subject to the total resource requirements of the accelerators and the general-purpose CPU, along with a description of how these components communicate. The related work in early-stage DSE [42, 43] operates on parameters \(sw_i\), \(hw_i\), and \(area_i\), which are automatically determined for each function. \(sw_i\) and \(hw_i\) represent the overall time spent in seconds executing a function in software, using a general-purpose core, and in hardware, using coarse-grained accelerators, respectively. \(area_i\) is the area requirement per component measured in Lookup Tables (LUTs). Some uncommon features that are usually only modeled in late-stage DSE tools, such as gem5-aladdin [30] and gem5-salam [28], are the interconnect \(latency\) and \(bandwidth\) that provide insights about the system level, and we take these into account when determining the optimal partitioning. In contrast with late-stage DSE, which is focused on evaluating point-solution SoC designs with the predefined set of functions to accelerate and run in software, we need an agile way of determining which loops and functions to accelerate and merge.
Moreover, for Post-Moore’s law accelerators with limited compute fabric, we need to build an extended set of functions that result from merging the most similar functions in \(P\). These would enable hardware realization with lower area costs. The creation of merged accelerators needs to overcome three technical issues, described in the following paragraphs.
Function Merging at Arbitrary Granularities. We need to expand the set \(P\) with merged functions \(M=\lbrace f_{k_1}, \ldots , f_{k_l}\rbrace\) that exploit flexible computational granularities and CDFG patterns to produce a new set \(P^{\prime }=P \cup M\), containing new merged functions and functions from the original set \(P\).
In this work, we refer to merging CDFG patterns in the sense of merging functions by sequence alignment [27], which we will describe in Section 4.2. In this paradigm, functions are represented in LLVM [16] and are composed of a list of parameters, a control-flow graph (CFG [1]), and a return type. The nodes in this graph are basic blocks, i.e., sequences of instructions in SSA format [8] with single entry and exit points. Languages such as C/C++ are fully supported in LLVM. In this work, we fully support LLVM, and our approach is not limited to a subset of the C or C++ language. In this context, loops are cycles in the CFG graph. Many loops can be analyzed statically and are safe to extract into functions, for example, to detect more function-level merging opportunities. Some loops that are not safe to extract are those that might exit the function via a system call. We rely on the existing LLVM loop extraction pass for this to ensure safe loop extraction.
When merging two “parent” functions, \(f_{i_1} \in P\) and \(f_{i_2} \in P\), the desired effect is that the result, say \(f_{k_1}\), satisfies \(area_{f_{k_1}} \lt area_{f_{i_1}} + area_{f_{i_2}}\). The exact code transformation to create these new candidates, namely Function Merging, should do more than just concatenate the CDFGs of the parent functions. Instead, it should reuse nodes corresponding to instructions in the high-level functions, and to RTL functional units at the lower level if these functions are synthesizable. We can express the Function Merging transformation more concisely as \(\begin{align*} \textrm {fm}:P \times P &\rightarrow M \\ (f_{i} ,\; f_{j}) &\mapsto f_k. \end{align*}\) The size of \(M\), denoted \(|M|\), is less than or equal to \(n^2\) depending on how many similar functions there are in \(P\), and therefore \(n + 1 \le k \le n + 1 + n^2\). We use subindexes starting from \(n+1\) for \(k\) to avoid confusing the \(f_k\) functions in \(M\) with functions in \(P\) with subindexes \(1..n\). We address the CDFG reuse problem in Section 4.2.
Accelerator Modeling. Since the set of possible merged functions \(M\) grows quadratically with the number of functions and loops in the original program \(P\), Early-DSE needs to be able to model \(area_i\) quantities with high accuracy since area reduction is the main criterion for choosing to merge two functions. For hardware and software latency, the related work has indicated that simple linear models can provide estimates for \(sw_i\) and \(hw_i\) that result in overall high-quality feedback about the most desirable functions to accelerate [42]. Section 4.1 describes how
HW/SW Partitioning. In the context of the new program \(P^{\prime }\), HW/SW partitioning is significantly complicated because \(P^{\prime }\) contains functions that have overlapping functionalities, and because RTL flows create hardware for a function \(f\) in a hierarchical manner, including the CDFGs of \(f\)’s callees. These interactions include the fact that merged functions still need to be considered in the context for \(P\)’s call graph. For example, if a merged function is realized in hardware, its callees need to be realized in hardware as well. Solving the HW/SW Partitioning problem with merged accelerators will result in an output as illustrated in Figure 2. The HW/SW partitioning step needs to produce three sets \(A^{\prime }= \lbrace f_{i_1}^{\prime }, \ldots , f_{i_p}^{\prime }\rbrace\), \(B^{\prime }=\lbrace f_{j_1}^{\prime }, \ldots , f_{j_q}^{\prime }\rbrace\), and \(C = \lbrace f_{k_1}, \ldots , f_{k_r}\rbrace\) such that \(A^{\prime } \subseteq P\) and \(B^{\prime } \subseteq P\) and \(C \subseteq M\). \(A\), \(B\), and \(C\) cannot be considered a partition in the mathematical sense since there can be more functions in \(P^{\prime }\) than in the union \(A \cup B \cup C\), meaning that before the HW/SW partitioning step the function merging step might generate merged functions \(f_k\) that are not selected in the final SoC when the parent functions \(\textrm {fm}^{-1}(f_k) = (f_i, f_j)\) are more profitable for a specific area budget and SoC communication characteristics.
Fig. 2. AccelMerger’s SoC model and feedback template for the system architect. This is a lower-level view of the Merged and Non-merged accelerator selection output in Figure 1.
We use \(f_{i_1}^{\prime }\) in contrast to \(f_{i_1}\) to denote that for a given area budget, the functions we will decide to have in software and hardware will differ from these selected and generated by vanilla Early-DSE. A subset \(C\) represents merged functions mapped onto hardware but never onto the general-purpose software processor since merged functions present benefits for accelerator reduced area consumption, which can be leveraged by the accelerators only. In Figure 2, we observe that the interconnect we model between the CPU and the accelerators has latency and bandwidth properties, which is information the architect can provide in a configuration file as indicated in Figure 1. Each call to a function mapped onto hardware and called from software takes “latency” cycles to initiate the computation plus the time required to move data from the CPU or another accelerator’s local memory according to the available bandwidth. We indicate how
4 ACCELMERGER
AccelMerger is composed of 3 major pieces Ⓐ Frontend, Ⓑ Transformation Pass: Function Merging and Ⓒ HW/SW Partitioning as indicated in Figure 3.
Fig. 3. AccelMerger’s workflow. (A) Accelerator Modeling and Loop Extraction. (B) Function Merging Pass. (C) HW/SW Partitioning.
Ⓐ Frontend: Accelerator Modeling and Loop Extraction are described in Section 4.1. The input to this stage is the application source code, and the main result is a model for coarse-grained accelerator area prediction. Also, the application is transformed to also enable loop-level merging. Ⓑ Function Merging in Section 4.2 is a transformation pass used to merge functions taking into account their characteristics when mapping them onto hardware using the model from step Ⓐ and latency prediction models from the related work. In order to determine whether the merged functions make sense in the context of the system specification, both the new merged functions and the original functions are forwarded to the final step. Ⓒ HW/SW partitioning is the step where we determine the final layout of the application in terms of functions to be executed in software and in merged hardware and non-merged hardware depending on the available area budget, and interconnect characteristics. This step is described in Section 4.3. Finally, Section 5 analyzes
4.1 Frontend
As detailed in Section 4.2, merging needs to quickly evaluate the feasibility of a merge operation as the number of possible merges increases quadratically.1
Hot Code:
Area and Latency Target Variables: For
Features Used to Predict Area and Latency: We model the area required for each static instruction using multiple techniques able to process tabular data. Those include Multilayer Perceptron, Random Forests, and the LASSO linear model. The features we use are the number of LLVM operations of each type and the base truth HW resources number, obtained with Bambu HLS for the area. When modeling the area for a function \(f\), we count the LLVM instruction features hierarchically, taking into account the static number of instructions for \(f\)’s callees. We obtain the models by training on both merged and non-merged synthesizable functions in H.264 and MachSuite [25]. We configure Bambu HLS to allocate logic by using LUTs exclusively to simplify area consumption and speed up comparisons across different approaches. In order to estimate the accelerator latency, we use Aladdin’s latency model on the hierarchical counts of dynamic instructions for each instruction type.
Model Hyperparameters: Table 2 shows the result of predicting the area wins for merged accelerators using different machine learning models. We will explain in Section 4.2 the exact mechanism that allows us to produce merged accelerators, and in this section, we focus exclusively on how we measure their area wins. The accuracy numbers shown in the table are generated for the best model across an exhaustive (grid) search of model hyper-parameters, using three-fold cross-validation. LASSO is the simplest model that we experiment with here and we tune two hyperparameters, including the model complexity penalization alpha parameter, with a total of six combinations. For Random Forests we perform a grid search for trees with a specific maximum depth, a maximum number of estimators, and a maximum number of features. The total number of hyperparameter configurations we explore for Random Forests is 360. For the MLP neural network, we tune the number of hidden layers, the number of neurons per layer, the activation functions, the complexity penalization parameter alpha, and the random seed. The total number of combinations explored is 360 and the best-selected model had 6 hidden layers with 40 neurons each, ReLU activation function, and \(alpha=0\). This architecture was better than larger ones, which were over-fitting the data and resulting in lower accuracies.
LASSO stands for the “Least Absolute Shrinkage and Selection Operator” linear prediction model and MLP stands for the Multilayer Perceptron, a neural network model.
Table 2. Model Selection and Accuracy Table for Area Wins Prediction
LASSO stands for the “Least Absolute Shrinkage and Selection Operator” linear prediction model and MLP stands for the Multilayer Perceptron, a neural network model.
Accuracy Metrics: Table 2 shows the model accuracy both for training each model on fewer data points (functions), 200 specifically in the case of the <model>–200 rows, and on more functions, 600 specifically in the case of the <model>–600 rows. In the columns we display two common regression accuracy metrics. First, \(r^2 = 1 - \frac{SS_{res}}{SS_{tot}}\), which is interpreted as the percentage of variance in the true target variable \(y\) explained by the model \(f\), with \(SS_{res} = \sum _{i=0}^n (y_i - f_i)^2\) and \(SS_{tot} = \sum _{i=0}^n (y_i - average(y))^2\). We see that the LASSO linear model can generalize very well, provided that enough data is available. With only 200 data points, \(r^2\) is of only 67% though, whereas the non-linear models can quickly model the patterns resulting from the multiple transformations HLS tools perform to transform high-level code to RTL, even with fewer data points. \(r^2\) is a loose accuracy metric even though popular in the related work.
Second, we also measure the Mean Relative Error \(MRE = \sum _{i=0}^n \frac{abs(y_i - f_i)}{abs(y_i)}\). This is the metric-related work in Late-Stage DSE such as that Aladdin [29] uses to measure accuracy for area and latency. This is a much stronger metric and late-stage DSE is able to minimize this error by iterating over representative applications with dozens of optimizations over the dynamic data-dependence graph (DDDG) representative of HLS flows that generate hardware. In a sense, the related work for late-DSE is reporting the equivalent of train-time MRE error, since models are created on the same data that is later reported as indicative of the in-production behavior. It is very remarkable that with the model MLP-600 we are able to achieve roughly the same MRE train error as Aladdin, but we also provide the MRE test error, which is 21%. This is very remarkable since MLP is able to produce this accuracy just starting from information available in the high-level LLVM code (the instruction opcodes), and thus, we’re able to use this model on both synthesizable and non-synthesizable applications.
We observe in Table 2 that the training error for MLP-600 is slightly larger than the error of MLP-200, while the error for test decreases from the latter to the former. Neural networks are well known to quickly overfit the training data, misidentifying the real patterns of the larger dataset (the test data is the same across models trained with less or more data). Nevertheless, with enough training data the neural network manages to better generalize, driving down the test error while slightly increasing the training error. The trend of lowering test error while increasing training error with larger dataframes is well known in the Machine Learning community and is illustrated in the second figure of Section 30 in [23].
Choosing the MLP Model: Figure 4 shows these results via the relationship between the predicted area savings, using the MLP-600 model, measured in LUTs, and the real area savings that can be observed by synthesizing the merged accelerators. The figure demonstrates that we are effectively able to filter merges that are not profitable, and more importantly it does not omit profitable merges. The figure also conveys the fact that in the absence of an accurate area predicting model like MLP-600, it would be very hard to filter merged accelerators with area losses.
Fig. 4. Multilayer Perceptron (MLP-600) area model for LUT wins. Synthesized/Real area vs. Predicted Area. TP = True Positive, FP = False Positive, TN = True Negative, FN = worst-case scenario, accelerators not produced that would have brought area benefits.
Accelerable Language Constructs: One of the key limitations of related work in fine-grained merging is that a small subset of the high-level C/C++ or intermediate LLVM language constructs is supported. Those subsets are typically smaller than the language subsets supported by large HLS projects such as Vitis HLS, Catapult HLS, or Bambu. In our data collection, ML models as well as the function merging code generation described in Section 4.2 fully support the LLVM semantics. Moreover, when a function is not synthesizable,
Loop Extraction: So far we have described how the Ⓐ Frontend extracts static and dynamic information from the original functions and then applies the appropriate model to predict the quantities of interest \(hw_i\), \(sw_i\), and \(area_i\). The process of collecting information like this directly from the original functions and then partitioning them into SW and HW in step Ⓒ HW/SW Partitioning is denoted Function Extraction (FE), the terminology used later in the Experimental Setup section. When further enhancing the original functions with merged functions before step Ⓒ, we call this configuration Merging+FE.
In order to increase the independence with respect to the level of granularity at which the programmer has originally implemented the functions, we extract top-level loops from the original functions. We use LLVM’s loop extractor to perform this transformation automatically. This results in new \(hw_i\), \(sw_i\), and \(area_i\) estimates since the set of functions is changed. These new estimates are used both at merge time in step Ⓑ Function Merging and in Ⓒ HW/SW Partitioning. When the estimates are used directly by Ⓒ without any coarse-grained merging, we call this configuration function-based Loop Extraction (FLE). In this configuration loops are transformed into functions since steps Ⓑ and Ⓒ operate at the function abstraction-level. The result is that now we can merge loops with loops and loops with functions and then create an HW/SW partitioning that contains original functions, merged loops, merged functions, and functions merged with loops realized in hardware. In applications with hard-to-analyze loops (e.g.,
Hot code profiling works seamlessly with the LLVM loop extraction pass since it is performed at the level of the basic blocks and loops are represented by basic-block-level graphs (CFGs) that contain cycles.
FE has the disadvantage of fewer merging opportunities, but it avoids the extra calling and parameter passing overhead in FLE. The pros and cons of FLE are discussed extensively in Section 5 and in [13].
4.2 Function Merging
4.2.1 Function Merging Example.
Figure 5 illustrates two similar simple functions that we can transform into a new function that is semantically equivalent to both input functions by using a flag \(f\_sel\) that enables multiplexing non-aligned instructions and operands. In this section, we only discuss how a function merging is realized provided that two functions look similar enough according to a simple heuristic we will later discuss. Once the merge is performed, we can apply the models described in Section 4.1 on the input functions and the merged function to determine with high precision whether the merge is profitable. In this example we discuss how merging is being performed by following the methodology [27] with some differences that make Function Merging amenable for accelerator merging and system design instead of code compression, which is the focus of the Function-Merging-related work [26, 27].
Fig. 5. Simple function merging example expressed in high-level C-like pseudo-code.
Parameter Merging. Figure 5 shows two input functions that have some similarity regarding the input parameters. Function Merging starts by merging parameters depending on their types, not depending on their names. A direct approach is being taken in [27] where for each parameter, in declaration order, in the first function, a parameter of the same type is being searched, in declaration order, in the second function. If a match is found, we proceed with trying to match new parameters, and if not, the currently analyzed parameter in the first function will not be merged. Not merging a parameter means that whenever it is encountered as an operand to a merged instruction, it will be multiplexed using a \(select\) instruction and the \(f\_sel\) flag. For the second parameter in f1, we proceed with the remaining parameters in f2. [27] has shown this parameter merging approach to be very effective by intentionally modifying parameter merging taking into account the function body, and we observe similar area benefits in exploiting parameter merging of about \(7\%\).
Merging Function Bodies. We see in Figure 5 that f1 and f2 have important similarities. The aligned instructions are determined using the Needleman-Wunsch (NW) sequence alignment algorithm [22] on the opcodes of the instructions. Since the algorithm NW operates on two strings and aligns their characters by finding the longest common subsequence (LCS),2 a linearization preprocessing of both f1 and f2 is necessary since at the IR level, code is represented as a CDFG. NW indicates which instructions can be reused and which ones cannot. In this simple example, only slight discrepancies must be settled such as the differing condition for the \(if\) statement at line 24, the discrepancy for the first operand of the addition operation at line 26, and the fact that f1 has a call not present in f2 at lines 30 and 31. These discrepancies are resolved via \(select\) instructions to select the right operands and via \(if\) statements for discrepancies not related to operand selection but whole instructions or groups of instructions that were not aligned across the two functions. For example, the \(call\) of f3 at the end of f1 is not present in f2 and therefore is only executed when f12 is called from f1’s call sites.
4.2.2 Accelerator-driven Function Merging.
We now describe how we systematically approach the merging problem in step Ⓑ box 1 of Figure 3, “Transformation Pass: Function Merging.”
(1) Candidate Ranking. We start the function merging process by ranking function pairs according to simple fingerprints that indicate how many shared instructions there are between two functions. This feature enables merging to scale to large codebases. Then the most similar candidates are linearized, meaning that we convert the graph structure into a sequential string of instructions. In
(2) Merging Engine. For step Ⓑ box 2 in
(3) Using Area/Latency Model on Opcodes. Next, in step Ⓑ box 3 we use the area-predicting MLP-600 model presented in Section 4.1 to predict with high accuracy the LUT consumption and the Aladdin per-instruction model. MLP-600 predicts the resource consumption using the op-code counts for the merging input functions and the opcode counts of the merging result.
(4) Profitability Metrics. In step Ⓑ box 4 we filter the merged functions that do not have area wins and the functions with unacceptable latency overhead. In this step we determine if the area of the merged function \(f_{k_1}\) is smaller than the area of the input functions \(f_{i_1}\) and \(f_{i_2}\) altogether (i.e., \(area_{f_{k_1}} \lt area_{f_{i_1}} + area_{f_{i_2}}\)). If a merge passes this test, we check if the resulting accelerator is acceptable latency-wise.
For each merging input function we denote their corresponding hardware latencies \(hw_1\) and \(hw_2\), their software latencies \(sw_1\) and \(sw_2\), and their area consumption \(area_1\) and \(area_2\). The resulting merged function is denoted with specifications \(sw_{12}\), \(hw_{12}\), and \(area_{12}\). In order to establish large benefits for the merged accelerator, it is necessary that, when there is enough area for the merged accelerator only but not for the two input accelerators, the merged accelerator can bring larger improvements than the hardware realization of either of the input functions. Equation (1) describes the maximum Estimated Profitability (EP): (1) \(\begin{equation} EP = \frac{sw_1 + sw_2 - hw_{12} - max(sw_1 - hw_1, sw_2 - hw_2)}{Total\_Application\_Exec\_Time}. \end{equation}\)
A profitable EP score is positive and the key insight that makes this equation area agnostic and useful is that provided enough area to realize a merged accelerator in hardware, there also must be enough area available to realize the most profitable of the two-parent accelerators. The maximum time savings benefit corresponding to the input functions is \(max(sw_1 - hw_1, sw_2 - hw_2),\) and the benefit corresponding to the time saved by the merged function is \(B=sw_1 + sw_2 - hw_{12}\). While it is true that instead of \(B\) we could have used \(C=sw_{12} - hw_{12}\), using \(B\) allows us to filter more non-profitable merged accelerators since the merged accelerator needs to improve latency-wise with respect to the software execution times of the original functions that have no merging-derived latency overheads. We use \(B\) in the \(EP\) computation since the real merging benefit \(B\) will always be smaller or equal to \(C\).
We only consider functions with positive EP as the second filter after the area reduction check, to further narrow down the candidates for acceleration.
In Appendix B we show the different EP metrics and the area savings for all the merges in the
| Application | Function Format | Number of Merges | Application | Function Format | Number of Merges |
|---|---|---|---|---|---|
| h264 | Merging+FLE | 34 | h264 | Merging+FE | 12 |
| sjeng | Merging+FLE | 76 | sjeng | Merging+FE | 19 |
| lbm | Merging+FLE | 3 | lbm | Merging+FE | 3 |
| bzip | Merging+FLE | 68 | bzip | Merging+FE | 10 |
| milc | Merging+FLE | 65 | milc | Merging+FE | 24 |
| libquantum | Merging+FLE | 25 | libquantum | Merging+FE | 14 |
| bzip | Merging+FLE | 68 | bzip | Merging+FE | 10 |
| hmmer | Merging+FLE | 26 | hmmer | Merging+FE | 6 |
| sphinx3 | Merging+FLE | 109 | sphinx3 | Merging+FE | 29 |
| h264ref | Merging+FLE | 190 | h264ref | Merging+FE | 66 |
| perlbench | Merging+FLE | 128 | perlbench | Merging+FE | 76 |
Table 3. Number of Merges in All the Applications Used in This Work
Due to the automated nature of this process, which relies on the control- and data-flow similarities rather than the high-level description of each function, describing the semantics of each pair of merged functions would be out of the scope of this article. However, as mentioned, in Appendix B we include the area savings and EP metric of all the merging candidates of a specific application.
In the next section, we describe the final selection of merged, non-merged, and software-executed functions.
4.3 HW/SW Partitioning
For the hardware/software partitioning formulation, we use MIP. We stay consistent with our previous notation \(area_i\), \(hw_i\), and \(sw_i\). These numerical constants are the result of the ML-based modeling phase and indicate what would be the area and hardware execution latency of a function/loop if they were realized in hardware and \(sw_i\) being the latency of executing each function/loop on the CPU.
Software and Hardware Selection. The MIP solver determines whether to realize a function in software or hardware by using the binary variables \(hwv_i\) and \(swv_i\). These variables are mutually exclusive. The objective function in Equation (2) includes these two variables and their associated constant costs in latency, as well as a communication minimization term. It accounts for the latency and data transfer cost for transitioning from a software execution to a hardware one. The \(\mathit {frontier}_{ij}\) variable will be 1 only when \(f_i\) is realized in software and \(f_j\), a callee of \(f_i\), is realized in hardware according to Equation (6). We cap the amount of area for hardware acceleration with Equation (3).
Handling Merge Graph. When two high-similarity functions \(f_i\) and \(f_j\) are merged, a new function \(f_k\) is produced, denoted the \(child\) of the \(parents\) \(f_i\) and \(f_j\). Since children themselves can be merged with other functions, a descendant is a function obtained by repeatedly proceeding from parent to child. For each function of an application, we define a set \(Descend_i\) that contains the children and other recursive descendants of function \(f_i\). For a given \(f_i\), \(swv_i\) and \(hwv_i\) can both be 0, meaning that \(f_i\) is not implemented in software or in hardware, since one of its descendants \(f_j\) might have \(hwv_i = 1\). Similarly, a function \(f_k\) might not be realized either in software or in hardware since its ancestors are selected for the final SoC either in software or in hardware. We will call functions without parents Root nodes.
Equation (4) requires each Root to be realized in either hardware or software. Otherwise, its functionality is covered by one of its descendants. This enforces that each part of the application is realized either in software or in hardware taking into account that merged functions cover more functionality than their parents.
Handling Call Graph. In the selection problem, we operate both on the functions in the original program and on the tree of merged functions. To model the hierarchical aspect of coarse-grained accelerator generation in HLS tools, we take into account the program call graph to include all the functions called directly and indirectly by each function. We consider a set \(C_i\) of direct and indirect callees for each function. For each caller-callee pair of functions \(f_i\) and \(f_j\) there is an associated constant number of dynamic calls from \(f_i\) to \(f_j\) “\(calls_{ij}\)” determined with dynamic instrumentation. We use \(calls_{ij}\) as part of the objective function to penalize calls from a software function to a hardware function.
In Equation (5), we handle the hardware realization constraint of the call graph. If we were to disregard the \(Roots\) set and the \(DC=\sum _{k \in Descend_j} hwv_k\) term, this equation would handle the call graph in the acceleration with no merging scenario. We refer to a hypothetical simplified Equation (5) as in \(hwv_i + 1 - hwv_j \le 1 \forall i \in \lbrace 1..|P^{\prime }|\rbrace \forall j \in C_i\). This formulation would constrain the values of \(hwv_i\) and \(hwv_j\) for all pairs of callers \(f_i\) and callees \(f_j\) to only allow \(hwv_i = 1\) if \(hwv_j = 1\). We implement the \(Roots\) set and the descendants of the callees in \(Descend_j\) to co-handle the call graph and the merge graph.
Handling the Interaction between the Call and the Merge Graph. The \(DC\) term in the previous paragraph allows the ILP-solver to consider designs for which the caller is realized in hardware (\(hwv_i=1\)) even though the callee might not be realized in hardware (\(hwv_j=0\)). The ILP solver will automatically set \(hwv_j= 0\) if one of the merging descendants of the callee is realized in hardware (\(hwv_k=1\)), since the objective function is minimizing the execution time across all the functions. The fact that we only consider callees that are roots in the term \(C_i \cap Roots\) is an optimization to reduce the number of redundant constraints. We safely perform this optimization since we specify all the descendants of callee j as part of the term \(DC\).
Using MIP for HW/SW Partitioning. In this article, we use the Python-MIP [34] mixed-integer programming package to find solutions to the HW/SW partitioning problem. Using MIP solvers for Early-DSE has the advantage of finding globally optimal solutions with the aid of high-performance libraries that exploit modern architectures. Therefore, we use the Python-MIP package built on some of the fastest open-source solvers, specifically the COIN-OR Branch-and-Cut solvers (CLP-CBC) [19].
Early DSE for Accelerator Merging Faster Than Using Late-DSE Exclusively. Figure 6 shows how the number of merging candidates to evaluate latency-wise grows quadratically with a brute-force merged accelerator search approach. The figure motivates using a smart merged-accelerator selection mechanism, e.g., the ILP formulation in this work. Based on the predictions of the software and hardware latencies using, e.g., the MLP-600 model, selecting the accelerators via the ILP formulation and then synthesizing just the selected accelerators takes less than 10 seconds based on our experimental results. With as little as 40 input functions and loops, Aladdin takes over 15 hours of analysis and HLS takes about 5 hours to synthesize all the possible merged accelerators. Moreover, cycle-accurate simulation and high-level synthesis cannot even be applied on non-synthesizable applications like the SPEC CPU2006 benchmarks. This limitation is due to the simulation and HLS tool expectation of a subset of the C language with explicit, unambiguous memory accesses and their lack of support for sophisticated computations with side effects and recursion.
Fig. 6. Hours spent in Aladdin to evaluate all possible merging candidates to determine their latency and Bambu HLS to assess their area. We represent the number of input functions for merging in the x-axis of the figure. For 40 Functions to Merge, there are approximately \(40^2/2 = 800\) merged accelerators to evaluate for all the possible pairwise merges. Late-DSE techniques struggle to analyze the possible merges of 40 accelerators in less than 22 hours.
Fig. 7. Seconds spent in AccelMerger performing DSE for an extensive range of functions. For 3,000 functions to merge, there are approximately \(4.5*10^6\) possible merged accelerators to evaluate for all the possible pairwise merges. Using accurate accelerator modeling with MIP-based DSE dramatically reduces analysis time. The applications used in this experiment are MachSuite, synthesizable H.264, and SPEC’06.
Objective (2) \(\begin{equation} \begin{split} \quad \min _{ {{{\scriptstyle \begin{matrix}hwv_{i},\\ swv_{i},\\ frontier_{ij} \in \lbrace 0, 1\rbrace \end{matrix}}}}} \Bigg (\sum _{1 \le i \le |P^{\prime }|} (hwv_{i} \cdot hw_i + swv_i \cdot sw_i) + \sum _{j \in C_i} calls_{ij} \cdot frontier_{ij} \cdot latency\Bigg) \end{split} \end{equation}\)
Constraints (3) \(\begin{equation} \sum _{1 \le i \le |P^{\prime }|} hwv_{i} \cdot area_{i} \le area\_budget \end{equation}\) (4) \(\begin{equation} swv_i + hwv_i + \sum _{j \in Descend_i} swv_j + hwv_j = 1 \text{, } \forall i \in Roots \end{equation}\) (5) \(\begin{equation} \begin{split} hwv_i + 1 - hwv_j - \sum _{k \in Descend_j} hwv_k \le 1 \forall i \in \lbrace 1..|P^{\prime }|\rbrace \text{, } j \in C_i \cap Roots \end{split} \end{equation}\) (6) \(\begin{equation} swv_i + hwv_i - frontier_{ij} \le 1 \: \text{, } \forall i \in \lbrace 1..|P^{\prime }|\rbrace \text{, } j \in C_i \end{equation}\)
5 EXPERIMENTAL RESULTS
5.1 Experimental Setup
Within the scope of our experimental setup we use four configurations, FE, FLE, Merging+FE, and Merging+FLE. Function Extraction (FE) corresponds to using the Frontend and the HW/SW partitioning step in Figure 3 to select functions for hardware realization. This configuration performs SOA early-stage accelerator selection [42].
The following three configurations exploit variable granularity and accelerator merging and represent
All the speedups represented in the next results are computed with respect to a serial execution that benefits from instruction-level parallelism available within a basic block but without any accelerators. We call this baseline the software execution (sw) of an application. All the other configurations benefit from increasing amounts of available area budgets by realizing in hardware an optimal subset of functions (optimal for the information available in the early design stages).
5.2 Overall Performance Improvement
In Figure 8 we showcase the benefits of accelerator merging when placing both the merged accelerators and the non-merged ones in the context of the whole application.
A variety of applications yield different behavior as a result of varied granularities produced via merging and loop extraction. For example, performing loop extraction for the applications
The applications that benefit from breaking down functions into loops are
In other applications such as
In
For applications like
In Figure 9 we represent the normalized time spent in software execution, non-merged hardware, merged hardware, and communication. We show the allocation of original application cycles both for the FLE configuration and for the Merging+FLE, optimizable by merged and non-merged hardware. As expected, applications belonging to the “Coarse-Grained Profitable” and “Medium Grained Profitable” categories in Figure 8 have a larger “Merged HW” fraction. We see that some applications (
Fig. 9. Overall execution time cost breakdown in software cycles, non-merged hardware and merged hardware. We are representing the percentage of cycles optimizable via merged and non-merged hardware at the specific area budget of the Artix Z-7007S as in related early-DSE work [42]. The 25 cycles of interconnect latency is being used for this breakdown. The applications in the left set of barplots represent the LE configuration and the right barplot represents applications executed with the “FLE+Merging” configuration.
However in applications like
5.3 Accelerator Invocation Latency Impact
Several factors can hurt the performance attainable via acceleration. Applications with high numbers of calls, for the most significant functions, tend to suffer from communication costs, since for each invocation of an accelerator, the host needs to set up memory-mapped model-specific registers (MSRs), and the accelerator notifies the host when the accelerated function and all its callees have finished. Beyond the large gap between memory access time and compute [37], and system configurations with off-chip FPGAs, long cache flushes are required before offloading computation onto accelerators [30].
Any of these scenarios can lead to high latency in initiating communication with the accelerator. In Figure 10, we show the maximum overhead represented by communication across different merging and granularity configurations in such a high-latency scenario (500 cycles).
Fig. 10. The percentage of application execution time spent in initiating and terminating accelerator computations in a large-latency situation and different merging and granularity scenarios. Note the discontinuity in the y-axis used to represent the largest speedups in the range \(25 \%\) to \(50 \%\) . In this range, the y-axes are more compressed than in the lower part of the figure.
In general, we observe a trend for the FLE+Merging configuration to have a slightly smaller overhead compared to the other configurations. For many applications, the HW/SW partitioning algorithm is able to pick acceleration candidates with low overall latency, but for
We see that for most area budgets and all latencies, merging has a beneficial impact on mitigating latency. The merged accelerators are able to cover more functions from the original application and therefore usually there is less switching between CPU computation and accelerator computation.
5.4 Exploring Other Area Budgets
In Figure 11 we see what speedups can be accomplished with different area budgets using the SOA in hardware-software partitioning as well as the three new techniques we introduce in this article. This graph represents the average over all the applications considered in this work. For very small area budgets most of the applications run in software and thus the speedups are very close to 1. Also, most applications have a transition region where more and more functionalities can be realized in hardware, up to an area budget point when the whole application can be executed in hardware. Most applications start this transitioning region around 1,000 LUTs, and all of them end it with 1M LUTs. When the area-latency curve has converged for a given application, that’s equivalent to a monolithic accelerator scenario.
Fig. 11. Speedup Geomean w.r.t. software execution across all applications for the FE, Merging+FE, FLE, and Merging+FLE configuration under a wide range of area budgets. Both the x-axis and the y-axis are in logarithmic scale.
Figure 11 contains a superset of the results shown in Figure 8 for the Geomean bars. The average speedups for Merging+FE and Merging+FLE is driven by the applications
5.5 Bandwidth Use-case for H.264
For the synthesizable implementation of the
Fig. 12. Bandwidth scenarios for varying granularities and merging configurations with the synthesizable version of H.264 [18].
5.6 Energy Consumption of Merged Accelerators
One of the key considerations in accelerator design is energy consumption. In this subsection, we discuss an experiment that purely indicates the improvement in energy consumption of the merged accelerators. We leave to future work the reformulation of the objective function we use in this work, to be able to cover the energy consumption quadratic complexity. In this experiment, we only demonstrate in general that merged accelerators exhibit great energy efficiency properties. We show both the energy efficiency of fine-grained accelerators produced by implementing a concatenation function at the LLVM level and then feeding the concatenated functions to the HLS tool and enabling fine-grained merging in the HLS tool.
In Figure 13 we highlight the average energy efficiency over the
Fig. 13. Average energy efficiency improvements of the Merging+FLE and Merging+FE configurations over the MachSuite and H.264 synth applications.
The power savings derived from area savings of the SOA HLS tool Bambu [24] is 9.73% with respect to the power of the input accelerators. Since we produce the accelerators with the tightest latency constraint allowable by the Vivado physical design tool, fine-grained merging is performed with practically 0% latency overhead. This makes the power savings translate into exactly the same energy savings.
With Merging+FE and Merging+FLE, the power savings are quite higher than what can be delivered by fine-grained merging in the FE and FLE configurations alone, reaching 15.6% with respect to no merging and a 5-point difference with respect to FE and FLE. The multiplexing overheads of handling more complicated control- and data-flow results in an average 4.49% merged accelerator latency overhead. The resulting energy savings are \(13.3\%,\) which represents a 3.57-points improvement over fine-grained merging in FE and FLE. We observe that
6 ACCELMERGER LIMITATIONS/FUTURE WORK
We plan to extend
7 CONCLUSIONS
Early-stage accelerator design through function merging, based on optimized selection of merged and non-merged, hardware-realized functions and loops, opens an exciting research area that promises to benefit performance and area/latency tradeoffs.
Appendices
A EXTRA DATA TO IDENTIFY THE NEED TO RETRAIN
In our training we have included the synthesizable applications in the MachSuite benchmarks. Based on the results in the MachSuite paper [25], we estimate that our training set has a great variety of both memory and compute-intense accelerators. The memory-intensive applications in MachSuite [20] in the sense of low temporal locality (\(L_{temporal}\)) are SPMV CRS (\(L_{temporal}=12.05\%\)), SPMV ELLPACK (\(L_{temporal}=24.91\%\)), BFS BULK (\(L_{temporal}=31.92\%\)), BFS QUEUE (\(L_{temporal}=32.74\%\)), and MD KNN (\(L_{temporal}=36.64\%\)); these applications belong to the memory-intense application domains of sparse linear algebra, graph processing, and molecular dynamics.
We also offer a more quantitative approach to identify whether the presented models in this article will perform with high accuracy. We recommend the range of number of operations per function to be between 0 and the maximum number represented in Table 4. Similarly, in Table 5 we present the maximum number of dynamic operations for the functions we have used for training. In the case that the functions used are larger than the ones presented in these tables, we recommend retraining with a larger set of functions. One way to achieve coarser accelerators is also via merging, which is provided as part of
| StaticOp | OpsMax | StaticOp | OpsMax | StaticOp | OpsMax | StaticOp | OpsMax |
|---|---|---|---|---|---|---|---|
| add | 7,263 | fpext | 184 | or | 520 | store | 12,628 |
| alloca | 3,238 | fptosi | 80 | phi | 1,109 | sub | 1,680 |
| and | 1,909 | fptoui | 7 | ptrtoint | 728 | switch | 183 |
| ashr | 833 | fptrunc | 56 | ret | 561 | trunc | 722 |
| bitcast | 3,386 | fsub | 130 | sdiv | 652 | udiv | 22 |
| br | 18,375 | getelementptr | 26,168 | select | 205 | uitofp | 22 |
| call | 4,950 | icmp | 9,690 | sext | 10,918 | unreachable | 15 |
| fadd | 190 | inttoptr | 11 | shl | 430 | urem | 16 |
| fcmp | 90 | load | 52,291 | sitofp | 251 | va_arg | 0 |
| fdiv | 100 | lshr | 131 | srem | 229 | xor | 165 |
| fmul | 275 | mul | 1,414 | store | 12,628 | zext | 1,637 |
For maximum confidence in the
AccelMerger approach, if larger values per function are being used, we recommend retraining. The LLVM opcodes not included in this table are not used in our training set and if they were needed, retraining is recommended as well.
Table 4. Maximum Number of Operations per Function Used in the Training Set
For maximum confidence in the
AccelMerger approach, if larger values per function are being used, we recommend retraining. The LLVM opcodes not included in this table are not used in our training set and if they were needed, retraining is recommended as well.
| Dynamic Op | OpsMax | Dynamic Op | OpsMax | Dynamic Op | OpsMax | Dynamic Op | OpsMax |
|---|---|---|---|---|---|---|---|
| add | 8.87E+10 | fmul | 3.79E+06 | mul | 4.47E+08 | srem | 2.22E+08 |
| alloca | 4.15E+09 | fpext | 4.92E+02 | or | 3.78E+08 | store | 2.60E+11 |
| and | 1.83E+08 | fptosi | 3.77E+06 | phi | 1.62E+08 | sub | 8.25E+10 |
| ashr | 6.82E+08 | fptoui | 2.00E+02 | ptrtoint | 5.73E+06 | switch | 2.87E+07 |
| bitcast | 6.45E+07 | fptrunc | 3.85E+02 | ret | 8.30E+08 | trunc | 6.47E+08 |
| br | 6.47E+09 | fsub | 4.30E+06 | sdiv | 6.43E+08 | udiv | 1.96E+07 |
| call | 5.16E+09 | getelementptr | 2.49E+11 | select | 1.86E+07 | uitofp | 3.85E+02 |
| fadd | 3.79E+06 | icmp | 6.29E+09 | sext | 8.43E+10 | urem | 1.96E+07 |
| fcmp | 5.30E+06 | load | 6.13E+11 | shl | 6.06E+08 | xor | 2.31E+08 |
| fdiv | 5.94E+03 | lshr | 1.83E+08 | sitofp | 7.59E+06 | zext | 1.65E+11 |
For maximum confidence in the
AccelMerger approach, if larger values per function are being used, we recommend retraining. The va_arg and unreachable instructions are never executed and therefore not included in this table.
Table 5. Maximum Number of Dynamic Operations per Function Used in the Training Set
For maximum confidence in the
AccelMerger approach, if larger values per function are being used, we recommend retraining. The va_arg and unreachable instructions are never executed and therefore not included in this table.
B MERGING EXAMPLE
In Table 6 we present all the merges for the
| ChildId | Parent1 | Parent2 | Area Savings (LUTs) | EP |
|---|---|---|---|---|
| 0 | unmake | make | 2,262 | 15.01% |
| 1 | setup_attackers_L2 | setup_attackers_L0 | 278 | 1.8% |
| 2 | merged1 | search_root | 12 | .56% |
| 3 | ProbeTT | StoreTT | 1,419 | .4% |
| 4 | is_draw | Queen | 66 | .04% |
| 5 | QProbeTT | QStoreTT | 1,345 | .4% |
| 6 | checkECache | bishop_mobility | 45 | .08% |
| 7 | Pawn | Rook | 444 | 1.48% |
| 8 | see_L0 | std_eval_L1 | 83 | .64% |
| 9 | King | Bishop | 114 | .12% |
| 10 | setup_attackers | add_capture | 33 | .12% |
| 11 | is_attacked_L0 | is_attacked_L5 | 233 | .56% |
| 12 | merged6 | is_attacked_L3 | 176 | .64% |
| 13 | Knight | rook_mobility | 66 | .52% |
| 14 | is_attacked_L8 | setup_attackers_L4 | 95 | .32% |
| 15 | push_king | push_knighT | 114 | .28% |
| 16 | merged13 | comp_to_san_L0 | 36 | .04% |
| 17 | f_in_check_L33 | f_in_check_L8 | 95 | .04% |
| 18 | merged20 | is_attacked_L7 | 62 | .08% |
| 19 | merged17 | remove_one_L0 | 50 | .52% |
| 20 | is_attacked_L2 | push_slidE_do.body_L0 | 33 | .32% |
| 21 | push_king_castle | add_move | 33 | .04% |
The first column illustrates the merge/child id. The second and third columns show the partial name of each input/parent function. Finally we show in the last two columns the profitability metrics for the merge: the area savings and the EP metric. When merged_id is shown in either of the parent columns, the child corresponding to the merge is a second-generation descendant. For
458.sjeng there are only first and second-generation descendants.
Table 6. All the Profitable Merges in the 458.sjeng for the Merging + FLE Configuration
The first column illustrates the merge/child id. The second and third columns show the partial name of each input/parent function. Finally we show in the last two columns the profitability metrics for the merge: the area savings and the EP metric. When merged_id is shown in either of the parent columns, the child corresponding to the merge is a second-generation descendant. For
458.sjeng there are only first and second-generation descendants.
Footnotes
- [1] . 1970. Control flow analysis. ACM Sigplan Notices 5, 7 (1970), 1–19.Google Scholar
Digital Library
- [2] . 2004. Area-efficient instruction set synthesis for reconfigurable system-on-chip designs. In The Design Automation Conference (DAC’04).Google Scholar
Digital Library
- [3] . 2016. Stratus High-Level Synthesis. https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/synthesis/stratus-high-level-synthesis.html.Google Scholar
- [4] . 2013. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems. ACM Transactions on Embedded Computing Systems (TECS) 13, 2 (2013), 1–27.Google Scholar
Digital Library
- [5] . 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Architectural Support for Programming Languages and Operating Systems (ASPLOS’14).Google Scholar
- [6] . 2008. Pattern-based behavior synthesis for FPGA resource reduction. In International Symposium on Field-Programmable Gate Arrays (FPGA’08).Google Scholar
- [7] . 2017. End-to-end deep learning of optimization heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT’17). IEEE, 219–232.Google Scholar
Cross Ref
- [8] . 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 4 (1991), 451–490.Google Scholar
Digital Library
- [9] . 2020. Type-directed scheduling of streaming accelerators. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’20). 408–422.Google Scholar
Digital Library
- [10] . 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News (2006).Google Scholar
Digital Library
- [11] . 2018. Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’18). 296–311.Google Scholar
Digital Library
- [12] . 2018. HPVM: Heterogeneous parallel virtual machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 68–80.Google Scholar
Digital Library
- [13] . 2016. Peruse and profit: Estimating the accelerability of loops. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). 1–13.Google Scholar
Digital Library
- [14] . 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). 242–251.Google Scholar
Digital Library
- [15] . 2009. Rapid design of area-efficient custom instructions for reconfigurable embedded processing. Journal of Systems Architecture (2009), 75–86.Google Scholar
Digital Library
- [16] . 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, 75.Google Scholar
Cross Ref
- [17] . 2015. Pudiannao: A polyvalent machine learning accelerator. ACM SIGARCH Computer Architecture News 43, 1 (2015), 369–381.Google Scholar
Digital Library
- [18] . 2016. High level synthesis of complex applications: An h. 264 video decoder. In 2016 ACM/SIGDA International Symposium On Field-Programmable Gate Arrays (FPGA’16).Google Scholar
- [19] . 2022. http://plato.asu.edu/ftp/lpsimp.html.Google Scholar
- [20] . 1995. Symbolic modeling and evaluation of data paths. In 32nd Design Automation Conference. IEEE, 389–394.Google Scholar
Digital Library
- [21] . 2005. Efficient datapath merging for partially reconfigurable architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD’05).Google Scholar
Digital Library
- [22] . 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (1970), 443–453.Google Scholar
Cross Ref
- [23] . 2017. Machine learning yearning. http://www.mlyearning.org/(96). 139 (2017).Google Scholar
- [24] . 2012. Bambu: A free framework for the high-level synthesis of complex applications. University Booth of DATE 29 (2012), 2011.Google Scholar
- [25] . 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In 2014 IEEE International Symposium on Workload Characterization (IISWC’14). IEEE, 110–119.Google Scholar
- [26] . 2020. Effective function merging in the SSA form. In Programming Language Design and Implementation (PLDI’20).Google Scholar
- [27] . 2019. Function merging by sequence alignment. In 2019 International Symposium on Code Generation and Optimization (CGO’19).Google Scholar
- [28] . 2020. gem5-SALAM: A system architecture for LLVM-based accelerator modeling. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). IEEE, 471–482.Google Scholar
- [29] . 2014. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. In 2014 ACM/IEEE Annual International Symposium on Computer Architecture (ISCA’14).Google Scholar
- [30] . 2016. Co-designing accelerators and soc interfaces using gem5-aladdin. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.Google Scholar
- [31] . 2022. AutoDSE: Enabling software programmers to design efficient FPGA accelerators. ACM Transactions on Design Automation of Electronic Systems (TODAES) 27, 4 (2022), 1–27.Google Scholar
Digital Library
- [32] . 2019. Navion: A 2-mw fully integrated real-time visual-inertial odometry accelerator for autonomous navigation of nano drones. IEEE Journal of Solid-State Circuits (JSSC) 54, 4 (2019), 1106–1119.Google Scholar
Cross Ref
- [33] . 2018. Stitch: Fusible heterogeneous accelerators enmeshed with many-core architecture for wearables. In 2018 ACM/IEEE Annual International Symposium on Computer Architecture (ISCA’18).Google Scholar
- [34] . 2022. MIP. https://python-mip.readthedocs.io/en/latest/intro.html.Google Scholar
- [35] . 2010. Conservation cores: Reducing the energy of mature computations. In Architectural Support for Programming Languages and Operating Systems (ASPLOS’10).Google Scholar
- [36] . 2011. QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In IEEE/ACM International Symposium on Microarchitecture (MICRO’11).Google Scholar
- [37] . 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News (1995), 1–709. https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Getting-Started-with-Vitis-HLS.Google Scholar
Digital Library
- [38] . 2021. Vivado High-Level Synthesis. www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google Scholar
- [39] . 2021. Xilinx All Programmable SoC Portfolio. https://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf.Google Scholar
- [40] . 2020. Deep program structure modeling through multi-relational graph-based learning. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT’20). 111–123.Google Scholar
Digital Library
- [41] . 2022. Trireme: Exploring hierarchical multi-level parallelism for domain specific hardware acceleration. arXiv preprint arXiv:2201.08603 (2022).Google Scholar
- [42] . 2019. Compiler-assisted selection of hardware acceleration candidates from application source code. In IEEE International Conference on Computer Design (ICCD’19).Google Scholar
Cross Ref
- [43] . 2018. RegionSeeker: Automatically identifying and selecting accelerators from application source code. IEEE Transactions on Computer-Aided Design Of Integrated Circuits and Systems (TCAD) (2018), 1–6.Google Scholar
- [44] . 2017. Accurate high-level modeling and automated hardware/software co-design for effective soc design space exploration. In The Design Automation Conference (DAC’17).Google Scholar
Digital Library
Index Terms
Early DSE and Automatic Generation of Coarse-grained Merged Accelerators
Recommendations
High-Level Synthesis Methodologies for Delay-Area Optimized Coarse-Grained Reconfigurable Coprocessor Architectures
ISVLSI '10: Proceedings of the 2010 IEEE Annual Symposium on VLSIAs Very Large Scale Integration (VLSI) process technology continues to scale down transistor sizes, modern computing devices are becoming extremely complex. In order to face this complexity explosion, the shifting of design methodologies towards higher ...
Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysUsing FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) ...
A Coarse-Grain Hierarchical Technique for 2-Dimensional FFT on Configurable Parallel Computers*This work was supported in part by the US Department of Energy under grant DE-FG02-03CH11171.
FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1], [2]. However, there are several challenges to further boost FPGA performance: the communication overhead between ...



















Comments