Abstract
The design of time-critical embedded systems often requires static models of computation such as cyclo-static dataflow. These models enable performance guarantees, execution correctness, and optimized memory usage. Nonetheless, determining optimal buffer sizing of dataflow applications remains difficult: existing methods offer either approximate solutions or fail to provide solutions for complex instances. We propose a throughput-buffering trade-off exploration that uses K-periodic scheduling to direct a design-space exploration—providing optimal solutions while significantly reducing the search space compared to existing methodologies. We compare this strategy against previous approaches and demonstrate search-space reductions over two benchmark suites, resulting in significant improvements in computation times while retaining optimal results.
1 INTRODUCTION
Synchronous Dataflow (SDF) [20] and Cyclo-Static Dataflow (CSDF) [7] are models of computation within the dataflow paradigm of programming. SDF models applications as a set of actors communicating through buffers with single constant production and consumption rates. CSDF is a generalization of SDF that admits cyclically changing rates instead of single constant rates. They are commonly used to describe Digital Signal Processing (DSP) applications (e.g., LTE encoders [23], Deep Neural Networks [30]) or to express parallelism for implementation on specialized hardware (e.g., GPU [17], FPGA [26]).
A key characteristic of SDF and CSDF models is that their scheduling can be determined at compile-time—this allows the throughput and memory allocation of modeled applications to be statically evaluated, thus providing guarantees on performance [12]. The main questions explored within the field thus involve (1) computing feasible and efficient schedules and (2) determining the optimal memory allocations required for a graph to execute at a certain throughput [25].
Throughput-buffering trade-off exploration addresses the latter question of memory optimisation in dataflow models. The goal of this process is to determine all optimal pairs of memory allocations based on their corresponding throughputs for a given Synchronous Dataflow Graph (SDFG) or Cyclo-Static Dataflow Graph (CSDFG). Optimality, in this case, refers to the smallest memory allocation required to achieve a particular throughput.
The importance of guaranteeing a throughput is particularly relevant in real-time applications. The consequences of failing to meet a throughput constraint could range from trivial to lethal: from a missing frame in the context of video processing to a potential loss of life in the context of an automatic emergency braking system in a vehicle. Memory optimization, however, is critical in the context of embedded systems; on-chip memory tends to be scarce, and there are often significant overheads resulting from accessing off-chip memory. Furthermore, excessive memory requirements can result in added costs—in commercially available digital signal processors, for example, the size of on-chip memory determines its cost [24]. Thus, apart from prior works in throughput-buffering trade-off exploration [27, 28], here have also been numerous works motivated by the goal of minimizing memory requirements in the context of utilizing dataflow graphs in embedded system design [1, 5, 6].
The exploration process involves modeling a graph with various configurations of buffer size allocations while computing its resulting throughput. This charts the design space of the graph, which consists of various pairs of memory allocations and corresponding throughputs. Figure 1(b), for example, shows a plot of all pairs identified by running a design-space exploration (DSE) algorithm [28] on the benchmark application shown in Figure 1(a) (Fig8 [27]1). The plot of Figure 1(b) shows how the design space charted out by the exploration algorithm allows the optimal memory allocations to be identified. The horizontal alignment of explored points (denoted by circles) indicates that there are multiple memory allocations that can achieve the same throughput. For example, although there are larger memory allocations that can achieve a throughput of 0.0185, the annotated point “(42, 0.0185)” in Figure 1(b) is the smallest memory allocation able to achieve a throughput of 0.0185. The vertical alignment of explored points, however, indicates that different configurations of the same overall memory allocation can have different resulting throughputs. Looking at the same point “(42, 0.0185)” in Figure 1(b), during the DSE, we found a storage distribution of 42 (where the buffers sizes, in order of the channel IDs, are “7,2,7,6,2,2,2,7,7”) that can achieve a throughput of 0.0185. However, we also found memory allocations of size 42 where the application was not functional (as convention, we then set the throughput to null). In this manner, by exploring this space, Pareto points indicating optimal memory allocations (denoted by crosses) can be identified as the smallest memory allocations that can achieve a particular throughput. This process enables developers to compute the various optimum throughput-buffering pairs for an application modeled by the graph.
Fig. 1. Example of a throughput-buffering trade-off exploration.
A challenge of throughput-buffering trade-off exploration is minimizing the amount of time taken to complete the exploration process. Looking again at Figure 1(b), we see that there are many more explored points than there are Pareto points. Although the counts of explored points to Pareto points differ by two orders of magnitude in this example, we have observed differences of up to four orders of magnitude in the benchmark applications that we have tested. The size of the design space can slow down the exploration process, as all viable points need to be explored before the full set of optimal solutions can be identified. For complex graphs, exact methods to find Pareto points can therefore have prohibitively long computation times [28], but alternate approaches using approximate methods are not always able to compute an optimal solution [8]. The following section will cover the existing works that have attempted to address this problem.
1.1 Related Work
Stuijk et al. [28] introduced a technique involving the use of a DSE algorithm to determine all Pareto points between the throughput and buffering requirements for a CSDFG. Due to their methods used for throughput computation and to identify new memory allocations, however, this technique can take an inordinate amount of time to complete its computations. This prevents it from computing the complete set of Pareto points for some of the more complex benchmark applications. Our work builds upon their DSE algorithm by proposing the use of an alternate throughput computation method and a new technique for identifying new exploration points based on the K-periodic scheduling method introduced by Bodin et al. [9]. We compare our solutions in more detail in Sections 1.2 and 1.3.
Ara et al. [2] extended the work of Stuijk et al. [28] to Scenario-Aware Dataflow Graphs. Interestingly, they use an expansion-based method [25] to identify additional configurations of buffer memory allocations to explore that is similar to the critical cycle technique we propose. Despite considerable differences between the two models, their solution would be applicable to CSDF. Nonetheless, their method differs in two points. First, they only apply their expansion-based method of identifying memory allocations to explore when the current explored configuration is deadlock-free (when the buffer sizes allocated are sufficient for infinite execution of the dataflow graph). Second, their expansion-based method is an additional step in the algorithm that performs over a more complex representation of the graph; in our approach, new exploration points are identified directly from our method of schedule computation. These differences mean that the reductions in computation time and design space from our approach do not necessarily carry over to their approach. We would thus expect the benefits of our approach to diverge significantly despite the similarities.
More recently, Hendriks et al. [18] proposed the use of monotonic optimizations to quickly identify the optimal memory allocations for a target throughput. Their work provides insights into how the design space can be efficiently reduced to quickly find a solution. Nonetheless, their technique aims to identify optimal solutions for a given throughput rather than identifying Pareto points for every possible throughput attainable by the dataflow graph—this distinguishes their work from that of Stuijk et al. [28] as well as the solution we are proposing in this article.
Although our work assumes that buffers have independent memory allocations, it is possible to optimize memory allocations by sharing memory between buffers. Desnos et al. [13] propose a method to minimize memory allocations of DSP applications specified by SDFGs by identifying buffers that can be merged to reduce the overall memory required by the application. Their technique requires additional inputs to the SDF model to enable shared memory between buffers. Nonetheless, it highlights that there are alternate approaches toward memory optimization of SDFs.
Finally, it should be noted that faster strategies are available when approximate solutions are sufficient. An alternate DSE algorithm is proposed by Stuijk et al. [28] to provide an approximate Pareto set when analyzing complex graphs.
It is also possible to use throughput-based buffer sizing techniques [8, 10, 29] to perform a DSE. We define a throughput-based buffer sizing method as one that optimize buffer sizes for a given throughput constraint, and Hendriks et al. [18] is one of them. To perform a throughput-buffering trade-off exploration, we can explore different throughput constraints and optimize buffer sizes iteratively, which directly produce Pareto points. One caveat of this approach is that when multiple storage distributions reach identical throughput, then only one of them is found. On the contrary, we perform our exploration by fixing buffer sizes and we identify maximal reachable throughput for each of them, which is an exhaustive exploration.
By considering periodic schedules for applications modeled as CSDFGs, Bamakhrama and Stefanov [3] showed that it is possible to apply hard-real-time scheduling techniques to construct periodic schedules for applications modeled as CSDFGs to achieve maximum throughput. Nonetheless, their work applies only to a specific class of acyclic CSDFGs, which they defined as matched input/output rates graphs.
More generally, approximate and throughput-based buffer sizing approaches usually provide a non-optimal solution. Furthermore, these techniques do not account for multiple minimal storage distributions for a single throughput.
1.2 Self-Timed Scheduling and Storage Dependencies Based DSE
Existing approaches to compute the Pareto set of an SDFG or CSDFG, as proposed by Stuijk et al. [28], involve a throughput-buffering trade-off exploration that calculates the throughput of a graph using a self-timed schedule—this is constructed by executing actors in the graph as soon as possible (i.e., once the amount of data necessary for the actor to be executed is available) until a periodic execution pattern is detected. Throughput can then be calculated as the average number of actor executions per time unit.
Identifying the memory allocations to evaluate over the course of the exploration, however, is based on storage dependencies. Storage dependencies are derived from buffers that prevent an actor from firing due to an insufficient memory allocation.
There are limitations to this approach: using a self-timed schedule to compute throughput has, in the worst case, exponential complexity [28]. Furthermore, increasing the memory allocations of buffers that have storage dependencies does not necessarily increase a graph’s throughput. Using storage dependencies to determine new exploration points can thus lead to an unnecessarily large design space. Thus, the shortcoming of this approach, as mentioned in the work of Stuijk et al. [28], is that the computation duration of the resulting DSE algorithm can be too long to be of practical use, as it has an exponential worst-case complexity. However, approximate methods such as increasing the coarseness of the exploration [28] and 1-periodic scheduling [8] can return overestimated results—moreover, in the former case, the approximated algorithm continues to have exponential complexity in the worst case. To address these shortcomings, we propose a new throughput-buffering trade-off exploration based on K-periodic scheduling [9] that returns an optimal Pareto set.
1.3 Our Proposed Solution: K-Periodic Scheduling and Critical Cycles
The process of K-periodic scheduling identifies cycles of actor executions that define the throughput of the graph—we refer to this as a critical cycle. The buffers connecting actors in the critical cycle are therefore those that are constraining the throughput of the graph itself. In contrast to storage dependencies, which identifies buffers that constrain the firing of individual actors, our solution provides us with a more efficient exploration point selection process. By using critical cycles to determine which buffer sizes to consider in the exploration process, we are able to reduce the design space considered in our DSE compared to the current state of the art. With these changes, we benefit from the reduced computation times of K-periodic scheduling [9] while significantly reducing the search space of the DSE in comparison to Stuijk et al. [28]. The result is a throughput-buffering trade-off exploration that completes earlier than the DSE algorithm implemented by Stuijk et al. [28] while returning the complete Pareto set of throughput-buffering trade-offs.
1.4 Contributions
The contributions of this work are as follows:
(1) | We present a throughput-buffering trade-off exploration technique with an exploration point selection process that results in a reduced design space compared to the current state of the art. We combine this improvement with K-periodic throughput evaluation and achieve reduced computation times. | ||||
(2) | Consequently, we were able to present a comparative study between this new methodology and several existing method with results on the Pareto fronts of some benchmarks that, to our knowledge, have not been successfully computed before. | ||||
(3) | Furthermore, through our DSE, we were also able to identify new Pareto points in some benchmarks that were not identified in the current state-of-the-art DSE. We discuss the discrepancy and provide a correction to the existing implementation. | ||||
It is important to note that we assume no resource constraints; it is assumed to be performed on a CSDFG in which we consider auto-concurrency and independent buffer allocation.
This article is organized as follows. Section 2 introduces the models, concepts, and the syntax we will be using. In Section 3, we present our proposed approach. We explain how new exploration points are identified using K-periodic scheduling and how this ensures that we are able to identify all optimal throughput-buffering pairs of a graph. A comparative study in performance is presented in Section 4, with an additional comparison of the sizes of the design spaces between our approach and that of Stuijk et al. [28]. Finally, we present our conclusion in Section 5.
2 SYNTAX AND MODEL DEFINITIONS
In this section, we introduce the CSDF model and related key concepts used throughout this article. In particular, we define the notions of throughput and buffer sizes of CSDFGs.
2.1 Cyclo-Static Dataflow Graphs
CSDFGs [7] are directed graphs where each node represents an actor and each edge represents a buffer between two actors. A CSDFG is denoted by \( \mathcal {G} = (\mathcal {A}, \mathcal {B}), \) where \( \mathcal {A} \) is the set of nodes and \( \mathcal {B} \) is the set of edges. CSDFGs can be either simple graphs (no more than one directed edge from one node to another) or multi-graphs (more than one directed edge can be connected from one node to another). Furthermore, even though we will generally limit our study to weakly connected graphs, graphs made of non-connected components could be considered part by part.
Actors. An actor consumes data from incoming buffers and produces data onto outgoing buffers. It represents a function within the larger application modeled by the graph \( \mathcal {G} \) where the consumption and production of data models incoming and outgoing data from its computations. Each actor \( a \in \mathcal {A} \) consists of \( \varphi (a) \in \mathbb {N} - \lbrace 0\rbrace \) phases of execution. These phases denote cyclically varying amounts of data consumed and produced by the actor upon each execution. For every \( k \in \lbrace 1,\ldots ,\varphi (a)\rbrace \), phase k of an actor a’s execution is denoted by \( a_k \). Each actor’s execution phase \( a_k \) has a constant duration \( d(a_k) \in \mathbb {R} \). The n-th execution of phase k of actor a’s execution is denoted by \( \langle {a_k, n}\rangle \), where \( n \in \mathbb {N} - \lbrace 0\rbrace \). Finally, \( Pr\langle {a_k, n}\rangle \) denotes the last execution of a before \( \langle {a_k, n}\rangle \)—that is, \( Pr\langle {a_k, n}\rangle = \langle {a_{k-1}, n}\rangle \) if \( k \gt 1 \), and \( Pr\langle {a_k, n}\rangle = \langle {a_{\varphi (a)}, n - 1}\rangle \) otherwise. By definition, \( Pr\langle {a_k, n}\rangle \) can only occur before or at the same time as \( \langle {a_k, n}\rangle \). Note that although it does not exist, we assume the existence of a fictitious execution \( \langle {a_{\varphi (a)}, 0}\rangle \) to simplify our definition of \( Pr\langle {a_k, n}\rangle \).
SDFGs are a special case of CSDFGs where each actor \( a \in \mathcal {A} \) has only one phase such that \( \forall a \in \mathcal {A} \), \( \varphi (a) = 1 \).
Buffers. A buffer stores data that is produced and consumed by the actors connected to each end of it. It models data dependencies between actors as well as the means by which data is passed between different actors. \( b = (a, a^{\prime }) \in \mathcal {B} \) thus refers to a buffer b between two actors a and \( a^{\prime } \), into which a produces data that \( a^{\prime } \) requires to execute. The number of tokens of data (“tokens” in short) produced onto b at the end of the execution of \( a_k \) and the number of tokens consumed from b at the start of the execution of \( a^{\prime }_k \) are denoted by \( in_b(k) \) and \( out_b(k), \) respectively. For the sake of clarity, we use \( I_b\langle {a_k, n}\rangle \) to denote the total amount of data produced in the buffer b over time from the first execution of a until after the execution of \( \langle {a_k, n}\rangle \), and \( O_b\langle {{a^{\prime }}_{k^{\prime }}, {n^{\prime }}}\rangle \) to denote the total amount of data consumed from the buffer b from the first execution of \( a^{\prime } \) until the start of the execution of \( \langle {{a^{\prime }}_{k^{\prime }}, {n^{\prime }}}\rangle \). We note that \( \begin{equation*} I_{b}\langle {a_k, n}\rangle = I_{b}Pr\langle {a_k, n}\rangle + in_{b}(k) \end{equation*} \) and \( \begin{equation*} O_{b}\langle {a^{\prime }_{k^{\prime }}, n^{\prime }}\rangle = O_{b}Pr\langle {a^{\prime }_{k^{\prime }}, n^{\prime }}\rangle + out_{b}(k^{\prime }). \end{equation*} \)
The initial number of tokens in buffer b is denoted by \( M_0(b) \in \mathbb {N} \). The number of tokens in a buffer must remain non-negative—that is, \( \langle {a^{\prime }_{k^{\prime }}, n^{\prime }}\rangle \) can only be executed after \( \langle {a_{k}, n}\rangle \) if \( M_0(b) + I_b\langle {a_k, n}\rangle - O_b\langle {a^{\prime }_{k^{\prime }}, n^{\prime }}\rangle \ge 0 \)—this expresses a precedence relation between the input and output actors of a buffer.
Figure 2(a) illustrates a simple example of a CSDFG. We see that it consists of two actors, A and B, where \( \varphi (A) = 2 \) and \( \varphi (B) = 3 \). There is one edge connecting the actors, \( b = (A, B) \), with \( M_0(b) = 0 \). In phase \( k = 1 \), \( in_b(k) = 3 \) and \( out_b(k) = 1 \).
Fig. 2. Example of self-loops to limit auto-concurrency and feedback buffer use to model bounded buffer capacity.
Finally, while we will often refer to an abstract notion of “tokens” of data when discussing the data transmitted through buffers, it is important to note that these tokens can occupy different amounts of physical memory depending on the data type they represent—for example, a token representing data of type double would be twice the size of a token representing data of type float. Nonetheless, this does not impact the methods that we describe in this article, as we can simply assign weights to these buffers based on the data type they transmit; multiplying the token quantity by this weight would allow us to account for differences in sizes between data types.
Auto-Concurrency. The execution of multiple phases of a same task could overlap, and this is known as auto-concurrency or reentrancy [25]. This is an important feature to express parallel executions of a single task. However, it is not always safe, andto avoid this behavior, it is possible to add self-loop buffers to any task that requires it. An example of such self-loops can be seen in Figure 2(b).
2.2 Bounded Buffer Capacity
On their own, buffers are unbounded. They do not denote the capacity for the number of tokens that it can store. This is not a realistic assumption when it comes to modeling real-life applications running on embedded systems where these capacities need to be specified. We therefore adopt the method described in the work of Stuijk et al. [28] where a feedback buffer is added to each buffer in the original graph to model bounded buffer capacities, including blocking write behavior on full buffers, with the goal of modeling bounded memory. Figure 2 illustrates how a feedback buffer can be added to a CSDFG to model bounded buffer capacities. Looking at Figure 2(c), the feedback buffer of \( b = (A, B) \) would thus be \( b_f = (B, A) \), with \( in_{b_f}(k^{\prime }) = out_b(k^{\prime }) \) and \( out_{b_f}(k) = in_b(k) \) for all \( k \in \lbrace 1,\ldots ,\varphi (A)\rbrace \) and \( {k^{\prime }} \in \lbrace 1,\ldots ,\varphi (B)\rbrace \). This is denoted by the blue dotted edge from B to A. \( M_0(b) \) continues to denote the initial number of tokens in b while the value of \( M_0(b_f) \) is varied to alter the buffer size modeled. The storage capacity of b would thus be equal to the sum of \( M_0(b) \) and \( M_0(b_f) \).
The terms buffer capacity and buffer size will be used interchangeably to denote the maximum number of tokens that a buffer can store as modeled using this method.
Storage distribution [28]. A storage distribution maps a specified buffer capacity \( \delta (b) \) for each buffer \( b \in \mathcal {B} \) of a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \).
Distribution size [28]. Distribution size refers to the sum of all buffer capacities given by a particular storage distribution. We denote the distribution size of a storage distribution \( \delta \) as \( |\delta | \). A storage distribution refers to a particular configuration of buffer capacities, whereas a distribution size refers to the resulting overall memory allocated. A graph can therefore have several unique storage distributions that share the same distribution size.
CSDFG with bounded buffers. Given a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \) and a particular storage distribution \( \delta \). We denote \( \mathcal {B}_f \) the set of feedback buffers in which \( \exists b_f \in \mathcal {B}_f \) for each \( b \in \mathcal {B} \) and \( \delta (b) = M_0(b) + M_0(b_f) \). We thus define an equivalent CSDFG with modeled bounded buffers as \( \mathcal {G}_{\delta } = (\mathcal {A}, \mathcal {B}_{\delta }), \) where \( \mathcal {B}_{\delta } = \mathcal {B} \cup \mathcal {B}_f \).
2.3 Consistency
Consistency is a structural property of a CSDFG implying that, when we disregard its initial marking, it is possible to construct a periodic sequence of actor firings for this graph where we can guarantee that tokens will not infinitely accumulate or diminish in its buffers; this is an especially important property in the context of ensuring bounded buffer capacities.
Methods to test that SDFGs are consistent have been demonstrated in the work of Lee and Messerschmitt [20] and extended to CSDFGs in the work of Bilsen et al. [7]. Both cases rely on the construction of a topology matrix that expresses the net number of tokens produced on each buffer from actor firings. Graphs are classified as consistent if there exists a vector that, if multiplied with their topology matrix, will return a zero matrix. The existence of this vector, known as a repetition vector, defines the number of times the actors in the graph need to complete a full iteration of its phases of executions to ensure that, for each buffer, the number of tokens in the buffer after firing is equal to the number of tokens before the executions. The existence of a repetition vector thus proves the given graph is consistent. The repetition vector is denoted by q, with \( q_a \) denoting the repetition factor of an actor \( a \in \mathcal {A} \).
The CSDFG in Figure 2(c), for example, has a repetition vector of \( q = [3, 4] \). After all phases of A are executed \( q_A = 3 \) times each and B phases are executed \( q_B = 4 \) times each, the number of tokens in b is equal to 0, which is the same number of tokens as the initial state, \( M_0(b) \), before any executions of A and B. More formally, for any buffer \( b = (a,a^{\prime }) \), \( \begin{equation*} q_{a} \times i_b = q_{a^{\prime }} \times o_b, \end{equation*} \) where \( i_b = \sum _{k=1}^{\varphi (a)} in_b(k) \) and \( o_b = \sum _{k=1}^{\varphi (a^{\prime })} out_b(k) \).
Expansion factor. The expansion factor of a graph is defined as the sum of each element in its repetition vector multiplied by the number of phases of execution of its corresponding actor, \( \sum _{a\in \mathcal {A}} q_a \times \varphi (a) \). It is thus the cumulative number of task executions in a dataflow graph required to return to the same number of tokens in each buffer as before the start of the executions. The complexity of most evaluation methods is directly related to the repetition vector of a graph. As such, the expansion factor is a metric commonly used to estimate the difficulty of an instance. Unfortunately, this value is also known to be potentially exponential with the size of the graph.
2.4 Scheduling and Liveness
A schedule refers to an ordered sequence of actor executions. We denote a schedule, S, as a function that associates any actor a’s nth execution of phase k, denoted by \( \langle {a_k, n}\rangle \), to a starting execution time \( S\langle {a_k,n}\rangle \in \mathbb {R} \). Given a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \), a schedule S is valid if actors are only scheduled for execution when a sufficient number of tokens in the corresponding input buffers are available such that the number of tokens in every buffer \( b \in \mathcal {B} \) stays positive or null after the scheduled actor executions. Note that, as explained in Section 2.2, buffers are unbounded as bounded buffer capacities are modeled with feedback buffers. Consequently, schedule validity considers only the availability of tokens on input buffers as the modeled buffer capacities of output buffers are denoted by the availability of tokens on their respective input feedback buffers.
To admit a valid schedule, a graph has to be live. A CSDFG is live if all its actors can be fired infinitely often. Thus, when a graph is live, a schedule can be constructed where all actors fire infinitely often. Liveness can be tested either by performing a symbolic execution of the graph [15] or by transforming the given CSDFG into a special case of an SDFG where all actors produce and consume one token of data when executed—this is known as a homogenous SDFG. The corresponding CSDFG is live if every circuit in this homogenous SDFG has at least one token [7].
In this work, we will consider only CSDFGs that are consistent and live—graphs where a schedule can be constructed such that all actors can be executed infinitely often with bounded buffers [25].
2.5 Throughput
The throughput of an actor \( a \in \mathcal {A} \) for a particular schedule S is defined as \( \begin{equation*} \displaystyle Th_S(a) = \lim _{n\rightarrow \infty } \frac{n}{{S}\langle {a_1, n}\rangle }. \end{equation*} \) Note that in the case of auto-concurrent tasks or null task durations, it is possible for the throughput of the actor to be infinite, in which case this limit will not exist.
Additionally, assuming strongly connected graphs, we can define the throughput of a graph \( \mathcal {G} \) as \( \begin{equation*} \displaystyle Th_S(\mathcal {G}) = \frac{Th_S(a)}{q_a}, \forall a \in \mathcal {A}. \end{equation*} \)
Conversely, we note the period of a graph \( \begin{equation*} \displaystyle \Omega _S(\mathcal {G}) = \frac{1}{Th_S(\mathcal {G})}. \end{equation*} \)
Although the maximal throughput evaluation of a CSDFG is an important task to perform, it is also a difficult problem—no consensus exists on the most effective method of calculation [12]. One possible technique to compute the maximum throughput of a CSDFG, as explained in Section 1.2, would be to perform a self-timed execution of the graph until it reaches a steady state (i.e., a previously known state of the graph) [16]. Nonetheless, as stated before, the complexity of this method is known to have an exponential worst-case complexity [28].
Alternative approaches consider more restrictive scheduling representations known as periodic or K-periodic schedules where each actor has a limited set of initial starting times which are then periodically repeated. Strictly periodic solutions, where only one start time is defined per actor, trade a reduction in complexity for computing an approximate, rather than optimal, throughput [4, 22]. However, K-periodic schedules, which allow for more than one starting time per actor, have been shown to reach optimal throughput [9, 12, 20] while simultaneously benefiting from a similar reduction in complexity over self-timed scheduling methods. Verifying the liveness (if a graph admits a valid schedule) of a weighted Petri net—the equivalent of an SDFG—remains an open question and is at most EXPSPACE-hard, as shown in other works [14, 19]. The complexity analysis from their work therefore naturally applies to the problem we are trying to solve as it consists of scheduling SDFGs and CSDFGs. Any existing algorithm that verifies the liveness of an SDFG/CSDFG or computes its maximal throughput is, in the worst case, exponential in the size of the graph. This includes the K-periodic method we rely on [9]. Despite the time and space reduction demonstrated in our experiments, the worst-case complexity of our technique remains, similarly to previous methods, exponential in the size of the graph.
Using the terminology introduced in this section, we can characterize the goal of throughput-buffering trade-off explorations, which forms the focus of this article: to efficiently compute the set of all Pareto optimal pairs of storage distributions and throughputs for a given SDFG or CSDFG.
3 METHODOLOGY
In this section, we explain the methodology of our throughput-buffering exploration. An overview of the DSE algorithm, introduced in the work of Stuijk et al. [28], is presented in Section 3.1 and is expressed in pseudo-code in Algorithm 1. Our algorithm differs from existing methods, such as the one implemented in SDF3 [28], in two aspects: (1) we use K-periodic rather than self-timed scheduling to compute throughput and (2) new storage distributions to explore are identified using the critical cycle that we derive from the K-periodic throughput evaluation method, rather than identifying channels with storage dependencies. These two points will form the focus of Sections 3.2 and 3.4. We also provide formal proof of the completeness of our approach in Section 3.4.

3.1 Overview of the KDSE Algorithm
As seen in Algorithm 1, the KDSE algorithm consists of an initialization phase (lines 1–6), followed by an execution phase (lines 7–18).
3.1.1 Initialization Phase (Lines 1–6).
Maximal/Target throughput (line 1). First, the schedule of the given CSDFG \( \mathcal {G} \) is computed. It is assumed that all buffers in \( \mathcal {G} \) are unbounded.2 This therefore gives us the maximum throughput attainable by \( \mathcal {G} \). The maximum throughput provides us with the stopping point of the exploration. Note that it is also possible to explicitly set a target throughput for the DSE algorithm. The specified target would have to be less than the maximal throughput attainable, and would result in skipping this step in the initialization phase. This is particularly helpful with graphs that could admit an infinite throughput such as those with auto-concurrent tasks.
Initial storage distribution (lines 2–5). Then, the lower bound of buffer sizes to avoid deadlock locally [1], as well as the step sizes of each buffer, are calculated. SDF3 extended the method presented in the work of Ade et al. [1] from SDFGs to CSDFGs. Although this method differs from known methodologies, to maintain a fair comparison with existing approaches by ensuring the same starting point of the DSE, we will use the exact same method implemented in SDF3 [28] in our experiments. For each buffer \( b = (a, a^{\prime }) \in \mathcal {B} \), the lower bound \( lb(b) \) is computed as \( \begin{align*} lb(b) &= \min _{i=0}^{\operatorname{lcm}(\varphi (a), \varphi (a^{\prime }))}(P_i + Q_i - \gcd (P_i, Q_i))\\ \text{where } P_i &= in_b(i \bmod \varphi (a))\\ C_i &= out_b(i \bmod \varphi (a^{\prime })). \end{align*} \)
The lower bounds of buffer sizes provide us with an initial non-zero storage distribution on which to begin our exploration, as seen in line 4. Identifying this initial storage distribution thus presents an initial pruning of the design space by removing all smaller storage distributions from consideration.
The step size of a buffer is then defined by computing the \( \gcd \) of all possible combinations of rates of token production and consumption for each given buffer, thus defining the smallest granularity of data in each buffer [28]. Formally, the step size of a buffer \( b = (a,a^{\prime }) \in {\mathcal {B}} \) is \( step_b = \text{gcd}(in_b(k) \forall k \in \lbrace 1, \ldots ,\varphi (a)\rbrace ,out_b(k^{\prime }) \forall k^{\prime } \in \lbrace 1, \ldots ,\varphi (a^{\prime })\rbrace) \).
The step size determines the amount by which the buffer size of a given buffer is incremented by during subsequent iterations of the DSE—this is reflected in line 14. As explained in the work of Stuijk et al. [28], the step size and initial storage distribution account for all possible numbers of tokens that may appear in the buffer by using the \( \gcd \) of the buffer’s rates of production and consumption.
The final steps of the initialization phase consists of initializing a list of storage distributions to be evaluated (lines 4 and 5); this is a sorted list whose elements are stored in ascending order, functioning as a “checklist” for the DSE. An empty set that will hold the resulting throughput-storage distribution pairs computed over the course of the DSE (“result”) is also initialized.
3.1.2 Execution Phase (Lines 7–18).
The execution phase of the DSE consists of repeatedly evaluating the next storage distribution stored in the checklist by modeling bounded buffer capacities accordingly before computing the graph’s resulting throughput.
Bounded buffer modeling (line 9). Evaluating a storage distribution begins by modeling \( \mathcal {G} \) with the bounded buffer sizes denoted by the given storage distribution using the method described in Section 2.2—the resulting graph is denoted as \( \mathcal {G}_{\delta } \).
Throughput evaluation of current storage distribution (line 10). A schedule is constructed for \( \mathcal {G}_{\delta } \). This provides us with the maximum throughput of \( \mathcal {G}_{\delta } \). The set of buffers that could be constraining the throughput of \( \mathcal {G}_{\delta } \) is also determined; the method through which these buffers are identified differs across approaches. Toward this goal, the K-periodic scheduling method we propose for throughput evaluation simultaneously provides us with a critical cycle—denoted by the variable CC seen on line 10. The given storage distribution and the computed throughput are stored as a pair in the set of results—this is reflected in line 11.
In the case where no valid schedule can be constructed for \( \mathcal {G}_{\delta } \), the throughput will be taken as 0, as this would mean that the graph deadlocks. Nonetheless, a set of buffers whose buffer sizes could be causing the deadlock will still need to be computed. In our case, the K-periodic scheduling method provides a critical cycle. In addition, for practical reasons, in the case where no valid schedule can be found, the storage distribution where all buffer sizes are 0 will be added to the set of results instead. This is because the minimal storage distribution for any CSDFG to have 0 throughput is one whose distribution size is 0.
Expanding the design space (lines 12–16). Given the critical cycle, CC, identified in the throughput evaluation process, new storage distributions—where the buffer size of each buffer associated with an edge in the critical cycle is increased by its corresponding step size—are added to the checklist, thus expanding the search space. Note that the checklist is stored in ascending order, and thus the storage distribution with the lowest distribution size will be checked first at each iteration. Also note that there is an implicit check made to prevent any duplicate storage distributions from being added to the list.
Minimization (line 17). Finally, all storage distributions which result in throughputs that are equal to, or lower than, storage distributions with a lower distribution size are removed from the set of results. This leaves us with a set of Pareto optimal pairs of throughputs and storage distributions up to that point in the DSE. This step is executed at every iteration rather than at the end of the entire process for performance reasons. This avoids having the results set being as large as the entire search space.
Stopping condition (line 18). This process is repeated until the target throughput is attained by an evaluated storage distribution and there are no more storage distributions of equal or smaller distribution size left to evaluate in the checklist, leaving us with a set containing all Pareto optimal pairs of throughputs and storage distributions of \( \mathcal {G} \).
The method of throughput computation and the method used to decide on which storage distributions to consider are key aspects of the DSE algorithm that are repeated until the stopping condition is met.
As described earlier, to calculate the maximal throughput of the CSDFG, our proposed algorithm uses a K-periodic scheduling method [9]. Crucially, computing an optimal schedule using K-periodic scheduling consists of identifying the critical cycles of the graph. The buffers associated with this critical cycle are used to identify the additional storage distributions that should make up our design space. This allows us to compute the throughput of the CSDFG and identify the buffers necessary to expand the design space in a single step. In the next two sections, we provide more details on the K-periodic throughput evaluation technique and the properties behind critical cycles.
3.2 Throughput Evaluation
The throughput of a CSDFG can be calculated by generating a self-timed schedule, where actors are executed as soon as possible; however, it has an exponential worst-case complexity. An alternate method by Bodin et al. [9] shows that considering only K-periodic schedules can lead to a substantial reduction of throughput computation times for large graphs while remaining optimal.
(K-periodic schedule).
A K-periodic schedule S for a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \) is defined by a period \( \Omega _S(\mathcal {G}) \) and a periodicity vector K. In such a schedule, for each actor \( a \in \mathcal {A} \), for any integers \( k \in \lbrace 1, \ldots , \varphi (a)\rbrace \) and \( p \in \lbrace 1, \ldots , K_a\rbrace \) any execution start time \( {S}\langle {a_k, p}\rangle \) is fixed.
Subsequent start times \( S\langle {a_k, n}\rangle \) are then defined by \( \begin{equation*} S\langle {a_k, n}\rangle = {S}\langle {a_k, p}\rangle + x\mu _a^S, \end{equation*} \) where \( \mu _a^S = \frac{\Omega _S(\mathcal {G})}{q_a} \) and \( n = x \times K_a + p \), \( x \in \mathbb {N} \).
A valid K-periodic schedule is a schedule that follows both the definition of a K-periodic schedule and of a valid schedule (as defined in Section 2.4). Figure 3(b), for example, shows a valid K-periodic schedule for the CSDFG from Figure 3(a). For each actor, A, B, C, and D, initial starting times are defined—these are denoted by the darker blue squares in the figure. Actor executions are then repeated following a particular period. For example, actor A has two initial starting times that occur at times 0 and 3, and a period of 6. The timing of its n-th sequence of executions can thus be computed by \( n \times 6 + 0 \) and \( n \times 6 + 3 \).
Fig. 3. Example of K-periodic scheduling of a CSDFG.
3.3 K-periodic Throughput Evaluation Implementation
Given any CSDFG \( \mathcal {G}=(\mathcal {A}, \mathcal {B}) \), the K-periodic throughput evaluation method defined by Bodin et al. [9] returns what is known as a partial expansion graph [31]. The max cost-to-time ratio of this partial expansion graph is equal to the minimal period of \( \mathcal {G} \)—the reciprocal of which is the maximal frequency of \( \mathcal {G} \). Although an in-depth explanation of the method is beyond the scope of this article, in this section we provide a brief explanation of its implementation. In particular, we define the partial expansion graph and the cost-to-time ratio [11].
3.3.1 Partial Expansion Graph.
Given a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \) and a periodicity vector K, its associated partial expansion graph is a bi-valued graph \( J_{\mathcal {G}} = (N, E) \) for which the set of nodes N corresponds to any of the \( K_a \times \varphi (a) \) first start times of any actor a defined by the K-periodic schedule and the edges \( (\langle {a_k, p}\rangle , \langle {{a^{\prime }}_{k^{\prime }}, p^{\prime }}\rangle) \in E \) model any existing precedence relation between two executions \( \langle {a_k, n}\rangle \) and \( \langle {{a^{\prime }}_{k^{\prime }}, n^{\prime }}\rangle \), where \( n = x \times K_a + p \) and \( n^{\prime } = x^{\prime } \times K_{a^{\prime }} + p^{\prime } \), \( x, x^{\prime } \in \mathbb {N} - \lbrace 0\rbrace \).
There is a precedence relation between two executions \( \langle {a_k, n}\rangle \) and \( \langle {a^{\prime }_{k^{\prime }}, n^{\prime }}\rangle \) if
More formally, we define the bi-valued directed graph \( J_{\mathcal {G}} = (N, E) \) as \( \begin{align*} N &= \lbrace \langle {a_k, p}\rangle , a \in \mathcal {A}, k \in \lbrace 1,\dots ,\varphi (a)\rbrace , p \in \lbrace 1,\dots ,K_a\rbrace \rbrace \\ E &= \lbrace (\langle {a_k, p}\rangle , \langle {{a^{\prime }}_{k^{\prime }}, p^{\prime }}\rangle), b = (a, a^{\prime }) \in \mathcal {B}, (\langle {a_k, p}\rangle , \langle {{a^{\prime }}_{k^{\prime }}, p^{\prime }}\rangle) \in \mathcal {Y}(b)\rbrace . \end{align*} \) N is the set of nodes and E is the set of edges, and \( \mathcal {Y}(b) \) is the set of tuples \( (\langle {a_k, p}\rangle , \langle {{a^{\prime }}_{k^{\prime }}, p^{\prime }}\rangle) \) such that there exists a precedence relation between these two actor executions.
In this graph, any edge \( e = (\langle {a_k, p}\rangle , \langle {{a^{\prime }}_{k^{\prime }}, p^{\prime }}\rangle) \in E \) represents a precedence relation induced by a buffer b—thereby accounting for the preceding actor’s execution duration as well as the characterization of the precedent constraint. It is thus bi-valued by (1) \( \begin{equation} (H(e), L(e)) = \left(\frac{-\left\lfloor {O_{b}\langle {{a^{\prime }}_{k^{\prime }},p}\rangle - I_b\langle {{a}_{k},p}\rangle - M_0(b) + in_{b}(k) - 1} \right\rfloor ^{gcd_{b}}}{i_{b} \times q_{a}}, d(a_k)\right), \end{equation} \) where \( \lfloor {x\rfloor ^{y}}=y \times \lfloor \frac{x}{y}\rfloor \) and \( \gcd _b = \gcd (i_b,o_b) \).
\( L(e) \) refers to the duration of the execution of \( a_k \). \( H(e) \), however, expresses the normalized token delay such that if there exists some cycle in \( J_{\mathcal {G}} \) in which the sum of \( H(e) \) is less than or equal to 0, the graph would deadlock.
For example, let us consider \( \mathcal {G}=(\mathcal {A}, \mathcal {B}), \) the CSDFG from the Figure 3(a), and a simple unitary periodicity vector K with \( K_a = 1 \forall a \in \mathcal {A} \). The resulting partial expansion graph would be as depicted in Figure 4. With this figure, we can see there is one node per phase of each task, as well as 17 edges to model precedence constraints. Weights are computed following Equation (1)—for example, the weights for \( e = (\langle {D_1,1}\rangle , \langle {C_1,1}\rangle) \) between \( \langle {D_1,1}\rangle \) and \( \langle {C_1,1}\rangle \) are \( H(e) = \frac{-\left\lfloor {6 - 36 - 6 + 36 - 1} \right\rfloor ^{6}}{36 \times 1} = \frac{1}{6} \) and \( L(e) = d(D_1) = 1 \). Conversely, there is no edge between \( \langle {A_1,1}\rangle \) and \( \langle {B_2,1}\rangle \), meaning there is no strict precedence constraint between the two.
Fig. 4. Partial expansion graph of CSDFG in Figure 3(a) given a periodicity vector K where \( K_a = 1 \forall a \in \mathcal {A} \) .
3.3.2 Cost-to-Time Ratio.
As shown in the work of Bodin et al. [9], the optimization of the minimal period of a K-periodic schedule for the original CSDFG is equivalent to solving the MCRP [11] for its partial expansion graph.
The cost-to-time ratio of a cycle in a graph is defined as follows:
(Cost-to-time ratio).
Given a bi-valued directed graph \( J_{\mathcal {G}} = (N, E), \) the cost-to-time ratio of a cycle \( c(e_1, e_2, \dots , e_p) \in \mathcal {C}(J_{\mathcal {G}}), e \in E \) is given by (2) \( \begin{equation} R(c) = \frac{\sum _{i = 1}^{p}L(e_i)}{\sum _{i = 1}^{p}H(e_i)}. \end{equation} \)
The max cost-to-time ratio is then given by the cycle \( c^{\star } \) that maximizes \( R(c^{\star }) = \max \nolimits _{c\in \mathcal {C}(J_{\mathcal {G}})}R(c) \).
On the partial expansion graph pictured in Figure 4, we can see the cycle \( c^{\star } \) composed of the bold edges going through the tasks \( \langle {A_1,1}\rangle \), \( \langle {C_1,1}\rangle \), and \( \langle {C_1,1}\rangle \). This is the critical circuit for this K-periodic schedule. This cycle maximizes the cost-to-time ratio \( R (c^{\star }) = \frac{1 + 1 + 1}{\frac{1}{6} + \frac{1}{3} + \frac{-1}{3}} = 18, \) whereas, for example, the cycle c going through \( \langle {A_1,1}\rangle \), \( \langle {B_1,1}\rangle \), and \( \langle {C_1,1}\rangle \) has a cost-to-time ratio \( R (c) = \frac{ 1 + 1 + 1}{0 + \frac{-1}{12} + \frac{1}{3}} = 12 \). What is important to note is that the combination of buffers and tasks modeled by these nodes and edges of the critical cycle in the partial expansion graph are the limiting factor of the schedule throughput. To reach a higher throughput, we must increase the periodicity factor of any of these tasks.
3.3.3 K-Periodic Throughput Evaluation Algorithm.
The algorithm iteratively constructs partial expansion graphs for increasing periodicity vectors K until the obtained schedule reaches the maximal throughput. When it reaches the maximal throughput for a graph \( \mathcal {G} \), the partial expansion graph returned is the last one constructed. The max cost-to-time ratio of this returned graph is the minimum period \( \Omega _\mathcal {G} \) of the CSDFG \( \mathcal {G} \).
3.4 The KDSE Algorithm
As explained in Section 3.3, by solving the MCRP for a partial expansion graph, we determine the minimal period—and thus the maximum throughput—of a CSDFG. An important finding is that the cycle that maximizes the cost-to-time ratio can be associated to a list of buffers that are limiting the throughput of the CSDFG; we call this cycle the critical cycle.
(Critical cycle).
Given a bi-valued directed graph \( J_{\mathcal {G}} = (N, E) \) associated with a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \) and a periodicity vector K where the set of cycles are denoted \( \mathcal {C}(J_{\mathcal {G}}) \), there exists a cycle of maximum cost-to-time ratio \( c^{\star }(e_1, e_2, \dots , e_p) \in \mathcal {C}(J_{\mathcal {G}}) \) such that \( R(c^{\star }) = \max \nolimits _{c\in \mathcal {C}(J_{\mathcal {G}})}R(c) \). We define the critical cycle associated with a K-periodic schedule for a graph \( \mathcal {G} \) as the list of all buffers from \( \mathcal {G} \) associated with any of the edges from \( c^{\star }(e_1, e_2, \dots , e_p) \in \mathcal {C}(H) \).
For the sake of clarity, we denote that a buffer \( b \in \mathcal {B} \) from \( \mathcal {G} \) as being associated with critical cycle \( c^{\star } \) as \( b \in c^{\star } \). Note that although there can be multiple critical cycles (wherein their cost-to-time ratios are equal), our implementation in KDSE considers only a single critical cycle at each iteration. If, in subsequent iterations of the DSE, a cycle of equal cost-to-time ratio exists, it will be the next critical cycle identified.
In the following, we prove that it is sufficient to consider only buffers from critical cycles to fully explore the optimal space for throughput-buffering trade-off exploration.
Consider a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \), two storage distributions \( \delta \) and \( \delta ^{\prime } \), and a buffer \( b^+ \in \mathcal {B} \) such that \( \begin{align*} \delta ^{\prime }(b^+) &\gt \delta (b^+) \text{ and}\\ \delta ^{\prime }(b) &= \delta (b) \text{ for all other} b \in \mathcal {B}. \end{align*} \) Given that \( Th(\mathcal {G}_{\delta }) \) denotes the maximal throughput of the CSDFG with modeled buffer sizes defined by \( \delta \), \( G_{\delta } \), and \( c^{\star } \) denotes the critical cycle of \( \mathcal {G}_{\delta } \), if \( b_f^+ \in c^{\star } \) then \( Th(\mathcal {G}_{\delta ^{\prime }}) \ge Th(\mathcal {G}_{\delta }) \); otherwise, \( Th(\mathcal {G}_{\delta ^{\prime }}) = Th(\mathcal {G}_{\delta }) \).
Proof. Given a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \) and a storage distribution \( \delta \), let us define another storage distribution \( \delta ^{\prime } \) by increasing the buffer size of a buffer \( b^+ \in \mathcal {B} \) such that \( \begin{align*} \delta ^{\prime }(b^+) &\gt \delta (b^+) \text{ and}\\ \delta ^{\prime }(b) &= \delta (b) \text{ for all other} b \in \mathcal {B}. \end{align*} \)
Considering the CSDFG with modeled buffer sizes \( \mathcal {G}_{\delta } \), the buffer size of any \( b \in \mathcal {B} \) is modeled by \( \delta (b) = M_0(b) + M_0(b_f) \), where \( M_0(b) \) denotes the initial number of tokens in b. Increasing the buffer size of \( b^+ \) therefore involves increasing the value of \( M_0(b^+_f) \).
To compute the K-periodic schedule, we produce a partial expansion graph \( J_{\mathcal {G}_\delta } = (N, E) \). Note that we use \( J_{\mathcal {G}_\delta } \) here rather than \( J_{\mathcal {G}} \) to denote the partial expansion graph produced by the CSDFG G with modeled bounded buffers.
Given the definition of a partial expansion graph \( J_{\mathcal {G}_\delta } = (N, E) \) – specifically, the definition of \( H(e) \) for any edge \( e \in E \) from Equation (1) – increasing \( M_0(b^+_f) \) would mean that the value \( H(e) \) of associated edges would decrease.
However, from Equation (2), the values of all \( R(c) \) where \( b^+_f \not\in c \) would remain the same.
Recall from Definition 3.2 that the maximum throughput of \( \mathcal {G} \) given storage distribution \( \delta \), \( Th(\mathcal {G}_\delta) \) is equal to the reciprocal of \( R(c^{\star }) \). Therefore, if \( b^+_f \in c^{\star } \), then increasing \( M_0(b^+_f) \) might reduce \( R(c^{\star }) \). However, if \( b^+_f \not\in c^{\star } \), then the value of \( R(c^{\star }) \) will not change, as reducing the value of any \( R(c) \) that is not the max cost-to-time ratio will never change the cycle that defines \( R(c^{\star }) \). Therefore, we establish that, given the storage distributions \( \delta \) and \( \delta ^{\prime } \), \( \begin{align*} \qquad \qquad \qquad \qquad \qquad Th(G_{\delta ^{\prime }}) &\ge Th(G_{\delta }) \text{ if $b^+_f \in c^{\star }$; otherwise,}\\ Th(G_{\delta ^{\prime }}) &= Th(G_{\delta }). \end{align*} \)
Lemma 3.4 states that the throughput of a given CSDFG can only be increased by increasing the buffer sizes of buffers associated with the critical cycle. In Theorem 3.6, we use this property to prove that Algorithm 1 will return the set of all Pareto optimal pairs of storage distributions to throughput.
(Reachability between storage distributions).
Given a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \) and storage distributions \( \delta \) and \( \delta ^{\prime } \), we say that \( \delta ^{\prime } \) is reachable by \( \delta \) if and only if \( \begin{equation*} \forall b \in \mathcal {B}, \delta ^{\prime }(b) \ge \delta (b). \end{equation*} \) We denote this relation as \( \delta ^{\prime } \ge \delta \).
In Algorithm 1, we see that there is a checklist set of storage distributions that have yet to be explored in the DSE as well as a result set, which consists of pairs of storage distributions and the maximum throughput achievable given the storage distribution. These sets define the storage distributions that will be next explored, and the storage distributions already explored in the DSE, respectively—for the sake of brevity in the coming proof, let us denote these sets as NE and AE.
For a given CSDFG, Algorithm 1 returns the set of precisely all Pareto optimal pairs of storage distributions to throughput.
To prove that Algorithm 1 returns the set of all Pareto optimal pairs of storage distributions to throughputs, we first need to establish the following: given a CSDFG \( \mathcal {G} = (\mathcal {A}, \mathcal {B}) \) and the set PO containing all Pareto optimal storage distributions for \( \mathcal {G} \). At any point in Algorithm 1, if \( \exists \delta \in PO \) where \( \delta \not\in AE \), then \( \exists \delta ^{\prime } \in NE \) where \( \delta \ge \delta ^{\prime } \). After which, we need to show that Algorithm 1 will terminate at the stopping condition.
Base case. In the first iteration of Algorithm 1, \( AE = \lbrace \rbrace \) and \( NE = \lbrace \delta _{min}\rbrace \), where \( \delta _{min} \) refers to the initial storage distribution identified in the initialization phase of the DSE algorithm (lines 1–4). The maximum throughput achievable by \( \delta _{min} \), \( Th(\delta _{min}) \) is computed and is moved from NE to AE (lines 9–10). At this point, there are two possible directions depending on the value of \( Th(\delta _{min}) \).
If \( Th(\delta _{min}) = 0 \), which means that no valid schedule was found for \( \delta _{min}, \) it is replaced by a storage distribution \( \delta _0 \) where \( \forall b \in \mathcal {B}, \delta _0(b) = 0 \). If \( AE = \lbrace \delta _0\rbrace \), then \( \forall \delta \in PO, \delta \ge \delta _0 \).
Alternatively, when \( Th(\delta _{min}) \gt 0 \), \( \mathcal {G}_{\delta _{min}} \) is non-deadlocking, then \( AE = \lbrace \delta _{min}\rbrace \). Since \( \delta _{min} \) is the smallest possible storage distribution to avoid deadlock locally, all subsequent \( \delta \in PO \) are reachable from \( \delta _{min} \) by definition.
Induction step. Let us assume that at the end of the n-th iteration of the repeat loop of Algorithm 1 (lines 6–17), it is true that if \( \exists \delta \in PO \) where \( \delta \not\in AE \), then \( \exists \delta ^{\prime } \in NE \) where \( \delta \ge \delta ^{\prime } \). At the beginning of the \( (n + 1) \)-th iteration, the storage distribution with the lowest distribution size in NE, \( \delta _{checked} \) will be removed from NE (line 7) and the maximal \( Th(\mathcal {G}_{\delta _{checked}}) \) will be computed (lines 8–9). New storage distributions \( \delta _{new} \) added to NE are defined as follows: \( \begin{equation*} \forall b \in c^{\star }, \delta _{new}(b) = \delta _{checked} + stepSz[b] \text{ (lines 12--14),} \end{equation*} \) where \( stepSz[b] \) denotes the step size of a buffer b. As \( \forall b \in \mathcal {B}, stepSz[b] \gt 0 \), all \( \delta _{new} \) will be strictly larger than \( \delta _{checked} \)—that is, \( \delta _{new} \gt \delta _{checked} \). At this point, if \( \exists \delta \in PO, \delta \not\in AE, \) which was not reachable by some \( \delta _{new} \in NE \), it would mean that \( Th(\mathcal {G}_\delta) \gt Th(\mathcal {G}_{\delta _{new}}), \) whereby for some \( b^+ \in \mathcal {B}, b_f^+ \not\in c^{\star } \), \( \begin{align*} \delta (b^+) &\gt \delta _{new} \text{ and}\\ \delta (b) &= \delta _{new}(b) \text{ for all other $b \in \mathcal {B}$}. \end{align*} \)
Conclusion. As shown in Lemma 3.4, this is not possible; therefore, \( \forall \delta \in PO, \delta \not\in AE \) that was reachable by \( \delta _{checked} \) will continue to be reachable by some \( \delta _{new} \in NE \). Thus, by induction, we have shown that any point in Algorithm 1, if \( \exists \delta \in PO \) where \( \delta \not\in AE \), then \( \exists \delta ^{\prime } \in NE \) where \( \delta \ge \delta ^{\prime } \)
Termination of the algorithm. Finally, assume that at some iteration a storage distribution \( \delta _{max} \) is added to \( AE, \) whereby \( Th(G_{\delta _{max}}) \) is found to be the maximum throughput achievable by \( \mathcal {G} \). As shown before, \( \forall \delta _{new} \in NE \) added after removing \( \delta _{max} \) will be strictly larger than \( \delta _{max} \) while \( Th(\mathcal {G}_{\delta _{new}}) = Th(\mathcal {G}_{\delta _{max}}) \) as the throughput achievable cannot be increased any further. Therefore, none of the storage distributions \( \delta _{new} \in NE \) added after \( \delta _{max} \) can possibly be Pareto optimal storage distributions. Algorithm 1 will thus terminate once all \( \delta \in NE \) with a distribution size that is smaller than or equal to \( \delta _{new} \) have been checked and added to AE (line 17). Therefore, at the end of Algorithm 1, we will have the set of precisely all Pareto optimal pairs of storage distributions to throughputs.□
Having shown the theoretical methodologies of our K-periodic-based DSE, in Section 4 we cover experiments that compare the performance of our DSE algorithm to the existing DSE algorithm from Stuijk et al. [28].
4 EXPERIMENTS
To evaluate the performance of the K-periodic-based DSE algorithm described in Section 3 to existing DSE algorithms, we performed throughput-buffering trade-off analyses on two existing benchmark suites3 from SDF3 [27] and K-Iter [9]. We compared the results from the K-periodic-based DSE algorithm (Algorithm 1; called KDSE for short) to the results from the DSE algorithm implemented in SDF3.
To supplement our comparison of the optimal DSEs, we also evaluated the performance of two different approximate buffer sizing algorithms (called 1DSE and KDSE-C/SDF3-C) on the same benchmark applications. Our experiments investigate the trade-off in accuracy for reduced computation times that are inherent in these approximate methods in comparison to optimal methods. 1DSE uses a linear programming solver to identify optimal storage distributions for a specified throughput based on a 1-periodic schedule [8]. A 1-periodic schedule is equivalent to a K-periodic schedule with periodicity of 1 for any task (\( k_t = 1, \forall k_t \in K \)). KDSE-C and SDF3-C, however, simply increases the coarseness of the exploration by increasing the step sizes used when increasing storage distribution buffer sizes. The results from using these algorithms were also measured and considered in our DSE algorithm evaluation.
In Section 4.1, we provide details on the benchmarks used as well as the specifications of the hardware on which we ran experiments. Section 4.2 compares the computation timings and DSE results between KDSE and SDF3 from running on the benchmarks listed in Table 1. In Section 4.3, we discuss the results obtained by the approximate DSE algorithms and provide some suggestions as to how these algorithms could be utilized in future works for an improved DSE methodology. Finally, in Section 4.4, we detail the differences between the search paths explored by KDSE and SDF3 by comparing the configurations explored by the two DSEs on a sample CSDFG.
Table 1. Summary of Benchmark Applications
4.1 Setup
The algorithm is open source4 and implemented as a C++ application. We compared these results with the SDF3 algorithm (using the
4.2 Comparing DSE Algorithms
4.2.1 Comparing Performance.
The results of the experiments using each DSE algorithm can be seen in Tables 2 and 3. The total duration and the number of throughput evaluations performed to complete the throughput-buffering trade-off analysis for each benchmark are listed under the rows corresponding to each DSE technique. In instances in which the DSE algorithm was able to complete its execution, the number of Pareto optimal storage distributions identified and the distribution size of the largest Pareto optimal storage distribution are listed—this serves as a means of verifying the completeness of the DSE algorithm as well as means to compare the Pareto spaces identified by an optimal solution to those identified by an approximate solution such as 1DSE. In instances where a DSE algorithm was unable to complete its execution, the total time is specified as “>72 h.” Although an incomplete DSE algorithm would not be able to identify how many Pareto points it has identified, we include the number of storage distributions checked as a point of comparison between KDSE/KDSE-C and SDF3/SDF3-C. The maximum throughput achieved by an explored algorithm for the application is highlighted in bold if it corresponds to the maximum throughput of the graph.
| Max Thr. | Max Dist. Size | #SDs Checked | #Pareto Points | Total Time | Overest. | ||
|---|---|---|---|---|---|---|---|
| Application | Method | (Hz) | (sec) | (Avg. %) | |||
| Bipartite | 1DSE | 3.97E-03 | 40 | 104 | 10 | 0.2 | 13.2 |
| KDSE-C | 3.97E-03 | 36 | 15 | 6 | 0.007 | 0.8 | |
| SDF3-C | 3.97E-03 | 36 | 17 | 6 | 0.002 | 0.8 | |
| KDSE | 3.97E-03 | 35 | 51 | 8 | 0.02 | 0.0 | |
| SDF3 | 3.97E-03 | 35 | 62 | 8 | 0.005 | 0.0 | |
| Fig8 | 1DSE | 3.57E-02 | 53 | 36 | 3 | 0.06 | 10.1 |
| KDSE-C | 3.57E-02 | 57 | 325 | 21 | 0.06 | 4.0 | |
| SDF3-C | 3.57E-02 | 57 | 1,275 | 15 | 0.06 | 4.0 | |
| KDSE | 3.57E-02 | 53 | 1,885 | 42 | 0.4 | 0.0 | |
| SDF3 | 3.57E-02 | 53 | 13,145 | 39 | 0.4 | 0.0 | |
| H263 Decoder | 1DSE | 1.00E-04 | 8,081 | 6,871 | 3,046 | 11 | 25.8 |
| KDSE-C | 1.00E-04 | 5,941 | 177,310 | 1,676 | 33,130 | 0.005 | |
| SDF3-C | 8.46E-05 | 7,093 | 45,334,366 | – | >72 h | – | |
| KDSE | 1.00E-04 | 5,941 | 707,455 | 5,654 | 135,703 | 0.0 | |
| SDF3 | 7.69E-05 | 6,505 | 35,334,870 | – | >72 h | – | |
| Modem | 1DSE | 6.25E-02 | 40 | 26 | 2 | 0.07 | 1.0 |
| KDSE-C | 6.25E-02 | 42 | 3 | 3 | 0.0004 | 1.6 | |
| SDF3-C | 6.25E-02 | 42 | 4 | 3 | 0.0006 | 1.6 | |
| KDSE | 6.25E-02 | 40 | 4 | 3 | 0.0003 | 0.0 | |
| SDF3 | 6.25E-02 | 40 | 5 | 3 | 0.0006 | 0.0 | |
| Samplerate | 1DSE | 1.04E-03 | 36 | 57 | 5 | 0.1 | 2.4 |
| KDSE-C | 1.04E-03 | 36 | 3 | 3 | 0.002 | 0.6 | |
| SDF3-C | 1.04E-03 | 36 | 14 | 3 | 0.01 | 0.6 | |
| KDSE | 1.04E-03 | 34 | 3 | 3 | 0.001 | 0.0 | |
| SDF3 | 1.04E-03 | 34 | 16 | 3 | 0.01 | 0.0 | |
| Satellite | 1DSE | 9.47E-04 | 1,546 | 36 | 3 | 0.1 | 0.1 |
| KDSE-C | 9.47E-04 | 1,546 | 3 | 2 | 0.001 | 0.03 | |
| SDF3-C | 9.47E-04 | 1,546 | 38 | 2 | 0.09 | 0.03 | |
| KDSE | 9.47E-04 | 1,544 | 3 | 2 | 0.001 | 0.0 | |
| SDF3 | 9.47E-04 | 1,544 | 38 | 2 | 0.08 | 0.0 |
“#SDs checked” is the total number of storage distribution checked, and “#Pareto points” is the total number of Pareto solutions. A dash “–” represents when the DSE did not finish.
Table 2. Execution Time of Different DSE Strategies over the SDFGs from the SDF3 Benchmarks
“#SDs checked” is the total number of storage distribution checked, and “#Pareto points” is the total number of Pareto solutions. A dash “–” represents when the DSE did not finish.
| Max Thr. | Max Dist. Size | #SDs Checked | #Pareto Points | Total Time | Overest. | ||
|---|---|---|---|---|---|---|---|
| Application | Method | (Hz) | (sec) | (Avg. %) | |||
| Black-Scholes | 1DSE | 2.38E-08 | 22,491 | 109 | 13 | 1 | 0.9 |
| KDSE-C | 2.38E-08 | 28,743 | 24 | 12 | 0.01 | 6.6 | |
| SDF3-C | 2.38E-08 | 28,743 | 46,093 | 12 | 3,543 | 6.6 | |
| KDSE | 2.38E-08 | 22,490 | 24 | 12 | 0.01 | 0.0 | |
| SDF3 | 2.38E-08 | 22,490 | 192,013 | 12 | 10,534 | 0.0 | |
| Echo | 1DSE | 1.96E-10 | 30,223 | 173 | 16 | 0.8 | 0.1 |
| KDSE-C | 1.96E-10 | 30,230 | 7,262 | 14 | 59,740 | 7.7 | |
| SDF3-C | 1.96E-10 | 30,230 | 91,544 | – | >72 h | – | |
| KDSE | 1.96E-10 | 28,037 | 17,984 | 19 | 178,590 | 0.0 | |
| SDF3 | 1.96E-10 | 28,037 | 450,152 | – | >72 h | – | |
| H264 Encoder | 1DSE | 4.20E-06 | 1,368,097 | 145 | 23 | 37 | – |
| KDSE-C | – | – | 23,813 | – | >72 h | – | |
| SDF3-C | – | – | 14,377 | – | >72 h | – | |
| KDSE | – | – | 54,368 | – | >72 h | – | |
| SDF3 | – | – | 14,682 | – | >72 h | – | |
| JPEG2000 | 1DSE | 4.11E-07 | 3,854,807 | 3,717 | 1,059 | 248 | – |
| KDSE-C | – | – | 209,993 | – | >72 h | – | |
| SDF3-C | – | – | 23,314 | – | >72 h | – | |
| KDSE | – | – | 1,024,592 | – | >72 h | – | |
| SDF3 | – | – | 21,011 | – | >72 h | – | |
| PDetect | 1DSE | 4.92E-07 | 5,407,410 | 1,267 | 205 | 11,644 | 3.6 |
| KDSE-C | 4.92E-07 | 5,414,965 | 3,572 | 15 | 52 | 4.5 | |
| SDF3-C | 4.07E-07 | 4,620,315 | 1,053,524 | – | >72 h | – | |
| KDSE | 4.92E-07 | 4,686,640 | 4,772 | 23 | 77 | 0.0 | |
| SDF3 | 4.07E-07 | 4,289,255 | 1,016,425 | – | >72 h | – |
“#SDs checked” is the total number of storage distribution checked, and “#Pareto points” is the total number of Pareto solutions. A dash “–” represents when the DSE did not finish.
Table 3. Execution Time of Different DSE Strategies over the SDFGs from the K-Iter Benchmarks
“#SDs checked” is the total number of storage distribution checked, and “#Pareto points” is the total number of Pareto solutions. A dash “–” represents when the DSE did not finish.
A notable result is that KDSE has been able to compute the Pareto set of some benchmarks (Echo, PDetect, and H263 Decoder) that SDF3 was unable to complete its exploration for after 72 hours. Furthermore, on all comparable results, it can be seen that the DSE algorithm implemented in KDSE required fewer storage distribution explorations to complete its throughput-buffering trade-off analysis. For benchmarks that returned results from both DSEs (Bipartite, Fig8, Modem, Samplerate, Satellite, Black-Scholes, Echo), we observed a median \( 84\% \) decrease in total computation duration, as well as a median \( 83\% \) decrease in total number of computations. The difference in storage distribution explorations was particularly significant for Fig8 and Black-Scholes; this difference becomes even more pronounced when we consider the benchmark applications in which SDF3 was unable to complete after 72 hours. Although this did not always translate to shorter total computation times in benchmarks with smaller search spaces, it was indicative of how critical cycles provided a more efficient search heuristic for the DSE algorithm.
Finally, the identified Pareto optimal pairs of throughput and storage distributions for the KDSE algorithm can be seen in Figure 6 by the red points with solid lines connecting adjacent Pareto optimal storage distributions. As expected, once completed, we found that both KDSE and SDF3 produced identical Pareto fronts—Fig8 presents an exception to this, however, with SDF3 identifying fewer Pareto optimal storage distributions than KDSE; this inconsistency will be covered in Section 4.2.3. The results from our experiments therefore show that, with KDSE, we were able to explore a reduced search space while continuing to produce optimal solutions for each benchmark application. In Section 4.2.2, we go into further detail on the search paths of KDSE in comparison to SDF3.
4.2.2 Comparing Design Spaces.
Figure 5 shows the computed throughput of explored storage distributions against the cumulative duration of the two DSEs for each benchmark where both techniques finish. These plots allow us to visualize the progress of the two DSEs as the number of storage distributions explored increase over the course of the experiments, giving us an insight into the efficiency of the search paths for each DSE algorithm at identifying Pareto points.
Fig. 5. Comparison of DSE between KDSE (black) and SDF3 (red). Each point on the line is a moment in time when a Pareto optimum point is found of the specific throughput. Only applications where both KDSE and SDF3 finished their DSEs are shown.
There are two possible reasons KDSE often had a shorter overall computation duration than SDF3. The first is the use of periodic-style scheduling instead of self-timed scheduling to compute throughput. As shown in the work of Bodin et al. [9], with a few exceptions, the use of periodic-style scheduling results in shorter computation times required to compute throughput. Nonetheless, as discrepancies in computation times between periodic and self-timed schedules have already been extensively discussed elsewhere [8, 9], we focus here on the second cause: the difference in the size of design spaces. For example, KDSE only takes 24 steps to explore the Black-Scholes application when SDF3 requires 192,013 steps. The impact of the size of the design space is also notable when comparing the design spaces where KDSE was able to complete its DSE but SDF3 was not. In these comparisons, the lower growth rate of the design space of KDSE is a significant factor in explaining why it was able to complete its DSE for some benchmarks that SDF3 could not.
It is important to note, however, that no DSE algorithm was able to return the Pareto set for all of the benchmarks tested. H264 and JPEG2000 are the two benchmark applications for which neither KDSE nor SDF3 was able to return a complete Pareto set after 72 hours. That said, KDSE was able to complete its search for all benchmark applications that SDF3 was able to, and it additionally returned the Pareto sets of H263 Decoder, PDetect, and Echo.
The results from our experiments have thus shown that critical cycles provide a more accurate heuristic to determine which storage distributions to explore in the DSE than the one used in SDF3. This leads to a more efficient pruning of the design space and is one of the main factors contributing to KDSE’s reduced computation timings for the majority of the benchmark applications.
4.2.3 Comparing Pareto Frontiers.
The Pareto frontiers between the two algorithms should be identical, as both rely on exact methods, and this is generally the case in our experiments apart from one application. We noted an inconsistency in the Pareto set computed for the benchmark application, Fig8. SDF3 identified 39 storage distributions in its Pareto set, whereas KDSE identified an additional 3 storage distributions with a distribution size equal to the maximum distribution size (53) in the Pareto set. By manually modeling Fig8 with the buffer sizes specified by each of these 3 storage distributions (e.g., with buffer sizes “7,5,8,6,3,3,5,8,8”), we were able to verify that they were all optimal solutions—each achieving the maximum throughput of Fig8. We found that the cause of this discrepancy is an implementation error in SDF3, in particular, in the simple cycle detection algorithm used in the storage dependency identification algorithm of SDF3. In SDF3, storage dependencies define which buffer sizes to increment in further iterations of the DSE; the error therefore led to some buffers being incorrectly left out as storage dependencies. With a corrected version of the algorithm, we would expect the results for SDF3 to differ: namely that its exploration space and execution time would increase for some applications.
4.3 Comparing Approximate Methods
As mentioned in Section 1.1, there exist methods that return approximated DSE results; these methods trade off accuracy for speed. In our experiments, we consider two types of approximate DSE algorithms: (1) 1-periodic, denoted as 1DSE, as described in the introduction to Section 4, and (2) “coarse” versions of KDSE and SDF3, respectively denoted as KDSE-C and SDF3-C, where the step sizes computed in the DSE (as described in Section 3.1) are multiplied by a fixed factor. For our experiments, we chose a value of 2 for this fixed factors—this can be increased to raise the “coarseness” of the approximation. The resulting Pareto points identified by the three approximate DSE methods are plotted in Figure 6. The results of KDSE are included as a means of visualizing the distance of the approximate results from an optimal solution.
Fig. 6. Comparison of Pareto points identified by KDSE (red), KDSE coarse (purple), SDF3 coarse (black), and 1DSE (green).
4.3.1 Comparing Performance.
As shown in Tables 2 and 3, 1DSE was able to complete its DSE for all benchmarks, identifying storage distributions that allow the benchmark application to achieve its maximum throughput. In particular, it was able to identify solutions for H264 and JPEG2000—two benchmark applications that neither KDSE nor SDF3 could identify solutions for after 72 hours. As expected, however, the results show that 1DSE tends to overestimate the buffer sizes required to achieve a particular throughput. This is reflected in Figure 6 where the green lines, representing the Pareto front generated from the Pareto optimal storage distributions identified by 1DSE, are consistently on the right of the Pareto fronts generated from the results of KDSE (red). Furthermore, 1DSE identifies several Pareto points with large periods that deviate significantly from the optimal solutions—the plotting areas in Figure 6 exclude these points to allow the relevant Pareto fronts to be more visible. It can also be seen from the results that there are discrepancies in the number of Pareto optimal storage distributions identified by 1DSE in comparison to those found by KDSE and SDF3.
As expected, KDSE-C and SDF3-C produced solutions with identical Pareto fronts. This can be seen by the overlapping purple and black lines (representing the Pareto fronts of KDSE-C and SDF3-C, respectively) in Figure 6. Just as in the optimal versions, “coarse” KDSE consistently completed its searches with less storage distributions explored than “coarse” SDF3. Although some optimal Pareto points were found using this method, we still observed a deviation from the Pareto front produced by the optimal solutions.
It can also be seen from Table 3 that, in instances where both KDSE-C and KDSE were unable to complete their respective DSEs after 72 hours (H264 Encoder and JPEG2000), KDSE-C consistently performed fewer storage distribution explorations than its optimal counterpart. This is an unusual finding, as one would expect KDSE-C to perform approximately as many storage distribution explorations given that it executes its DSE for an equal amount of time as KDSE.
Although further experiments are required to verify this hypothesis, a possible explanation for this discrepancy would be that the larger storage distributions explored by KDSE-C are also more complex problem to solve for the K-Iter algorithm. This results in longer computation times for the throughput evaluation. In Figure 7 of the execution time of throughput evaluation taken per storage distribution explored for JPEG2000, for example, it can be seen that the median execution time per storage distribution for KDSE-C is longer than that of KDSE.
Fig. 7. Histogram of execution time of the throughput evaluation per storage distribution explored for JPEG2000.
In contrast, as visible on the left side of Figure 7, the median execution times of the throughput evaluation for SDF3 and SDF3-C are approximately equal and extremely short compared to KDSE and KDSE-C. Interestingly, despite these shorter execution times, both SDF3 and SDF3-C perform far fewer explorations than KDSE and KDSE-C for JPEG2000. This leads us to another possible explanation: due to the increasing size of the design space, we suspect that there is a significant overhead in the SDF3/SDF3-C algorithms while populating the checklist with additional storage distributions and minimizing the set of results. Overheads from similar sources might also disproportionately affect the overall computation times for KDSE and KDSE-C, leading to discrepancies in the number of explorations performed in the same amount of time. Again, further experiments would be required to verify these claims.
4.3.2 Overestimation of Approximate Methods.
Figure 8 shows the plot of the percentage overestimation of distribution sizes against percentage of maximum throughput for two different benchmarks. This was computed by comparing the distribution sizes provided by the Pareto set of KDSE to that of approximate methods for the various achievable throughputs. For each throughput, the ratio of the distribution sizes along the two Pareto fronts were used to compute the percentage overestimation of distribution sizes provided by the approximate method. We noted a maximum of around 20% to 25% overestimation among the benchmark applications tested. Furthermore, we also observed a rather large variance in the amount of overestimation among the approximate methods as seen in Figure 8. From these plots, we can compute the average overestimated distribution size percentage. For Fig8 (Figure 8(a)), it is \( 10.1\% \) for 1DSE and \( 4.0\% \) for KDSE-C/SDF3-C. For Black-Scholes (Figure 8(b)), it is \( 0.9\% \) for 1DSE and \( 6.6\% \) for KDSE-C/SDF3-C. Tables 2 and 3 list the average overestimated distribution size of each DSE method under the columns labeled “Overest. (% Avg.).” As these numbers show, although the approximate DSE methods did reduce overall computation timings for certain benchmarks, they often trade off accuracy as evidenced by the variability and amount of overestimation in the resulting Pareto points identified.
Fig. 8. Comparing optimality of the Pareto front of approximate methods compared to KDSE.
Thus, it should be noted that, for benchmark applications where KDSE took a long time to complete (H263 Decoder, Echo), although 1DSE was able to complete its DSE within a much shorter time frame, it should only be employed in instances where it is not critical to identify an optimal solution. Alternatively, one possible direction for future studies could be to use 1DSE as a pre-processing step in which the Pareto optimal storage distributions identified by 1DSE serve as a starting point for a DSE algorithm that would then identify the optimal solutions.
4.4 Detailed Example
As a concrete example of the methodologies, let us consider the CSDFG shown in Figure 3(a). By performing the initialization step of the algorithm (the way it is performed by SDF3), we find a minimal storage distribution \( [3,6,4,25,36] \) respectively corresponding to the buffers (A, B), (B, C), (C, A), (A, D), (D, C). We note the buffer size selected between A and B is already insufficient given the production and consumption rates of A and B, respectively. Nonetheless, to maintain a fair comparison and to focus on the DSE exploration, we will leave this initialization method as is.
From there, the SDF3 approach looks for dependency cycles during execution and identifies the cycle containing actors A, B, and C first. This implies the three next configurations to explore will be \( [{\bf 4},6,4,25,36] \), \( [3,{\bf 7},4,25,36] \), and \( [3,6,{\bf 5},25,36] \). However, KDSE identifies the insufficient buffer size between A and B as more critical (considering the loop between the buffer and its corresponding feedback buffer) and does not consider the configurations \( [3,7,4,25,36] \) and \( [3,6,5,25,36] \).
Figure 9(c) shows in detail this DSE as performed by KDSE and SDF3. On the gray side are all configurations explored by KDSE, whereas any additional configurations explored by SDF3 are visible outside of this gray block. At the top of the graph, we can see the initial configuration followed by three new configurations to explore: only one of them is considered by KDSE.
Fig. 9. Comparison of DSE using KDSE and SDF3 for the CSDFG in Figure 3(a).
Figures 9(a) and (b) visualize the final DSEs performed on the CSDFG shown in Figure 3(a); KDSE completes its DSE in 32 computations, whereas SDF3 takes 66—approximately two times as many computations to find the same Pareto set.
5 CONCLUSION AND FUTURE WORK
In this work, we designed a novel technique for throughput-buffering trade-off exploration based on K-periodic scheduling. Instead of relying on storage dependencies identified through tracking constraints between actor firings to define the next storage distribution to evaluate, we utilize critical cycles derived from the computation of the K-periodic scheduling component itself. This method reduces the size of the explored space, thereby reducing computation time while continuing to provide the full set of optimal solutions. Our results show a clear improvement over existing methods, especially with regard to reducing the search space; for some benchmarks with particularly high complexities, this enabled our approach to identify optimal solutions where existing methods could not.
We also evaluated the performance of an approximate DSE algorithm and found that it performed particularly well in instances where optimal solutions would take a much longer time. Within the study of space exploration, the three methods of shaping a design space (KDSE, SDF3, and 1DSE) resulted in distinct performances depending on the benchmark application on which they were run. In future work, rather than concurrently running these techniques in competition, it would be interesting to attempt to profile the benchmarks to identify which DSE algorithm would be best suited for the use case. Another direction for future work is to explore how additional constraints on memory optimization would impact the search space. For instance, as addressed in the work of Lesparre et al. [21], in the context of a many-core processor with a network-on-chip, dataflow tasks can be mapped on different cores; a buffer between these two tasks would model communication between the two cores. Nonetheless, the two cores may not access the same memory space. In this case, the buffer will generally need to be duplicated at least partially between local memories of the two different cores. Finally, we foresee that a deeper study into the differences in DSE search paths might lead to a means of developing a hybrid methodology taking advantage of the strengths of multiple approaches.
ACKNOWLEDGMENTS
We sincerely thank the anonymous reviewers for their contribution that greatly helped to improve and clarify this article.
Footnotes
1 https://www.es.ele.tue.nl/sdf3/download/files/benchmarks/sdfg_buffersizing.zip.
Footnote2 The case where the given CSDFG \( \mathcal {G} \) already models bounded buffers using feedback edges can easily be supported by fixing the sizes of these buffers.
Footnote3 The SDFG and CSDFG benchmarks used can be found at https://www.es.ele.tue.nl/sdf3/download/benchmarks.php and https://github.com/bbodin/kiter, respectively.
Footnote4 https://github.com/bbodin/kiter.
Footnote
- [1] . 1997. Data memory minimisation for synchronous data flow graphs emulated on DSP-FPGA targets. In Proceedings of the Design Automation Conference. ACM, New York, NY, 64–69. Google Scholar
Digital Library
- [2] . 2018. Throughput-buffering trade-of analysis for scenario-aware dataflow models. In ACM International Conference Proceeding Series. ACM, New York, NY, 265–275. Google Scholar
Digital Library
- [3] . 2013. On the hard-real-time scheduling of embedded streaming applications. Design Automation for Embedded Systems 17, 2 (2013), 221–249. Google Scholar
Digital Library
- [4] . 2012. Periodic schedules for bounded timed weighted event graphs. IEEE Transactions on Automatic Control 57, 5 (2012), 1222–1232.Google Scholar
Cross Ref
- [5] . 2010. A new approach for minimizing buffer capacities with throughput constraint for embedded system design. In Proceedings of the 2010 ACS/IEEE International Conference on Computer Systems and Applications (AICCSA’10). IEEE, Los Alamitos, CA. Google Scholar
Digital Library
- [6] . 1999. Synthesis of embedded software from synchronous dataflow specifications. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 21, 2 (
6 1999), 151–166. Google ScholarDigital Library
- [7] . 1995. Cyclo-static data flow. IEEE Transactions on Signal Processing 5 (1995), 3255–3258. https://ieeexplore.ieee.org/abstract/document/485935/.Google Scholar
- [8] . 2013. Periodic schedules for cyclo-static dataflow. In Proceedings of the 11th IEEE Symposium on Embedded Systems for Real-Time Multimedia.105–114. Google Scholar
Cross Ref
- [9] . 2016. Optimal and fast throughput evaluation of CSDF. In Proceedings of the Design Automation Conference. Google Scholar
Digital Library
- [10] . 2012. Affine data-flow graphs for the synthesis of hard real-time applications. In Proceedings of the International Conference on Application of Concurrency to System Design (ACSD’12). 183–192. Google Scholar
Digital Library
- [11] . 1999. Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems. In Proceedings of the Design Automation Conference.37–42. Google Scholar
Digital Library
- [12] . 2018. Throughput analysis of dataflow graphs. In Handbook of Signal Processing Systems. Vol. 66. Springer, 751–786. Google Scholar
Cross Ref
- [13] . 2015. Buffer merging technique for minimizing memory footprints of synchronous dataflow specifications. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’15). 1111–1115. Google Scholar
Cross Ref
- [14] . 1994. Decidability issues for Petri nets. Petri Nets Newsletter 94 (1994), 5–23. https://pdfs.semanticscholar.org/11ee/c1de65956f3a65d8d124386f4b8fbd80cc38.pdf.Google Scholar
- [15] . 2006. Liveness and boundedness of synchronous data flow graphs. In Proceedings of Formal Methods in Computer Aided Design (FMCAD’06). 68–75. Google Scholar
Digital Library
- [16] . 2006. Throughput analysis of synchronous data flow graphs. In Proceedings of the 6th International Conference on Application of Concurrency to System Design(
ACSD ’06). 25–36. Google ScholarDigital Library
- [17] . 2011. Automated architecture-aware mapping of streaming applications onto GPUs. In Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium. 467–478. Google Scholar
Digital Library
- [18] . 2019. Monotonic optimization of dataflow buffer sizes. Journal of Signal Processing Systems 91, 1 (2019), 21–32. Google Scholar
Digital Library
- [19] . 2018. On deadlockability, liveness and reversibility in subclasses of weighted Petri nets. Fundamenta Informaticae 161, 4 (2018), 383–421. Google Scholar
Digital Library
- [20] . 1987. Synchronous data flow. Proceedings of the IEEE 75, 9 (1987), 1235–1245. Google Scholar
Cross Ref
- [21] . 2016. Evaluation of synchronous dataflow graph mappings onto distributed memory architectures. In Proceedings of the 2016 Euromicro Conference on Digital System Design (DSD’16). 146–153. Google Scholar
Cross Ref
- [22] . 2019. Hard real-time scheduling of streaming applications modeled as cyclic CSDF graphs. In Proceedings of the 2019 Design, Automation, and Test in Europe Conference and Exhibition (DATE’19). 1549–1554. Google Scholar
Cross Ref
- [23] . 2013. Physical Layer Multi-Core Prototyping—A Dataflow-Based Approach for LTE eNodeB.
Lecture Notes in Electrical Engineering , Vol. 171. Springer. Google ScholarCross Ref
- [24] . 1995. Scheduling for optimum data memory compaction in block diagram oriented software synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95). IEEE, Los Alamitos, CA, 2651–2654. Google Scholar
Cross Ref
- [25] . 2009. Embedded Multiprocessors: Scheduling and Synchronization (2nd ed.). CRC Press, Boca Raton, FL.Google Scholar
- [26] . 2019. Verifying parallel dataflow transformations with model checking and its application to FPGAs. Journal of Systems Architecture 101 (2019), 101657. Google Scholar
Digital Library
- [27] . 2006. Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 899–904. Google Scholar
Digital Library
- [28] . 2008. Throughput-buffering trade-off exploration for cyclo-static and synchronous dataflow graphs. IEEE Transactions on Computers 57, 10 (2008), 1331–1345. Google Scholar
Digital Library
- [29] . 2007. Efficient computation of buffer capacities for cyclo-static dataflow graphs. In Proceedings of the 44th Annual Conference on Design Automation (DAC’07).658. Google Scholar
Digital Library
- [30] . 2016. Resource-constrained implementation and optimization of a deep neural network for vehicle classification. In Proceedings of the European Signal Processing Conference. Google Scholar
Cross Ref
- [31] . 2012. Partial expansion graphs: Exposing parallelism and dynamic scheduling opportunities for DSP applications. In Proceedings of the International Conference on Application-Specific Systems, Architectures, and Processors. 86–93. Google Scholar
Digital Library
Index Terms
K-Periodic Scheduling for Throughput-Buffering Trade-Off Exploration of CSDF
Recommendations
Cyclo-static DataFlow phases scheduling optimization for buffer sizes minimization
M-SCOPES '13: Proceedings of the 16th International Workshop on Software and Compilers for Embedded SystemsCyclo-Static DataFlow (CSDF) is a powerful model for the specification of DSP applications. However, as in any asynchronous model, the synchronization of the different communicating tasks (processes) is made through buffers that have to be sized such ...
Throughput-Buffering Trade-Off Analysis for Scenario-Aware Dataflow Models
RTNS '18: Proceedings of the 26th International Conference on Real-Time Networks and SystemsIn multi-media applications, buffers represent storage spaces that are used to store the data communicated between different tasks in the application, and throughput refers to the rate at which output data is produced by the application. The capacities ...















Comments