Towards Efficient Control Flow Handling in Spatial Architecture via Architecting the Control Flow Plane

Spatial architecture is a high-performance architecture that uses control flow graphs and data flow graphs as the computational model and producer/consumer models as the execution models. However, existing spatial architectures suffer from control flow handling challenges. Upon categorizing their PE execution models, we find that they lack autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability. This leads to limited performance in intensive control programs. A spatial architecture, Marionette, is proposed, with an explicit-designed control flow plane. The Control Flow Plane enables autonomous, peer-to-peer and temporally loosely-coupled control flow handling. The Proactive PE Configuration ensures timely and computation-overlapped configuration to improve handling Branch Divergence. The Agile PE Assignment enhance the pipeline performance of Imperfect Loops. We develop full stack of Marionette (ISA, compiler, simulator, RTL) and demonstrate that in a variety of challenging intensive control programs, compared to state-of-the-art spatial architectures, Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88x, 3.38x, 1.55x, and 2.66x.

Spatial architectures represent a category of accelerators that harness the immense potential of high computational parallelism through direct intercommunication among an array of processing engines (PEs).The software programs are transformed into control flow graphs (CFGs) and data flow graphs (DFGs) to facilitate their compilation.The CFG captures the program's control dependencies, including loops and conditions.The widely prevalent execution model for spatial architectures is the producer/consumer pipeline model [15].The CFGs and DFGs are assigned to PEs and interconnect networks.This mapping relationship is expressed through carefully distributed instructions (configurations) among the PEs and interconnect networks.Spatial architecture can mitigate data transfers from memory, thus bypassing the storage bottleneck and facilitating the relentless advancement of computing power.Furthermore, its producer/consumer pipeline model excels in accelerating kernels with conventional data-level parallelism, bringing noteworthy performance benefits in many domains such as artificial intelligence (AI), mobile communication, and image processing [5,23,25,45,51,65].
However, contemporary applications impose heightened demands on the spatial architecture's capacity to adeptly handle control flow.On the one hand, modern applications across pivotal domains, such as mobile communication, computer vision, bioinformatics, and general-purpose kernels, exhibit intricate control flow patterns, involving branches or nested loops, as shown in Table 1.On the other hand, AI also has higher requirements for control flow processing capabilities.Tenstorrent has introduced a cutting-edge framework for big AI models, encompassing Dynamic Sparsity, Conditional Execution, and Dynamic Routing [63].FlexMoE [47] has further proposed an innovative scheduling module that enables dynamic mapping of the model-to-hardware allocation based on real-time dataflow over the existing DNN runtime.
Unfortunately, spatial architectures exhibit limitations in effectively handling control flows.To tackle this challenge, this work strives to comprehensively pinpoint the fundamental architectural

Mobile Communication
CRC [32] Innermost Imperfect nested Serial Loops ADPCM [32] Serial branches N/A SC Decode [4] Innermost Imperfect nested Serial Loops LDPC Decode [54] Nested branches Innermost Imperfect nested Serial Loops limitations of the hardware execution model of spatial architecture, at both array-level and PE-level.We survey the representative spatial architectures in the past decade, categorizing their PE execution models into two distinct paradigms, namely von Neumann PE and dataflow PE.Furthermore, we conduct an in-depth analysis of these two models in the context of handling two typical control flow scenarios, namely Branch Divergence and Imperfect Loop.
Our observations indicate that conventional PE execution models cause significant PE idleness while handling above typical control flows.This limitation stems from the fact that: (1) von Neumann PEs lack the capability to autonomously initiate configuration changes in other PEs, and that there exists no direct channel for control information transfer between PEs; (2) The distinctive token utilized by dataflow PEs results in a temporally and spatially close-knit coupling of control flow and data flow, consequently constraining control flow transfer.We advocate that an ideal PE for control flow handling is expected to (1) autonomously change the configurations of other PEs, (2) incorporate a peer-to-peer control flow path to enable agile control information transmission, and (3) be temporally loosely-coupled with dataflow path.
To achieve the above objectives, our insight is that the control flow handling in spatial architectures should stand out from the current hybrid design that mixes control flow handling and data flow handling.This calls for a new architectural scheme, which decouples the control flow handling and dataflow handling.We introduce Marionette, a spatial architecture design with decoupled control flow plan and data flow plane.Specifically, we architect a control flow plane for existing spatial architecture, which includes redefined PE architecture and features.We believe a decoupled control flow plane is naturally beneficial for autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability.
Control Flow Plane: The control and configuration components of Marionette have been consolidated into a control flow plane, which completely decouples control flow handling from data flow handling.To enable autonomous, peer-to-peer and temporally loosely-coupled control flow handling, the control flow plane incorporates three distinct PE micro-architectures and a specialized control network.

Basic Block 1
For: i < size Proactive PE Configuration: To computation-overlapped and timely configuration, we devise the Control Flow Sender, which timely transfer control flow to hasten the configuration of subsequent PEs.This allows for executing current-stage computation and next-stage configuration in the same stage.Consequently, Marionette achieves a high PE utilization rate in Branch Divergence.

Basic
Agile PE Assignment: Given the solid foundation for control flow handling capabilities provided by the Marionette control flow plane, we optimize the Marionette scheduling strategy and develope a Control Flow Scheduler.This enhancement renders Marionette highly flexible in constructing basic block pipelines, leading to a significant improvement in PE utilization when processing Imperfect Loops.
The contributions of this work are as follows: (1) We present a taxonomy that categorizes PE execution models of spatial architectures based on prior research.(2) We synopsize the control flow predicaments encountered by extant spatial architectures when confronted with demanding control flow applications, specifically, the paucity of autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability.

BACKGROUND
This paper presents a challenge with the intensive control flow algorithm employed in Spatial Architectures (SAs).To provide a comprehensive understanding of this problem, we first provide an overview of the SA's computational model and execution model.

SA Computational Model
Upper part of Figure 1 shows the computational model of SA.A SA normally uses a Control Data Flow Graph (CDFG) as its computational model, where a program running on the SA is represented as CDFGs.The CDFG consists of a control flow graph (CFG) and data flow graphs (DFGs).A DFG [20] is a graph that depicts operations as nodes and data dependencies as edges.As there is no control dependencies in DFG, it is usually embedded in a basic block (BB).The BB has a single entry and a single exit.The CFG [3] is a graph whose nodes are BBs, and the edges represent the control dependencies between the BBs.

SA Execution Model
We briefly introduce the essential architecture of SA and then elaborate the hardware execution model at array-level and PE-level.We believe that understanding the root cause of control flow handling problem is worth another re-examination of the hardware execution model.A core processor that generates the overall configuration signal.DRP [59] Switching all PE configurations via a finite state machine.DySER [30] Configuration update via external processor signal.FPCA [16] External processor assignments.DORA [69] A counter determines the end and update of the configurations.Plasticine [52] A counter controls the distribution and execution of configurations.Softbrain [48] Processor fetches instruction from memory.SPU [18] Processor fetches instruction from memory.MP-CGRA [19] Distributed instruction counters.DRIPS [62] The centralized controller dynamically changes the map table.RipTide [28] Processor fetches instruction.

dataflow PE
TRIPS [56] An instruction window to determine instruction execution.Wavescalar [60] According to the data, configurations are fetched to execute.TIA [49] Scheduler selects intructions based on the input data.T3 [55] An instruction window to determine instruction execution.SGMF [66] The corresponding thread is executed when the token arrives.dMT-CGRA [67] An instruction window to determine instruction execution.
SA Hardware Architecture: As shown in the lower left of Figure 1, a SA consists of a group of processing elements (PEs) interconnected by an on-chip network.A PE typically includes a set of functional units, such as adders, multipliers, and shifters.These PEs are designed to support higher-level operations such as multiplication.Moreover, the PEs can be reconfigured to perform different tasks, and the interconnect network can be programmed to support various data flows and communication patterns, allowing the PEs to be connected in different ways to form various computational structures.SA Array-level Execution Model: SA needs programming to run applications.A typical way is to map the DFGs of the program onto the PE array, along with a set of hardware resources that will be used to execute the tasks.The hardware resources can include functional units, interconnects, and memory blocks.A mapping algorithm is then used to determine the optimal placement of tasks on the PEs and to configure the interconnect network to support the required data flows.Such configurations of PE and networks are achieved through instructions.
The producer/consumer pipeline model is a crucial characteristic of SA array-level execution model that enables efficient data transfer and computation.Each PE is assigned a specific operator in the DFG, and multiple PEs are spatially interconnected to form a pipeline.Figure 1 lower right part depicts the producer/consumer pipeline, wherein the pipeline initiation interval (II) equals 1, which means that for each cycle, a new loop iteration can begin.As a result, the producer/consumer pipeline model affords two crucial benefits: high parallelism and effective hardware resource utilization.

PE Execution Model
In this work, our goal is to comprehensively examine the existing SA execution model and pinpoint the root cause of inefficient control flow handling both at the array-level and PE-level.To achieve this goal, we first conduct a comprehensive survey of SAs in the past decade, as shown in Table 2.We then categorize them into von Neumann architecture-derived [68] and dataflow architecturederived [17] PE according to their control flow handling schemes, as shown in Figure 2 1 .We assume that configuring a PE takes one cycle, and executing an instruction takes two cycles (This does not

Issues: PE idles caused by control delay or computation-unoverlapped configuration.
Architectural Inefficiencies: A token is used as input, consisting of a data and a tag.The tag activates the configuration, while the data performs the operation.This means that the dataflow PE's configuration can be altered by other PEs while in use.However, the token links control flow and data flow, leading to some limitations.Each PE execution requires configuration before execution, causing latency overhead.Moreover, this coupling limits the effect of an instruction to the current data, hampering pipeline control.

CHALLENGES AND MOTIVATIONS
In this section, we will provide a detailed explanation of how von Neumann PE and dataflow PE handle control flow when executing control flow-intensive kernels.We will use two representative control flows, namely Branch Divergence and Imperfect Loop, which are commonly found in various algorithms.These control flows account for over 40% of the average percentage in popular benchmark suites such as MachSuite [53], Rodinia [13] and PolyBench [1].
Our goal is to show the root cause of existing spatial architectures' awkwardness in handling the control flow, which is due to the lacking mechanism for autonomous, peer-to-peer, and temporally loosely-coupled control flow information transfer among PEs in SA.This observation motivates us to propose a decoupled control flow plane for SA, which includes redefined PE architecture and features.

Two Typical Control Flow Forms
Branch Divergence is a prevalent issue in CFGs, where the program's control flow divides into different execution paths due to the presence of conditional branches.This occurs when the program encounters a decision point, where it must choose between multiple execution paths based on specific conditions.Branch Divergence is common in various kernels such as Sort and Merge, in the database and sparse computing.As an example, Figure 3 (a) shows a code snippet in Merge, where the data flow within the conditional branch dynamically forks into two paths (true and false), or BBs.This results in divergent execution paths and can lead to poor PE utilization in both existing PE execution models.
Imperfect Loop is another typical control flow form which can be characterized as nested loops, with computations present in the outer loop bodies.It is a common feature in scientific computing and finite element analysis, particularly in applications such as blocked matrix multiplication (GEMM) and computational fluid dynamics (CFD).Figure 3 (b) shows a code snippet of Sparse Vec/Mat.Mult.(SPMV), where the inner loop (in green) executes every block size times while the outer loop (in yellow) executes only once.Different nested loops have varying BB execution frequencies, which can cause an imbalanced pipeline and poor PE utilization in existing PE execution models.

Control Flow Handling Inefficiency in von
Neumann PE Branch Divergence: Switch Configuration The first method (shown in the left half of Figure 3 (c)) is to implement the branch in the fashion of switching PE configurations in the time dimension.Specifically, the branch target PEs need to wait for the control flow information and switch the configurations.Due to the inability of the branch PE to autonomously modify the control information of the target PEs, and the lack of a direct channel for control flow information transmission between them, the branch PE is constrained to indirectly convey control information through an external Centralized Control Unit (CCU).This approach is clumsy.As the branch PE needs to transfer the control flow result to the CCU.Then the CCU replies to the branch target PE through the configuration network while the whole array is left idle.
Branch Divergence: Predication The second approach, referred to as "Predication" (shown in the right half of Figure 3 (c)), is more prevalent.It involves converting branches into distinct data paths in spatial dimension by consuming additional PEs.The configurations for both branch targets are pre-configured in two target PE lanes, respectively.Subsequently, the correct PE lane is selected from these two paths for the following BB according to the branch result.However, the not taken PEs will be left idle.It would be great if this idle resource could be used as other kernels.
Imperfect Loop: The CFG of Imperfect Loop can be seen as a variation of Branch Divergence.BB 4 is responsible for making a branch decision, with one branch leading to BB 5 and the other to BB 2 and BB 3. As a result, the von Neumann PE typically employs the Predication method in Branch Divergence.Based on this, there is a key point in the SPMV algorithm: the outcome of BB 3 determines the loop boundary for BB 5. Thus, it bears resemblance to the Switch Configuration employed in Branch Divergence.In this case, BB 3 transmits the control flow to the Centralized Control Unit, which subsequently configures the loop generator of the inner loop.As shown in Figure 3 (d), the low utilization rate of Von Neumann PE in various stages is quite significant, and it can be attributed to a reason similar to the idle state under Branch Divergence.

Control Flow Handling Inefficiency in Data flow PE
The execution model of the dataflow PE demonstrates a restriction in managing control flow.The constraint stems from the binding of control and data flow within tokens.This close-knit coupling in both temporal and spatial dimensions restricts the control flow transferring.Branch Divergence: We expose the challenge caused by the temporal tight coupling between control flow and data flow when dataflow PEs perform Branch Divergence, as shown in Figure 3 (e).The concurrent arrival of both tag (control flow information) and data flow via the same channel at the same time necessitates  Imperfect Loop: When executing an Imperfect Loop, dataflow PEs can accentuate the challenges arising from the close coupling of control flow and data flow, manifesting in both temporal and spatial domains, as shown in Figure 3 (f).On the one hand, the synchronization of control flow and data flow naturally leads to a longer pipeline II, as elaborated in the preceding paragraph.On the other hand, the close spatial couping of control flow and data flow implies that control information transferring reliant on the data path is frequently inflexible.For instance, in the absence of a direct control flow pathway between the second PE and the loop generator, control flow information must traverse the red data path, leading to pipeline idleness.Consequently, even though the dataflow PE has some autonomy, the inherent drawback results in a substantially reduced utilization rate.

Insights and Motivations
We can observe that both von Neumann PE and dataflow PE will cause significant PE idleness, as shown in Figure 3 (i)(j).This is mainly due to (1) PEs cannot autonomously change the configuration of other PEs; (2) Control flow transmission between PEs is restricted, manifested in both spatial and temporal dimensions: 1. Spatially, control flow transmission is not direct, but instead realized by switching configurations through centralized control units or coupled into the data flow (i.e.tag or predication).2. Temporally, control flow transfer is also constrained.The configuration stage is hard to overlap with computation since the configuration function and computation function are temporally tightly-coupled.
Facing the intensive control flows, an ideal PE is expected to (1) be able to autonomously change the control information of other PEs, (2) incorporate a peer-to-peer control flow path to enable agile control information transmission, and (3) feature an temporally loosely-coupled control flow path to facilitate the concurrent execution of current-stage computation and next-stage configuration within the same stage, as shown in Fig. 3 (g)(h).To achieve this objective, a critical requirement is to decouple the control flow handling and dataflow handling, which calls for a new architectural scheme.Our architecture is predicated on this assumption, and the superior performance of our design relative to other state-of-the-art architectures is demonstrated in Table 3.

DESIGN
The current PE execution models expose a deficiency in control flow handling ability.It becomes imperative to deploy an autonomous, peer-to-peer and temporally loosely-coupled control channel.We propose that in spatial architectures, control flow handling should be separated from the current hybrid designs that combine control flow handling with data flow handling.This introduces a new layer of abstraction, called the control flow plane of the spatial architecture, which facilitates the separate design and optimization of control flow handling for each PE.This approach represents the only viable means to swiftly reconfigure a set of PEs and achieve control coordination within the spatial architecture.We introduce Marionette, a spatial architecture design with decoupled control flow plan and data flow plane.Specifically, we architect a control flow plane for existing spatial architecture, which incorporates three innovative features.This section presents our proposal for Marionette and organizes the discussion to describe its key features.4 (a), demonstrates the decoupling of the control flow part and data flow part.We design the micro-architecture of three control flow components: the Control Flow Scheduler, Control Flow Trigger, and Control Flow Sender, along with a corresponding ISA that enables independent control flow handling and ports.This permits the free transmission and receipt of control flow, unrestricted by the data flow within the PE.
Control Flow Trigger, shown in Figure 5, is the pivotal configuration unit of the Marionette PE framework.It is composed of two phases, namely the check phase and the configuration phase.The Control Flow Trigger is designed to sustain the configuration determined in the configuration phase until a fresh control input is detected during the check phase.This contrasts with data flow PE instructions, which are solely responsible for a single calculation.By virtue of the majority of PEs executing within the confines of the same basic block's producer-consumer pipeline, the Control Flow Trigger obviates the overhead of switching instructions.Autonomous control flow handling: As previously noted, the autonomy of the Marionette PE stems from its ability to generate instruction addresses for subsequent PEs.The configuration of control flow part provides a range of instruction addresses.Temporally loosely-coupled control flow handling: Given that the Control Flow Scheduler, Control Flow Trigger, and Control   A peer-to-peer Control Network: To enable timely and flexible peer-to-peer control flow transfer with minimal area overhead, we design a control network based on the Benes network.This wellknown rearrangeable non-blocking network [11] has a butterflyshaped interconnection structure, and a much smaller number of node switches than the Crossbar network, serving as our design starting point due to its low area overhead and high flexibility (Figure 6 (a)).However, it lacks broadcasting capabilities.To address this, we incorporate the Consecutive Spreading (CS) network [38], which performs broadcast and has a smaller area overhead than cascading multiple same-sized networks.We present a CS-Benes network that connects PEs, control FIFOs, and the controller, providing configurable network output with a fixed connection and no arbitration during control transfers.Each path in the network contributes one element of throughput every cycle.We reserve many extensible interfaces.Figure 6 (c) displays the specific interface design of our CS-Benes network, and in Section 7.2, we evaluate the control network's scalability.

Proactive PE Configuration
In the Marionette control flow plane design, the control flow and data flow are loosely-coupled in the time domain, which allows for the execution of current-stage computation and next-stage configuration within the same stage.

Timeextend
Marionette Scheduling:  The Marionette PE's data flow part executes non-branch calculation operators in the DFG operator mode.This mode indicates that the current and subsequent PEs are in the same BB and share the same control flow.To expedite control flow transmission, the current PE's configuration is sent to the subsequent PE in advance once the configuration is completed.Consequently, when the current PE sends the data flow's dataflow result to the subsequent PE, the configuration of the latter has already been completed.In contrast, the branch operator mode executes the branch (not loop) operator, indicating that the current and subsequent PEs are in different basic blocks with a control jump in between.Therefore, the control flow of the PE must wait for the data flow calculation result to determine the control information of the subsequent PE, and no proactive control flow is transmitted.Lastly, in the loop operator mode, the data flow part executes the loop operator, and to ensure continuous pipeline generation, the loop configuration is maintained in advance.
In Figure 4 (b), we have explained the workings of the control flow plane and data flow plane within a single PE.Furthermore, we demonstrates the execution of three Marionette PEs under Branch Divergence in Figure 7 (b).The scenario includes BB 1, which consists of a single branch operation, and BB 2 and BB 3, which are two branch paths with a length of two.To handle this, we assign the branch operation to PE 1 and set it to Branch Operator mode.Additionally, the two branch paths are merged and assigned to PE Figure 7 (c), it illustrates the PE's functionality as a loop generator.When the PE is configured in Loop Operator mode, the control flow plane continuously maintains the current loop operation.It does not accept new configuration signals until the data flow plane determines that it should exit the loop.This capability allows for the continuous generation of the loop pipeline.(This illustrates the method for minimizing Pipeline II.In reality, Pipeline II is configurable.)

Agile PE Assignment
Leveraging the Marionette control flow plane architecture, the PE exhibits autonomous, peer-to-peer and temporally loosely-coupled control flow handling capabilities, aided by a dedicated loop operator that regulates the pipeline II.This serves as the foundation for realigning the flawed loop pipeline, and the Agile PE Assignment feature we proposed enhances PE utilization in Imperfect Loops.The feature encompasses a refined Marionette scheduling strategy and a Control Flow scheduler.
The Marionette scheduling algorithm is shown in the upper left of Figure 8.The frequency of BB execution varies among nested loop levels.Thus, we establish the mapping at each loop level and construct the pipeline with BB granularity.Once all BBs in the current loop level have been scheduled and unassigned PEs remain, we reshape (time-extend) the mapping of both the current layer BB and the inner layer BB, as they satisfy control dependencies in the current state.Time Extended mapping is a widely adopted technique in compilation [7], which entails the folding of the initial spatial domain mapping into the temporal domain, thereby reducing PE resources while also increasing the II.However, reshaping may result in idle PEs due to the diverse DFG shapes of BBs.To address this issue, we select a mapping scheme that minimizes PE waste and expand the original mapping to generate the mapping result of the current loop level.The reshaping scheme, PE waste, and scheduling results of the three-layer nested loop algorithm are illustrated in the upper right of Figure 8.
The lower part of Figure 8 showcases the Agile PE Assignment through Marionette timeline diagrams.The Marionette control flow plane allows for a highly flexible construction of the BB pipeline, encompassing adaptable PE resources, startup time, and pipeline II.This results in significantly enhanced PE utilization.To collect the control information generated by the outer BB pipeline, we design the Control Flow Scheduler and Control FIFOs.When the inner loop BB completes a round of loop iterations, it utilizes the pre-collected outer loop BB's control information to determine whether to initiate the next loop, thereby avoiding frequent configuration switch to the outer loop BB.Furthermore, the Control Flow Arbiter inside the Control Flow Scheduler possesses the capability to arbitrate between the current execution configuration and input control, enabling dynamic adjustment of the execution priority among BBs with varying levels of nested loops.

Programming and Compilation
The process of programming Marionette entails several tasks: (1) Annotating the branches and loop statements of the algorithm with #pragma tags, (2) Extracting and analyzing the control data flow graph (CDFG), and (3) Mapping the control flow and data flow portions to the corresponding planes of Marionette, while imposing constraints on memory access and communication.To elucidate the mapping of the program onto Marionette, we employ an example from Figure 9. Figure 9 (a) depicts the original kernel's C code, while Figure 9 (b) illustrates the extracted CDFG.Subsequently, Figure 9 (c) showcases the specific mapping onto Marionette's control flow and data flow planes, using the Marionette scheduling algorithm explained in Figure 8.The CDFG for Branch Divergence may yield a tree-like structure, and the same level BBs are mapped to the same PE lane to the extent possible.Furthermore, Imperfect Loops are partitioned into distinct mappings.

IMPLEMENTATION
Here we discuss the implementation of the hardware, software stack and simulator used in the evaluation.Figure 10 shows an overview.
Hardware: Our Marionette design is parameterizable (e.g., PE array size, FU type, port widths, memory size, etc.) and yields an architectural description shared with the software stack and simulator.Table 4 shows the hardware parameters.We synthesized a prototype of the Marionette at 500MHz using the 28nm technology library.
Software Stack: We use the annotated source code to generate an LLVM intermediate representation (IR), which represents lowlevel operations on data flow and control flow.An automatic tool examines LLVM's IR and generates several DFGs and one CFG

EVALUATION METHODOLOGY 6.1 Comparison Methodology
We built a cycle-level accurate simulator with the option to implement innovation points to compare the performance gains obtained by each innovation point separately.The simulator optimistically offers high memory access flexibility.
First, we evaluate the performance of Marionette PE (including Proactive PE Configuration) compared to von Neumann PE and dataflow PE.In order to verify the optimization effect on Branch Divergence, for a fair comparison, we do not consider the dedicated control network and Agile PE Assignment.And we unify the data network.
Second, we evaluate the peer-to-peer control network.Besides, we conduct a DC synthesis experiment on the control network delay under different frequencies and network stages.Moreover, we compare the network normalized area with the state-of-the-art architecture.
In addition, we evaluate the Agile PE Assignment to verify the optimization effect on Imperfect Loop.Furthermore, we compute the utilization improvement of the PE that originally executed the outer BB and pipeline utilization, and analyze the relationship between the result of the peer-to-peer control network and the Agile PE Assignment.
Finally, we built the performance models of Softbrain [48], TIA [49], REVEL [70] (15 systolic PEs, 1 tagged-dataflow PE), Riptide [28] (16 fully functional PEs and 25 control flow operators inside network) and Marionette with the simulator and normalized the computing fabric to the same size to compare the performance.

Benchmark
We selected a wide range of 13 benchmarks to evaluate our work.FFT, NW, Viterbi (VI), Merge-Sort (MS) and GEMM are from Machsuite [53].ADPCM and CRC are from Mibench [32].Hough Transform (HT) is from HosNa Suite [10].We also selected LDPC Decode (LDPC) [54] and SC Decode (SCD) [4].Some of the control flow characteristics of these benchmarks have been qualitatively described in Table 1.Conv-1d (CO), Sigmoid (SI) and Gray Processing (GP) are simple single-layer loop applications, which are prepared as a fair comparison.Table 5 shows the data sizes.

EVALUATION
We evaluate the performance improvement of features in Marionette.In addition, we conducted scalability experiments for the control network and compared the network with other work.Finally, we compare Marionette with state-of-the-art architectures and show that Marionette performs better on intensive control flow applications.Figure 13: The relationship among network stages, network delay and critical path delay.Control network provides optimal scalability at high frequencies.

Advantages of Control Network
As shown in Figure 12, the peer-to-peer control network leads to a shorter transfer delay of the control flow.It outperforms on average 1.14× and up to 1.36× (CRC).CRC, ADPCM, and Merge Sort are only partially pipelined.Hence, the overhead of the control flow transfer is high, and the speedup is apparent.
After adding the control network, we compare the network area overhead with the state-of-the-art architectures.As shown in Table 6, considering a fair comparison, we normalize the computing fabric (Plasticine uses a 3-lane 4-stage PCU and a 4-stage SRAM-free PMU), respectively measure the network (including data network, memory network and control network) area and the ratio of the network to computing fabric.Our network area is only 0.0118 2 , which is 11.5% of the computing fabric.While each architecture may have a unique functional design with varying PE and network functions and sizes, we can infer from our experiment results that a peerto-peer control flow network can alleviate the burden of utilizing other network structures for transmitting control flow.This, in turn, reduces the interconnection complexity of the original network and minimizes overhead.
We also evaluate the scalability of the control network by synthesizing different stages of the control network under various time constraints.Figure 13 shows the result.Higher frequency and largescale fabric will increase network latency.However, we believe the low increase in network latency is acceptable because the data flow has more severe constraints than the control flow.

Advantages of Agile PE Assignment
As shown in Figure 14, Agile PE Assignment significantly improves the performance, which achieves an average speedup of 2.03× and up to 5.99×.We also further analyze the improvement of Agile PE Assignment, respectively from outer-BB PE utilization and pipeline utilization in a fine-grained manner, in Figure15.We only selected the multi-layer nested loop benchmark where the innermost loop can form a pipeline.
"Outer-BB PE utilization" pertains to those PEs initially assigned to solely execute the outer loop BB.By dynamically assignment, they can either join the construction of the outer loop pipeline or reconfigure them as inner loop pipelines, leading to an average improvement of 21.57×.Among them, GEMM forms a dense spatial pipeline structure and obtains a utilization rate of 134×.
The measure of pipeline utilization is determined by the proportion of pipeline initiations to the overall number of executions.This ratio provides an indication of the pipeline's level of idleness.Overall, Agile PE Assignment has achieved an average of 1.54× optimization in the pipeline utilization in different benchmarks.Hough Transform, NW, SC Decode and GEMM are suitable because outer BBs can generate more control flow.FFT and Viterbi have a data-dependent pipeline II and limit the practical pipeline to 33% (II=2).In general, the improvement is limited by the loop structure and the limitations of data dependencies between loops (LDPC).
Speedup comparison between control network and Agile PE Assignment: An exciting balance of speedup between the control network and Agile PE Assignment is shown in Figure 16.CRC, ADPCM, Merge Sort, and LDPC cannot be well pipelined.Therefore, Agile PE Assignment cannot create a significant acceleration, but the acceleration of the control network is noticeable.For Viterbi, Hough Transform, SC Decode and GEMM, the control flow is comparatively regular, so the Agile PE Assignment of the control flow is evident.While the proportion of control flow in the critical path is diminished, and the acceleration effect of the control network declines.

Marionette Outperforms State-of-the-art architectures
We compare Marionette with other architectures.The results are shown in Figure 17.For non-intensive control flow benchmarks, all architectures have similar performance except for TIA which has a longer pipeline II (dataflow PE).Single BB inside loop structure is particularly suitable for constructing balanced pipelines.The innovative features of the Marionette do not deteriorate performance for non-intensive control flow applications.For intensive control flow benchmarks, TIA and Softbrain have similar performance.On average, the Marionette speedup is 2.88× that of Softbrain, 3.38× that of TIA, 1.55× that of REVEL, and 2.66× that of RipTide.In some benchmarks such as Viterbi, Hough Transform, SC Decode and GEMM, the REVEL execution model is comparable to the Agile PE Assignment, so the speedup is better.However, Agile PE Assignment has apparent advantages because it is flexible enough.

RELATED WORK
Spatial architecture taxonomy by execution model: We divide PEs of SA and reconfigurable spatial architectures into von Neumann PE and dataflow PE according to the execution mode.Table 2 lists some of them.Von Neumann PE only passively executes the configuration according to the instruction sequence.Some SAs [2, 12, 16, 18, 19, 21, 22, 24, 26, 28-31, 33, 36, 39-41, 46, 48, 50, 62] are configured by a unified controller (main processor or co-processor), and some use counters [52,69] or finite state machines [14,59] to control the order of instructions.They both have controllers that construct sequential instruction flows.To satisfy some dynamic properties, there are some von Neumann PEs request configuration from the processor [37,42,55,58].Dataflow PE means the reconfigurable PE determines the execution state according to the input data to select the configuration execution [9,49,56,57,60,61,66,67,71].Its essence is the out-of-order execution of instructions.The marionette PE is innovative from the von Neumann PE and the data flow PE, and decouples the configuration process through the control plane to achieve timely and Proactive Configuration, which does not exist before.Dedicated Control Network Design: Some SAs add control bits to tag data with additional functions, such as SPU [18], etc.The control signal is coupled with the data network, which cannot satisfy control flow flexibility.The DRIPS control network [62] is essentially a config network-on-chip (NoC), not a control network for our control flow characteristics.The RipTide [28] moves most control operations to the network, but the transferring is slow and inflexible, and the data and control information in the network are still coupled.Overall, Marionette has the first independent control network designed for control flow from the control plane.
Spatial pipelines on multiple BBs: Most SAs support spatial pipelines of innermost loops.Some SAs support spatial pipelines of different BBs, but limit their execution resources.FIFER [44] restricts different BBs to pipeline on different computing fabrics.REVEL [70] restricts the innermost loop to pipeline on systolic computing fabric and the outer loop BBs to pipeline on only a few dataflow PEs.The mismatch between the number of pipeline operators and the fixed execution resources can lead to PE underutilization and performance bottlenecks.In our execution model, the BB pipelines are dynamically balanced by the control flow.The mechanism of the DRIPS [62] balancing of pipelines is passive by setting a fixed time window by the controller to obtain the current pipeline state and resend the configuration by the config NoC.The passive centralized balancing approach will lead to pipeline pauses.Our dynamic balancing spatial pipeline is active and decentralized, thus having better dynamic balancing pipeline results.

CONCLUSION
This work describes Marionette, a spatial architecture with a decoupled, explicit-designed control flow plane and three corresponding innovative features.We developed full stack of Marionette (ISA, compiler, simulator, RTL) and demonstrate that in a variety of challenging control-intensive applications, compared to state-of-theart spatial architectures, Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88×, 3.38×, 1.55×, and 2.66×.Tsinghua University-China Mobile Communications Group Co.,Ltd.Joint Institute.

Figure 1 :
Figure 1: Spatial Architecture Computational Model and Execution Model.

1 . 1 1.
Control flow can only transfer through Data path.2. The handing of control flow and data flow is constrained to the same token processing.DataToken The PE autonomously propels the switching configurations of other PEs.2.A peer to peer control path between PEs. 3. The control and data paths of the PEs are temporally loosely-coupled.flexible pipeline rebuildOnly Data path can be selected.

Figure 3 :
Figure 3: Von Neumann PEs and dataflow PEs face control flow handling challenges when dealing with two typical control flow forms: Branch Divergence and Imperfect Loop.necessarily indicate an accurate timing for all architectures, but only show a relative time cost all over this paper).We will delve into the distinct ways they employ to handle control flow and identify their respective limitations.Control Flow Handling of von Neumann PE: A typical execution model of von Neumann PE is shown in Figure 2 (a).Von Neumann PE gets its configurations from the instruction buffer to configure the interconnect input/output, local register, and function unit.In the traditional von Neumann architecture, the execution sequence of instructions is controlled by the program counter (PC) pointer.However, in the evolved von Neumann PE, the PC pointer is often replaced by a finite state machine or a control core.This enables more flexible control flow and quicker reconfiguration of the processor.The logic of switching configurations is pre-set at compile time and can be quickly reconfigured.During execution, each von Neumann PE has distributed and isolated configuration logic, which cannot be directly changed by other PEs.To adapt to the pipeline of the spatial architecture, each instruction may take effect on the data input for a period of time, enabling producer-consumer parallelism.Control Flow Handling of dataflow PE: Figure 2 (b) displays a standard model for executing dataflow PEs.A token is used as input, consisting of a data and a tag.The tag activates the configuration, while the data performs the operation.This means that the dataflow PE's configuration can be altered by other PEs while in

Figure 4 :
Figure 4: Marionette PE execution model, timing diagrams, PE micro-architecture, and overall architecture.a PE configuration stage as a consequent operation of data entry, leading to an explicit overhead for PE configuration.Unfortunately, this explicit overhead results in a suboptimal utilization of the PE.Imperfect Loop: When executing an Imperfect Loop, dataflow PEs can accentuate the challenges arising from the close coupling of control flow and data flow, manifesting in both temporal and spatial domains, as shown in Figure3 (f).On the one hand, the synchronization of control flow and data flow naturally leads to a longer pipeline II, as elaborated in the preceding paragraph.On the other hand, the close spatial couping of control flow and data flow implies that control information transferring reliant on the data path is frequently inflexible.For instance, in the absence of a direct control flow pathway between the second PE and the loop generator, control flow information must traverse the red data path, leading to pipeline idleness.Consequently, even though the dataflow PE has some autonomy, the inherent drawback results in a substantially reduced utilization rate.

( 1 )
What is the control flow plane of Marionette, and how to realize a autonomous, peer-to-peer and temporally looselycoupled control flow handling capability by establishing a control flow plane?(2) How is the Proactive PE Configuration used to achieve timely and computation-overlapped configuration through the Control Flow Sender, and what solutions does it offer for Branch Divergence and pipeline initiation?(3) How does the Agile PE Assignment enhance the pipeline performance of Imperfect Loops through a refined Marionette scheduling strategy and Control Flow scheduler?4.1 Control Flow Plane Control flow plane abstraction in Marionette: As shown in Figure 4 (d), to attain a fully decoupled control flow plane, we have encapsulated all control flow-related components, including the Controller, Control Network, Control FIFO, and the control flow part within PEs, within Marionette's control flow plane.The primary objective of the control flow plane is to establish a correlation between CFG and the hardware implementation.In this process,

Figure 6 :
Figure 6: A dedicated control network enables peer-to-peer transfer of control flow between PEs.

Figure 7 :
Figure 7: Control Flow Sender and Proactive PE Configuration feature for Branch Divergence and Pipeline Initiation.Flow Sender on the control flow plane possess distinct instruction sets and execution procedures, the control flow and data flow are inherently decoupled.Consequently, it allows for overlapping of the next configuration phase with the current computation phase through Proactive PE Configuration, as detailed in Section 4.2.This approach facilitates the transfer of control and data flow between PEs, as demonstrated in Figure4(b).A peer-to-peer Control Network: To enable timely and flexible peer-to-peer control flow transfer with minimal area overhead, we design a control network based on the Benes network.This wellknown rearrangeable non-blocking network[11] has a butterflyshaped interconnection structure, and a much smaller number of node switches than the Crossbar network, serving as our design starting point due to its low area overhead and high flexibility (Figure6 (a)).However, it lacks broadcasting capabilities.To address this, we incorporate the Consecutive Spreading (CS) network[38], which performs broadcast and has a smaller area overhead than cascading multiple same-sized networks.We present a CS-Benes network that connects PEs, control FIFOs, and the controller, providing configurable network output with a fixed connection and no arbitration during control transfers.Each path in the network contributes one element of throughput every cycle.We reserve many extensible interfaces.Figure6(c) displays the specific interface design of our CS-Benes network, and in Section 7.2, we evaluate the control network's scalability.

Figure 8 :
Figure 8: Agile PE Assignment, including Marionette scheduling algorithm and an example of processing Imperfect Loops.implements three distinct operating modes for the control flow transmitter: DFG operator mode, branch operator mode, and loop operator mode, as shown in Figure 7 (a).Additionally, the Control Flow Sender features Proactive PE Configuration in the DFG operator mode and loop operator mode that hastens the transmission of control flow.The Marionette PE's data flow part executes non-branch calculation operators in the DFG operator mode.This mode indicates that the current and subsequent PEs are in the same BB and share the same control flow.To expedite control flow transmission, the current PE's configuration is sent to the subsequent PE in advance once the configuration is completed.Consequently, when the current PE sends the data flow's dataflow result to the subsequent PE, the configuration of the latter has already been completed.In contrast, the branch operator mode executes the branch (not loop) operator, indicating that the current and subsequent PEs are in different basic blocks with a control jump in between.Therefore, the control flow of the PE must wait for the data flow calculation result to determine the control information of the subsequent PE, and no proactive control flow is transmitted.Lastly, in the loop operator mode, the data flow part executes the loop operator, and to ensure continuous pipeline generation, the loop configuration is maintained in advance.In Figure4(b), we have explained the workings of the control flow plane and data flow plane within a single PE.Furthermore, we demonstrates the execution of three Marionette PEs under Branch Divergence in Figure7(b).The scenario includes BB 1, which consists of a single branch operation, and BB 2 and BB 3, which are two branch paths with a length of two.To handle this, we assign the branch operation to PE 1 and set it to Branch Operator mode.Additionally, the two branch paths are merged and assigned to PE

2 and PE 3 ,
which are in DFG Operator mode.In each cycle, PE 1 executes the branch instruction to select the correct branch path.It then transmits the configuration signals of the branch path to the control flow plane of the downstream PE.The data flow plane of the downstream PE executes the correct branch path, meanwhile the control flow plane accepts the upstream configuration signals in advance to configure for the next cycle.This effectively hides the configuration time overhead.The Proactive PE Configuration feature allows Marionette PE to achieve a branch target PE utilization rate that is similar to that of an ideal PE.

Figure 12 :
Figure 12: Normalized speedup by Control Network.A peerto-peer control network contributes a 1.14× performance improvement.

Figure 14 :
Figure 14: Normalized speedup by Agile PE Assignment.Agile PE Assignment contributes a 2.03x performance improvement.

Figure 16 :
Figure 16: Speedup comparison between control network and Agile PE Assignment.The proportion of benchmarks' control flow that can be hidden distinguishes the acceleration effect of the two features.

Table 1 :
Control flow forms across modern applications.

connect Output Local Register Input Data Execution Result Output Data Other PEs cannot directly modify instruction sequences.
the state-of-the-art architectures, Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88×, 3.38×, 1.55×, and 2.66× across a variety of intensive control flow applications.

Table 2 :
SA taxonomy by PE execution model.

Table 3 :
Comparison of Marionette's ability to handle control flow with other state-of-the-art architectures

suppose 2×4 PE array Loop j BB 1 Loop k BB 2 1. First innermost Loop k, generate Mapping 1. BB 2 Mapping 2. Then middle Loop j, generate Mapping 2. BB 2 PE_waste=large PE_waste=small 3. Last outermost Loop i, generate Mapping 3.
To accelerate the configuration process of subsequent PEs, we innovate Proactive PE Configuration and develop Control Flow Sender that send control information at the earliest possible time.The data flow part of the PE currently

Table 4 :
Area and power breakdown (28nm) We have developed a cycle-level accurate simulator.It uses the binary configuration file output by the compiler to verify the functional correctness of the Marionette and to evaluate the performance.

Table 6 :
Figure 11 shows the speedup of our Marionette PE with Proactive PE Configuration compared to von Neumann PE and dataflow PE.The results show that the Marionette PE outperforms von Neumann PE by geomean 1.18× and up to 1.45× (Merge Sort).Moreover, it outperforms dataflow PE by geomean 1.33× and up to 1.76× (GEMM).Further, we separately count the proportion of operators under the branch.The ratio can expose the utilization waste caused by the static mapping of von Neumann execution model.Merge Sort has the highest branch subsequent PE ratio due to the Proactive Configuration saving the most PE resources.Due to the pipeline II, the data flow PE still has poor performance even if it has some flexibility.Comparison of network area ( 2 ) in state-of-theart architectures (normalized to 28nm, 32-bit, 4 × 4 PE array.)