Portable Implementations of Work Stealing

Work stealing is a well-known technique for dynamic load balancing; however, manually writing work-stealing protocols is error-prone. We can use the Tascell parallel programming language for the correct and portable implementation of work stealing; the implementation combines polling and adequate mutual exclusion. In Tascell, we can express on-demand concurrency for backtracking-based load balancing where a worker performs a sequential computation with its own execution stack unless it is requested to spawn a task. To spawn a larger task by temporarily backtracking, nested functions can be used for legitimate execution stack access. As nested functions for extended C languages, we can use GCC’s heavyweight implementation with runtime code generation or lightweight implementations by enhancing GCC; however, compiler-based implementations are poor in portability. In this study, we implement and evaluate more portable Tascell frameworks called “Tascell/SC” by using transformation-based portable implementations of nested functions. In addition, we propose Tascell-inspired portable frameworks only in C++ called “Tascell++” by using lambda expressions in C++11 for legitimate execution stack access.


INTRODUCTION
Work stealing is a useful dynamic load balancing technique for many parallel applications.Unlike other coarse-grain task-based load balancing techniques such as work sharing and master workers, work stealing supports irregular applications written in a divideand-conquer style.An idle worker (thief) steals a task (a piece of work) from another randomly or sophisticatedly selected loaded worker (victim).(Nested) FORALL-style algorithms can be converted into divide-and-conquer (recursive) algorithms easily, affording better cache locality in many cases.The data locality generally reduces cache misses, TLB misses, and page faults.
Users should use good work-stealing frameworks and not manually write their own work-stealing protocols, because work stealing involves intentional subtle races.By "subtle, " we mean that things tend to be incorrect without careful handling by experts.For example, some techniques require an elaborate protocol with store-load memory barrier (fence) instructions on modern parallel architecture to ensure that each potential task is extracted exactly once by the owner or a thief.
Hiraishi et al. proposed a "logical thread"-free parallel programming/execution framework called Tascell [7] as an efficient workstealing framework.As will be shown in Section 2, Tascell implements backtracking-based load balancing with on-demand concurrency.A worker performs a computation sequentially unless it receives a task request with polling.When requested, the worker spawns a "real" task by temporarily "backtracking" and restoring its oldest task-spawnable state.Unlike LTC [14] and Cilk [6], the thief steals the newly spawned task (not the oldest thread). 1 Since no logical threads are created as potential tasks, the cost of managing a queue for them can be eliminated. 2Tascell also promotes the longterm (re)use of workspaces (such as arrays and other mutable data structures) and improves the locality of references because it does not have to prepare a workspace for each concurrently runnable logical thread.
For the correct and portable implementation of work stealing, every Tascell worker combines polling and adequate mutual exclusion (which uses adequate memory barrier instructions internally).Every Tascell worker can perform a sequential computation with its own execution stack, while nested functions can be used for legitimate execution stack access (called LESA for short).With LESA mechanisms, a running worker can legitimately access data deeply in execution stacks (C stacks consisting of stack frames).Every use of Tascell's parallel construct owns a nested function as a task request handler.By invoking a series of lexical closures of nested functions with their environments (i.e., stack frames), the worker spawns a "real" task by temporarily "backtracking" and restoring its oldest task-spawnable state.
GCC supports nested functions as an extension to C [2].To make them compatible with ordinary top-level functions, GCC dynamically generates an instruction sequence on its execution stack where the instruction sequence (called a trampoline) sets up a static link and jumps.Note that this technique often incurs considerable creation costs for flushing the data/instruction cache.
We can use lightweight implementations of nested functions (as LESA mechanisms) by enhancing GCC [21,24,25]; however, compiler-based implementations are poor in portability.
In this study, we implement and evaluate more portable Tascell frameworks called "Tascell/SC" by using transformation-based portable implementations of nested functions such as CL-SC2 [26] as portable and efficient LESA mechanisms.In [26], CL-SC2 was good for implementing garbage collection and first-class continuations for a Scheme interpreter.
In addition, we propose Tascell-inspired new portable frameworks only in C++ called "Tascell++" by using lambda expressions in C++11 as LESA mechanisms.Note that the std::function class template for accepting various closures of lambda expressions is standard, but incurs significant overhead.We developed a custom virtual function with a custom derived class template.
The main contribution of this paper is three-fold.
• We implement and evaluate Tascell/SC by using CL-SC2 as a portable and efficient LESA mechanism for work stealing.• We propose Tascell-inspired new portable frameworks Tas-cell++ only in C++ by using lambda expressions in C++11 as LESA mechanisms with an efficient type-erasure technique.• We confirm good portability and performance of these implementations of work stealing on top of Xeon Phi, AMD EPYC, and FX700 (an ARM-based parallel computer).
The remainder of this paper is organized as follows.Section 2 introduces backgrounds, such as implementations of work stealing (related work), the Tascell framework and its implementation [7] using nested functions as LESA mechanisms, compiler-based implementations of nested functions, and transformation-based implementations of nested functions.Section 3 describes a variety of evaluation environments and methods to evaluate portability and performance.Section 4 examines Tascell/SC in terms of its implementations with LW-SC [8] (and LW-SC2 [21]) and CL-SC2, overheads, and parallel efficiency.Section 5 proposes new Tascellinspired frameworks called Tascell++ only in C++ by using lambda expressions in C++11 as LESA mechanisms; we examine Tascell++ in terms of its implementations with std::function and a custom integrated type using a type-erasure technique, overheads, and parallel efficiency.Section 6 presents further discussions of portability and performance of Tascell/SC and Tascell++.Finally, we conclude this paper in Section 7.
Almost all frameworks except StackThreads/MP [20] and Tascell [7] employ deques.The owner and thieves share a deque of (stack) frames for work stealing.Deques can be protected with mutual exclusion or other elaborate shared-memory protocols on tails and heads with adequate memory-barrier instructions; however, memory-barrier instructions are not so portable and also incur substantial overhead.
Memory-barrier instructions can be avoided if the worker has its own private deque.The message passing implementation of LTC [4] uses a polling technique for accepting steal requests and making deques private.
Some multithreaded languages [19,22] use a polling technique for delaying (and avoiding unless requested) creation of stealable continuation frames, providing good performance even with public deques.
StackThreads/MP [20] uses a polling technique while managing stack frames so that thread migration is permitted without moving stack frames.

Tascell
The Tascell framework [7] is a load-balancing framework that consists of a compiler for the Tascell language and a runtime system.Although this framework can run on both distributed and shared memory environments, we consider only shared memory environments in this paper.As in [7], Tascell exhibits good performance since it employs a polling technique and no logical threads are created as potential tasks unless requested.
A Tascell worker spawns a task by temporarily backtracking and restoring its oldest task-spawnable state.That is, when a worker receives a task request, (1) it temporarily backtracks (goes back to the past), (2) spawns a task (and changes the execution path to receive the result of the task), (3) returns from the backtracking, and (4) resumes its own task.
The Tascell worker always chooses not to spawn a task at first and performs sequential computation (as an extreme form of the work-first principle).However, when the worker receives a task request, it spawns a task as if it changed the past choice.
For performing a temporary backtracking, we can employ LESA mechanisms.
Tascell outperformed Cilk-5 and Cilk Plus, as shown in [7] and [27], respectively.Moreover, we can parallelize some "highly serial" applications [16,18] in a straightforward manner, in which a worker continuously and serially updates a single workspace; this is because Tascell exhibits the following characteristics: • While a Tascell worker performs a sequential computation, it can reuse a single workspace, whereas a logical thread typically requires its own workspace.• When a new task is spawned, the victim's workspace can be copied for the thief.Because a task is spawned only when it is requested by idle workers, workspace copying can occur only when it is actually required.
As real-world applications, itemset-sharing subgraph extraction [17] and hierarchical matrix construction [1] have been implemented in Tascell.

The Implementation of Tascell Using Nested Functions as LESA Mechanisms
When GCC's nested functions are employed, the Tascell compiler translates a Tascell program in Fig. 1 into an extended C program in Fig. 2. In Fig. 1, the do-two construct takes two statements (= s1 (fib (-n 1))) and (= s2 (fib (-n 2))) with a handler for fib.
Unless requested, a worker executes two statements sequentially.
During the execution of (= s1 (fib (-n 1))), a task request handler do_two_bk as a nested function in Fig. 2 is available.The task request handler is invoked with mutual exclusion when a task request is detected with polling as in lines 20-21 in Fig. 2. The task request handler do_two_bk is also passed as a parameter as in "s1 = fib(do_two_bk, _thr, n-1)" so that do_two_bk can handle a task request as an older handler.
After continuing backtracking to older handlers via _bk(), do_two_bk handles a request by allocating a task object as pthis, executing the :put block (e.g., "(*pthis).n= n-2" for (= this.n (-n 2))), "spawned = 1" to change the execution path to receive the result of the task, and make_and_send_task.After the change, the victim does not execute (= s2 (fib (-n 2))); instead, it waits for the result of the stolen task and executes :get block ("s2 = (*pthis).r"for (= s2 this.r)).Similarly to Leapfrogging [23], when a victim needs the result of a stolen task but the result is not available yet, the victim tries to steal back a new smaller task from the thief of the stolen task.

Compiler-based Implementations of Nested Functions
GCC is widely used and is considered very reliable (stable).GCC supports nested functions as an extension to C [2].GCC dynamically generates a trampoline on its execution stack; the trampoline    [24] by enhancing GCC 3.4.6 at that time.After GCC 4, the inside of the GCC compiler was significantly changed, which made it difficult to enhance GCC for L-closures.
As implementations of L-closures based on translators from an extended SC language with nested functions (called SC-NF) into standard C, Hiraishi et al. have developed an implementation LW-SC [8] which reduces creation/maintenance costs of L-closures which was published in 2006 and using the "execution stack reconstruction technique, " where each worker (thread) lazily maintains an explicit stack other than a usual C execution stack. 4Tazuke et al. have developed another implementation LW-SC2 [21] which reduces creation/maintenance costs plus invocation costs of L-closures which was published in 2013 and using the "frame-by-frame restoration technique." CL-SC2 [26] is a transformation-based implementation of a LESA mechanism M-closures, which is realized as a translator into standard C by following the main idea of the compilation techniques for M-closures in enhancing GCC 4.6.3[21].
In the evaluation with a Scheme interpreter [26], CL-SC2 achieves better performance than LW-SC and LW-SC2.In this paper, we mainly examine CL-SC2 as a transformation-based implementation of a LESA mechanism.

EVALUATION ENVIRONMENTS AND METHODS
We used three evaluation environments: Xeon Phi, AMD EPYC, and FX700.The details of these environments are summarized in Table 1.These environments feature a variety of manufacturing dates, software environments, and instruction sets (processor architecture) to ensure portability.For Tascell/SC, we used Common Lisp for translating Tascell programs into C.We used the following programs: • Fib(): recursively computes the -th Fibonacci number.
• AreaSum(): approximately computes the area of a quadrant by recursively dividing a square containing it into rectangles for summing up their contributions until rectangles are not across the boundary or their areas are less than (10.0) − .• Histogram(): performs simple irregular backtrack search for non-one GCDs among  integers ( = 7) between [2, 2 + ) generating a histogram.• MatMul(): computes the matrix multiplication of an  ×  matrix using a cache-oblivious recursive algorithm.
We use intentionally small problem sizes (Fib(41), AreaSum (14), Histogram (25), and MatMul(1000)) to make parallelization harder.We also intentionally use no threshold (such as " < 20" for Fib()) to switch to pure serial versions in Fib (as in Figures 1, 2, 4 and 6 to 8) and AreaSum programs to see the "exposed" overhead.With 1-worker executions, we evaluate the performance using execution time.Later in Tables 2, 3, 4, 6, 7, 8, and 9, we will present the numbers in a fixed-point manner to show their magnitudes textually with four significant figures.
With multiple workers, we evaluate the performance mainly using parallel efficiency and execution time.Parallel efficiency is defined as / w where  is a speedup to a serial C program (baseline) and  w is the number of workers (i.e., / w = 1 means an ideal speedup).Since we use intentionally small problem sizes, obtaining good scalability is challenging and reflecting overheads of backtracking-based work stealing in Tascell.Although this study's goals are efficiency and portability of workstealing frameworks, we will compare them with OpenMP's taskbased work sharing later in Fig. 5.The discussion on work sharing will be shown in Section 6.1.

TASCELL/SC
We call the Tascell implementations using transformation-based implementations of nested functions Tascell/SC.

Tascell/LW-SC
Prior work [7,15] used LW-SC [8]; however, CL-SC2 achieves better performance than LW-SC (and LW-SC2) in a Scheme interpreter [26], since LW-SC (and LW-SC2) can delay initialization of closures and maintenance of coherence between private and shared locations but it needs some delay judgment costs.Although the Scheme interpreter uses small functions a lot, the comparison suggests CL-SC2's better performance also in Tascell/SC.

Tascell/CL-SC2
CL-SC2 translates an extended SC (SC-NF) program similar to Fig. 2 into a standard C program in Fig. 3 and Fig. 4. The nested function becomes a top-level function in Fig. 3.The fib function in Fig. 4 shares common fields of a structure (via efp of struct fib_frame *) with the top-level function in Fig. 3 (via xfp of struct fib_frame *).The common structure serves as an environment representation of M-closures.
A main difference between CL-SC2 and LW-SC (and LW-SC2) is that CL-SC2 is not based on laziness, so there is no need for delay judgment costs.

Overheads
Tables 2, 3, and 4 show the performance of serial C programs.The serial C (with gcc in the case of Table 4) is almost always fastest and will be used as a baseline before parallelization.All transformed work-stealing programs based on CL-SC2 ran in three evaluation environments, which confirms high portability.
Tables 2 and 3 also show 1-worker executions of Tascell programs using nested functions of LW-SC and LW-SC2 on Xeon Phi and AMD EPYC.We can see that LW-SC and LW-SC2 show not so better performance than trampoline or even worse than it.
Tables 2, 3, and 4 also show 1-worker executions of Tascell programs using nested functions based on the GCC extension (trampoline) and based on CL-SC2.We can see that CL-SC2 shows better performance than trampoline for Fib, AreaSum, and Histogram.
Table 4 shows considerable closure creation cost with trampoline for flushing the data and instruction caches.On FX700, caches should be flushed by using costly instructions, such as "dc cvau" instruction for cleaning data cache, "dsb ish" instruction for synchronization, and "ic ivau" instruction for invalidating instruction cache, whereas CL-SC2 simply writes a pair as in lines 9 and 10 in Fig. 4 for closure invocation as at line 15 in Fig. 3. From Table 4, CL-SC2 shows up to ×5.14 better performance than the GCC's trampoline-based implementation.

Parallel Efficiency
Fig. 5 shows the parallel efficiency and execution times of Tascell programs using nested functions of the GCC extension and CL-SC2.Later in Section 5.4, two Tascell++ (C++) programs (CLR and ptr) using lambda expressions as LESA mechanisms will be discussed.OpenMP programs will be discussed in Section 6.1.The performance and parallel efficiency of CL-SC2 are better than trampoline for Fib(41)(gcc, fcc), AreaSum(14)(gcc, fcc) and Histogram(25)(gcc).Using CL-SC2 is effective for applications which frequently create closures with frequent recursive calls.

TASCELL++ 5.1 New Frameworks Based on C++
In this study, we propose Tascell-inspired new portable frameworks in C++ called "Tascell++" by using lambda expressions in C++11 as LESA mechanisms.We developed Tascell++ classes, such as ParEnv, WorkerEnv, Task0, and TaskReq, in C++.

Lambda Expressions in C++ as LESA Mechanisms
We directly utilize and evaluate lambda expressions in C++11 as a portable and efficient LESA mechanism for Tascell++.Based on Fig. 2, we can write a C++ program as in Fig. 6.However, simple use of the standard std::function class template for accepting closures of lambda expressions is general, but incurs anomalous overheads.
In order to pass closures of lambda expressions to another function, the std::function class template are commonly used.A template class based on std::function can be used as an integrated type for passing/receiving an argument/parameter to/as other functions.In Fig. 6   const lvalue reference  pointer to a structure  reference instead of pointer We developed several integrated types: Table 5 shows the name and features of each type, such as (1) the integrated type of variables like bk for passing/receiving closures of various lambda expressions, (2) how to reserve a location to hold the entity of a lambda-expression-specific type, (3) how to type-cast into the integrated type, (4) how to invoke a received closure of a lambda expression via variable bk of the integrated type.
Instead of simple use (fun), we can use a const lvalue reference (CLR) of std::function template type, namely const std:: function<void(TaskReq*)>&, as an integrated type.
Based on a type-erasure technique, we developed a custom virtual function void invoke(TaskReq* x) in a class (structure) struct treq2v and a custom derived class template template <class F> struct treq2v_imp : treq2v for forwarded invocation of lambda closures as in Fig. 7.

Overheads
Tables 6, 7, 8, and 9 show 1-worker executions of five Tascell++ (C++) programs (with g++ or FCC) using lambda expressions in C++ as in Table 5.As was explained in Section 3, magnitudes of execution time are expressed textually with four digits and fixed decimal points.The performances of fun and funR are worse than the other C++ implementations.Unlike CLR, fun and funR require deep copy and shallow copy respectively when passing bks.
The proposed customized types such as ptr and ptrR significantly reduce the overheads of CLR.From Table 7, ptr runs up to ×6.45 faster than CLR.From Table 8, ptr runs up to ×5.79 faster than CLR.
From Table 9, FCC did not show good performance except Mat-Mul, but they show evidence of portability.

Parallel Efficiency
Fig. 5 shows the parallel efficiency and execution times of two Tascell++ (C++) programs (CLR and ptr) using lambda expressions as LESA mechanisms.Tascell/SC programs using nested functions of the GCC extension and CL-SC2 were discussed in Section 4.4.OpenMP programs will be discussed in Section 6.1.We will compare Tascell/SC and Tascell++ in Section 6.2.

Comparison of Tascell/SC and Tascell++
Fig. 5 shows that whether Tascell/SC using CL-SC2 or Tascell++ using ptr is better depends on evaluation environments.Since Tas-cell++ is new, it has a room for further improvement, such as more sophisticated selection of victims and appropriate regulation of task requests (steal attempts).

Distributed Memory Environments
Tascell/SC can run in distributed memory environments by allowing work stealing among different computing nodes.In the early implementation of Tascell, TCP/IP was used for internode communication [7].Later, an implementation using the MPI library was proposed [15].Support for distributed memory environments in Tascell++ is currently under development.It should be possible to implement it using the same approach as Tascell/SC.That is, the complex 25 lines 33-57 in Fig. 7 can be expressed as the simple 7 lines 49-55 in Fig. 8.Note that the third lambda expression (corresponding to the :put block in Tascell) needs to be expanded into a LESA lambda expression in doTwo• • • for backtracking.The latest versions of g++ likely succeed in such inline expansion, while further research will be required for performance portability.

CONCLUSION AND FUTURE WORK
In this paper, we examine a transformation-based implementation CL-SC2 and C++ lambda expressions as LESA (legitimate execution stack access) mechanisms for the Tascell work-stealing framework or the Tascell++ framework (Tascell-inspired framework in C++).
In 1-worker executions, CL-SC2 shows up to ×5.14 better performance than the GCC's trampoline-based implementation on FX700.We avoid anomalous overheads in naive use of std::function for C++ lambda expressions; we use a const lvalue reference (CLR) of std::function template type as an integrated type.Based on a type-erasure technique, we developed a custom virtual function with a custom derived class template.This type erasure approach runs up to ×5.79 faster than a CLR of std::function template type on FX700.
Future work includes the support of Tascell's advanced parallel for construct and dynamic-wind construct for backtracking in Tascell++, where these constructs may be efficiently implemented as higher-order functions if C++ compilers expand all of them.Future work also includes implementations and evaluations in distributed memory environments.

34 } 35 }Figure 2 :
Figure 2: Translation result from the worker function fib in Fig. 1, including translation of a do-two statement.

2
Figure 3: Translation result from the nested function in Fig. 2

Figure 8 :
Figure 8: C++ program with the doTwo function and its use similar to Tascell's do-two construct so that plain C++ compilers can be used instead of the Tascell compiler.

6. 4 doTwo
Functions in C++ (more sophisticated Tascell++)We are examining a C++ (more sophisticated Tascell++) program as in Fig.8.This program employs an application-independent function doTwo• • • similar to Tascell's do-two construct.The higherorder function doTwo• • • can take four lambda expressions as in the fib function in Fig.8.The usage of doTwo• • • is similar to that of Tascell's do-two; that is, four lambda expressions for doTwo• • • correspond to four statements/blocks for do-two.If a C++ compiler successfully expands all of the doTwo• • • function and four lambda expressions inline (without the use of the existing Tascell compiler), the result of the compile-time inline expansion corresponds to Fig.7.

Table 1 :
Specifications of three evaluation environments Hiraishi et al. and Tazuke et al. have implemented L-closures by developing translators from an extended S-expression based C language into the standard C language [8, 21].For transforming programs written in (extended) S-expression based C languages (called SC languages), we can employ the SC language system [9].Transformation-based implementations of LESA mechanisms are preferable over enhanced C-compiler based implementations in terms of portability and development costs.Yasugi et al. developed the extended C compiler

Table 2 :
The execution time (in seconds) of 1-worker executions of serial C and Tascell/SC (Xeon Phi)

Table 3 :
The execution time (in seconds) of 1-worker executions of serial C and Tascell/SC (EPYC)

Table 4 :
The execution time (in seconds) of 1-worker executions of serial C and Tascell/SC with gcc and fcc (FX700)

Table 5 :
List of names and how to employ and handle (a variable bk of) an integrated type for various lambda expressions.

Table 6 :
The execution time (in seconds) of 1-worker executions of serial C and Tascell++ (Xeon Phi)

Table 7 :
The execution time (in seconds) of 1-worker executions of serial C and Tascell++ (EPYC)

Table 8 :
The execution time (in seconds) of 1-worker executions of serial C and Tascell++ with g++ (FX700)

Table 9 :
The execution time (in seconds) of 1-worker executions of serial C and Tascell++ with FCC (FX700)