Abstract
In this work, we present
1 INTRODUCTION
In recent years there has been a shift toward increasingly heterogeneous platforms to cope with the slowdown of Moore’s Law [54]. As chip designers have faced resistance in scaling single-core [53] and multicore [20] performance due to physical limitations, they have responded by incorporating more specialized processors into systems [46, 47]. These emerging heterogeneous systems are increasingly necessary to deal with future challenges, e.g., Amazon has begun offering cloud instances with different types of CPUs to match analytics workloads [9], and the Summit supercomputer combines CPUs and GPUs for enhanced performance and power efficiency [2]. The path forward for tackling these challenges is through increasing architectural diversity.
Chip manufacturers have begun diversifying server-grade CPU designs to strike different levels of single-threaded performance, parallelism and energy efficiency. For example, Intel Xeon [32] CPUs package tens of cores with high single-threaded performance, whereas Cavium ThunderX2 [45] CPUs instead package a large number of lower-performance cores. At the same time, chip designers have begun tightly coupling heterogeneous compute elements for power and performance benefits—for example, Intel’s Agilex platform combines Xeon CPUs with FPGAs and ARM processors into a single physical processor package [5]. These trends suggest future platforms may provide greater architectural diversity by integrating asymmetric general-purpose server-grade CPUs into a single motherboard or package.
We envision a system with heterogeneous CPUs, each of which has its own physical memory, connected by either a point-to-point connection such as PCIe [1] or a fast memory bus such as AMD’s Infinity Fabric [33]. These CPUs will also likely use heterogeneous instruction set architectures (ISAs), which have been shown to provide better performance and energy efficiency than asymmetric single-ISA CPUs [19, 55]. For developers, the ability to optimize performance and energy efficiency of applications on such systems will be essential, but leveraging diverse architectures together is a daunting task. While cores within a homogeneous set of CPUs, termed a node, provide cache coherence, there is no coherence between heterogeneous CPUs. This necessitates software memory consistency and data movement between discrete memory regions. Additionally, because these CPUs use different ISAs, data must be marshaled to be shared. Data must either be laid out in a common format by the compiler or dynamically transformed when transferred at runtime. Traditional programming models like MPI [23] are a poor fit for programming these tightly coupled heterogeneous servers as they require developers to manually manage memory consistency and work distribution.
Recent works such as K2 [34] and Popcorn Linux [7, 8, 30] instead use distributed shared memory (DSM) on tightly coupled heterogeneous CPU systems for better programmability, transparency, and flexibility. In addition to supporting cross-ISA execution migration [7, 19, 55], these systems provide transparent and on-demand data marshaling between nodes. Because of this transparency, multiple discrete memory regions appear as shared memory to applications. Distributing parallel computation across heterogeneous-ISA CPUs becomes simpler as parallel runtimes can assign work items to CPUs and let the DSM transparently marshal data. On-demand marshaling is expensive, however, and can have a significant performance impact in cross-node execution.
For tightly coupled heterogeneous CPU platforms, the challenge is how to optimally distribute parallel work to balance per-CPU performance against DSM communication overheads. For example, Figure 1 shows the execution time of three OpenMP HPC benchmarks when run on an Intel Xeon E5-2620v4, a Cavium ThunderX, and when utilizing both simultaneously with the
Fig. 1. Execution time of OpenMP benchmarks with work-sharing regions executed entirely on an ×86 Intel Xeon, entirely on an ARM Cavium ThunderX, and when leveraging both with libHetMP. How can the runtime automatically determine the best workload distribution configuration across heterogeneous CPUs to optimize performance?
We present
The HetProbe scheduler is designed for work sharing regions with loops where each loop iteration performs a constant and equal amount of work. This assumption allows the HetProbe scheduler to make workload distribution decisions by monitoring the behavior of a small number of probe iterations. For irregular applications where the behavior of the benchmark varies over time, e.g., changing memory access patterns or types of computation, predicting execution behavior (including cross-node DSM traffic) becomes significantly more difficult.
To address these irregular workloads,
In this work we make the following contributions:
The design and implementation of
libHetMP , a new OpenMP runtime that distributes threads and parallel work across cache-incoherent heterogeneous CPU systems without programmer intervention;Extensions to shared memory OpenMP synchronization primitives and loop iteration schedulers to adapt execution to heterogeneous CPUs and minimize DSM overheads;
Measurement tools built into the runtime that monitor metrics such as data transfer costs and hardware performance counters to make workload distribution decisions;
The HetProbe loop iteration scheduler, which uses these metrics to automatically determine where to place computation to minimize DSM overheads and leverage architectural diversity to achieve the best performance;
An evaluation of
libHetMP using 10 benchmarks from three benchmark suites that shows up to a 4.7× speedup when work sharing across a Xeon and ThunderX versus homogeneous execution on the Xeon. We also show the HetProbe scheduler is able to make the right workload distribution choice in all benchmarks, including evaluating decisions on two interconnects;The HetProbe-I loop iteration scheduler, which periodically generates performance profiles and redistributes loop iterations into global and local work pools for the nodes. The performance evaluation of this novel scheduler shows a speedup of up to 24% for a class of irregular workloads.
2 BACKGROUND
OpenMP specifies a set of directives that developers add to applications to parallelize execution. The compiler is responsible for converting OpenMP directives into function calls into the OpenMP runtime, which spawns teams of threads, partitions parallel work between threads and provides synchronization capabilities. For loop work sharing regions as shown in Listing 1 (e.g.,
OpenMP assumes architectural uniformity and current implementations do not target heterogeneous-ISA CPUs. To support execution across such CPUs, the system software (compiler, OS, runtime) must provide a shared-memory abstraction. Even if this abstraction exists, optimizing OpenMP for heterogeneous CPU systems requires re-designing how parallel work is assigned to CPUs in consideration of system and interconnect performance characteristics. Before describing
Heterogeneous-ISA Execution. Unlike ARM’s big.LITTLE architecture [21], which provides cache-coherence across same-ISA heterogeneous cores, there exist no server-grade cache-coherent heterogeneous CPUs. Past systems that couple together overlapping-ISA architectures (e.g., Xeon/Xeon Phi) are defunct; system designers wishing to couple together asymmetric processors today must integrate CPUs of different ISAs. Thus the system software (compiler, operating system, runtime) must handle both ISA heterogeneity and memory consistency. Previous works [7, 19, 55] describe system software for migrating compiled shared-memory applications between heterogeneous-ISA CPUs at runtime. While these works describe similar designs, we leverage Popcorn Linux [7] due to its availability. Other research on ISA-heterogeneous performance focused on GPU/CPU interaction, but in this work we concentrate on general-purpose computing applications instead of SIMD workloads.
Multi-ISA Binaries. Similarly to past works [19, 55], Popcorn Linux’s compiler builds multi-ISA binaries that are capable of cross-ISA execution. Multi-ISA binaries consist of one aligned data section and multiple per-ISA code sections, one for each target ISA in the system. To enable cross-ISA execution, the compiler arranges the application’s global address space to be aligned across ISAs so that pointers to globally visible data and functions refer to the same addresses on all nodes. Additionally, the compiler generates metadata describing function stack layouts at equivalence points [56]. These metadata describe the locations and type information (e.g., pointer-type) of live values so that stack frames can be reconstructed for the destination ISA.
Thread Migration. Threads migrate between nodes at migration points, a subset of equivalence points chosen by the compiler or user.
Page-level Distributed Shared Memory. Once threads have migrated to new nodes, they must be able to access application data. OS-level DSMs such as those proposed by Kerrighed [38], K2 [34] and Popcorn Linux observe remote memory accesses inside the page fault handler and migrate data pages similarly to a cache coherence protocol. By carefully manipulating page permissions, the OS forces the application to fault when accessing remote data. When a fault occurs, the kernel on the source (i.e., faulting) node requests the page from the remote node that currently owns the page. The page is transferred from the remote to the source and mapped into the application’s address space. The memory access is restarted and application threads are unaware that data was fetched over the interconnect. In this way, data are marshaled between nodes transparently and on demand. Note that software memory consistency would be required even for heterogeneous CPUs with (cache-incoherent) shared memory to prevent lost or reordered writes due to differing memory consistency models.
Many DSM systems use a multiple-reader, single-writer protocol as shown in Figure 2. In addition to data, nodes request access rights based on the type of memory access. If multiple nodes read data from the same page, then the protocol replicates the page with read-only permissions and all nodes can read the data in parallel. If a thread writes to a page, then the node first invalidates all other copies of the page from other nodes and then acquires exclusive write access. Any subsequent attempts to read or write the page on other nodes will cause a fault and access rights must be re-acquired.
Fig. 2. DSM protocol. Pages are migrated on-demand by observing memory accesses through the page fault handler. Pages read by threads on multiple nodes are replicated with read-only protections while only one node may have exclusive write permissions for a page.
Cross-node Execution Challenges. Unlike traditional shared memory multiprocessor systems that share data at a cache-line granularity, DSM systems share data at a page granularity due to observing memory accesses via page faults. Additionally, the cost of bringing data over the interconnect and managing access permissions is significantly higher than a traditional memory access—rather than taking tens to hundreds of nanoseconds, page migration takes tens of microseconds (see Section 3). These two characteristics mean that in order for a parallel computation to benefit from leveraging multiple heterogeneous CPUs simultaneously, data accessed by threads on different nodes must partition cleanly between pages, and there must be enough computation to amortize DSM costs. Otherwise, the application should only execute parallel work on a single node.
Irregular workloads. Naturally, not all loops exhibit the same behavior throughout all iterations and may vary in many ways: different numbers of DSM-induced page faults, changing memory access patterns, stressing different functional units (like floating points, integers or SIMD), and so on. Listing 2 illustrates a naive example of such a case, with a loop whose first half of iterations are memory-intensive and second half of iterations are instead compute-intensive. Irrespective of the source of irregularity, we refer to loops with a non-negligible degree of behavioral changes as irregular workloads.
Irregular workloads can be problematic for loop schedulers that statically decide how to distribute work. These workloads can be especially challenging for HetProbe, because their initial profiling portrays a distorted image of the overall workload requirements. Because of their changing behavior, these workloads can benefit from re-assessing workload distribution decisions throughout execution of the work sharing region.
3 DESIGN
3.1 OpenMP Across Heterogeneous-ISA CPUs
To support existing OpenMP applications,
Cross-Node Execution. When starting a
Synchronization.
Fig. 3. libHetMP’s thread hierarchy. In this setup, libHetMP has placed three threads (numbered 1–12) on each node. For synchronization, threads on a node elect a leader (green) to represent the node at the global level. Non-leader threads (red) wait for the leader using local synchronization to avoid cross-node data accesses.
Workload Distribution. OpenMP defines several loop iteration schedulers that affect how iterations of a work-shared parallel loop are mapped to threads. The default loop iteration schedulers (
Cross-node static scheduler. OpenMP’s
Cross-node dynamic scheduler. With OpenMP’s
While not traditionally meant for load balancing on heterogeneous systems, the dynamic scheduler can load balance work distribution based on the compute capacity of CPUs in the system. However, continuous synchronization both at the local and global level to grab batches of work can negatively impact performance, especially with small batch sizes. Users must again profile to determine the ideal per-region and per-hardware batch size. Non-deterministic mapping of loop iterations to threads can also cause “churn” in the DSM layer for applications that execute the same work sharing region multiple times. With deterministic mapping of iterations to threads, data may settle after the first invocation as nodes acquire the appropriate pages and permissions. The
The main problem with the default schedulers is that users must extensively profile to find the best workload distribution configuration in a large state space, i.e., determine CSRs or batch sizes for each individual work sharing region on every new heterogeneous platform. Additionally, if cross-node execution is not beneficial for a work sharing region due to large DSM overheads, then users must profile to determine the best CPU for single-node execution and manually reconfigure the thread team (including the thread hierarchy) to only execute work-sharing regions on the selected CPU.
3.2 The HetProbe Scheduler
To avoid the tuning complexity of the default schedulers,
The HetProbe scheduler must be precise when distributing iterations for the probing period to accurately evaluate system performance. First, the scheduler issues a constant number of loop iterations to each thread, regardless of node, to compare the execution time of equal amounts of work on each CPU. Second, the scheduler must deterministically issue iterations, so that threads executing a work sharing region multiple times receive the same batch of iterations across invocations to account for the aforementioned data settling effect. If the HetProbe scheduler non-deterministically distributes probe iterations, then data might unintentionally churn and cause falsely higher DSM overheads.
3.3 Extension of HetProbe for Irregular Workloads
Some workloads exhibit an initial behavior that is not representative of their entire execution. For instance, an application could initially be memory-intensive setting up data structures, while the rest of the time it uses that memory for compute-intensive operations. Based on the initial probing period, the HetProbe scheduler will distribute iterations favoring nodes with better memory performance. In these irregular workloads (previously described in detail in Section 2) it can be advantageous for the HetProbe scheduler to periodically re-probe so that the work distribution can be adapted to the changing behavior.
Figure 4 illustrates how HetProbe-I schedules a work sharing region with 100 iterations (excluding the initial probing period). Part 1 shows the global work queue, divided using the metrics collected from the initial probing period. In this scenario, HetProbe-I has assigned the first 60 iterations to node A and the remaining 40 to node B. Then in Part 2, after each node has completed a certain number of iterations, HetProbe-I triggers a reprobing period to re-evaluate the distribution. Once the reprobe phase concludes, Node A has finished iterations 0 to 29 and Node B iterations 60 to 80. Finally, in Part 3, the remaining iterations not executed by any node are joined to generate a new queue (iterations 30 to 59 and 81 to 100) called a jump. Note that HetProbe-I needs to make sure that all iterations in between (60 to 80) are not executed again.
Fig. 4. Process of regeneration of the global work queue, which requires registering iteration jumps. Phases (1), (2), and (3) occur sequentially and correspond to the loop start, reprobing period, and resulting workload, respectively.
Hence, the two main challenges of this new scheduler are as follows:
(1) | Determine when to trigger a reprobe (Figure 4, part 2). More frequent re-probings means more fine-grained and balanced work distribution. However, reprobes have a non-negligible impact on performance, especially if the redistribution of iterations for the probing period is worse than the current running configuration (recall that probing requires executing a small number of iterations across all nodes). The most obvious example of a performance tradeoff is triggering a reprobing period when only one node is running, as it is likely that using other nodes requires additional synchronization and DSM traffic. | ||||
(2) | Should HetProbe-I decide to reprobe, the scheduler must redistribute disjoint groups of work iterations, taking into account iterations that have already been completed (Figure 4, part 3). | ||||
To address the first challenge, HetProbe-I logically breaks the work sharing region into multiple smaller work-sharing regions, each with its own probing and workload distribution decisions. In HetProbe, the scheduler calculates a CSR and distributes all remaining loop iterations to nodes. HetProbe-I instead only distributes a fraction of iterations, which forces threads to re-enter the OpenMP runtime and allows HetProbe-I to check whether to reprobe. Currently, HetProbe-I triggers a reprobe after executing a user-defined percentage of iterations.
At some point, HetProbe-I determines it needs to prepare and carry out the reprobing. For this second task, HetProbe-I leverages
3.4 Workload Distribution Decisions
The HetProbe schedulers (both HetProbe and HetProbe-I) use the execution time, page faults and performance counters measured during the probing period to determine where to execute parallel work. Specifically, the HetProbe schedulers answer three questions:
1. Should the runtime leverage multiple nodes for parallel execution? While coupling together multiple CPUs provides more theoretical computational power, not all applications benefit from cross-node execution. As mentioned in Section 2 there is a significant cost for on-demand data marshaling and page coherency across nodes. To understand DSM overheads, we ran a microbenchmark that varies the number of compute operations executed per byte of data transferred over the interconnect. Because there are no server-grade heterogeneous-ISA CPUs integrated by point-to-point interconnects, we approximate a system using the experimental setup shown in Table 1 and evaluated the DSM layer using two protocols, TCP/IP and RDMA.
The microbenchmark spawns one thread for every core in every node in the system. It then runs a control loop that stresses each node (i.e., each \( (architecture, interconnect) \) pair) connected to the source node, i.e., the Xeon, because it runs the single-threaded portion of applications. At the start of the control loop, the source node threads initialize memory by touching all data pages to force the DSM protocol to bring all pages back to the source node’s memory. The control loop then releases the other node’s threads (ThunderX) to begin timed execution. Each ThunderX thread touches non-overlapping sets of pages to force the DSM protocol to transfer them to ThunderX memory. Finally, the ThunderX threads perform varying amounts of compute operations per page transferred. The microbenchmark calculates operations/second (incorporating the DSM costs) by timing how long it takes to execute the loop to determine the break-even point where cross-node execution is beneficial.
Figure 5(a) shows the compute throughput in millions of floating point operations per second when varying the number of compute operations per byte of data transferred over the interconnect. Figure 5(b) shows the average page fault period, i.e., elapsed time between subsequent page faults. Intuitively, as threads perform more computation per byte transferred, the computation is able to amortize the DSM costs and reach peak throughput. As shown in Figure 5(b), there are significant latency differences between RDMA and TCP/IP. Page faults using RDMA cost around 30 microseconds, whereas they cost 90 and 120 microseconds for the Xeon and Cavium servers, respectively, with TCP/IP. Thus, the amount of computation needed to amortize DSM costs when using TCP/IP is significantly higher than RDMA.
Fig. 5. Performance metrics observed when varying the number of compute operations per byte of data transferred over the interconnect. For example, a 16 on the x-axis means 16 math operations were executed per transferred byte or 65,536 operations per page.
To determine if cross-node execution is beneficial, the HetProbe schedulers calculate the page fault period by measuring execution times and number of faults. The break-even point when cross-node execution becomes beneficial can be seen in Figure 5(a) when the microbenchmark is close to maximum throughput: above 512 operations/byte for RDMA and 32,768 operations/byte for TCP/IP. Correlating these values to Figure 5(b), the runtime uses a threshold of 100 \( \mu \)s/fault for RDMA and 7,600 \( \mu \)s/fault for TCP/IP to determine whether there is enough computation to amortize DSM costs and benefit from executing across multiple CPUs. As faulting latency drops (e.g., if CPUs share physical memory), fewer compute operations are needed to amortize cross-node memory access latencies. When the interconnect between CPUs changes, this microbenchmark can be re-used as a tool to automatically determine the threshold value of when cross-node execution becomes beneficial.
2. If utilizing cross-node execution, then how much work should be distributed to each node? As mentioned previously, during the probe period the runtime measures the execution time of a constant number of iterations on each core in the system. The HetProbe schedulers use this information to directly calculate the core speed ratios of each node and skew the distribution of the remaining loop iterations.
3. If not utilizing cross-node execution, then on which node should the work be run? Determining on which node an application executes best involves understanding how the application stresses the architectural properties of each CPU. Performance counters provide insights into how applications execute and what parts of the architecture bottleneck performance. For our setup, the ThunderX has a much higher degree of parallelism versus the Xeon, meaning it has a much higher theoretical throughput for parallel computation. However, the biggest challenge in utilizing all 96 cores is being able to supply data from the memory hierarchy. Although the ThunderX uses quad-channel RAM (with twice the bandwidth of the Xeon), it only has a simple two level cache hierarchy versus the Xeon’s much more advanced (and larger per-core) three level hierarchy. If an application exhibits many cache misses, then it is unlikely to fully utilize the 96 available cores and would be better run on the Xeon. The HetProbe schedulers measure cache misses per thousand instructions during the probing period to determine how much the work-sharing region stresses the cache hierarchy (users can specify any performance counters prudent for their hardware). We experimentally determined a threshold value of three cache misses per thousand instructions—below the threshold and the application can take advantage of the ThunderX’s parallelism, but above the threshold the ThunderX’s CPUs will continuously stall waiting on the cache hierarchy. Note that the HetProbe schedulers must use performance counters and cannot simply use execution times from the probing period to decide on a node; the probing period measures execution times with DSM overheads that are not present when executing only on a single node.
Once a node has been chosen, the HetProbe schedulers fall back to existing OpenMP schedulers for single-node work distribution. Currently they default to the static scheduler, but this is configurable by the user. Additionally,
Figure 6 shows an example of a work sharing region with 20,000 loop iterations executing using a HetProbe scheduler. The first 2,000 iterations are used for the probing period and each of the 20 cores across both nodes is given an equal share of 100 iterations. Importantly, the probing period is performing useful work, albeit in a potentially unbalanced way. After the probing period,
Fig. 6. HetProbe scheduler. A small number of probe iterations are distributed at the beginning of the work-sharing region to determine core speed ratios of nodes in the system. Using the results, the runtime decides either to run all iterations on one of the nodes or distribute work across nodes according to the calculated core speed ratio (shown here).
4 IMPLEMENTATION
4.1 Implementation of HetProbe-I
The starting point of HetProbe-I is the HetProbe scheduler, from which we leverage the mechanisms to determine the core speed ratio and abstractions such as the hierarchical barriers that guarantee node synchronization. We apply the same primitives and abstractions as for HetProbe for setting the scheduler at runtime. HetProbe-I differs from previous schedulers mainly in the way it dispatches work between threads, since it is there that HetProbe-I monitors the re-probing conditions and carries out the reprobing as needed.
In the HetProbe scheduler, threads call into
In this version of HetProbe-I we trigger a reprobing every time the sum of iterations finished by all nodes exceeds a percentage (OMP_HET_PTG) of the total number of the loop. Because this condition is reviewed whenever a thread accesses the dispatching function, HetProbe-I could be facing three potential situations when a reprobing is required:
Only one node is executing work. The thread that triggered the reprobing period belongs to the node the only node to which HetProbe assigned iterations—cross-node execution was deemed not beneficial according to metrics gathered during probing. HetProbe-I modifies the previous stage (second call) to cover this case. The leader thread of the node not executing iterations will be stopped if no work is assigned, effectively blocking all the threads on that node as they will be stalled in a hierarchical barrier. Hence, whenever a reprobing is triggered under these circumstances, HetProbe-I only needs to resume execution on the inactive node and assign probing iterations. It is not costly to make the stopped leader thread spin over a global variable, since that node is not performing any other work.
Work is running on all nodes. Because there is work executing across multiple nodes, the thread that triggered the reprobing belongs to a node with a fraction of the iterations. The leader thread at this node will have to wait for the other node(s) to replenish work. It is important to remark that this does not mean that any particular architecture is likely to finish first.
One node does not have iterations left. HetProbe-I must also account for the less common case in which, even though the reprobing has not been triggered, the node runs out of work. Consequently, HetProbe-I must also stop the threads that ran out of work in case a reprobing is triggered by another node. This will not have a negative impact on the total execution time even if the reprobing never occurs, since those threads will have completed their work.
In any case, HetProbe-I must restart performance statistics and generate new global work queues. The latter has to be done with special care to prevent the repeated execution of finished work, since HetProbe-I must provide threads with a consecutive range of iterations. When HetProbe-I generates a new work queue combining iterations of different nodes, HetProbe-I needs to label and assign work to threads in consideration of jumps.
When the work assigned to a thread in the dispatch function contains a jump, the first half is assigned and the thread is labeled so that it receives the second half in the next call to the function. That second part may also contain jumps so this process could be repeated several times. Hence, HetProbe-I needs to keep track of the aforementioned real end, the assigned end and the beginning of the jump. In the worst case, HetProbe-I will have to manage as many jumps as triggered reprobings.
5 EVALUATION
When evaluating
5.1 Experimental Setup and Benchmarks
We evaluated
We selected 10 benchmarks from three popular benchmarking suites: The Seoul National University [51] C/OpenMP versions of the NAS Parallel Benchmarks [6], PARSEC [10] and Rodinia [14]. These benchmarks represent HPC and data mining use cases and exhibit a wide variety of computational patterns on which to evaluate
5.2 Work Distribution Configurations
We evaluated running benchmarks using several workload configurations. Xeon represents running the benchmark entirely on the Xeon—serial phases run on a single Xeon core and work-sharing regions use the Xeon’s 16 threads. ThunderX is similar—serial phases run on a single ThunderX core and work-sharing regions use the ThunderX’s 96 cores. Ideal CSR executes across both CPUs—serial phases run on a Xeon core and work-sharing regions always split loop iterations across the Xeon and ThunderX (112 total threads) using the static scheduler. The scheduler skews distribution using the CSRs in Table 2. The CSRs were gathered from runs with the HetProbe scheduler and manually supplied via environment variables. Cross-Node Dynamic is identical except it uses the hierarchy-based dynamic scheduler described in Section 3.1. We experimentally determined the best chunk size for each benchmark; most benchmarks performed better with smaller sizes, i.e., finer-grained load balancing. HetProbe is again identical except it uses the HetProbe scheduler. HetProbe uses both CPUs during the probing period and then decides whether cross-node execution is beneficial. If so, then it uses measured execution time to calculate CSRs (Table 2) to skew loop iteration distribution for the remaining iterations. If not, then it selects the best CPU and falls back to OpenMP’s original static scheduler on a single node; threads on the not-selected node are joined to avoid unnecessary synchronization. The probe period was configured to use 10% of available loop iterations. For benchmarks where cross-node execution was beneficial, probing overhead was determined by comparing the difference in performance between Ideal CSR and HetProbe. For benchmarks where it was not beneficial, probing overhead was determined by comparing the delta between the best single-node performance (either Xeon or ThunderX) and HetProbe. The HetProbe scheduler probed for up to 10 invocations of a given work-sharing region (using an exponential weighted moving average to smooth out measurements), after which it re-used existing measurements from the probe cache. For several benchmarks, the HetProbe scheduler chose single-node execution on the ThunderX. As a comparison point, “HetProbe (force Xeon)” shows the same results except forcing the HetProbe scheduler to use single-node execution on the Xeon; these results are explained below.
| Benchmark | Core speed ratio – Xeon : ThunderX |
|---|---|
| blackscholes | 3 : 1 |
| EP-C | 2.5 : 1 |
| kmeans | 1 : 1 |
| lavaMD | 3.666 : 1 |
Used by Ideal CSR and HetProbe configurations. Without the HetProbe scheduler, developers would have to manually determine these values via extensive profiling.
Table 2. Core Speed Ratios Calculated by HetProbe Scheduler
Used by Ideal CSR and HetProbe configurations. Without the HetProbe scheduler, developers would have to manually determine these values via extensive profiling.
5.3 Results
Table 3 shows the total benchmark execution times, including both serial and parallel phases, on the Xeon. Figure 7 shows the speedup normalized to homogeneous Xeon execution for each of the aforementioned configurations. The benchmarks can broadly be classified into two categories: those that benefit from cross-node execution and those that do not. blackscholes, EP-C, kmeans and lavaMD fall into the former category, whereas the others fall into the latter. Across benchmarks that benefit from multi-node execution, all but blackscholes achieve the highest speedup under Cross-Node Dynamic. This is because with a granular chunk size, work is distributed across nodes in an almost perfect balance. Additionally, due to the thread hierarchy there is significantly reduced global synchronization and threads grab work from a local work pool the majority of the time. Across these four benchmarks, Cross-Node Dynamic yields a geometric mean speedup of 2.68×. Ideal CSR is 12.5% faster for blackscholes and close behind Cross-Node Dynamic for the other three, achieving a geometric mean speedup of 2.55×. Finally, HetProbe is slightly slower than the other two cross-node configurations, achieving a geometric mean speedup of 2.4×. This is because the probe period runs a constant number of iterations for all cores leading to an initial workload imbalance. Additionally, measurement machinery (timestamps, parsing the
Fig. 7. Speedup of benchmarks versus running homogeneously on Xeon (values less than one indicate slowdowns). Asterisks mark the best workload distribution configuration for each benchmark. “Cross-Node Dynamic” provides the best performance across applications that benefit from leveraging both CPUs (blackscholes, EP-C, kmeans, lavaMD), but causes significant slowdowns for those that do not. “HetProbe” achieves similar performance to Ideal CSR and Cross-Node Dynamic for these four applications but falls back to a single CPU for applications with significant DSM communication and hence worse cross-node performance. For geometric mean, “Oracle” is the average of the configurations marked by asterisks, i.e., what a developer who had explored all such possible workload distribution configurations through extensive profiling would choose.
For benchmarks that do no scale across nodes, however, the Ideal CSR and Cross-Node Dynamic configurations significantly degrade performance with geometric mean slowdowns of 3.63× and 5.89×, respectively. This is due to DSM—threads spend significant time waiting for pages from other nodes, which also forces application threads on other nodes to be time-multiplexed with DSM workers. There is not enough computation to amortize DSM page fault costs over the network. The Cross-Node Dynamic scheduler is exclusively worse than the Ideal CSR scheduler due to additional work distribution synchronization caused by threads repeatedly grabbing batches of iterations. The HetProbe scheduler, however, successfully avoids cross-node execution for these benchmarks by measuring the page fault period and determining cross-node execution to not be beneficial (geometric mean slowdown of 39%, or 2.4% without cfd). Figure 8 shows measured page fault periods for each application; applications with a period below 100 \( \mu \)s were considered not profitable for cross-node execution.
Fig. 8. Page fault periods determining whether cross-node execution is beneficial. Red bars (cross-node not profitable) are below the RDMA threshold indicated in Section 3, blue are above.
For applications deemed not beneficial to execute across nodes due to high DSM overheads, the HetProbe scheduler utilized cache misses per 1,000 instructions to determine whether to execute work-sharing regions on the Xeon or ThunderX. As shown in Figure 9, there is a clear separation between applications that benefit from the ThunderX’s high parallelism (BT-C, cfd, lud) and those that are bottlenecked by memory accesses (CG-C, SP-C, streamcluster). When selecting a node, the HetProbe scheduler used a threshold value of three misses per thousand instructions, placing BT-C, cfd and lud on the ThunderX and the others on Xeon (cfd has special behavior, see below). For the three benchmarks placed on Xeon, probing overhead is equivalent to the difference between Xeon and HetProbe, since HetProbe degrades to Xeon after probing. The probing period adds 4.8%, 6.6%, and 7.1% for CG-C, SP-C, and streamcluster, respectively, for a geometric mean overhead of 6.1%. This shows performance close to single-node execution on the Xeon, meaning the probing period has minimal impact on performance.
Fig. 9. Cache misses for applications not executed across nodes. Green bars (including lud) indicate the application was run on the ThunderX, blue were run on the Xeon.
Looking closer at cfd and CG-C, these applications have roughly the same performance on Xeon and ThunderX but are vastly different in cache miss behavior. Even more interestingly, the HetProbe scheduler places cfd on the ThunderX although the optimal choice would be on the Xeon. This is due to the fact that although cfd’s parallel region runs faster on the ThunderX (74.58 seconds on Xeon, 66.79 seconds on ThunderX), it has a long serial file I/O phase that runs significantly faster on Xeon (1.83 seconds on Xeon, 13.72 seconds on the ThunderX), leading the benchmark’s overall execution time to be faster on the Xeon. This file I/O phase also explains the disparity in cache misses between benchmarks: cfd’s parallel region has a low number of cache misses, but the benchmark’s execution time is heavily impacted by file operations, whereas CG-C does not perform file I/O and has a large number of cache misses.
Interestingly for BT-C, cfd and lud, executing parallel regions on the ThunderX achieved worse than expected performance due OS limitations. Popcorn Linux’s kernel currently only supports spawning threads on the node on which the application started, meaning one thread must remain on the Xeon even when work-sharing regions execute on the ThunderX. Each of these benchmarks executes hundreds to thousands of work-sharing regions (and their associated implicit barriers), causing significant cross-node synchronization. As a comparison point for BT-C and cfd, we ran an additional experiment to force the HetProbe scheduler to select the Xeon for single-node execution; it added 3.2% and 4.2% probing overhead, respectively. lud is an interesting case – the HetProbe scheduler decides cross-node execution is not profitable and runs work sharing regions on the ThunderX. The aforementioned OS limitation impacts HetProbe’s performance enough that Ideal CSR actually achieves 20% better performance than HetProbe (although still worse than running solely on the ThunderX). We expect that when Popcorn Linux allows spawning threads on remote nodes,
It is important to note that none of Xeon, ThunderX, Ideal CSR or Cross-Node Dynamic perform best in all situations, clearly illustrating the need for HetProbe. As shown in Figure 7, HetProbe provides the best performance out of all evaluated configurations across all benchmarks with a geometric mean performance improvement of 41% (ThunderX provides an 11% improvement). In contrast, Ideal CSR causes a slowdown of 49% and Cross-Node Dynamic causes a 96% slowdown, highlighting the importance of communication traffic when distributing computation. As a comparison point, “Oracle” shows that developers could obtain a geometric mean speedup of 60% if they had extensively profiled all configurations and selected the best for all benchmarks. As Popcorn Linux matures, HetProbe will be able to more closely match the Oracle, as the aforementioned limitation has a significant impact on HetProbe’s performance.
5.4 What Types of Applications Benefit from Cross-Node Execution?
The four applications that benefit from cross-node execution have a high enough compute to cross-node communication ratio to leverage the compute resources of multiple CPUs. blackscholes has an initial data transfer period but repeats computation on the same data, allowing it to settle on nodes (blackscholes also has a lengthy file I/O phase that benefits from the Xeon’s strong single-threaded performance). EP-C performs completely local computation (including heavy use of thread-local storage) with a single final reduction stage. lavaMD computes particle potentials through interactions of neighbors within a radius, meaning multiple threads re-use the same data brought across the interconnect. Similarly, kmeans alternatively updates cluster centers and cluster members—all threads on a node alternate between scanning the cluster member and cluster center arrays, re-using pages brought over the interconnect.
Benchmarks that do not benefit cannot amortize data transfer costs. For example, BT-C and SP-C access multidimensional arrays along different dimensions in consecutive work sharing regions, causing the DSM to shuffle large amounts of data between nodes. Other benchmarks have little data locality: CG-C and streamcluster calculate a set of results and then access them in irregular patterns using an indirection array. This behavior causes extensive latencies for local cache hierarchies, let alone DSM. lud’s work-sharing region sequentially accesses an array, but does not perform enough computation per byte to amortize DSM costs. Additionally, there is a large amount of “false sharing” where threads on different nodes write to independent parts of the same page. False sharing can be avoided by the use of a multiple-writer protocol such as lazy-release consistency [4].
An interesting observation that arises from measuring the page fault period is that while the metric provides a sound threshold for determining whether cross-node execution will be beneficial, it is not a good indicator of overall performance gains. For example, while kmeans’ page fault period is slightly over the threshold (130-\( \mu \)s period), it benefits the most from cross-node execution out of all benchmarks. This is because it has a high level of inter-thread data reuse. As mentioned previously, all threads scan the same array in a superstep of the algorithm, meaning all threads reuse data brought over from a page fault (in addition to being extremely efficient on the cache-starved ThunderX). This is in contrast to lavaMD where only a subset of threads working on adjacent regions reuse migrated pages.
5.5 What Applications Benefit from Ideal CSR versus Cross-Node Dynamic?
Three of the four benchmarks that benefit from cross-node execution achieve the best performance with Cross-Node Dynamic due to fine-grained load balancing. For blackscholes, however, Ideal CSR achieves better performance. This is due to pages settling into a steady state after an initial page shuffle. Threads receiving the same loop iterations across multiple invocations of the work sharing region access the same data, thus all data pages required by threads are already mapped to the appropriate node. With Cross-Node Dynamic, however, threads receive different loop iterations across separate executions, meaning pages containing results must be continually shuffled across nodes. This settling behavior is why the HetProbe scheduler deterministically distributes iterations for the probing period.
5.6 Case Study: TCP/IP
To evaluate the effectiveness of the HetProbe scheduler for different types of interconnects, we ran blackscholes with varying number of iterations (more iterations means more compute operations per byte since blackscholes’ data settles after the first iteration) using the TCP/IP protocol described in Section 3.4. Figure 10 shows the execution time when running homogeneously on the Xeon versus cross-node execution (lines) and the page fault period of each cross-node run (bars). We use a page fault period of 7600 \( \mu \)s to determine whether cross-node execution will be beneficial when using TCP/IP. The results are somewhat noisy (TCP/IP tends to have more variable latencies) but consistent with expectations – only after the page fault period climbs above 8,000 \( \mu \)s does cross-node execution pay off. Thus we conclude using page fault periods as the determining factor for cross-node execution is applicable for different types of interconnects.
Fig. 10. Execution time (lines, left axis) and page fault period (bars, right axis) for blackscholes. “Homogeneous” refers to Xeon configuration, “TCP/IP” refers to using HetProbe over TCP/IP.
5.7 Evaluation of HetProbe-I
We evaluated the HetProbe-I loop scheduler, paying special attention to the irregularity of workloads as this was the central aspect of its design. We use the same experimental setup as on previous sections (see Table 1), with an Intel Xeon 2620v4 at 2.1 GHz and a ThunderX Cavium at 2 GHz, relying on the RDMA protocol to establish inter-node communication. We developed a new microbenchmark, specifically designed to evaluate performance of heterogeneous-ISA OpenMP schedulers under workloads with irregular degrees of architectural-affinity. This microbenchmark leverages the work sharing regions of streamcluster and BT-C, two benchmarks that achieved their best numbers running solely on the Xeon and the ThunderX cores, respectively (see Figure 7). We also ran several benchmarks to evaluate HetProbe-I in comparison to HetProbe, using streamcluster and BT-C as well as blackscholes (which showed the best speedup under the Ideal CSR scheduler) and lavaMD (most benefited by using the hierarchy-based Cross-Node Dynamic scheduler). To make our evaluation more comprehensive, we also included benchmarks from SPEC OMP 2012 [39]. We rearranged some portions of code to address limitations of the Popcorn compiler, without altering the benchmarks’ execution. Since the Popcorn compiler does not currently support C++, we considered all SPEC benchmarks written in C, except for 367.imagick, as it triggers a bug in the Popcorn compiler. All benchmarks are compiled using the same flags as in the previous evaluation of HetProbe. We still use 10% of iterations for the initial probing period, just like HetProbe, as our experiments do not demonstrate a significant difference in the performance of the benchmarks for either HetProbe or HetProbe-I when using a different percentage.
To further test HetProbe-I, we include a microbenchmark that receives the degree of irregularity of the workload as parameter. The source code of streamcluster and BT-C are joined, since the evaluation of HetProbe demonstrates they will give their best performance running only on Xeon (the former) or ThunderX (the latter). The microbenchmark combines the execution of these two benchmarks’ work sharing regions in different quantities to generate the requested degree of irregularity. It is additionally used to generate workloads that are best suited for the combined effort of the Xeon and ThunderX architectures by mingling iterations of both benchmarks. For example, with a 60% degree of irregularity, six out of 10 iterations will be invocations of the streamcluster loop and four of the BT-C loop.
Figure 11 compares the performance of HetProbe and HetProbe-I using different degrees of irregularity with the microbenchmark. Because of how the microbenchmark divides iterations, it is no surprise that under no degree of irregularity the speedup closely resembles the one of streamcluster’s, a regular workload, whereas with 100% irregularity the overall performance is very similar to the one of BT-C, an irregular workload. As expected, a progressive improvement in the performance of HetProbe-I compared to HetProbe can be observed as the percentage of iterations from BT-C is increased.
Fig. 11. Speedup of the microbenchmark executing using HetProbe-I normalized to execution with HetProbe. The microbenchmark is composed of work sharing regions from streamcluster and BT-C.
Figure 12 shows the speedup comparison between the two new loop schedulers. The benchmark with the greatest improvement (BT-C, 24%) combines two features that are hard for HetProbe to deal with - it has one of the highest cross-node synchronization requirements among the benchmarks, but is actually very well suited to be executed by ThunderX (as seen in Figure 7). Hence, with HetProbe-I the initial page faults—which reduce the ThunderX’s chances for work on the first probing period—end up paying for themselves in the long run, something HetProbe-I detects on the reprobing periods. Noticeably, the Cross-Node Dynamic scheduler is also designed to deal with irregular workloads by having threads rate-limit themselves; threads that get assigned computationally demanding iterations take more time, whereas threads that get lighter work can churn through more iterations. Nonetheless, threads replenish iterations from local per-node queues generated after the initial probing period, so it is also interesting to compare this scheduler with HetProbe-I, which regenerates the global queues and then these local queues once the reprobing is completed and a different CSR is obtained.
Fig. 12. Speedup of benchmarks using the HetProbe-I scheduler normalized to the HetProbe scheduler.
Compared with the Cross-Node Dynamic scheduler, HetProbe-I performs over three times better on BT-C, a benchmark on which HetProbe was already significantly beneficial compared to the Cross-Node Dynamic scheduler (see Figure 7). The slight improvement of blackscholes (5%) over Ideal CSR (the scheduler with the best performance) can be attributed to the high number of re-probing periods, which even out workload imbalances caused by measurement error. For blackscholes, HetProbe-I also averages a 13% speedup improve over the Cross-Node Dynamic scheduler. This is an example of a case in which the reprobing does not cause overhead, but is instead beneficial due to blackscholes’ irregularity. Conversely, streamcluster is better suited to run on the Xeon -we learned from the microbenchmark evaluation that it had a low degree of irregularity- so periodic reprobing that sends iterations to the ThunderX side is always detrimental.
In conclusion, HetProbe-I can achieve significant performance benefits over HetProbe. HetProbe-I excels in work sharing regions that contain a high degree of irregularity, for example a high amount of page faults registered on the first probing period that are not representative of the overall execution.
5.8 Limitations
There are number of ways in which
An alternative for HetProbe-I would be to continuously monitor page faults during the work sharing region and fall back to single node execution if the number of page faults begins to rise. Conversely, if the number of page faults begins to drop, then the HetProbe-I scheduler could dynamically bring extra threads on another node online. In general,
In terms of potential performance degradation for HetProbe-I, the act of reprobing can be costly and may have some degree of negative impact on other areas of the system. It is worth noting that under HetProbe-I the overall CPU utilization will increase under certain circumstances. Intuitively, if a reprobing period triggers a redistribution of work onto a node that was not in use before, then the CPU resources on that system will begin executing loop iterations. Furthermore, we observed an increase in the number of cache misses on all the benchmarks evaluated. This increase can be attributed to several factors, including that (1) a node that was not working at all in the previous iteration of work distribution will have to perform the initial fetching of memory (page faults and compulsory cache misses) and (2) the spatial locality of iterations may not be exploited by the node on the next work distribution. As an example, picture Node A running iterations 10 to 1,000 and Node B running iterations 1,000 to 1,100. If Node B is asked to execute iterations 300 to 1,100 after a reprobing period (which will include jumps for what was already completed), then the spatial locality from adjacent work iterations will be lost.
Figure 13 shows the percentage increase in cache misses for each of the benchmarks in comparison with HetProbe. The largest increase comes from streamcluster (55% more cache misses), the benchmark with which HetProbe-I obtained the worst results with a 35% slowdown compared to HetProbe. Even though a relative increase in cache misses is undesirable, the total number of cache misses for streamcluster is an order of magnitude smaller than lavaMD or BT-C, which makes this increase less negative in terms of absolute numbers. Hence, there are scenarios in which this loop scheduler does not only degrade performance, but could also potentially damage other applications on the system that access shared levels of cache memory. From this we learn that a refinement of the loop scheduler management of memory could potentially mitigate the overhead produced by HetProbe-I.
Fig. 13. Percentage increase in cache misses for the evaluated benchmarks scheduled with HetProbe-I relative to the HetProbe loop scheduler as a baseline.
6 RELATED WORK
Parallel Programming Models and Frameworks. Shared-memory parallel programming models like OpenMP [43] and Cilk [11] provide source code annotations to automate parallel computation, but do not support execution across cache-incoherent, heterogeneous-ISA CPUs. MPI [23] gives developers low-level primitives to distribute execution, manage separate physical memories and marshal memory between heterogeneous-ISA CPUs. However, for asymmetric CPUs, developers must manually assign parallel work and transfer required data to maximize performance, leading to complex and verbose applications with static, non-portable workload distribution decisions. Cluster OpenMP [25] is a now-defunct commercial product attempting to replace hierarchical MPI + OpenMP parallelism by providing shared memory semantics using DSM on networked homogeneous machines. PGAS frameworks like UPC [15], X10 [13], and Grappa [40] support cross-node execution and memory accesses, but do not support sharing data across ISAs and changing workload distribution decisions in light of system characteristics is cumbersome (data are not migrated between nodes for locality). Charm++ [27] is an object-oriented approach to sharing data between (potentially distributed) processes, but does not support load balancing across heterogeneous-ISA CPUs. Cluster frameworks like SnuCL-D [29] and OmpSs [12] provide coarse-grained work distribution by assigning multiple independent parallel computations to individual heterogeneous processors. They do not consider fine-grained work-sharing of a single parallel computation and require developers to specify data movement.
CPU/GPU Work Partitioning. Several works explore work distribution in CPU/GPU systems. Qilin [35] is a compiler and runtime that enables CPU/GPU workload partitioning but requires developers to rewrite computation using a new API. Unlike
Single-ISA Scheduling. There are a number of schedulers designed to improve task-parallel workloads (as opposed to data-parallel workloads targeted by
7 CONCLUSION
In this work we presented
As heterogeneous-ISA architectures become more ubiquitous, it is important that new system software like
Our complete implementation is available as open source as part of the Popcorn Linux project at http://popcornlinux.org/.
ACKNOWLEDGMENT
An earlier version of this article was presented at the 21st International Middleware Conference [37].
- [1] 2017. PCI Express Base Specification Revision 4.0, Version 1.0. Retrieved from https://pcisig.com/specifications/pciexpress/.Google Scholar
- [2] 2018. Summit: A Supercomputer Suited for AI. Retrieved from https://www.olcf.ornl.gov/wp-content/uploads/2018/06/NODE_infographic_FIN.pdf.Google Scholar
- [3] . 2020. AMD Infinity Architecture Technology. Retrieved from https://www.amd.com/en/technologies/infinity-architecture.Google Scholar
- [4] . 1996. TreadMarks: Shared memory computing on networks of workstations. Computer 29, 2 (
February 1996), 18–28. Google ScholarDigital Library
- [5] . 2019. Intel Agilex: 10nm FPGAs with PCIe 5.0, DDR5, and CXL. Retrieved from https://www.anandtech.com/show/14149/intel-agilex-10nm-fpgas-with-pcie-50-ddr5-and-cxl.Google Scholar
- [6] . 1991. The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5, 3 (1991), 63–73.Google Scholar
Digital Library
- [7] . 2017. Breaking the boundaries in heterogeneous-ISA datacenters. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). ACM, New York, NY, 645–659. Google Scholar
Digital Library
- [8] . 2015. Popcorn: Bridging the programmability gap in heterogeneous-ISA platforms. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15). ACM, New York, NY, Article
29 , 16 pages. Google ScholarDigital Library
- [9] . 2018. New—EC2 Instances (A1) Powered by Arm-Based AWS Graviton Processors. Retrieved from https://aws.amazon.com/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/.Google Scholar
- [10] . 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.Google Scholar
Digital Library
- [11] . 1995. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’95). ACM, New York, NY, 207–216. Google Scholar
Digital Library
- [12] . 2012. Productive programming of GPU clusters with OmpSs. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium. 557–568. Google Scholar
Digital Library
- [13] . 2005. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’05). ACM, New York, NY, 519–538. Google Scholar
Digital Library
- [14] . 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). Google Scholar
Digital Library
- [15] . 2005. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’05). ACM, New York, NY, 36–47. Google Scholar
Digital Library
- [16] . 2017. Cache Coherent Interconnect for Accelerators (CCIX). Retreived from http://www.ccixconsortium.com.Google Scholar
- [17] . 2020. Compute Express Link. Retrieved from https://www.computeexpresslink.org/.Google Scholar
- [18] . 2018. Parallel computation of voxelised protein surfaces with OpenMP. In Proceedings of the 6th International Workshop on Parallelism in Bioinformatics (PBio’18). Association for Computing Machinery, New York, NY, 19–29. Google Scholar
Digital Library
- [19] . 2012. Execution migration in a heterogeneous-ISA chip multiprocessor. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, 261–272. Google Scholar
Digital Library
- [20] . 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). ACM, New York, NY, 365–376. Google Scholar
Digital Library
- [21] . 2011. big.LITTLE processing with ARM Cortex-A15 & Cortex-A7. ARM White Paper 17 (2011).Google Scholar
- [22] . 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Compiler Construction, (Ed.). Springer, Berlin, 286–305.Google Scholar
Cross Ref
- [23] . 1999. Using MPI: Portable Parallel Programming with the Message-passing Interface. Vol. 1. MIT Press.Google Scholar
Digital Library
- [24] . 2018. Dynamic data race detection for OpenMP programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18). IEEE Press, Article
61 , 12 pages.Google ScholarDigital Library
- [25] . 2006. Extending OpenMP to clusters. White Paper, Intel Corporation.Google Scholar
- [26] . 2016. Portable performance on asymmetric multicore processors. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’16). ACM, New York, NY, 24–35. Google Scholar
Digital Library
- [27] . 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. Vol. 28. Citeseer.Google Scholar
Digital Library
- [28] . 2018. The OpenCL Specification.
Technical Report . https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_API.pdf.Google Scholar - [29] . 2016. A distributed OpenCL framework using redundant computation and data replication. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). ACM, New York, NY, 553–569. Google Scholar
Digital Library
- [30] . 2017. Popcorn Linux: Compiler, Operating System and Virtualization Support for Application/Thread Migration in Heterogeneous ISA Environments. Retrieved from http://www.linuxplumbersconf.org/2017/ocw/proposals/4719.html.Google Scholar
- [31] . 2013. An automatic input-sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS’13). ACM, New York, NY, 149–160. Google Scholar
Digital Library
- [32] . 2017. The New Intel Xeon Processor Scalable Family (Formerly Skylake-SP). Retreived from https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.930-Xeon-Skylake-sp-Kumar-Intel.pdf.Google Scholar
- [33] . 2017. The next generation AMD enterprise server product architecture. IEEE Hot Chips 29 (2017).Google Scholar
- [34] . 2014. K2: A mobile operating system for heterogeneous coherence domains. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 285–300. Google Scholar
Digital Library
- [35] . 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, 45–55. Google Scholar
Digital Library
- [36] . 2019. libMPNode: An OpenMP runtime for parallel processing across incoherent domains. In Proceedings of the 10th International Workshop on Programming Modesl and Applications for Multicores and Manycores (PMAM’19).Google Scholar
Digital Library
- [37] . 2020. An OpenMP runtime for transparent work sharing across cache-incoherent heterogeneous nodes. In Proceedings of the 21st International Middleware Conference (Middleware’20). Association for Computing Machinery, New York, NY, 415–429. Google Scholar
Digital Library
- [38] . 2004. Kerrighed and data parallelism: Cluster computing on single system image operating systems. In Proceedings of the IEEE International Conference on Cluster Computing. 277–286. Google Scholar
Cross Ref
- [39] . 2012. SPEC OMP2012—An application benchmark suite for parallel systems using OpenMP. In OpenMP in a Heterogeneous World, , , , and (Eds.). Springer, Berlin, 223–236.Google Scholar
Digital Library
- [40] . 2015. Latency-Tolerant software distributed shared memory. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15). USENIX Association, Berkeley, CA, 291–305. https://www.usenix.org/conference/atc15/technical-session/presentation/nelson.Google Scholar
- [41] . 2020. NVLink. Retrieved from https://www.nvidia.com/en-us/data-center/nvlink/.Google Scholar
- [42] . 2020. OpenCAPI Consortium. Retrieved from https://opencapi.org/.Google Scholar
- [43] . 2018. OpenMP Application Program Interface v5.0.
Technical Report . OpenMP Architecture Review Board. Retrieved from https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf.Google Scholar - [44] . 2012. Lucky scheduling for energy-efficient heterogeneous multi-core systems. In Proceedings of the USENIX Conference on Power-Aware Computing and Systems (HotPower’12). USENIX Association, Berkeley, CA, 7–7. http://dl.acm.org/citation.cfm?id=2387869.2387876.Google Scholar
- [45] . 2018. Next-Generation ThunderX2 ARM Targets Skylake Xeons. Retrieved from https://www.nextplatform.com/2016/06/03/next-generation-thunderx2-arm-targets-skylake-xeons/.Google Scholar
- [46] . 2016. A reconfigurable fabric for accelerating large-scale datacenter services. Commun. ACM 59, 11 (
October 2016), 114–122. Google ScholarDigital Library
- [47] . 2019. Qualcomm Snapdragon 855 Mobile Platform. Retrieved from https://www.qualcomm.com/media/documents/files/snapdragon-855-mobile-platform-product-brief.pdf.Google Scholar
- [48] . 2010. Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors. ACM Trans. Archit. Code Optim. 7, 2, Article
9 (Oct. 2010), 25 pages. Google ScholarDigital Library
- [49] . 2017. Parallel processing design of latent semantic analysis based essay grading system with OpenMP. In Proceedings of the 2017 International Conference on Computer Science and Artificial Intelligence (CSAI’17). Association for Computing Machinery, New York, NY, 119–124. Google Scholar
Digital Library
- [50] . 2015. CoreTSAR: Core task-size adapting runtime. IEEE Trans. Parallel Distrib. Syst. 26, 11 (
November 2015), 2970–2983. Google ScholarDigital Library
- [51] . 2011. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’11). 137–148. Google Scholar
Digital Library
- [52] . 2017. Energy-Efficient compilation of irregular task-parallel loops. ACM Trans. Archit. Code Optim. 14, 4, Article
35 (Nov. 2017), 29 pages. Google ScholarDigital Library
- [53] . 2005. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb’s J. 30, 3 (2005), 202–210.Google Scholar
- [54] . 2012. Welcome to the Jungle. Retrieved from https://herbsutter.com/welcome-to-the-jungle/.Google Scholar
- [55] . 2014. Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA’14). IEEE Press, Piscataway, NJ, 121–132. http://dl.acm.org/citation.cfm?id=2665671.2665692.Google Scholar
Digital Library
- [56] . 1994. A unified model of pointwise equivalence of procedural computations. ACM Trans. Program. Lang. Syst. 16, 6 (
Nov. 1994), 1842–1874. Google ScholarDigital Library
Index Terms
An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous Nodes
Recommendations
An OpenMP Runtime for Transparent Work Sharing Across Cache-Incoherent Heterogeneous Nodes
Middleware '20: Proceedings of the 21st International Middleware ConferenceIn this work we present libHetMP, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes. libHetMP targets platforms comprising CPUs with different instruction set architectures (ISA) coupled ...
Heterogeneous Task Scheduling for Accelerated OpenMP
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing SymposiumHeterogeneous systems with CPUs and computational accelerators such as GPUs, FPGAs or the upcoming Intel MIC are becoming mainstream. In these systems, peak performance includes the performance of not just the CPUs but also all available accelerators. ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud ComputingMany modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...





















Comments