Zero-Overhead Parallel Scans for Multi-Core CPUs

We present three novel parallel scan algorithms for multi-core CPUs which do not need to fix the number of available cores at the start, and have zero overhead compared to sequential scans when executed on a single core. These two properties are in contrast with most existing parallel scan algorithms, which are asymptotically optimal, but have a constant factor overhead compared to sequential scans when executed on a single core. We achieve these properties by adapting the classic three-phase scan algorithms. The resulting algorithms also exhibit better performance than the original ones on multiple cores. Furthermore, we adapt the chained scan with decoupled look-back algorithm to also have these two properties. While this algorithm was originally designed for GPUs, we show it is also suitable for multi-core CPUs, outperforming the classic three-phase scans in our benchmarks, by better using the caches of the processor at the cost of more synchronisation. In general our adaptive chained scan is the fastest parallel scan, but in specific situations our assisted reduce-then-scan is better.


Introduction
Scans, also known as prefix-sums, are an important primitive in parallel algorithms.They have many applications, including as targets for the flattening of nested data parallel computations [9], for compacting or filtering of data [1], to implement sparse matrix multiplication [1], and the computation of summed-area tables for computer graphics [15].
Given an array, a scan computes for each element the combined value of all prior elements (including or excluding its own value, for respectively an inclusive or exclusive scan).The scan uses a binary operator ⊕ to combine two elements, and can be implemented sequentially as follows: function scan_seq(T* input, T* output, size, T initial) accum ← initial for  in 0 . . .size do accum ← accum ⊕ input[] output[] ← accum return accum If the operator ⊕ is associative, the scan can be executed in parallel.In this paper, we introduce scan algorithms for thread-level parallelism on CPUs which exhibit faster performance than the state of the art algorithms.We address the cost of performing scans in parallel on multi-core CPUs.Most existing parallel scan algorithms are asymptotically optimal.However, they typically have a constant factor overhead over sequential scans, as they traverse the data multiple times.This is also the case for Blelloch's classical two-phase scan [2] and its in-place variant, presented by Gu et al. [12].Most parallel scans executed on a single thread perform at around 50% to 75% of the speed of a sequential scan in our benchmarks.Whether a parallel scan is worthwhile over a sequential scan depends (among other factors) on the number of available threads.In many cases this number is not known in advance, for example in a task-parallel context, when other tasks are executed at the same time.This number may also change during the execution of the scan: after the scan starts, other tasks may finish and their threads may then join the work for the scan.We introduce scan algorithms that are robust in changes to the number of threads.

The Main Ideas
Our scan algorithms are a mix of parallel and sequential scans.We modify the classic three-phase scans (scan-thenpropagate [23] and reduce-then-scan [8,13]) using a twosided parallel loop.One thread performs a sequential scan from one side of the array, while other threads work from the other side, until they meet somewhere in the middle.The other threads assist the first thread, hence we name these assisted scan-then-propagate and assisted reduce-then-scan.
We port chained scans [19,26], the state-of-the-art scan algorithm for GPUs, to multi-core CPUs and modify it for zero single-threaded overhead.Our adaptive chained scan starts in a sequential mode, and adapts to a parallel mode when another thread joins the computation.
We assume that the operator ⊕ is associative with an identity element .We use the following terminology: a block is a contiguous part of the input, an aggregate of a block is the combined value (with ⊕) of all its values, and the prefix of a value or block is the combined value of all values before it.An exclusive prefix does not include the value or block itself, whereas an inclusive prefix does.We present algorithms for left-to-right inclusive scans (computing inclusive prefixes), but they also apply to right-to-left or exclusive scans.

Contributions
We introduce three new parallel scan algorithms, that: • have almost no overhead when executed on a single core; • are at least equally fast as the algorithm they are based on, on any number of threads; and • do not need to fix the number of threads working on the scan at the start.
Similar to most existing scan algorithms, our three-phase scans can work both out-of-place and in-place, i.e. by either having a separate array for the input and output or by reusing the input array for the output.Chained scans, including our adaptation, need auxiliary storage, but that is around thousand times smaller than the input array.
We are not aware of other work that evaluates the performance of the chained scan with decoupled look-back [19] on multicore CPUs.We show that the chained scan is also suitable for CPUs.Our adaptive chained scan is in general the fastest parallel scan algorithm in our benchmarks, outperforming the classic three-phase scans and the scan implementations in oneTBB and ParlayLib [3].In some specific situations however, our assisted reduce-then-scan may be faster.We have an implementation available at https: //github.com/ivogabe/zero-overhead-parallel-scans.
The remainder of this paper is organized as follows.We explain existing three-phase scan algorithms and our variants in Section 2. We explain and adapt the single-pass chained scan in Section 3 and evaluate the performance of all these scans in Section 4.

Parallel three-phase scans
Parallel scans are commonly implemented with the reducethen-scan or scan-then-propagate (also known as scan-thenmap) algorithm, which are both three-phase scans.We first introduce the existing algorithms, and then modify them for zero-overhead on a single thread.
Figure 1a and 1c contain visualisations of the three phase scan algorithms.The input is split in a fixed number of blocks.In Phase 1, the aggregate (combined value) of each block is computed in parallel.In the figure, we have eight blocks, and the colours denote the activities of the different threads.In Phase 2, a scan is performed over these aggregates, to compute the prefix value of each block.This may either happen Input (on a smaller scale than later charts)

Output
Prefix-sum over the input.recursively with a parallel scan, or with a sequential scan; we assume the latter.In the figure, the yellow thread performs the sequential scan computing the prefixes of the blocks.The data is traversed again in parallel in Phase 3 to incorporate the prefix of each block.Algorithm scan-then-propagate [23] (or scan-then-map) performs a scan per block in Phase 1 and writes the results to memory.During Phase 3 it combines each element with the prefix of that block, as visualised in Figure 1a.Algorithm reduce-then-scan [8,13] (Figure 1c) only performs a reduction during Phase 1 and doesn't store the prefix values per element.It performs a scan per block in Phase 3 with the prefix value of that block.It requires approximately 3 memory operations (reads and writes, where  is the input size), whereas scan-then-propagate performs 4 memory operations.Having fewer memory operations is usually beneficial as the performance is often limited by memory bandwidth.In comparison, a sequential scan performs 2 memory operations.The input should be split in a certain number of blocks, which is not necessarily the same as the number of threads.To ensure that Phase 2 has constant time complexity, we use a fixed number of blocks independent of the input size.This number should be a factor larger than the maximum number of threads, to ensure that the work can be properly balanced over the available threads, especially when only some of the threads can assist on this scan.We used a factor of 16 in our experiments.In our algorithm, the number of blocks should be at most (2 16 − thread_count) to ensure that no overflows happen (in work_index as later shown in the pseudocode, although one could make work_index a 64-bit integer to increase this limit).
2.1 Zero-overhead three-phase scans Our adaptions of these scan algorithms, called assisted scanthen-propagate and assisted reduce-then-scan, respectively, use a hybrid approach where the scan is partly performed as a sequential scan during Phase 1.This way, we achieve parallel three-phase scans which have zero-overhead compared to a sequential scan.The first thread (coloured orange in Figure 1b and 1d) executes a sequential scan, and is assisted by other threads performing a parallel scan.We call the first thread the sequential thread and the others the parallel threads.For clarity, we assume that the scan goes from left to right, but the same approach works for right to left scans.The sequential thread works through the array left-to-right, and can compute the output for its part in Phase 1.The parallel threads claim blocks of the array from the right and behave like a normal parallel scan (either scan-thenpropagate or reduce-then-scan).During Phase 2 and Phase 3 we continue with the normal parallel scan, but only over the part of the array handled by the parallel threads.Phase 2 performs a sequential scan over the aggregates of the parallel blocks, starting with the final aggregate of the sequential thread.This phase may be performed by any thread, not necessarily the sequential thread from Phase 1.If there were no parallel threads, there is no work to be done in Phase 2 and Phase 3, and if there were parallel threads then the workload in these phases will be lower than in the original algorithms.The workload of Phase 3 can be handled by all threads, including the sequential thread of Phase 1.

Two-sided parallel loops
In Phase 1 of three-phase scans, the blocks of the array are claimed from two sides.The sequential thread claims blocks from the left and the parallel threads claim them from the right.We thus need to keep track of two indices: respectively the number of blocks claimed from the left and the right.We pack them into a single 32-bit integer called work_index, by using the 16 least significant bits for the left index and the 16 most significant bits for the right index.The choice for which thread is the sequential thread may depend on which scheduling algorithm is used for task parallelism.We assume that all threads call the same function and we make the decision when the first block is claimed.When a thread starts working on the scan, it checks whether no blocks are claimed yet, i.e. whether work_index is zero.If that is the case, then it directly claims the first block and acts as the sequential thread.Since multiple threads may claim blocks at the same time, we use atomic instructions to update work_index.We use atomic compare-and-swap to claim the first block and decide which thread is the sequential thread.This instruction reads a value from memory and replaces it with, in our case, 1 if it is currently equal to 0. It returns the value before the change and whether the operation was successful, i.e. whether the value was 0. This happens atomically: when two or more threads perform this at the same time, at most one can succeed.We use atomic fetch-and-add to claim later blocks.It atomically reads the value from memory at the address of the first argument, increases it by the second argument and returns the value before the change.
We present the implementation of these two-sided parallel loops in Algorithm 1 as a higher order function.It takes a function seq to do the sequential work for a block, a function par for the parallel work for a block and a function conclude to be called at the end to register the split between sequential and parallel blocks.We use this function to implement assisted scan-then-propagate and assisted reduce-then-scan.Because of their similarities, we only show the pseudocode of assisted reduce-then-scan, in Algorithm 2. Both algorithms can run in-place, by letting input and output contain the same pointer.We use sequential folds and scans, denoted by respectively reduce_seq(input, size) and scan_seq(input, output, size, initial).

Parallel chained scans
In contrast to three-phase scans, which require at least 3 global memory operations, parallel chained scans [26] only require 2.They split the data in blocks of a fixed size.Threads then handle these blocks in parallel, claiming them from the left to the right.To perform the scan in a single traversal, serial dependencies are imposed between the blocks.A thread first reduces the elements in its block to an aggregate, then waits on the prefix of the previous block to become available, combines that prefix with its aggregate and shares it as the prefix of its block.Finally, it performs a scan over this block with the prefix of the previous block.
The chained scan with decoupled look-back [19] reduces the impact of these serial dependencies.After reducing a block to an aggregate, that aggregate is directly shared.It then performs the look-back to compute the prefix.When subsequent blocks need the prefix of that block, they can take the aggregate if the prefix is not known yet and calculate the prefix themselves by continuing the look-back to the predecessor of that block.The look-back continues until the thread finds a block with a prefix, thereby reducing the latency of propagating the prefix values.
For the synchronisation between threads, a descriptor object is stored per block.It contains the following values: • The status flag, signalling the progress on this block.
It can be either: X Initialised -No progress has been made.2. The look-back for block  starts at block  − 1.It will loop until the flag is not X.If the flag is A, then we take the aggregate of that block and continue the look-back with the predecessing block.If the flag is P, then the look-back finishes.

Chained scans on CPUs
Chained scans were designed for massively parallel GPUs.A block is handled by a thread block, which is a group of threads that can communicate with low overhead.Since all threads in the thread block have their own registers, we can store all elements of a block of the array in registers.The elements only have to be read from global memory before doing the reduction and don't need to be loaded again during the scan.The large number of registers makes it possible to have a sufficiently large block size, which is important to keep the synchronisation overhead low.When implementing chained scan on CPUs, we cannot store all elements in registers, as that would lead to a very small block size.Instead we have to use the cache to have a large enough block size.One block should fit in the L1 cache; the maximal block size thus depends on the processor and the size of an element of the array.In our experiments, we set the block size to 4096.We are not aware of other work evaluating chained scans on CPUs.Our benchmarks show that this chained scan outperforms the parallel three-phase scans on CPUs as well.

Adaptive chained scans
Now that we have transformed chained scans [19,26] with decoupled look-back to support CPUs, let us look into how we can ensure that these parallel scans have zero overhead compared to sequential scans executed on one core.Chained scans cannot be executed with two-sided parallel loops, like we did with three-phase scans.The blocks must be claimed from left to right, as each block can only finish after the previous block has an aggregate.To achieve zero overhead, we introduce adaptive chained scans, which start in a sequential mode and can switch (adapt) to a parallel mode at any time during the execution.We still let threads repeatedly claim fixed-size blocks of the array.After a new block is claimed, a thread directly checks whether the previous block's prefix is available.If that is the case, then we perform a scan over

⋮ ⋮ ⋮
One thread works on the scan and handles each block sequentially.
One thread starts on the scan and handles the first two blocks sequentially.Then it is assisted by another thread and switches to the parallel mode.
Three threads work on the scan from the start.Only the first block is handled sequentially.this block with that prefix value.Otherwise, we perform the usual reduce, look-back and scan phases of the chained scan.The latter never happens when only a single thread works on the scan, as the prefix of the previous block will always already be available.Figure 3 shows a number of possible executions of our chained scan.The method to claim work is the same as in the original algorithm.New threads may join the computation of the scan during execution, and there is no additional synchronisation overhead when changing from the sequential to the parallel mode.We present the pseudocode for adaptive chained scans in Algorithm 3.

Benchmarks
To evaluate the performance of the different parallel scans, we benchmark prefix-sum and array compaction (also known as filter) operations.We performed the prefix-sum benchmarks over an array of  = 2 26 64-bit integers, both out-ofplace and in-place.Compaction writes the elements of an input array which satisfy some predicate to an output array.The benchmark runs compaction on  = 2 28 64-bit integers.
It preserves a fraction  ∈ { 1 /2, 1 /8} of the numbers, i.e. the predicate returns true for one-out-of-two or one-out-of-eight values.A scan determines the indices where elements should be written to: it evaluates the predicate, converts the resulting booleans to integers and then performs a scan over those integers to compute the target indices [1].

Experiment setup
We implemented algorithms scan-then-propagate, reducethen-scan, a chained scan and our variants of those.These algorithms and a sequential baseline are implemented in Rust.
For comparison we include a sequential implementation in C++, a parallel implementation using oneAPI Threading Building Blocks (oneTBB) and a parallel implementation with ParlayLib [3].The oneTBB library provides a parallel scan implementation based on reduce-then-scan, with a similar optimisation for single-threaded performance as our assisted reduce-then-scan.We elaborate more on this approach in Section 5.1.ParlayLib also has a reduce-then-scan algorithm, but has no optimisation for single-threaded performance.We performed the benchmarks on an Intel 12900 processor, which has eight performance and eight efficiency cores.Measurements up to eight threads use only performance cores, on nine to 16 threads we use both performance and efficiency cores, and on 17 to 24 threads we also use Simultaneous Multi-Threading (SMT) on the performance cores.We report the performance of prefix sums for up to 16 threads as the performance does not change significantly for more (neither positively nor negatively).All these threads are directly spawned at the start of the computation.After a cold run, we measure the average execution time of 50 runs.The implementation of the benchmarks can be found at https://github.com/ivogabe/zero-overhead-parallel-scans.

Prefix-sum
Figure 4 presents the performance of the prefix-sum algorithms, both out-of-place and in-place, on 1 to 16 threads as the speedup over a sequential implementation written in Rust on an Intel 12900.Note that the chained scan requires additional storage per block and is thus not truly in-place.This is, however, smaller than the input size, and the output will directly be written to the input array of the scan.The ParlayLib implementation of out-of-place scans has a disadvantage as the allocation for the output happens within the measured code; the library provides no way to write the results to an already allocated array, whereas the other implementations can allocate the output array beforehand.We summarise the important conclusions here.
76%.They need two, three or four threads to outperform the sequential scan.Our variants and the oneTBB implementation eliminate the single-threaded overhead and run as fast as the sequential scan on a single thread.Chained scans outperform classic three-phase scans on CPUs.Chained scans were designed for GPUs, and we are not aware of other work that evaluated chained scans on CPUs.While the advantage of chained scans might seem smaller for CPUs, as a block cannot be stored in registers, they also improve performance on CPUs compared to threephase scans.They still require 3 data movements, but 1 of these will now be from the L1 cache, which is an order of magnitude faster than access to main memory.
Assisted scans have higher multi-threaded performance than their classic base algorithm.Our assisted three-phase scans do not only improve the performance on a single thread, or on few threads, but also exhibit a small speedup on more threads.On 16 threads, assisted scans are around 5% faster than their base algorithms.A small part of the array is handled in a single phase, instead of three, which still gives a small benefit on a larger number of threads.
The adaptive chained scan does not have higher multi-threaded performance than the standard chained scan.The adaptive chained scan improves the performance over the chained scan only for few threads.This makes sense: it only operates in the sequential mode as long as no other thread has joined the workload.
The oneTBB scan performs similar to the standard reduce-then-scan on multiple threads.The oneTBB scan is based on reduce-then-scan and uses work stealing to distribute the work over the cores of the processor.It has a similar optimisation for zero single-threaded overhead as our assisted reduce-then-scan.However, the advantage of this optimisation on two to four threads is less than in our assisted reduce-then-scan, and on six or more threads oneTBB performs similar to the standard reduce-then-scan.As we describe in Section 4.4, on a higher number of threads oneTBB handles a slightly smaller part of the array in the sequential mode than our assisted scan, which partly explains why our assisted scan is faster.
Parallel scans cap at around six or eight threads.After eight threads, adding more threads does not increase performance.The algorithm is then limited by the memory bandwidth, as we will show in the next section.In the outof-place benchmark, the performance of chained scans also deteriates eventually, which may be caused by the heterogeneous cores.The processor has performance and efficiency cores, and the latter may handle blocks so slowly that the performance cores are waiting on them during the look-back.

Memory bandwidth analysis
The performance of the parallel scans is limited by memory bandwidth.This causes them to cap out at around six or eight threads.The Intel 12900 processor has a maximal memory bandwidth of 76.8 GB/s.A scan performs at least 2 memory operations (reads and writes) of, in our benchmark, 8-byte (64-bit) values.With  = 2 26 , it takes thus at least 14ms based on the memory bandwidth.The chained scans come close to this limit, with 15ms on 16 threads in the in-place benchmark.Scan-then-propagate requires 4 memory operations, and thus takes at least 28ms, which is slightly less than the observed 31ms in our benchmark.Reduce-then-scan, with 3 memory operations, takes at least 21ms.The reduce-thenscan implementations (including oneTBB and ParlayLib) are close to this limit, with 22ms or 23ms.

Ratio between sequential and parallel modes
To further investigate the multi-threaded performance of assisted scans, we measured the fraction of elements handled by the sequential mode in our assisted scans and the oneTBB scan.We report these fractions on 1-16 threads in Figure 6 on an input of 2 26 elements, as the average of 50 runs.If the workload is split uniformly, 1/(thread count) of the elements would be handled by the sequential scan.The assisted scanthen-propagate is close to the uniform distribution, which is expected as the sequential thread and parallel threads have a similar workload.In the assisted reduce-then-scan and oneTBB scan, the sequential thread is performing a scan and the parallel threads are performing a reduction.The sequential thread thus needs to perform more work for its elements than a parallel thread.Hence the sequential thread may handle a smaller fraction than the uniform distribution.On eight or more threads, the oneTBB scan handles a smaller part of the array in the sequential mode than assisted reducethen-scan.It thus has a higher workload than our algorithm, as more elements are also handled in the third phase.We discuss that further in Section 5.1.

Compaction
Since compaction is a common use case for scans, we also benchmarked the scan algorithms modified to perform an array compaction (also known as array filter).In this algorithm, the scan is used to compute the destination indices of the preserved elements, as described by Blelloch [1].Instead of writing the index to the output array, like in the previous benchmark, we use the indices to write each preserved element to the correct position in the output.The scan-then-propagate implementation uses an intermediate array to store the destination indices (relative to the block) after Phase 1 and reduce-then-scan uses an intermediate array to store the number of preserved elements per block.Figure 5 presents the results of these benchmarks.Note that ParlayLib is at a disadvantage here as it stores the results of the predicate in a separate array before doing the scan, whereas the other implementations do not store those values in memory.We summarise the conclusions here.
Our variants have no single-threaded overheaded.Similar to the prefix-sum benchmarks, the compaction benchmarks also show that our variants have no overhead in single-threaded execution.The standard three-phase scans and chained scans perform at around 50% performance compared to a sequential implementation.
Our variants have a small benefit over their base algorithms on multiple threads.The improvement is similar to the improvement in the prefix-sum benchmarks.
The chained scan outperforms assisted reduce-thenscan on six or more threads.Moreover, in this benchmark, adaptive chained scans are faster than the three-phase scans on six or more threads.However on two threads, for  =1 /2, or at most four for  = 1 /8, assisted reduce-then-scan is faster.Our optimisation for single-threaded performance for reduce-then-scan also improves performance on few threads significantly, giving assisted reduce-then-scan the advantage over chained scans on a low number of threads.

Related Work
Scans have been extensively studied for parallel execution in various contexts.With sufficient cores (or lanes, in case of vector computations), tree-based scans can operate in logarithmic time [1,2,5,12,16,18,23].We focus on scan algorithms with significantly fewer threads than the input size [10,17].Most developments in this setting have been made in the context of GPUs, where the number of thread blocks is significantly less than the input size.The interaction between thread blocks on GPUs translates to the interaction between threads on CPUs.The tree-based Brent-Kung scan [5] forms the base for the three-phase scan-thenpropagate algorithm [22,23].Reduce-then-scan has lower memory cost, making it better suited for the GPU architecture [8,13,20].Chained scans reduce memory utilisation further, at the cost of increasing sequential dependencies between threads [19,26].
Copik et al. [7] discuss a scan implementation with work stealing for load balancing.They need load balancing because the computational cost of their operator (⊕) is not constant, whereas we use load balancing as we do not assume that the number of threads working on the scan is fixed at the start.
When computing the prefix sum of floating-point values, one needs to consider both performance and accuracy.Fraser et al. [11] introduce scan algorithms to reduce rounding error and discuss the trade-off between performance and accuracy.
Scans are an important building block for parallel algorithms [2] and are hence included in languages for data parallelism [6,14,25].Whereas these contain scans as a primitive, Pizzuti et al. [21] and Šinkarovs et al. [24] allow scans to be constructed in terms of smaller primitives.

Scan in oneTBB
A similar approach to our assisted reduce-then-scan can be found in oneAPI Threading Building Blocks (oneTBB, https://oneapi-src.github.io/oneTBB/).It distributes the parallel workload of the scan over the cores of the processor with work stealing [4] by recursively splitting the task (which corresponds to a slice of the input array) in smaller tasks.It has a similar optimisation to perform the work for the first thread in a sequential mode.A task can be executed in the sequential mode if neither that task nor a parent task was stolen.
Our benchmarks show that this eliminates the sequential overhead similar to our scan algorithms.It however does not reach the same multi-threaded performance as our assisted reduce-then-scan.This may be caused by the overhead of work stealing and/or recursive splitting.Our algorithm distributes work with an atomic increment.This may have lower overhead, and has more flexibility: the boundary between the sequential and parallel part can be anywhere in the array and is not tied to the boundaries following from recursive splitting.

Conclusion
We introduced parallel scan algorithms, with a mix of sequential and parallel modes.These parallel scan algorithms are robust to changes in the number of threads, and have (in practice) zero overhead compared to sequential scans when executed with a single thread.This is important when parallel scans are used in a setting with task parallelism, where one might not know in advance how many threads will work in parallel on the scan.
Our three-phase scans use two-sided parallel loops.The first thread works in one direction performing a sequential scan, while the other threads assist it by working in parallel from the other direction, hence the names assisted scan-then-propagate and assisted reduce-then-scan.Only the part of the input not handled by the first thread needs to be handled in subsequent (parallel) phases.This optimisation also improves the multi-threaded performance by around 5% (in our prefix-sum benchmark).Our assisted reduce-thenscan outperformed the standard three-phase scans and the implementations in oneTBB and ParlayLib.
Chained scans are however even faster in most cases.We showed that these chained scans-originally designed for GPUs-are also suitable for CPUs.Our adaptive chained scan starts sequentially, and adapts to a parallel mode when a second thread joins the computation.It has zero singlethreaded overhead, while preserving the parallel performance of chained scans.Chained scans make good use of the L1 cache, and only require 2 global memory operations, compared to 3 for reduce-then-scan.When the program is not limited by memory bandwidth, our assisted reduce-thenscan may be faster.In our benchmarks this was the case for compaction on two to four threads.
Future work may be aimed at applying two-sided parallel loops to other parallel algorithms, including parallel scan algorithms that reduce rounding errors on floating-point numbers [11].Our algorithms may also be further adapted to and evaluated on heterogeneous architectures.

3
Scan parallel blocks, from their prefixes.

Figure 1 .
Figure 1.Visualisations of three-phase scans, executed by three threads.The colours denote which thread executes which part of the computation.

Figure 3 .
Figure 3. Visualisation of possible executions of the adaptive chained scan, on varying thread counts.Time is on the vertical axis.'Sequential' denotes a sequential scan of a block, whereas 'Reduce' and 'Scan' are the parts of the parallel chained scan for a block.Arrows represent dependencies between blocks.

A
Aggregate available -The aggregate is available, but the prefix not yet.P Prefix available -Both the aggregate and the prefix are available.• The aggregate (if available) • The prefix (if available) All descriptors are initialised with X as status flag.The look-back is visualised in Figure Figure 2. The look-back of block  in a chained scanAlgorithm 3 Adaptive chained scan struct Data{T* input, T* output, size, block_count, uint32 work_index, Descriptor* descriptors} struct Descriptor{uint32 flag, T aggregate, T prefix} function adaptive_chained_scan(input, output, size) uint32 block_count ← size+BLOCK_SIZE−1 return if block_idx ≥ d block_count start ← block_idx * d block_size len ← min(d block_size, d size − start) if !seq then // Don't switch back to sequential mode else if block_idx = 0 then prefix ←  // Identity element else if d descriptors[block_idx−1].Reads and writes of flags have ordering acquire and release.