CPMA: An Efficient Batch-Parallel Compressed Set Without Pointers

This paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed, dynamic, ordered set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on pointer-based data structures such as trees because pointer-based structures enable fast parallel unions via pointer manipulation.Whencomparedwithcache-optimizedtrees,PMAswere slowertoupdatebutfastertoscan. Thebatch-parallelCPMAovercomesthistradeoffbetween updatesandscansbyoptimizingforcache-friendliness.On average,theCPMAachieves3 × faster batch-insert through-put and 4 × faster range-query throughput compared with compressed PaC-trees, a state-of-the-art batch-parallel set library based on cache-optimized trees. We further evaluate the CPMA compared with compressed PaC-trees and Aspen, a state-of-the-art system, on a real-world application of dynamic-graph processing. The CPMA is on average 1 . 2 × faster on a suite of graph algorithms and 2 × faster on batch inserts when compared with compressed PaC-trees. Furthermore, the CPMA is on average 1 . 3 × faster on graph algorithms and 2 × faster on batch inserts compared with Aspen.


Introduction
The dynamic ordered set data type (also called a key store) is one of the most fundamental collection types and appears in many programming languages as either a built-in basic type or in standard libraries [21,33,70].Ordered sets enable efficient scan-based operations (i.e., operations that use ordered iteration) such as range queries and maps.This paper Work done while Helen Xu was at Lawrence Berkeley National Laboratory.
Due to their role in large-scale data processing, dynamic ordered sets have been targeted for efficient batch-parallel implementations [11,33,36,42,47,70,74].Since point operations (e.g., single-element insertion) are often not worth parallelizing due to their sublinear complexity, modern libraries parallelize batch updates that insert or delete multiple elements.Direct support for batch updates simplifies update parallelism and reduces the overall work of updates by sharing work between updates.
Existing set 1 implementations demonstrate the importance of optimizing for the memory subsystem to achieve high performance.Almost all fast batch-parallel set implementations are built on pointer-based data structures (e.g., trees) [11,21,33,36,42,47,70,74]. Unfortunately, the main bottleneck in the scalability of these sets is memory bandwidth limitations due to pointer chasing [21,47].Dhulipala et al. [33,36] mitigated these issues in trees by improving spatial locality via blocking and compression.
Even with these improvements, cache-optimized trees inherently leave performance on the table because the random memory accesses from following pointers are slower than contiguous memory accesses [15,60,81].In theory, cachefriendly trees such as B-trees [13] are asymptotically optimal in the classical external-memory model [4] for both updates and scans.Empirically, array-based data structures support scans over 2× faster than tree-based data structures due to prefetching and the cost of pointer chasing [60,78].
Exploiting sequential access with PMAs.This paper introduces a work-efficient batch-parallel Compressed Packed Memory Array (CPMA) based on the Packed Memory Array (PMA) [16,17,50], a dynamic array-based data structure optimized for cache-friendliness (i.e., spatial locality).The PMA appears in domains such as graph processing [30,32,60,64,77,80,81], particle simulations [39], and computer graphics [73].Existing PMAs suffer from low update throughput compared to batch-parallel trees because they lack direct algorithmic support for parallel batch updates [81].At a high level, batch-update algorithms can be implemented with unions/differences [21].Previous work [31] introduced a serial batchupdate algorithm for PMAs based on local merges but stopped short of parallelization.
Supporting theoretically and practically efficient parallel unions in a PMA requires novel algorithmic development because existing parallel batch-update algorithms rely heavily on pointer adjustments, which do not easily translate to contiguous memory layouts.
As an additional optimization, this paper adds compression to PMAs.Previous work on compressed blocked trees [33,36] demonstrates the potential for compression to alleviate memory bandwidth limitations by reducing the number of bytes transferred.This paper applies the same techniques to PMAs.
Results summary.The CPMA's cache-friendliness translates into performance: the CPMA overcomes the traditional tradeoff between updates and queries in trees and PMAs.Figures 1 and 2 demonstrate that the CPMA achieves on average 3× faster batch-insert throughput and 4× faster range-query throughput compared to Parallel Compressed trees (PaCtrees) [33].PaC-trees 2 are a state-of-the-art batch-parallel set implementation based on cache-optimized blocked trees.We 2 PaC-trees are implemented in a library called CPAM, but we use "PaC-trees" and "C-PaC" in this paper to avoid confusion with "CPMA." Similarly P-trees are implemented in a library called PAM.
To understand the improved locality of the PMA compared to PaC-trees, we measured 3 the number of cache misses during batch inserts in both.The PMA incurs at least 3× fewer cache misses when compared to PaC-trees because the PMA takes advantage of contiguous memory access as can be seen in Table 1.
Furthermore, to demonstrate the applicability of the CPMA, we built F-Graph 4 , a dynamic-graph-processing system built on the CPMA because PMAs have been used extensively in graph processing [30,32,60,64,77,80,81].F-Graph is on average 1.2× faster on a suite of graph algorithms and 2× faster on batch updates compared to C-PaC, a dynamicgraph-processing framework based on compressed PaC-trees.F-Graph uses marginally less space to store the graphs when compared to C-PaC.We also evaluate Aspen [36], a stateof-the-art dynamic-graph-processing framework based on compressed blocked trees.We find that F-Graph is on average 1.3× faster on graph algorithms, 2× faster on batch updates, and uses about 0.6× the space when compared to Aspen.

Contributions
• The design and analysis of a theoretically efficient parallel batch-update algorithm for PMAs (and for CPMAs).• An implementation of the PMA and CPMA with the parallel batch-update algorithm in C++.• An evaluation of the PMA/CPMA with PaC-trees and P-trees.• An evaluation of F-Graph, a dynamic-graph-processing system based on the CPMA, compared to C-PaC and Aspen.

Related work
This section describes how this work relates to prior work in parallel data structures.Specifically, it discusses concurrent versus batch-parallel data structures and their use cases.2. Asymptotic bounds for operations in a CPMA and compressed PaC-tree.We use  to denote the cache-line size,  to denote the size of the batch,  to denote the number of elements returned by the range query, and  to denote the user-specified tree node block size in the PaC-tree (called  in [33]).Bounds with † are amortized.All bounds are Ω(1).
There is extensive work on concurrent data structures such as trees [6,9,22,24,41,52,56,58], skip lists [49,61], and PMAs [81].Concurrent data structures are mostly orthogonal to this paper, which focuses on batch-parallel data structures.Existing concurrent trees typically support mostly point operations (i.e., linearizable inserts/deletes and finds), whereas the CPMA in this paper also supports range queries and maps (and associated operations such as filter and reduce).Some recent work studies range queries in concurrent trees [12,44].On the other hand, concurrent trees support asynchronous updates, which are more general than batch updates because batch updates require a single writer.Therefore, fairly comparing concurrent and batch-parallel data structures on update throughput is challenging as their update functionalities are different.
The PAM paper [70] demonstrated that batch-parallel binary trees can achieve orders of magnitude higher insertion throughput compared to concurrent cache-optimized trees [56,76,88].
Batch-parallel and concurrent data structures are suited for different use cases.For example, batch-parallel data structures have recently become popular for both practical [36,40,53,55] and theoretical [1,37,38,46,59,74] dynamic-graph algorithms and containers.They are well-suited for applications with a large number of requests in a short time, such as stream processing or loop join [51].In contrast, concurrent data structures have been used extensively in key-value stores for online transaction processing applications that emphasize point operations such as put and get [28].

Packed Memory Array
This section reviews the Packed Memory Array [16,50] (PMA) data structure to understand the improvements in later sections.First, it introduces the theoretical models used to analyze the PMA.It then describes the PMA's structure and supported operations.Finally, it details how to perform point updates in a PMA, which forms the basis for the batch-update algorithm in Section 4.
Analysis method.Table 2 summarizes the bounds for key parallel operations in the CPMA and compressed PaC-tree in the work-span model [29,Chapter 27] and the externalmemory model [4].The work is the total time to execute the entire algorithm in serial.The span is the longest serial chain of dependencies in the computation.In the work-span model with binary forking, a parallel for loop with  iterations with  (1) work per iteration has  () work and  (log()) span.
The external-memory model introduces the cache-linesize parameter  and measures algorithm cost in terms of cache-line transfers.
Design and operations.The PMA maintains elements in sorted order in an array with (a constant factor of) empty spaces between its elements.Specifically, a PMA with  elements uses  = Θ() cells.The empty cells enable fast updates by reducing the amount of data movement necessary to maintain the elements in order.The primary feature of a PMA is that it stores data in contiguous memory, which enables fast cache-efficient iteration through the elements.
A PMA exposes four operations: • insert(x): inserts element x into the PMA.
• delete(x): deletes element x from the PMA, if it exists.
• search(x): returns a pointer to the smallest element that is at least x in the PMA.• range_map(start, end, f): applies the function f to all elements in the range [start, end).
In this paper, we use the terms "range map" and "range query" interchangeably.Range queries can be implemented with the more general range map, but we use the more popular term "range query." The PMA supports point queries (search) in  (log()) cache-line transfers and updates in  ((log 2 ())/ +log()) (amortized and worst-case) cache-line transfers [16][17][18][82][83][84].The PMA supports efficient iteration of the elements in sorted order, enabling fast scans and range queries.Specifically, the PMA supports the range_map operation on  elements in  (log() +/) transfers.It implements range_map with a search for the first element in the range, then a scan until the end of the range.PMAs are asymptotically worse than PaC-trees for all inserts/deletes and match them for search and range queries (Table 2).
The PMA defines an implicit binary tree with leaves of size Θ(log( )) cells.That is, the implicit tree has Θ( /log( )) leaves and height Θ(log( /log( ))).Every node in the PMA tree has a corresponding region of cells.Each leaf Each node of the PMA tree has an upper density bound that determines the allowed number of occupied cells in that node.If an insert causes a node's density to exceed its upper density bound, the PMA enforces the density bound by redistributing elements with that node's sibling, equalizing the densities between them.The density bound of a node depends on its height.
Updating a PMA.A PMA maintains spaces between elements for efficient updates.Since deletions are symmetric to insertions, we omit the discussion of deletes.
As shown in Figure 3, the four main steps in a PMA insertion are as follows: 1. Search for the location that the element should be inserted into to maintain global sorted order.2. Place the element at that location, potentially shifting some elements to make room.3.If the leaf that was inserted into violates its density bound, count the density of nodes in the PMA to find a sibling to redistribute into.4. If necessary, redistribute elements to maintain the correct distribution of empty spaces in the PMA.
Resizing a PMA.If the root-to-leaf traversal after an insert reaches the root and finds that its density bound has been violated, the entire PMA is copied to a larger array and the elements are distributed equally amongst the leaves of the new PMA.

Parallel Batch Updates in a PMA
Batching updates in a PMA improves throughput by sharing work between updates and simplifying parallelization.This section describes how to apply batch inserts in a PMA (batch deletes are symmetric).
We present a work-efficient parallel batch-insert algorithm for PMAs.A work-efficient parallel algorithm performs no more than a constant factor of extra operations than the serial algorithm for the same problem.Serially inserting  elements into a PMA with  elements takes  ( (log() + (log 2 ())/)) cache-line transfers.Unfortunately, a naive algorithm that parallelizes over the inserts is not work-efficient because it may recount densities to determine which regions to redistribute.Supporting work-efficient batch inserts requires careful algorithm design to avoid redundant work.Finally, we conclude the section with a microbenchmark that demonstrates the serial and parallel scalability of batch inserts in PMAs.
The batch-insert problem for PMAs takes as input a PMA with  elements and a batch with  sorted elements to insert.An unsorted batch can be converted into a sorted batch in  (log()) work.
The optimal strategy for applying a batch of updates depends on the size of the batch.At one extreme, if  is small (e.g.,  < 100), the overheads from the batch-update algorithm outweigh the benefits, so point updates are more efficient than batch updates.At the other extreme, if  is large (e.g.,  ≥ /10), the optimal algorithm is to rebuild the entire data structure with a linear two-finger merge .The batch-insert algorithm for PMAs performs local merges to address the intermediate case between these two extremes.
The parallel batch-insert algorithm for PMAs applies a batch of updates efficiently in the case where neither point insertions nor a complete merge are the best options.Therefore, we focus on the case 5 where  (1) = = ().
The batch-insert algorithm consists of three phases: (1) a batch-merge phase, (2) a counting phase, and (3) a redistribute phase.The phases proceed in serial, but each phase is parallelized internally.The phases adapt the steps of a PMA insertion described in Section 3 to the batch setting.The batchmerge phase combines the search and place steps, and the counting and redistribute phases generalize their counterparts from point inserts.
At a high level, the parallel batch-merge phase divides the PMA and the batch recursively into independent sections and operates independently on those sections.Each recursive step first merges elements from the batch into one leaf of the PMA, and then recurses down on the remaining left and right portions of the batch.This recursive merge phase is inspired by recursive join-based algorithms in batch-parallel trees [2,3,21,33,36,70].Existing join algorithms for tree layouts rely on pointer adjustments which do not easily translate into array layouts.
At each step of the recursion, we perform a PMA search for the midpoint (median) of the current batch and merge the relevant elements from the batch destined for that leaf into the target leaf.Finding the bounds in the batch of all elements in that leaf takes two searches (one backwards and one forwards).
Once the endpoints have been found, we fork the merge of all relevant elements from the current batch into the target leaf.
If the number of elements destined for a leaf is sufficiently large, we use a parallel merge algorithm with load-balancing guarantees to achieve parallelism [5].Finally, we recurse on the remaining left and right sides of the batch in parallel.
Lemma 1.Given a batch of  sorted elements, the work of the batch-merge phase is  (log()), and the span is  (log()log()).
Proof.The height of the recursion is  (log()), and each search in the PMA takes  (log()) work.Finding the first and last element in the batch destined for the leaf takes  (log()) work with exponential searches, which is smaller than  (log()).Therefore, the work and span of finding the bounds for the recursion is  (log()log()).
In the worst case for the work, each element in the batch could be destined for a different leaf, so the total work of merging  elements into  leaves is  ( log()), which is asymptotically larger than the work to perform the recursion.
The worst-case span for any one of the merges is  (log()), so the total worst-case span of all the merges is  (log 2 ()), which is less than  (log()log()).□ When merging elements from the batch into a leaf, the target leaf may overflow because it does not have enough space to hold all the elements destined for it.To resolve this issue, the batch merge copies all elements into separate memory and keeps the size of the extra memory as well as a pointer to it in the leaf.This extra data is then cleaned up after the merge during the redistribution phase.Figure 4 illustrates a batch merge, leaf overflow, and subsequent redistribution.
During the recursive batch merge, we keep track of all modified PMA leaves in a thread-safe set for use in the counting and redistribution phases.
Counting phase.After merging all elements into the PMA, the batch insert algorithm performs a counting phase where it finds the PMA nodes that violate their density bounds for later redistribution.The  (log 2 ()/) work bound for point insertions in the PMA comes from amortized analysis of the counting and redistribution phase [50], so efficiently counting and redistributing in the batch-parallel setting is critical to achieving work-efficiency.To understand how to avoid redundant work, we start with a presentation of an efficient serial algorithm and describe how simply parallelizing this algorithm can lead to extra work.We then present our workefficient parallel algorithm.
An efficient serial batch algorithm must count each required cell exactly once.The algorithm starts with the set of leaves that were touched in the batch-merge phase.The ancestors of these leaves in the implicit PMA tree may need to be redistributed.The serial algorithm checks every leaf in turn.If a leaf violates its density bound, the algorithm then walks up the implicit PMA tree from that leaf until it finds a node that respects its density bound.Finding the density of a node involves counting all of its descendants.By caching every result and checking the cached results before counting, the serial algorithm counts every required cell exactly once.
Unfortunately, simply parallelizing this serial algorithm over the leaves is not work-efficient because the algorithm may recount PMA nodes whose densities have not been cached yet.Therefore, the parallel algorithm may recount the same region more than a constant number of times if many leaves share the same ancestor to be redistributed.
To resolve this issue, we devise a new work-efficient parallel counting algorithm that counts each required PMA cell exactly once.Figure 5 presents a worked example of this counting algorithm.The counting algorithm takes as input the leaves that were modified in the batch merge and outputs the set of PMA nodes that need to be redistributed.
This parallel algorithm avoids redundant work by processing the levels serially from the leaves to the root and saving any counts for later lookups by nodes in higher levels.At each level, we maintain a thread-safe set of nodes that need to be counted.This set is initialized with the leaves that were affected by the batch merge.The levels are processed serially, but all nodes at each level are processed in parallel.If any node at some level  exceeds its density bound, the algorithm adds its parent to the set of nodes to be counted at level  +1.The algorithm terminates when there are no more nodes to be counted, or it has reached the PMA root.
Lemma 2. The parallel counting algorithm is work-efficient.
Proof.The parallel counting algorithm caches results from each counted region as it processes the levels of the PMA tree.Due to the serial iteration of levels, all nodes to be counted at a level are counted in parallel.When a node  needs to be counted, no other node  at that level will need to count any (1, 0), (1, 3), (0, 4) (2, 1), (1, 0) Proof.The counting algorithm serially iterates over at most  (log()) levels of the PMA because the height of the PMA tree is bounded by  (log()).In the worst case, for each level , the algorithm may have to recurse down  levels to count, so the worst-case span of traversing the PMA tree levels is: The PMA leaves are  (log()) cells each, so the total span of counting is  (log 2 ()).□ Redistribution phase.Once the counting phase has identified the correct regions to redistribute, the PMA redistributes regions by performing two copies of the relevant data.The first copy packs the regions to redistribute from the PMA into a buffer, and the second copy equalizes the densities in the regions to redistribute by spreading the elements evenly from the buffer into the target leaves.
Lemma 4. Given a batch of  sorted elements, the work of the redistribute phase is  (( log 2 ())/)) amortized cache-line transfers, and the worst-case span is  (log 2 ()).
Proof.The work of the redistribute phase is bounded above by the work of the counting phase, because the number of elements that need to be redistributed is at most the number of elements that need to be counted.From Lemma 2, the counting step is work-efficient, so it takes no more than the serial amortized work bound of  ((log Batch insert microbenchmark.Table 3 reports the throughput of batch inserts as a function of batch size (using the setup described in Section 6).The PMA under test starts with 100 million elements and we add an additional 100 million elements.On one core, the batch-insert algorithm is up to 3× faster than point inserts when the batch is large.Batch inserts in a PMA save computation over point insertions by reducing the number of searches, the length of each search, and the number of redistributions.The batch algorithm performs only one binary search per updated leaf because the remaining elements in the batch destined for that leaf are merged in directly.Additionally, the searches are smaller because they often search only a subsection of the PMA.Finally, the counting algorithm combines ancestor ranges to redistribute in the PMA, potentially skipping levels of redistribution.
Furthermore, Table 3 shows that batch inserts in a PMA achieve parallel speedup of up to about 19× on 64 cores (128 The density bound in all leaves is 0.9.Here, sizeof(T) is 8 bits, and a byte is 4 bits.The blue bits in the CPMA represent continue bits.The green shaded cells in both the PMA and CPMA contain new data after the insert.The PMA redistributes its elements after the insertion, but the CPMA does not because the insertion did not violate the leaf density bound.threads) as the batch size grows.The main bottleneck in the parallel scalability of batch updates is memory bandwidth.Section 5 mitigates these issues by adding compression to the batch-parallel PMA to reduce data movement.

Compressed Packed Memory Array
This section introduces, analyzes, and empirically evaluates the Compressed Packed Memory Array (CPMA).Adding compression does not affect the PMA's asymptotic bounds.Empirically, the CPMA achieves better parallel scalability than the PMA because the parallel operations are memorybound, so the CPMA's smaller size makes better use of memory bandwidth.

Data compression techniques.
The CPMA exploits the fact that elements are stored in sorted order in a PMA to apply delta encoding [69] to the elements.Delta encoding stores differences (deltas) between sequential elements rather than the full element.Given a sorted array  of  elements, delta encoding results in a new array  ′ such that  ′ 0 = 0 and for all  = 1,2,...,−1,  ′  =  −  −1 .These deltas can then be stored in byte codes, which store an integer as a series of bytes [19,86].Each byte uses one bit as a continue bit, which indicates if the following byte starts a new element or is a continuation of the previous element.
We use delta encoding with byte codes in the CPMA because they are fast to decode and achieve most of the memory savings of shorter codes [19,36,67].
CPMA structure.The CPMA maintains the same implicit binary tree structure as a PMA and compresses the leaves.Just like in the PMA, a CPMA with  elements and  = Θ() cells maintains leaves of size Θ(log()) in order to achieve its asymptotic time bounds.The CPMA applies the packedleft optimization, which packs elements to the left in PMA leaves, for ease of compression [81].Packing the elements to the left does not affect the PMA's (or CPMA's) asymptotic bounds 6 because the bounds only depend on the density of the elements in the PMA leaves [81].
A CPMA leaf stores its head, or its first element, uncompressed, and stores subsequent elements compressed with delta encoding and byte codes.That is, in a CPMA with elements of type T, the first sizeof(T) bytes in each leaf contain the uncompressed head.All following cells take 1 byte each rather than sizeof(T) bytes.
The density bounds in a CPMA count byte density rather than element density.The density in a CPMA node is the ratio of the number of filled bytes to the total number of bytes available in the node.

CPMA Operations
The CPMA maintains the same asymptotic bounds as the PMA for point queries (searches) and point updates.Furthermore, compression does not affect concurrency schemes for PMAs [81] or the batch-update algorithm from Section 4.
The PMA's asymptotic bounds are derived from its implicit tree structure and related density bounds.The main change in the CPMA is the compression of each individual leaf, which does not affect the high-level implicit tree structure.
The uncompressed head allows for efficient searching to find which leaf contains an element.The compressed leaves in the CPMA do not affect the high-level tree structure or searches because each leaf can still be processed independently in  (log()) work.
Point queries.A CPMA on elements supports point queries in  (log()) cache-line transfers.There are two steps in a point query in a CPMA: a binary search on leaf heads, and then a pass through the leaf at the end of the binary search to find the closest element.There are  (/log()) leaves, so a binary search takes  (log()) cache-line transfers.The leaf heads are stored uncompressed, so there is no additional cost to perform the binary search on leaf heads compared to a search in a PMA.After finding the target leaf, the CPMA performs a search within that leaf.The size of each leaf is bounded by  (log()), so it takes  (log()/) cache-line transfers to search a compressed leaf.
Point updates.A CPMA on  elements supports point updates in  (log() + (log 2 ())/) cache-line transfers.We will focus on the case of inserts, since deletes are symmetric to inserts. Figure 6 presents a worked example of the same insert in a PMA and a CPMA.
The CPMA follows the same four steps of a PMA point update described in Section 3. We will focus on steps (2)-(4) (place, count, and redistribute), since we already analyzed point queries.After performing a point query to find the target leaf, the CPMA places an element by adding a delta to the leaf and updating the following delta.Updating the leaf can be done in a single pass, which modifies up to  (log()) cells because the size of the leaf is bounded by  (log()).The CPMA matches the PMA's asymptotic bound on the number of cells modified during the place step.
Once the target leaf has been updated, the CPMA traverses up the leaf-to-root path and redistributes any nodes that violate their density bounds just as in a PMA.The amortized insert time bound comes from the checking and maintenance of density bounds, which the CPMA supports in the same asymptotic cache-line transfers as a PMA.Just as in a PMA, counting and redistributing in a CPMA takes cache-line transfers linear in the size of the region.
Parallelizing the CPMA.The compression in the CPMA does not conflict with existing lock-based multiple-writer parallelism for PMAs [81] because the locking scheme depends on the implicit PMA tree structure and locking at a leaf granularity.Furthermore, compression does not affect the theoretical performance of concurrent PMAs because the CPMA also supports single-pass operations within leaves.The CPMA supports multiple readers because reads are non-modifying.
Finally, the batch-update algorithm in the CPMA is identical to the batch-update algorithm for PMAs described in Section 4. The design and analysis of the batch-update algorithm also depends only on single-pass operations on leaves.

Scalability analysis
We measure the scalability of both the PMA and CPMA on batch inserts and range queries using the setup described in Section 6.In each experiment, the PMA and CPMA start with 100 million elements.In each batch-insert experiment, we add 100 batches of 1 million elements each.In each rangequery experiment, we perform 100,000 range queries in parallel where each query is expected to return about 1.5 million elements.We measure the effect of core count on performance of the PMA/CPMA.The extended version of the paper contains the raw data.
Figure 7 shows that the CPMA achieves better scalability than the PMA on batch inserts because compression maximizes the CPMA's usage of available memory bandwidth.The PMA achieves up to 19× speedup and the CPMA achieves up to 43× speedup for batch inserts on 64 cores (128 threads).The CPMA achieves better batch insert throughput compared to the PMA when the number of cores is sufficiently large (at least 16).When the number of cores is too small, the additional computational overhead from compression outweighs the benefits of decreased memory traffic.
Similarly, Figure 8 demonstrates that the PMA achieves about 41× speedup for range queries and the CPMA achieves about 118× speedup for range queries on 64 cores (128 threads).The PMA's/CPMA's scalability on range queries is much better than its scalability on updates because the queries proceed in parallel and do not need to coordinate.The PMA's range query throughput in terms of bytes transferred per second reaches the memory bandwidth on the machine, but its overall range query throughput is limited because of the large size per element.The CPMA alleviates the memory bandwidth issue by decreasing the size per element, enabling it to support more elements processed per byte transferred.

Evaluation
To measure the improvements described in Sections 4 and 5, this section evaluates the PMA/CPMA compared to uncompressed/compressed PaC-trees [33] and P-trees [70] on range queries, batch inserts, and space usage.We use the terms "U-PaC" and "C-PaC" to denote the uncompressed and compressed versions of PaC-trees, respectively, in this section.
This section then evaluates the CPMA, C-PaC, and Aspen [36], a state-of-the-art dynamic-graph processing system based on compressed trees, on an application benchmark of dynamic-graph processing because both PMAs and trees appear frequently as dynamic-graph containers [30,32,33,36,60,64,77,80,81].We introduce F-Graph, a system for processing dynamic graphs that uses the CPMA as its underlying data structure.
Additional experiments and data tables can be found in an extended version of this paper [79].
Microbenchmarks summary.At a high level, the CPMA achieves the best of both worlds in terms of performance.On average, it achieves 4× faster range-query throughput and 3× faster batch-insert throughput when compared to compressed PaC-trees.According to the theoretical prediction in Table 2, PaC-trees asymptotically match or beat CPMAs for all operations.However, in practice, the CPMA supports both fast queries and updates due to its locality.Finally, CPMAs use about the same space as compressed PaC-trees, but they use less than half the space of uncompressed PMAs.When compared with PAM, an uncompressed data structure, the uncompressed PMA achieves 1.5× faster throughput for batch insertions and 20× faster range query throughput.
Graph benchmark summary.For graph workloads, we found that F-Graph is on average 1.2× faster on a suite of graph algorithms, achieves 2× faster throughput for batch updates, and uses marginally less space to store the graphs compared to C-PaC.Furthermore, F-Graph is on average 1.3× faster on graph algorithms, achieves 2× faster throughput for batch updates, and uses 0.6× space to store the graphs compared to Aspen.Systems setup.We implemented the PMA and CPMA as a C++ library on top of the search-optimized PMA [78] and compiled them with clang++-14.To match the parallelization method from the PaC-trees library, we parallelized the PMA/CPMA with the Parlaylib toolkit [20].The PMA and CPMA are currently implemented as key stores (sets).The code can be found on https://github.com/wheatman/Packed-Memory-Array.git.
Each external library is compiled using the default configuration of g++-11 and Parlaylib (or PBBSlib, a precursor to Parlaylib, for Aspen) for parallelization.They are each implemented in a C++ library.We used the in-place set mode of P-trees and PaC-trees for a fair comparison (although the libraries also support a less efficient functional mode).The PaC-trees library block size is set to the default for sets at 256, which corresponds to a maximum node size of 4108 bytes.To initialize PaC-trees, we used the library-provided recursive build routine, which lays out the tree nodes non-contiguously in memory.
We also tested the Rewired PMA (RMA) [31] and compiled it with the default provided scripts which used clang++-14.Since the RMA is serial, there is no parallelization framework.
All experiments were run on a 64-core 2-way hyper-threaded Intel ® Xeon ® Platinum 8375C CPU @ 2.90GHz with 256 GB of memory from AWS [7].Across all the cores, the machine has 3 MiB of L1 cache, 80 MiB of L2 cache, and 108 MiB of L3 cache.All performance results are the average of 10 trials after a single warm up trial.

Evaluation on microbenchmarks
We first evaluate the RMA, P-trees, and PaC-trees compared to the PMA/CPMA on a suite of microbenchmarks.
Experimental setup.We evaluate batch-update throughput first with 40-bit uniform random numbers.40-bit numbers gives a balance between the compression ratio and the number of duplicates.Uniform random is the worst case for compressed data structures because it maximizes the deltas between elements and therefore minimizes the compression ratio.Uniform random is also the worst case for batch inserts because it minimizes the amount of shared work between updates that the algorithm can eliminate.However, uniform random is the best case for redistributes in PMAs/CPMAs.
We also evaluate batch-update throughput by starting with 40-bit uniform random numbers and then adding elements according to a zipfian distribution.The zipfian distribution generates 34-bit numbers with skew parameter  = 0.99 (parameter taken from the YCSB [28]).For additional batch-insert experiments on skewed distributions, we test the data structures on a skewed RMAT distribution [27] in the graph-processing application benchmark at the end of this section.
We measure range-query performance of the data structure when it contains 100 million elements by performing 100,000 range queries in parallel.We varied the size of the range queries across experiments.We measure batch-insert performance, by inserting 100 million elements in batches into a data structure that starts with 100 million elements.We varied the batch size across experiments.If the batch-insert performance was slower than the non-batched insert done in a loop, the non-batched insert number was reported.
To measure the space usage, we vary the number of elements and report the size.
Finally, we evaluate the serial batch-update algorithm from the Rewired PMA (RMA) [31] with the provided test code and build scripts.For a fair comparison, we ran the batch update algorithm for PMAs from Section 4 on one core.
The RMA's provided tests use the numbers [1,2,...,] sampled without replacement where  is the total number of elements after the test.Although this is not exactly the same set as numbers as in our PMA experiments (with uniform random 40-bit numbers), the experiments are equivalent because both data structures are uncompressed, so only the ordering of the numbers matters.
Batch inserts on uniform random inputs.Figure 1 demonstrates that the throughput of parallel batch inserts in the CPMA is on average 3× faster than in compressed PaC-trees.Similarly, parallel batch inserts in the PMA achieve on average 1.5× faster throughput than in P-trees.The PMA's/CPMA's cache-friendliness enables it to support faster updates than Batch size RMA [31] 4. Serial batch insert throughput (inserts per second) of the uncompressed PMA and RMA.We use point insertions for small batches when the batch update algorithm does not provide practical benefits.the theory suggests.As mentioned in Section 3, PMAs (and by extension, CPMAs) support point updates in  ((log 2 ())/ + log()) work.Trees theoretically dominate PMAs for point updates: balanced binary trees support updates in  (1+log()) work [29], and cache-friendly trees such as B-trees [13] support updates in  (1 + log  ()) work.However, in practice, batch updates in a PMA/CPMA are faster than batch updates in trees because the PMA/CPMA takes advantage of contiguous memory access.
Table 4 evaluates the batch-insert algorithm for uncompressed PMAs from Section 4 on one core compared to the existing serial batch-insert algorithm for RMAs, an optimized version of PMAs [31].On average, the batch-insert algorithm in this paper is about 1.2× faster than the existing batch-insert algorithm for RMAs.
Batch inserts on skewed inputs.Just as in the uniform random case, the CPMA outperforms C-PaC on small batches and is slightly slower on large batches of skewed inserts.
The batch-parallel PMA is well-suited for the case of all insertions targeting the same leaf.In contrast, for non batched PMAs, this is the worst case.The batch-insert PMA mitigates the worst case by (1) sharing the work of searches between inserts, reducing overall work, and (2) skipping levels of redistribution with larger batches, improving overall work and parallelism.Due to these factors, the PMA/CPMA achieves higher throughput on zipfian batch inserts compared to uniform random batch inserts as can be seen in Table 5.
Batch deletes.On average, the PMA performs uniform random batch deletions 1.9× faster than uniform random batch insertions.Similarly, the CPMA achieves 1.5× higher throughput for uniform random batch deletions compared to uniform random batch insertions, on average as can be seen in Table 5.We see a similar trend for the zipfian distribution.Batch deletions are faster than batch insertions when the batch is large because deletes do not have to allocate temporary space as they will never overflow the PMA leaves.
Range queries.Figure 2 shows that the CPMA supports range queries between 1.2 × −10× faster than compressed PaC-trees.Similarly, the PMA supports range queries between 8.9×−27.4×faster than P-trees.The PMA/CPMA is faster to scan than compressed PaC-trees because the PMA's/CPMA's contiguous layout enables prefetching, while trees require pointer-chasing between tree nodes.Furthermore, for small ranges, the PMA/CPMA are at least 4× faster due to the preexisting search layout optimizations for PMAs, which are orthogonal to the optimizations in this paper [78].
Furthermore, the CPMA supports range queries 1.3× faster than the PMA on the largest range because the CPMA's smaller size enables it to fetch more elements before reaching memory bandwidth.However, the PMA is faster for small range queries because of the added overhead of decompression in the CPMA.
Space usage.Table 6 shows that CPMAs are similar in size to C-PaC and are over 2× smaller than uncompressed PMAs.The space savings of the compressed data structures improves with the number of elements because the distance between elements decreases as the number of elements increases.The CPMA uses more space than C-PaC for smaller inputs but less space than C-PaC when the input is sufficiently large (at least 100M elements) because the CPMA leaf size, which defines the ratio of uncompressed to compressed elements, grows with the number of elements.As an uncompressed data structure, P-trees take a fixed 32 bytes per element.

Evaluation on graph workloads
We use the CPMA as the basis for a dynamic-graph container called F-Graph and evaluate it on a suite of dynamic-graph workloads as an application benchmark for the CPMA.We first describe how F-Graph processes dynamic graphs with a single CPMA.Then we present the results of the benchmark for F-Graph, C-PaC, and Aspen.
F-Graph description.F-Graph is built on a single batchparallel CPMA with delta compression and byte codes.It differs from traditional graph representations because it uses only a single array to store both the vertex and edge data.
To understand the distinction, consider the canonical Compressed Sparse Row (CSR) [72] representation.For unweighted graphs, CSR uses two arrays: an edge array to store the edges in sorted order (by source and then by destination), and a vertex array to store offsets into the edge array corresponding to the start of each vertex's neighbor list.The vertex array saves space: the edge array then only needs to store destinations and not sources.
In contrast, storing graphs in a CPMA takes only one array.Using a CPMA, F-Graph stores edges in 64-bit words by representing the source in the upper 32 bits and the destination in the lower 32 bits7 .The start of each vertex's neighbors is implicit and can be restored with a search into the underlying CPMA.The delta compression in the CPMA elides out the source vertex in all edges except for the edges in the uncompressed PMA leaf heads and the first edge of each vertex.F-Graph supports batch updates and graph algorithms by adopting the popular approach of phasing updates and algorithms separately [8,25,26,40,45,48,57,62,63,65,71,75,85].It supports batch updates with one writer and therefore does not use locks.
Finally, F-Graph currently supports unweighted graphs because the CPMA is currently a key store.F-Graph also currently supports algorithms on undirected graphs because it is built on a single CPMA, but it could be easily extended to support algorithms on directed graphs with two CPMAs -one for incoming edges and one for outgoing edges 8 .Many graph algorithms (e.g., all the ones in this paper, among others) can be run with only the graph topology.Future work includes extending the CPMA to a key-value store which would allow F-Graph to store weighted graphs.
The CPMA under F-Graph has a growing factor of 1.2×.
C-PaC and Aspen description.C-PaC and Aspen support dynamic-graph processing with compressed trees (one per vertex) and enable concurrent updates and graph algorithms without locking in functional mode.Since we are not concurrently performing updates and algorithms, we use C-PaC's and Aspen's in-place unweighted modes for a fair comparison.
Systems setup.All systems run the same algorithms via the Ligra interface, which is based on the VertexSubset/EdgeMap abstraction [66].Therefore, all algorithms implemented with C-PaC and Aspen can be run on top of F-Graph with minor syntatic changes [34,35,68].Datasets.Table 7 lists the graphs used in the evaluation and their sizes.We tested on real social network graphs and a synthetic graph.We used a few social network graphs of various sizes: the LiveJournal (LJ) [10], the Community Orkut (CO) [87], the Twitter (TW) [14], and Friendster (FS) [54] graphs.Additionally, we generated an Erdős-Rényi (ER) graph [43] with  = 10 7 and  = 5•10 −6 .
Graph algorithms.We evaluate the performance of F-Graph, C-PaC, and Aspen on three fundamental graph algorithms: PageRank[23] 9 (PR), connected components (CC), and singlesource betweenness centrality (BC).Figure 9 presents the results of the evaluation, and the full version of the paper contains all of the data.The algorithms are from the Ligra [66] Graph Traversals in graph kernels can be organized on a continuum depending on how many long scans they contain, which depends on the order of vertices accessed.On one extreme, arbitrary-order algorithms such as PR access vertices in any order and can be cast as a straightforward pass through the data structure.On the other extreme, topology-order algorithms such as BC access vertices depending on the graph topology, and are therefore more likely to incur cache misses by accessing a random vertex's neighbors.CC is in between arbitrary order and topology order because it starts with large scans in the beginning of the algorithm, but it converges to smaller scans as fewer vertices remain under consideration.
Systems with a flat layout such as F-Graph have an advantage when the algorithm is closer to arbitrary order -they support fast scans of neighbors because all of the data is stored contiguously.For example, F-Graph is 1.5× faster than C-PaC on average on PR.In contrast tree-based systems such as C-PaC incur more cache misses during large scans due to pointer chasing.
Since F-Graph uses a single edge array in its flat layout, it must incur a fixed cost to reconstruct the vertex array of offsets in all algorithms besides PR (because PR accesses all of the edges in each iteration).The relative cost of building the vertex array in F-Graph compared to the cost of the algorithm depends on the amount of other work in the algorithm.For example, building the vertex array in F-Graph takes about 10% of the total time in BC.The relative cost of building the offset array also depends on the average degree: a higher average degree corresponds to a smaller overhead compared to the cost of the algorithm.Finally, although this experiment rebuilds the vertex array with each run of the algorithm, the vertex array could be reused across computations (e.g., from different sources) if there have been no updates.
Update throughput.F-Graph does not sacrifice updatability for its improved algorithm speed -on average, F-Graph is 2× faster than C-PaC and Aspen on batch inserts.Figure 10 shows that F-Graph achieves faster updates than C-PaC and Aspen despite the theoretical dominance of trees over PMAs in terms of point and batch updates.
To evaluate insertion throughput, we first insert all edges from the FS graph (the largest graph we tested on).We then add a new batch of directed edges (with potential duplicates) to the existing graph in both systems.To generate edges for inserts, we sample directed edges from an RMAT generator [27] (with  = 0.5;  = = 0.1;  = 0.3 to match the distribution from the PaC-tree paper [33]).
We note that the distribution of inserts is different here than in Section 6.Here we see that even with a skewed distribution, while traditionally challenging for PMAs, the batch parallel CPMA achieves good insert throughput due to the work sharing in the batch insert algorithm.
Space usage.Finally, we consider the space usage of F-Graph, C-PaC, and Aspen.Table 7 shows that F-Graph uses marginally less space than C-PaC and about 0.6× the space that Aspen uses because F-Graph collocates small neighbor sets by using only one array to store all of the data (rather than two levels of trees for vertices and edges in C-PaC and Aspen).

Conclusion
This paper optimizes traditional PMAs with parallel batch updates and data compression.On average, the compressed PMA (CPMA) outperforms compressed trees (PaC-trees) by 3× on parallel batch updates and 4× on range queries due to the CPMA's cache-friendliness.The CPMA uses similar space compared to compressed PaC-trees and uses 2×−3× less space compared to uncompressed representations.Compression enables the CPMA to scale better with the number of cores compared to the PMA because its smaller size mitigates memory bandwidth issues with reduced memory traffic.
To further demonstrate the real-world applicability of the CPMA, we introduce F-Graph, a dynamic-graph-processing system built on a single CPMA, and compared it to C-PaC, a state-of-the-art dynamic-graph-processing system built on compressed PaC-trees.We found that F-Graph is 1.2× faster on graph algorithms, 2× faster on batch updates, and slightly smaller when compared to C-PaC.
The empirical advantage of the CPMA over compressed PaC-trees demonstrates the importance of optimizing parallel data structures for the memory subsystem.Specifically, the CPMA's array-based layout enables it to take advantage of the speed of contiguous memory accesses.Despite the theoretical prediction, the batch-parallel CPMA empirically overcomes the update/scan tradeoff with compressed PaC-trees due to its locality.To run PAM/CPAM, you will also need jemalloc.
To make the plots, you will need python with matplotlib.
The test machine should have multiple threads but does not necessarily need 128 threads.In terms of memory, the known minimum necessary to run the graph evaluation is 118 GB.This amount of memory is needed to run on the largest graph we tested (Friendster).
The code should compile and run on non x86 machines, but the performance was only tested on the machine above.
Get the code.To get the code via git, clone the repo, go to the for_artifact branch, and set up the submodules:

B Data tables
This section contains the data used to generate the plots in Sections 1, 5 and 6.The growing factor in the PMA/CPMA in the microbenchmarks is 1.2×.The growing factor is the amount by which the underlying array in the PMA grows when it becomes too dense.The asymptotic bounds of the PMA still hold as long as the growing factor is a constant greater than 1.
We chose the growing factor based on the microbenchmarks in Section C.  13.Parallel batch insertion throughput (inserts per second) for inserts from a zipfian distribution on all cores in P-trees, PaC-trees, and the PMA/CPMA.15.Parallel batch insertion throughput (inserts per second) on all cores in Aspen, C-PaC, and F-Graph.The base graph is the FS graph.The new insertions are sampled from the RMAT distribution.

Section 1
Table 9 contains the data used to generate Figure 1, and Table 10 contains the data used to generate Figure 2. Table 1 reports the cache misses of each data structure mentioned in Section 1.

Section 5
Table 11 contains the data for Figure 7, and Table 12 contains the data for Figure 8.

Section 6
Batch inserts with zipfian distribution.
Figure 11 and Table 13 contain the data for zipfian batch inserts.

Evaluation on graph workloads.
Table 14 contains the data for Figure 9, and Table 15 contains the data for Figure 10.

C Growing factor sensitivity
We evaluate how the space usage, batch insertion throughput, and scan throughput of the CPMA changes with the growing factor.
To measure the effect of growing factor on the CPMA, we performed the following experiment on CPMAs with growing factors 1.1×,1.2×,...,2.0×.We started with an empty CPMA and added 1 billion elements in parallel batches of 1 million elements each (for a total of 1,000 batches).After each batch, we measured the space usage of the CPMA and performed a parallel scan over all of the elements.We also measure the batch insertion throughput as a function of growing factor.
Figure 12 demonstrates that a smaller growing factor results in smaller average space usage and therefore better average scan performance because smaller sizes require less memory traffic.Figure 13 shows that the growing factor bounds the worst-case space usage of the CPMA: a CPMA with a higher growing factor has a higher worst-case space usage.However, the exact space usage and scan performance depend not only on the growing factor but also on the state of the CPMA (i.e., how far it is from a growth).
Moreover, the relationship between insert time and growing factor is not as straightforward as the relationship between size/scan and growing factor.Figure 12 shows that the CPMA with growing factor 1.5× achieves the best insertion throughput.Small growing factors (e.g., 1.1×) increase the number of array copies since the CPMA incurs more growths, but also decrease the size of the array which improves other parts of inserts such as the binary search and rebalances.On the other hand, large growing factors (e.g., 2×) have fewer array copies but longer searches, which contribute to more expensive inserts.

Figure 1 .Figure 2 .
Figure 1.Insert throughput as a function of batch size.

Figure 3 .
Figure 3. Example of an insertion in a PMA with leaf density bound of 0.9 and leaf size of 4.  ∈ {0,..., /log( ) −1} has the region [log( ),( +1)log( )), and each internal node's region encompasses all of the regions of its descendants.The density of a region in the PMA is the fraction of occupied cells in that region.Each node of the PMA tree has an upper density bound that determines the allowed number of occupied cells in that node.If an insert causes a node's density to exceed its upper density bound, the PMA enforces the density bound by redistributing elements with that node's sibling, equalizing the densities between them.The density bound of a node depends on its height.

Figure 4 .
Figure 4. Example of batch insertion in a PMA with leaf density bound of 0.9 and leaf size of 4. After the merge, there are more elements in the second leaf than the leaf size, so the number of elements is stored in the leaf, and the elements are stored out-of-place until the redistribute.

Figure 5 .
Figure 5.An example of the work-efficient counting algorithm for batch updates.The blocks at the top represent the PMA leaves and the dots represent elements in the PMA.The pink blocks with arrows represent leaves that were touched during a batch update.The tree below the PMA is the implicit PMA tree of nodes labeled with a tuple of (height, index) (indices are assigned left to right).The blue solid circles represent PMA nodes that must be counted because their sibling or child violated its density.The tan dotted circles represent PMA nodes that did not need to be counted.

Figure 6 .
Figure 6.An example of inserting the same element in a PMA and CPMA with the same elements.The density bound in all leaves is 0.9.Here, sizeof(T) is 8 bits, and a byte is 4 bits.The blue bits in the CPMA represent continue bits.The green shaded cells in both the PMA and CPMA contain new data after the insert.The PMA redistributes its elements after the insertion, but the CPMA does not because the insertion did not violate the leaf density bound.

Figure 7 .
Figure 7. Scalability of batch inserts in the PMA/CPMA.We use 64 to denote all physical cores and 64h to denote all 128 hyperthreads.

Figure 8 .
Figure 8. Scalability of range queries in the PMA/CPMA.We use 64 to denote all physical cores and 64h to denote all 128 hyperthreads.

Figure 11 .
Figure 11.Insert throughput as a function of batch size with batches generated from a zipfian distribution.

Figure 12 .
Figure 12.Effect of growing factor on performance and size.

Figure 13 .
Figure 13.Size (bytes) and scan time (ns) per element after each batch insertion in CPMAs with different growing factors.

Table 3 .
Throughput 2())/)) cache-line transfers.The span of the redistribute phase is bounded above by  (log 2 ()) because there are at most  independent sections to redistribute of size  each.Redistributing each one involves a parallel copy in and out, which has span  (log()).□Putting it all together.Analyzing the entire batch-insert algorithm just involves summing the work and span of the (TP) of serial and parallel batch insertions in the PMA.We use point insertions for small batches when the batch update algorithm does not provide practical benefits.Overall speedup is the speedup over serial point inserts.merge,counting,and redistribute phases of the batch-insert algorithm.Theorem 5.The batch-insert algorithm for PMAs inserts a batch of  sorted elements in  ( (log() +log 2 ()/)) amortized work and  (log 2 ()) worst-case span.

Table 5 .
Parallel batch inserts and deletes (updates per second) for uniform and zipfian distribution for the PMA and CPMA.

Table 6 .
Bytes per element in each of the data structures and compression ratios.The sizeof(T) is 8 bytes.

Table 7 .
Graph sizes (N = number of vertices, M = number of edges, all in millions) and the memory used to store the graphs in all of the systems in Gigabytes.A number below 1 in the F/C or F/A column means that F-Graph was smaller.

Table 8 .
CPMA uniform batch inserts (serial) run-table-2.sh(assumingyou did the parallel ones via run-fig-1.sh)PMA/CPMAbatch insert scalability (strong scaling) run-serial-fig-7.sh,shrun-parallel-fig-7.shPMA/CPMArange query scalability (strong scaling) run-serial-fig-8.sh,shrun-parallel-fig-8.shPMA/CPMAexperiments in the paper and their associated scripts.This section summarizes how to download and use the code.The full details (including how to compile the original binaries and reproduce the experiments in the paper) can be found at the top level directory a pdf called "CPMA artifact readme" in both the git repo and the Zenodo (at https: //zenodo.org/records/10222939). specs.Please use a machine with preinstalled g++ (at least version 11) and git.We have tested the artifact on an Amazon c6i.metal instance running Ubuntu 20.04 with 128 threads and 256 GB of memory) and g++ 11.4.

Table 9 .
Runs function f on at most length elements starting from key at least start.•The PMA also supports iteration as it has begin and end functions, so you can perform operations like for (auto el : pma).Note that this may be slower than using the map functions.Parallel batch insertion throughput (inserts per second) on all cores in P-trees, PaC-trees, and the PMA/CPMA.

Table 10 .
Range query throughput (elements per second) on all cores in P-trees, PaC-trees, and the PMA/CPMA.

Table 11 .
Batch insert scalability as a function of the number of cores.We use 64 to denote all physical cores and 64h to denote all 128 hyperthreads.

Table 12 .
Range query scalability as a function of the number of cores.We use 64 to denote all physical cores and 64h to denote all 128 hyperthreads.

Table 14 .
Running times (seconds) of Aspen, C-PaC, and F-Graph on PR, CC, and BC with all (64) threads.A number above 1 in the ratio columns means that F-Graph was faster.