PIM-trie: A Skew-resistant Trie for Processing-in-Memory

Memory latency and bandwidth are significant bottlenecks in designing in-memory indexes. Processing-in-memory (PIM), an emerging hardware design approach, alleviates this problem by embedding processors in memory modules, enabling low-latency memory access whose aggregated bandwidth scales linearly with the number of PIM modules. Despite recent work in balanced comparison-based indexes on PIM systems, building efficient tries for PIMs remains an open challenge due to tries' inherently unbalanced shape. This paper presents the PIM-trie, the first batch-parallel radix-based index for PIM systems that provides load balance and low communication under adversary-controlled workloads. We introduce trie matching-matching a query trie of a batch against the compressed data trie-as a key building block for PIM-friendly index operations. Our algorithm combines (i) hash-based comparisons for coarse-grained work distribution/elimination and (ii) bit-by-bit comparisons for fine-grained matching. Combined with other techniques (meta-block decomposition, selective recursive replication, differentiated verification), PIM-trie supports LongestCommonPrefix, Insert, and Delete in O(logP) communication rounds per batch and O(l/w) communication volume per string, where P is the number of PIM modules, l is the string length in bits, and w is the machine word size. Moreover, work and communication are load-balanced among modules whp, even under worst-case skew.


INTRODUCTION
As data-intensive applications have become increasingly prominent in the past decades, the ever-widening gap between computation speed and memory access speed has made data movement the dominant cost and main bottleneck in modern systems. This problem, often referred to as the memory wall [65], can now be potentially solved by emerging processing-in-memory technologies [42].
Processing-in-memory (PIM), a.k.a. near-data-processing (NDP), enables computation to be executed on computation units embedded in memory modules. Instead of fetching data through the memory and cache hierarchy to the CPU as in traditional von Neumann systems, PIM pushes computation to the memory modules, reducing the energy consumption of data movement and leveraging the performance gains arising from having aggregated memory bandwidth that scales linearly with the number of modules.
Although the idea of PIM dates back to the 1970s [56], it is now re-gaining research attention due to the development of 3D-stacked memory [28], which enables the production of real PIM systems. At least hundreds of academic works on PIM architecture design have been published (see the references of [5,32,42]), while real-world commercialized PIM systems have also been launched [34,58].
Radix-based indexes (radix trees, tries) are important search structures introduced in textbooks [48], and widely used in linux kernels [45,46], in-memory storage [68], and IP routing [54,63]. They are the only family of search structures designed to inherently support variable-length keys. Moreover, they are often faster in practice than other search structures on shared-memory machines [6,16,36]. A recent convincing example is from SetBench [4], the most efficient open-source implementation of comparison-based trees. In their latest evaluation [55], radix-based ART-OLC [37] outperforms all other comparison-based structures in most cases.  Table 2. In addition, is the bit-string length of the given operation, and is the span of the radix tree (i.e., the fanout is 2 ). (#) denotes that the structure supports only fixed-sized strings with = ( ) bits, in which case / = (1). A Subtree query returns a trie , with and as defined in Table 2. (*) denotes a whp bound (with high probability in ). ( †) denotes an amortized bound.
Nevertheless, because the shape of a radix-based index depends on the set of strings being stored, it can be highly imbalanced with height up to the length of the longest string. This creates two main challenges for adapting existing radix trees or tries to the PIM setting: (C1) how to map their nodes/edges to PIM modules in a way that achieves good load balance across the modules, while minimizing the communication required to answer queries (as noted in [29], there is an inherent tension for PIM between load balance and low communication), and (C2) how to avoid serial bottlenecks when dealing with long strings. Table 1 (first two rows) summarizes the performance of taking traditional radix trees and x-fast tries and randomly hashing them to PIM modules, in terms of the space for each data structure, the number of communication rounds (IO rounds), and the total communication, for various operations (LCP, Insert/Delete, Subtree)-full details in Sections 2 and 3. The family of fast tries (x-fast tries [62], y-fast tries [62], z-fast tries [8]) are explicitly designed to address challenge C2, but x-fast tries and y-fast tries support only fixed-length keys, and how to address challenge C1 for z-fast tries is still an open problem.
To address these challenges, we present PIM-trie, a skew-resistant batch-parallel radix-based tree that has good asymptotic guarantees in the PIM Model (Table 1, third row). It is not only the first radix-based index designed for PIM systems, but also the first radix tree that asymptotically benefits from batch-parallel processing, for worst-case data and query skew. It builds upon the idea in z-fast tries of combining both a radix tree and a hash table, addressing challenge C1 and other challenges.
We introduce trie matching as the core idea in our algorithm design, which exploits the benefits of batch-parallel processing. The set of strings in the index is stored on PIM modules as a data trie, a hybrid of a radix tree and hash values to further exploit PIM parallelism. A query trie is constructed upon the batched operations, and the matched trie information between this query trie and the data trie is vital in all operations. The tree component of the data trie is decomposed into blocks. Each block is a sub-trie stored on the same PIM module, while distributed meta-blocks are used to organize the block metadata. We give a selective recursive replication of meta-blocks into child meta-blocks, to ensure load balance and to further reduce communication without increasing the space bound.
To resolve potential false-positive query results due to hash collisions, we introduce a verification procedure. It does not increase our asymptotic bounds due to our differentiated handling of critical blocks, whose number proved to be bounded, and non-critical blocks handled by the attached last bytes.
In summary, our PIM-trie supports batch-parallel operations (LongestCommonPrefix, Subtree Query, Insert and Delete) on variable-length bit strings. These operations have efficient bounds  in the PIM Model even under adversary skew of query and data, outperforming PIM-based radix trees and fast tries. The paper is organized as follows. Section 2 reviews the PIM Model and other preliminaries. In Section 3, we discuss the most closely related work, listing some of the building blocks we used from them and addressing their drawbacks in the settings of PIMfriendly radix-based trees. Section 4 describes the key techniques of PIM-trie and their analyses in the PIM Model. Section 5 introduces the procedures of the operations supported by PIM-trie and their asymptotic bounds. Section 6 presents conclusions.

PIM MODEL
We use the Processing-in-Memory Model (PIM Model) [29] as the theoretical abstraction of PIM systems. Prior work [30] has shown experimentally that the PIM Model is a good match for the well-studied commercial PIM system from UPMEM, which is an example of a class of PIM systems referred to as bank-level-inmemory-processing (BLIMP) [20]. We believe that the PIM Model is a good match for BLIMP systems and other near-data-processing (NDP) systems, although not a good match for processing-usingmemory systems [42,49] (because such systems can perform only a few operations in the memory module, not arbitrary code).
The PIM Model consists of a host CPU side and PIM modules (the PIM side). The CPU side is a multicore processor with a shared on-chip cache of words. Each PIM module combines a small memory of ( / ) words (where denotes the problem size), called the PIM memory, and a weak but general-purpose compute unit called the PIM processor. The host CPU can load programs to PIM modules, launch them, and detect their completion. The host CPU can access both its on-chip cache and the local memory of PIM modules, but each PIM module can access only its own PIM memory. The host CPU communicates with PIM modules by directly reading/writing their respective local memories in parallel. The model assumes that programs run in BSP-like synchronous rounds [59], where at each round, the CPU side can (1) perform local computations, (2) write a buffer of data to each PIM module's local memory, (3) launch PIM programs and wait for their completion, and (4) read a buffer of data from each PIM module's local memory.
To analyze algorithms with both CPU and PIM sides, the model combines both shared-memory and distributed metrics. For local computations on the multicore CPU, it assumes a binary forking model [13] with a work-stealing scheduler [15] and measures the CPU work (total number of instructions of all cores) and CPU depth (work on the critical path). For CPU-PIM communication, it measures the number of IO rounds and the IO time, which is the maximum number of word-sized messages to/from any PIM module. For PIM programs, it measures the PIM time, which is the maximum work on any one PIM processor. For algorithms with multiple rounds, the maximums are derived separately for each round, and summed across rounds. Because both IO time and PIM time consider the maximum across all PIM modules, it is critical to design algorithms that ensure good load balance among PIM modules, even under adversarially chosen (skewed) workloads. Definition 1 ( [29]). An algorithm is PIM-balanced if it takes ( / ) PIM time and ( / ) IO time, with and , respectively, the sums of PIM work and communication across all PIM modules.
In other words, each PIM module asymptotically performs an equal fraction of the total work and total amount of communication.
In this paper, we will bound the total communication and prove PIM-balance whp, thereby bounding the IO time. We frequently use the following weighted balls-into-bins lemma to prove PIM-balance: Lemma 2.1 ( [29,47]). Placing weighted balls with total weight = and each < /( log ) into bins uniformly randomly yields ( / ) weight in each bin whp.

Tries and Variants
Trie. Tries are tree structures that store key-value pairs with bitstring keys. All descendants of an inner node in the tree share a common prefix formed by the path to this node. When searching through a trie, an inner node decides which child to traverse based on the query key. A binary trie, for instance, determines whether to go to the left or right child based on each bit of the query key. Binary tries do not necessarily perform well, since their heights are equal to the key length, , in bits. Patricia tries [41] introduce the path compression technique, which reduces the tree height by omitting nodes with only one child and thus compressing the paths.
Radix Tree. Radix trees (or compressed tries) allow each node to represent bits of a key instead of just one. They also use path compression. Each inner node can support an array of 2 child pointers, and an bit chunk of the key is used to index into each inner node when querying. The here is called the span. The tree height is reduced to at most / . However, the 2 -sized child array is often not fully utilized in practice, causing space inefficiency.
Adaptive radix trees (ARTs) [36] resolve this problem by dynamically adapting the node structure with a variable array size based on the number of children, enabling both large span and good space efficiency. The worst-case space overhead is proved to be a constant number of bytes per key-value pair for arbitrary long keys.
Height optimized tries (HOTs) [10,11] introduce another solution that dynamically varies the number of bits considered at each node and introduces compound nodes, which enable a consistently high fanout and thereby good cache efficiency. However, there is no nontrivial bound on the height of a compound tree.
Fast Trie. Fast trie is another family of tries that leverage hashing to achieve logarithmic query costs relative to key length. The earliest work on x-fast tries [62] constructs hash tables on each level of the original binary trie. Meanwhile, each inner node in the trie that does not have a left (right) child stores a pointer to its predecessor (successor) leaf node. When querying, the x-fast trie carries out a binary search on the query string to find whether a prefix string exists in the hash table on the corresponding level. For a string with length , this binary search returns the longest prefix as well as predecessor/successor queries in (log ). However, the x-fast trie takes ( ) space as well as ( ) update cost. The y-fast trie [62] was designed to reduce these costs. Buckets of comparison-based indexes with size Θ( ) are constructed near the leaves, and x-fast tries are used only in the top levels to index into the ( / ) buckets. In this way, y-fast tries achieve ( ) space and (log ) update cost, while keeping (log ) query cost, for fixed-length keys.
The dynamic z-fast trie [8] was proposed to support arbitrarylength key strings. For machine word size , it can handle strings of length up to 2 , and queries/updates of key are supported in (| |/ + log | |) time and (| |/ + log | |) I/Os (in the cacheoblivious model [21]). The key mechanism is fat binary search [7].

PIM-Friendly Indexes
Most indexes today are bottlenecked by memory bandwidth. Previous works try to overcome this bottleneck with the large aggregated memory bandwidth of PIM modules. However, the design of the index needs to be reconsidered to make effective use of many independent PIM modules. Two key challenges must be considered: (1) the algorithm should have low CPU-PIM communication, otherwise it will hit the memory wall like traditional algorithms for non-PIM indexes, and (2) the algorithm should have balanced communication, work, and space requirements across PIM modules, otherwise any stragglers will slow down the whole system. We divide previous PIM-friendly indexes into two categories, as follows.
Range-partitioned Indexes. Some prior works utilize PIM by using range partitioning [18,19,40]. The key space is divided into disjoint key ranges using a small set of separator keys that fit into the host CPU cache. Elements in the key ranges are then partitioned among PIM modules. The separators are managed by the CPU.
This type of index achieves low CPU-PIM communicationconstant per element for both point and range operations-because after local CPU lookups, the operations are directly sent and executed by the corresponding PIM module. A limitation of this solution, however, is the load imbalance among PIM modules under skewed workloads. In the worst case, all queries target the key range of a single PIM module, serializing the entire batch of queries.
Skew-resistant Indexes. To solve this problem, two prior approaches focus on the load imbalance issue under skewed workloads [29,30]. Both solutions build a batch-parallel PIM-optimized skiplist for integer keys that support operations with both accurate keys (get, update, insert, delete) and inaccurate keys (predecessor, range query). They first randomly distribute skiplist nodes to PIM modules for skew resistance, then horizontally divide the index into multiple layers and use different replication policies in each layer to reduce communication. The PIM-tree data-structure [30], for example, is a three-layer comparison-based index for PIMs. It uses full replication in the top layer, a partial selective replication method called shadow subtrees in the middle layer and purely distributed storage in the bottom layer. This solution is not readily applicable for tries, however, because tries can be arbitrarily unbalanced.

Building Blocks
Although the approach for PIM-trees [30] cannot directly be used for tries, two key ideas-push-pull search and selective replicationcan be adapted, so we describe them here in more detail.
Push-Pull Search. Push-Pull Search is introduced to avoid imbalance among PIM modules, as randomly distributing nodes in a tree does not guarantee load balance without further design. In traditional non-PIM skip lists, point queries are executed by pointer chasing from root to leaf, forming a search path. A straightforward but inefficient query algorithm for PIM is to arbitrarily distribute the nodes among the PIM modules and visit them one by one along the search path by remote accesses. In addition to being communication inefficient, this approach suffers from load imbalance in skewed workloads: many queries can share nodes on their search paths, causing imbalance across PIM modules. For example, predecessor queries with different keys but the same answer will have exactly the same search path, causing contention on every node.
In Push-Pull search, this straightforward algorithm is called the Push method. The Pull method is then introduced to alleviate the load imbalance: when the number of queries to the same node exceeds a fixed number, the node is fetched to the CPU side and comparisons are executed on the CPU side instead. The combination of the Push and Pull methods guarantees load balance for any workload, although by itself this approach does not reduce communication.
Selective Replication. Selective replication is a replication strategy used in PIM-based balanced search trees, guaranteeing loadbalance and constant communication per tree search. For each of the selected randomly-distributed inner nodes in a PIM-based tree, its entire subtree is replicated and stored on the same PIM module as the node. When a search query reaches this inner node, all further searching downwards can be carried out locally on the same PIM module, avoiding any further pointer chasing across modules. Selective replication is applicable only when the subtree(s) can fit into the local memory of a single PIM module. The primary benefit of selective replication is that it guarantees load balance when combined with Push-Pull search. When there are few searches to a subtree, these queries are pushed through the selective replicas and only constant communication is needed per search. When the number of searches on a subtree is sufficiently large and can cause load imbalance if all executed by the same replica, the node on the current level is pulled to the CPU for searches to proceed. Selective replicas are used in the nodes on the next layer if the search distribution is relatively balanced; otherwise nodes are recursively pulled. This pulling incurs only amortized constant communication per query.
Applying selective replication to all nodes in the bottom log levels of a balanced tree would increase by a Θ(log ) factor both (i) the space and (ii) the work required to update the replicas under insertions/deletions. Instead, PIM-trees do not replicate their bottom (log log ) levels. This guarantees linear space, no asymptotic increase in insertion/deletion costs, and (log log ) IO rounds.

Limitations of Prior Work
We cannot simply use algorithms from prior PIM-friendly indexes to get ideal tries. Range-partitioning indexes are not suitable for skewed workloads because they suffer from severe load imbalance. Techniques in prior skew-resistant indexes cannot be applied to tries directly, because they require a balanced tree with (i) limited height and (ii) a constant factor decrease in size from the bottom level leaves to the top level root. Specifically, as tries are not balanced, they cannot be horizontally divided into layers with decreasing size bounds. Directly applying the PIM-tree replication strategy can cause a factor of (or even ) space amplification.
Deploying prior radix-based indexes on a PIM system is also non-trivial. Table 1 illustrates the CPU-PIM communication bounds for queries by different approaches, showing that simple transformations of traditional indexes fail to reach competitive bounds. The first approach is to build a PIM-friendly radix tree by distributing tree nodes uniformly randomly to PIM modules, in order to mitigate load imbalance. The CPU-PIM communication required by this approach, however, is not smaller than the CPU-Memory communication for traditional in-memory radix trees, providing no benefits for having PIM modules. Moreover, querying a string of bits can take ( / ) words communication. This bound is worse than ours because the radix span must be several times smaller than , since inner nodes support 2 child pointers. The number of IO rounds is also higher. The second approach is to adapt x-fast tries to PIM systems by using PIM hash tables [30], but such tries can support only integers of = ( ) bits and require ( ) space (in words) per integer. We could reduce space consumption to ( / ) words if an efficient distribution strategy were proposed for the buckets in y-fast tries, but y-fast tries still support only integer keys. The z-fast trie algorithm is a serial algorithm with good bounds in both work and space by combining a radix tree and a hash table. We are motivated by this combination, and realize that a hash table can act like a load distributor that parallelizes the workload to utilize PIM modules. This new use does not exist in serial z-fast tries. Also, all these algorithms have bad IO rounds bounds for Subtree queries.

OUR APPROACHES
Overview. PIM-tries are batch-parallel skew-resistant PIM-friendly binary radix trees supporting LongestCommonPrefix (LCP) queries, Insert, Delete, and Subtree Query for keys of arbitrary length bit-strings. Being a batch-parallel algorithm, PIM-tries take a batch of same-type operations as input, and execute them in parallel as in [30,50]. Minimum batch sizes are required for load balance, and batch sizes are assumed to be ( ) so that a batch fits in the CPU cache. Motivated by radix trees, fast tries, and skew-resistant indexes, we design PIM-tries with three key optimizations: (1) a new execution layout to avoid replicated computation in the input batch, as we build a query trie upon the queried keys then do parallel lookup for the entire query trie; (2) using hash value comparisons for parallelism, load distribution, and work elimination; and (3) a hash value manager supporting effective hash value comparison by a recursive decomposition strategy over dynamic tries.
The hash value comparison in PIM-tries may generate false positives in the case of hash collisions. The collision probability can be reduced by using more bits while hashing; we also introduce a verification process to eliminate false positives.
Basic Structures and Terminology. Before introducing PIM-tries, we clarify the underlying trie structure and terminologies used. We use a binary radix tree (i.e., a binary compressed trie) as the basic structure in our paper. The term "trie" refers to a binary radix tree rather than its standard definition, an uncompressed k-ary search tree, for simplicity. For a standard radix tree storing (key, value) pairs, path compression omits all nodes with only one child except those being the end of a key, leaving only ( ) nodes and edges in total. We call these nodes compressed nodes, and edges compressed edges as they remain after path compression. A compressed node either has two children, or is the endpoint of a stored key, or both. Compressed nodes do not represent all prefixes stored in a radix tree, and those not included are also valid prefixes required for LCP, Insert, etc. We introduce hidden nodes to represent these implicit prefixes. There can be multiple hidden nodes on each compressed node. Combining both types of nodes, we have a bijection between all nodes and all valid prefixes. The node depth of a node is the length of its represented string (in bits). The word "node" refers to both types of nodes unless explicitly specified.
All compressed nodes/edges physically exist in PIM-tries and are each referred to by a (PIM module ID, local memory address) pair (called a PIM address). Every node has pointers to all its adjacent edges, and vice versa. Hidden nodes do not exist physically, so we refer to them by pairing the address of its host edge and its position on the edge (in bits). The node representing the key of a (key, value) pair holds the value (assumed to take (1) words) locally.
Although nodes and edges are distributed in CPU and PIM modules, pointers between them are never remote pointers, because PIM-tries always store a trie at CPU or at a single PIM module. A PIM-trie achieves this by decomposing itself into a bunch of unconnected small tries called blocks and distributing in block granularity. Every trie with strings stores all its compressed nodes, compressed edges, and an array of pointers to all compressed nodes to enable efficient parallel tree operations. For example, the treefix operations [53], including rootfix operations and leaffix operations, can be executed in ( ) work and (log ) depth whp. This array also enables efficient decomposition-given a set of partition nodes, we can generate stand-alone tries, where each node contains the ID of its corresponding node in the original trie-in ( ) work and (log ) depth whp. We use parameter to denote the total number of trie nodes, which is also the aggregated lengths of all edges. The space consumption for is = ( / + ). PIM-tries use hashing as a key technique. The term node hash represents the hash value of the string represented by a node. Hash values are stored in hash tables [24] of linear space and whp ( ) work (and (log * ) PRAM depth or (log ) binary forking depth) for batched lookups, inserts and deletes with batch size .

Query Trie and Trie Matching
PIM-trie relies on two key structures: the data trie and the query trie. The data trie is the main structure containing all data stored in the index, and the query trie is a novel structure containing all keys considered by a batch of operations. A query trie and a data trie (before future optimizations) are shown in Figure 1. Processing operation batches by using a whole query trie rather than one-byone enables PIM-trie to avoid processing shared common prefixes among strings in the batch. This idea improves both computation and communication, and also has benefits in reducing contention.
Query Trie Construction. We build the query trie as a preprocessing step for every new batch. It is built and stored in the CPU cache. The construction algorithm is shown in Algorithm 1, and the theoretical guarantees in Lemma 4.1. Proof. Regarding construction in Algorithm 1, string sorting [26] is ( (1 + / ) log log ) work and (log 2 /log log ) depth on the binary forking model. Constructing the LCP array between adjacent string pairs [52] is ( / ) work and (log 2 ) depth. Constructing a Patricia trie with the LCP array [14] costs ( / ) work and (log 2 ) depth whp. □ Trie Matching Operation. We use trie matching operation as a key subroutine in our method. It compares the query trie generated by the current batch with the data trie to derive a shared part between them. This shared part, called the matched trie, represents all common prefixes between two tries, and is stored as a collection of node reference pairs between query trie nodes and data trie nodes. Figure 1 shows an example, with the matched trie in red.  Figure 1 decomposed and distributed randomly among PIM modules with mirror nodes marked as dashed circles and the hash value manager omitted. Right side: A query trie decomposed by data trie block root hashes, with each block in a gray box marked with its matching block ID. The final matched trie is in red. Let denote the total number of matched trie pairs. The full matching information contains = ( ) pairs, taking ( ) communication to record data trie node references, thus breaking our bounds. Therefore, we only derive ( ) matchings for compressed nodes of the matched trie and the query trie. As shown in Figure 1, a trie matching operation builds red dashed arrows.

Hybrid Hash Trie
The PIM-trie stores string data among the PIM modules and enables efficient search by using a hybrid method of trie and hash comparisons. We combine both methods because using only one would fail to fully utilize the PIM system. For instance, simply randomly distributing the trie nodes to the PIM modules suffers from (serially) pointer chasing ( ) steps when processing highly-skewed data.
Using hash comparisons only, on the other hand, is also insufficient. Assuming a fast trie structure where we calculate the node hash for every data trie node, and store a pointer to the node in the hash table, trie matching is executed by doing a hash join: it looks up query trie nodes in the hash table, then all matched node pairs will be found because they have the same node hash. However, this solution brings a dilemma in case of path compression: if we only store compressed nodes of the data trie in the hash table, we miss potential matched node pairs if one data trie hidden node matches with one query trie node; if we store all data trie nodes, the hash table will take ( ) words space, ( ) times more than storing only the trie. A similar dilemma also exists on the query trie side, as we miss possible answers if we look up only compressed nodes, or cause too much communication otherwise. One example is shown in Figure 1, where compressed nodes 1, 3, 4 match with compressed nodes while 2 matches with a hidden node; common prefix "10100" are represented by hidden nodes in both tries. In this approach, we get worse space and communication bounds for a correct result.
Tree Decomposition with Hash Values. PIM-tries combine the trie structure with hashing. We decompose the data trie into blocks of similar sizes, compute node hashes for their roots as metadata, and distribute these blocks uniformly randomly to PIM modules. An example is shown on the left side of Figure 2 where the data trie in Figure 1 is decomposed and distributed. Each (data trie) block contains the root node hash and the trie block. This decomposition replicates block root nodes as mirror nodes in the block containing the node's parent, represented by dashed circles in Figure 2. We omit details about metadata management by the hash value manager (see Section 4.4) and focus on trie blocks in this section.
Differently from the approach without hashing, we avoid remote pointer chasing by using block root hashes instead, abandoning all inter-block remote pointers. The trie matching algorithm starts with a comparison between block root hashes and hashes of all nodes in the query trie, and a trie matching operation between the current block and the subtree of is triggered if and only if the block root matches the query trie node . Furthermore, trie matching operations over all blocks can be run in parallel, even without waiting for the results of their parent blocks.
Block Size and Blocking Algorithm. Data tries are divided into blocks of ( ) = (log 2 ) words, making ( / ) blocks. All block roots are compressed nodes, and long compressed edges of more than words are cut into pieces by adding compressed nodes in the middle to avoid oversized blocks, introducing ( /( · )) new compressed nodes.
For the blocking algorithm, PIM-tries reduce the problem to a parallel tree partitioning problem with weighted nodes, where the weight of each node is the total size of itself and two child edges. We use the parallel tree partitioning algorithm of [9], but extend it to a weighted version. To divide the tree of nodes into blocks of nodes, the unweighted algorithm (1) generates the Euler tour of a tree, (2) marks one out of every nodes as base nodes, and (3) marks all lowest common ancestors of base nodes. The marked node set makes an ideal partition for the tree. For the weighted version, we assign the node weight to the Euler tour array, calculate the prefix sum, and pick nodes whose prefix sum exceeds a multiple of as base nodes. This algorithm generates roots for ( / ) blocks of size less than in ( ) work and (log ) whp depth. Proof. Every PIM-trie contains multiple distributed trie blocks and a hash value manager. The aggregated size of the trie blocks is ( / + ), as the data trie takes ( / + ) words space before decomposition. Only ( /( · ) + ) additional space is required after, including ( /( · )) new compressed nodes as long edge cuts and (1) sized data per block. The space complexity of the hash value manager is proved in Lemma 4.7. □

Trie Matching Algorithm
The data trie block root hashes enable parallel block matching by decomposing the query trie, but matching all blocks in parallel breaks our communication bound. In the worst case, each query trie node can match with a different data trie block root, dividing the query trie into ( ) blocks for parallel matching, breaking our bounds even if only one word communication is used per block. However, most blocks are not critical. Though there can be up to ( ) blocks, all but ( ) blocks are simply edges connecting two matched hidden nodes as ends and contain no compressed nodes, as the query trie only has ( ) compressed nodes. We call these blocks non-critical blocks. For example, a query trie after root hash matching is shown in Figure 2, where blocks are in gray boxes, and nodes representing ( , "101", and "1010") are roots of block 1, 2, and 3, respectively. Blocks 1 and 3 are critical blocks, but block 2 is not. During the trie matching process, we ignore non-critical blocks and only match critical blocks unless verification is required. The verification process will be discuss later in Section 4.4.3. Algorithm 2 depicts pseudo-code for our trie matching algorithm. The hash value manager generates critical block roots together with references to their matching data trie blocks, and stores them on the CPU side (Line 1). We then expand these roots to stand-alone blocks (Line 2) on the CPU side. Blocks generated by this expansion can be larger than the actual critical block by absorbing all child noncritical blocks (for example, block 2 will be absorbed into block 1 in Figure 2), but these additional bits will be filtered out automatically in the local trie matching process. Since actual data trie blocks are in PIM memory, we use the Push-Pull technique to decide where to match local trees (Line 6-13). We get the final result by merging results in different blocks (Line 14). The actual Push-Pull process takes 5 rounds: (1) push small query trie blocks (Line 8), (2) PIM calculation (Line 9), (3) fetch results (Line 10), (4) pull for small data trie blocks (Line 12), and (5) CPU calculation (Line 13). In this paper, we merge these rounds into a single one in pseudo-codes for simplicity. A serial depth-first search is used as the local matching algorithm between blocks, and details about this algorithm will be introduced at the end of Section 4.4.2. A minimal batch size is required for the Push-Pull technique. The hash value manager and the solution to hash collisions will be introduced later in Section 4.4. Theorem 4.3. For a query trie with size = Ω( log 5 ), the trie matching algorithm requires ( / ) IO time, (log ) IO rounds, ( ) CPU work, (log 2 + log log + log( )) CPU depth, and ( log / ) PIM time, with all bounds holding whp. Proof. We prove IO bounds for Algorithm 2 here and leave the work bounds to Lemma 4.11. Bounds for the hash value manager will be proven in Lemmas 4.8 and 4.10. All bounds hold whp because of the verification costs for potential hash collisions.
According to the Push-Pull technique, the communication of each query trie critical block is the minimal between its own size and the block size limit . Therefore, the total communication is

Hash Value Manager
The hash value manager manages the metadata of data trie blocks: their root hashes. It is used in the first step of trie matching, where we derive critical blocks by matching between hash values of all ( ) query trie nodes and that of all ( / ) data trie block roots in ( / + ) words communication. Designing a hash value manager is challenging when we want both low communication and load balance. If it calculates node hashes on the CPU side, sending all ( ) node hashes breaks our communication bound. Another solution is to send the trie and calculate node hashes on the PIM side. This approach overcomes the communication problem, but it brings additional challenges as it requires the matching pairs of trie nodes and block root hashes to be on the same PIM module.
We observed that we need sufficient locality for low communication, and sufficient randomness for load balance. The selective replication technique (Section 3.3) is such combination, but its design is only applicable to balanced search trees, not to tries that can have much larger height. Motivated by this technique, we design a new distribution strategy in the hash value manager.
The hash value manager organizes block root hashes into a meta-tree, a directed tree with nodes representing blocks and edges representing block connections: if block contains a mirror node of the root of block , nodes for and are connected in the meta-tree. Every meta-tree node contains the root hash and the PIM address for the block it represents. The meta-tree does not physically exist as a whole tree, but instead as similar-sized pieces distributed randomly in PIM modules, forming meta-blocks. Each meta-block also maintains a hash table that maps block root hashes to its meta-tree nodes. Similarly, a master-tree and its hash table are built to organize meta-blocks. The master-tree physically exists and is replicated in all PIM modules. An example is shown on the  Figure 3. A meta-tree with 12 nodes is shown in the top left corner. It is decomposed into two meta-blocks with their IDs and PIM IDs attached, shown on the right side. The master-tree and its hash table are drawn in red on the bottom left corner. As for Figure 2, its meta-tree can be represented as 1 → 2 → 3 .
We set the meta-block size threshold as = . Meta-trees have degree up to ( ) = (log 2 ), because all ( ) leaves of a data trie block can be mirror nodes of child block roots.
Trie Matching. Every meta-block represents multiple blocks, forming a single connected component in the data trie. Therefore, the data trie now has a two-layer decomposition: first by meta-block roots, then by block roots. The trie matching algorithm runs by applying the same decomposition to the query trie: it first divides the query trie into meta-blocks, then into blocks, and finally performs the actual matching. There is also the idea of critical query trie metablocks. The first step that derives critical meta-blocks is shown in Algorithm 4, where we first divide the query trie into ( log ) similar sized master-blocks (Line 1), then do HashMatching between the hash table of the master-tree and the block (Algorithm 3) at PIM modules to generate query trie meta-block roots (Line 2-6).
Algorithm 4 is load balanced because it sends similar-sized blocks to random PIM modules, but a load balanced matching between meta-blocks to divide query trie into blocks is hard in case of large query trie meta-blocks. The load balanced matching algorithm will be introduced in Section 4.4.1; optimizations to reduce work in Section 4.4.2; and verification in Section 4.4.3.
Hash Function. PIM-tries have requirements on the hash function, and the minimum of which is incremental, because after decomposing a query trie into blocks, the full string of a node may not exist in the block, therefore the hash function needs to generate node hashes from a prefix hash (the block root hash) and a suffix string (see Definition 2). Many commonly used incremental hash functions in practice are even further binary associatively incremental (see Definition 3), such as rolling polynomial hashing [31] and CRC [44]. This property enables internal parallelism in node hash generation by parallel prefix sum [12] and rootfix scan [53].

PIM3
Hash Table   Figure 4: Meta-block trees generated from Figure 3 with cut nodes for meta-blocks in blue. Definition 3. A hash function ℎ(·) for bit-strings is binary associatively incremental, if for any bit-string = concatenated by and , it gives a binary associative operation ⊕ that outputs ℎ( ) = ℎ( ) ⊕ ℎ( ), using only the hash values ℎ( ) and ℎ( ) and their lengths, without knowing bit-strings or .

Recursive Meta-block Decomposition
Trying to match between meta-blocks directly at PIM modules can cause load imbalance, because we may send a large critical meta-block to a single PIM module. For example, if no hash match is found between the query trie and any meta-block roots, the whole query trie will be sent to the PIM module of the root meta-block. Similar problems occurs in block matching, where the Push-Pull technique is applied as a solution. However, the same solution does not work for meta-blocks because of the size difference (log 2 vs. , respectively). Simply fetching a meta-block of words to CPU can cause imbalance given our Ω( log 5 ) limit on batch sizes.
To solve this size problem, we divide meta-blocks into smaller meta-blocks in order to re-enable the Push-Pull technique. We recursively decompose meta-blocks while ensuring that each still represents a connected component in the data trie. In each step, we select the node that minimizes the size of the child meta-block, then cut all its child edges. We recursively apply the division until the size drops below threshold = = log 2 .
In two examples in Figure 4, we set = 7 and = 3; each meta-block tree is in a gray box; and the cut node is marked in blue. Child meta-blocks are distributed and stored at random PIM modules, and a meta-block tree is generated, connecting them by remote pointers. The height of this tree is (log ) according to Lemmas 4.5 and 4.6. Only root meta-blocks in meta-block trees are inserted to the master-tree. Lemma 4.5. For any unweighted directed out-tree of nodes, there exists a cut node such that if we remove all its out-edges, the maximum remaining tree contains no more than ( + 1)/2 nodes. Lemma 4.6. Assuming a meta-block tree with ( ) nodes and a pre-defined constant < 1, if it is true for all meta-blocks in the meta-block tree that all its children have size less than fraction of its own size, the meta-block tree height is (log( )). Proof. The size of the hash value manager is linear to the number of root hashes it stores after internal replication, including root hashes of ( / ) = ( /log 2 ) different blocks and ( / / ) = ( /( log 2 )) different root meta-blocks. Block root hashes are replicated (log ) = (log ) times according to the height of the meta-block tree, and meta-block tree root hashes are replicated times in the master-tree. Thus, the hash value manager contains ( ) hash values with ( ) space. □ Note that after this decomposition, the master-tree and the metablock tree combine to form a hierarchical decomposition of the data trie of bounded height. The data trie is first decomposed into meta-blocks by the master-tree, then further divided by root hashes of meta-block tree nodes layer by layer into blocks. With this balanced hierarchy, we can now apply the Push-Pull strategy to divide large query trie meta-blocks into blocks without load imbalance: whenever its size exceeds the threshold log 4 , we pull the matching meta-block's (log 2 ) child root hashes to divide it into smaller ones. We do this recursively until the block is small enough for a push, or the meta-block tree leaves are reached and then pulled. The full algorithm for deriving critical blocks from query trie is described in Algorithm 5, where the recursive decomposition is in Line 4-16, and Push-Pull in Line 17-27.
Lemma 4.8. IO bounds in Theorem 4.3 hold for Algorithm 5.
Proof. We prove the IO time separately for push and pull. In push rounds, blocks are sent from CPU to PIM for HashMatching, where the query trie is sent only twice (once in Algorithm 4) in blocks of ( / log ) words, guaranteeing load balance. There are (log ) pull rounds according to the meta-block tree height, and we pull (log 2 ) data for each of the ( /log 4 ) oversized blocks in each round, adding up to ( / log ) total IO time. □ The key challenge in meta-block design is its low-cost compatibility to Inserts/Deletes. Newly generated/removed blocks may break the assumption in Lemma 4.6. We will discuss in Section 5.2.

Optimizations to Reduce Work
The work inefficiency of Algorithm 5 comes from two sides: (1) it calculates hashes for all ( ) nodes in HashMatching; and (2) it decomposes large query trie blocks of size ( ) in linear work in each pull round, adding up to ( log ) total work. In this section, we introduce optimizations that reduce the CPU work to ( ) whp and the PIM time to ( log / ) whp to meet our bounds. We first reduce the number of hashes by string alignment, then use a batch-parallel euler-tour tree [57] to maintain query trie blocks. A local matching algorithm for blocks is introduced at the end, which follows similar ideas of our optimizations in work reduction.
Efficient HashMatching. To reduce the number of hashes computed, we compute for only a small proportion of nodes to cut the trie into fixed-length tries, then obtain the matching using fast tries.
To be specific, we select pivot nodes on the query trie-nodes whose depth (string length) is a multiple of bits. We generate all ( / ) pivot node hashes on CPU when generating the query trie. We define the bottommost pivot ancestor of a node as its host pivot, and a pivot node is a host pivot of an edge if it is the host pivot of any node on this edge. We will miss matching points if we only consider pivot nodes, therefore additional steps are introduced for a complete result, starting by a data augmentation on block root  Figure 5: An example for HashMatching. Pivot nodes appear as red squares. One pivot-rooted block is outlined in red. The blue circle represents a critical block root, which matches the blue meta-tree node. On the right is the twolayer index, including the hash table and y-fast tries combined with validity vectors. Gray strings in the meta-tree do not exist physically. hashes. For each meta-tree node, assuming that it represents a data trie block , whose root represents a string , so that it physically contains ℎ ℎ( ) and ( ), we now add to this node: (1) the length of , (2) the hash value of the longest prefix of whose length is a multiple of , denoted as ℎ ℎ( ), and (3) the suffix of after , denoted as . Note that | | < . The HashMatching algorithm is modified to utilize this additional data. Our modification is different for CPU and PIM side execution, as they have different inputs. In CPU side execution (pull), (log 2 ) data about child meta-blocks are fetched to CPU to divide a large query trie block. We augment the query trie for efficient division: (1) a hash table is built to map pivot hashes to pivot nodes, (2) a z-fast trie of height is built for every pivot node as shortcuts to all its hosting compressed nodes. (Z-fast tries are compressed binary radix trees supporting lookup with cost (log ℎ) whp for height ℎ and linear additional space.) For the fetched information of every block root in a pull round, we first lookup its ℎ ℎ( ) in the hash table, then its in the relative z-fast trie to find the exact match position, each taking (log ) work whp.
In PIM side execution (push), query trie blocks of size (log 4 ) are sent to meta-blocks for critical block roots. To support this operation, we change the hash table in each meta-block to a twolayered index, where the first layer maps ℎ ℎ( ) to the second layer, and the second maps from to meta-tree nodes. During the matching process, we lookup all pivot hashes in the hash table, then for every edge, we pick its bottommost hit host pivot node as its critical pivot. The match point is no more than bits deeper than the critical pivot, so we build the string ′ going downward from the critical pivot until gathering bits or reaching the end of the edge. However, the critical block root may not be on this string (see Figure 5 for an example), making it impossible for us to directly find the node by a simple lookup of ′ in the second layer.
To solve this problem, we do not directly find the critical block root, but find it or one of its direct children. The second layer index can be described as follows: (1) it maintains a set of strings, all less than bits, (2) For a query string , it returns string , whose LCP with ( ( , )) is the longest among all strings in , and there is no who has the same LCP but is a prefix of , in (log ). The last requirement ensures that we find a direct child rather than an arbitrary node in the subtree. We store of block roots in this structure. Once we query ′ , it will return of either the critical block root or its direct child, which leads us to the node by an additional hash table from to meta-tree node addresses. The second layer index combines a y-fast trie and a table. Every of a block root is padded into two ( -bit) integers, 0 and 1, by adding 0s and 1s at the end, and both are inserted into the y-fast trie. As different strings may be padded into the same integer, for every padded integer, a -bit validity vector is stored in the table to record all valid prefixes. For a query, its string is padded similarly into 0 and 1, then we lookup their predecessor and successor in the y-fast trie. Taking the predecessor of 0, 0 , as an example, we calculate its LCP with , then do a binary search on the validity vector of 0 to find the shortest valid string longer than the LCP, or the longest valid string shorter than the LCP if no one is longer, and take it as the result. A similar process is executed for both the predecessor and the successor of 0 and 1, and the shortest one of those with the longest LCP is the final result. It takes (log ) time whp to lookup, insert, delete a string in this structure, and we are guaranteed to find the correct string by a single query. Figure 5 gives an example of this case when the critical block root is an ancestor of the critical pivot, but we still find its child through the two-layer index. In this example, we have = 3. We first lookup hash("000000") in the hash table, then pad ′ = "0" into "011" (also "000") and lookup in the y-fast trie to find a predecessor (an exact match in this case), and finally do binary search on the validity vector to find = "01" for the child of our target meta-tree node (marked in blue). The whole process is marked in red. Proof. We first apply parallel prefix sum [12] to hash pivots on each edge locally, in ( / ) work and (log( )) depth; then we use rootfix scan [53] on the trie, which costs ( ) work in expectation and (log ) depth whp. □ Efficient Block Partition. According to the height of meta-block trees, we use (log ) pull rounds in Algorithm 5 (Line 4-16) to divide oversized blocks into smaller ones that will not cause load imbalance. In each round, there are ( /log 4 ) oversized blocks with total size ( ), and we fetch (log 2 ) child meta-block root hashes and references to divide each block, adding up to ( /log 2 ) divisions. Although the position of all divisions can be found in ( log /log 2 ) time, it takes ( ) work each round if we partition the oversized block physically into multiple stand-alone tries, adding up to ( log ) total CPU work. We realize that this tree decomposition problem is a dynamic forest problem with edge deletions and subtree size queries, and choose the batch-parallel Euler tour tree [57] as the solution. The basic idea of this data structure is to use augmented skip lists to maintain the Euler tours, which supports edge insertions and deletions by splits and merges of skip lists and subtree queries by values augmented on skip list nodes. For a forest with nodes, this algorithm solves a batch of edge insertions, edge deletions, or subtree size queries in ( log ) work and (log ) depth whp. Proof. Pre-processing all pivot hashes takes ( ) CPU work and (log( )) depth. We then prove work bounds separately for push and pull in Algorithm 5. For push, the total work (number of instructions) on PIM modules is ( / + log ) whp because the y-fast trie lookup with (log ) whp cost only happens once per edge, but the total PIM time is ( log / ) because of the load imbalance when some block has more nodes than the others. For pull, there are (log ) rounds. In each round we pull ( /log 2 ) node hashes to divide oversized query trie metablocks. Locating these divisions takes ( log /log 2 ) work whp and (log ) depth whp by z-fast tries, and dividing the metablock takes ( /log ) work whp and (log ) depth whp by the dynamic forest algorithm. Adding these all up gives our bound. □ Efficient Local Matching. Using a serial depth-first search as the local trie matching algorithm takes work linear in the total size of the two blocks, which can be ( ) times more than the size of a small query block, breaking our bounds in PIM time. The method to reduce it to (log ) is similar to that used in the hash value manager: we pick pivot nodes in the data block (with length , ∈ Z), and build a z-fast trie of height bits on every pivot as a shortcut to compressed nodes hosted by this pivot and pivots of the next level (with length ( + 1) ). Trie Matching is executed by a DFS for all pivot nodes and compressed nodes in the query block.
Lemma 4.11. Work bounds in Theorem 4.3 hold for Algorithm 2.
Proof. The pull side on the CPU takes work linear in the query trie and depth linear in the block size (log 2 ). For PIM work, the number of z-fast trie lookups is the total number of pivot nodes and compressed nodes in the query block, which is linear in its size. □

Verification
Hash collisions can cause false positives and thus incorrect results. We design a verification process to effectively correct this. By setting a hash length of 5 log 2 = ( ) bits and triggering a global re-hash once the tree size is squared, hash collisions occur with probability (1/ ). A global PIM hash table [30] ensures no hash collision between data trie block roots by triggering global re-hashes once a collision is found. Given the infrequency of re-hashing, the maintenance cost does not change our (amortized) Insert bounds.
A collision between a query trie and a data trie can cause an incorrectly partitioned query trie and thus a wrong matching process.
The key challenge is to detect these collisions. Once a collision is detected, we rectify the partitioning, then redo the matching process, until no collision exists. The number of redo rounds is linear in the number of simultaneous collisions, whose probability decreases exponentially to (1/ ), making all bounds in Theorem 4.3 whp.
All critical blocks go through the bit-by-bit matching process, and these results report hash collisions. However, we cannot afford to verify all ( ) non-critical blocks in this way, because it will take ( ) communication, which breaks our bounds. We instead verify them during HashMatching by the hash value manager. Notice that non-critical blocks only exist as consecutive sequences in the middle of edges, and every such sequence has an critical block root as the bottommost end. For PIM side HashMatching, after finding a critical block root and its matching meta-tree node , it locates the matching meta-tree nodes of these non-critical blocks, which are the closest ancestors of . We augment the metatree node by (1) the last bits of its represented string , denoted as , (2) a shortcut pointer to an arbitrary meta-tree ancestor whose depth is ∈ [ /2, ] bits smaller than the node's depth. Short non-critical blocks with less than bits are verified by , chasing shortcut pointers for speedup, and long ones are returned as critical blocks. CPU side HashMatching matches all blocks, so short non-critical blocks are directly verified by . The bounds in Theorem 4.3 still hold with verification introduced.

PIM-TRIE OPERATIONS
In this section, we introduce operations supported by PIM-trie, and show how trie matching can be used to efficiently implement them.

LongestCommonPrefix
A LongestCommonPrefix (LCP) query finds the LCP between the query string and all strings stored in the PIM-trie. Because the result is a prefix of the query string, we only need the length. This query can be easily solved by trie matching results: the length is the node depth of the bottommost matching ancestor of the string's representing query trie node. For example, in Figure 1, string "101001" is represented by node 6. Its LCP length 5 can be found by the node depth of its ancestor.
To find LCPs for a batch of strings, we build the query trie, perform trie matching, then run a parallel rootfix on the query trie for results. Combining these bounds gives its final bound.

Insert and Delete
Insert. To Insert a batch of strings, we insert the unmatched subtrees of the query trie into the data trie, which updates both distributed data trie blocks and the hash value manager.
Update blocks. Unmatched query trie subtrees are inserted to the matching node of their roots. For example in Figure 2, the unmatched subtrees are in black and should be inserted into block 1 and 3. The required information is derived by the matched trie after a trie matching. The in-block insert process is the same as in traditional tries, and is executed on CPU (for large subtrees) or PIM (for small subtrees) based on the Push-Pull threshold. New blocks may be oversized after insertion. In this case, they are sent to CPU and re-partitioned by the blocking algorithm mentioned in Section 4.2, where additional info for block roots is also derived. Newly generated blocks are re-distributed to random PIM modules.
Update the hash value manager. Root hashes (and other auxiliary data) of the newly generated block should be inserted into the hash value manager. As all new blocks are children of the existing blocks, only meta-blocks holding the parent block's root hash need insertion, and we send new root hashes to them.
Insertions to the meta-block tree brings new problems: the cut vertex may no longer be the optimal cut in the new subtree, breaking the balanced structure of the meta-block tree and its height bound. For example, adversaries can insert many new blocks at the end of a flat list of blocks, degenerating the meta-block tree into a flat list.
Motivated by the scapegoat tree [22], we keep meta-block tree height bounds by rebuilding in case of sufficient imbalance. To be specific, for any meta-block in the meta-block tree with a child whose size exceeds fraction of its own size, a new cut vertex is selected and its meta-block subtree is rebuilt completely. The rebuilt process is always executed on the CPU side in ( log ) work and (log 2 ) depth whp for a meta-block with nodes. Setting as a predefined parameter larger than 0.5, this rebalancing protocol ensures that the tree height bound holds according to Lemma 4.6. Besides imbalanced meta-blocks, oversized ones also need maintenance to meet size limits for roots and leaves of the meta-block tree. For a leaf meta-block exceeding nodes, we select its cut vertex to generate its children in the meta-block tree. For a root meta-block exceeding nodes, all its children are upgraded to roots of separate meta-block trees, and their root hashes are inserted to the master-tree. This policy may bring (log 2 ) undersized root meta-block to the master-tree, but is still ( ) space.
Delete. To Delete a batch of strings, we remove the matched trie from the data trie. The process is similar to Insert, by updating the distributed trie blocks first and then the hash value manager.
For block-level deletion, query trie blocks are sent to data trie blocks for local deletion. A special notice is required for mirror nodes (block root replicas), as they should not be deleted until the entire subtree of its representing block root is completely removed. For example in Figure 2, we should not remove the mirror node in block 1 unless both blocks 2 and 3 are removed completely. To solve this, we record each query trie block attempting to completely delete its matching data trie block; then use a parallel leaffix operation to find completely deleted subtrees; finally do the local deletion. Undersized blocks are merged into their parents. Another challenge is that non-critical blocks may also need deletion. Long non-critical blocks are treated as critical, and short ones never completely delete a block, so only those with a critical block as a child need deletion, and the number of those is linear to that of other blocks.
For hash value manager updates, block removal is applied at meta-blocks as node removal. The rebalance protocol is the same as Insert. Undersized internal nodes in the meta-block tree delete all their children, and undersized root meta-blocks are merged into their parent meta-blocks in the master-tree if both are undersized.
Load Balance. In case when a skewed Insert/Delete batch is applied all into the same meta-block, there can be ( /log 2 ) updates to a single meta-block in one round, which is severely load-imbalanced for Ω( log 5 ) batch sizes.
Note that the structure of the meta-block tree is a selective replicated one, as every meta-block tree node caches the information in its subtree in its meta-tree piece. The data on each meta-block can be categorized into: (log 2 ) critical information, including its own root hash, pointers and root hashes of child meta-block, and meta-tree pieces for leaves; and ( ) non-critical infor, including meta-tree pieces for internal nodes. If we set the threshold in Algorithm 5 to 0, we get correct trie matching results using only critical information, though with asymptotically more communication.
Our solution to load imbalance is similar to [29] -always keeping critical information up-to-date and delaying non-critical info updates on some meta-blocks with unfinished markers. Whenever the number of unfinished nodes exceeds log , additional update rounds are activated until the number drops below. With a bounded number of unfinished nodes, Theorem 4.3 still holds.
Compared to [29], there are two new challenges in PIM-trie. First, reading meta-blocks of ( ) words to CPU for merging/repartitioning causes load imbalance. We solve this by not fetching these blocks directly, but merging their child meta-blocks stored in different PIM modules. Second, while y-fast trie insertions(deletions) take amortized (log ) time, they take worst-case ( ), which can cause PIM time imbalance. They can be de-amortized by using a weight balanced tree as the internal binary search tree, and deamortizing the internal x-fast trie by adding a layer of indirection, borrowing ideas from concurrent data structures.

Subtree Query
A Subtree Query returns a trie including all (key, value) pairs whose key contains the given string as a prefix. Traditional indexes solve them by pointer chasing, which may take ( ) IO rounds in the worst case. PIM-tries solve them in (log ) IO rounds.
Using trie matching, we find a target node for every non-trivial Subtree Query; its result is the subtree of this node. A primary observation is that if a block (meta-block) root is in the subtree, the whole block is in this subtree. Thus, the result subtree consists of the subtree within the target node's block, and the children of this block. We can fetch and merge all these components by slightly modify the trie matching to return all child block references rather than only the matching block. This takes no additional IO rounds. To ensure load balance, we do not fetch large meta-tree subtrees (containing block references) directly from a single meta-block, but by merging its children in (log ) more IO rounds.

CONCLUSION
This paper presents PIM-trie, the first batch-parallel radix-based index designed for processing-in-memory (PIM) systems. It supports LCP, Subtree, Insert, and Delete on variable-length strings, and simultaneously achieves high load balance, low communication, and low space overhead with good theoretical bounds in the PIM Model, regardless of adversary skew in data and queries. Key techniques introduced in this paper include (1) trie matching between a query trie and data trie, (2) a block-wise decomposition and selective recursive replication of the data stored on the PIM side supported by hashing management, and (3) a verification procedure with differentiated handling of critical and non-critical blocks. Future work includes designing PIM-friendly algorithms and data structures supported by these key methods (such as suffix trees and graph processing), as well as implementing PIM-trie on real PIM systems.