Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems

LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, which scatter in multiple LSM components. In this work, we revisit secondary indexing in LSM-based storage systems with byte-addressable persistent memory (PM). Existing PM-based indexes are not directly competent for efficient secondary indexing. We propose Perseid, an efficient PM-based secondary indexing mechanism for LSM-based storage systems, which takes into account both characteristics of PM and secondary indexing. Perseid consists of (1) a specifically designed secondary index structure that achieves high-performance insertion and query, (2) a lightweight hybrid PM-DRAM and hash-based validation approach to filter out obsolete values with subtle overhead, and (3) two adapted optimizations on primary table searching issued from secondary indexes to accelerate non-index-only queries. Our evaluation shows that Perseid outperforms existing PM-based indexes by 3–7× and achieves about two orders of magnitude performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks.


INTRODUCTION
Log-Structured Merge trees (LSM-trees) feature outstanding write performance and thus have been widely adopted in modern key-value (KV) stores, such as RocksDB [28] and Cassandra [1].Diferent from in-place update storage structures (e.g., B + -Tree), LSM-trees bufer writes in memory and lush them to storage devices in batches periodically to avoid random writes, which enables high write performance and low device write ampliication.Besides high write performance, many database applications also require high-performance queries on not only primary keys but also other speciic values [13], thus necessitating secondary indexing techniques.
LSM-trees' attributes make it challenging to design eicient secondary indexing.Modern LSM-based storage systems typically store a secondary index as another LSM-tree [54] (e.g., a column family in RocksDB [51]).However, designed for block devices and optimized for write performance, LSM-trees are not competent data structures for secondary indexes which require high search performance.First, since secondary indexes usually only store primary keys instead of full records 1 as values, KV pairs in secondary indexes are small.LSMtrees' heavy lookup operations are ineicient for these small KV pairs.Second, secondary keys are not unique and can have multiple associated primary keys.LSM-trees' out-of-place write pattern will scatter these nonconsecutive-arrived values (i.e., associated primary keys) to multiple pieces at diferent levels.Consequently, query operations need to search all levels in the LSM-based secondary index to fetch these value pieces.Besides the device I/O overhead, LSM-trees have non-negligible overheads of CPU and memory (i.e., indexing and Bloom ilter) [21,25,40].
Moreover, the consistency of secondary indexes is another issue in LSM-based storage systems.As an LSMbased primary table adopts the blind-write pattern to insert, update, and delete records (appends new data without checking prior data, versus read-modify-write in B + -Trees) for high write performance, it is unable to delete the obsolete entry in a secondary index without acquiring the old secondary key.Consequently, when querying a secondary index, the system should validate each entry by checking the primary table before returning the results to users, which introduces many unnecessary but expensive lookups on the primary table for obsolete entries.Some systems fetch old records when updating or deleting records to keep secondary indexes up-to-date synchronously [11,51], whereas this method discards the blind-write attribute and thus degrades the write performance.
Though many eforts have been made to optimize these predicaments [42,47,54,59], they are diicult to solve the problems discussed above well, sacriicing either write performance of the LSM-based storage systems or query performance of the secondary index.
As secondary indexing demands low-latency queries and the KV pairs of secondary indexes are small, we argue that leveraging persistent memory (PM) to provide a new solution for secondary indexing is promising.PM has many attractive advantages such as byte-addressability, DRAM-comparable access latency, and the ability of data persistency, which is well suited to secondary indexing.Though there are many state-of-the-art PM-based indexes [17,31,37,39,43,44,52,53,72,73], none of them are designed for secondary indexing.Without considering the non-unique feature of secondary indexes and consistency in LSM-based KV stores, simply adopting existing general PM-based indexes as secondary indexes can overshadow their performance.
In this work, we propose Perseid [61], a new persistent memory-based secondary indexing mechanism for LSM-based KV stores.Perseid contains PS-Tree, a speciically designed data structure on PM for secondary indexes.PS-Tree can leverage state-of-the-art PM-based indexes and enhance them with a speciic value layer, which considers the characteristics of both PM and secondary indexing.The value layer of PS-Tree works in a manner of blended log-structured approach and B + -Tree leaf nodes, which is both PM-friendly and secondaryindex-friendly. Speciically, new values are appended to value pages for eicient insertion on PM.During the value page split, multiple values (i.e., associated primary keys) that belong to the same secondary key are reorganized to store continuously for eicient querying.
Moreover, Perseid retains the blind-write attribute of LSM-based KV stores for high write performance without sacriicing secondary index query performance.This is achieved by a lightweight hybrid PM-DRAM and hash-based validation approach in Perseid.Perseid uses a hash table on PM to record the latest version of primary keys.However, multiple random accesses on PM still incur high latencies.Thus, Perseid adopts a small mirror of the validation hash table on DRAM which only contains useful information for validation.During validation, the volatile hash table absorbs random accesses to PM, and thus reduces the validation overhead.The small volatile hash table not only saves DRAM memory space but also reduces cache pollution.
Perseid has a fairly low latency of index-only query 2 However, the overhead of non-index-only queries is still dominated by the LSM-based primary table.Therefore, we further propose two optimizations for non-index-only queries in Perseid.First, as querying the primary table issued by the secondary index is an internal operation, we can locate KV pairs with additional auxiliary information much more eiciently, reducing cumbersome indexing operations.By matching the tiering compaction strategy [24,48], we can further bypass Bloom ilter checking.Second, as one secondary index query may need to search for multiple independent records in the primary table, we parallelize these searching operations with multiple threads.Since search latencies on the LSM-based primary table may vary largely, we apply a worker-active manner on parallel threads to avoid load imbalance among threads and improve utilization.
We implement Perseid and evaluate it against state-of-the-art PM-based indexes and LSM-based secondary indexing techniques on PM.The evaluation results show that Perseid outperforms exiting PM-based indexes by 3-7× for queries, and achieves about two orders of magnitude higher performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks, while maintaining the high write performance of LSM-based storage systems.
In summary, this paper makes the following contributions: • Analysis of the ineiciencies of LSM-based secondary indexing techniques and existing PM-based indexes when adopted as secondary indexes for LSM-based KV stores.• Perseid, an eicient PM-based secondary indexing mechanism, which includes a secondary index-friendly structure, a lightweight validation approach, and two optimizations on primary table searching issued from secondary indexes.• Experiments that demonstrate the advantage of Perseid.

BACKGROUND 2.1 Log-Structured Merge Trees
The LSM-tree applies out-of-place updates and performs sequential writes, which achieves superior write performance compared to other in-place-update storage structures.
The LSM-tree has a multi-level structure on storage and each level comprises one or several sorted runs.The size of Level is several times (e.g., 10) larger than Level −1 .Each sorted run contains sorted KV pairs and is further partitioned to multiple small components called SSTables.In LSM-trees, new key-value pairs are irst bufered into a memory component called a MemTable.When the MemTable ills up, it turns into an immutable MemTable and gets lushed to storage as a sorted run.Since sorted runs have overlapping key ranges, a query operation needs to search multiple sorted runs.To limit the number of sorted runs and improve search eiciency, LSM-trees conduct compaction periodically to merge several components and remove obsolete KV pairs.Two typical compaction strategy and their variants are commonly used in LSM-trees [24,48]: The leveling strategy [28,30] allows each level (besides 0 ) to have only one sorted run.When a level ( ) exceeds its size limit, one or more SSTables from level and all overlapped SSTables from the higher level +1 are sort-merged to generate new SSTables into level +1 ; The tiering strategy [55,64] allows each level (besides 0 ) to have multiple sorted runs to reduce the write ampliication.To compact SSTables at level , several SSTables in a range partition are merged to new SSTables writing to level +1 directly, without rewriting existing SSTables at level +1 .Compared with the leveling strategy, the tiering strategy has a much smaller write ampliication ratio and thus higher write performance.However, since query operations need to search multiple sorted runs in each level, LSM-trees with a tiering strategy have much lower read performance.

Secondary Index in LSM-based Systems
Many applications require queries on speciic values other than primary keys.Without an index based on speciic values, database systems need to scan the whole table to ind relevant data.Thus, secondary indexing is an indispensable technique in database systems.For example, in Facebook's database service for social graphs, secondary keys are heavily used, such as inding IDs who liked a speciic photo [13,51].In this work, we mainly Primary Index discuss stand-alone secondary indexes, which are separate index structures apart from the primary table and are commonly used in database systems [54].A stand-alone secondary index maintains mappings from each secondary key to its associated primary keys.As secondary keys are not unique, a single secondary key can have multiple associated primary keys.
Consistency strategy.Since LSM-based KV stores update or delete records by out-of-place blind-writes, maintaining consistency of secondary indexes becomes a challenge in LSM-based storage systems.There are two strategies to handle this issue, Synchronous and Validation.
For Synchronous strategy, whenever a record is written in the primary table, the secondary index is maintained synchronously to relect the latest and valid status (e.g., AsterixDB [11], MongoDB [5], MyRocks [51]).For example, as shown in Figure 1(a), when writing a new record { 2→1 } ( denotes the primary key, denotes the secondary key, and other ields are omitted for simplicity) into the primary table, the storage system also fetches the old record of 2 to get its old secondary key 2. Then the storage system inserts not only a new entry { 1→2 } but also a tombstone to delete the obsolete entry { 2→2 } in the secondary index.Nevertheless, this strategy discards the blind-write attribute and thus degrades the write performance which is the main advantage of LSM-based KV stores.
By contrast, as shown in Figure 1(b), Validation strategy only inserts the new entry { 1→2 } but does not maintain the consistency of obsolete entries in secondary indexes (e.g., Cassandra [1,2], DELI [59], and secondary indexing proposed by Luo et al. [47]).However, secondary index query operations need to validate all relevant entries by checking the primary table to ilter out obsolete mappings.Though previous work proposed some approaches to reduce the validation overhead, their beneits are limited.For example, DELI [59] lazily repairs the secondary index along with compaction of the primary table.Luo et al. [47] propose to store an extra timestamp for each entry in the secondary index and use a primary key index that only stores primary keys and their latest timestamp for validation.The primary key index is validated instead of the primary table.However, since the primary key index is also an LSM-tree, though it ilters out unnecessary point lookups on the primary table, it still requires point lookups on itself.
Index type.As a secondary key can have multiple associated primary keys, LSM-based secondary indexes have two types surrounding this issue, including composite index and posting list [54].The key in a composite index (i.e., composite key) is a concatenation of a secondary key and a primary key.The composite index is easy to implement and adopted by many systems [20,51,54].However, it turns a secondary lookup operation into a preix range search operation.
The posting list stores multiple associated primary keys in the value of a KV pair.Entries in each posting list can be sorted by primary keys or recency.When a new record is inserted, there are two update strategies.Eager update strategy conducts read-modify-write, fetching the old posting list and merging the new primary key to the posting list.Lazy update strategy blindly insert a new posting list which only includes the new primary key.It leaves posting lists merging to compaction.However, a secondary lookup needs to search all levels to fetch all relevant entries.
Limitations.Even though there are multiple strategies, types, and optimizations, LSM-based secondary indexes have to sacriice either the write performance of storage systems or the secondary index query performance, which results from the incompatibility of inherent attributes of LSM-trees and characteristics of secondary indexes.

Persistent Memory
Persistent Memory (PM), also called Non-Volatile Memory (NVM) or Storage Class Memory (SCM), provides several attractive beneits for storage systems, such as byte-addressability, DRAM-comparable access latency, and data persistency.CPUs can access data on PM directly with load and store instructions.Besides, compared to DRAM, PM has a much larger capacity, lower cost, and lower power consumption.Therefore, both academia and industry have proposed plenty of work to harness PM's beneits in storage systems [18,26,33,46,53,56,58,66,67,73].In addition to DDR bus-connected PM (e.g., Intel Optane DCPMM), the recent high-bandwidth and low-latency IO interconnection, Compute Express Link (CXL) [4,35], brings a new form of SCM, CXL device-attached memory (e.g., Samsung's CMM-H (CXL Memory Module -Hybrid) [9,10]).
However, PM also has some performance idiosyncrasies.For example, the current commercial PM hardware (i.e., Intel Optane DCPMM) has physical media access granularity of 256 bytes, leading to high random access latency (about 3× of DRAM in terms of reads) and write ampliication for small random writes, which needs to be considered when designing PM systems [18,60,63,65,68,71].These performance idiosyncrasies are likely to be a general problem or even more obvious in other PM devices due to physical media characteristics (e.g., lash page in CXL-SSD).
Though Intel Optane DCPMM is currently the only available commercial device, we believe it can represent other emerging PM devices to some extent.In this paper, we mainly focus on PM's general characteristics described above but not the speciic numbers of Optane's attributes.

MOTIVATION
Though recent work introduces some techniques to optimize secondary indexing in LSM-based systems, we ind that the performance of LSM-based secondary indexing is still unsatisfactory due to the incompatibility of inherent attributes of LSM-trees and characteristics of secondary indexing.On the one hand, LSM-tree is not a competent data structure for secondary indexes, since the characteristics of secondary indexes exacerbate the deiciency of LSM-tree's read operations: (1) KV pairs are usually small in secondary indexes, to which LSM-tree's cumbersome lookup operations are unfriendly; (2) Secondary keys are not unique and can have multiple values, which LSM-tree's out-of-place update will exacerbate the query ineiciency.On the other hand, the blind-write attribute of LSM-based primary tables makes the consistency of secondary indexes troublesome.
Therefore, this motivates us to ind a better solution for secondary indexes in LSM-based storage systems.As PM provides attractive features such as byte-addressability, DRAM-comparable access latency, and data persistency, we argue that it is promising to provide secondary indexing with PM.
Though there are many state-of-the-art PM-based index structures, they are not speciically designed for secondary indexing.To adopt them as secondary indexes (e.g., support the multi-value feature), naive approaches include the composite index or using a conventional allocator to organize posting lists (ğ2.2).However, simply adopting these naive approaches to use existing PM-based indexes as secondary indexes will overshadow their superior advantages.
Why not use a PM-based composite index?Though this method is straightforward and easy to implement in LSM-based systems, it is not ideal for tree-based persistent indexes.First, when adding or removing a primary key for a secondary key, a value update operation turns into a new composite key insert or delete operation for composite indexes.Insert and delete operations are more expensive than update operations in a PM-based tree index because they may cause shift operations or structural modiication operations (SMOs).Second, composite indexes store every pair of mappings as an individual KV pair, expanding the number of KV pairs, which increases the height of the tree index and thus degrades its query performance.Third, storing the same secondary keys repeatedly in multiple composite keys wastes PM space, which can be a dominant overhead for some real-world databases [70].
Why not use a conventional allocator for posting lists?One may use a conventional allocator, such as a slab-based allocator or a log-structured approach, to allocate space for values (posting lists) out of the index.Nevertheless, they are not suitable for values of secondary keys.One way is allocating space for a whole posting list for each secondary key with a general-purpose allocator such as a slab-based allocator.However, these generalpurpose allocators usually have high overheads on PM since they conduct expensive mechanisms for crash consistency (e.g., logging) and perform many small writes on their metadata which is necessary for recovery [7].Though some PM allocators relieve allocation overheads by techniques such as deferring garbage collection to post-failure [14,15], slab-based allocators have low memory utilization due to the memory fragmentation issue [57], which cannot be eliminated by restarting on PM [22].Worse still, these issues are more severe for secondary indexes.In secondary indexes, a posting list of a secondary key is changed by inserting or removing primary keys, which means the size of the posting list (the total size of associated primary keys) changes constantly.This characteristic requires frequent reallocations and copy-on-writes.Another way is allocating space for each individual new value (primary key) of a postint list, and using pointers to link them together.One can use a lightweight and PM-friendly allocator, such as a log-structured approach for its sequential-write pattern.However, it will scatter posting lists (primary keys) associated with the same secondary key into multiple pieces and thus reduce the query performance due to poor data locality.
Our experiments (ğ5.2) show that these naive approaches on PM-based indexes lead to several times performance degradation.It thus motivates us to explore a new PM-based secondary indexing mechanism for LSM-based KV stores.In addition, an eicient validation approach is required to retain the blind-write attribute of LSM-based KV stores.

PERSEID DESIGN 4.1 Overview
Motivated by the analysis above, we propose Perseid, a PM-based secondary indexing mechanism for LSM-based storage systems, which overcomes traditional LSM-based secondary indexes' deiciencies.Figure 2 shows the overall architecture of an LSM-based storage system with Perseid.
• Perseid contains a PM-based secondary index, PS-Tree, which is both PM and secondary index friendly: by adopting log-structured insertion, PS-Tree achieves fast insertion on PM; by storing primary keys that associate to the same secondary key closer and further rearranging them to be adjacent, PS-Tree supports eicient query operations (ğ4.2). • Perseid retains the blind-write attribute of the LSM primary table for write performance (i.e., taking the validation strategy (ğ2.2)), without sacriicing query performance by introducing a lightweight hybrid PM-DRAM and hash-based validation approach.The validation approach contains a persistent hash table to record version information of primary keys, and a volatile and lite hash table to absorb random accesses to PM. (ğ4.3).• To accelerate non-index-only queries, Perseid adapts two optimizations on primary table searching issued from secondary indexes.Perseid ilters out irrelevant component searching with sequence numbers and parallelizes primary table searching in an eicient way (ğ4.4).

PS-Tree Design
Perseid introduces PS-Tree, a PM-based secondary index, which is designed considering the multi-value feature and PM characteristics.We irst present PS-Tree's structure (ğ4.2.1), and then describe its operations (ğ4.2.2).

Structure.
The overall structure of PS-Tree is shown in Figure 3. PS-Tree consists of two layers, SKey Layer for indexing secondary keys and PKey Layer for storing values.Speciically, the SKey Layer resembles a normal in-memory index, which maintains mappings from secondary keys to posting lists in the PKey Layer.Thus, the SKey Layer can leverage an existing high-performance PM-based index (e.g., P-Masstree [39,50] and FAST&FAIR [31]).The PKey Layer stores variable-number values (i.e., primary keys and other user-speciied values) of secondary keys in a manner of blended B + -Tree leaf nodes and log-structured approaches, which combines the advantages of the two approaches.The value of a secondary key in the SKey Layer is a pointer, which points to corresponding primary keys in the PKey Layer.Each pointer is a combination of the address of the PKey Page and an ofset within the page.In the PKey Layer, primary key entries (PKey Entries) are stored in PKey Pages.Each PKey Entry has an 8-byte metadata header and a primary key.The header consists of a 2-byte size, a 1-bit obsolete lag, and a 47-bit sequence number (SQN) of the primary key.The SQN is internally used for multi-version concurrency control (MVCC) in LSM-based KV stores [28,30].Each new record (including updates and deletes) in the primary table gets a monotonically increased SQN.Perseid leverages the SQN mechanism to guarantee data consistency among the primary table and secondary indexes, and also for validation which will be described in ğ4.3.PKey Pages are aligned to PM physical media access granularity (e.g., 256 bytes of Intel Optane DCPMM [68]).PS-Tree inserts PKey Entries into PKey Pages in a log-structured manner to reduce the write overhead and ease crash consistency on PM.
Nevertheless, traditional log-structured approaches scatter diferent values of the same secondary key in the log, resulting in poor data locality and degraded query performance.To improve data locality, PS-Tree stores PKey Entries of contiguous SKeys in the same PKey Page, similar to the leaf node in a B + -Tree.Furthermore, during the PKey Page split, PS-Tree rearranges PKey Entries that belong to the same secondary keys to store continuously as a PKey Group.Each PKey Group has an 8-byte Group Header and one or multiple PKey Entries.The lower 48 bits of a group header are the address of the previous PKey Group of the same secondary key or null if the current group is the last one.Thus all PKey Groups belonging to one secondary key are linked as a list.The remaining 16 bits store the number of total entries and the number of obsolete entries in the group.Log-structured Insert.Algorithm 1 describes the process of the insert operation in PS-Tree.First, PS-Tree searches for the SKey and its pointer in the SKey Layer.From the pointer, PS-Tree locates the previous PKey Group and the corresponding PKey Page (Lines 1-3).If the SKey is not found, then the PKey Page is located from the pointer of the previous SKey which is just smaller than this new SKey (Line 5).
Second, PS-Tree appends a new PKey Group in that PKey Page (Line 11-12).The new PKey Group contains one entry with the new PKey and other values if speciied, and the header points to the previous PKey Group if exists.
Third, the new pointer of the SKey (i.e., the address of the new PKey Group) is updated or inserted in the SKey Layer (Line 13).Thus, the insert request usually performs an update operation in the SKey Layer.PKey Entries of a secondary key are always linked in the order of recency to facilitate query operations which usually require the most latest entries [13,54].Search.Algorithm 2 describes the process of the search operation in PS-Tree, which starts with searching for the secondary key and its pointer in the SKey Layer (Line 2).Then, from the latest PKey Group indicated by the pointer, primary keys and other user-speciied values can be retrieved in the order of recency.Perseid adopts the Validation strategy (ğ2.2) for its high ingestion performance.Therefore, all primary keys are irst validated before returning (Line 7).The validation process will identify and mark obsolete entries as deleted by setting their obsolete lags.The LSM-based primary table supports MVCC by attaching one snapshot and using reference counters to protect components from being deleted [28,30].In PS-Tree, we adopt an epoch-based approach: readers publish their snapshot numbers during query operations, and obsolete entries whose sequence number is larger than any reader's snapshot number are guaranteed not to be removed physically.
Update and Delete.PS-Tree has no update or delete operations (from the point of view of secondary indexes rather than the data structure).Since updating the primary key of a record in the primary table is commonly not supported in database systems, there is no requirement to update values (i.e., primary keys) in secondary indexes.With the Validation strategy, PS-Tree does not delete the obsolete entries synchronously with the primary table.PS-Tree leaves obsolete entry cleaning to garbage collection.
Locality-aware PKey Page Split with Garbage Collection.When a PKey Page does not have enough space for a new entry, it splits into two new PKey Page in a copy-on-write manner.Algorithm 3 shows the process of the PKey Page split operation.Since insertions are performed in a log-structured manner, the PKey Entries associated with one SKey may scatter discontinuously.Querying these entries may need multiple random accesses on PM.As PM has non-negligible read latencies compared to DRAM (e.g., about 300 ns with Intel Optane DCPMM [68]), query operations can have high overheads.Therefore, as shown in Figure 4, to improve locality, PS-Tree reorganizes PKey Entries when the PKey Page splits.Speciically, PS-Tree iterates all secondary keys associated with the PKey Page (Line 3).For each secondary key, PS-Tree collects scattered PKey Entries and  puts them together into a new PKey Page (Lines 4-24).PS-Tree rearranges PKey Entries belonging to the same SKey in one PKey Group, so these entries are stored continuously, and the storage overhead of the Group Header is reduced since multiple PKey Entries share one Group Header.
Besides, entries not marked as deleted in the current PKey Page are validated by a lightweight approach (described in ğ4.3), and obsolete entries are physically removed during reorganization to reduce space overhead (Lines 9-11 in Algorithm 3).
A skewed secondary key may have many primary keys that occupy more than one PKey Page.For those PKey Entries not in the current PKey Page, PS-Tree lazily garbage collects them when the number of obsolete entries exceeds half of the number of total entries in that PKey Group (Lines 15-17).The size of collected PKey Entries (i.e., new PKey Group) of a skewed secondary key may exceed a PKey Page.To keep page management simple, instead of using variable-sized PKey Pages, PS-Tree allocates multiple PKey Pages on demand in the append operation of PKey Page (Line 24).
To support MVCC, PS-Tree retains obsolete entries whose sequence number is larger than the minimum snapshot number of concurrent readers.Obsolete entries may be retained long if there exist long-running queries.Perseid can be enhanced with similar techniques in recent work [36,41]   Crash Consistency.Perseid relies on the existing write-ahead-log (WAL) of the LSM-based primary table to guarantee atomic durability among the primary table and secondary indexes.During recovery with WAL, Perseid redoes uncompleted operation to the PS-Tree.
PS-Tree also handles its own crash consistency issues.Insert operations are committed only when the pointers in the SKey Layer are updated.If the system crashes before updating pointers but after allocating a new PKey Page, then the PKey Page is unreachable.After restart, a background thread will scan the allocated pages and PS-Tree to ind and reclaim unreachable pages.Besides, PS-Tree allows concurrent insertion in one PKey Page.A thread obtains a piece of space to write new entries by compare-and-swap (CAS) the tail pointer of the PKey Page.Thus, the space may leak if any thread obtains it but does not update the pointer in the SKey Layer when the system crashes.PS-Tree tolerates this situation and leaves these leakages to page splitting which can naturally reclaim these leaked spaces.

Hybrid PM-DRAM Validation
Perseid adopts the Validation strategy (see ğ2.2) for high write performance, which necessitates a lightweight validation approach.Since update-intensive workloads are quite common nowadays [13,16], if the validation approach is heavy, validating a large number of obsolete entries brings no outcomes but generates huge overhead.Perseid adopts a hash table on PM storing version information for primary keys.The hash table is indexed by the primary key and stores its latest sequence number (ğ4.2.1).Nevertheless, even though point lookups on a PM-based hash table are much faster than on a tree, the validation time is comparable to the query time of PS-Tree.This is because one secondary key has multiple primary keys to validate, and PM has non-negligible random access latency.Simply placing the hash table on DRAM will occupy a large memory footprint.However, as validation only needs to validate whether a version of a primary key is valid, but not obtaining the speciic latest version number, Perseid builds another volatile hash table on DRAM which only stores versions for primary keys that have been updated or deleted.In this way, Perseid only needs to query the small volatile hash table and thus the validation overhead is further reduced.
Figure 5 illustrates the hybrid PM-DRAM validation approach.The values in the hash tables consist of the sequence number of the record (6-byte) and a 2-byte counter.The counter is used to determine whether a primary key has obsolete versions.There is a slight diference in the counters of the two hash tables.In the volatile hash  Upsert.The process of upsert operation on validation hash tables is shown in Algorithm 4. When a new record (including update and delete) is inserted into the primary table, the primary key is inserted or updated with its sequence number into the persistent hash table.If the persistent hash table does not contain this primary key before, its counter is set to one (Line 9), which means this primary key has only one version and no obsolete entries of this primary key exist in the secondary index.For example in Figure 5, at t2, key c is inserted for the irst time, and it is inserted into the persistent hash table.Otherwise, the primary key's counter in the persistent hash table is increased by one (Line 4); besides, the primary key is inserted or updated with its sequence number into the volatile hash table, and the counter in the volatile hash table is set to two if it's an insertion or increased by one if it's an update (Lines 5-7).For example, when key c is updated with a new version v2 at t3 in Figure 5, the entry in the persistent hash table is updated, and a new entry is inserted into the volatile hash table.
Validate.The secondary index validates an entry by querying the volatile hash table, which is shown in Algorithm 5. Speciically, the entry is valid if the sequence number of this entry matches the latest sequence ACM Trans.Recomm.Syst.number stored in the hash table, or the hash table does not contain the primary key which means there are no obsolete entries of this primary key (Line 2 in Algorithm 5).Otherwise, the entry may be obsolete.If the version of the hash table entry is smaller than the global minimum read snapshot number, which means all readers can see the newer version, Perseid further marks the entry as obsolete and decreases the counter of the entry in the volatile hash table by one (Lines 9-13).For example, when key a is checked with an obsolete version v1 at t2 in Figure 5, the result is false, and then the counter is decreased from 3 to 2. If the counter is decreased to 1, which means all obsolete entries have been marked, the entry is removed from the volatile hash table to restrict the hash table size (Lines 14-16).For example, when key a is checked with an obsolete version v2 at t3 in Figure 5, the counter is decreased to one, the validation returns false and the entry is removed.We describe other corner cases regarding to snapshot in ğ4.3.3.
During validation for secondary index queries, Perseid only operates with the volatile validation hash table.Thus, the validation overhead is quite small.Garbage Collection.During the PKey Page split, entries that are not marked as obsolete are also validated to remove obsolete entries (ğ4.2.2).Since this step physically removes obsolete entries, Perseid decreases the corresponding counters in the persistent hash table.If a counter is decreased to one, Perseid removes the corresponding hash pairs from the volatile hash table.
Recovery.When the system restarts from a crash or a normal shutdown, the volatile hash table needs to be recovered.Perseid iterates the whole persistent hash table and inserts primary keys whose counter is greater than one into the volatile hash table.Now the counters in the volatile hash table are numbers of physically existing entries, which may be larger than the actual numbers of logically existing entries.Therefore, some false positive primary keys may exist in the volatile hash table.However, this does not afect the validation accuracy and these primary keys can be removed by garbage collection.

4.3.3
Together with PS-Tree.Each write operation in the LSM-based storage system starts with getting a monotonically increased sequence number (SQN).After writing the write-ahead-log (WAL) and inserting new records to the MemTable of the LSM primary table, Perseid inserts the PKey Entry to the PS-Tree and inserts or updates the version with the primary key (i.e., SQN) in the validation hash tables.Thus, both the record in the LSM primary table, the PKey Entry in PS-Tree, and the version in the validation hash tables are tagged with the same SQN.After that, the write operation is committed.
Each query operation irst gets the latest committed snapshot number.Then it searches the PS-Tree with the secondary key and gets corresponding PKey Entries visible in the current snapshot (i.e., SQN is not larger than the snapshot number).Perseid validates candidate primary key entries via the volatile hash table, and then returns valid entries.
A rare scenario is that the volatile hash table reports a new sequence number larger than the current reader's snapshot number, which means a concurrent writer has updated this primary key.In this case, Perseid cannot directly conirm whether this entry is still valid in this snapshot, since there may exist a version newer than the entry and valid in the snapshot, so Perseid has to validate it by the primary table (Lines 5-7 in Algorithm 5).Another scenario where the volatile hash table reports an older version than the requested PKey Entry is not possible to happen.Perseid commits a write operation after Perseid has inserted the new secondary entry in PS-Tree and updated the validation hash tables, so readers can only get consistent snapshots and ignore entries in PS-Tree whose version is larger than readers' snapshot number.

Non-Index-Only uery Optimizations
Though the Perseid signiicantly reduces the overhead of secondary indexing, the overhead of non-index-only queries (requiring full records) is still dominated by the LSM-based primary table.Thus, Perseid further introduces two optimizations for non-index-only queries.

Locating Components with Sequence Number.
A secondary index query operation may need to search the primary LSM table multiple times for all its associated records.LSM-trees have mediocre read performance due to the multi-level structure.Besides device I/Os, if data is cached in memory or using fast storage devices, LSM-trees have non-negligible overheads on probing components (i.e., indexing and checking Bloom ilters) [21,25,71].Since LSM-based KV stores usually employ Bloom ilters for each data block [28,30], the indexing overhead includes indexing not only SSTables but also data blocks.Moreover, the read performance gets worse with the tiering compaction strategy since more components (SSTables) need to be checked and read.
Nevertheless, we ind that many components are unnecessary to probe in searching processes issued from the secondary index.Previous work uses zone maps, which store the minimum and maximum values of an attribute, to skip irrelevant data blocks or components during searching [11,12,54].We found that this technique can also be used by secondary indexes to search the primary table.Since we have already recorded the sequence numbers of primary keys in the secondary index, the sequence number can be used as an additional attribute to skip irrelevant components.Perseid builds a zone map that records a sequence number range (i.e., the minimum and maximum sequence numbers of records) for each component (including MemTables).
Moreover, as shown in Figure 6, since tiering compaction merges SSTables from lower level ( ) to generate new SSTables in higher level ( +1 ) and does not rewrite other SSTables in the higher level (except for the last level), for a range partition, the sequence number ranges of diferent levels and even diferent sorted runs are strictly divided.For primary tables adopting the tiering strategy, with the primary key to search SSTables horizontally and the additional sequence number to search sorted runs vertically, Perseid can locate the exact component that contains the record directly.Besides, since Perseid already validates the version so it must exist in the component, Perseid can further skip the Bloom ilter checking.Thus, the indexing overheads are greatly reduced and overheads on checking Bloom ilters are almost eliminated.This optimization its with the leveling strategy less efectively.The sequence number ranges in diferent levels may overlap because compaction rewrites SSTables in higher levels with blended sequence numbers from lower levels.However, since most LSM-base KV stores adopt the tiering strategy on 0 at least [28,30], this optimization is still efective to some extent.Searching.A single secondary key usually has multiple associated primary keys, and queries on these primary keys are independent.Therefore, using multiple threads to accelerate primary table searching is a natural optimization method.One naive approach is to assign primary keys to threads equally (e.g., in a round-robin fashion as shown in Figure 7(a)).However, point lookups on LSM-trees may have a large latency gap, since some KV pairs can be fetched from MemTable or block cache directly and others may reside at a relatively high level and need several disk I/Os due to Bloom ilter false positives.It cannot be known in advance how much time each point lookup will take.Therefore, the naive approach may result in a load imbalance among parallel threads where some threads have inished their tasks and become idle while others are still stuck and there may still exist some uninished tasks.

Parallel Primary Table
To relieve this issue, we apply a worker-active fashion as shown in Figure 7(b).Perseid publishes primary keys into a lock-free shared queue as tasks, and each parallel worker thread fetches one task from the queue.An element in the shared queue is a required primary key and the corresponding sequence number.When a worker thread inishes the current task, it tries to fetch another task from the queue.In this way, though each thread may perform a diferent number of tasks, parallel threads are utilized more adequately and latencies of query requests are further reduced.

EVALUATION
In this section, we evaluate Perseid against existing PM-based indexes with naive approaches and state-of-the-art LSM-based secondary indexing techniques [47,54].After describing the experimental setup (ğ5.1), we evaluate these secondary indexing mechanisms with micro benchmarks to show their performance on basic operations (ğ5.2).Then, we evaluate these systems' overall performance with mixed workloads (ğ5.3) and recovery time (ğ5.4).

Experimental Setup
Platform.Our experiments are conducted on a server with an 18-core Intel Xeon Gold 5220 CPU, which runs Ubuntu 20.04 LTS with Linux 5.4.The system is equipped with 64 GB DRAM, two 128 GB Intel Optane DC Persistent Memory in AppDirect mode, and a 480 GB Intel Optane 905P SSD.Implementation.Perseid can leverage any existing state-of-the-art PM-based index as the SKey Layer of PS-Tree.In our implementation, we build PS-Tree based on two typical PM-based indexes, FAST&FAIR [31] and P-Masstree [39,50].FAST&FAIR is a B + -Tree that leverages total store ordering (TSO) in x86 architecture to tolerate transient inconsistency caused by incomplete write transactions, thus avoiding expensive copy-on-write or logging.P-Masstree is a converted version of Masstree [50] for PM [39], which is a trie-like concatenation of B + -Tree.Indexes use their original memory allocators allocating space with memory-mapped iles on PM.
For the hybrid PM-DRAM validation hash table, depending on the diferent usages of two hash tables, we deploy CLHT [23] as the volatile hash table, and CCEH [52] as the persistent hash table.CLHT is a cache-friendly hash table providing high search performance.CCEH is an extendible hash table optimized for PM that achieves high insert performance by mitigating rehashing overhead.Compared Systems.We compare Perseid against the two original PM-based indexes (FAST&FAIR and P-Masstree), and LSM-based secondary index with validation strategy (denoted as LSMSI) approaches of Lev-elDB++ [54].The compared PM-based indexes are implemented as secondary indexes via the composite index approach and the log-structured approach (denoted as FAST&FAIR-composite, FAST&FAIR-log, P-Masstreecomposite, P-Masstree-log, respectively).For the log-structured approach, we simply provide enough space for allocation and disable garbage collection to avoid its inluence and present the ideal performance [60].We enhance other PM-based indexes with Perseid's hybrid PM-DRAM validation approach (ğ4.3) and LSMSI with the primary key index [47] (ğ2.2) for validation.For a fair comparison, we also implement LSM-based secondary indexing approaches on PM (LSMSI-PM).In addition, an LSM-based secondary index with synchronous strategy (ğ2.2, denoted as LSMSI-PM-sync) is evaluated for comparison.We use PebblesDB [55], a state-of-the-art tiering-based KV store, as the primary table.Workloads.Since common benchmarks for key-value stores such as YCSB [19] don't have operations on secondary indexes, as in previous work [42,47,54], we implemented a secondary index workload generator based on an open-source twitter-like workload generator [3] for evaluation.With this generator, we generate several microbenchmark workloads and mixed workloads.The primary key (e.g., ID) and secondary key (e.g., UserID) are randomly generated 64-bit integers.The key space of primary keys and secondary keys is 100 million and 4 million, respectively.Thus the average number of records per secondary key is about 25.The size of each record is 1KB.KV Store Conigurations.For the primary table, according to coniguration tuning guide [29], MemTable size is set to 64 MB and the Bloom ilters are set to 10 bits per key.As our workloads will generate a primary table larger than 100 GB, we set a 16-GB block cache for the primary table and a 1-GB block cache for the LSM-based secondary index.Compression is turned of to reduce other inluencing factors.

Microbenchmarks
In this section, we evaluate the basic single-threaded performance and scalability of compared secondary indexing mechanisms.

Insert and Update.
The Insert workload (i.e., no updates) has 100 million unique records.Figure 8(a) shows the average latency of insert operations of each secondary index.
Perseid performs about 10-38% faster than the corresponding composite indexes, but 25% slower than the ideal log-structured approach without garbage collection due to the page split overhead in PS-Tree.The composite index approach results in inferior performance as we analyzed in ğ3.Other approaches have higher performance due to the sequential-write pattern.
The upsert workloads contain 100 million insert operations and 100 million update operations.Operations are shuled to avoid all newer entries being valid in secondary indexes.In the Uniform workload (Figure 8(b)), both primary keys and secondary keys follow a uniform distribution.In the Skewed-Pri workload (Figure 8(c)), primary keys follow a Zipian distribution with the skewness parameter 0.99, and secondary keys are selected randomly.In the Skewed-Sec workload (Figure 8(d)), secondary keys follow a Zipian distribution (parameter 0.99), and primary keys are uniform.Thus, hot secondary keys have lots of associated primary keys, which represent low-cardinality columns.
LSMSI-PM-sync has the largest upsert overhead due to its synchronous strategy, which needs to fetch old records from the LSM primary table and delete old secondary index entries (by inserting tombstones) synchronously.The skewness of primary keys has a large impact on the synchronous strategy.Hence LSMSI-PM-sync has about 50% higher upsert latencies in the Uniform and Skewed-Sec workloads than in other workloads.
Among other validation-based secondary indexes, composite indexes perform even worse in upsert workloads than other secondary indexes.This is because, with additional upsert operations, composite indexes have more KV pairs and larger tree heights.By contrast, PS-Tree and the log-structured approach do not increase the number of KV pairs in the index part.
Figure 9 shows the normalized memory usage of the persistent hash table (PM-HT) and the volatile hash table (DRAM-HT) of Perseid after each upsert workload.For a fair comparison, we evaluate the memory usage of PM-HT with the same hashing structure (CLHT) as DRAM-HT.The PM-HT stores all 100 million primary keys with their latest sequence numbers, so it contains about 46 million hashing buckets including linked collision buckets, which occupies about 2.7 GiB memory.By contrast, since the DRAM-HT only stores versions for primary keys that have been updated (ğ4.3), it has a smaller memory footprint than the PM-HT.Speciically, the DRAM-HT is empty because there are no updates in the Insert workload.Besides, Perseid reduce the memory usage of the DRAM-HT to 37.8%, 10.4%, and 77.3% of the whole PM-HT in Uniform, Skewed-Pri, and Skewed-Sec, respectively.Though in the Uniform and Skewed-Sec workloads, most primary keys have been updated, PS-Tree conducts garbage collection and validation during PKey Page splitting, so primary keys with obsolete versions cleaned up are removed from DRAM-HT.

uery.
In this experiment, we evaluate the performance of index-only queries after loading the insert workload or upsert workloads.Index-only query relects the performance of a secondary index itself and is a common query technique (i.e., covering index [6,8]) to avoid looking up the primary table.We show two diferent selectivities by specifying limit N (10 and 200) on return results.The most recent and valid N entries are returned.For limit of 200, the actual average number of returned entries per query is 25 and 142 for the Skewed-Pri and Skewed-Sec, respectively.Figure 10 shows the results of index-only query performance.From the results, we have the following observations.
First, PM-based indexes have signiicantly lower latencies than LSM-based secondary indexes.Putting LSMSI on PM (LSMSI-PM) has very limited improvement, which is because LSMSI already beneits from block cache and OS page cache.Even so, LSMSI is still ineicient due to the high overhead of indexing and Bloom ilter checking.Besides, LSMSI has a high overhead on validating the primary key index.LSMSI-PM-sync has much higher query performance than LSMSI-PM, as it does not require validation.However, this comes at the cost of poor write performance (ğ5.2.1).From the gap between LSMSI-PM and LSMSI-PM-sync, we can ind that validating the primary key index has a huge overhead.This is because validating each primary key requires one heavy point search on the LSM-tree.Despite being exempted from validation, LSMSI-PM-sync still has higher query latencies than PM-based indexes which conduct validation with Perseid's approach.
Second, Perseid outperforms existing PM-based indexes with the composite index and the log-structured approach by up to 4.5× and 4.3×, respectively.The log-structured approach has poor locality since relevant values are scattered across the whole log and require multiple random accesses to fetch them all.Composite indexes are inferior due to the larger number of KV pairs in the indexes and range-scan operations as we analyzed in ğ3.They are especially ineicient under the Skewed-Sec workload with a large limit (e.g., 200), where they fetch a large number of entries and fail to enjoy the cache efect.By contrast, the performance of Perseid is much more stable across diferent workloads, owing to the locality-aware design of PS-Tree.For a limit of 10, PM-based secondary indexes beneit from higher cache hit ratios under the Skewed-Sec workload, thus achieving better performance than other upsert workloads.Composite indexes also occupy about 4× more PM space than PS-Tree, which is because they repeatedly store secondary keys and have more index nodes.In addition, P-Masstree-composite has higher latencies than FAST&FAIR-composite, because trie-based indexes are less eicient than B + -Trees in range search since their leaf nodes do not have sibling pointers pointing to neighbor nodes.Third, under upsert workloads, all systems need to validate more primary keys to exclude obsolete entries, which contributes to the higher overheads than under insert workloads.For LSMSI, since the primary key index needs multiple heavy point lookups on LSM-trees, validating the primary key index accounts for the lion's share of the total cost of an index-only query.LSMSI has lower latencies under the Skewed-Pri workload than other upsert workloads since the primary key index beneits from the data locality on primary keys.By contrast, Perseid (and other PM-based indexes) validates on a volatile hash table, which takes up less than half of the total cost.The overhead on Perseid increased little owing to the locality-aware design of PS-Tree and the lightweight validation approach.
Figure 11 demonstrates the necessity and the beneit of the volatile hash table of Perseid.Directly validating multiple primary keys on persistent hash table (PM-HT) has a large overhead, since it requires multiple random accesses on PM.This prominent overhead can overshadow the advantage of PS-Tree.Thus, Perseid validates on the volatile hash table (DRAM-HT), which is 2.7-6.6×faster than validating on PM-HT.As shown in Figure 9, DRAM-HT is much smaller under other workloads than the Skewed-Sec workload, and the improvement brought by DRAM-HT is more evident under other workloads.This result proves the beneit of DRAM-HT is not only due to the lower access latency of DRAM, but also because of the smaller size of DRAM-HT which is more cache-friendly and brings lower hash collision.The results are shown in Figure 12.Range queries need to search more KV pairs from ten diferent secondary keys, showing a more pronounced diference between these secondary indexes than low-limit query operations.Perseid outperforms LSMSI-PM, the Composite P-Masstree, and the log-structured approach by up to 92×, 5.2×, and 1.6×, respectively under the Skewed-Sec workload.Though LSMSI-PM beneits from PM access latency and DRAM caching, it still has a fairly high latency.This is because the range operation in LSM needs to merge-sort multiple iterators of components.The composite index needs to perform more range search than Perseid in the index since Perseid groups primary keys outside of the index.

5.2.4
Multi-threaded Performance.Figure 13 shows the multi-threaded performance of compared secondary indexes.We take the results of Skewed-Pri and Skewed-Sec workloads as representatives.For Skewed-Sec, we show the result with the limit of 200, and the result with the limit of 10 is similar to that of Skewed-Pri.For upsert operations, Perseid scales up to 24 threads, achieving 2.8× and 16× the upsert throughput of the composite P-Masstree and LSMSI-PM, and slightly slower than the ideal log-structured approach.For query operations, Perseid scales well and achieves 7× and 3× query throughput of P-Masstree-composite and P-Masstree-log under the Skewed-Sec workload due to the locality-aware design of PS-Tree.LSMSI has poor scalability due to its coarse-granularity lock and non-concurrent logging mechanism.Though using the same index structure (P-Masstree), because the composite index turns update operations into insert operations, the index operations limit its write scalability; and because it expands the number of KV pairs and thus has a bigger tree height, the increased index overhead limits its query scalability.As for the log-structured approach, the poor data locality restricts its query performance, especially for large-range queries.Fig. 13.Multi-threaded performance.
primary table (+SEQ), naive parallel primary table searching (+PAR), and worker-active parallel table searching (+PAR-WA) sequentially.In this experiment, we use 4 threads for parallel primary table searching.Figure 14 shows the performance and time breakdown of non-index-only query operations.Note that the breakdown of primary table time on +PAR only shows the time not covered by the secondary index and validation.Perseid brings considerable improvements against the LSMSI-PM, even if it has the two optimizations applied.Perseid outperforms LSMSI-PM by up to 62% and 2.3×, when without and with optimizations on primary table lookups (the sequence number zone map and parallel primary table searching), respectively.Though the primary key index indeed reduces unnecessary point lookup operations on the primary table for LSMSI-PM, with advanced low-latency storage devices and suicient DRAM caching, it also has signiicant overhead.On the contrary, the hybrid PM-DRAM validation of Perseid reduces the primary table lookups with subtle extra overhead.Perseid's optimizations on primary table searching can also boost other compared secondary indexing.The zone map improves the overall query performance of the KV store with Perseid by about 50%, and the workeractive parallel primary table searching further improves by up to 3.1×.The worker-active parallel searching exceeds naive parallel searching by up to 30% for Perseid.This efect is more evident when the limit of return results is small as the load imbalance among multiple parallel worker threads is more prominent.However, the numbers are only 20-36% and up to 2.4× for LSMSI-PM, respectively.This is because these optimizations only accelerate the primary table lookups, but the LSMSI-PM still has huge overheads.In addition, LSMSI-PM has to conduct the heavy validation irst then it can pass the lookup tasks to parallel worker threads.Therefore, parallel threads cannot work adequately for LSMSI-PM.For the same reason, the worker-active parallel searching helps little above the naive naive parallel searching.
We also implement secondary indexes and conduct the experiments on a leveling-based LSM primary table (LevelDB [30]).Figure 15 shows the results of Skewed-Sec as an example.The main diference is that the sequence number zone map is less efective on leveling-based LSM primary tables.However, the zone map is still efective when the limit is small, since the latest few records stay in MemTables or SSTables in lower levels like 0 , and these components can be iltered by sequence number with a high probability.

Mixed Workloads
In this section, we evaluate Perseid, the composite P-Masstree, and LSMSI-PM under mixed workloads.The mixed workloads consist of interleaved and various types of operations, which are more representative of real-world workloads.Each workload has 40 million operations, containing both Skewed-Pri and Skewed-Sec operations.Table 1 describes these workloads' traits.Each system is preilled with 80 million records before performing the workloads.We also enable Perseid's optimizations on primary table searching (i.e., the sequence number zone map and worker-active parallel primary table searching) for all systems.
Figure 16 reports the average operation latencies every million operations.At the beginning of the Write-Heavy workload and the Balanced workload, PM-based secondary indexes have a spike in latency, which is mainly caused by seek-driven compaction in the LSM primary table.Perseid outperforms LSMSI-PM signiicantly under diferent mixed workloads.Even though the overhead of the primary table dominates the whole operations, Perseid still has visible advantages against the other PM-based indexes.Note that PS-Tree has much less capacity overhead than the composite index.As we set the limit on return results to 10 for query operations, the log-structured approach is not afected too much by its poor data locality.

Workload
Operation Ratios Upsert Get Index-Only Query Non-Index-

Recovery Time
We evaluate the recovery time of Perseid and LSMSI-PM after the Zipian upsert workload that contains 200 million upsert operations with a single thread.Since we only need to recover the volatile validation hash table in Perseid, it takes 2.7 seconds to scan the persistent hash table and rebuild the volatile hash table.Note that the recovery process of the validation hash table can be placed in the background, and validation can be served from the persistent hash table until the volatile hash table is restored.By contrast, it takes 2.3 seconds and 1.4 seconds to recover the LSM-based secondary index and the primary key index, respectively.Their recovery time is mainly spent on MemTables from logs and varies with the size of MemTables.

RELATED WORK
Secondary Indexing in LSM-based KV stores.Qader et al. [54] conduct a comparative study on secondary indexing techniques in LSM-based systems.They conclude and evaluate several common secondary indexing techniques, including ilter-based embedded index, composite index, and posting list.DELI [59] proposes an index maintenance approach that defers expensive index repair to compaction of the primary table.Luo et al. [47] propose several techniques for LSM-based secondary indexes, improving data ingestion and query performance.However, their techniques mainly reduce random device I/Os for traditional disk devices but at the cost of more sequential reads.Based on key-value separation [45], SineKV [42] keeps both the primary index and secondary indexes pointing to the record values.Thus, secondary index queries can get records directly without searching the primary index.However, SineKV has to discard the blind-write attribute and maintain index consistency synchronously.Cuckoo Index [38] enhances the ilter-based indexing with a cuckoo ilter.However, as a ilter-based index, Cuckoo Index does not support range queries.Though there are many proposed optimizations, LSM-based secondary indexing is not eicient enough due to the nature of LSM-trees.In this work, we revisit the design of the secondary index with persistent memory.
Improving LSM-based KV stores with PM.There is a lot of work optimizing LSM-based KV stores with PM.NoveLSM [33] introduces a large mutable MemTable on PM to lower compaction frequency and avoid logging.SLM-DB [32] utilizes a B + -Tree on PM to index KV pairs on disks; SSTables on disks are organized in a single level, which reduces the compaction requirements.MatrixKV [69] places level 0 on PM and adopts ine-granularity and parallel column compaction to reduce write stalls in LSM-trees.Facebook redesigns the block cache on PM to reduce the DRAM usage and thus reduce the total cost of ownership (TCO) [27,34].Diferent from these eforts, this work revisits the secondary indexing for LSM-based KV stores with PM.

Fig. 1 .
Fig. 1.Stand-alone secondary indexing in LSM-based systems with Synchronous strategy and Validation strategy [54].The shaded entries indicate that they are invisible in the index.

Fig. 4 .
Fig. 4.An example of PKey Page split.PEs (PKey Entries) with the same color belong to the same secondary key; PE in gray are obsolete.

4. 3
.1 Structure.Perseid introduces a lightweight validation approach based on the requirement of validation.

Fig. 6 .Fig. 7 .
Fig. 6.Sequence number range of components ater tiering compaction.The black text in components indicates the key range.The blue text below each component indicates the sequence number range.
5.2.5 Non-Index-Only uery.We next evaluate the non-index-only query operations.Besides the basic compared secondary indexes, we also enhance them by applying the two optimizations (ğ4.4), sequence number zone map Throughput (Mops/s)
table, each counter indicates the number of logically existing entries related to a primary key in the secondary index.By contrast, each counter in the persistent hash table indicates the number of physically existing entries in the secondary index.