Abstract
Key-value (KV) stores support many crucial applications and services. They perform fast in-memory processing but are still often limited by I/O performance. The recent emergence of high-speed commodity non-volatile memory express solid-state drives (NVMe SSDs) has propelled new KV system designs that take advantage of their ultra-low latency and high bandwidth. Meanwhile, to switch to entirely new data layouts and scale up entire databases to high-end SSDs requires considerable investment. As a compromise, we propose SpanDB, an LSM-tree-based KV store that adapts the popular RocksDB system to utilize selective deployment of high-speed SSDs. SpanDB allows users to host the bulk of their data on cheaper and larger SSDs (and even hard disc drives with certain workloads), while relocating write-ahead logs (WAL) and the top levels of the LSM-tree to a much smaller and faster NVMe SSD. To better utilize this fast disk, SpanDB provides high-speed, parallel WAL writes via SPDK, and enables asynchronous request processing to mitigate inter-thread synchronization overhead and work efficiently with polling-based I/O. To ease the live data migration between fast and slow disks, we introduce TopFS, a stripped-down file system providing familiar file interface wrappers on top of SPDK I/O. Our evaluation shows that SpanDB simultaneously improves RocksDB's throughput by up to 8.8\(\times\) and reduces its latency by 9.5–58.3%. Compared with KVell, a system designed for high-end SSDs, SpanDB achieves 96–140% of its throughput, with a 2.3–21.6\(\times\) lower latency, at a cheaper storage configuration.
- [1] . [n.d.]. A Persistent Key-Value Store for Fast Storage Environments. Retrieved from https://rocksdb.org/.Google Scholar
- [2] . [n.d.]. Benchmarking Apache Samza. Retrieved from https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-node.Google Scholar
- [3] [n.d.]. Group Commit for the Binary Log. Retrieved from https://mariadb.com/kb/en/group-commit-for-the-binary-log/.Google Scholar
- [4] [n.d.] HHVM. Retrieved from https://github.com/facebook/hhvm.Google Scholar
- [5] . [n.d.]. High capacity SSDs: How Big Can They Grow? Retrieved from https://www.enterprisestorageforum.com/hardware/high-capacity-ssds-how-big-can-they-grow/.Google Scholar
- [6] . [n.d.]. MySQL Reference Manual. Retrieved from https://dev.mysql.com/doc/refman/5.7/en/replication-options-binary-log.html#sysvar_binlog_order_commits.Google Scholar
- [7] . [n.d.]. RocksDB on Steroids. Retrieved from https://www.i-programmer.info/news/84-database/8542-rocksdb-on-steroids.html.Google Scholar
- [8] . [n.d.]. SSD Storage Capacities Increase With Improved Storage Density. Retrieved from https://insights.samsung.com/2016/06/28/ssd-storage-capacities-increase-with-improved-storage-density/.Google Scholar
- [9] [n.d.]. Storage & Hard Drives—The CDW-G website. Retrieved from https://www.cdwg.com/content/cdwg/en/products/storage-and-hard-drives.html.Google Scholar
- [10] . 2013. LinkBench: A Database Benchmark Based on the Facebook Social Graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google Scholar
Digital Library
- [11] . Scalable Metadata Service in Alluxio: Storing Billions of Files. Retrieved from https://www.alluxio.io/blog/scalable-metadata-service-in-alluxio-storing-billions-of-files/.Google Scholar
- [12] . 2019. SPEICHER: Securing LSM-based Key-Value Stores using Shielded Execution. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST'19). 173–190. Google Scholar
Digital Library
- [13] . 2017. TRIAD: Creating synergies between memory, disk and log in log structured key-value stores. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'17). Google Scholar
Digital Library
- [14] . 2019. SILK: Preventing latency spikes in log-structured merge key-value stores. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'19). Google Scholar
Digital Library
- [15] . 2017. FloDB: Unlocking memory in persistent key-value stores. In Proceedings of the 12th European Conference on Computer Systems. Google Scholar
Digital Library
- [16] . 2013. TAO: Facebook's Distributed Data Store for the Social Graph. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'13). Retrieved from https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson. Google Scholar
Digital Library
- [17] . 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST'20). Google Scholar
Digital Library
- [18] . 2018. HashKV: Enabling efficient updates in KV storage via hashing. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'18). Google Scholar
Digital Library
- [19] . 2020. FlatStore: An efficient log-structured key-value storage engine for persistent memory. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'20). Google Scholar
Digital Library
- [20] . 2020. SplinterDB: Closing the bandwidth gap for NVMe key-value stores. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'20). Google Scholar
Digital Library
- [21] . 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. Google Scholar
Digital Library
- [22] . 2017. Monkey: Optimal navigable key-value store. In Proceedings of the ACM International Conference on Management of Data. Google Scholar
Digital Library
- [23] . 2009. Designs, lessons and advice from building large distributed systems. In Proceedings of the International Workshop on Large Scale Distributed Systems and Middleware: Keynote (LADIS'09).Google Scholar
- [24] . 2010. FlashStore: High Throughput Persistent Key-Value Store. Proceedings of the VLDB Endowment 3, 1–2 (2010), 1414–1425. Google Scholar
Digital Library
- [25] . 2014. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Google Scholar
Digital Library
- [26] . 2018. Reducing DRAM footprint with NVM in Facebook. In Proceedings of the 13th EuroSys Conference (EuroSys'18). Article
42 , 13 pages. https://doi.org/10.1145/3190508.3190524 Google ScholarDigital Library
- [27] . 2019. I/O is faster than the CPU: Let's partition resources and eliminate (most) OS abstractions. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS'19). 7 pages. https://doi.org/10.1145/3317550.3321426 Google Scholar
Digital Library
- [28] . [n.d.]. Cassandra on RocksDB at Instagram. Retrieved from https://developers.facebook.com/videos/f8-2018/cassandra-on-rocksdb-at-instagram.Google Scholar
- [29] . [n.d.]. Retrieved from MyRocks. http://myrocks.io/.Google Scholar
- [30] . [n.d.]. Under the Hood: Building and Open-sourcing RocksDB. Retrieved from https://www.facebook.com/notes/facebook-engineering/under-the-hood-building-and-open-sourcing-rocksdb/10151822347683920/.Google Scholar
- [31] . 2018. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Google Scholar
Digital Library
- [32] . 1985. Varieties of Concurrency Control in IMS/VS Fast Path. IEEE Database Eng. Bull. 8, 2 (1985), 3–10.Google Scholar
- [33] . 2014. LevelDB, A Fast and Lightweight Key/Value Database Library by Google. Retrieved from https://github.com/google/leveldb.Google Scholar
- [34] . 2018. Analyzing, modeling, and provisioning QoS for NVMe SSDs. In Proceedings of the IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC'18).
IEEE .Google ScholarCross Ref
- [35] . 1987. Reimplementing the cedar file system using logging and group commit. In Proceedings of the eleventh ACM Symposium on Operating systems principles. Google Scholar
Digital Library
- [36] . 2019. WAL-SSD: Address remapping-based write-ahead-logging solid-state disks. IEEE Trans. Comput. 69, 2 (2019), 260–273.Google Scholar
Cross Ref
- [37] . 2019. X-Engine: An optimized storage engine for large-scale e-commerce transaction processing. In Proceedings of the International Conference on Management of Data
(SIGMOD'19) . 15 pages. https://doi.org/10.1145/3299869.3314041 Google ScholarDigital Library
- [38] . 2014. NVRAM-aware Logging in Transaction Systems. Proceedings of the VLDB Endowment 8, 4 (2014), 389–400. Google Scholar
Digital Library
- [39] . 2018. Closing the performance gap between volatile and persistent key-value stores using cross-referencing logs. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'18). Google Scholar
Digital Library
- [40] . 2020. PinK: High-speed In-storage Key-value Store with Bounded Tails. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'20). Retrieved from https://www.usenix.org/conference/atc20/presentation/im. Google Scholar
Digital Library
- [41] . Breakthrough Performance for Demanding Storage Workloads. https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/optane-ssd-905p-product-brief.pdf.Google Scholar
- [42] . SPDK: Storage Performance Development Kit. Retrieved from https://spdk.io/.Google Scholar
- [43] . 2017. NVMRocks: RocksDB on Non-Volatile Memory Systems. Retrieved from http://istc-bigdata.org/index.php/nvmrocks-rocksdb-on-non-volatile-memory-systems/.Google Scholar
- [44] . 2019. SLM-DB: Single-Level Key-Value Store with Persistent Memory. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST'19). Google Scholar
Digital Library
- [45] . 2019. Towards building a high-performance, scale-in key-value storage system. In Proceedings of the 12th ACM International Conference on Systems and Storage. Google Scholar
Digital Library
- [46] . 2018. Redesigning LSMs for Nonvolatile Memory with NoveLSM. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'18). Retrieved from https://www.usenix.org/conference/atc18/presentation/kannan. Google Scholar
Digital Library
- [47] . 2016. NVWAL: Exploiting NVRAM in Write-ahead Logging. ACM SIGPLAN Notices 51, 4 (2016), 385–398. Google Scholar
Digital Library
- [48] . 2017. Strata: A cross media file system. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP'17). 18 pages. https://doi.org/10.1145/3132747.3132770 Google Scholar
Digital Library
- [49] . 2010. Cassandra: A Decentralized Structured Storage System. ACM SIGOPS Operating Systems Review, 44, 2 (2010), 35–40. Google Scholar
Digital Library
- [50] . 2019. Asynchronous I/O stack: A low-latency kernel I/O stack for ultra-low latency SSDs. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'19). Google Scholar
Digital Library
- [51] . 2019. KVell: The design and implementation of a fast persistent key-value store. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. Google Scholar
Digital Library
- [52] . 2020. HiLSM: An LSM-based Key-Value Store for Hybrid NVM-SSD Storage Systems. In Proceedings of the 17th ACM International Conference on Computing Frontiers. 208–216. Google Scholar
Digital Library
- [53] . 2019. ElasticBF: Elastic bloom filter with hotness awareness for boosting read performance in large key-value stores. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'19). Google Scholar
Digital Library
- [54] . 2016. Towards accurate and fast evaluation of multi-stage log-structured designs. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST'16). Google Scholar
Digital Library
- [55] . 2019. DistCache: Provable load balancing for large-scale storage systems with distributed caching. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST'19). Retrieved from https://www.usenix.org/conference/fast19/presentation/liu. Google Scholar
Digital Library
- [56] . 2017. WiscKey: Separating Keys from Values in SSD-conscious Storage. ACM Transactions on Storage (TOS) 13, 1 (2017), 1–28. Google Scholar
Digital Library
- [57] . 1992. ARIES: A Transaction Recovery Method Supporting Fine-granularity Locking and Partial Rollbacks Using Write-ahead Logging. ACM Transactions on Database Systems (TODS) 17, 1 (1992), 94–162. Google Scholar
Digital Library
- [58] . 2013. Scaling Memcache at Facebook. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI'13). Retrieved from https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala. Google Scholar
Digital Library
- [59] . 2013. Storage Management in the NVRAM Era. Proceedings of the VLDB Endowment 7, 2 (2013), 121–132. Google Scholar
Digital Library
- [60] . 2018. MDev-NVMe: A NVMe storage virtualization solution with mediated pass-through. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'18). 665–676. Google Scholar
Digital Library
- [61] . 2013. NVMFS: A hybrid file system for improving random write in NAND-flash SSD. In Proceedings of the IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST'13).Google Scholar
Cross Ref
- [62] . 2020. PrismDB: Read-aware Log-structured Merge Trees for Heterogeneous Storage. Retrieved from https://arxiv.org/abs/2008.02352.Google Scholar
- [63] . 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the 26th Symposium on Operating Systems Principles. Google Scholar
Digital Library
- [64] . 2017. SlimDB: A Space-Efficient Key-Value Storage Engine for Semi-Sorted Data. Proceedings of the VLDB Endowment 10, 13 (2017), 2037–2048. Google Scholar
Digital Library
- [65] . Ultra-Low Latency with Samsung Z-NAND SSD. Retrieved from https://www.samsung.com/us/labs/pdfs/collateral/Samsung_Z-NAND_Technology_Brief_v5.pdf.Google Scholar
- [66] . 2017. Workload Diversity with RocksDB. Retrieved from http://www.hpts.ws/papers/2017/hpts2017_rocksdb.pdf.Google Scholar
- [67] . BlobFS (Blobstore Filesystem)—BlobFS Getting Started Guide—RocksDB Integration. Retrieved from https://spdk.io/doc/blobfs.html.Google Scholar
- [68] . 2015. Memory Errors in Modern Systems: The Good, the Bad, and the Ugly. ACM SIGARCH Computer Architecture News 43, 1 (2015) 297–310.Google Scholar
Digital Library
- [69] . Toshiba Memory Introduces XL-FLASH Storage Class Memory Solution. Retrieved from https://business.kioxia.com/en-us/news/2019/memory-20190805-1.html.Google Scholar
- [70] . 2014. Scalable Logging through Emerging Non-Volatile Memory. Proceedings of the VLDB Endowment 7, 10 (2014), 865–876. Google Scholar
Digital Library
- [71] . 2001. SEDA: An architecture for well-conditioned, scalable internet services. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP'01). 14 pages. https://doi.org/10.1145/502034.502057 Google Scholar
Digital Library
- [72] . 2019. Towards an unwritten contract of Intel Optane SSD. In Proceedings of the 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19). Retrieved from https://www.usenix.org/conference/hotstorage19/presentation/wu-kan. Google Scholar
Digital Library
- [73] . 2015. LSM-trie: An LSM-tree-based Ultra-Large key-value store for small data items. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'15). Google Scholar
Digital Library
- [74] . 2017. HiKV: A hybrid index key-value store for DRAM-NVM memory systems. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'17). Google Scholar
Digital Library
- [75] . 2020. Spool: Reliable Virtualized NVMe storage pool in public cloud infrastructure. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'20). 97–110. Google Scholar
Digital Library
- [76] . 2017. SPDK: A development kit to build high performance storage applications. In Proceedings of the IEEE International Conference on Cloud Computing Technology and Science (CloudCom'17).
IEEE .Google ScholarCross Ref
- [77] . 2018. SPDK Vhost-NVMe: Accelerating I/Os in Virtual Machines on NVMe SSDs via User Space Vhost Target. In Proceedings of the IEEE 8th International Symposium on Cloud and Service Computing (SC'18).
IEEE , 67–76.Google ScholarCross Ref
- [78] . 2020. MatrixKV: Reducing write stalls and write amplification in LSM-tree-based KV stores with matrix container in NVM. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC'20). 17–31. Google Scholar
Digital Library
- [79] . 2018. Mutant: Balancing storage cost and latency in LSM-Tree data stores. In Proceedings of the ACM Symposium on Cloud Computing (SoCC'18). 12 pages. https://doi.org/10.1145/3267809.3267846 Google Scholar
Digital Library
- [80] . 2020. Scalable parallel flash firmware for many-core architectures. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST'20). Retrieved from https://www.usenix.org/conference/fast20/presentation/zhang-jie. Google Scholar
Digital Library
- [81] . 2020. UniKV: Toward high-performance and scalable KV Storage in Mixed Workloads via Unified Indexing. In Proceedings of the 36th IEEE International Conference on Data Engineering (ICDE'20).Google Scholar
Cross Ref
- [82] . 2020. FPGA-Accelerated compactions for LSM-based key-value store. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST'20). Google Scholar
Digital Library
- [83] . 2019. Ziggurat: A tiered file system for non-volatile main memories and disks. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST'19). Retrieved from https://www.usenix.org/conference/fast19/presentation/zheng. Google Scholar
Digital Library
- [84] . 2014. Fast databases with fast durability and recovery through multicore parallelism. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14). Google Scholar
Digital Library
- [85] . 2020. LiveGraph: A transactional graph storage system with purely sequential adjacency list scans. Proc. VLDB Endow. 13, 7 (Mar. 2020), 1020–1034. https://doi.org/10.14778/3384345.3384351 Google Scholar
Digital Library
Index Terms
Leveraging NVMe SSDs for Building a Fast, Cost-effective, LSM-tree-based KV Store
Recommendations
Design of LSM-tree-based Key-value SSDs with Bounded Tails
Key-value store based on a log-structured merge-tree (LSM-tree) is preferable to hash-based key-value store, because an LSM-tree can support a wider variety of operations and show better performance, especially for writes. However, LSM-tree is difficult ...
KVell: the design and implementation of a fast persistent key-value store
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesModern block-addressable NVMe SSDs provide much higher bandwidth and similar performance for random and sequential access. Persistent key-value stores (KVs) designed for earlier storage devices, using either Log-Structured Merge (LSM) or B trees, do not ...
LSM-tree managed storage for large-scale key-value store
SoCC '17: Proceedings of the 2017 Symposium on Cloud ComputingKey-value stores are increasingly adopting LSM-trees as their enabling data structure in the backend storage, and persisting their clustered data through a file system. A file system is expected to not only provide file/directory abstraction to organize ...






Comments