Abstract
We propose Vigil-KV, a hardware and software co-designed framework that eliminates long-tail latency almost perfectly by introducing strong latency determinism. To make Get latency deterministic, Vigil-KV first enables a predictable latency mode (PLM) interface on a real datacenter-scale NVMe SSD, having knowledge about the nature of the underlying flash technologies. Vigil-KV at the system-level then hides the non-deterministic time window (associated with SSD’s internal tasks and/or write services) by internally scheduling the different device states of PLM across multiple physical functions. Vigil-KV further schedules compaction/flush operations and client requests being aware of PLM’s restrictions thereby integrating strong latency determinism into LSM KVs. We implement Vigil-KV upon a 1.92TB NVMe SSD prototype and Linux 4.19.91, but other LSM KVs can adopt its concept. We evaluate diverse Facebook and Yahoo scenarios with Vigil-KV, and the results show that Vigil-KV can reducethe tail latency of a baseline KV system by 3.19× while reducing the average latency by 34%, on average.
- [1] . 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX Annual Technical Conference.Google Scholar
- [2] . 2011. Energy-efficient cache design using variable-strength error-correcting codes. ACM SIGARCH Computer Architecture News 39, 3 (2011), 461–472.Google Scholar
Digital Library
- [3] . (n.d.). Amazon Found Every 100ms of Latency Cost them 1ry-100ms-of-latency-cost-them-1-in-sales.Google Scholar
- [4] . 2021. First-Generation Inference Accelerator Deployment at Facebook. arXiv preprint arXiv:2107.04140 (2021).Google Scholar
- [5] . 2013. LinkBench: A database benchmark based on the Facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.Google Scholar
Digital Library
- [6] . 2010. Differential raid: Rethinking raid for ssd reliability. ACM Transactions on Storage 6, 2 (2010), 1–22.Google Scholar
Digital Library
- [7] . 2017. TRIAD: Creating synergies between memory, disk and log in log structured key-value stores. In Proceedings of the 2017 USENIX Annual Technical Conference.Google Scholar
- [8] . 2019. SILK: Preventing latency spikes in log-structured merge key-value stores. In Proceedings of the 2019 USENIX Annual Technical Conference.Google Scholar
- [9] . 2017. Error characterization, mitigation, and recovery in flash-memory-based solid-state drives. Proceedings of the IEEE 105, 9 (2017), 1666–1704.Google Scholar
Cross Ref
- [10] Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu. 2017. Error characterization, mitigation, and recovery in flash-memory-based solid-state drives. Proc. IEEE 105, 9 (2017), 1666–1704. Google Scholar
Cross Ref
- [11] . 2017. Vulnerabilities in MLC NAND flash memory programming: Experimental analysis, exploits, and mitigation techniques. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture. IEEE.Google Scholar
Cross Ref
- [12] . 2012. Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis. In Proceedings of the 2012 Design, Automation & Test in Europe Conference & Exhibition. IEEE.Google Scholar
- [13] . 2013. Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling. In Proceedings of the 2013 Design, Automation & Test in Europe Conference & Exhibition. IEEE.Google Scholar
Cross Ref
- [14] . 2015. Read disturb errors in MLC NAND flash memory: Characterization, mitigation, and recovery. In Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE.Google Scholar
Digital Library
- [15] . 2013. Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation. In Proceedings of the 2013 IEEE 31st International Conference on Computer Design. IEEE.Google Scholar
Cross Ref
- [16] . 2020. Characterizing, modeling, and benchmarking rocksdb key-value workloads at facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies.Google Scholar
Digital Library
- [17] . 2004. Real-time garbage collection for flash-memory storage systems of real-time embedded systems. ACM Transactions on Embedded Computing Systems 3, 4 (2004), 837–863.Google Scholar
Digital Library
- [18] . 2021. SpanDB: A fast, cost-effective LSM-tree based KV store on hybrid storage. In Proceedings of the 19th USENIX Conference on File and Storage Technologies.Google Scholar
- [19] . 2018. Parallelizing garbage collection with I/O to improve flash resource utilization. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing.Google Scholar
Digital Library
- [20] . 2020. SplinterDB: Closing the bandwidth gap for nvme key-value stores. In Proceedings of the 2020 USENIX Annual Technical Conference.Google Scholar
- [21] . 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing.Google Scholar
Digital Library
- [22] . 2017. Clipper: A \(\lbrace\)Low-Latency\(\rbrace\) online prediction serving system. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation. 613–627.Google Scholar
- [23] . 2017. Monkey: Optimal navigable key-value store. In Proceedings of the 2017 ACM International Conference on Management of Data.Google Scholar
Digital Library
- [24] . 2018. Dostoevsky: Better space-time trade-offs for LSM-tree based key-value stores via adaptive removal of superfluous merging. In Proceedings of the 2018 International Conference on Management of Data.Google Scholar
Digital Library
- [25] . 2017. Optimizing space amplification in RocksDB. In Proceedings of the CIDR.Google Scholar
- [26] . 2021. Evolution of development priorities in key-value stores serving large-scale applications: The RocksDB experience. In Proceedings of the 19th USENIX Conference on File and Storage Technologies.Google Scholar
Digital Library
- [27] . 2021. RocksDB: Evolution of development priorities in a key-value store serving large-scale applications. ACM Transactions on Storage 17, 4 (2021), 1–32.Google Scholar
Digital Library
- [28] . 2018. Reducing DRAM footprint with NVM in Facebook. In Proceedings of the 13th EuroSys Conference.Google Scholar
Digital Library
- [29] . (n.d.). RocksDB: A Persistent Key-value Store for Fast Storage Environments. Retrieved from https://rocksdb.orgGoogle Scholar
- [30] . 2019. Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 19–33.Google Scholar
Digital Library
- [31] . (n.d.). LevelDB. Retrieved from https://github.com/google/leveldbGoogle Scholar
- [32] . (n.d.). Benchmarks for Mobile Page Speed. Retrieved from https://www.thinkwithgoogle.com/intl/en-ca/marketing-strategies/app-and-mobile/mobile-page-speed-new-industry-benchmarksGoogle Scholar
- [33] . (n.d.). The Value of a Millisecond: Finding the Optimal Speed of a Trading Infrastructure. Retrieved from https://research.tabbgroup.com/report/v06-007-value-millisecond-finding-optimal-speed-trading-infrastructureGoogle Scholar
- [34] . (n.d.). gRPC: A high performance open-source universal RPC framework.Retrieved from https://grpc.ioGoogle Scholar
- [35] . 2009. DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings. Acm Sigplan Notices 44, 3 (2009), 229–240.Google Scholar
Digital Library
- [36] . 2015. An integrated approach for managing read disturbs in high-density NAND flash memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 7 (2015), 1079–1091.Google Scholar
- [37] . 2014. FlexECC: Partially relaxing ECC of MLCSSD for better cache performance. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIXATC’14).Google Scholar
- [38] . 2020. PinK: High-speed in-storage key-value store with bounded tails. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIXATC’20).Google Scholar
- [39] . (n.d.). Performance & Usage at Instagram. Retrieved from https://instagram-engineering.com/performance-usage-at-instagram-d2ba0347e442Google Scholar
- [40] . 2021. FusionRAID: Achieving consistent low latency for commodity SSD arrays. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21).Google Scholar
- [41] . 2014. Wear unleveling: Improving NAND flash lifetime by balancing page endurance. In Proceedings of the 12th USENIX Conference on File and Storage Technologies.Google Scholar
- [42] . 2006. FAB: Flash-aware buffer management policy for portable media players. IEEE Transactions on Consumer Electronics 52, 2 (2006), 485–493.Google Scholar
- [43] . 2019. Design of a host interface logic for gc-free ssds. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 8 (2019), 1674–1687.Google Scholar
- [44] . 2014. HIOS: A host interface I/O scheduler for solid state disks. ACM SIGARCH Computer Architecture News 42, 3 (2014), 289–300.Google Scholar
Digital Library
- [45] . 2013. Revisiting widely held SSD expectations and rethinking system-level implications. SIGMETRICS Perform. Eval. Rev. 41, 1 (June 2013), 203–216. Google Scholar
Digital Library
- [46] . 2012. Taking garbage collection overheads off the critical path in SSDs. In Proceedings of the Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal.
Lecture Notes in Computer Science , Vol. 7662. Springer, 164–186.Google ScholarCross Ref
- [47] . 2019. SLM-DB: Single-level key-value store with persistent memory. In Proceedings of the17th USENIX Conference on File and Storage Technologies (FAST’19).Google Scholar
- [48] . 2006. A superblock-based flash translation layer for NAND flash memory. In Proceedings of the 6th ACM & IEEE International Conference on Embedded Software.Google Scholar
Digital Library
- [49] . 2008. Performance trade-offs in using NVRAM write buffer for flash memory-based storage devices. IEEE Transactions on Computers 58, 6 (2008), 744–758.Google Scholar
- [50] . 2014. Durable write cache in flash memory SSD for relational and NoSQL databases. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.Google Scholar
Digital Library
- [51] . 2018. Redesigning LSMs for nonvolatile memory with NoveLSM. In Proceedings of the 2018 USENIX Annual Technical Conference.Google Scholar
- [52] . 2019. Design tradeoffs for SSD reliability. In Proceedings of the 17th USENIX Conference on File and Storage Technologies.Google Scholar
Digital Library
- [53] . 2018. AutoSSD: An autonomic SSD architecture. In Proceedings of the 2018 USENIX Annual Technical Conference.Google Scholar
- [54] . 2008. BPLRU: A buffer management scheme for improving random writes in flash storage. In Proceedings of the File and Storage Technologies.Google Scholar
- [55] . 2019. Alleviating garbage collection interference through spatial separation in all flash arrays. In Proceedings of the 2019 USENIX Annual Technical Conference.Google Scholar
Digital Library
- [56] . 2019. Faster than flash: An in-depth study of system challenges for emerging ultra-low latency SSDs. In Proceedings of the IEEE International Symposium on Workload Characterization, IISWC 2019, Orlando, FL, USA, November 3-5, 2019. IEEE.Google Scholar
Cross Ref
- [57] . 2018. Exploring system challenges of ultra-low latency solid state drives. In Proceedings of the 10th USENIX Workshop on Hot Topics in Storage and File Systems.Google Scholar
Digital Library
- [58] . (n.d.). Social networking at scale. Retrieved from https://www.ece.lsu.edu/hpca-18/files/HPCA2012_Facebook_Keynote.pdf.Google Scholar
- [59] . 2022. LightPC: Hardware and software co-design for energy-efficient full system persistence. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 289–305.Google Scholar
Digital Library
- [60] . 2019. Kvell: The design and implementation of a fast persistent key-value store. In Proceedings of the 27th ACM Symposium on Operating Systems Principles.Google Scholar
Digital Library
- [61] . 2021. lODA: A host/device co-design for strong predictability contract on modern flash storage. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles.Google Scholar
Digital Library
- [62] . 2016. Access characteristic guided read and write cost regulation for performance improvement on flash memory. In Proceedings of the 14th USENIX Conference on File and Storage Technologies.Google Scholar
Digital Library
- [63] . 2021. Differentiated key-value storage management for balanced I/O performance. In Proceedings of the 2021 USENIX Annual Technical Conference.Google Scholar
- [64] . 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles.Google Scholar
Digital Library
- [65] . 2019. SOML read: Rethinking the read operation granularity of 3D NAND SSDs. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
Digital Library
- [66] . 2017. Wisckey: Separating keys from values in ssd-conscious storage. ACM Transactions on Storage 13, 1 (2017), 1–28.Google Scholar
Digital Library
- [67] . 2012. HPDA: A hybrid parity-based disk array for enhanced performance and reliability. ACM Transactions on Storage 8, 1 (2012), 1–20.Google Scholar
Digital Library
- [68] . 2020. MyRocks: LSM-tree database storage engine serving Facebook’s social graph. Proceedings of the VLDB Endowment 13, 12 (2020), 3217–3230.Google Scholar
Digital Library
- [69] . 2008. Bit error rate in NAND flash memories. In Proceedings of the 2008 IEEE International Reliability Physics Symposium. IEEE.Google Scholar
Cross Ref
- [70] . (n.d.). NVM Express Specification. Retrieved from https://nvmexpress.org/specificationsGoogle Scholar
- [71] . 2018. BIBIM: A prototype multi-partition aware heterogeneous new memory. In Proceedings of the 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’18).Google Scholar
- [72] . 2012. FIOS: A fair, efficient flash I/O scheduler.. In Proceedings of the File and Storage Technologies.Google Scholar
- [73] . 2017. Pebblesdb: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the 26th Symposium on Operating Systems Principles.Google Scholar
Digital Library
- [74] . 2017. SlimDB: A space-efficient key-value storage engine for semi-sorted data. Proceedings of the VLDB Endowment 10, 13 (2017), 2037–2048.Google Scholar
Digital Library
- [75] . (n.d.). Apache Thrift. Retrieved from https://thrift.apache.orgGoogle Scholar
- [76] . 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies. 67–80. Retrieved from https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroederGoogle Scholar
Digital Library
- [77] . 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data.Google Scholar
Digital Library
- [78] . 2016. Exploring the potentials of parallel garbage collection in ssds for enterprise storage systems. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE.Google Scholar
Digital Library
- [79] . 2016. Exploiting intracell bit-error characteristics to improve min-sum LDPC decoding for MLC NAND flash-based storage in mobile device. IEEE Transactions on Very Large Scale Integration Systems 24, 8 (2016), 2654–2664.Google Scholar
Digital Library
- [80] . 2018. FLIN: Enabling fairness and enhancing performance in modern NVMe solid state drives. In Proceedings of the2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture. IEEE.Google Scholar
Digital Library
- [81] . 2013. A mean field model for a class of garbage collection algorithms in flash-based solid state drives. ACM SIGMETRICS Performance Evaluation Review 41, 1 (2013), 191–202.Google Scholar
Digital Library
- [82] . 2020. BCW: Buffer-controlled writes to HDDs for SSD-HDD hybrid storage server. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20).Google Scholar
- [83] Wei Wang, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad. 2018. Rafiki: machine learning as an analytics service system. Proc. VLDB Endow. 12, 2 (2018), 128–140. https://arxiv.org/abs/1804.06087Google Scholar
- [84] . 2012. Reducing SSD read latency via NAND flash program and erase suspension.. In Proceedings of the File and Storage Technologies.Google Scholar
- [85] . 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. ACM Transactions on Storage 13, 3 (2017)1–26.Google Scholar
Digital Library
- [86] . 2014. Garbage collection and wear leveling for flash memory: Past and future. In Proceedings of the 2014 International Conference on Smart Computing. IEEE.Google Scholar
Cross Ref
- [87] . 2019. Reducing garbage collection overhead in SSD based on workload prediction. In Proceedings of the 11th USENIX Workshop on Hot Topics in Storage and File Systems.Google Scholar
Digital Library
- [88] . 2021. Revamping storage class memory with hardware automated memory-over-storage solution. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture. IEEE.Google Scholar
Digital Library
- [89] . 2018. FlashShare: Punching through server storage stack from kernel to firmware for ultra-low latency SSDs. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, and (Eds.). USENIX Association, 477–492.Google Scholar
- [90] . 2020. DRAM-less: Hardware acceleration of data processing with new memory. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture. IEEE.Google Scholar
Cross Ref
- [91] . 2016. REAL: A retention error aware LDPC decoding scheme to improve NAND flash read performance. In Proceedings of the 2016 32nd Symposium on Mass Storage Systems and Technologies. IEEE.Google Scholar
Cross Ref
- [92] . 2013. LDPC-in-SSD: Making advanced error correction codes work effectively in solid state drives. In Proceedings of the 11th USENIX Conference on File and Storage Technologies.Google Scholar
- [93] . 2022. Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 1042–1057.Google Scholar
Digital Library
- [94] . (n.d.). Feathr: LinkedIn’s feature store is now available on Azure. https://azure.microsoft.com/en-us/blog/feathr-linkedin-s-feature-store-is-now-available-on-azure/.Google Scholar
Index Terms
Realizing Strong Determinism Contract on Log-Structured Merge Key-Value Stores
Recommendations
FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores
The Log-Structured Merge Tree (LSM-Tree) is widely used in key-value (KV) stores because of its excwrite performance. But LSM-Tree-based KV stores still have the overhead of write-ahead log and write stall caused by slow L0 flush and L0-L1 compaction. New ...
An Efficient Memory-Mapped Key-Value Store for Flash Storage
SoCC '18: Proceedings of the ACM Symposium on Cloud ComputingPersistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Today, due to power limitations it is important to reduce CPU overheads for data ...
Splaying Log-Structured Merge-Trees
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataModern persistent key-value stores typically use a log-structured merge-tree (LSM-tree) design, which allows for high write throughput. Our observation is that the LSM-tree, however, has suboptimal performance during read-intensive workload windows with ...






Comments