Abstract
As the cost-per-bit of solid state disks is decreasing quickly, SSDs are supplanting HDDs in many cases, including the primary storage of key-value stores. However, simply deploying LSM-tree-based key-value stores on commercial SSDs is inefficient and induces heavy write amplification and severe garbage collection overhead under write-intensive conditions. The main cause of these critical issues comes from the triple redundant management functionalities lying in the LSM-tree, file system and flash translation layer, which block the awareness between key-value stores and flash devices. Furthermore, we observe that the performance of LSM-tree-based key-value stores is improved little by only eliminating these redundant layers, as the I/O stacks, including the cache and scheduler, are not optimized for LSM-tree’s unique I/O patterns.
To address the issues above, we propose FlashKV, an LSM-tree based key-value store running on open-channel SSDs. FlashKV eliminates the redundant management and semantic isolation by directly managing the raw flash devices in the application layer. With the domain knowledge of LSM-tree and the open-channel information, FlashKV employs a parallel data layout to exploit the internal parallelism of the flash device, and optimizes the compaction, caching and I/O scheduling mechanisms specifically. Evaluations show that FlashKV effectively improves system performance by 1.5× to 4.5× and decreases up to 50% write traffic under heavy write conditions, compared to LevelDB.
- Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of 2008 USENIX Annual Technical Conference (USENIX ATC). USENIX, Berkeley, CA. Google Scholar
Digital Library
- Matias Bjørling, Javier Gonzalez, and Philippe Bonnet. 2017. LightNVM: The linux open-channel SSD subsystem. In 15th USENIX Conference on File and Storage Technologies (FAST 17). USENIX Association, Santa Clara, CA, 359--374. https://www.usenix.org/conference/fast17/technical-sessions/presentation/bjorling. Google Scholar
Digital Library
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google Scholar
Digital Library
- MingMing Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. 2008. The new ext4 filesystem: Current status and future plans. In Proceedings of the Linux Symposium. Ottawa, ON, CA: Red Hat. Retrieved. 01--15.Google Scholar
- Cassandra. 2016. Apache Cassandra Documentation v4.0. (2016). Retrieved May 20, 2017 from http://cassandra.apache.org/doc/.Google Scholar
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). USENIX, Berkeley, CA, 205--218. Google Scholar
Digital Library
- Feng Chen, Rubao Lee, and Xiaodong Zhang. 2011. Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 266--277. Google Scholar
Digital Library
- Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. 2008. PNUTS: Yahoo!’s hosted data serving platform. Proceedings of the VLDB Endowment (PVLDB 2008) 1, 2 (2008), 1277--1288. Google Scholar
Digital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. ACM, 143--154. Google Scholar
Digital Library
- Corbet. 2006. Trees I: Radix trees. (2006). Retrieved February 28, 2017 from https://lwn.net/Articles/175432/.Google Scholar
- Facebook. 2013. RocksDB. http://rocksdb.org/. (2013).Google Scholar
- Sanjay Ghemawat and Jeff Dean. 2012. LevelDB, A fast and lightweight key/value database library by Google. (2012). Retrieved June 20, 2015 from http://code.google.com/p/leveldb/.Google Scholar
- Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing flash memory: Anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 24--33. Google Scholar
Digital Library
- Jie Guo, Chuhan Min, Tao Cai, and Yiran Chen. 2016. A design to reduce write amplification in object-based NAND flash devices. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). IEEE, 1--10. Google Scholar
Digital Library
- HBase. 2016. Apache HBase Reference Guide. (2016). Retrieved April 21, 2017 from http://hbase.apache.org/book.html.Google Scholar
- Yang Hu, Hong Jiang, Dan Feng, Lei Tian, Hao Luo, and Shuping Zhang. 2011. Performance impact and interplay of SSD parallelism through advanced commands, allocation strategy and data granularity. In Proceedings of the International Conference on Supercomputing (ICS). ACM, 96--107. Google Scholar
Digital Library
- William K. Josephson, Lars A. Bongo, David Flynn, and Kai Li. 2010. DFS: A file system for virtualized flash storage. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA. Google Scholar
Digital Library
- Jeong-Uk Kang, Jeeseok Hyun, Hyunjoo Maeng, and Sangyeun Cho. 2014. The multi-streamed solid-state drive. In Proceedings of the 6th USENIX conference on Hot Topics in Storage and File Systems. USENIX Association, 13--13. Google Scholar
Digital Library
- Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST). USENIX, Santa Clara, CA. https://www.usenix.org/conference/fast15/technical-sessions/presentation/lee. Google Scholar
Digital Library
- Sungjin Lee, Ming Liu, Sangwoo Jun, Shuotao Xu, Jihong Kim, and Arvind. 2016. Application-managed flash. In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 339--353. http://usenix.org/conference/fast16/technical-sessions/presentation/lee. Google Scholar
Digital Library
- Young-Sik Lee, Sang-Hoon Kim, Jin-Soo Kim, Jaesoo Lee, Chanik Park, and Seungryoul Maeng. 2013. OSSD: A case for object-based solid state drives. In Proceedings of the 29th IEEE Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1--13.Google Scholar
Cross Ref
- Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in ssd-conscious storage. In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 133--148. https://www.usenix.org/conference/fast16/technical-sessions/presentation/lu. Google Scholar
Digital Library
- Youyou Lu, Jiwu Shu, and Wei Wang. 2014. ReconFS: A reconstructable file system on flash storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA, 75--88. Google Scholar
Digital Library
- Youyou Lu, Jiwu Shu, and Weimin Zheng. 2013. Extending the lifetime of flash-based storage through reducing write amplification from file systems. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA. Google Scholar
Digital Library
- Youyou Lu, Jiacheng Zhang, and Jiwu Shu. 2015. Rethinking the file system design on flash-based storage. In Communications of the Korean Institute of Information Scientists and Engineers (KIISE).Google Scholar
- Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala, and Raju Rangaswami. 2015. NVMKV: A scalable, lightweight, FTL-aware key-value store. In USENIX Annual Technical Conference (ATC). 207--219. Google Scholar
Digital Library
- Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385. Google Scholar
Digital Library
- Jian Ouyang, Shiding Lin, Song Jiang, Zhenyu Hou, Yong Wang, and Yuanzheng Wang. 2014. SDF: Software-defined flash for web-scale internet storage systems. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, USA, 471--484. Google Scholar
Digital Library
- Xiangyong Ouyang, David Nellans, Robert Wipfel, David Flynn, and Dhabaleswar K. Panda. 2011. Beyond block I/O: Rethinking traditional storage primitives. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 301--311. Google Scholar
Digital Library
- William Pugh. 1990. Skip lists: A probabilistic alternative to balanced trees. Commun. ACM 33, 6 (1990), 668--676. Google Scholar
Digital Library
- Zhaoyan Shen, Feng Chen, Yichen Jia, and Zili Shao. 2017. DIDACache: A deep integration of device and application for flash based key-value caching. In 15th USENIX Conference on File and Storage Technologies (FAST 17). USENIX Association, Santa Clara, CA, 391--405. https://www.usenix.org/conference/fast17/technical-sessions/presentation/shen. Google Scholar
Digital Library
- Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys). ACM, New York, NY, USA, Article 16, 14 pages. Google Scholar
Digital Library
- Wei Wang, Youyou Lu, and Jiwu Shu. 2014. p-OFTL: An object-based semantic-aware parallel flash translation layer. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE). European Design and Automation Association, 157. Google Scholar
Digital Library
- Jingpei Yang, Ned Plasson, Greg Gillis, Nisha Talagala, and Swaminathan Sundararaman. 2014. Don’t stack your Log on my Log. In USENIX Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW).Google Scholar
- Jiacheng Zhang, Jiwu Shu, and Youyou Lu. 2016. ParaFS: A log-structured file system to exploit the internal parallelism of flash devices. In 2016 USENIX Annual Technical Conference (USENIX ATC 16). USENIX Association, 87--100. Google Scholar
Digital Library
- Jiacheng Zhang, Jiwu Shu, and Youyou Lu. 2016. RFFS: A log-structured file system on raw-flash devices(WiPs). In 14th USENIX Conference on File and Storage Technologies (FAST). USENIX Association, Santa Clara, CA.Google Scholar
- Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. De-indirection for flash-based SSDs with nameless writes. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST). USENIX, Berkeley, CA. Google Scholar
Digital Library
Index Terms
FlashKV: Accelerating KV Performance with Open-Channel SSDs
Recommendations
Flash-Aware RAID Techniques for Dependable and High-Performance Flash Memory SSD
Solid-state disks (SSDs), which are composed of multiple NAND flash chips, are replacing hard disk drives (HDDs) in the mass storage market. The performances of SSDs are increasing due to the exploitation of parallel I/O architectures. However, ...
Optimizing CoW-based file systems on open-channel SSDs with persistent memory
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in EuropeBlock-based file systems, such as Btrfs, utilize the copy-on-write (CoW) mechanism to guarantee data consistency on solid-state drives (SSDs). Open-channel SSD provides opportunities for in-depth optimization of block-based file systems. However, ...
Mitigating Synchronous I/O Overhead in File Systems on Open-Channel SSDs
Synchronous I/O has long been a design challenge in file systems. Although open-channel solid state drives (SSDs) provide better performance and endurance to file systems, they still suffer from synchronous I/Os due to the amplified writes and worse hot/...






Comments