Abstract
Remote access to NVMe Flash enables flexible scaling and high utilization of Flash capacity and IOPS within a datacenter. However, existing systems for remote Flash access either introduce significant performance overheads or fail to isolate the multiple remote clients sharing each Flash device. We present ReFlex, a software-based system for remote Flash access, that provides nearly identical performance to accessing local Flash. ReFlex uses a dataplane kernel to closely integrate networking and storage processing to achieve low latency and high throughput at low resource requirements. Specifically, ReFlex can serve up to 850K IOPS per core over TCP/IP networking, while adding 21us over direct access to local Flash. ReFlex uses a QoS scheduler that can enforce tail latency and throughput service-level objectives (SLOs) for thousands of remote clients. We show that ReFlex allows applications to use remote Flash while maintaining their original performance with local Flash.
- IX-project: protected dataplane for low latency and high performance. https://github.com/ix-project, 2016.Google Scholar
- Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark S. Manasse, and Rina Panigrahy. Design tradeoffs for ssd performance. In USENIX Annual Technical Conference, pages 57--70, 2008.Google Scholar
Digital Library
- Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Disk-locality in datacenter computing considered irrelevant. In Proc. of USENIX Hot Topics in Operating Systems, HotOS'13, pages 12--12, 2011.Google Scholar
Digital Library
- Eric Anderson, Michael Hobbs, Kimberly Keeton, Susan Spence, Mustafa Uysal, and Alistair Veitch. Hippodrome: Running circles around storage administration. In Proc. of the 1st USENIX Conference on File and Storage Technologies, FAST '02. USENIX Association, 2002.Google Scholar
Digital Library
- Sebastian Angel, Hitesh Ballani, Thomas Karagiannis, Greg O\textquoterightShea, and Eno Thereska. End-to-end performance isolation through virtual datacenters. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'14, pages 233--248, October 2014.Google Scholar
Digital Library
- Avago Technologies. Storage and PCI Express -- A Natural Combination. http://www.avagotech.com/applications/datacenters/enterprise-storage , 2016.Google Scholar
- Jens Axboe. Linux block IO-present and future. In Ottawa Linux Symp, pages 51--61, 2004.Google Scholar
- Microsoft Azure. Storage. https://azure.microsoft.com/en-us/services/storage/, 2016.Google Scholar
- Luiz André Barroso and Urs Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. 2009.Google Scholar
- Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis. Dune: Safe user-level access to privileged cpu features. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'12, pages 335--348, 2012.Google Scholar
- Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'14, pages 49--65, October 2014.Google Scholar
- Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet. Linux block io: introducing multi-queue ssd access on multi-core systems. In Proc. of International Systems and Storage Conference, page 22. ACM, 2013. Google Scholar
Digital Library
- Simona Boboila and Peter Desnoyers. Write endurance in flash drives: Measurements and analysis. In Proc. of USENIX Conference on File and Storage Technologies, FAST'10, pages 9--9. USENIX Association, 2010.Google Scholar
- John Bruno, Jose Brustoloni, Eran Gabber, Banu Ozden, and Abraham Silberschatz. Disk scheduling with quality of service guarantees. In Proc. of the IEEE International Conference on Multimedia Computing and Systems - Volume 2, ICMCS '99, pages 400--405. IEEE Computer Society, 1999. Google Scholar
Digital Library
- Adrian M. Caulfield and Steven Swanson. QuickSAN: A storage area network for fast, distributed, solid state disks. In Proc. of International Symposium on Computer Architecture, ISCA '13, pages 464--474. ACM, 2013. Google Scholar
Digital Library
- Mallikarjun Chadalapaka, Hemal Shah, Uri Elzur, Patricia Thaler, and Michael Ko. A study of iSCSI extensions for RDMA (iSER). In Proc. of ACM SIGCOMM Workshop on Network-I/O Convergence: Experience, Lessons, Implications, NICELI '03, pages 209--219. ACM, 2003.Google Scholar
Digital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. of USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI '06, pages 205--218. USENIX Association, 2006.Google Scholar
- Chelsio Communications. NVM Express over Fabrics. http://www.chelsio.com/wp-content/uploads/resources/NVM_Express_Over_Fabrics.pdf, 2014.Google Scholar
- François Alexandre Colombani. HDD, SSHD, SSD or PCIe SSD. Storage Newsletter, http://www.storagenewsletter.com/rubriques/market-reportsresearch/hdd-sshd-ssd-or-pcie-ssd/, 2015.Google Scholar
- A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm. In Symposium Proceedings on Communications Architectures &Amp; Protocols, SIGCOMM '89, pages 1--12. ACM, 1989. Google Scholar
Digital Library
- Adam Dunkels. Design and implementation of the lwip, 2001.Google Scholar
- Facebook Inc. RocksDB: A persistent key-value store for fast storage environments. http://rocksdb.org, 2015.Google Scholar
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In Proc. of ACM Symposium on Operating Systems Principles, SOSP '03, pages 29--43. ACM, 2003. Google Scholar
Digital Library
- Ajay Gulati, Irfan Ahmad, and Carl A. Waldspurger. PARDA: proportional allocation of resources for distributed storage access. In Proc. of USENIX File and Storage Technologies, FAST '09, pages 85--98, 2009.Google Scholar
Digital Library
- Ajay Gulati, Arif Merchant, Mustafa Uysal, Pradeep Padala, and Peter Varman. Efficient and adaptive proportional share I/O scheduling. SIGMETRICS Perform. Eval. Rev., 37(2):79--80, October 2009. Google Scholar
Digital Library
- Ajay Gulati, Arif Merchant, and Peter J. Varman. pclock: An arrival curve based approach for qos guarantees in shared storage systems. In Proc. of ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '07, pages 13--24. ACM, 2007. Google Scholar
Digital Library
- Ajay Gulati, Arif Merchant, and Peter J. Varman. mClock: handling throughput variability for hypervisor io scheduling. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'10, pages 437--450, 2010.Google Scholar
- Ajay Gulati, Ganesha Shanmuganathan, Irfan Ahmad, Carl Waldspurger, and Mustafa Uysal. Pesto: Online storage performance management in virtualized datacenters. In Proc. of the 2Nd ACM Symposium on Cloud Computing, SOCC '11, pages 19:1--19:14. ACM, 2011.Google Scholar
Digital Library
- Intel Corp. Intel Rack Scale Architecture Platform. http://www.intel.com/content/dam/www/public/us/en/documents/guides/rack-scale-hardware-guide.pdf, 2015.Google Scholar
- Intel Corp. Dataplane Performance Development Kit. https://dpdk.org, 2016.Google Scholar
- Intel Corp. Storage Performance Development Kit. https://01.org/spdk, 2016.Google Scholar
- Jens Axboe. FIO: Flexible I/O Tester. https://github.com/axboe/fio, 2015.Google Scholar
- Eun Young Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. mTCP: A highly scalable user-level TCP stack for multicore systems. In Proc. of USENIX Networked Systems Design and Implementation, NSDI'14, pages 489--502, 2014.Google Scholar
- Wei Jin, Jeffrey S. Chase, and Jasleen Kaur. Interposed proportional sharing for a storage service utility. In Proc. of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '04/Performance '04, pages 37--48. ACM, 2004. Google Scholar
Digital Library
- Abhijeet Joglekar, Michael E. Kounavis, and Frank L. Berry. A scalable and high performance software iSCSI implementation. In Proc. of USENIX Conference on File and Storage Technologies - Volume 4, FAST'05, pages 20--20. USENIX Association, 2005.Google Scholar
- Rishi Kapoor, George Porter, Malveeka Tewari, Geoffrey M. Voelker, and Amin Vahdat. Chronos: Predictable low latency for data center applications. In Proc. of the Third ACM Symposium on Cloud Computing, SoCC '12, pages 9:1--9:14, New York, NY, USA, 2012. ACM. Google Scholar
Digital Library
- Ana Klimovic, Christos Kozyrakis, Eno Thereska, Binu John, and Sanjeev Kumar. Flash storage disaggregation. In Proc. of European Conference on Computer Systems, EuroSys '16, pages 29:1--29:15, 2016. Google Scholar
Digital Library
- Yossi Kuperman, Eyal Moscovici, Joel Nider, Razya Ladelsky, Abel Gordon, and Dan Tsafrir. Paravirtual remote I/O. In Proc. of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, pages 49--65. ACM, 2016. Google Scholar
Digital Library
- Jure Leskovec and Andrej Krevl. SNAP datasets: Stanford large network dataset collection. 2015.Google Scholar
- Jacob Leverich. Mutilate: High-Performance Memcached Load Generator. https://github.com/leverich/mutilate, 2014.Google Scholar
- Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. Tales of the tail: Hardware, OS, and application-level sources of tail latency. In Proc. of the ACM Symposium on Cloud Computing, SOCC '14, pages 9:1--9:14. ACM, 2014.Google Scholar
Digital Library
- Ilias Marinos, Robert N.M. Watson, and Mark Handley. Network stack specialization for performance. In Proc. of ACM SIGCOMM, SIGCOMM'14, pages 175--186, 2014. Google Scholar
Digital Library
- Menage, Paul. cgroups. https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt, 2004.Google Scholar
- Arif Merchant, Mustafa Uysal, Pradeep Padala, Xiaoyun Zhu, Sharad Singhal, and Kang G. Shin. Maestro: quality-of-service in large disk arrays. In Proc. of International Conference on Autonomic Computing, ICAC'11, pages 245--254, 2011. Google Scholar
Digital Library
- J. Metz, Amber Huffman, Steve Sardella, and Dave Mintrun. The performance impact of NVM Express and NVM Express over Fabrics. http://www.nvmexpress.org/wp-content/uploads/NVMe-Webcast-Slides-20141111-Final.pdf, 2015.Google Scholar
- Trond Norbye. Memcached Binary Protocol. https://https://github.com/memcached/memcached/blob/master/protocol_binary.h, 2008.Google Scholar
- NVM Express Inc. NVM Express: the optimized PCI Express SSD interface. http://www.nvmexpress.org, 2015.Google Scholar
- NVM Express Inc. NVM Express over Fabrics Revision 1.0 . http://www.nvmexpress.org/wp-content/uploads/NVMe_over_Fabrics_1_0_Gold_20160605.pdf , 2016.Google Scholar
- Open-iSCSi project. iSCSI tools for Linux. https://github.com/open-iscsi/open-iscsi, 2016.Google Scholar
- Jian Ouyang, Shiding Lin, Jiang Song, Zhenyu Hou, Yong Wang, and Yuanzheng Wang. SDF: software-defined flash for web-scale internet storage systems. In Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, pages 471--484, 2014.Google Scholar
Digital Library
- Abhay K. Parekh and Robert G. Gallager. A generalized processor sharing approach to flow control in integrated services networks: The single-node case. IEEE/ACM Trans. Netw., 1(3):344--357, June 1993. Google Scholar
Digital Library
- Stan Park and Kai Shen. FIOS: a fair, efficient flash I/O scheduler. In Proc. of USENIX File and Storage Technologies, FAST'12, page 13, 2012.Google Scholar
- George Prekas, Mia Primorac, Adam Belay, Christos Kozyrakis, and Edouard Bugnion. Energy proportionality and workload consolidation for latency-critical applications. In Proc. of the Sixth ACM Symposium on Cloud Computing, SoCC '15, pages 342--355. ACM, 2015. Google Scholar
Digital Library
- Niels Provos and Nick Mathewson. libevent-an event notification library. http://libevent.org, 2016.Google Scholar
- Samsung Electronics Co. Samsung PM1725 NVMe PCIe SSD. http://www.samsung.com/semiconductor/global/file/insight/2015/11/pm1725-ProdOverview-2015-0.pdf, 2015.Google Scholar
- R. Sandberg. Design and implementation of the Sun network filesystem. In In Proc. of USENIX Summer Conference., pages 119--130. 1985.Google Scholar
- Satran, et al. Internet Small Computer Systems Interface (iSCSI). https://www.ietf.org/rfc/rfc3720.txt, 2004.Google Scholar
- Kai Shen and Stan Park. FlashFQ: A fair queueing I/O scheduler for flash-based SSDs. In Proc. of USENIX Annual Technical Conference, ATC'13, pages 67--78. USENIX, 2013.Google Scholar
- Prashant J. Shenoy and Harrick M. Vin. Cello: A disk scheduling framework for next generation operating systems. Technical report, Austin, TX, USA, 1998. Google Scholar
Digital Library
- M. Shreedhar and George Varghese. Efficient fair queueing using deficit round robin. In Proc. of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM '95, pages 231--242. ACM, 1995.Google Scholar
Digital Library
- David Shue and Michael J. Freedman. From application requests to virtual IOPs: provisioned key-value storage with Libra. In Proc. of European Conference on Computer Systems, EuroSys'14, pages 17:1--17:14, 2014. Google Scholar
Digital Library
- David Shue, Michael J. Freedman, and Anees Shaikh. Performance isolation and fairness for multi-tenant cloud storage. In Proc. of USENIX Operating Systems Design and Implementation, OSDI'12, pages 349--362, 2012.Google Scholar
Digital Library
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop distributed file system. In Proc. of IEEE Mass Storage Systems and Technologies, MSST '10, pages 1--10. IEEE Computer Society, 2010. Google Scholar
Digital Library
- Solarflare Communications Inc. OpenOnload. http://www.openonload.org/ , 2013.Google Scholar
- Ioan Stefanovici, Bianca Schroeder, Greg O'Shea, and Eno Thereska. sRoute: Treating the storage stack like a network. In Proc. of USENIX Conference on File and Storage Technologies, FAST '16, pages 197--212, Santa Clara, CA, 2016.Google Scholar
- Eno Thereska, Hitesh Ballani, Greg O\textquoterightShea, Thomas Karagiannis, Antony Rowstron, Tom Talpey, Richard Black, and Timothy Zhu. IOFlow: A software-defined storage architecture. In Proc. of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 182--196. ACM, 2013. Google Scholar
Digital Library
- Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. Secure I/O device sharing among virtual machines on multiple hosts. In Proc. of International Symposium on Computer Architecture, ISCA '13, pages 108--119. ACM, 2013. Google Scholar
Digital Library
- Paolo Valente and Fabio Checconi. High throughput disk scheduling with fair bandwidth distribution. IEEE Trans. Computers, 59:1172--1186, 2010. Google Scholar
Digital Library
- Matthew Wachs, Michael Abd-El-Malek, Eno Thereska, and Gregory R. Ganger. Argon: Performance insulation for shared storage servers. In Proc. of USENIX File and Storage Technologies, FAST '07, pages 5--5, 2007.Google Scholar
Digital Library
- Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, and Ion Stoica. Cake: Enabling high-level SLOs on shared storage systems. In Proc. of ACM Symposium on Cloud Computing, SoCC '12, pages 14:1--14:14. ACM, 2012. Google Scholar
Digital Library
- Theodore M. Wong, Richard A. Golding, Caixue Lin, and Ralph A. Becker-Szendy. Zygaria: Storage performance as a managed resource. In Proc. of IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS '06, pages 125--134. IEEE Computer Society, 2006. Google Scholar
Digital Library
- Joel Wu and Scott A. Brandt. The design and implementation of aqua: an adaptive quality of service aware object-based storage device. In Proc. of the 23rd IEEE / 14th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 209--218, May 2006.Google Scholar
- Jianyong Zhang, Anand Sivasubramaniam, Qian Wang, Alma Riska, and Erik Riedel. Storage performance virtualization via throughput and latency control. Trans. Storage, 2(3):283--308, August 2006. Google Scholar
Digital Library
- Yiying Zhang, Leo Prasath Arulraj, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. De-indirection for flash-based ssds with nameless writes. In FAST, page 1, 2012.Google Scholar
- Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S. Szalay. Flashgraph: Processing billion-node graphs on an array of commodity SSDs. In Proc of USENIX Conference on File and Storage Technologies, FAST '15, pages 45--58, 2015.Google Scholar
- Timothy Zhu, Alexey Tumanov, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. Prioritymeister: Tail latency QoS for shared networked storage. In Proc. of ACM Symposium on Cloud Computing, SOCC '14, pages 29:1--29:14. ACM, 2014. Google Scholar
Digital Library
Index Terms
ReFlex: Remote Flash ≈ Local Flash
Recommendations
Flash storage disaggregation
EuroSys '16: Proceedings of the Eleventh European Conference on Computer SystemsPCIe-based Flash is commonly deployed to provide datacenter applications with high IO rates. However, its capacity and bandwidth are often underutilized as it is difficult to design servers with the right balance of CPU, memory and Flash resources over ...
ReFlex: Remote Flash ≈ Local Flash
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsRemote access to NVMe Flash enables flexible scaling and high utilization of Flash capacity and IOPS within a datacenter. However, existing systems for remote Flash access either introduce significant performance overheads or fail to isolate the ...
ReFlex: Remote Flash ≈ Local Flash
Asplos'17Remote access to NVMe Flash enables flexible scaling and high utilization of Flash capacity and IOPS within a datacenter. However, existing systems for remote Flash access either introduce significant performance overheads or fail to isolate the ...







Comments