Abstract
In a data center, an IO from an application to distributed storage traverses not only the network but also several software stages with diverse functionality. This set of ordered stages is known as the storage or IO stack. Stages include caches, hypervisors, IO schedulers, file systems, and device drivers. Indeed, in a typical data center, the number of these stages is often larger than the number of network hops to the destination. Yet, while packet routing is fundamental to networks, no notion of IO routing exists on the storage stack. The path of an IO to an endpoint is predetermined and hard coded. This forces IO with different needs (e.g., requiring different caching or replica selection) to flow through a one-size-fits-all IO stack structure, resulting in an ossified IO stack.
This article proposes sRoute, an architecture that provides a routing abstraction for the storage stack. sRoute comprises a centralized control plane and “sSwitches” on the data plane. The control plane sets the forwarding rules in each sSwitch to route IO requests at runtime based on application-specific policies. A key strength of our architecture is that it works with unmodified applications and Virtual Machines (VMs). This article shows significant benefits of customized IO routing to data center tenants: for example, a factor of 10 for tail IO latency, more than 60% better throughput for a customized replication protocol, a factor of 2 in throughput for customized caching, and enabling live performance debugging in a running system.
- Michael Abd-El-Malek, William V. Courtright, II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, and Jay J. Wylie. 2005. Ursa minor: Versatile cluster-based storage. In Proceedings of the 4th Conference on USENIX Conference on File and Storage Technologies, Volume 4 (FAST’05). USENIX Association, Berkeley, CA, 5--5. Google Scholar
Digital Library
- Sebastian Angel, Hitesh Ballani, Thomas Karagiannis, Greg O’Shea, and Eno Thereska. 2014. End-to-end performance isolation through virtual datacenters. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 233--248. Google Scholar
Digital Library
- Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau. 2001. Information and control in gray-box systems. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01). ACM, New York, NY, 43--56. Google Scholar
Digital Library
- Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Lakshmi N. Bairavasundaram, Timothy E. Denehy, Florentina I. Popovici, Vijayan Prabhakaran, and Muthian Sivathanu. 2006. Semantically-smart disk systems: Past, present, and future. SIGMETRICS Perform. Eval. Rev. 33, 4 (Mar. 2006), 29--35. Google Scholar
Digital Library
- Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J. Engle, Haryadi S. Gunawi, James A. Nugent, and Florentina I. Popovici. 2003. Transforming policies into mechanisms with infokernel. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). ACM, New York, NY, 90--105. Google Scholar
Digital Library
- Jason Baker, Chris Bond, James C. Corbett, J. J. Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, and Vadim Yushprakh. 2011. Megastore: Providing scalable, highly available storage for interactive services. In Proceedings of CIDR. Retrieved from http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf.Google Scholar
- Hitesh Ballani, Paolo Costa, Christos Gkantsidis, Matthew P. Grosvenor, Thomas Karagiannis, Lazaros Koromilas, and Greg O’Shea. 2015. Enabling end-host network functions. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM’15). ACM, New York, NY, 493--507. Google Scholar
Digital Library
- Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design 8 Implementation, Volume 6 (OSDI’04). USENIX Association, Berkeley, CA, 18--18. Google Scholar
Digital Library
- B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Chambers, and S. Eggers. 1995. Extensibility safety and performance in the SPIN operating system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). ACM, New York, NY, 267--283. Google Scholar
Digital Library
- Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). ACM, New York, NY, 143--157. Google Scholar
Digital Library
- Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1996. Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Trans. Comput. Syst. 14, 4 (Nov. 1996), 311--343. Google Scholar
Digital Library
- Martin Casado, Michael J. Freedman, Justin Pettit, Jianying Luo, Nick McKeown, and Scott Shenker. 2007. Ethane: Taking control of the enterprise. In Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’07). ACM, New York, NY, 1--12. Google Scholar
Digital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of USENIX OSDI. Retrieved from http://dl.acm.org/citation.cfm?id=1267308.1267323 Google Scholar
Digital Library
- Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. 2008. PNUTS: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1277--1288. Google Scholar
Digital Library
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally-distributed database. In Proceedings USENIX OSDI. Google Scholar
Digital Library
- Brendan Cully, Jake Wires, Dutch Meyer, Kevin Jamieson, Keir Fraser, Tim Deegan, Daniel Stodden, Geoffrey Lefebvre, Daniel Ferstay, and Andrew Warfield. 2014. Strata: Scalable high-performance storage on virtualized non-volatile memory. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). USENIX Association, Berkeley, CA, 17--31. Google Scholar
Digital Library
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (Feb. 2013), 74--80. Google Scholar
Digital Library
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP’07). ACM, New York, NY, 205--220. Google Scholar
Digital Library
- D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. 1995. Exokernel: An operating system architecture for application-level resource management. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). ACM, New York, NY, 251--266. Google Scholar
Digital Library
- Andrew D. Ferguson, Arjun Guha, Chen Liang, Rodrigo Fonseca, and Shriram Krishnamurthi. 2013. Participatory networking: An API for application control of SDNs. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM (SIGCOMM’13). ACM, New York, NY, 327--338. Google Scholar
Digital Library
- Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design 8 Implementation (NSDI’07). USENIX Association, Berkeley, CA, 20. Google Scholar
Digital Library
- FreeBSD. 2014. FreeBSD GEOM storage framework. Retrieved from http://www.freebsd.org/doc/handbook/.Google Scholar
- Albert Greenberg. 2015. SDN for the Cloud (Sigcomm 2015 Keynote). Retrieved from http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/keynote.pdf.Google Scholar
- Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: A scalable and flexible data center network. In Proceedings of ACM SIGCOMM. Retrieved from Google Scholar
Digital Library
- Kieran Harty and David R. Cheriton. 1992. Application-controlled physical memory using external page-cache management. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V). ACM, New York, NY, 187--197. Google Scholar
Digital Library
- Dave Hitz, James Lau, and Michael Malcolm. 1994. File system design for an NFS file server appliance. In Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference (WTEC’94). USENIX Association, Berkeley, CA, 19--19. Google Scholar
Digital Library
- Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, and Harry C. Li. 2013. An analysis of facebook photo caching. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). ACM, New York, NY, 167--181. Google Scholar
Digital Library
- Intel Corporation. 2014. IoMeter Benchmark. Retrieved from http://www.iometer.org/. (2014).Google Scholar
- Michael Isard. 2007. Autopilot: Automatic data center management. SIGOPS Oper. Syst. Rev. 41, 2 (April 2007), 60--67. Google Scholar
Digital Library
- Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2013. B4: Experience with a globally-deployed software defined wan. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM (SIGCOMM’13). ACM, New York, NY, 3--14. Google Scholar
Digital Library
- M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Hector M. Briceño, Russell Hunt, David Mazières, Thomas Pinckney, Robert Grimm, John Jannotti, and Kenneth Mackenzie. 1997. Application performance and flexibility on exokernel systems. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP’97). ACM, New York, NY, 52--65. Google Scholar
Digital Library
- Peyman Kazemian, Michael Chang, Hongyi Zeng, George Varghese, Nick McKeown, and Scott Whyte. 2013. Real time network policy checking using header space analysis. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (nsdi’13). USENIX Association, Berkeley, CA, 99--112. Google Scholar
Digital Library
- Teemu Koponen, Martin Casado, Natasha Gude, Jeremy Stribling, Leon Poutievski, Min Zhu, Rajiv Ramanathan, Yuichiro Iwata, Hiroaki Inoue, Takayuki Hama, and Scott Shenker. 2010. Onix: A distributed control platform for large-scale production networks. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, Berkeley, CA, 351--364. Google Scholar
Digital Library
- Keith Krueger, David Loftesness, Amin Vahdat, and Thomas Anderson. 1993. Tools for the development of application-specific virtual memory management. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA’93). ACM, New York, NY, 48--64. Google Scholar
Digital Library
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (Apr. 2010), 35--40. Google Scholar
Digital Library
- Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169. Google Scholar
Digital Library
- Cheng Li, Daniel Porto, Allen Clement, Johannes Gehrke, Nuno Preguiça, and Rodrigo Rodrigues. 2012. Making geo-replicated systems fast as possible, consistent when necessary. In Proceedings of USENIX OSDI (OSDI’12). Google Scholar
Digital Library
- Robert Love. 2010. Linux Kernel Development (3rd ed.). Addison-Wesley Professional. Google Scholar
Digital Library
- Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner. 2008. OpenFlow: Enabling innovation in campus networks. ACM SIGCOMM Comput. Commun. Rev. 38, 2 (Mar. 2008), 69--74. Google Scholar
Digital Library
- Michael Mesnier, Feng Chen, Tian Luo, and Jason B. Akers. 2011. Differentiated storage services. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP’11). ACM, New York, NY, 57--70. Google Scholar
Digital Library
- Microsoft. 2010. Virtual Hard Disk Performance. Retrieved from http://download.microsoft.com/download/0/7/7/0778C0BB-5281-4390-92CD-EC138A18F2F9/WS08_R2_VHD_Performance_WhitePaper.docx.Google Scholar
- Microsoft Corporation. 2014a. Minidrivers, Miniport drivers, and driver pairs. Retrieved from https://msdn.microsoft.com/en-us/library/windows/hardware/hh439643.aspx.Google Scholar
- Microsoft Corporation. 2014b. File System Minifilter Drivers (MSDN). Retrieved from https://msdn.microsoft.com/en-us/windows/hardware/drivers/ifs/file-system-minifilter-drivers.Google Scholar
- Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron. 2008a. Write Off-loading: Practical power management for enterprise storage. Trans. Storage 4, 3, Article 10 (Nov. 2008), 23 pages. Google Scholar
Digital Library
- Dushyanth Narayanan, Austin Donnelly, Eno Thereska, Sameh Elnikety, and Antony Rowstron. 2008b. Everest: Scaling down peak loads through I/O Off-loading. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, Berkeley, CA, USA, 15--28. Google Scholar
Digital Library
- Netronome. 2016. Agilio Server Networking Platform. Retrieved from https://www.netronome.com/products/overview/.Google Scholar
- Oracle. 2010. Oracle Solaris ZFS Administration Guide. Retrieved from http://docs.oracle.com/cd/E19253-01/819-5461/index.html.Google Scholar
- Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2014. Arrakis: The operating system is the control plane. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, 1--16. Google Scholar
Digital Library
- Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA’14). IEEE Press, Piscataway, NJ, 13--24. Google Scholar
Digital Library
- Zafar Ayyub Qazi, Cheng-Chun Tu, Luis Chiang, Rui Miao, Vyas Sekar, and Minlan Yu. 2013. SIMPLE-fying middlebox policy enforcement using SDN. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM (SIGCOMM’13). ACM, New York, NY, 27--38. Google Scholar
Digital Library
- Mark Reitblatt, Nate Foster, Jennifer Rexford, Cole Schlesinger, and David Walker. 2012. Abstractions for network update. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM’12). ACM, New York, NY, 323--334. Google Scholar
Digital Library
- Microsoft Research. 2014. Microsoft Research Storage Toolkit. Retrieved from https://www.microsoft.com/en-us/research/project/software-defined-stora ge-architectures/.Google Scholar
- Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The linux B-tree filesystem. Trans. Stor. 9, 3, Article 9 (Aug. 2013). Google Scholar
Digital Library
- Margo I. Seltzer, Yasuhiro Endo, Christopher Small, and Keith A. Smith. 1996. Dealing with disaster: Surviving misbehaved kernel extensions. In Proceedings of the Second USENIX Symposium on Operating Systems Design and Implementation (OSDI’96). ACM, New York, NY, 213--227. Google Scholar
Digital Library
- Justine Sherry, Shaddi Hasan, Colin Scott, Arvind Krishnamurthy, Sylvia Ratnasamy, and Vyas Sekar. 2012. Making middleboxes someone else’s problem: Network processing as a cloud service. In Proceedings of the ACM SIGCOMM. Helsinki, Finland, 12. Google Scholar
Digital Library
- Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc.Google Scholar
- SNIA. 2007. Exchange server traces. Retrieved from http://iotta.snia.org/traces/130. (2007).Google Scholar
- Ioan Stefanovici, Eno Thereska, Greg O’Shea, Bianca Schroeder, Hitesh Ballani, Thomas Karagiannis, Antony Rowstron, and Tom Talpey. 2015. Software-defined caching: Managing caches in multi-tenant data centers. In Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC’15). ACM, New York, NY, 174--181. Google Scholar
Digital Library
- Douglas B. Terry, Vijayan Prabhakaran, Ramakrishna Kotla, Mahesh Balakrishnan, Marcos K. Aguilera, and Hussam Abu-Libdeh. 2013. Consistency-based service level agreements for cloud storage. In Proceedings of ACM SOSP. Google Scholar
Digital Library
- D. B. Terry, M. M. Theimer, Karin Petersen, A. J. Demers, M. J. Spreitzer, and C. H. Hauser. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of ACM SOSP. Google Scholar
Digital Library
- Eno Thereska, Hitesh Ballani, Greg O’Shea, Thomas Karagiannis, Antony Rowstron, Tom Talpey, Richard Black, and Timothy Zhu. 2013. IOFlow: A software-defined storage architecture. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP’13). ACM, New York, NY, 182--196. Google Scholar
Digital Library
- Niraj Tolia, Michael Kaminsky, David G. Andersen, and Swapnil Patil. 2006. An architecture for internet data transfer. In Proceedings of the 3rd Conference on Networked Systems Design 8 Implementation, Volume 3 (NSDI’06). USENIX Association, Berkeley, CA, 19--19. Google Scholar
Digital Library
- Transaction Processing Performance Council. 2014. TPC Benchmark E - Rev. 1.14.0. Standard.Google Scholar
- Carl A. Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad. 2015. Efficient MRC construction with SHARDS. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Santa Clara, CA. Google Scholar
Digital Library
- Theodore M. Wong and John Wilkes. 2002. My cache or yours? Making storage more exclusive. In Proceedings of the General Track of the Annual Conference on USENIX Annual Technical Conference (ATEC’02). USENIX Association, Berkeley, CA, 161--175. Google Scholar
Digital Library
- Hong Yan, David A. Maltz, T. S. Eugene Ng, Hemant Gogineni, Hui Zhang, and Zheng Cai. 2007. Tesseract: A 4D network control plane. In Proceedings of the 4th USENIX Conference on Networked Systems Design 8 Implementation (NSDI’07). USENIX Association, Berkeley, CA, 27--27. Google Scholar
Digital Library
Index Terms
Treating the Storage Stack Like a Network
Recommendations
Deconstructing Network Attached Storage systems
Network Attached Storage (NAS) has been gaining general acceptance, because it can be managed easily and files shared among many clients, which run different operating systems. The advent of Gigabit Ethernet and high speed transport protocols further ...
Using Working Set Reorganization to Manage Storage Systems with Hard and Solid State Disks
ICPPW '14: Proceedings of the 2014 43rd International Conference on Parallel Processing WorkshopsScientific applications from many problem domains produce and/or access large volumes of data. To support these applications, designers of high-end computing (HEC) systems have greatly increased the capacity of storage systems in recent years. However, ...
An update-aware storage system for low-locality update-intensive workloads
ASPLOS '12Traditional storage systems provide a simple read/write interface, which is inadequate for low-locality update-intensive workloads because it limits the disk scheduling flexibility and results in inefficient use of buffer memory and raw disk bandwidth. ...






Comments