Abstract
Cloud computing services often replicate data and may require ways to coordinate distributed actions. Here we present Derecho, a library for such tasks. The API provides interfaces for structuring applications into patterns of subgroups and shards, supports state machine replication within them, and includes mechanisms that assist in restart after failures. Running over 100Gbps RDMA, Derecho can send millions of events per second in each subgroup or shard and throughput peaks at 16GB/s, substantially outperforming prior solutions. Configured to run purely on TCP, Derecho is still substantially faster than comparable widely used, highly-tuned, standard tools. The key insight is that on modern hardware (including non-RDMA networks), data-intensive protocols should be built from non-blocking data-flow components.
- {n. d.}. LibPaxos: Open-source Paxos. Retrieved from http://libpaxos.sourceforge.net/.Google Scholar
- {n. d.}. RDMA-Paxos: Open-source Paxos. Retrieved from https://github.com/wangchenghku/RDMA-PAXOS.Google Scholar
- 2011. Vsync Reliable Multicast Library. Retrieved from http://vsync.codeplex.com/.Google Scholar
- 2012. Gbcast Protocol. Retrieved from https://en.wikipedia.org/wiki/Gbcast.Google Scholar
- Mahesh Balakrishnan, Dahlia Malkhi, John D. Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber. 2013. CORFU: A distributed shared log. ACM Trans. Comput. Syst. 31, 4, Article 10 (Dec. 2013), 24 pages. Google Scholar
Digital Library
- Bela Ban. 2002. JGroups Reliable Multicast Library. Retrieved from http://jgroups.org/.Google Scholar
- Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 29--44. Google Scholar
Digital Library
- Behrens, Jonathan and Birman, Ken and Jha, Sagar and Tremel, Edward. 2018. RDMC: A reliable RDMA multicast for large objects. In Proceedings of the 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’18). IEEE Computer Society, Los Alamitos, CA, 1--12.Google Scholar
- Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A protected dataplane operating system for high throughput and low latency. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 49--65. Google Scholar
Digital Library
- Kenneth Birman. 2012. Guide to Reliable Distributed Systems. Number XXII in Texts in Computer Science. Springer-Verlag, London. Google Scholar
Digital Library
- Kenneth P. Birman. 1985. Replication and fault-tolerance in the isis system. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (SOSP’85). ACM, New York, NY, 79--86. Google Scholar
Digital Library
- Kenneth P. Birman. 2012. Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services. Springer Verlag Texts in Computer Science, New York, NY. Google Scholar
Digital Library
- Kenneth P. Birman and Thomas A. Joseph. 1987. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP’87). ACM, New York, NY, 123--138. Google Scholar
Digital Library
- Kenneth P. Birman and Thomas A. Joseph. 1987. Reliable communication in the presence of failures. ACM Trans. Comput. Syst. 5, 1 (Jan. 1987), 47--76. Google Scholar
Digital Library
- Eric Brewer. 2010. A certain freedom: Thoughts on the CAP theorem. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’10). ACM, New York, NY, 335--335. Google Scholar
Digital Library
- Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). ACM, New York, NY, 143--157. Google Scholar
Digital Library
- Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: An engineering perspective. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing (PODC’07). ACM, New York, NY, 398--407. Google Scholar
Digital Library
- Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (Mar. 1996), 225--267. Google Scholar
Digital Library
- K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (Feb. 1985), 63--75. Google Scholar
Digital Library
- Jo-Mei Chang and N. F. Maxemchuk. 1984. Reliable broadcast protocols. ACM Trans. Comput. Syst. 2, 3 (Aug. 1984), 251--273. Google Scholar
Digital Library
- Neil Conway, William R. Marczak, Peter Alvaro, Joseph M. Hellerstein, and David Maier. 2012. Logic and lattices for distributed programming. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC’12). ACM, New York, NY. Google Scholar
Digital Library
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, 251--264. http://dl.acm.org/citation.cfm?id=2387880.2387905 Google Scholar
Digital Library
- Heming Cui, Rui Gu, Cheng Liu, Tianyu Chen, and Junfeng Yang. 2015. Paxos made transparent. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 105--120. Google Scholar
Digital Library
- Huynh Tu Dang, Daniele Sciascia, Marco Canini, Fernando Pedone, and Robert Soulé. 2015. NetPaxos: Consensus at network speed. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR’15). ACM, New York, NY, Article 5, 7 pages. Google Scholar
Digital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI’14). USENIX Association, Berkeley, CA, 401--414. http://dl.acm.org/citation.cfm?id=2616448.2616486 Google Scholar
Digital Library
- Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr. 1985), 374--382. Google Scholar
Digital Library
- Roy Friedman and Robbert van Renesse. 1997. Packing messages as a tool for boosting the performance of total ordering protocols. In Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing (HPDC’97). Google Scholar
Digital Library
- Prasanna Ganesan and M. Seshadri. 2005. On cooperative content distribution and the price of barter. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05). 81--90. Google Scholar
Digital Library
- Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (Jun. 2002), 51--59. Google Scholar
Digital Library
- Rachid Guerraoui, Ron R. Levy, Bastian Pochon, and Vivien Quéma. 2010. Throughput optimal total order broadcast for cluster environments. ACM Trans. Comput. Syst. 28, 2, Article 5 (Jul. 2010), 32 pages. Google Scholar
Digital Library
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM’16). ACM, New York, NY, 202--215. Google Scholar
Digital Library
- Joseph Y. Halpern and Yoram Moses. 1990. Knowledge and common knowledge in a distributed environment. J. ACM 37, 3 (Jul. 1990), 549--587. Google Scholar
Digital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM’14). ACM, New York, NY, 295--306. Google Scholar
Digital Library
- Idit Keidar and Alexander Shraer. 2006. Timeliness, failure-detectors, and consensus performance. In Proceedings of the 25th Annual ACM Symposium on Principles of Distributed Computing (PODC’06). ACM, New York, NY, 169--178. Google Scholar
Digital Library
- Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169. Google Scholar
Digital Library
- Butler Lampson. 2001. The ABCD’s of Paxos. In Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing (PODC’01). ACM, New York, NY, 13--.Google Scholar
Digital Library
- Jialin Li, Ellis Michael, Naveen Kr Sharma, Adriana Szekeres, and Dan RK Ports. 2016. Just say NO to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). Google Scholar
Digital Library
- Parisa Jalili Marandi, Samuel Benz, Fernando Pedonea, and Kenneth P. Birman. 2014. The performance of paxos in the cloud. In Proceedings of the 2014 IEEE 33rd International Symposium on Reliable Distributed Systems (SRDS’14). IEEE Computer Society, Los Alamitos, CA, 41--50. Google Scholar
Digital Library
- D. Mazieres. 2007. Paxos made practical. Technical report. Retrieved from http://www.scs.stanford.edu/dm/home/papers.Google Scholar
- Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC’13). USENIX Association, Berkeley, CA, 103--114. Google Scholar
Digital Library
- Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley, CA, 305--320. Google Scholar
Digital Library
- John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Diego Ongaro, Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2011. The case for RAMCloud. Commun. ACM 54, 7 (Jul. 2011), 121--130. Google Scholar
Digital Library
- Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2014. Arrakis: The operating system is the control plane. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 1--16. Google Scholar
Digital Library
- Marius Poke and Torsten Hoefler. 2015. DARE: High-performance state machine replication on RDMA networks. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15). ACM, New York, NY, 107--118. Google Scholar
Digital Library
- Marius Poke, Torsten Hoefler, and Colin Glass. 2017. AllConcur: Leaderless concurrent atomic broadcast. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’17). ACM, New York, NY, 18. Google Scholar
Digital Library
- Roberto De Prisco, Butler W. Lampson, and Nancy A. Lynch. 1997. Revisiting the paxos algorithm. In Proceedings of the 11th International Workshop on Distributed Algorithms (WDAG’97). Springer-Verlag, London, 111--125. Google Scholar
Digital Library
- Dan Pritchett. 2008. BASE: An ACID alternative. Queue 6, 3 (May 2008), 48--55. Google Scholar
Digital Library
- Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4 (Dec. 1990), 299--319. Google Scholar
Digital Library
- Steve Shankland. 2008. Google’s Jeff Dean spotlights data center inner workings. C|Net Reviews (May 2008). Retrieved from https://www.cnet.com/news/google-spotlights-data-center-inner-workings/.Google Scholar
- Weijia Song, Theo Gkountouvas, Ken Birman, Qi Chen, and Zhen Xiao. 2016. The freeze-frame file system. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). ACM, New York, NY, 307--320. Google Scholar
Digital Library
- Robbert Van Renesse and Deniz Altinbuken. 2015. Paxos made moderately complex. ACM Comput. Surv. 47, 3, Article 42 (Feb. 2015), 36 pages. Google Scholar
Digital Library
- Robbert van Renesse and Fred B. Schneider. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Operating Systems Design 8 Implementation, Volume 6 (OSDI’04). USENIX Association, Berkeley, CA, USA, 7--7. Google Scholar
Digital Library
- T. von Eicken, A. Basu, V. Buch, and W. Vogels. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). ACM, New York, NY, 40--53. Google Scholar
Digital Library
- Cheng Wang, Jianyu Jiang, Xusheng Chen, Ning Yi, and Heming Cui. 2017. APUS: Fast and scalable paxos on RDMA. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, Santa Clara, CA, 14. Google Scholar
Digital Library
- Michael Wei, Amy Tai, Christopher J. Rossbach, Ittai Abraham, Maithem Munshed, Medhavi Dhawan, Jim Stabile, Udi Wieder, Scott Fritchie, Steven Swanson, Michael J. Freedman, and Dahlia Malkhi. 2017. vCorfu: A cloud-scale object store on a shared log. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). USENIX Association, 35--49. Google Scholar
Digital Library
- Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast In-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 87--104. Google Scholar
Digital Library
Index Terms
Derecho: Fast State Machine Replication for Cloud Services
Recommendations
Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating SystemsToday's datacenter applications are underpinned by datastores that are responsible for providing availability, consistency, and performance. For high availability in the presence of failures, these datastores replicate data across several nodes. This is ...
Consistent and automatic replica regeneration
Reducing management costs and improving the availability of large-scale distributed systems require automatic replica regeneration, that is, creating new replicas in response to replica failures. A major challenge to regeneration is maintaining ...
OSSR: Optimal Single Site Replication
ISPA '10: Proceedings of the International Symposium on Parallel and Distributed Processing with ApplicationsReplication is a technique widely used for achieving qualities like scalability, high performance, high availability and fault tolerance in computer systems. Consistency of the system state is another quality that may also vary as a side effect of ...






Comments