skip to main content
research-article
Open Access

Full-Stack Architecting to Achieve a Billion-Requests-Per-Second Throughput on a Single Key-Value Store Server Platform

Published:06 April 2016Publication History
Skip Abstract Section

Abstract

Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented data center infrastructure. Their performance and efficiency directly affect the QoS of web services and the efficiency of data centers. Traditionally, these systems have had significant overheads from inefficient network processing, OS kernel involvement, and concurrency control. Two recent research thrusts have focused on improving key-value performance. Hardware-centric research has started to explore specialized platforms including FPGAs for KVSs; results demonstrated an order of magnitude increase in throughput and energy efficiency over stock memcached. Software-centric research revisited the KVS application to address fundamental software bottlenecks and to exploit the full potential of modern commodity hardware; these efforts also showed orders of magnitude improvement over stock memcached.

We aim at architecting high-performance and efficient KVS platforms, and start with a rigorous architectural characterization across system stacks over a collection of representative KVS implementations. Our detailed full-system characterization not only identifies the critical hardware/software ingredients for high-performance KVS systems but also leads to guided optimizations atop a recent design to achieve a record-setting throughput of 120 million requests per second (MRPS) (167MRPS with client-side batching) on a single commodity server. Our system delivers the best performance and energy efficiency (RPS/watt) demonstrated to date over existing KVSs including the best-published FPGA-based and GPU-based claims. We craft a set of design principles for future platform architectures, and via detailed simulations demonstrate the capability of achieving a billion RPS with a single server constructed following our principles.

References

  1. Jung Ho Ahn, Sheng Li, Seongil O, and Norman P. Jouppi. 2013. McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling. In ISPASS.Google ScholarGoogle Scholar
  2. Amazon. 2012. Amazon Elasticache. Retrieved from http://aws.amazon.com/elasticache/.Google ScholarGoogle Scholar
  3. Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In SIGMETRICS.Google ScholarGoogle Scholar
  4. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A protected dataplane operating system for high throughput and low latency. In OSDI.Google ScholarGoogle Scholar
  5. Michaela Blott, Kimon Karras, Ling Liu, K Vissers, J Bär, and Z István. 2013. Achieving 10Gbps line-rate key-value stores with FPGAs. In HotCloud.Google ScholarGoogle Scholar
  6. Sai Rahul Chalamalasetti, Kevin Lim, Mitch Wright, Alvin AuYoung, Parthasarathy Ranganathan, and Martin Margala. 2013. An FPGA memcached appliance. In FPGA.Google ScholarGoogle Scholar
  7. Brian Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In SOCC.Google ScholarGoogle Scholar
  8. Intel DDIO. 2014. Intel® Data Direct I/O Technology. Retrieved from http://www.intel.com/content/www/us/en/io/direct-data-i-o.html.Google ScholarGoogle Scholar
  9. Mihai Dobrescu, Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, and Sylvia Ratnasamy. 2009. RouteBricks: Exploiting parallelism to scale software routers. In SOSP.Google ScholarGoogle Scholar
  10. Intel DPDK. 2014. Intel Data Plane Development Kit (Intel DPDK). Retrieved from http://www.intel.com/go/dpdk.Google ScholarGoogle Scholar
  11. Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In NSDI.Google ScholarGoogle Scholar
  12. Facebook. 2014. Introducing mcrouter: A memcached protocol router for scaling memcached deployments. Retrieved from https://code.facebook.com/posts/296442737213493/introducing-mcrouter-a-memcached-protocol-router-for-scaling-memcached-deployments/.Google ScholarGoogle Scholar
  13. Bin Fan, David G. Andersen, and Michael Kaminsky. 2013. MemC3: Compact and concurrent memcache with dumber caching and smarter hashing. In NSDI.Google ScholarGoogle Scholar
  14. Intel FlowDirector. 2014. Intel® Ethernet Flow Director. Retrieved from http://www.intel.com/content/www/us/en/ethernet-controllers/ethernet-flow-director-video.html.Google ScholarGoogle Scholar
  15. Anthony Gutierrez, Michael Cieslak, Bharan Giridhar, Ronald G. Dreslinski, Luis Ceze, and Trevor Mudge. 2014. Integrated 3D-stacked server designs for increasing physical density of key-value stores. In ASPLOS.Google ScholarGoogle Scholar
  16. Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-accelerated software router. In SIGCOMM.Google ScholarGoogle Scholar
  17. Maurice Herlihy, Nir Shavit, and Moran Tzafrir. 2008. Hopscotch hashing. In Distributed Computing. Springer, 350--364.Google ScholarGoogle ScholarCross RefCross Ref
  18. Tayler H. Hetherington, Mike O’Connor, and Tor M. Aamodt. 2015. MemcachedGPU: Scaling-up scale-out key-value stores. In Proc. SOCC.Google ScholarGoogle Scholar
  19. Ram Huggahalli, Ravi Iyer, and Scott Tetrick. 2005. Direct cache access for high bandwidth network I/O. In ISCA.Google ScholarGoogle Scholar
  20. Intel IOAT. 2014. Intel® I/O Acceleration Technology. Retrieved from http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html.Google ScholarGoogle Scholar
  21. Ruzica Jevtic, Hanh-Phuc Le, Milovan Blagojevic, Stevo Bailey, Krste Asanovic, Elad Alon, and Borivoje Nikolic. 2015. Per-core DVFS with switched-capacitor converters for energy efficiency in manycore processors. IEEE TVLSI 23, 4 (2015), 723--730.Google ScholarGoogle Scholar
  22. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In SIGCOMM.Google ScholarGoogle Scholar
  23. Rishi Kapoor, George Porter, Malveeka Tewari, Geoffrey M. Voelker, and Amin Vahdat. 2012. Chronos: Predictable low latency for data center applications. In SOCC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Maysam Lavasani, Hari Angepat, and Derek Chiou. 2013. An FPGA-based in-line accelerator for memcached. In HotChips.Google ScholarGoogle Scholar
  25. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO.Google ScholarGoogle Scholar
  26. Sheng Li, Hyeontaek Lim, Victor W. Lee, Jung Ho Ahn, Anuj Kalia, Michael Kaminsky, David G. Andersen, O. Seongil, Sukhan Lee, and Pradeep Dubey. 2015. Architecting to achieve a billion requests per second throughput on a single key-value store server platform. In ISCA.Google ScholarGoogle Scholar
  27. Sheng Li, Kevin Lim, Paolo Faraboschi, Jichuan Chang, Parthasarathy Ranganathan, and Norman P. Jouppi. 2011. System-level integrated server architectures for scale-out datacenters. In MICRO.Google ScholarGoogle Scholar
  28. Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A holistic approach to fast in-memory key-value storage. In NSDI.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kevin Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2013. Thin servers with smart pipes: Designing SoC accelerators for memcached. In ISCA.Google ScholarGoogle Scholar
  30. Linkedin. 2014. How Linkedin uses memcached. Retrieved from http://www.oracle.com/technetwork/server-storage/ts-4696-159286.pdf.Google ScholarGoogle Scholar
  31. Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, and Babak Falsafi. 2012. Scale-out processors. In ISCA.Google ScholarGoogle Scholar
  32. Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In EuroSys.Google ScholarGoogle Scholar
  33. Mellanox. 2014. Mellanox®OpenFabrics Enterprise Distribution for Linux (MLNX_OFED). Retrieved from http://www.mellanox.com/page/products_dyn?product_family=26.Google ScholarGoogle Scholar
  34. Mellanox. 2015. Mellanox®100Gbps Ethernet NIC. Retrieved from http://www.mellanox.com/related-docs/prod_silicon/PB_ConnectX-4_VPI_Card.pdf.Google ScholarGoogle Scholar
  35. memcached. 2003. Memcached: A distributed memory object caching system. Retrieved from http://memcached.org/.Google ScholarGoogle Scholar
  36. Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In USENIX ATC.Google ScholarGoogle Scholar
  37. Netflix. 2012. Netflix EVCache. Retrieved from http://techblog.netflix.com/2012/01/ephemeral-volatile-caching-in-cloud.html.Google ScholarGoogle Scholar
  38. Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling memcache at Facebook. In NSDI.Google ScholarGoogle Scholar
  39. Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2014. Scale-out NUMA. In ASPLOS.Google ScholarGoogle Scholar
  40. Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In SOSP.Google ScholarGoogle Scholar
  41. R. Pagh and F.F. Rodler. 2004. Cuckoo hashing. J. Algorithms 51, 2 (May 2004), 122--144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. David A. Patterson. 2004. Latency lags bandwith. Commun. ACM 47, 10 (2004), 71--75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and Robert T. Morris. 2012. Improving network connection locality on multicore systems. In EuroSys.Google ScholarGoogle Scholar
  44. Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2014. Arrakis: The operating system is the control plane. In OSDI.Google ScholarGoogle Scholar
  45. Luigi Rizzo. 2012. netmap: A novel framework for fast packet I/O. In USENIX ATC.Google ScholarGoogle Scholar
  46. Shingo Tanaka and Christos Kozyrakis. 2014. High performance hardware-accelerated flash key-value store. In NVM Workshop.Google ScholarGoogle Scholar
  47. Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In ISCA.Google ScholarGoogle Scholar
  48. Twitter. 2012. Twemcache: Twitter Memcached. https://github.com/twitter/twemcache. (2012).Google ScholarGoogle Scholar
  49. Kai Zhang, Kaibo Wang, Yuan Yuan, Lei Guo, Rubao Lee, and Xiaodong Zhang. 2015. Mega-KV: A case for GPUs to maximize the throughput of in-memory key-value stores. Proc. VLDB Endow. 8, 11 (July 2015).Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Full-Stack Architecting to Achieve a Billion-Requests-Per-Second Throughput on a Single Key-Value Store Server Platform

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!