Abstract
To provide low-latency and high-throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution of keys, limits performance. To avoid violating tail latency service-level objectives, systems tend to keep server utilization low and organize the data in micro-shards, which provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew but incur additional monitoring, data replication, and consistency maintenance overheads.
In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data are aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the rack’s micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key-value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. We devise two distinct approaches to load balancing within a RackOut unit, one based on random selection of nodes—RackOut_static—and another one based on an adaptive load balancing mechanism—RackOut_adaptive. Our results show that RackOut_static increases throughput by up to 6× for RDMA and 8.6× for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives. RackOut_adaptive improves the throughput by 30% for workloads with 20% of writes over RackOut_static.
- 2015. AMD High-Bandwidth Memory (HBM). Retrieved from https://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf.Google Scholar
- Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. 2018. Remote regions: A simple abstraction for remote memory. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC’18). 775--787. Google Scholar
Digital Library
- Amazon. 2011. Amazon Elasticache. Retrieved from http://aws.amazon.com/elasticache/.Google Scholar
- David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2011. FAWN: A fast array of wimpy nodes.Commun. ACM 54, 7 (2011), 101--109. Google Scholar
Digital Library
- Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark Callaghan. 2013. LinkBench: A database benchmark based on the Facebook social graph. In Proceedings of the SIGMOD Conference. 1185--1196. Google Scholar
Digital Library
- Krste Asanovic. 2014. Firebox: A hardware building block for 2020 warehouse-scale computers. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’14).Google Scholar
- Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proceedings of the 2012 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 53--64. Google Scholar
Digital Library
- Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2017. The IX operating system: Combining low latency, high throughput, and efficiency in a protected dataplane. ACM Trans. Comput. Syst. 34, 4 (2017), 11:1--11:39. Google Scholar
Digital Library
- Eric A. Brewer. 2010. A certain freedom: Thoughts on the CAP theorem. In Proceedings of the 29th Annual ACM Symposium on Principles of Distributed Computing (PODC’10). 335. Google Scholar
Digital Library
- Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry C. Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkateshwaran Venkataramani. 2013. TAO: Facebook’s distributed data store for the social graph. In Proceedings of the 2013 USENIX Annual Technical Conference (ATC’13). 49--60. Google Scholar
Digital Library
- Michael Burrows. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating System Design and Implementation (OSDI’06). 335--350. Google Scholar
Digital Library
- Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). 143--157. Google Scholar
Digital Library
- Cavium Networks. 2014. Cavium Announces Availability of ThunderX<sup>TM</sup> Industry’s First 48 Core Family of ARMv8 Workload Optimized Processors for Next Generation Data Center 8 Cloud Infrastructure. Retrieved from http://www.cavium.com/newsevents-Cavium-Announces-Availability-of-ThunderX.html.Google Scholar
- Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue B. Moon. 2007. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Workshop on Internet Measurement (IMC’07). 1--14. Google Scholar
Digital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 2010 ACM Symposium on Cloud Computing (SOCC’10). 143--154. Google Scholar
Digital Library
- Carlos Cunha, Azer Bestavros, and Mark Crovella. 1995. Characteristics of WWW Client-based Traces. Technical Report. Boston, MA. Google Scholar
- Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2015. Manycore network interfaces for in-memory rack-scale computing. In Proceedings of the 42nd International Symposium on Computer Architecture (ISCA’15). 567--579. Google Scholar
Digital Library
- Alexandros Daglis, Dmitrii Ustiugov, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2016. SABRes: Atomic object reads for in-memory rack-scale computing. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 6:1--6:13. Google Scholar
Digital Library
- Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: Invited talk. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining (WSDM’09). 1. Google Scholar
Digital Library
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google Scholar
Digital Library
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07). 205--220. Google Scholar
Digital Library
- Aleksandar Dragojevic, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI’14). 401--414. Google Scholar
Digital Library
- EMC Isilon. 2015. Isilon Scale-Out Storage Data Sheet. Retrieved from www.emc.com/collateral/software/data-sheet/h10541-ds-isilon-platform.pdf.Google Scholar
- EZchip Semiconductor Ltd. 2015. EZchip Introduces TILE-Mx100 World’s Highest Core-Count ARM Processor Optimized for High-Performance Networking Applications. Press release. Retrieved from http://www.tilera.com/News/PressRelease/?ezchip=97.Google Scholar
- Facebook. 2008. Apache Cassandra. Retrieved from http://cassandra.apache.org/.Google Scholar
- Facebook. 2015. Introducing “Yosemite”: The First Open Source Modular Chassis for High-powered Microservers. Retrieved from https://code.facebook.com/posts/1616052405274961/introducing-yosemite-the-first-open-source-modular-chassis-for-high-powered-microservers-/.Google Scholar
- Facebook. 2017. Data Sharing on Traffic Pattern Inside Facebook’s Datacenter Network. Retrieved from https://research.fb.com/data-sharing-on-traffic-pattern-inside-facebooks-datacenter-network/.Google Scholar
- Bin Fan, Hyeontaek Lim, David G. Andersen, and Michael Kaminsky. 2011. Small cache, big effect: Provable load balancing for randomly partitioned cluster services. In Proceedings of the 2011 ACM Symposium on Cloud Computing (SOCC’11). 23. Google Scholar
Digital Library
- Paolo Faraboschi, Kimberly Keeton, Tim Marsland, and Dejan S. Milojicic. 2015. Beyond processor-centric operating systems. In Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS-XV). Google Scholar
Digital Library
- Michael Ferdman, Almutaz Adileh, Yusuf Onur Koçberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XVII). 37--48. Google Scholar
Digital Library
- Brad Fitzpatrick. 2003. Memcached. Retrieved from http://memcached.org/.Google Scholar
- Vasilis Gavrielatos, Antonios Katsarakis, Arpit Joshi, Nicolai Oswald, Boris Grot, and Vijay Nagarajan. 2018. Scale-Out ccNUMA: Exploiting skew with strongly consistent caching. In Proceedings of the Thirteenth EuroSys Conference (EuroSys'18). Google Scholar
Digital Library
- GigaIO. 2018. GigaIO Link Express. Retrieved from http://gigaio.com.Google Scholar
- Linley Group. 2015. Oracle shrink sparc M7. Microprocess. Rep. 29, 9 (2015), 28--31.Google Scholar
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over commodity ethernet at scale. In Proceedings of the ACM SIGCOMM 2016 Conference. 202--215. Google Scholar
Digital Library
- Hewlett-Packard Development Company. 2014. HP Moonshot System Family Guide. Retrieved from http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=4AA4-6076ENW.Google Scholar
- Yu-Ju Hong and Mithuna Thottethodi. 2013. Understanding and mitigating the impact of load imbalance in the memory caching tier. In Proceedings of the 2013 ACM Symposium on Cloud Computing (SOCC’13). 13:1--13:17. Google Scholar
Digital Library
- Qi Huang, Helga Gudmundsdottir, Ymir Vigfusson, Daniel A. Freedman, Ken Birman, and Robbert van Renesse. 2014. Characterizing load imbalance in real-world networked caches. In Proceedings of the 13th ACM Workshop on Hot Topics in Networks (HotNets-XIII). 8:1--8:7. Google Scholar
Digital Library
- Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC’10). Google Scholar
Digital Library
- Jinho Hwang and Timothy Wood. 2013. Adaptive performance-aware distributed memory caching. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC’13). 33--43.Google Scholar
- Bob Jenkins. 2011. SpookyHash: A 128-bit Noncryptographic Hash. Retrieved from http://burtleburtle.net/bob/hash/spooky.html.Google Scholar
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proceedings of the 12th Symposium on Operating System Design and Implementation (OSDI’16). 185--201. Google Scholar
Digital Library
- Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (1998), 133--169. Google Scholar
Digital Library
- Sheng Li, Hyeontaek Lim, Victor W. Lee, Jung Ho Ahn, Anuj Kalia, Michael Kaminsky, David G. Andersen, Seongil O, Sukhan Lee, and Pradeep Dubey. 2015. Architecting to achieve a billion requests per second throughput on a single key-value store server platform. In Proceedings of the 42nd International Symposium on Computer Architecture (ISCA’15). 476--488. Google Scholar
Digital Library
- Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A holistic approach to fast in-memory key-value storage. In Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI’14). 429--444. Google Scholar
Digital Library
- Kevin T. Lim, Jichuan Chang, Trevor N. Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated memory for expansion and sharing in blade servers. In Proceedings of the 36th International Symposium on Computer Architecture (ISCA’09). 267--278. Google Scholar
Digital Library
- LinkedIn. 2009. How Linkedin Uses Memcached. Retrieved from http://www.oracle.com/technetwork/server-storage/ts-4696-159286.pdf.Google Scholar
- LinkedIn. 2009. Project Voldemort. Retrieved from http://www.project-voldemort.com/.Google Scholar
- Linley Group. 2014. X-Gene 2 aims above microservers. Microprocess. Rep. 28, 9 (Sep. 2014), 20--24.Google Scholar
- Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Yusuf Onur Koçberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Özer, and Babak Falsafi. 2012. Scale-out processors. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA’12). 500--511. Google Scholar
Digital Library
- Petar Maymounkov and David Mazières. 2002. Kademlia: A peer-to-peer information system based on the XOR metric. In Proceedings of the 1st International Conference on Peer-to-peer Systems (IPTPS’02). 53--65. Google Scholar
Digital Library
- Mellanox Corp. 2015. RDMA Aware Networks Programming User Manual, Rev 1.7. Retrieved from www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf.Google Scholar
- Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Annual Technical Conference (ATC’13). 103--114. Google Scholar
Digital Library
- Christopher Mitchell, Kate Montgomery, Lamont Nelson, Siddhartha Sen, and Jinyang Li. 2016. Balancing CPU and network in the cell distributed B-tree store. In Proceedings of the 2016 USENIX Annual Technical Conference (ATC’16). 451--464. Google Scholar
Digital Library
- Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling memcache at Facebook. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI’13). 385--398. Google Scholar
Digital Library
- Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2014. Scale-out NUMA. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIX). 3--18. Google Scholar
Digital Library
- Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2016. An analysis of load imbalance in scale-out data serving. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 367--368. Google Scholar
Digital Library
- Diego Ongaro and John K. Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (ATC’14). 305--319. Google Scholar
Digital Library
- Oracle. 2012. Exalogic 8 Exadata: The Optimal Platform for Oracle Knowledge. Retrieved from http://www.oracle.com/us/products/applications/knowledge-management/exalogic-exadata-optl-knolg-1509222.pdf.Google Scholar
- John K. Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin Park, Henry Qin, Mendel Rosenblum, Stephen M. Rumble, Ryan Stutsman, and Stephen Yang. 2015. The RAMCloud storage system.ACM Trans. Comput. Syst. 33, 3 (2015), 7:1--7:55. Google Scholar
Digital Library
- Vivek S. Pai, Mohit Aron, Gaurav Banga, Michael Svendsen, Peter Druschel, Willy Zwaenepoel, and Erich M. Nahum. 1998. Locality-aware request distribution in cluster-based network servers. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII). 205--216. Google Scholar
Digital Library
- Lin Qiao, Kapil Surlaker, Shirshanka Das, Tom Quiggle, Bob Schulman, Bhaskar Ghosh, Antony Curtis, Oliver Seeliger, Zhen Zhang, Aditya Auradkar, Chris Beaver, Gregory Brandt, Mihir Gandhi, Kishore Gopalakrishna, Wai Ip, Swaroop Jagadish, Shi Lu, Alexander Pachev, Aditya Ramesh, Abraham Sebastian, Rupa Shanbhag, Subbu Subramaniam, Yun Sun, Sajid Topiwala, Cuong Tran, Jemiah Westerman, and David Zhang. 2013. On brewing fresh espresso: LinkedIn’s distributed data serving platform. In Proceedings of the SIGMOD Conference. 1135--1146. Google Scholar
Digital Library
- Venugopalan Ramasubramanian and Emin Gün Sirer. 2004. Beehive: O(1) lookup performance for power-law query distributions in peer-to-peer overlays. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI’04). 99--112. Google Scholar
Digital Library
- Stephen M. Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum, and John K. Ousterhout. 2011. It’s time for low latency. In Proceedings of the 13th Workshop on Hot Topics in Operating Systems (HotOS-XIII). Google Scholar
Digital Library
- Salvatore Sanfilippo. 2009. Redis. Retrieved from http://redis.io/.Google Scholar
- Navin Sharma, Sean Kenneth Barker, David E. Irwin, and Prashant J. Shenoy. 2011. Blink: Managing server clusters on intermittent power. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XVI). 185--198. Google Scholar
Digital Library
- Ion Stoica, Robert Tappan Morris, David Liben-Nowell, David R. Karger, M. Frans Kaashoek, Frank Dabek, and Hari Balakrishnan. 2003. Chord: A scalable peer-to-peer lookup protocol for internet applications.IEEE/ACM Trans. Netw. 11, 1 (2003), 17--32. Google Scholar
Digital Library
- Patrick Stuedi, Animesh Trivedi, and Bernard Metzler. 2012. Wimpy nodes with 10GbE: Leveraging one-sided operations in soft-RDMA to boost memcached. In Proceedings of the 2012 USENIX Annual Technical Conference (ATC’12). 347--353. Google Scholar
Digital Library
- Arash Tavakkol, Aasheesh Kolli, Stanko Novakovic, Kaveh Razavi, Juan Gómez-Luna, Hasan Hassan, Claude Barthels, Yaohua Wang, Mohammad Sadrosadati, Saugata Ghose, Ankit Singla, Pratap Subrahmanyam, and Onur Mutlu. 2018. Enabling efficient RDMA-based synchronous mirroring of persistent memory transactions. arxiv:1810.09360 http://arxiv.org/abs/1810.09360.Google Scholar
- Birjodh Tiwana, Mahesh Balakrishnan, Marcos K. Aguilera, Hitesh Ballani, and Zhuoqing Morley Mao. 2010. Location, location, location!: Modeling data proximity in the cloud. In Proceedings of the 9th ACM Workshop on Hot Topics in Networks (HotNets-IX). 15. Google Scholar
Digital Library
- Twitter. 2010. Memcached SPOF Mystery. Retrieved from https://blog.twitter.com/2010/memcached-spof-mystery.Google Scholar
- Xingda Wei, Zhiyuan Dong, Rong Chen, and Haibo Chen. 2018. Deconstructing RDMA-enabled distributed transactions: Hybrid is better!. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 233--251. https://www.usenix.org/conference/osdi18/presentation/wei. Google Scholar
Digital Library
- Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15). 87--104. Google Scholar
Digital Library
- Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion control for large-scale RDMA deployments. In Proceedings of the ACM SIGCOMM 2015 Conference. 523--536. Google Scholar
Digital Library
Index Terms
Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling
Recommendations
The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud ComputingTo provide low latency and high throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution ...
Pre-Copy and post-copy VM live migration for memory intensive applications
Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshopsVirtualization technology provides a means for server consolidation, reducing the number of physical servers required for running a given workload. Virtual Machine (VM) live migration facilitates the transfer of a running (VM) between physical hosts ...
Improving Resource Efficiency at Scale with Heracles
User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared ...






Comments