Abstract
High-performance servers are Non-Uniform Memory Access (NUMA) machines. To fully leverage these machines, programmers need efficient concurrent data structures that are aware of the NUMA performance artifacts. We propose Node Replication (NR), a black-box approach to obtaining such data structures. NR takes an arbitrary sequential data structure and automatically transforms it into a NUMA-aware concurrent data structure satisfying linearizability. Using NR requires no expertise in concurrent data structure design, and the result is free of concurrency bugs. NR draws ideas from two disciplines: shared-memory algorithms and distributed systems. Briefly, NR implements a NUMA-aware shared log, and then uses the log to replicate data structures consistently across NUMA nodes. NR is best suited for contended data structures, where it can outperform lock-free algorithms by 3.1x, and lock-based solutions by 30x. To show the benefits of NR to a real application, we apply NR to the data structures of Redis, an in-memory storage system. The result outperforms other methods by up to 14x. The cost of NR is additional memory for its log and replicas.
- http://redis.io.Google Scholar
- Dmitry Vyukov. Distributed Reader-Writer Mutex. http://www.1024cores.net/home/ lock-free-algorithms/reader-writer-problem/ distributed-reader-writer-mutex.Google Scholar
- http://goog-perftools.sourceforge.net/doc/tcmalloc.html.Google Scholar
- M. K. Aguilera. Thread-local malloc. https://github.com/mkaguilera/tmalloc.Google Scholar
- J. H. Anderson and M. Moir. Universal constructions for large objects. IEEE Transactions on Parallel and Distributed Systems, 10(12):1317--1332, Dec. 1999. Google Scholar
Digital Library
- M. Balakrishnan, D. Malkhi, J. P. Davis, V. Prabhakaran, M. Wei, and T. Wobber. CORFU: A distributed shared log. ACM Transactions on Computer Systems, 31(4), Dec. 2013. Google Scholar
Digital Library
- A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: A new OS architecture for scalable multicore systems. In ACM Symposium on Operating Systems Principles, pages 29--44, Oct. 2009. Google Scholar
Digital Library
- J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: distributed shared memory based on type-specific memory coherence. In ACM Symposium on Principles and Practice of Parallel Programming, pages 168--176, Mar. 1990. Google Scholar
Digital Library
- S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for NUMA-aware contention management on multicore systems. In USENIX Annual Technical Conference, Oct. 2011.Google Scholar
- W. J. Bolosky, R. P. Fitzgerald, and M. L. Scott. Simple but effective techniques for NUMA memory management. In ACM Symposium on Operating Systems Principles, pages 19--31, Dec. 1989. Google Scholar
Digital Library
- S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: an operating system for many cores. In Symposium on Operating Systems Design and Implementation, pages 43--57, Dec. 2008.Google Scholar
Digital Library
- S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. OpLog: a library for scaling update-heavy data structures. Technical Report TR-2014-019, MIT CSAIL, Sept. 2014.Google Scholar
- A. Braginsky, A. Kogan, and E. Petrank. Drop the anchor: Lightweight memory management for non-blocking data structures. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 33--42, July 2013. Google Scholar
Digital Library
- T. Brown, A. Kogan, Y. Lev, and V. Luchangco. Investigating the performance of hardware transactions on a multi-socket machine. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 121--132, July 2016. Google Scholar
Digital Library
- E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco: running commodity operating systems on scalable multiprocessors. ACM Transactions on Computer Systems, 15(4):412--447, Nov. 1997. Google Scholar
Digital Library
- I. Calciu, D. Dice, T. Harris, M. Herlihy, A. Kogan, V. J. Marathe, and M. Moir. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In International Conference on Principles of Distributed Systems, pages 83--97, Dec. 2013. Google Scholar
Digital Library
- I. Calciu, J. E. Gottschlich, and M. Herlihy. Using delegation and elimination to implement a scalable NUMA-friendly stack. In USENIX Workshop on Hot Topics in Parallelism, June 2013.Google Scholar
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In ACM Symposium on Cloud Computing, pages 143--154, June 2010. Google Scholar
Digital Library
- A. L. Cox and R. J. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. In ACM Symposium on Operating Systems Principles, pages 32--44, Dec. 1989. Google Scholar
Digital Library
- M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--394, Mar. 2013. Google Scholar
Digital Library
- T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In ACM Symposium on Operating Systems Principles, pages 33--48, Nov. 2013. Google Scholar
Digital Library
- T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 631--644, Mar. 2015. Google Scholar
Digital Library
- T. David, R. Guerraoui, and M. Yabandeh. Consensus inside. In International Middleware Conference, pages 145--156, Dec. 2014. Google Scholar
Digital Library
- P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 325--334, June 2011. Google Scholar
Digital Library
- K. Fraser. Practical lock-freedom. Technical Report UCAM-CL-TR-579, University of Cambridge, Computer Laboratory, Feb. 2004.Google Scholar
- M. L. Fredman, R. Sedgewick, D. D. Sleator, and R. E. Tarjan. The pairing heap: a new form of self-adjusting heap. Algorithmica, 1(1):111--129, Jan. 1986. Google Scholar
Digital Library
- B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In Symposium on Operating Systems Design and Implementation, pages 87--100, Feb. 1999.Google Scholar
Digital Library
- S. K. Haider, W. Hasenplaugh, and D. Alistarh. Lease/release: Architectural support for scaling contended data structures. In ACM Symposium on Principles and Practice of Parallel Programming, Mar. 2016. Article 17. Google Scholar
Digital Library
- P. Hawkins, A. Aiken, K. Fisher, M. Rinard, and M. Sagiv. Concurrent data representation synthesis. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 417--428, June 2012. Google Scholar
Digital Library
- D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 355--364, June 2010. Google Scholar
Digital Library
- D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Scalable flat-combining based synchronous queues. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 355--364, Sept. 2010. Google Scholar
Cross Ref
- D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. J. Parallel Distrib. Comput., 70(1):1--12, Jan. 2010. Google Scholar
Digital Library
- M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 11(1):124--149, Jan. 1991. Google Scholar
Digital Library
- M. Herlihy and I. Calciu. Work in progress: Shared nothing transactional memory. In Workshop on Systems for Future Multicore Architectures, Apr. 2011.Google Scholar
- M. Herlihy, V. Luchangco, P. Martin, and M. Moir. Nonblocking memory management support for dynamic-sized data structures. ACM Trans. Comput. Syst., 23(2):146--196, May 2005. Google Scholar
Digital Library
- M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. ACM SIGARCH Computer Architecture News, 21(2):289--300, May 1993. Google Scholar
Digital Library
- M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008.Google Scholar
Digital Library
- M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463--492, July 1990. Google Scholar
Digital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 295--306, Aug. 2014. Google Scholar
Digital Library
- O. Krieger, M. Auslander, B. Rosenburg, R. W. Wisniewski, J. Xenidis, D. D. Silva, M. Ostrowski, J. Appavoo, M. Butrico, M. Mergen, A. Waterland, and V. Uhlig. K42: building a complete operating system. In European Conference on Computer Systems, pages 133--145, Apr. 2006. Google Scholar
Digital Library
- R. Lachaize, B. Lepers, and V. Quéma. MemProf: A memory profiler for NUMA multicore systems. In USENIX Annual Technical Conference, pages 53--64, June 2012.Google Scholar
Digital Library
- C. Lameter. NUMA (non-uniform memory access): An overview. ACM Queue, 11(7), July 2013. Google Scholar
Digital Library
- L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, May 1998. Google Scholar
Digital Library
- V. Leis, P. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In International Conference on Management of Data, pages 743--754, June 2014. Google Scholar
Digital Library
- Y. Li, I. Pandis, R. Mueller, V. Raman, and G. Lohman. NUMA-aware algorithms: the case of data shuffling. In Conference on Innovative Data Systems Research, Jan. 2013.Google Scholar
- H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. MICA: A holistic approach to fast in-memory key-value storage. In Symposium on Networked Systems Design and Implementation, pages 429--444, Apr. 2014.Google Scholar
Digital Library
- Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In European Conference on Computer Systems, pages 183--196, Apr. 2012. Google Scholar
Digital Library
- A. Matveev, N. Shavit, P. Felber, and P. Marlier. Read-log-update: a lightweight synchronization mechanism for concurrent programming. In ACM Symposium on Operating Systems Principles, pages 168--183, Oct. 2015. Google Scholar
Digital Library
- P. E. McKenney and J. D. Slingwine. Read-copy-update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems, pages 509--518, Oct. 1998.Google Scholar
- Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. CPHash: a cache-partitioned hash table. In ACM Symposium on Principles and Practice of Parallel Programming, pages 319--320, Feb. 2012. Google Scholar
Digital Library
- M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Transactions on Parallel and Distributed Systems, 15(6):491--504, June 2004. Google Scholar
Digital Library
- D. Ongaro, S. M. Rumble, R. Stutsman, J. K. Ousterhout, and M. Rosenblum. Fast crash recovery in RAMCloud. In ACM Symposium on Operating Systems Principles, pages 29--41, Oct. 2011. Google Scholar
Digital Library
- D. Porobic, E. Liarou, P. Tözün, and A. Ailamaki. ATraPos: Adaptive transaction processing on hardware islands. In International Conference on Data Engineering, pages 688--699, Mar. 2014. Google Scholar
Cross Ref
- W. Pugh. Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6):668--676, June 1990. Google Scholar
Digital Library
- F. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys, 22(4):299--319, Dec. 1990. Google Scholar
Digital Library
- O. Shahmirzadi, T. Ropars, and A. Schiper. High-throughput maps on message-passing manycore architectures: Partitioning versus replication. In International Conference on Parallel Processing, pages 536--547, Aug. 2014. Google Scholar
Cross Ref
- O. Shalev and N. Shavit. Predictive log-synchronization. In European Conference on Computer Systems, pages 305--316, Apr. 2006. Google Scholar
Digital Library
- N. Shavit and D. Touitou. Software transactional memory. In ACM Symposium on Principles of Distributed Computing, pages 204--213, Aug. 1995. Google Scholar
Digital Library
- P. Stuedi, A. Trivedi, B. Metzler, and J. Pfefferle. DaRPC: Data center RPC. In ACM Symposium on Cloud Computing, pages 1--13, Nov. 2014. Google Scholar
Digital Library
- H. Sutter. Lock-free code: A false sense of security. Dr. Dobb's, Sept. 2008.Google Scholar
- R. K. Treiber. Systems programming: Coping with parallelism. Technical report, IBM Almaden Research Center, Apr. 1986.Google Scholar
- B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--289, Oct. 1996. Google Scholar
Digital Library
- L. Xiang and M. L. Scott. Compiler aided manual speculation for high performance concurrent data structures. In ACM Symposium on Principles and Practice of Parallel Programming, pages 47--56, Feb. 2013. Google Scholar
Digital Library
Index Terms
Black-box Concurrent Data Structures for NUMA Architectures
Recommendations
Black-box Concurrent Data Structures for NUMA Architectures
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsHigh-performance servers are Non-Uniform Memory Access (NUMA) machines. To fully leverage these machines, programmers need efficient concurrent data structures that are aware of the NUMA performance artifacts. We propose Node Replication (NR), a black-...
Black-box Concurrent Data Structures for NUMA Architectures
Asplos'17High-performance servers are Non-Uniform Memory Access (NUMA) machines. To fully leverage these machines, programmers need efficient concurrent data structures that are aware of the NUMA performance artifacts. We propose Node Replication (NR), a black-...
Concurrent Data Structures for Near-Memory Computing
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and ArchitecturesThe performance gap between memory and CPU has grown exponentially. To bridge this gap, hardware architects have proposed near-memory computing (also called processing-in-memory, or PIM), where a lightweight processor (called a PIM core) is located ...







Comments