skip to main content
research-article
Best Paper

Black-box Concurrent Data Structures for NUMA Architectures

Published:04 April 2017Publication History
Skip Abstract Section

Abstract

High-performance servers are Non-Uniform Memory Access (NUMA) machines. To fully leverage these machines, programmers need efficient concurrent data structures that are aware of the NUMA performance artifacts. We propose Node Replication (NR), a black-box approach to obtaining such data structures. NR takes an arbitrary sequential data structure and automatically transforms it into a NUMA-aware concurrent data structure satisfying linearizability. Using NR requires no expertise in concurrent data structure design, and the result is free of concurrency bugs. NR draws ideas from two disciplines: shared-memory algorithms and distributed systems. Briefly, NR implements a NUMA-aware shared log, and then uses the log to replicate data structures consistently across NUMA nodes. NR is best suited for contended data structures, where it can outperform lock-free algorithms by 3.1x, and lock-based solutions by 30x. To show the benefits of NR to a real application, we apply NR to the data structures of Redis, an in-memory storage system. The result outperforms other methods by up to 14x. The cost of NR is additional memory for its log and replicas.

References

  1. http://redis.io.Google ScholarGoogle Scholar
  2. Dmitry Vyukov. Distributed Reader-Writer Mutex. http://www.1024cores.net/home/ lock-free-algorithms/reader-writer-problem/ distributed-reader-writer-mutex.Google ScholarGoogle Scholar
  3. http://goog-perftools.sourceforge.net/doc/tcmalloc.html.Google ScholarGoogle Scholar
  4. M. K. Aguilera. Thread-local malloc. https://github.com/mkaguilera/tmalloc.Google ScholarGoogle Scholar
  5. J. H. Anderson and M. Moir. Universal constructions for large objects. IEEE Transactions on Parallel and Distributed Systems, 10(12):1317--1332, Dec. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Balakrishnan, D. Malkhi, J. P. Davis, V. Prabhakaran, M. Wei, and T. Wobber. CORFU: A distributed shared log. ACM Transactions on Computer Systems, 31(4), Dec. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: A new OS architecture for scalable multicore systems. In ACM Symposium on Operating Systems Principles, pages 29--44, Oct. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: distributed shared memory based on type-specific memory coherence. In ACM Symposium on Principles and Practice of Parallel Programming, pages 168--176, Mar. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for NUMA-aware contention management on multicore systems. In USENIX Annual Technical Conference, Oct. 2011.Google ScholarGoogle Scholar
  10. W. J. Bolosky, R. P. Fitzgerald, and M. L. Scott. Simple but effective techniques for NUMA memory management. In ACM Symposium on Operating Systems Principles, pages 19--31, Dec. 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: an operating system for many cores. In Symposium on Operating Systems Design and Implementation, pages 43--57, Dec. 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. OpLog: a library for scaling update-heavy data structures. Technical Report TR-2014-019, MIT CSAIL, Sept. 2014.Google ScholarGoogle Scholar
  13. A. Braginsky, A. Kogan, and E. Petrank. Drop the anchor: Lightweight memory management for non-blocking data structures. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 33--42, July 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Brown, A. Kogan, Y. Lev, and V. Luchangco. Investigating the performance of hardware transactions on a multi-socket machine. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 121--132, July 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco: running commodity operating systems on scalable multiprocessors. ACM Transactions on Computer Systems, 15(4):412--447, Nov. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Calciu, D. Dice, T. Harris, M. Herlihy, A. Kogan, V. J. Marathe, and M. Moir. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In International Conference on Principles of Distributed Systems, pages 83--97, Dec. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. Calciu, J. E. Gottschlich, and M. Herlihy. Using delegation and elimination to implement a scalable NUMA-friendly stack. In USENIX Workshop on Hot Topics in Parallelism, June 2013.Google ScholarGoogle Scholar
  18. B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In ACM Symposium on Cloud Computing, pages 143--154, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. L. Cox and R. J. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. In ACM Symposium on Operating Systems Principles, pages 32--44, Dec. 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--394, Mar. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In ACM Symposium on Operating Systems Principles, pages 33--48, Nov. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 631--644, Mar. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. David, R. Guerraoui, and M. Yabandeh. Consensus inside. In International Middleware Conference, pages 145--156, Dec. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 325--334, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Fraser. Practical lock-freedom. Technical Report UCAM-CL-TR-579, University of Cambridge, Computer Laboratory, Feb. 2004.Google ScholarGoogle Scholar
  26. M. L. Fredman, R. Sedgewick, D. D. Sleator, and R. E. Tarjan. The pairing heap: a new form of self-adjusting heap. Algorithmica, 1(1):111--129, Jan. 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In Symposium on Operating Systems Design and Implementation, pages 87--100, Feb. 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. K. Haider, W. Hasenplaugh, and D. Alistarh. Lease/release: Architectural support for scaling contended data structures. In ACM Symposium on Principles and Practice of Parallel Programming, Mar. 2016. Article 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Hawkins, A. Aiken, K. Fisher, M. Rinard, and M. Sagiv. Concurrent data representation synthesis. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 417--428, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 355--364, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Scalable flat-combining based synchronous queues. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 355--364, Sept. 2010. Google ScholarGoogle ScholarCross RefCross Ref
  32. D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. J. Parallel Distrib. Comput., 70(1):1--12, Jan. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 11(1):124--149, Jan. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Herlihy and I. Calciu. Work in progress: Shared nothing transactional memory. In Workshop on Systems for Future Multicore Architectures, Apr. 2011.Google ScholarGoogle Scholar
  35. M. Herlihy, V. Luchangco, P. Martin, and M. Moir. Nonblocking memory management support for dynamic-sized data structures. ACM Trans. Comput. Syst., 23(2):146--196, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. ACM SIGARCH Computer Architecture News, 21(2):289--300, May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463--492, July 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 295--306, Aug. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. O. Krieger, M. Auslander, B. Rosenburg, R. W. Wisniewski, J. Xenidis, D. D. Silva, M. Ostrowski, J. Appavoo, M. Butrico, M. Mergen, A. Waterland, and V. Uhlig. K42: building a complete operating system. In European Conference on Computer Systems, pages 133--145, Apr. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. Lachaize, B. Lepers, and V. Quéma. MemProf: A memory profiler for NUMA multicore systems. In USENIX Annual Technical Conference, pages 53--64, June 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Lameter. NUMA (non-uniform memory access): An overview. ACM Queue, 11(7), July 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. V. Leis, P. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In International Conference on Management of Data, pages 743--754, June 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Y. Li, I. Pandis, R. Mueller, V. Raman, and G. Lohman. NUMA-aware algorithms: the case of data shuffling. In Conference on Innovative Data Systems Research, Jan. 2013.Google ScholarGoogle Scholar
  46. H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. MICA: A holistic approach to fast in-memory key-value storage. In Symposium on Networked Systems Design and Implementation, pages 429--444, Apr. 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In European Conference on Computer Systems, pages 183--196, Apr. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. A. Matveev, N. Shavit, P. Felber, and P. Marlier. Read-log-update: a lightweight synchronization mechanism for concurrent programming. In ACM Symposium on Operating Systems Principles, pages 168--183, Oct. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. P. E. McKenney and J. D. Slingwine. Read-copy-update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems, pages 509--518, Oct. 1998.Google ScholarGoogle Scholar
  50. Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. CPHash: a cache-partitioned hash table. In ACM Symposium on Principles and Practice of Parallel Programming, pages 319--320, Feb. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Transactions on Parallel and Distributed Systems, 15(6):491--504, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. D. Ongaro, S. M. Rumble, R. Stutsman, J. K. Ousterhout, and M. Rosenblum. Fast crash recovery in RAMCloud. In ACM Symposium on Operating Systems Principles, pages 29--41, Oct. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. D. Porobic, E. Liarou, P. Tözün, and A. Ailamaki. ATraPos: Adaptive transaction processing on hardware islands. In International Conference on Data Engineering, pages 688--699, Mar. 2014. Google ScholarGoogle ScholarCross RefCross Ref
  54. W. Pugh. Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6):668--676, June 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. F. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys, 22(4):299--319, Dec. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. O. Shahmirzadi, T. Ropars, and A. Schiper. High-throughput maps on message-passing manycore architectures: Partitioning versus replication. In International Conference on Parallel Processing, pages 536--547, Aug. 2014. Google ScholarGoogle ScholarCross RefCross Ref
  57. O. Shalev and N. Shavit. Predictive log-synchronization. In European Conference on Computer Systems, pages 305--316, Apr. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. N. Shavit and D. Touitou. Software transactional memory. In ACM Symposium on Principles of Distributed Computing, pages 204--213, Aug. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. P. Stuedi, A. Trivedi, B. Metzler, and J. Pfefferle. DaRPC: Data center RPC. In ACM Symposium on Cloud Computing, pages 1--13, Nov. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. H. Sutter. Lock-free code: A false sense of security. Dr. Dobb's, Sept. 2008.Google ScholarGoogle Scholar
  61. R. K. Treiber. Systems programming: Coping with parallelism. Technical report, IBM Almaden Research Center, Apr. 1986.Google ScholarGoogle Scholar
  62. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--289, Oct. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. L. Xiang and M. L. Scott. Compiler aided manual speculation for high performance concurrent data structures. In ACM Symposium on Principles and Practice of Parallel Programming, pages 47--56, Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Black-box Concurrent Data Structures for NUMA Architectures

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader
                  About Cookies On This Site

                  We use cookies to ensure that we give you the best experience on our website.

                  Learn more

                  Got it!