skip to main content
research-article

Black-box Concurrent Data Structures for NUMA Architectures

Published: 04 April 2017 Publication History

Abstract

High-performance servers are Non-Uniform Memory Access (NUMA) machines. To fully leverage these machines, programmers need efficient concurrent data structures that are aware of the NUMA performance artifacts. We propose Node Replication (NR), a black-box approach to obtaining such data structures. NR takes an arbitrary sequential data structure and automatically transforms it into a NUMA-aware concurrent data structure satisfying linearizability. Using NR requires no expertise in concurrent data structure design, and the result is free of concurrency bugs. NR draws ideas from two disciplines: shared-memory algorithms and distributed systems. Briefly, NR implements a NUMA-aware shared log, and then uses the log to replicate data structures consistently across NUMA nodes. NR is best suited for contended data structures, where it can outperform lock-free algorithms by 3.1x, and lock-based solutions by 30x. To show the benefits of NR to a real application, we apply NR to the data structures of Redis, an in-memory storage system. The result outperforms other methods by up to 14x. The cost of NR is additional memory for its log and replicas.

References

[1]
http://redis.io.
[2]
Dmitry Vyukov. Distributed Reader-Writer Mutex. http://www.1024cores.net/home/ lock-free-algorithms/reader-writer-problem/ distributed-reader-writer-mutex.
[3]
http://goog-perftools.sourceforge.net/doc/tcmalloc.html.
[4]
M. K. Aguilera. Thread-local malloc. https://github.com/mkaguilera/tmalloc.
[5]
J. H. Anderson and M. Moir. Universal constructions for large objects. IEEE Transactions on Parallel and Distributed Systems, 10(12):1317--1332, Dec. 1999.
[6]
M. Balakrishnan, D. Malkhi, J. P. Davis, V. Prabhakaran, M. Wei, and T. Wobber. CORFU: A distributed shared log. ACM Transactions on Computer Systems, 31(4), Dec. 2013.
[7]
A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: A new OS architecture for scalable multicore systems. In ACM Symposium on Operating Systems Principles, pages 29--44, Oct. 2009.
[8]
J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: distributed shared memory based on type-specific memory coherence. In ACM Symposium on Principles and Practice of Parallel Programming, pages 168--176, Mar. 1990.
[9]
S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for NUMA-aware contention management on multicore systems. In USENIX Annual Technical Conference, Oct. 2011.
[10]
W. J. Bolosky, R. P. Fitzgerald, and M. L. Scott. Simple but effective techniques for NUMA memory management. In ACM Symposium on Operating Systems Principles, pages 19--31, Dec. 1989.
[11]
S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: an operating system for many cores. In Symposium on Operating Systems Design and Implementation, pages 43--57, Dec. 2008.
[12]
S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. OpLog: a library for scaling update-heavy data structures. Technical Report TR-2014-019, MIT CSAIL, Sept. 2014.
[13]
A. Braginsky, A. Kogan, and E. Petrank. Drop the anchor: Lightweight memory management for non-blocking data structures. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 33--42, July 2013.
[14]
T. Brown, A. Kogan, Y. Lev, and V. Luchangco. Investigating the performance of hardware transactions on a multi-socket machine. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 121--132, July 2016.
[15]
E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco: running commodity operating systems on scalable multiprocessors. ACM Transactions on Computer Systems, 15(4):412--447, Nov. 1997.
[16]
I. Calciu, D. Dice, T. Harris, M. Herlihy, A. Kogan, V. J. Marathe, and M. Moir. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In International Conference on Principles of Distributed Systems, pages 83--97, Dec. 2013.
[17]
I. Calciu, J. E. Gottschlich, and M. Herlihy. Using delegation and elimination to implement a scalable NUMA-friendly stack. In USENIX Workshop on Hot Topics in Parallelism, June 2013.
[18]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In ACM Symposium on Cloud Computing, pages 143--154, June 2010.
[19]
A. L. Cox and R. J. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. In ACM Symposium on Operating Systems Principles, pages 32--44, Dec. 1989.
[20]
M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--394, Mar. 2013.
[21]
T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In ACM Symposium on Operating Systems Principles, pages 33--48, Nov. 2013.
[22]
T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 631--644, Mar. 2015.
[23]
T. David, R. Guerraoui, and M. Yabandeh. Consensus inside. In International Middleware Conference, pages 145--156, Dec. 2014.
[24]
P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 325--334, June 2011.
[25]
K. Fraser. Practical lock-freedom. Technical Report UCAM-CL-TR-579, University of Cambridge, Computer Laboratory, Feb. 2004.
[26]
M. L. Fredman, R. Sedgewick, D. D. Sleator, and R. E. Tarjan. The pairing heap: a new form of self-adjusting heap. Algorithmica, 1(1):111--129, Jan. 1986.
[27]
B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In Symposium on Operating Systems Design and Implementation, pages 87--100, Feb. 1999.
[28]
S. K. Haider, W. Hasenplaugh, and D. Alistarh. Lease/release: Architectural support for scaling contended data structures. In ACM Symposium on Principles and Practice of Parallel Programming, Mar. 2016. Article 17.
[29]
P. Hawkins, A. Aiken, K. Fisher, M. Rinard, and M. Sagiv. Concurrent data representation synthesis. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 417--428, June 2012.
[30]
D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 355--364, June 2010.
[31]
D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Scalable flat-combining based synchronous queues. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 355--364, Sept. 2010.
[32]
D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. J. Parallel Distrib. Comput., 70(1):1--12, Jan. 2010.
[33]
M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 11(1):124--149, Jan. 1991.
[34]
M. Herlihy and I. Calciu. Work in progress: Shared nothing transactional memory. In Workshop on Systems for Future Multicore Architectures, Apr. 2011.
[35]
M. Herlihy, V. Luchangco, P. Martin, and M. Moir. Nonblocking memory management support for dynamic-sized data structures. ACM Trans. Comput. Syst., 23(2):146--196, May 2005.
[36]
M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. ACM SIGARCH Computer Architecture News, 21(2):289--300, May 1993.
[37]
M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008.
[38]
M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463--492, July 1990.
[39]
A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 295--306, Aug. 2014.
[40]
O. Krieger, M. Auslander, B. Rosenburg, R. W. Wisniewski, J. Xenidis, D. D. Silva, M. Ostrowski, J. Appavoo, M. Butrico, M. Mergen, A. Waterland, and V. Uhlig. K42: building a complete operating system. In European Conference on Computer Systems, pages 133--145, Apr. 2006.
[41]
R. Lachaize, B. Lepers, and V. Quéma. MemProf: A memory profiler for NUMA multicore systems. In USENIX Annual Technical Conference, pages 53--64, June 2012.
[42]
C. Lameter. NUMA (non-uniform memory access): An overview. ACM Queue, 11(7), July 2013.
[43]
L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, May 1998.
[44]
V. Leis, P. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In International Conference on Management of Data, pages 743--754, June 2014.
[45]
Y. Li, I. Pandis, R. Mueller, V. Raman, and G. Lohman. NUMA-aware algorithms: the case of data shuffling. In Conference on Innovative Data Systems Research, Jan. 2013.
[46]
H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. MICA: A holistic approach to fast in-memory key-value storage. In Symposium on Networked Systems Design and Implementation, pages 429--444, Apr. 2014.
[47]
Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. In European Conference on Computer Systems, pages 183--196, Apr. 2012.
[48]
A. Matveev, N. Shavit, P. Felber, and P. Marlier. Read-log-update: a lightweight synchronization mechanism for concurrent programming. In ACM Symposium on Operating Systems Principles, pages 168--183, Oct. 2015.
[49]
P. E. McKenney and J. D. Slingwine. Read-copy-update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems, pages 509--518, Oct. 1998.
[50]
Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. CPHash: a cache-partitioned hash table. In ACM Symposium on Principles and Practice of Parallel Programming, pages 319--320, Feb. 2012.
[51]
M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Transactions on Parallel and Distributed Systems, 15(6):491--504, June 2004.
[52]
D. Ongaro, S. M. Rumble, R. Stutsman, J. K. Ousterhout, and M. Rosenblum. Fast crash recovery in RAMCloud. In ACM Symposium on Operating Systems Principles, pages 29--41, Oct. 2011.
[53]
D. Porobic, E. Liarou, P. Tözün, and A. Ailamaki. ATraPos: Adaptive transaction processing on hardware islands. In International Conference on Data Engineering, pages 688--699, Mar. 2014.
[54]
W. Pugh. Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6):668--676, June 1990.
[55]
F. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys, 22(4):299--319, Dec. 1990.
[56]
O. Shahmirzadi, T. Ropars, and A. Schiper. High-throughput maps on message-passing manycore architectures: Partitioning versus replication. In International Conference on Parallel Processing, pages 536--547, Aug. 2014.
[57]
O. Shalev and N. Shavit. Predictive log-synchronization. In European Conference on Computer Systems, pages 305--316, Apr. 2006.
[58]
N. Shavit and D. Touitou. Software transactional memory. In ACM Symposium on Principles of Distributed Computing, pages 204--213, Aug. 1995.
[59]
P. Stuedi, A. Trivedi, B. Metzler, and J. Pfefferle. DaRPC: Data center RPC. In ACM Symposium on Cloud Computing, pages 1--13, Nov. 2014.
[60]
H. Sutter. Lock-free code: A false sense of security. Dr. Dobb's, Sept. 2008.
[61]
R. K. Treiber. Systems programming: Coping with parallelism. Technical report, IBM Almaden Research Center, Apr. 1986.
[62]
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--289, Oct. 1996.
[63]
L. Xiang and M. L. Scott. Compiler aided manual speculation for high performance concurrent data structures. In ACM Symposium on Principles and Practice of Parallel Programming, pages 47--56, Feb. 2013.

Cited By

View all
  • (2024)Opca: Enabling Optimistic Concurrent Access for Multiple Users in Oblivious Data StorageIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.344162335:11(1891-1903)Online publication date: 1-Nov-2024
  • (2022)PREP-UCProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538568(217-229)Online publication date: 11-Jul-2022
  • (2022)Using skip graphs for increased NUMA localityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.04.006167:C(31-49)Online publication date: 1-Sep-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 52, Issue 4
ASPLOS '17
April 2017
811 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3093336
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
    April 2017
    856 pages
    ISBN:9781450344654
    DOI:10.1145/3037697
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017
Published in SIGPLAN Volume 52, Issue 4

Check for updates

Badges

  • Best Paper

Author Tags

  1. black-box techniques
  2. concurrent data structures
  3. log
  4. numa architecture
  5. replication

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)117
  • Downloads (Last 6 weeks)12
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Opca: Enabling Optimistic Concurrent Access for Multiple Users in Oblivious Data StorageIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.344162335:11(1891-1903)Online publication date: 1-Nov-2024
  • (2022)PREP-UCProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538568(217-229)Online publication date: 11-Jul-2022
  • (2022)Using skip graphs for increased NUMA localityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.04.006167:C(31-49)Online publication date: 1-Sep-2022
  • (2020)Using Skip Graphs for Increased NUMA Locality2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD49847.2020.00031(157-166)Online publication date: Sep-2020
  • (2019)An adaptive concurrent priority queue for NUMA architecturesProceedings of the 16th ACM International Conference on Computing Frontiers10.1145/3310273.3323164(135-144)Online publication date: 30-Apr-2019
  • (2024)WASP: Workload-Aware Self-Replicating Page-Tables for NUMA ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640369(1233-1249)Online publication date: 27-Apr-2024
  • (2023)Leaf: Modularity for Temporary Sharing in Separation LogicProceedings of the ACM on Programming Languages10.1145/36227987:OOPSLA2(31-58)Online publication date: 16-Oct-2023
  • (2023)Using Local Cache Coherence for Disaggregated Memory SystemsACM SIGOPS Operating Systems Review10.1145/3606557.360656157:1(21-28)Online publication date: 28-Jun-2023
  • (2023)DiffLex: A High-Performance, Memory-Efficient and NUMA-Aware Learned Index using Differentiated ManagementProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605590(62-71)Online publication date: 7-Aug-2023
  • (2023)Beyond isolation: OS verification as a foundation for correct applicationsProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595899(158-165)Online publication date: 22-Jun-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media