skip to main content
10.1145/3579990.3580013acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections

WARDen: Specializing Cache Coherence for High-Level Parallel Languages

Published:22 February 2023Publication History

ABSTRACT

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required.

We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD.

We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.

References

  1. Sarita V Adve and Mark D Hill. 1990. Weak ordering—a new definition. ACM SIGARCH Computer Architecture News, 18, 2SI (1990), 2–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jatin Arora, Sam Westrick, and Umut A. Acar. 2021. Provably Space Efficient Parallel Functional Programming. In Proceedings of the 48th Annual ACM Symposium on Principles of Programming Languages (POPL)". Google ScholarGoogle Scholar
  3. Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 320–332. https://doi.org/10.1145/3079856.3080231 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. 1989. I-structures: Data Structures for Parallel Computing. ACM Trans. Program. Lang. Syst., 11, 4 (1989), Oct., 598–632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Thomas J. Ashby, Pedro Díaz, and Marcelo Cintra. 2011. Software-Based Cache Coherence with Hardware-Assisted Selective Self-Invalidations Using Bloom Filters. IEEE Trans. Comput., 60, 4 (2011), 472–483. https://doi.org/10.1109/TC.2010.155 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Rajeev Balasubramonian, Andrew B Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Transactions on Architecture and Code Optimization (TACO), 14, 2 (2017), 1–25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. 2004. On-the-Fly Maintenance of Series-Parallel Relationships in Fork-Join Multithreaded Programs. In 16th Annual ACM Symposium on Parallel Algorithms and Architectures. 133–144. Google ScholarGoogle Scholar
  8. Guy E. Blelloch. 1996. Programming Parallel Algorithms. Commun. ACM, 39, 3 (1996), 85–97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic parallel algorithms can be fast. PPoPP ’12. 181–192. Google ScholarGoogle Scholar
  10. Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. 1994. Implementation of a Portable Nested Data-Parallel Language. J. Parallel Distrib. Comput., 21, 1 (1994), 4–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM, 46, 5 (1999), Sept., 720–748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Robert L Bocchino Jr, Vikram S Adve, Danny Dig, Sarita V Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. 2009. A type and effect system for deterministic parallel Java. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications. 97–116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Bull. 2021. Bull Bullion S16 Server. http://www.scaleupservers.com/Bullion-S16-Server.asp. Google ScholarGoogle Scholar
  14. Paul Caheny, Lluc Alvarez, Said Derradji, Mateo Valero, Miquel Moretó, and Marc Casas. 2018. Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach. IEEE Transactions on Parallel and Distributed Systems, 29, 5 (2018), 1174–1187. https://doi.org/10.1109/TPDS.2017.2787123 Google ScholarGoogle ScholarCross RefCross Ref
  15. Irina Calciu, M. Talha Imran, Ivan Puddu, Sanidhya Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. 2021. Rethinking Software Runtimes for Disaggregated Memory. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA. 79–92. isbn:9781450383172 https://doi.org/10.1145/3445814.3446713 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulations. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An Evaluation of High-Level Mechanistic Core Models. ACM Transactions on Architecture and Code Optimization (TACO), Article 5, 23 pages. issn:1544-3566 https://doi.org/10.1145/2629677 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nicholas P. Carter, Aditya Agrawal, Shekhar Borkar, Romain Cledat, Howard David, Dave Dunning, Joshua Fryman, Ivan Ganev, Roger A. Golliver, Rob Knauerhase, Richard Lethin, Benoit Meister, Asit K. Mishra, Wilfred R. Pinfold, Justin Teller, Josep Torrellas, Nicolas Vasilache, Ganesh Venkatesh, and Jianping Xu. 2013. Runnemede: An architecture for Ubiquitous High-Performance Computing. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 198–209. https://doi.org/10.1109/HPCA.2013.6522319 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Guang-Ien Cheng, Mingdong Feng, Charles E. Leiserson, Keith H. Randall, and Andrew F. Stark. 1998. Detecting data races in Cilk programs that use locks. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA ’98). Google ScholarGoogle Scholar
  20. H. Cheong and A.V. Veidenbaum. 1988. A cache coherence scheme with fast selective invalidation. In [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings. 299–307. https://doi.org/10.1109/ISCA.1988.5240 Google ScholarGoogle ScholarCross RefCross Ref
  21. Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V Adve, Vikram S Adve, Nicholas P Carter, and Ching-Tsun Chou. 2011. DeNovo: Rethinking the memory hierarchy for disciplined parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 155–166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Choi and Pen-Chung Yew. 1994. A compiler-directed cache coherence scheme with improved intertask locality. In Supercomputing ’94:Proceedings of the 1994 ACM/IEEE Conference on Supercomputing. 773–782. https://doi.org/10.1109/SUPERC.1994.344343 Google ScholarGoogle ScholarCross RefCross Ref
  23. Lynn Choi and Pen-Chung Yew. 1996. Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study. In Proceedings of the 23rd annual international symposium on Computer architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Blas Cuesta, Alberto Ros, María E Gómez, Antonio Robles, and José Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In 2011 38th Annual International Symposium on Computer Architecture (ISCA). 93–103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ervan Darnell, John M. Mellor-Crummey, and Ken Kennedy. 1992. Automatic Software Cache Coherence through Vectorization. In Proceedings of the 6th International Conference on Supercomputing (ICS ’92). 129–138. isbn:0897914856 https://doi.org/10.1145/143369.143398 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Abhishek Das, Matt Schuchhardt, Nikos Hardavellas, Gokhan Memik, and Alok Choudhary. 2012. Dynamic Directories: A mechanism for reducing on-chip interconnect power in multicores. In Proceedings of the Conference on Design, Automation Test in Europe. 479–484. Google ScholarGoogle ScholarCross RefCross Ref
  27. Yigit Demir, Yan Pan, Seukwoo Song, Nikos Hardavellas, John Kim, and Gokhan Memik. 2014. Galaxy: A High-Performance Energy-Efficient Multi-Chip Architecture Using Photonic Interconnects. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Marco Elver and Vijay Nagarajan. 2014. TSO-CC: Consistency directed cache coherence for TSO. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 165–176. Google ScholarGoogle ScholarCross RefCross Ref
  29. Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti. 2008. Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence. In International Symposium on Microarchitecture. 35–46. Google ScholarGoogle Scholar
  30. Ericsson. 2017. Time for memory disaggregation? https://www.ericsson.com/en/blog/2017/5/time-for-memory-disaggregation. Google ScholarGoogle Scholar
  31. Mingdong Feng and Charles E. Leiserson. 1999. Efficient Detection of Determinacy Races in Cilk Programs. Theory of Computing Systems, 32, 3 (1999), 301–326. Google ScholarGoogle ScholarCross RefCross Ref
  32. Sevin Fide and Stephen Jenks. 2008. Proactive use of shared L3 caches to enhance cache communications in multi-core processors. IEEE Computer Architecture Letters, 7, 2 (2008), 57–60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jeremy T. Fineman. 2005. Provably Good Race Detection That Runs in Parallel. Master’s thesis. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science. Cambridge, MA. Google ScholarGoogle Scholar
  34. Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2011. Implicitly threaded parallelism in Manticore. Journal of Functional Programming, 20, 5-6 (2011), 1–40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and Other Cilk++ Hyperobjects. In 21st Annual ACM Symposium on Parallelism in Algorithms and Architectures. 79–90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. 2017. Efficient memory disaggregation with infiniswap. In 14th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 17). 649–667. Google ScholarGoogle Scholar
  37. Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut A. Acar, and Matthew Fluet. 2018. Hierarchical memory management for mutable state. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018. 81–93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Robert H. Halstead, Jr.. 1984. Implementation of Multilisp: Lisp on a Multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming (LFP ’84). ACM, 9–17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Kevin Hammond. 2011. Why Parallel Functional Programming Matters: Panel Statement. In Reliable Software Technologies - Ada-Europe 2011 - 16th Ada-Europe International Conference on Reliable Software Technologies, Edinburgh, UK, June 20-24, 2011. Proceedings. 201–205. Google ScholarGoogle ScholarCross RefCross Ref
  40. Tim Harris, James Larus, and Ravi Rajwar. 2010. Transactional memory. Synthesis Lectures on Computer Architecture, 5, 1 (2010), 1–263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Maurice Herlihy and J Eliot B Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th annual international symposium on Computer architecture. 289–300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hewlett Packard Enterprise. 2021. HPE Integrity MC990 X Server. https://www.hpe.com/psnow/doc/PSN1008798952USEN.pdf. Google ScholarGoogle Scholar
  43. Derek R. Hower. 2012. Acoherent Shared Memory. Ph. D. Dissertation. USA. isbn:9781267539397 AAI3522117 Google ScholarGoogle Scholar
  44. C.C. Hu, M.F. Chen, W.C. Chiou, and Doug C.H. Yu. 2019. 3D Multi-chip Integration with System on Integrated Chips (SoIC™). In 2019 Symposium on VLSI Technology. T20–T21. https://doi.org/10.23919/VLSIT.2019.8776486 Google ScholarGoogle ScholarCross RefCross Ref
  45. IBM. 2018. Advancing cloud with memory disaggregation. https://www.ibm.com/blogs/research/2018/01/advancing-cloud-memory-disaggregation/. Google ScholarGoogle Scholar
  46. Subramanian S. Iyer. 2016. Heterogeneous Integration for Performance and Scaling. IEEE Transactions on Components, Packaging and Manufacturing Technology, 6, 7 (2016), 973–982. https://doi.org/10.1109/TCPMT.2015.2511626 Google ScholarGoogle ScholarCross RefCross Ref
  47. Alexandra Jimborean, Jonatan Waern, Per Ekemark, Stefanos Kaxiras, and Alberto Ros. 2017. Automatic detection of extended data-race-free regions. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 14–26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Robert L. Bocchino Jr., Stephen Heumann, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Adam Welc, and Tatiana Shpeisman. 2011. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, Thomas Ball and Mooly Sagiv (Eds.). ACM, 535–548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel Loh. 2015. Enabling Interposer-based Disintegration of Multi-core Processors. In Proceedings of the International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Pete Keleher, Alan L Cox, and Willy Zwaenepoel. 1992. Lazy release consistency for software distributed shared memory. ACM SIGARCH Computer Architecture News, 20, 2 (1992), 13–21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Wooil Kim, Sanket Tavarageri, P. Sadayappan, and Josep Torrellas. 2016. Architecting and Programming a Hardware-Incoherent Multiprocessor Cache Hierarchy. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 555–565. https://doi.org/10.1109/IPDPS.2016.76 Google ScholarGoogle ScholarCross RefCross Ref
  52. Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin Taranov, Dejan S. Milojicic, and Gustavo Alonso. 2021. Farview: Disaggregated Memory with Operator Off-loading for Database Engines. CoRR, abs/2106.07102 (2021), arxiv:2106.07102. arxiv:2106.07102 Google ScholarGoogle Scholar
  53. Lindsey Kuper and Ryan R Newton. 2013. LVars: lattice-based data structures for deterministic parallelism. In Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing. 71–84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Lindsey Kuper, Aaron Todd, Sam Tobin-Hochstadt, and Ryan R. Newton. 2014. Taming the Parallel Effect Zoo: Extensible Deterministic Parallelism with LVish. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). 2–14. isbn:978-1-4503-2784-8 Google ScholarGoogle Scholar
  55. Youngeun Kwon and Minsoo Rhu. 2019. A Disaggregated Memory System for Deep Learning. IEEE Micro, 39, 5 (2019), 82–90. https://doi.org/10.1109/MM.2019.2929165 Google ScholarGoogle ScholarCross RefCross Ref
  56. Peng Li, Simon Marlow, Simon L. Peyton Jones, and Andrew P. Tolmach. 2007. Lightweight concurrency primitives for GHC. In Proceedings of the ACM SIGPLAN Workshop on Haskell, Haskell 2007, Freiburg, Germany, September 30, 2007. 107–118. Google ScholarGoogle Scholar
  57. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In MICRO. Google ScholarGoogle Scholar
  58. Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA ’09). 267–278. isbn:9781605585260 https://doi.org/10.1145/1555754.1555789 Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. John Mellor-Crummey. 1991. On-the-fly Detection of Data Races for Programs with Nested Fork-Join Parallelism. In Proceedings of Supercomputing’91. 24–33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Moor Insights and Strategy. 2013. Intel’s Disaggregated Server Rack. https://moorinsightsstrategy.com/wp-content/uploads/2013/08/Intels-Disagggregated-Server-Rack-by-Moor-Insights-Strategy.pdf. Google ScholarGoogle Scholar
  61. [n. d.]. MPL compiler. https://github.com/mpllang/mpl Google ScholarGoogle Scholar
  62. Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H. Loh, Mahesh Subramony, and Sean White. 2021. Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 57–70. https://doi.org/10.1109/ISCA52012.2021.00014 Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2020. A Primer on Memory Consistency and Cache Coherence: Second Edition. Google ScholarGoogle Scholar
  64. M. F. P. O’Boyle, R. W. Ford, and E. A. Stohr. 2003. Towards General and Exact Distributed Invalidation. J. Parallel Distrib. Comput., 63, 11 (2003), Nov., 1123–1137. issn:0743-7315 https://doi.org/10.1016/j.jpdc.2003.07.007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. 2018. OpenMP Application Programming Interface, Version 5.0. Accessed in July 2018 Google ScholarGoogle Scholar
  66. Susan Owicki and Anant Agarwal. 1989. Evaluating the performance of software cache coherence. ACM SIGARCH Computer Architecture News, 17, 2 (1989), 230–242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [n. d.]. https://github.com/mpllang/parallel-ml-bench Google ScholarGoogle Scholar
  68. Ivy Peng, Roger Pearce, and Maya Gokhale. 2020. On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems. In 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 183–190. https://doi.org/10.1109/SBAC-PAD49847.2020.00034 Google ScholarGoogle ScholarCross RefCross Ref
  69. Simon L. Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. 2008. Harnessing the Multicores: Nested Data Parallelism in Haskell. In FSTTCS. 383–414. Google ScholarGoogle Scholar
  70. Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. 2016. Hierarchical Memory Management for Parallel Programs. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming (ICFP 2016). ACM, New York, NY, USA. 392–406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin Vechev, and Eran Yahav. 2010. Efficient Data Race Detection for Async-Finish Parallelism. In Runtime Verification, Howard Barringer, Ylies Falcone, Bernd Finkbeiner, Klaus Havelund, Insup Lee, Gordon Pace, Grigore Rosu, Oleg Sokolsky, and Nikolai Tillmann (Eds.) (Lecture Notes in Computer Science, Vol. 6418). Springer Berlin / Heidelberg, 368–383. isbn:978-3-642-16611-2 Google ScholarGoogle Scholar
  72. Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin Vechev, and Eran Yahav. 2012. Scalable and Precise Dynamic Datarace Detection for Structured Parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). 531–542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Yuxin Ren, Gabriel Parmer, and Dejan Milojicic. 2020. Ch’i: Scaling Microkernel Capabilities in Cache-Incoherent Systems. In 2020 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 12–21. https://doi.org/10.1109/ROSS51935.2020.00007 Google ScholarGoogle ScholarCross RefCross Ref
  74. Alberto Ros and Alexandra Jimborean. 2015. A dual-consistency cache coherence protocol. In 2015 IEEE International Parallel and Distributed Processing Symposium. 1119–1128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Alberto Ros and Stefanos Kaxiras. 2012. Complexity-effective multicore coherence. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 241–251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Matthew Schuchhardt, Abhishek Das, Nikos Hardavellas, Gokhan Memik, and Alok Choudhary. 2013. The Impact of Dynamic Directories on Multicore Interconnects. IEEE Computer, 46, 10 (2013), October, 32–39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Nir Shavit and Dan Touitou. 1997. Software transactional memory. Distributed Computing, 10, 2 (1997), 99–116. Google ScholarGoogle ScholarCross RefCross Ref
  78. Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief Announcement: The Problem Based Benchmark Suite. SPAA ’12. 68–70. isbn:9781450312134 https://doi.org/10.1145/2312005.2312018 Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul Dhiman, and Anil Madhavapeddy. 2020. Retrofitting parallelism onto ocaml. arXiv preprint arXiv:2004.11663. Google ScholarGoogle Scholar
  80. KC Sivaramakrishnan, Stephen Dolan, Leo White, Tom Kelly, Sadiq Jaffer, and Anil Madhavapeddy. 2021. Retrofitting effect handlers onto OCaml. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 206–221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. K. C. Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul Dhiman, and Anil Madhavapeddy. 2020. Retrofitting parallelism onto OCaml. Proc. ACM Program. Lang., 4, ICFP (2020), 113:1–113:30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Daniel Spoonhower. 2009. Scheduling Deterministic Parallel Programs. Ph. D. Dissertation. Carnegie Mellon University. https://www.cs.cmu.edu/~rwh/theses/spoonhower.pdf Google ScholarGoogle Scholar
  83. Rabin Sugumar, Mehul Shah, and Ricardo Ramirez. 2021. Marvell ThunderX3: Next-Generation Arm-Based Server Processor. IEEE Micro, 41, 2 (2021), 15–21. https://doi.org/10.1109/MM.2021.3055451 Google ScholarGoogle ScholarCross RefCross Ref
  84. Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve. 2013. DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). 13–26. isbn:9781450318709 https://doi.org/10.1145/2451116.2451119 Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Igor Tartalja and Veljko Milutinovic. 1996. The Cache Coherence Problem in Shared Memory Multiprocessors: Software Solutions. In XVI International Symposium on Nuclear Electronics and VI International School on Automation and Computing in Nuclear Physics and Astrophysics. 131. Google ScholarGoogle Scholar
  86. Sanket Tavarageri, Wooil Kim, Josep Torrellas, and P Sadayappan. 2016. Compiler support for software cache coherence. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC). 341–350. Google ScholarGoogle ScholarCross RefCross Ref
  87. Josep Torrellas, HS Lam, and John L. Hennessy. 1994. False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput., 43, 6 (1994), 651–663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Robert Utterback, Kunal Agrawal, Jeremy T. Fineman, and I-Ting Angelina Lee. 2016. Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2016, Asilomar State Beach/Pacific Grove, CA, USA, July 11-13, 2016. 83–94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Sam Westrick, Jatin Arora, and Umut A. Acar. 2022. Entanglement detection with near-zero cost. Proc. ACM Program. Lang., 6, ICFP (2022), 679–710. https://doi.org/10.1145/3547646 Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. 2020. Disentanglement in Nested-Parallel Programs. In Proceedings of the 47th Annual ACM Symposium on Principles of Programming Languages (POPL)". Google ScholarGoogle Scholar
  91. Michael Wilkins, Sam Westrick, Vijay Kandiah, Alex Bernat, Brian Suchy, Enrico Armenio Deiana, Simone Campanoni, Umut Acar, Peter Dinda, and Nikos Hardavellas. 2022. Artifact for "WARDen: Specializing Cache Coherence for High-Level Parallel Languages". https://doi.org/10.5281/zenodo.7374334 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. WARDen: Specializing Cache Coherence for High-Level Parallel Languages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CGO 2023: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization
        February 2023
        262 pages
        ISBN:9798400701016
        DOI:10.1145/3579990

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 February 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate312of1,061submissions,29%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader