ABSTRACT
High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required.
We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD.
We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.
- Sarita V Adve and Mark D Hill. 1990. Weak ordering—a new definition. ACM SIGARCH Computer Architecture News, 18, 2SI (1990), 2–14.
Google Scholar
Digital Library
- Jatin Arora, Sam Westrick, and Umut A. Acar. 2021. Provably Space Efficient Parallel Functional Programming. In Proceedings of the 48th Annual ACM Symposium on Principles of Programming Languages (POPL)".
Google Scholar
- Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 320–332. https://doi.org/10.1145/3079856.3080231
Google Scholar
Digital Library
- Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. 1989. I-structures: Data Structures for Parallel Computing. ACM Trans. Program. Lang. Syst., 11, 4 (1989), Oct., 598–632.
Google Scholar
Digital Library
- Thomas J. Ashby, Pedro Díaz, and Marcelo Cintra. 2011. Software-Based Cache Coherence with Hardware-Assisted Selective Self-Invalidations Using Bloom Filters. IEEE Trans. Comput., 60, 4 (2011), 472–483. https://doi.org/10.1109/TC.2010.155
Google Scholar
Digital Library
- Rajeev Balasubramonian, Andrew B Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Transactions on Architecture and Code Optimization (TACO), 14, 2 (2017), 1–25.
Google Scholar
Digital Library
- Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. 2004. On-the-Fly Maintenance of Series-Parallel Relationships in Fork-Join Multithreaded Programs. In 16th Annual ACM Symposium on Parallel Algorithms and Architectures. 133–144.
Google Scholar
- Guy E. Blelloch. 1996. Programming Parallel Algorithms. Commun. ACM, 39, 3 (1996), 85–97.
Google Scholar
Digital Library
- Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic parallel algorithms can be fast. PPoPP ’12. 181–192.
Google Scholar
- Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. 1994. Implementation of a Portable Nested Data-Parallel Language. J. Parallel Distrib. Comput., 21, 1 (1994), 4–14.
Google Scholar
Digital Library
- Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM, 46, 5 (1999), Sept., 720–748.
Google Scholar
Digital Library
- Robert L Bocchino Jr, Vikram S Adve, Danny Dig, Sarita V Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. 2009. A type and effect system for deterministic parallel Java. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications. 97–116.
Google Scholar
Digital Library
- Bull. 2021. Bull Bullion S16 Server. http://www.scaleupservers.com/Bullion-S16-Server.asp.
Google Scholar
- Paul Caheny, Lluc Alvarez, Said Derradji, Mateo Valero, Miquel Moretó, and Marc Casas. 2018. Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach. IEEE Transactions on Parallel and Distributed Systems, 29, 5 (2018), 1174–1187. https://doi.org/10.1109/TPDS.2017.2787123
Google Scholar
Cross Ref
- Irina Calciu, M. Talha Imran, Ivan Puddu, Sanidhya Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. 2021. Rethinking Software Runtimes for Disaggregated Memory. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA. 79–92. isbn:9781450383172 https://doi.org/10.1145/3445814.3446713
Google Scholar
Digital Library
- Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulations. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
Google Scholar
Digital Library
- Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An Evaluation of High-Level Mechanistic Core Models. ACM Transactions on Architecture and Code Optimization (TACO), Article 5, 23 pages. issn:1544-3566 https://doi.org/10.1145/2629677
Google Scholar
Digital Library
- Nicholas P. Carter, Aditya Agrawal, Shekhar Borkar, Romain Cledat, Howard David, Dave Dunning, Joshua Fryman, Ivan Ganev, Roger A. Golliver, Rob Knauerhase, Richard Lethin, Benoit Meister, Asit K. Mishra, Wilfred R. Pinfold, Justin Teller, Josep Torrellas, Nicolas Vasilache, Ganesh Venkatesh, and Jianping Xu. 2013. Runnemede: An architecture for Ubiquitous High-Performance Computing. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 198–209. https://doi.org/10.1109/HPCA.2013.6522319
Google Scholar
Digital Library
- Guang-Ien Cheng, Mingdong Feng, Charles E. Leiserson, Keith H. Randall, and Andrew F. Stark. 1998. Detecting data races in Cilk programs that use locks. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA ’98).
Google Scholar
- H. Cheong and A.V. Veidenbaum. 1988. A cache coherence scheme with fast selective invalidation. In [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings. 299–307. https://doi.org/10.1109/ISCA.1988.5240
Google Scholar
Cross Ref
- Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V Adve, Vikram S Adve, Nicholas P Carter, and Ching-Tsun Chou. 2011. DeNovo: Rethinking the memory hierarchy for disciplined parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 155–166.
Google Scholar
Digital Library
- L. Choi and Pen-Chung Yew. 1994. A compiler-directed cache coherence scheme with improved intertask locality. In Supercomputing ’94:Proceedings of the 1994 ACM/IEEE Conference on Supercomputing. 773–782. https://doi.org/10.1109/SUPERC.1994.344343
Google Scholar
Cross Ref
- Lynn Choi and Pen-Chung Yew. 1996. Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study. In Proceedings of the 23rd annual international symposium on Computer architecture.
Google Scholar
Digital Library
- Blas Cuesta, Alberto Ros, María E Gómez, Antonio Robles, and José Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In 2011 38th Annual International Symposium on Computer Architecture (ISCA). 93–103.
Google Scholar
Digital Library
- Ervan Darnell, John M. Mellor-Crummey, and Ken Kennedy. 1992. Automatic Software Cache Coherence through Vectorization. In Proceedings of the 6th International Conference on Supercomputing (ICS ’92). 129–138. isbn:0897914856 https://doi.org/10.1145/143369.143398
Google Scholar
Digital Library
- Abhishek Das, Matt Schuchhardt, Nikos Hardavellas, Gokhan Memik, and Alok Choudhary. 2012. Dynamic Directories: A mechanism for reducing on-chip interconnect power in multicores. In Proceedings of the Conference on Design, Automation Test in Europe. 479–484.
Google Scholar
Cross Ref
- Yigit Demir, Yan Pan, Seukwoo Song, Nikos Hardavellas, John Kim, and Gokhan Memik. 2014. Galaxy: A High-Performance Energy-Efficient Multi-Chip Architecture Using Photonic Interconnects. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14).
Google Scholar
Digital Library
- Marco Elver and Vijay Nagarajan. 2014. TSO-CC: Consistency directed cache coherence for TSO. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 165–176.
Google Scholar
Cross Ref
- Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti. 2008. Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence. In International Symposium on Microarchitecture. 35–46.
Google Scholar
- Ericsson. 2017. Time for memory disaggregation? https://www.ericsson.com/en/blog/2017/5/time-for-memory-disaggregation.
Google Scholar
- Mingdong Feng and Charles E. Leiserson. 1999. Efficient Detection of Determinacy Races in Cilk Programs. Theory of Computing Systems, 32, 3 (1999), 301–326.
Google Scholar
Cross Ref
- Sevin Fide and Stephen Jenks. 2008. Proactive use of shared L3 caches to enhance cache communications in multi-core processors. IEEE Computer Architecture Letters, 7, 2 (2008), 57–60.
Google Scholar
Digital Library
- Jeremy T. Fineman. 2005. Provably Good Race Detection That Runs in Parallel. Master’s thesis. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science. Cambridge, MA.
Google Scholar
- Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2011. Implicitly threaded parallelism in Manticore. Journal of Functional Programming, 20, 5-6 (2011), 1–40.
Google Scholar
Digital Library
- Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and Other Cilk++ Hyperobjects. In 21st Annual ACM Symposium on Parallelism in Algorithms and Architectures. 79–90.
Google Scholar
Digital Library
- Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. 2017. Efficient memory disaggregation with infiniswap. In 14th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 17). 649–667.
Google Scholar
- Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut A. Acar, and Matthew Fluet. 2018. Hierarchical memory management for mutable state. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018. 81–93.
Google Scholar
Digital Library
- Robert H. Halstead, Jr.. 1984. Implementation of Multilisp: Lisp on a Multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming (LFP ’84). ACM, 9–17.
Google Scholar
Digital Library
- Kevin Hammond. 2011. Why Parallel Functional Programming Matters: Panel Statement. In Reliable Software Technologies - Ada-Europe 2011 - 16th Ada-Europe International Conference on Reliable Software Technologies, Edinburgh, UK, June 20-24, 2011. Proceedings. 201–205.
Google Scholar
Cross Ref
- Tim Harris, James Larus, and Ravi Rajwar. 2010. Transactional memory. Synthesis Lectures on Computer Architecture, 5, 1 (2010), 1–263.
Google Scholar
Digital Library
- Maurice Herlihy and J Eliot B Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th annual international symposium on Computer architecture. 289–300.
Google Scholar
Digital Library
- Hewlett Packard Enterprise. 2021. HPE Integrity MC990 X Server. https://www.hpe.com/psnow/doc/PSN1008798952USEN.pdf.
Google Scholar
- Derek R. Hower. 2012. Acoherent Shared Memory. Ph. D. Dissertation. USA. isbn:9781267539397 AAI3522117
Google Scholar
- C.C. Hu, M.F. Chen, W.C. Chiou, and Doug C.H. Yu. 2019. 3D Multi-chip Integration with System on Integrated Chips (SoIC™). In 2019 Symposium on VLSI Technology. T20–T21. https://doi.org/10.23919/VLSIT.2019.8776486
Google Scholar
Cross Ref
- IBM. 2018. Advancing cloud with memory disaggregation. https://www.ibm.com/blogs/research/2018/01/advancing-cloud-memory-disaggregation/.
Google Scholar
- Subramanian S. Iyer. 2016. Heterogeneous Integration for Performance and Scaling. IEEE Transactions on Components, Packaging and Manufacturing Technology, 6, 7 (2016), 973–982. https://doi.org/10.1109/TCPMT.2015.2511626
Google Scholar
Cross Ref
- Alexandra Jimborean, Jonatan Waern, Per Ekemark, Stefanos Kaxiras, and Alberto Ros. 2017. Automatic detection of extended data-race-free regions. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 14–26.
Google Scholar
Digital Library
- Robert L. Bocchino Jr., Stephen Heumann, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Adam Welc, and Tatiana Shpeisman. 2011. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, Thomas Ball and Mooly Sagiv (Eds.). ACM, 535–548.
Google Scholar
Digital Library
- Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel Loh. 2015. Enabling Interposer-based Disintegration of Multi-core Processors. In Proceedings of the International Symposium on Microarchitecture.
Google Scholar
Digital Library
- Pete Keleher, Alan L Cox, and Willy Zwaenepoel. 1992. Lazy release consistency for software distributed shared memory. ACM SIGARCH Computer Architecture News, 20, 2 (1992), 13–21.
Google Scholar
Digital Library
- Wooil Kim, Sanket Tavarageri, P. Sadayappan, and Josep Torrellas. 2016. Architecting and Programming a Hardware-Incoherent Multiprocessor Cache Hierarchy. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 555–565. https://doi.org/10.1109/IPDPS.2016.76
Google Scholar
Cross Ref
- Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin Taranov, Dejan S. Milojicic, and Gustavo Alonso. 2021. Farview: Disaggregated Memory with Operator Off-loading for Database Engines. CoRR, abs/2106.07102 (2021), arxiv:2106.07102. arxiv:2106.07102
Google Scholar
- Lindsey Kuper and Ryan R Newton. 2013. LVars: lattice-based data structures for deterministic parallelism. In Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing. 71–84.
Google Scholar
Digital Library
- Lindsey Kuper, Aaron Todd, Sam Tobin-Hochstadt, and Ryan R. Newton. 2014. Taming the Parallel Effect Zoo: Extensible Deterministic Parallelism with LVish. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). 2–14. isbn:978-1-4503-2784-8
Google Scholar
- Youngeun Kwon and Minsoo Rhu. 2019. A Disaggregated Memory System for Deep Learning. IEEE Micro, 39, 5 (2019), 82–90. https://doi.org/10.1109/MM.2019.2929165
Google Scholar
Cross Ref
- Peng Li, Simon Marlow, Simon L. Peyton Jones, and Andrew P. Tolmach. 2007. Lightweight concurrency primitives for GHC. In Proceedings of the ACM SIGPLAN Workshop on Haskell, Haskell 2007, Freiburg, Germany, September 30, 2007. 107–118.
Google Scholar
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In MICRO.
Google Scholar
- Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA ’09). 267–278. isbn:9781605585260 https://doi.org/10.1145/1555754.1555789
Google Scholar
Digital Library
- John Mellor-Crummey. 1991. On-the-fly Detection of Data Races for Programs with Nested Fork-Join Parallelism. In Proceedings of Supercomputing’91. 24–33.
Google Scholar
Digital Library
- Moor Insights and Strategy. 2013. Intel’s Disaggregated Server Rack. https://moorinsightsstrategy.com/wp-content/uploads/2013/08/Intels-Disagggregated-Server-Rack-by-Moor-Insights-Strategy.pdf.
Google Scholar
- [n. d.]. MPL compiler. https://github.com/mpllang/mpl
Google Scholar
- Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H. Loh, Mahesh Subramony, and Sean White. 2021. Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 57–70. https://doi.org/10.1109/ISCA52012.2021.00014
Google Scholar
Digital Library
- Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2020. A Primer on Memory Consistency and Cache Coherence: Second Edition.
Google Scholar
- M. F. P. O’Boyle, R. W. Ford, and E. A. Stohr. 2003. Towards General and Exact Distributed Invalidation. J. Parallel Distrib. Comput., 63, 11 (2003), Nov., 1123–1137. issn:0743-7315 https://doi.org/10.1016/j.jpdc.2003.07.007
Google Scholar
Digital Library
- 2018. OpenMP Application Programming Interface, Version 5.0. Accessed in July 2018
Google Scholar
- Susan Owicki and Anant Agarwal. 1989. Evaluating the performance of software cache coherence. ACM SIGARCH Computer Architecture News, 17, 2 (1989), 230–242.
Google Scholar
Digital Library
- [n. d.]. https://github.com/mpllang/parallel-ml-bench
Google Scholar
- Ivy Peng, Roger Pearce, and Maya Gokhale. 2020. On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems. In 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 183–190. https://doi.org/10.1109/SBAC-PAD49847.2020.00034
Google Scholar
Cross Ref
- Simon L. Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. 2008. Harnessing the Multicores: Nested Data Parallelism in Haskell. In FSTTCS. 383–414.
Google Scholar
- Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. 2016. Hierarchical Memory Management for Parallel Programs. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming (ICFP 2016). ACM, New York, NY, USA. 392–406.
Google Scholar
Digital Library
- Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin Vechev, and Eran Yahav. 2010. Efficient Data Race Detection for Async-Finish Parallelism. In Runtime Verification, Howard Barringer, Ylies Falcone, Bernd Finkbeiner, Klaus Havelund, Insup Lee, Gordon Pace, Grigore Rosu, Oleg Sokolsky, and Nikolai Tillmann (Eds.) (Lecture Notes in Computer Science, Vol. 6418). Springer Berlin / Heidelberg, 368–383. isbn:978-3-642-16611-2
Google Scholar
- Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin Vechev, and Eran Yahav. 2012. Scalable and Precise Dynamic Datarace Detection for Structured Parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). 531–542.
Google Scholar
Digital Library
- Yuxin Ren, Gabriel Parmer, and Dejan Milojicic. 2020. Ch’i: Scaling Microkernel Capabilities in Cache-Incoherent Systems. In 2020 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 12–21. https://doi.org/10.1109/ROSS51935.2020.00007
Google Scholar
Cross Ref
- Alberto Ros and Alexandra Jimborean. 2015. A dual-consistency cache coherence protocol. In 2015 IEEE International Parallel and Distributed Processing Symposium. 1119–1128.
Google Scholar
Digital Library
- Alberto Ros and Stefanos Kaxiras. 2012. Complexity-effective multicore coherence. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 241–251.
Google Scholar
Digital Library
- Matthew Schuchhardt, Abhishek Das, Nikos Hardavellas, Gokhan Memik, and Alok Choudhary. 2013. The Impact of Dynamic Directories on Multicore Interconnects. IEEE Computer, 46, 10 (2013), October, 32–39.
Google Scholar
Digital Library
- Nir Shavit and Dan Touitou. 1997. Software transactional memory. Distributed Computing, 10, 2 (1997), 99–116.
Google Scholar
Cross Ref
- Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief Announcement: The Problem Based Benchmark Suite. SPAA ’12. 68–70. isbn:9781450312134 https://doi.org/10.1145/2312005.2312018
Google Scholar
Digital Library
- KC Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul Dhiman, and Anil Madhavapeddy. 2020. Retrofitting parallelism onto ocaml. arXiv preprint arXiv:2004.11663.
Google Scholar
- KC Sivaramakrishnan, Stephen Dolan, Leo White, Tom Kelly, Sadiq Jaffer, and Anil Madhavapeddy. 2021. Retrofitting effect handlers onto OCaml. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 206–221.
Google Scholar
Digital Library
- K. C. Sivaramakrishnan, Stephen Dolan, Leo White, Sadiq Jaffer, Tom Kelly, Anmol Sahoo, Sudha Parimala, Atul Dhiman, and Anil Madhavapeddy. 2020. Retrofitting parallelism onto OCaml. Proc. ACM Program. Lang., 4, ICFP (2020), 113:1–113:30.
Google Scholar
Digital Library
- Daniel Spoonhower. 2009. Scheduling Deterministic Parallel Programs. Ph. D. Dissertation. Carnegie Mellon University. https://www.cs.cmu.edu/~rwh/theses/spoonhower.pdf
Google Scholar
- Rabin Sugumar, Mehul Shah, and Ricardo Ramirez. 2021. Marvell ThunderX3: Next-Generation Arm-Based Server Processor. IEEE Micro, 41, 2 (2021), 15–21. https://doi.org/10.1109/MM.2021.3055451
Google Scholar
Cross Ref
- Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve. 2013. DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). 13–26. isbn:9781450318709 https://doi.org/10.1145/2451116.2451119
Google Scholar
Digital Library
- Igor Tartalja and Veljko Milutinovic. 1996. The Cache Coherence Problem in Shared Memory Multiprocessors: Software Solutions. In XVI International Symposium on Nuclear Electronics and VI International School on Automation and Computing in Nuclear Physics and Astrophysics. 131.
Google Scholar
- Sanket Tavarageri, Wooil Kim, Josep Torrellas, and P Sadayappan. 2016. Compiler support for software cache coherence. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC). 341–350.
Google Scholar
Cross Ref
- Josep Torrellas, HS Lam, and John L. Hennessy. 1994. False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput., 43, 6 (1994), 651–663.
Google Scholar
Digital Library
- Robert Utterback, Kunal Agrawal, Jeremy T. Fineman, and I-Ting Angelina Lee. 2016. Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2016, Asilomar State Beach/Pacific Grove, CA, USA, July 11-13, 2016. 83–94.
Google Scholar
Digital Library
- Sam Westrick, Jatin Arora, and Umut A. Acar. 2022. Entanglement detection with near-zero cost. Proc. ACM Program. Lang., 6, ICFP (2022), 679–710. https://doi.org/10.1145/3547646
Google Scholar
Digital Library
- Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. 2020. Disentanglement in Nested-Parallel Programs. In Proceedings of the 47th Annual ACM Symposium on Principles of Programming Languages (POPL)".
Google Scholar
- Michael Wilkins, Sam Westrick, Vijay Kandiah, Alex Bernat, Brian Suchy, Enrico Armenio Deiana, Simone Campanoni, Umut Acar, Peter Dinda, and Nikos Hardavellas. 2022. Artifact for "WARDen: Specializing Cache Coherence for High-Level Parallel Languages". https://doi.org/10.5281/zenodo.7374334
Google Scholar
Digital Library
Index Terms
- WARDen: Specializing Cache Coherence for High-Level Parallel Languages
Recommendations
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing FrontiersChip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
The locality-aware adaptive cache coherence protocol
ICSA '13Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
An Adaptive Cache Coherence Protocol Specification for Parallel Input/Output Systems
Abstract--Caching has been intensively used in memory and traditional file systems to improve system performance. However, the use of caching in parallel file systems and I/O libraries has been limited to I/O nodes to avoid cache coherence problems. In ...





Comments