skip to main content

The semantics of shared memory in Intel CPU/FPGA systems

Published:15 October 2021Publication History
Skip Abstract Section

Abstract

Heterogeneous CPU/FPGA devices, in which a CPU and an FPGA can execute together while sharing memory, are becoming popular in several computing sectors. In this paper, we study the shared-memory semantics of these devices, with a view to providing a firm foundation for reasoning about the programs that run on them. Our focus is on Intel platforms that combine an Intel FPGA with a multicore Xeon CPU. We describe the weak-memory behaviours that are allowed (and observable) on these devices when CPU threads and an FPGA thread access common memory locations in a fine-grained manner through multiple channels. Some of these behaviours are familiar from well-studied CPU and GPU concurrency; others are weaker still. We encode these behaviours in two formal memory models: one operational, one axiomatic. We develop executable implementations of both models, using the CBMC bounded model-checking tool for our operational model and the Alloy modelling language for our axiomatic model. Using these, we cross-check our models against each other via a translator that converts Alloy-generated executions into queries for the CBMC model. We also validate our models against actual hardware by translating 583 Alloy-generated executions into litmus tests that we run on CPU/FPGA devices; when doing this, we avoid the prohibitive cost of synthesising a hardware design per litmus test by creating our own 'litmus-test processor' in hardware. We expect that our models will be useful for low-level programmers, compiler writers, and designers of analysis tools. Indeed, as a demonstration of the utility of our work, we use our operational model to reason about a producer/consumer buffer implemented across the CPU and the FPGA. When the buffer uses insufficient synchronisation -- a situation that our model is able to detect -- we observe that its performance improves at the cost of occasional data corruption.

Skip Supplemental Material Section

Supplemental Material

Auxiliary Presentation Video

This is a presentation video for our OOPSLA 2021 paper "The Semantics of Shared Memory in Intel CPU/FPGA Systems". Dan Iorga is the presenter in the video.

References

  1. Maleen Abeydeera, Manupa Karunaratne, Geethan Karunaratne, Kalana De Silva, and Ajith Pasqual. 2016. 4K Real-Time HEVC Decoder on an FPGA. IEEE Transactions on Circuits and Systems for Video Technology, 26, 1 (2016), Jan, 236–249. https://doi.org/10.1109/TCSVT.2015.2469113 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). Association for Computing Machinery, New York, NY, USA. 577–591. isbn:9781450328357 https://doi.org/10.1145/2694344.2694391 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. 2011. Litmus: Running Tests against Hardware. In Tools and Algorithms for the Construction and Analysis of Systems, Parosh Aziz Abdulla and K. Rustan M. Leino (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 41–44. isbn:978-3-642-19835-9 https://doi.org/10.1007/978-3-642-19835-9_5 Google ScholarGoogle Scholar
  4. Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst., 36, 2 (2014), Article 7, July, 74 pages. issn:0164-0925 https://doi.org/10.1145/2627752 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Bechtel and H. Yun. 2019. Denial-of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention. In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). 357–367. https://doi.org/10.1109/RTAS.2019.00037 Google ScholarGoogle ScholarCross RefCross Ref
  6. Young-Kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2019. In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms. ACM Trans. Reconfigurable Technol. Syst., 12, 1 (2019), Article 4, Feb., 20 pages. issn:1936-7406 https://doi.org/10.1145/3294054 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Edmund Clarke, Daniel Kroening, and Flavio Lerda. 2004. A Tool for Checking ANSI-C Programs. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2004), Kurt Jensen and Andreas Podelski (Eds.) (Lecture Notes in Computer Science, Vol. 2988). Springer, 168–176. isbn:3-540-21299-X https://doi.org/10.1007/978-3-540-24730-2_15 Google ScholarGoogle Scholar
  8. Roland Dobai and Lukas Sekanina. 2013. Image filter evolution on the Xilinx Zynq Platform. In 2013 NASA/ESA Conference on Adaptive Hardware and Systems. https://doi.org/10.1109/AHS.2013.6604241 Google ScholarGoogle ScholarCross RefCross Ref
  9. Naila Farooqui, Rajkishore Barik, Brian T. Lewis, Tatiana Shpeisman, and Karsten Schwan. 2016. Affinity-Aware Work-Stealing for Integrated CPU-GPU Processors. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’16). Association for Computing Machinery, New York, NY, USA. Article 30, 2 pages. isbn:9781450340922 https://doi.org/10.1145/2851141.2851194 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ce Guo, Wayne Luk, Stanley Qing Shui Loh, Alexander Warren, and Joshua Levine. 2019. Customisable Control Policy Learning for Robotics. In 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 2160-052X, 91–98. https://doi.org/10.1109/ASAP.2019.00-24 Google ScholarGoogle ScholarCross RefCross Ref
  11. Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2018. Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37, 1 (2018), Jan, 35–47. https://doi.org/10.1109/TCAD.2017.2705069 Google ScholarGoogle ScholarCross RefCross Ref
  12. John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer Architecture. Commun. ACM, 62, 2 (2019), Jan., 48–60. issn:0001-0782 https://doi.org/10.1145/3282307 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-Race-Free Memory Models. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). Association for Computing Machinery, New York, NY, USA. 427–440. isbn:9781450323055 https://doi.org/10.1145/2541940.2541981 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Bo-Yuan Huang, Hongce Zhang, Pramod Subramanyan, Yakir Vizel, Aarti Gupta, and Sharad Malik. 2018. Instruction-Level Abstraction (ILA): A Uniform Specification for System-on-Chip (SoC) Verification. CoRR, abs/1801.01114 (2018), arxiv:1801.01114. arxiv:1801.01114Google ScholarGoogle Scholar
  15. Intel. 2019. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl-ias-ccip.pdf Version 2019.11.04.Google ScholarGoogle Scholar
  16. Intel. 2021. Intel Academic Compute Environment. https://wiki.intel-research.net/Google ScholarGoogle Scholar
  17. Dan Iorga, Alastair Donaldson, Tyler Sorensen, and John Wickerson. 2021. The semantics of Shared Memory in Intel CPU/FPGA. https://doi.org/10.5281/zenodo.5468873 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dan Iorga, Tyler Sorensen, John Wickerson, and Alastair F. Donaldson. 2020. Slow and Steady: Measuring and Tuning Multicore Interference. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). 200–212. https://doi.org/10.1109/RTAS48715.2020.000-6 Google ScholarGoogle ScholarCross RefCross Ref
  19. Daniel Jackson. 2012. Software Abstractions: Logic, Language, and Analysis. The MIT Press. isbn:0262017156Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jake Kirkham, Tyler Sorensen, Esin Tureci, and Margaret Martonosi. 2020. Foundations of Empirical Memory Consistency Testing. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 226, Nov., 29 pages. https://doi.org/10.1145/3428294 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput., 28, 9 (1979), Sept., 690–691. issn:0018-9340 https://doi.org/10.1109/TC.1979.1675439 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA. 257–270. isbn:9781450362405 https://doi.org/10.1145/3297858.3304043 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Daniel Lustig, Caroline Trippel, Michael Pellauer, and Margaret Martonosi. 2015. ArMOR: Defending against Memory Consistency Model Mismatches in Heterogeneous Architectures. SIGARCH Comput. Archit. News, 43, 3S (2015), June, 388–400. issn:0163-5964 https://doi.org/10.1145/2872887.2750378 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Daniel Lustig, Andrew Wright, Alexandros Papakonstantinou, and Olivier Giroux. 2017. Automated Synthesis of Comprehensive Memory Model Litmus Test Suites. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). Association for Computing Machinery, New York, NY, USA. 661–675. isbn:9781450344654 https://doi.org/10.1145/3037697.3037723 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yuan Meng, Sanmukh R. Kuppannagari, and Viktor K. Prasanna. 2020. Accelerating Proximal Policy Optimization on CPU-FPGA Heterogeneous Platforms. 28th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), May, http://par.nsf.gov/biblio/10144121Google ScholarGoogle Scholar
  26. Duncan J.M Moss, Srivatsan Krishnan, Eriko Nurvitadhi, Piotr Ratuszniak, Chris Johnson, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H.W. Leong. 2018. A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’18). Association for Computing Machinery, New York, NY, USA. 107–116. isbn:9781450356145 https://doi.org/10.1145/3174243.3174258 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, Henry Mitchel, Suchit Subhaschandra, Arthur Sheiman, Tim Whisonant, and Prabhat Gupta. 2011. A Reconfigurable Computing System Based on a Cache-Coherent Fabric. In 2011 International Conference on Reconfigurable Computing and FPGAs. 80–85. https://doi.org/10.1109/ReConFig.2011.4 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Scott Owens, Susmit Sarkar, and Peter Sewell. 2009. A Better x86 Memory Model: x86-TSO. In Theorem Proving in Higher Order Logics, Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 391–407. isbn:978-3-642-03359-9 https://doi.org/10.1007/978-3-642-03359-9_27 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2017. Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8. Proc. ACM Program. Lang., 2, POPL (2017), Article 19, Dec., 29 pages. https://doi.org/10.1145/3158107 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Christopher Pulte, Jean Pichon-Pharabod, Jeehoon Kang, Sung-Hwan Lee, and Chung-Kil Hur. 2019. Promising-ARM/RISC-V: A Simpler and Faster Operational Concurrency Model. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA. 1–15. isbn:9781450367127 https://doi.org/10.1145/3314221.3314624 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Petar Radojković, Sylvain Girbal, Arnaud Grasset, Eduardo Quiñones, Sami Yehia, and Francisco J. Cazorla. 2012. On the Evaluation of the Impact of Shared Resources in Multithreaded COTS Processors in Time-critical Environments. ACM Trans. Archit. Code Optim., 8, 4 (2012), Article 34, Jan., 25 pages. issn:1544-3566 https://doi.org/10.1145/2086696.2086713 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Nadesh Ramanathan, John Wickerson, Felix Winterstein, and George A. Constantinides. 2016. A Case for Work-Stealing on FPGAs with OpenCL Atomics. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’16). Association for Computing Machinery, New York, NY, USA. 48–53. isbn:9781450338561 https://doi.org/10.1145/2847263.2847343 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Martin C. Rinard. 2012. Unsynchronized Techniques for Approximate Parallel Computing. In [email protected]. ACM. https://people.csail.mit.edu/rinard/paper/races12.unsynchronized.pdfGoogle ScholarGoogle Scholar
  34. Karl Rupp. 2015. 40 Years of Microprocessor Trend Data. https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-dataGoogle ScholarGoogle Scholar
  35. Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11). Association for Computing Machinery, New York, NY, USA. 175–186. isbn:9781450306638 https://doi.org/10.1145/1993498.1993520 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing Errors Related to Weak Memory in GPU Applications. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). Association for Computing Machinery, New York, NY, USA. 100–113. isbn:9781450342612 https://doi.org/10.1145/2908080.2908114 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2016. Portable Inter-Workgroup Barrier Synchronisation for GPUs. SIGPLAN Not., 51, 10 (2016), Oct., 39–58. issn:0362-1340 https://doi.org/10.1145/3022671.2984032 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Stuecheli, B. Blaner, C.R. Johns, and M.S. Siegel. 2015. CAPI: A Coherent Accelerator Processor Interface. IBM Journal of Research and Development, 59, 1 (2015), 7:1–7:7. https://doi.org/10.1147/JRD.2014.2380198 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task Management for Irregular-Parallel Workloads on the GPU. In High Performance Graphics, Michael Doggett, Samuli Laine, and Warren Hunt (Eds.). The Eurographics Association. isbn:978-3-905674-26-2 issn:2079-8687 https://doi.org/10.2312/EGGH/HPG10/029-037 Google ScholarGoogle ScholarCross RefCross Ref
  40. Y. Wang, J. C. Hoe, and E. Nurvitadhi. 2019. Processor Assisted Worklist Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory Platform. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 136–144. issn:2576-2613 https://doi.org/10.1109/FCCM.2019.00028 Google ScholarGoogle ScholarCross RefCross Ref
  41. Felix Winterstein and George Constantinides. 2017. Pass a pointer: Exploring shared virtual memory abstractions in OpenCL tools for FPGAs. In 2017 International Conference on Field Programmable Technology (ICFPT). 104–111. https://doi.org/10.1109/FPT.2017.8280127 Google ScholarGoogle ScholarCross RefCross Ref
  42. Xilinx. 2018. Accelerating DNNs with Xilinx Alveo Accelerator Cards. https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdfGoogle ScholarGoogle Scholar
  43. Hsin Jung Yang, Kermin Fleming, Michael Adler, and Joel Emer. 2014. LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. 117–124. https://doi.org/10.1109/FCCM.2014.43 Google ScholarGoogle ScholarCross RefCross Ref
  44. Chi Zhang, Ren Chen, and Viktor Prasanna. 2016. High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). https://doi.org/10.1109/IPDPSW.2016.117 Google ScholarGoogle ScholarCross RefCross Ref
  45. Hongce Zhang, Caroline Trippel, Yatin A. Manerkar, Aarti Gupta, Margaret Martonosi, and Sharad Malik. 2018. ILA-MCM: Integrating Memory Consistency Models with Instruction-Level Abstractions for Heterogeneous System-on-Chip Verification. In 2018 Formal Methods in Computer Aided Design (FMCAD). 1–10. https://doi.org/10.23919/FMCAD.2018.8603015 Google ScholarGoogle ScholarCross RefCross Ref
  46. S. Zhou and V. K. Prasanna. 2017. Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 137–144. issn:null https://doi.org/10.1109/SBAC-PAD.2017.25 Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. The semantics of shared memory in Intel CPU/FPGA systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!