Abstract
Heterogeneous CPU/FPGA devices, in which a CPU and an FPGA can execute together while sharing memory, are becoming popular in several computing sectors. In this paper, we study the shared-memory semantics of these devices, with a view to providing a firm foundation for reasoning about the programs that run on them. Our focus is on Intel platforms that combine an Intel FPGA with a multicore Xeon CPU. We describe the weak-memory behaviours that are allowed (and observable) on these devices when CPU threads and an FPGA thread access common memory locations in a fine-grained manner through multiple channels. Some of these behaviours are familiar from well-studied CPU and GPU concurrency; others are weaker still. We encode these behaviours in two formal memory models: one operational, one axiomatic. We develop executable implementations of both models, using the CBMC bounded model-checking tool for our operational model and the Alloy modelling language for our axiomatic model. Using these, we cross-check our models against each other via a translator that converts Alloy-generated executions into queries for the CBMC model. We also validate our models against actual hardware by translating 583 Alloy-generated executions into litmus tests that we run on CPU/FPGA devices; when doing this, we avoid the prohibitive cost of synthesising a hardware design per litmus test by creating our own 'litmus-test processor' in hardware. We expect that our models will be useful for low-level programmers, compiler writers, and designers of analysis tools. Indeed, as a demonstration of the utility of our work, we use our operational model to reason about a producer/consumer buffer implemented across the CPU and the FPGA. When the buffer uses insufficient synchronisation -- a situation that our model is able to detect -- we observe that its performance improves at the cost of occasional data corruption.
Supplemental Material
- Maleen Abeydeera, Manupa Karunaratne, Geethan Karunaratne, Kalana De Silva, and Ajith Pasqual. 2016. 4K Real-Time HEVC Decoder on an FPGA. IEEE Transactions on Circuits and Systems for Video Technology, 26, 1 (2016), Jan, 236–249. https://doi.org/10.1109/TCSVT.2015.2469113 Google Scholar
Digital Library
- Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). Association for Computing Machinery, New York, NY, USA. 577–591. isbn:9781450328357 https://doi.org/10.1145/2694344.2694391 Google Scholar
Digital Library
- Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. 2011. Litmus: Running Tests against Hardware. In Tools and Algorithms for the Construction and Analysis of Systems, Parosh Aziz Abdulla and K. Rustan M. Leino (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 41–44. isbn:978-3-642-19835-9 https://doi.org/10.1007/978-3-642-19835-9_5 Google Scholar
- Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst., 36, 2 (2014), Article 7, July, 74 pages. issn:0164-0925 https://doi.org/10.1145/2627752 Google Scholar
Digital Library
- M. Bechtel and H. Yun. 2019. Denial-of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention. In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). 357–367. https://doi.org/10.1109/RTAS.2019.00037 Google Scholar
Cross Ref
- Young-Kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2019. In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms. ACM Trans. Reconfigurable Technol. Syst., 12, 1 (2019), Article 4, Feb., 20 pages. issn:1936-7406 https://doi.org/10.1145/3294054 Google Scholar
Digital Library
- Edmund Clarke, Daniel Kroening, and Flavio Lerda. 2004. A Tool for Checking ANSI-C Programs. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2004), Kurt Jensen and Andreas Podelski (Eds.) (Lecture Notes in Computer Science, Vol. 2988). Springer, 168–176. isbn:3-540-21299-X https://doi.org/10.1007/978-3-540-24730-2_15 Google Scholar
- Roland Dobai and Lukas Sekanina. 2013. Image filter evolution on the Xilinx Zynq Platform. In 2013 NASA/ESA Conference on Adaptive Hardware and Systems. https://doi.org/10.1109/AHS.2013.6604241 Google Scholar
Cross Ref
- Naila Farooqui, Rajkishore Barik, Brian T. Lewis, Tatiana Shpeisman, and Karsten Schwan. 2016. Affinity-Aware Work-Stealing for Integrated CPU-GPU Processors. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’16). Association for Computing Machinery, New York, NY, USA. Article 30, 2 pages. isbn:9781450340922 https://doi.org/10.1145/2851141.2851194 Google Scholar
Digital Library
- Ce Guo, Wayne Luk, Stanley Qing Shui Loh, Alexander Warren, and Joshua Levine. 2019. Customisable Control Policy Learning for Robotics. In 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 2160-052X, 91–98. https://doi.org/10.1109/ASAP.2019.00-24 Google Scholar
Cross Ref
- Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2018. Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37, 1 (2018), Jan, 35–47. https://doi.org/10.1109/TCAD.2017.2705069 Google Scholar
Cross Ref
- John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer Architecture. Commun. ACM, 62, 2 (2019), Jan., 48–60. issn:0001-0782 https://doi.org/10.1145/3282307 Google Scholar
Digital Library
- Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-Race-Free Memory Models. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). Association for Computing Machinery, New York, NY, USA. 427–440. isbn:9781450323055 https://doi.org/10.1145/2541940.2541981 Google Scholar
Digital Library
- Bo-Yuan Huang, Hongce Zhang, Pramod Subramanyan, Yakir Vizel, Aarti Gupta, and Sharad Malik. 2018. Instruction-Level Abstraction (ILA): A Uniform Specification for System-on-Chip (SoC) Verification. CoRR, abs/1801.01114 (2018), arxiv:1801.01114. arxiv:1801.01114Google Scholar
- Intel. 2019. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl-ias-ccip.pdf Version 2019.11.04.Google Scholar
- Intel. 2021. Intel Academic Compute Environment. https://wiki.intel-research.net/Google Scholar
- Dan Iorga, Alastair Donaldson, Tyler Sorensen, and John Wickerson. 2021. The semantics of Shared Memory in Intel CPU/FPGA. https://doi.org/10.5281/zenodo.5468873 Google Scholar
Digital Library
- Dan Iorga, Tyler Sorensen, John Wickerson, and Alastair F. Donaldson. 2020. Slow and Steady: Measuring and Tuning Multicore Interference. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). 200–212. https://doi.org/10.1109/RTAS48715.2020.000-6 Google Scholar
Cross Ref
- Daniel Jackson. 2012. Software Abstractions: Logic, Language, and Analysis. The MIT Press. isbn:0262017156Google Scholar
Digital Library
- Jake Kirkham, Tyler Sorensen, Esin Tureci, and Margaret Martonosi. 2020. Foundations of Empirical Memory Consistency Testing. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 226, Nov., 29 pages. https://doi.org/10.1145/3428294 Google Scholar
Digital Library
- L. Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput., 28, 9 (1979), Sept., 690–691. issn:0018-9340 https://doi.org/10.1109/TC.1979.1675439 Google Scholar
Digital Library
- Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA. 257–270. isbn:9781450362405 https://doi.org/10.1145/3297858.3304043 Google Scholar
Digital Library
- Daniel Lustig, Caroline Trippel, Michael Pellauer, and Margaret Martonosi. 2015. ArMOR: Defending against Memory Consistency Model Mismatches in Heterogeneous Architectures. SIGARCH Comput. Archit. News, 43, 3S (2015), June, 388–400. issn:0163-5964 https://doi.org/10.1145/2872887.2750378 Google Scholar
Digital Library
- Daniel Lustig, Andrew Wright, Alexandros Papakonstantinou, and Olivier Giroux. 2017. Automated Synthesis of Comprehensive Memory Model Litmus Test Suites. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). Association for Computing Machinery, New York, NY, USA. 661–675. isbn:9781450344654 https://doi.org/10.1145/3037697.3037723 Google Scholar
Digital Library
- Yuan Meng, Sanmukh R. Kuppannagari, and Viktor K. Prasanna. 2020. Accelerating Proximal Policy Optimization on CPU-FPGA Heterogeneous Platforms. 28th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), May, http://par.nsf.gov/biblio/10144121Google Scholar
- Duncan J.M Moss, Srivatsan Krishnan, Eriko Nurvitadhi, Piotr Ratuszniak, Chris Johnson, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H.W. Leong. 2018. A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’18). Association for Computing Machinery, New York, NY, USA. 107–116. isbn:9781450356145 https://doi.org/10.1145/3174243.3174258 Google Scholar
Digital Library
- Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, Henry Mitchel, Suchit Subhaschandra, Arthur Sheiman, Tim Whisonant, and Prabhat Gupta. 2011. A Reconfigurable Computing System Based on a Cache-Coherent Fabric. In 2011 International Conference on Reconfigurable Computing and FPGAs. 80–85. https://doi.org/10.1109/ReConFig.2011.4 Google Scholar
Digital Library
- Scott Owens, Susmit Sarkar, and Peter Sewell. 2009. A Better x86 Memory Model: x86-TSO. In Theorem Proving in Higher Order Logics, Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 391–407. isbn:978-3-642-03359-9 https://doi.org/10.1007/978-3-642-03359-9_27 Google Scholar
Digital Library
- Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2017. Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8. Proc. ACM Program. Lang., 2, POPL (2017), Article 19, Dec., 29 pages. https://doi.org/10.1145/3158107 Google Scholar
Digital Library
- Christopher Pulte, Jean Pichon-Pharabod, Jeehoon Kang, Sung-Hwan Lee, and Chung-Kil Hur. 2019. Promising-ARM/RISC-V: A Simpler and Faster Operational Concurrency Model. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA. 1–15. isbn:9781450367127 https://doi.org/10.1145/3314221.3314624 Google Scholar
Digital Library
- Petar Radojković, Sylvain Girbal, Arnaud Grasset, Eduardo Quiñones, Sami Yehia, and Francisco J. Cazorla. 2012. On the Evaluation of the Impact of Shared Resources in Multithreaded COTS Processors in Time-critical Environments. ACM Trans. Archit. Code Optim., 8, 4 (2012), Article 34, Jan., 25 pages. issn:1544-3566 https://doi.org/10.1145/2086696.2086713 Google Scholar
Digital Library
- Nadesh Ramanathan, John Wickerson, Felix Winterstein, and George A. Constantinides. 2016. A Case for Work-Stealing on FPGAs with OpenCL Atomics. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’16). Association for Computing Machinery, New York, NY, USA. 48–53. isbn:9781450338561 https://doi.org/10.1145/2847263.2847343 Google Scholar
Digital Library
- Martin C. Rinard. 2012. Unsynchronized Techniques for Approximate Parallel Computing. In [email protected]. ACM. https://people.csail.mit.edu/rinard/paper/races12.unsynchronized.pdfGoogle Scholar
- Karl Rupp. 2015. 40 Years of Microprocessor Trend Data. https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-dataGoogle Scholar
- Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11). Association for Computing Machinery, New York, NY, USA. 175–186. isbn:9781450306638 https://doi.org/10.1145/1993498.1993520 Google Scholar
Digital Library
- Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing Errors Related to Weak Memory in GPU Applications. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). Association for Computing Machinery, New York, NY, USA. 100–113. isbn:9781450342612 https://doi.org/10.1145/2908080.2908114 Google Scholar
Digital Library
- Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2016. Portable Inter-Workgroup Barrier Synchronisation for GPUs. SIGPLAN Not., 51, 10 (2016), Oct., 39–58. issn:0362-1340 https://doi.org/10.1145/3022671.2984032 Google Scholar
Digital Library
- J. Stuecheli, B. Blaner, C.R. Johns, and M.S. Siegel. 2015. CAPI: A Coherent Accelerator Processor Interface. IBM Journal of Research and Development, 59, 1 (2015), 7:1–7:7. https://doi.org/10.1147/JRD.2014.2380198 Google Scholar
Digital Library
- Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task Management for Irregular-Parallel Workloads on the GPU. In High Performance Graphics, Michael Doggett, Samuli Laine, and Warren Hunt (Eds.). The Eurographics Association. isbn:978-3-905674-26-2 issn:2079-8687 https://doi.org/10.2312/EGGH/HPG10/029-037 Google Scholar
Cross Ref
- Y. Wang, J. C. Hoe, and E. Nurvitadhi. 2019. Processor Assisted Worklist Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory Platform. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 136–144. issn:2576-2613 https://doi.org/10.1109/FCCM.2019.00028 Google Scholar
Cross Ref
- Felix Winterstein and George Constantinides. 2017. Pass a pointer: Exploring shared virtual memory abstractions in OpenCL tools for FPGAs. In 2017 International Conference on Field Programmable Technology (ICFPT). 104–111. https://doi.org/10.1109/FPT.2017.8280127 Google Scholar
Cross Ref
- Xilinx. 2018. Accelerating DNNs with Xilinx Alveo Accelerator Cards. https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdfGoogle Scholar
- Hsin Jung Yang, Kermin Fleming, Michael Adler, and Joel Emer. 2014. LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. 117–124. https://doi.org/10.1109/FCCM.2014.43 Google Scholar
Cross Ref
- Chi Zhang, Ren Chen, and Viktor Prasanna. 2016. High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). https://doi.org/10.1109/IPDPSW.2016.117 Google Scholar
Cross Ref
- Hongce Zhang, Caroline Trippel, Yatin A. Manerkar, Aarti Gupta, Margaret Martonosi, and Sharad Malik. 2018. ILA-MCM: Integrating Memory Consistency Models with Instruction-Level Abstractions for Heterogeneous System-on-Chip Verification. In 2018 Formal Methods in Computer Aided Design (FMCAD). 1–10. https://doi.org/10.23919/FMCAD.2018.8603015 Google Scholar
Cross Ref
- S. Zhou and V. K. Prasanna. 2017. Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 137–144. issn:null https://doi.org/10.1109/SBAC-PAD.2017.25 Google Scholar
Cross Ref
Index Terms
The semantics of shared memory in Intel CPU/FPGA systems
Recommendations
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysWe present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational ...
Intel nehalem processor core made FPGA synthesizable
FPGA '10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arraysWe present a FPGA-synthesizable version of the Intel Nehalem processor core, synthesized, partitioned and mapped to a multi-FPGA emulation system consisting of Xilinx Virtex-4 and Virtex-5 FPGAs. To our knowledge, this is the first time a modern state-...
A memory model for scientific algorithms on graphics processors
SC '06: Proceedings of the 2006 ACM/IEEE conference on SupercomputingWe present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying ...






Comments