skip to main content
research-article

Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs

Authors Info & Claims
Published:27 September 2017Publication History
Skip Abstract Section

Abstract

Shared virtual memory is key in heterogeneous systems on chip (SoCs) that combine a general-purpose host processor with a many-core accelerator, both for programmability and performance. In contrast to the full-blown, hardware-only solutions predominant in modern high-end systems, lightweight hardware-software co-designs are better suited in the context of more power- and area-constrained embedded systems and provide additional benefits in terms of flexibility and predictability. As a downside, the latter solutions require the host to handle in software synchronization in case of page misses as well as miss handling. This may incur considerable run-time overheads.

In this work, we present a novel hardware-software virtual memory management approach for many-core accelerators in heterogeneous embedded SoCs. It exploits an accelerator-side helper thread concept that enables the accelerator to manage its virtual memory hardware autonomously while operating cache-coherently on the page tables of the user-space processes of the host. This greatly reduces overhead with respect to host-side solutions while retaining flexibility. We have validated the design with a set of parameterizable benchmarks and real-world applications covering various application domains. For purely memory-bound kernels, the accelerator performance improves by a factor of 3.8 compared with host-based management and lies within 50% of a lower-bound ideal memory management unit.

References

  1. Adapteva Inc. Parallela Reference Manual. Technical reference manual. (2014).Google ScholarGoogle Scholar
  2. AMD Inc. AMD Compute Cores. White Paper. (2014). www.amd.com/documents/compute_cores_whitepaper.pdf.Google ScholarGoogle Scholar
  3. ARM Ltd. Cortex-A9 Floating-Point Unit. Technical reference manual. (2012).Google ScholarGoogle Scholar
  4. ARM Ltd. AMBA AXI and ACE Protocol Specification. Protocol specification. (2013).Google ScholarGoogle Scholar
  5. ARM Ltd. ARM CoreLink MMU-500 System Memory Management Unit. Technical reference manual. (2016).Google ScholarGoogle Scholar
  6. S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual web search engine. In IW3C-7. 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC-53. 109:1--109:6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cong, Z. Fang, Y. Hao, and G. Reinman. 2017. Supporting address translation for accelerator-centric architectures. In HPCA-23. 37--48.Google ScholarGoogle Scholar
  9. J. Corbet. Fixing the contiguous memory allocator. LWN article. (2015). http://lwn.net/Articles/636234/.Google ScholarGoogle Scholar
  10. J. Gall and V. Lempitsky. 2009. Class-specific hough forests for object detection. In CVPR-27. 1022--1029.Google ScholarGoogle Scholar
  11. Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. 2014. How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In IPDPS-28. 395--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. HSA Foundation. HSA Foundation. (2012). www.hsafoundation.com.Google ScholarGoogle Scholar
  13. K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu. 2016. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation. In ICCD-34. 25--32.Google ScholarGoogle Scholar
  14. Intel Corp. The compute architecture of Intel Processor Graphics Gen9. White Paper. (2015). https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf.Google ScholarGoogle Scholar
  15. Intel Corp. Arria 10 Device Overview. Product Specification. (2016).Google ScholarGoogle Scholar
  16. Kalray S. A. MPPA MANYCORE. (2014).Google ScholarGoogle Scholar
  17. G. Kornaros, K. Harteros, I. Christoforakis, and M. Astrinaki. 2014. I/O virtualization utilizing an efficient hardware system-level memory management unit. In ISSoC’14. 1--4.Google ScholarGoogle Scholar
  18. A. Kurth, A. Tretter, P. A. Hager, S. Sanabria, O. Göksel, L. Thiele, and L. Benini. 2016. Mobile ultrasound imaging on heterogeneous multi-core platforms. In ESTIMedia-14. 9--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Lavasani, H. Angepat, and D. Chiou. 2014. An FPGA-based in-line accelerator for memcached. IEEE CAL 13, 2 (2014), 57--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Li, R. Melhem, and A. K. Jones. 2013. PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future CMPs. ACM TACO 9, 4 (2013), 28:1--28:21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Mantovani, E. G. Cota, C. Pilato, G. Di Guglielmo, and L. P. Carloni. 2016. Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chip. In CASES’16. 3:1--3:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit. 2012. Platform 2012, a many-core computing accelerator for embedded SoCs. In DAC-49. 1137--1142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Nazarewicz. A deep dive into CMA. LWN article. (2012). http://lwn.net/Articles/486301/.Google ScholarGoogle Scholar
  24. S. Park, M. Kim, and H. Y. Yeom. 2016. GCMA: Guaranteed contiguous memory allocator. SIGBED Rev. 13, 1 (2016), 29--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. O. Peleg, A. Morrison, B. Serebrin, and D. Tsafrir. 2015. Utilizing the IOMMU scalably. In USENIX ATC’15. 549--562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Pichai, L. Hsu, and A. Bhattacharjee. 2014. Architectural support for address translation on GPUs. In ASPLOS-19. 743--758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Power, M. D. Hill, and D. A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In HPCA-20. 568--578.Google ScholarGoogle Scholar
  28. D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini, and A. Marongiu. 2014. Energy efficient parallel computing on the PULP platform with support for OpenMP. In ICEEEI-28. 1--5.Google ScholarGoogle Scholar
  29. J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1--7:7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Viola and M. Jones. 2004. Robust real-time face detection. IJCV 57, 2 (2004), 137--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Vogel, A. Marongiu, and L. Benini. 2017. Lightweight virtual memory support for zero-copy sharing of pointer-rich data structures in heterogeneous embedded SoCs. IEEE TPDS 28, 7 (2017), 1947--1959.Google ScholarGoogle Scholar
  32. Xilinx Inc. Zynq-7000 All Programmable SoC Overview. Product Specification. (2016).Google ScholarGoogle Scholar
  33. Xilinx Inc. SDSoC Environment User Guide. User Guide. (2017). https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_1/ug1027-sdsoc-user-guide.pdf.Google ScholarGoogle Scholar
  34. Xilinx Inc. Zynq UltraScale+ MPSoC Data Sheet: Overview. Advance Product Specification. (2017).Google ScholarGoogle Scholar

Index Terms

  1. Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!