skip to main content
research-article

Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

Published:11 September 2015Publication History
Skip Abstract Section

Abstract

Massively parallel memory systems are designed to deliver high bandwidth at relatively low clock speed for memory-intensive applications implemented on programmable logic. For example, the Convey HC-1 provides 1,024 DRAM banks to each of four FPGAs through a full crossbar, presenting a peak bandwidth of 76.8GB/s to the user logic. Such highly parallel memory systems suffer from high latency, and their effective bandwidth is highly sensitive to access ordering. To achieve high performance, the user must use a customized memory interface that combines scheduling, latency hiding, and data reuse. In this article, we describe the design of a custom memory interface for 3D stencil kernels on the Convey HC-1 that incorporates these features. Experimental results show that the proposed memory interface achieves a speedup in runtime of 2.2 for 6-point stencil and 9.5 for 27-point stencil when compared to a naive memory interface.

References

  1. J. H. Ahn, N. P. Jouppi, C. J. Kozyrakis Leverich, and R. S. Schreiber. 2009. Future scaling of processor-memory interfaces. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). Article No. 42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. W. Augustin, J. Weiss, and V. Heuveline. 2011. Convey HC-1 Hybrid Core Computer-The Potential of FPGAs in numerical simulation. In Proceedings of the Second International Workshop on New Frontiers in High-Performance and Hardware-Aware Computing (HipHaC'11). San Antonio, Texas, USA.Google ScholarGoogle Scholar
  3. R. Banakar, S. Steinke, and B. Lee. 2002. Scratchpad memory design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES’02). 73--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Baradaran and P. C. Diniz. 2008. A compiler approach to managing storage and memory bandwidth in configurable architectures. ACM Transactions on Design Automation of Electronic Systems 13, 4, Article No. 61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. Ben-Asher and N. Rotem. 2010. Automatic memory partitioning: Increasing memory parallelism via data structure partitioning. In Proceedings of the 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 155--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Benini, L. Macchiarulo, A. Macii, and M. Poncino. 2002. Layout-driven memory synthesis for embedded systems-on-chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 10, 2, 96--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. K. Chang and Y. L. Lin. 2000. Array allocation taking into account SDRAM characteristics. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’00). 497--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cong, H. Huang, C. Liu, and Y. Zou. 2011a. A reuse-aware prefetching scheme for scratchpad memory. In Proceedings of the 48th Design Automation Conference (DAC’11). 960--965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Cong, M. Huang, and Y. Zou. 2011b. 3D recursive Gaussian IIR on GPU and FPGAs: A case study for accelerating bandwidth-bounded applications. In Proceedings of the 9th IEEE Symposium on Application Specific Processors. 201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Cong, W. Jiang, B. Liu, and Y. Zou. 2011c. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems 16, 2, Article No. 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Cong, P. Zhang, and Y. Zou. 2011d. Combined loop transformation and hierarchy allocation in data reuse optimization. In Proceedings of the 2011 International Conference on Computer-Aided Design (ICCAD’11). 185--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Convey Corporation. 2012. Convey Personality Development Kit Reference Manual. Retrieved August 24, 2015, from http://www.conveysupport.com/alldocs/ConveyPDKReferenceManual.pdf.Google ScholarGoogle Scholar
  13. K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, Los Alamitos, CA, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Z. Fang, X. H. Sun, Y. Chen, and S. Byna. 2009. Core-aware memory access scheduling schemes. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’09). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. He, M. Lu, and C. Sun. 2004. Accelerating seismic migration using FPGA-based coprocessor platform. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’04). IEEE, Los Alamitos, CA, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. He, G. Qin, M. Lu, and W. Zhao. 2006. An efficient implementation of high-accuracy finite difference computing engine on FPGAs. In Proceedings of the International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’06). IEEE, Los Alamitos, CA, 95--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. He, W. Zhao, and M. Lu. 2005. Time domain numerical simulation for transient waves on reconfigurable coprocessor platform. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, Los Alamitos, CA, 127--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. K. C. Ho and S. J. E. Wilton. 2004. Logical-to-physical memory mapping for FPGAs with dual-port embedded arrays. In Field Programmable Logic and Applications. Lecture Notes in Computer Science, Vol. 1673. Springer, 111--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Issenin, E. Brockmeyer, M. Miranda, and N. Dutt. 2007. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Transactions on Design Automation of Electronic Systems 12, 2, Article No. 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Z. Jin and J. D. Bakos. 2013. Memory access scheduling on the Convey HC-1. In Proceedings of the 21st IEEE International Symposium on Field-Programmable Custom Computing Machines. 237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. 2004. A compiler-based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 23, 2, 243--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Liu, S. O. Memik, Y. Zhang, and G. Memik. 2008. A power and temperature aware DRAM architecture. In Proceedings of the Design Automation Conference (DAC’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. G. Lyuh and T. Kim. 2004. Memory access scheduling and binding considering energy minimization in multi-bank memory systems. In Proceedings of the Design Automation Conference (DAC’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. R. Panda, N. D. Dutt, and A. Nicolau. 1997. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC’97). 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. 2008. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). 128--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Tatsumi and H. Mattausch. 1999. Fast quadratic increase of multiport-storage-cell area with port number. Electronics Letters 35, 25, 2185--2187.Google ScholarGoogle ScholarCross RefCross Ref
  27. Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article No. 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Wang, P. Zhang, X. Cheng, and J. Cong. 2012. An integrated and automated memory optimization flow for FPGA behavioral synthesis.” In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). 257--262.Google ScholarGoogle Scholar

Index Terms

  1. Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 8, Issue 4
      October 2015
      134 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/2822909
      • Editor:
      • Steve Wilton
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 September 2015
      • Accepted: 1 June 2015
      • Revised: 1 May 2015
      • Received: 1 November 2014
      Published in trets Volume 8, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!