Abstract
Massively parallel memory systems are designed to deliver high bandwidth at relatively low clock speed for memory-intensive applications implemented on programmable logic. For example, the Convey HC-1 provides 1,024 DRAM banks to each of four FPGAs through a full crossbar, presenting a peak bandwidth of 76.8GB/s to the user logic. Such highly parallel memory systems suffer from high latency, and their effective bandwidth is highly sensitive to access ordering. To achieve high performance, the user must use a customized memory interface that combines scheduling, latency hiding, and data reuse. In this article, we describe the design of a custom memory interface for 3D stencil kernels on the Convey HC-1 that incorporates these features. Experimental results show that the proposed memory interface achieves a speedup in runtime of 2.2 for 6-point stencil and 9.5 for 27-point stencil when compared to a naive memory interface.
- J. H. Ahn, N. P. Jouppi, C. J. Kozyrakis Leverich, and R. S. Schreiber. 2009. Future scaling of processor-memory interfaces. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). Article No. 42. Google Scholar
Digital Library
- W. Augustin, J. Weiss, and V. Heuveline. 2011. Convey HC-1 Hybrid Core Computer-The Potential of FPGAs in numerical simulation. In Proceedings of the Second International Workshop on New Frontiers in High-Performance and Hardware-Aware Computing (HipHaC'11). San Antonio, Texas, USA.Google Scholar
- R. Banakar, S. Steinke, and B. Lee. 2002. Scratchpad memory design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES’02). 73--78. Google Scholar
Digital Library
- N. Baradaran and P. C. Diniz. 2008. A compiler approach to managing storage and memory bandwidth in configurable architectures. ACM Transactions on Design Automation of Electronic Systems 13, 4, Article No. 61. Google Scholar
Digital Library
- Y. Ben-Asher and N. Rotem. 2010. Automatic memory partitioning: Increasing memory parallelism via data structure partitioning. In Proceedings of the 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 155--162. Google Scholar
Digital Library
- L. Benini, L. Macchiarulo, A. Macii, and M. Poncino. 2002. Layout-driven memory synthesis for embedded systems-on-chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 10, 2, 96--105. Google Scholar
Digital Library
- H. K. Chang and Y. L. Lin. 2000. Array allocation taking into account SDRAM characteristics. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’00). 497--502. Google Scholar
Digital Library
- J. Cong, H. Huang, C. Liu, and Y. Zou. 2011a. A reuse-aware prefetching scheme for scratchpad memory. In Proceedings of the 48th Design Automation Conference (DAC’11). 960--965. Google Scholar
Digital Library
- J. Cong, M. Huang, and Y. Zou. 2011b. 3D recursive Gaussian IIR on GPU and FPGAs: A case study for accelerating bandwidth-bounded applications. In Proceedings of the 9th IEEE Symposium on Application Specific Processors. 201. Google Scholar
Digital Library
- J. Cong, W. Jiang, B. Liu, and Y. Zou. 2011c. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems 16, 2, Article No. 15. Google Scholar
Digital Library
- J. Cong, P. Zhang, and Y. Zou. 2011d. Combined loop transformation and hierarchy allocation in data reuse optimization. In Proceedings of the 2011 International Conference on Computer-Aided Design (ICCAD’11). 185--192. Google Scholar
Digital Library
- Convey Corporation. 2012. Convey Personality Development Kit Reference Manual. Retrieved August 24, 2015, from http://www.conveysupport.com/alldocs/ConveyPDKReferenceManual.pdf.Google Scholar
- K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, Los Alamitos, CA, 1--12. Google Scholar
Digital Library
- Z. Fang, X. H. Sun, Y. Chen, and S. Byna. 2009. Core-aware memory access scheduling schemes. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’09). 1--12. Google Scholar
Digital Library
- C. He, M. Lu, and C. Sun. 2004. Accelerating seismic migration using FPGA-based coprocessor platform. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’04). IEEE, Los Alamitos, CA, 207--216. Google Scholar
Digital Library
- C. He, G. Qin, M. Lu, and W. Zhao. 2006. An efficient implementation of high-accuracy finite difference computing engine on FPGAs. In Proceedings of the International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’06). IEEE, Los Alamitos, CA, 95--98. Google Scholar
Digital Library
- C. He, W. Zhao, and M. Lu. 2005. Time domain numerical simulation for transient waves on reconfigurable coprocessor platform. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, Los Alamitos, CA, 127--136. Google Scholar
Digital Library
- W. K. C. Ho and S. J. E. Wilton. 2004. Logical-to-physical memory mapping for FPGAs with dual-port embedded arrays. In Field Programmable Logic and Applications. Lecture Notes in Computer Science, Vol. 1673. Springer, 111--123. Google Scholar
Digital Library
- I. Issenin, E. Brockmeyer, M. Miranda, and N. Dutt. 2007. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Transactions on Design Automation of Electronic Systems 12, 2, Article No. 15. Google Scholar
Digital Library
- Z. Jin and J. D. Bakos. 2013. Memory access scheduling on the Convey HC-1. In Proceedings of the 21st IEEE International Symposium on Field-Programmable Custom Computing Machines. 237. Google Scholar
Digital Library
- M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. 2004. A compiler-based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 23, 2, 243--260. Google Scholar
Digital Library
- S. Liu, S. O. Memik, Y. Zhang, and G. Memik. 2008. A power and temperature aware DRAM architecture. In Proceedings of the Design Automation Conference (DAC’08). Google Scholar
Digital Library
- C. G. Lyuh and T. Kim. 2004. Memory access scheduling and binding considering energy minimization in multi-bank memory systems. In Proceedings of the Design Automation Conference (DAC’04). Google Scholar
Digital Library
- P. R. Panda, N. D. Dutt, and A. Nicolau. 1997. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC’97). 7. Google Scholar
Digital Library
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. 2008. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). 128--138. Google Scholar
Digital Library
- Y. Tatsumi and H. Mattausch. 1999. Fast quadratic increase of multiport-storage-cell area with port number. Electronics Letters 35, 25, 2185--2187.Google Scholar
Cross Ref
- Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article No. 12. Google Scholar
Digital Library
- Y. Wang, P. Zhang, X. Cheng, and J. Cong. 2012. An integrated and automated memory optimization flow for FPGA behavioral synthesis.” In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). 257--262.Google Scholar
Index Terms
Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System
Recommendations
Direct distributed memory access for CMPs
On-chip distributed memory has emerged as a promising memory organization for future many-core systems, since it efficiently exploits memory level parallelism and can lighten off the load on each memory module by providing a comparable number of memory ...
Massively parallel GPU memory compaction
ISMM 2019: Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory ManagementMemory fragmentation is a widely studied problem of dynamic memory allocators. It is well known that fragmentation can lead to premature out-of-memory errors and poor cache performance.
With the recent emergence of dynamic memory allocators for SIMD ...
Demand look-ahead memory access scheduling for 3D graphics processing units
With the rapid growing complexity of 3D applications, the memory subsystem has become the most bandwidth-exhausting bottleneck in a Graphics Processing Unit (GPU). To produce realistic images, tens to hundreds of thousands of primitives are used. ...






Comments