Abstract
High-level abstractions separate algorithm design from platform implementation, allowing programmers to focus on algorithms while building complex systems. This separation also provides system programmers and compilers an opportunity to optimize platform services on an application-by-application basis. In field-programmable gate arrays (FPGAs), platform-level malleability extends to the memory system: Unlike general-purpose processors, in which memory hardware is fixed at design time, the capacity, associativity, and topology of FPGA memory systems may all be tuned to improve application performance. Since application kernels may only explicitly use few memory resources, substantial memory capacity may be available to the platform for use on behalf of the user program. In this work, we present Scavenger, which utilizes spare resources to construct program-optimized memories, and we also perform an initial exploration of methods for automating the construction of these application-specific memory hierarchies. Although exploiting spare resources can be beneficial, naïvely consuming all memory resources may cause frequency degradation. To relieve timing pressure in large block RAM (BRAM) structures, we provide microarchitectural techniques to trade memory latency for design frequency. We demonstrate, by examining a set of benchmarks, that our scalable cache microarchitecture achieves performance gains of 7% to 74% (with a 26% geometric mean on average) over the baseline cache microarchitecture when scaling the size of first-level caches to the maximum.
- Michael Adler, Kermin Fleming, Angshuman Parashar, Michael Pellauer, and Joel Emer. 2011. LEAP scratchpads: Automatic memory and cache management for reconfigurable logic. In International Symposium on Field-Programmable Gate Arrays (FPGA). Google Scholar
Digital Library
- Amit Agarwal, Kaushik Roy, and T. N. Vijaykumar. 2003. Exploring high bandwidth pipelined cache architecture for scaled technology. In Design, Automation Test in Europe Conference Exhibition (DATE). Google Scholar
Digital Library
- Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Tomasz Czajkowski, Stephen D. Brown, and Jason H. Anderson. 2013. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems. ACM Trans. Embed. Comput. Syst. 13, 2 (2013), 24. Google Scholar
Digital Library
- J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski. 2012. Impact of cache architecture and interface on performance and area of FPGA-based processor/parallel-accelerator systems. In International Symposium on Field-Programmable Custom Computing Machines (FCCM). Google Scholar
Digital Library
- Eric S. Chung, James C. Hoe, and Ken Mai. 2011. CoRAM: An in-fabric memory abstraction for FPGA-based computing. In International Symposium on Field-Programmable Gate Arrays (FPGA). Google Scholar
Digital Library
- Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru Zhang. 2011. High-level synthesis for FPGAs: From prototyping to deployment. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 30, 4 (2011), 473--491. Google Scholar
Digital Library
- G. Dessouky, M. J. Klaiber, D. G. Bailey, and S. Simon. 2014. Adaptive dynamic on-chip memory management for FPGA-based reconfigurable architectures. In International Conference on Field-Programmable Logic and Applications (FPL).Google Scholar
- Jeffrey R. Diamond, Donald S. Fussell, and Stephen W. Keckler. 2014. Arbitrary modulus indexing. In International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Kermin Fleming, Hsin-Jung Yang, Michael Adler, and Joel Emer. 2014. The LEAP FPGA operating system. In International Conference on Field-Programmable Logic and Applications (FPL).Google Scholar
Cross Ref
- Q. S. Gao. 1993. The chinese remainder theorem and the prime memory system. In ACM SIGARCH Computer Architecture News. Google Scholar
Digital Library
- D. Göhringer, L. Meder, M. Hübner, and J. Becker. 2011. Adaptive multi-client network-on-chip memory. In International Conference on Reconfigurable Computing and FPGAs (RECONFIG). Google Scholar
Digital Library
- Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24, 7 (2002), 881--892. Google Scholar
Digital Library
- Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Charles Eric LaForest and J. Gregory Steffan. 2010. Efficient multi-ported memories for FPGAs. In International Symposium on Field-Programmable Gate Arrays (FPGA). Google Scholar
Digital Library
- H. Lange, T. Wink, and A. Koch. 2011. MARC ii: A parametrized speculative multi-ported memory subsystem for reconfigurable computers. In Design, Automation Test in Europe Conference Exhibition (DATE).Google Scholar
- Duncan H. Lawrie and Chandra R. Vora. 1982. The prime memory system for array access. IEEE Trans. Comput. 31, 5 (1982), 435--442. Google Scholar
Digital Library
- Eric Matthews, Nicholas C. Doyle, and Lesley Shannon. 2015. Design space exploration of L1 data caches for FPGA-based multiprocessor systems. In International Symposium on Field-Programmable Gate Arrays (FPGA). Google Scholar
Digital Library
- E. Matthews, L. Shannon, and A. Fedorova. 2012. Polyblaze: From one to many bringing the microblaze into the multicore era with linux SMP support. In International Conference on Field-Programmable Logic and Applications (FPL).Google Scholar
- Vincent Mirian and Paul Chow. 2012. FCache: A system for cache coherent processing on FPGAs. In International Symposium on Field-Programmable Gate Arrays (FPGA). Google Scholar
Digital Library
- M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and J. Emer. 2011. HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. In International Symposium on High-Performance Computer Architecture (HPCA). Google Scholar
Digital Library
- Jason Villarreal, Adrian Park, Walid Najjar, and Robert Halstead. 2010. Designing modular hardware accelerators in c with ROCCC 2.0. In International Symposium on Field-Programmable Custom Computing Machines (FCCM). Google Scholar
Digital Library
- Vivado. 2012. Vivado High-Level Synthesis. Retrieved from http://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google Scholar
- Felix Winterstein, Samuel Bayliss, and George A. Constantinides. 2014. Separation logic-assisted code transformations for efficient high-level synthesis. In International Symposium on Field-Programmable Custom Computing Machines (FCCM). Google Scholar
Digital Library
- Hsin-Jung Yang, Kermin Fleming, Michael Adler, and Joel Emer. 2014. LEAP shared memories: Automating the construction of FPGA coherent memories. In International Symposium on Field-Programmable Custom Computing Machines (FCCM). Google Scholar
Digital Library
- Hsin-Jung Yang, Kermin Fleming, Michael Adler, Felix Winterstein, and Joel Emer. 2015. Scavenger: Automating the construction of application-optimized memory hierarchies. In International Conference on Field-Programmable Logic and Applications (FPL).Google Scholar
Cross Ref
Index Terms
(FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies
Recommendations
Design and Optimization of Large Size and Low Overhead Off-Chip Caches
Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Performance of One's Complement Caches
On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which ...






Comments