Abstract
Many modern workloads compute on large amounts of data, often with irregular memory accesses. Current architectures perform poorly for these workloads, as existing prefetching techniques cannot capture the memory access patterns; these applications end up heavily memory-bound as a result. Although a number of techniques exist to explicitly configure a prefetcher with traversal patterns, gaining significant speedups, they do not generalise beyond their target data structures. Instead, we propose an event-triggered programmable prefetcher combining the flexibility of a general-purpose computational unit with an event-based programming model, along with compiler techniques to automatically generate events from the original source code with annotations. This allows more complex fetching decisions to be made, without needing to stall when intermediate results are required. Using our programmable prefetching system, combined with small prefetch kernels extracted from applications, we achieve an average 3.0x speedup in simulation for a variety of graph, database and HPC workloads.
- S. Ainsworth and T. M. Jones. Graph prefetching using data structure knowledge. In ICS, 2016. Google Scholar
Digital Library
- S. Ainsworth and T. M. Jones. Software prefetching for indirect memory accesses. In CGO, 2017. Google Scholar
Digital Library
- H. Al-Sukhni, I. Bratt, and D. A. Connors. Compiler-directed content-aware prefetching for dynamic data structures. In PACT, 2003. Google Scholar
Digital Library
- AnandTech. http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review/6, a.Google Scholar
- AnandTech. http://www.anandtech.com/show/8542/cortexm7-launches-embedded-iot-and-wearables/2, b.Google Scholar
- M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In ISCA, 2001. Google Scholar
Digital Library
- ARM. http://www.arm.com/products/processors/cortex-m/cortex-m0plus.php.Google Scholar
- K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM, 52 (10), Oct. 2009. Google Scholar
Digital Library
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel benchmarks -- summary and preliminary results. In Supercomputing, 1991. Google Scholar
Digital Library
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39 (2), Aug. 2011. Google Scholar
Digital Library
- S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core cpus. In SIGMOD, 2011. Google Scholar
Digital Library
- D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In ASPLOS, 1991. Google Scholar
Digital Library
- T. Chen and J. Baer. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, 44 (5), May 1995. Google Scholar
Digital Library
- T.-F. Chen and J.-L. Baer. Reducing memory latency via non-blocking and prefetching caches. In ASPLOS, 1992. Google Scholar
Digital Library
- S. Choi, N. Kohout, S. Pamnani, D. Kim, and D. Yeung. A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching. ACM Trans. Comput. Syst., 22 (2), May 2004. Google Scholar
Digital Library
- G. Z. Chrysos and J. S. Emer. Memory dependence prediction using store sets. In ISCA, 1998. Google Scholar
Digital Library
- R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. In ASPLOS, 2002. Google Scholar
Digital Library
- P. Demosthenous, N. Nicolaou, and J. Georgiou. A hardware-efficient lowpass filter design for biomedical applications. In BioCAS, Nov 2010.Google Scholar
Cross Ref
- B. Falsafi and T. F. Wenisch. A primer on hardware prefetching. Synthesis Lectures on Computer Architecture, 9 (1), 2014. Google Scholar
Digital Library
- I. Ganusov and M. Burtscher. Efficient emulation of hardware prefetchers via event-driven helper threading. In PACT, 2006. Google Scholar
Digital Library
- A. Gutierrez, J. Pusdesris, R. G. Dreslinski, T. Mudge, C. Sudanthi, C. D. Emmons, M. Hayenga, and N. Paver. Sources of error in full-system simulation. In ISPASS, 2014.Google Scholar
Cross Ref
- T. J. Ham, J. L. Aragón, and M. Martonosi. DeSC: Decoupled supply-compute communication management for heterogeneous architectures. In MICRO, 2015. Google Scholar
Digital Library
- M. Hashemi, O. Mutlu, and Y. N. Patt. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In MICRO, 2016. Google Scholar
Digital Library
- C.-H. Ho, S. J. Kim, and K. Sankaralingam. Efficient execution of memory access phases using dataflow specialization. In ISCA, 2015. Google Scholar
Digital Library
- A. Jain and C. Lin. Linearizing irregular memory accesses for improved correlated prefetching. In MICRO, 2013. Google Scholar
Digital Library
- D. Joseph and D. Grunwald. Prefetching using markov predictors. In ISCA, 1997. Google Scholar
Digital Library
- D. Kim and D. Yeung. Design and evaluation of compiler algorithms for pre-execution. In ASPLOS, 2002. Google Scholar
Digital Library
- D. Kim and D. Yeung. A study of source-level compiler algorithms for automatic construction of pre-execution code. ACM Trans. Comput. Syst., 22 (3), Aug. 2004. Google Scholar
Digital Library
- J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and Z. Chishti. Path confidence based lookahead prefetching. In MICRO, 2016. Google Scholar
Digital Library
- O. Kocberber, B. Falsafi, K. Lim, P. Ranganathan, and S. Harizopoulos. Dark silicon accelerators for database indexing. In 1st Dark Silicon Workshop (DaSi), 2012.Google Scholar
- O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In MICRO, 2013. Google Scholar
Digital Library
- O. Kocberber, B. Falsafi, and B. Grot. Asynchronous memory access chaining. In VLDB, 2015. Google Scholar
Digital Library
- N. Kohout, S. Choi, D. Kim, and D. Yeung. Multi-chain prefetching: Effective exploitation of inter-chain memory parallelism for pointer-chasing codes. In PACT, 2001. Google Scholar
Digital Library
- S. Kumar, A. Shriraman, V. Srinivasan, D. Lin, and J. Phillips. Sqrl: Hardware accelerator for collecting software data structures. In PACT, 2014. Google Scholar
Digital Library
- S. Kumar, N. Vedula, A. Shriraman, and V. Srinivasan. Dasx: Hardware accelerator for software data structures. In ICS, 2015. Google Scholar
Digital Library
- C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In CGO, 2004. Google Scholar
Digital Library
- E. Lau, J. E. Miller, I. Choi, D. Yeung, S. Amarasinghe, and A. Agarwal. Multicore performance optimization using partner cores. In HotPar, 2011. Google Scholar
Digital Library
- A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17 (01), 2007.Google Scholar
Cross Ref
- P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas, R. Rabenseifner, and D. Takahashi. The hpc challenge (hpcc) benchmark suite. In SC, 2006. Google Scholar
Digital Library
- V. Malhotra and C. Kozyrakis. Library-based prefetching for pointer-intensive applications. Technical report, Online, 2006.Google Scholar
- F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In HotOS, 2015. Google Scholar
Digital Library
- D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In PPoPP, 2012. Google Scholar
Digital Library
- S. Mittal. A survey of recent prefetching techniques for processor caches. ACM Comput. Surv., 49 (2), Aug. 2016. Google Scholar
Digital Library
- A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi. Slice-processors: An implementation of operation-based prediction. In ICS, 2001. Google Scholar
Digital Library
- T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS, 1992. Google Scholar
Digital Library
- R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang. Introducing the graph 500. Cray User's Group (CUG), May 5, 2010.Google Scholar
- O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA, 2003. Google Scholar
Digital Library
- K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In HPCA, 2004. Google Scholar
Digital Library
- K. Nilakant, V. Dalibard, A. Roy, and E. Yoneki. Prefedge: Ssd prefetcher for large-scale graph traversal. In SYSTOR, 2014. Google Scholar
Digital Library
- L. Peled, S. Mannor, U. Weiser, and Y. Etsion. Semantic locality and context-based prefetching using reinforcement learning. In ISCA, 2015. Google Scholar
Digital Library
- A. Roth, A. Moshovos, and G. S. Sohi. Dependence based prefetching for linked data structures. In ASPLOS, 1998. Google Scholar
Digital Library
- M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti. Efficiently prefetching complex address patterns. In MICRO, 2015. Google Scholar
Digital Library
- J. Siek, L.-Q. Lee, and A. Lumsdaine. The Boost Graph Library: User Guide and Reference Manual. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002. ISBN 0--201--72914--8.Google Scholar
Digital Library
- V. Viswanathan. Disclosure of h/w prefetcher control on some intel processors. https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors, Sept. 2014.Google Scholar
- T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In ISCA '05, 2005. Google Scholar
Digital Library
- M. Yabuuchi, Y. Tsukamoto, M. Morimoto, M. Tanaka, and K. Nii. 20nm high-density single-port and dual-port srams with wordline-voltage-adjustment system for read/write assists. In ISSCC, 2014.Google Scholar
Cross Ref
- C.-L. Yang and A. Lebeck. A programmable memory hierarchy for prefetching linked data structures. In H. Zima, K. Joe, M. Sato, Y. Seo, and M. Shimasaki, editors, High Performance Computing, volume 2327 of Lecture Notes in Computer Science. 2002. ISBN 978--3--540--43674--4. Google Scholar
Digital Library
- X. Yu, C. J. Hughes, N. Satish, and S. Devadas. IMP: Indirect memory prefetcher. In MICRO, 2015. Google Scholar
Digital Library
Index Terms
An Event-Triggered Programmable Prefetcher for Irregular Workloads
Recommendations
An Event-Triggered Programmable Prefetcher for Irregular Workloads
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsMany modern workloads compute on large amounts of data, often with irregular memory accesses. Current architectures perform poorly for these workloads, as existing prefetching techniques cannot capture the memory access patterns; these applications end ...
A PAB-based multi-prefetcher mechanism
Aggressive prefetching mechanisms improve performance of some important applications, but substantially increase bus traffic and "pressure" on cache tag arrays. They may even reduce performance of applications that are not memory bounded. We introduce a ...
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches
Special Issue on High-Performance Embedded Architectures and CompilersThe exponential increase in multicore processor (CMP) cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with a single, uniform access latency. Non-Uniform Cache Architecture (NUCA) designs have been ...







Comments