Abstract
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited.
This article develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instructions are architecture dependent and the class of programs that are particularly amenable. Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3× for an Intel Haswell processor, 1.1× for both an ARM Cortex-A57 and Qualcomm Kryo, 1.2× for a Cortex-72 and an Intel Kaby Lake, and 1.35× for an Intel Xeon Phi Knight’s Landing, each of which is an out-of-order core, and performance improvements of 2.1× and 2.7× for the in-order ARM Cortex-A53 and first generation Intel Xeon Phi.
- Thomas Mueller. 2012. What integer hash function are good that accepts an integer hash key? Stack Overflow. Retrieved from http://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key#12996028.Google Scholar
- S. Ainsworth and Timothy M. Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’17).Google Scholar
- Sam Ainsworth and Timothy M. Jones. 2018. An event-triggered programmable prefetcher for irregular workloads. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. 2001. Data prefetching by dependence graph precomputation. In Proceedings of the International Symposium on Computer Architecture (ISCA’01). 10.Google Scholar
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks—Summary and preliminary results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’91). Google Scholar
Digital Library
- B. Cahoon and K. S. McKinley. 2001. Data flow analysis for software prefetching linked data structures in Java. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’01).Google Scholar
Digital Library
- Brendon Cahoon and Kathryn S. McKinley. 2002. Simple and effective array prefetching in Java. In Proceedings of the Proceedings of the 2002 Joint ACM-ISCOPE Conference on Java Grande (JGI’02).Google Scholar
- David Callahan, Ken Kennedy, and Allan Porterfield. 1991. Software prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’91).Google Scholar
Digital Library
- Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving hash join performance through prefetching. ACM Trans. Database Syst. 32, 3, Article 17 (Aug. 2007). Google Scholar
Digital Library
- Tien-Fu Chen and Jean-Loup Baer. 1992. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’92).Google Scholar
Digital Library
- Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’02).Google Scholar
Digital Library
- Babak Falsafi and Thomas F. Wenisch. 2014. A primer on hardware prefetching. Synth. Lect. Comput. Arch. 9, 1 (2014).Google Scholar
- Andrei Frumusanu. 2016. The ARM Cortex A73—Artemis Unveiled. Retrieved from http://www.anandtech.com/show/10347/arm-cortex-a73-artemis-unveiled/2.Google Scholar
- Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos, David Black-Schaffer, and Stefanos Kaxiras. 2014. Fix the code. Don’t tweak the hardware: A new compiler approach to voltage-frequency scaling. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’14).Google Scholar
Digital Library
- M. Khan and E. Hagersten. 2014. Resource conscious prefetching for irregular applications in multicores. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’14).Google Scholar
- Muneeb Khan, Michael A. Laurenzano, Jason Mars, Erik Hagersten, and David Black-Schaffer. 2015. AREP: Adaptive resource efficient prefetching for maximizing multicore performance. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’15).Google Scholar
Digital Library
- Dongkeun Kim and Donald Yeung. 2002. Design and evaluation of compiler algorithms for pre-execution. SIGPLAN Not. 37, 10 (Oct. 2002). Google Scholar
Digital Library
- J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and Z. Chishti. 2016. Path confidence based lookahead prefetching. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’16).Google Scholar
- Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. 2013. Meet the Walkers: Accelerating index traversals for in-memory databases. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’13).Google Scholar
Digital Library
- Rakesh Krishnaiyer. 2012. Compiler Prefetching for the Intel Xeon Phi coprocessor. Retrieved from https://software.intel.com/sites/default/files/managed/54/77/5.3-prefetching-on-mic-update.pdf.Google Scholar
- R. Krishnaiyer, E. Kultursay, P. Chawla, S. Preis, A. Zvezdin, and H. Saito. 2013. Compiler-based data prefetching and streaming non-temporal store generation for the Intel(R) Xeon Phi(TM) coprocessor. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPSW’13).Google Scholar
- Snehasish Kumar, Arrvindh Shriraman, Vijayalakshmi Srinivasan, Dan Lin, and Jordon Phillips. 2014. SQRL: Hardware accelerator for collecting software data structures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’14).Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04).Google Scholar
Digital Library
- Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When prefetching works, when it doesn’t, and why. ACM Trans. Archit. Code Optim. 9, 1, Article 2 (March 2012), 29 pages. Google Scholar
Digital Library
- Mikko H. Lipasti, William J. Schmidt, Steven R. Kunkel, and Robert R. Roediger. 1995. SPAID: Software prefetching in pointer- and call-intensive environments. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’95).Google Scholar
- Chi-Keung Luk and Todd C. Mowry. 1996. Compiler-based prefetching for recursive data structures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’96). 12.Google Scholar
- Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in parallel graph processing. Parallel Process. Lett. 17, 1 (2007).Google Scholar
Cross Ref
- Piotr R. Luszczek, David H. Bailey, Jack J. Dongarra, Jeremy Kepner, Robert F. Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC challenge (HPCC) benchmark suite. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’06). Article 213. Google Scholar
Digital Library
- V. Malhotra and C. Kozyrakis. 2006. Library-Based Prefetching for Pointer-Intensive Applications. Technical Report. Computer Systems Laboratory, Stanford University.Google Scholar
- John D. McCalpin. 2013. Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Retrieved from https://portal.tacc.utexas.edu/documents/13601/933270/MIC_Native_2013-11-16.pdf.Google Scholar
- Andreas Moshovos, Dionisios N. Pnevmatikatos, and Amirali Baniasadi. 2001. Slice-processors: An implementation of operation-based prediction. In Proceedings of the International Conference on Supercomputing (ICS’01). 14.Google Scholar
Digital Library
- Todd C. Mowry. 1994. Tolerating Latency Through Software-Controlled Data Prefetching. Ph.D. Dissertation. Stanford University, Computer Systems Laboratory. Google Scholar
Digital Library
- Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’92). Google Scholar
Digital Library
- Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang. May 5, 2010. Introducing the Graph 500. Cray User’s Group (CUG) (May 5, 2010).Google Scholar
- Karthik Nilakant, Valentin Dalibard, Amitabha Roy, and Eiko Yoneki. 2014. PrefEdge: SSD prefetcher for large-scale graph traversal. In Proceedings of the ACM International Systems and Storage Conference (SYSTOR’14). Article 4, 12 pages.Google Scholar
Digital Library
- Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98).Google Scholar
- M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti. 2015. Efficiently prefetching complex address patterns. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’15).Google Scholar
- Jens Teubner, Gustavo Alonso, Cagri Balkesen, and M. Tamer Ozsu. 2013. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’13).Google Scholar
- S. P. VanderWiel and D. J. Lilja. 1999. A compiler-assisted data prefetch controller. In Proceedings of the IEEE International Conference on Computer Design (ICCD’99).Google Scholar
- Vish Viswanathan. 2014. Disclosure of H/W prefetcher control on some Intel processors. Retrieved from https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.Google Scholar
- Youfeng Wu, Mauricio J. Serrano, Rakesh Krishnaiyer, Wei Li, and Jesse Fang. 2002. Value-profile guided stride prefetching for irregular code. In Proceedings of the International Conference on Compiler Construction (CC’02).Google Scholar
Cross Ref
- Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’15).Google Scholar
Digital Library
Index Terms
Software Prefetching for Indirect Memory Accesses: A Microarchitectural Perspective
Recommendations
Informed Prefetching for Indirect Memory Accesses
Indirect memory accesses have irregular access patterns that limit the performance of conventional software and hardware-based prefetchers. To address this problem, we propose the Array Tracking Prefetcher (ATP), which tracks array-based indirect memory ...
Software prefetching for indirect memory accesses
CGO '17: Proceedings of the 2017 International Symposium on Code Generation and OptimizationMany modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. ...
Software Prefetching for Unstructured Mesh Applications
Special Issue on Innovations in Systems for Irregular Applications, Part 1 and Regular PaperThis article demonstrates the utility and implementation of software prefetching in an unstructured finite volume computational fluid dynamics code of representative size and complexity to an industrial application and across a number of modern ...






Comments