skip to main content
research-article

An Event-Triggered Programmable Prefetcher for Irregular Workloads

Published:19 March 2018Publication History
Skip Abstract Section

Abstract

Many modern workloads compute on large amounts of data, often with irregular memory accesses. Current architectures perform poorly for these workloads, as existing prefetching techniques cannot capture the memory access patterns; these applications end up heavily memory-bound as a result. Although a number of techniques exist to explicitly configure a prefetcher with traversal patterns, gaining significant speedups, they do not generalise beyond their target data structures. Instead, we propose an event-triggered programmable prefetcher combining the flexibility of a general-purpose computational unit with an event-based programming model, along with compiler techniques to automatically generate events from the original source code with annotations. This allows more complex fetching decisions to be made, without needing to stall when intermediate results are required. Using our programmable prefetching system, combined with small prefetch kernels extracted from applications, we achieve an average 3.0x speedup in simulation for a variety of graph, database and HPC workloads.

References

  1. S. Ainsworth and T. M. Jones. Graph prefetching using data structure knowledge. In ICS, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Ainsworth and T. M. Jones. Software prefetching for indirect memory accesses. In CGO, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. H. Al-Sukhni, I. Bratt, and D. A. Connors. Compiler-directed content-aware prefetching for dynamic data structures. In PACT, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AnandTech. http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review/6, a.Google ScholarGoogle Scholar
  5. AnandTech. http://www.anandtech.com/show/8542/cortexm7-launches-embedded-iot-and-wearables/2, b.Google ScholarGoogle Scholar
  6. M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In ISCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ARM. http://www.arm.com/products/processors/cortex-m/cortex-m0plus.php.Google ScholarGoogle Scholar
  8. K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM, 52 (10), Oct. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel benchmarks -- summary and preliminary results. In Supercomputing, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39 (2), Aug. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core cpus. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In ASPLOS, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Chen and J. Baer. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, 44 (5), May 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T.-F. Chen and J.-L. Baer. Reducing memory latency via non-blocking and prefetching caches. In ASPLOS, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Choi, N. Kohout, S. Pamnani, D. Kim, and D. Yeung. A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching. ACM Trans. Comput. Syst., 22 (2), May 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Z. Chrysos and J. S. Emer. Memory dependence prediction using store sets. In ISCA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. In ASPLOS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Demosthenous, N. Nicolaou, and J. Georgiou. A hardware-efficient lowpass filter design for biomedical applications. In BioCAS, Nov 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. B. Falsafi and T. F. Wenisch. A primer on hardware prefetching. Synthesis Lectures on Computer Architecture, 9 (1), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. Ganusov and M. Burtscher. Efficient emulation of hardware prefetchers via event-driven helper threading. In PACT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Gutierrez, J. Pusdesris, R. G. Dreslinski, T. Mudge, C. Sudanthi, C. D. Emmons, M. Hayenga, and N. Paver. Sources of error in full-system simulation. In ISPASS, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  22. T. J. Ham, J. L. Aragón, and M. Martonosi. DeSC: Decoupled supply-compute communication management for heterogeneous architectures. In MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Hashemi, O. Mutlu, and Y. N. Patt. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C.-H. Ho, S. J. Kim, and K. Sankaralingam. Efficient execution of memory access phases using dataflow specialization. In ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Jain and C. Lin. Linearizing irregular memory accesses for improved correlated prefetching. In MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Joseph and D. Grunwald. Prefetching using markov predictors. In ISCA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Kim and D. Yeung. Design and evaluation of compiler algorithms for pre-execution. In ASPLOS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Kim and D. Yeung. A study of source-level compiler algorithms for automatic construction of pre-execution code. ACM Trans. Comput. Syst., 22 (3), Aug. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and Z. Chishti. Path confidence based lookahead prefetching. In MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. O. Kocberber, B. Falsafi, K. Lim, P. Ranganathan, and S. Harizopoulos. Dark silicon accelerators for database indexing. In 1st Dark Silicon Workshop (DaSi), 2012.Google ScholarGoogle Scholar
  31. O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. O. Kocberber, B. Falsafi, and B. Grot. Asynchronous memory access chaining. In VLDB, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Kohout, S. Choi, D. Kim, and D. Yeung. Multi-chain prefetching: Effective exploitation of inter-chain memory parallelism for pointer-chasing codes. In PACT, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Kumar, A. Shriraman, V. Srinivasan, D. Lin, and J. Phillips. Sqrl: Hardware accelerator for collecting software data structures. In PACT, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Kumar, N. Vedula, A. Shriraman, and V. Srinivasan. Dasx: Hardware accelerator for software data structures. In ICS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In CGO, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. Lau, J. E. Miller, I. Choi, D. Yeung, S. Amarasinghe, and A. Agarwal. Multicore performance optimization using partner cores. In HotPar, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17 (01), 2007.Google ScholarGoogle ScholarCross RefCross Ref
  39. P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas, R. Rabenseifner, and D. Takahashi. The hpc challenge (hpcc) benchmark suite. In SC, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. V. Malhotra and C. Kozyrakis. Library-based prefetching for pointer-intensive applications. Technical report, Online, 2006.Google ScholarGoogle Scholar
  41. F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In HotOS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In PPoPP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Mittal. A survey of recent prefetching techniques for processor caches. ACM Comput. Surv., 49 (2), Aug. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi. Slice-processors: An implementation of operation-based prediction. In ICS, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang. Introducing the graph 500. Cray User's Group (CUG), May 5, 2010.Google ScholarGoogle Scholar
  47. O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In HPCA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. K. Nilakant, V. Dalibard, A. Roy, and E. Yoneki. Prefedge: Ssd prefetcher for large-scale graph traversal. In SYSTOR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. L. Peled, S. Mannor, U. Weiser, and Y. Etsion. Semantic locality and context-based prefetching using reinforcement learning. In ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. A. Roth, A. Moshovos, and G. S. Sohi. Dependence based prefetching for linked data structures. In ASPLOS, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti. Efficiently prefetching complex address patterns. In MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J. Siek, L.-Q. Lee, and A. Lumsdaine. The Boost Graph Library: User Guide and Reference Manual. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002. ISBN 0--201--72914--8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. V. Viswanathan. Disclosure of h/w prefetcher control on some intel processors. https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors, Sept. 2014.Google ScholarGoogle Scholar
  55. T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In ISCA '05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. M. Yabuuchi, Y. Tsukamoto, M. Morimoto, M. Tanaka, and K. Nii. 20nm high-density single-port and dual-port srams with wordline-voltage-adjustment system for read/write assists. In ISSCC, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  57. C.-L. Yang and A. Lebeck. A programmable memory hierarchy for prefetching linked data structures. In H. Zima, K. Joe, M. Sato, Y. Seo, and M. Shimasaki, editors, High Performance Computing, volume 2327 of Lecture Notes in Computer Science. 2002. ISBN 978--3--540--43674--4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. X. Yu, C. J. Hughes, N. Satish, and S. Devadas. IMP: Indirect memory prefetcher. In MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An Event-Triggered Programmable Prefetcher for Irregular Workloads

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 53, Issue 2
        ASPLOS '18
        February 2018
        809 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3296957
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
          March 2018
          827 pages
          ISBN:9781450349116
          DOI:10.1145/3173162

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 March 2018

        Check for updates

        Author Tags

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!