Abstract
The importance of irregular applications such as graph analytics is rapidly growing with the rise of Big Data. However, parallel graph workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due to poor cache locality, low compute intensity, frequent synchronization, uneven task sizes, and dynamic task generation. At high thread counts, execution time is dominated by worklist synchronization overhead and cache misses. Researchers have proposed hardware worklist accelerators to address scheduling costs, but these proposals often harden a specific scheduling policy and do not address high cache miss rates. We address this with Minnow, a technique that augments each core in a CMP with a lightweight Minnow accelerator. Minnow engines offload worklist scheduling from worker threads to improve scalability. The engines also perform worklist-directed prefetching, a technique that exploits knowledge of upcoming tasks to issue nearly perfectly accurate and timely prefetch operations. On a simulated 64-core CMP running a parallel graph benchmark suite, Minnow improves scalability and reduces L2 cache misses from 29 to 1.2 MPKI on average, resulting in 6.01x average speedup over an optimized software baseline for only 1% area overhead.
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing Proceedings of the 42nd International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 105--117. Google Scholar
Digital Library
- Sam Ainsworth and Timothy M. Jones. 2016. Graph Prefetching Using Data Structure Knowledge. Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 39, 11 pages. Google Scholar
Digital Library
- Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, and Burton Smith. 1990. The Tera computer system. In ACM SIGARCH Computer Architecture News, Vol. Vol. 18. ACM, 1--6. Google Scholar
Digital Library
- S. Beamer, K. Asanovic, and D. Patterson. 2015. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server 2015 IEEE International Symposium on Workload Characterization. 56--65. Google Scholar
Digital Library
- R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, and Y. N. Patt. 1999. Simultaneous subordinate microthreading (SSMT). In Proceedings of the 26th International Symposium on Computer Architecture. 186--195. 1109/SBAC-PAD.2014.39 Google Scholar
Digital Library
- A. Tumeo and J. Feo. 2015. Irregular Applications: From Architectures to Algorithms {Guest editors' introduction}. Computer, Vol. 48, 8 (Aug. 2015), 14--16. showISSN0018--9162Google Scholar
Cross Ref
- Joyce Jiyoung Whang, Andrew Lenharth, Inderjit S Dhillon, and Keshav Pingali. 2015. Scalable Data-Driven PageRank: Algorithms, System Issues, and Lessons Learned. Euro-Par 2015: Parallel Processing. Springer, 438--450.Google Scholar
- Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect Memory Prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 178--190. Google Scholar
Digital Library
Index Terms
Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching
Recommendations
Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsThe importance of irregular applications such as graph analytics is rapidly growing with the rise of Big Data. However, parallel graph workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due to poor cache locality, low ...
Overcoming Limitations Of Prefetching In Multiprocessors By Compiler-Initiated Coherence Action
PACT '97: Proceedings of the 1997 International Conference on Parallel Architectures and Compilation TechniquesIn this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, ...
Correlation Prefetching with a User-Level Memory Thread
This paper proposes using a User-Level Memory Thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs ...







Comments