skip to main content
research-article
Public Access

Quantifying Data Locality in Dynamic Parallelism in GPUs

Authors Info & Claims
Published:21 December 2018Publication History
Skip Abstract Section

Abstract

GPUs are becoming prevalent in various domains of computing and are widely used for streaming (regular) applications. However, they are highly inefficient when executing irregular applications with unstructured inputs due to load imbalance. Dynamic parallelism (DP) is a new feature of emerging GPUs that allows new kernels to be generated and scheduled from the device-side (GPU) without the host-side (CPU) intervention to increase parallelism. To efficiently support DP, one of the major challenges is to saturate the GPU processing elements and provide them with the required data in a timely fashion. There have been considerable efforts focusing on exploiting data locality in GPUs. However, there is a lack of quantitative analysis of how irregular applications using dynamic parallelism behave in terms of data reuse. In this paper, we quantitatively analyze the data reuse of dynamic applications in three different granularities of schedulable units: kernel, work-group, and wavefront. We observe that, for DP applications, data reuse is highly irregular and is heavily dependent on the application and its input. Thus, existing techniques cannot exploit data reuse effectively for DP applications. To this end, we first conduct a limit study on the performance improvements that can be achieved by hardware schedulers that are provided with accurate data reuse information. This limit study shows that, on an average, the performance improves by 19.4% over the baseline scheduler. Based on the key observations from the quantitative analysis of our DP applications, we next propose LASER, a Locality-Aware SchedulER, where the hardware schedulers employ data reuse monitors to help make scheduling decisions to improve data locality at runtime. Our experimental results on 16 benchmarks show that LASER, on an average, can improve performance by 11.3%.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jefrey Dean, Matthieu Devin, Sanjay Ghemawat, Geofrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Vignesh Adhinarayanan, Indrani Paul, Joseph Greathouse, Wei N. Huang, Ashutosh Pattnaik, and Wu chun Feng. 2016. Measuring and Modeling On-Chip Interconnect Power on Real Hardware. In IISWC.Google ScholarGoogle Scholar
  3. AMD. 2013. AMD APP SDK OpenCL User Guide. (2013).Google ScholarGoogle Scholar
  4. Austin Appleby. 2016. Murmur Hash 2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp. (2016).Google ScholarGoogle Scholar
  5. Sara S Baghsorkhi, Isaac Gelado, Matthieu Delahaye, and Wen-mei W Hwu. 2012. Eicient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In ACM SIGPLAN Notices, Vol. 47. ACM, 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS.Google ScholarGoogle Scholar
  7. S. Beamer, K. Asanovic, and D. Patterson. 2015. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server. In 2015 IEEE International Symposium on Workload Characterization. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shuai Che, Jeremy W. Sheafer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, and Kevin Skadron. 2010. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Guoyang Chen and Xipeng Shen. 2015. Free Launch: Optimizing GPU Dynamic Kernel Launches Through Thread Reuse. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Xuhao Chen, Shengzhao Wu, Li-Wen Chang, Wei-Sheng Huang, Carl Pearson, Zhiying Wang, and Wen-Mei W Hwu. 2014. Adaptive cache bypass and insertion for many-core accelerators. In Proceedings of International Workshop on Manycore Embedded Systems. ACM, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Haoyu Cheng, Huaipan Jiang, Jiaoyun Yang, Yun Xu, and Yi Shang. 2015. BitMapper: an eicient all-mapper based on bit-vector computing. In BMC Bioinformatics.Google ScholarGoogle Scholar
  12. Chen Ding and Yutao Zhong. 2003. Predicting Whole-program Locality Through Reuse Distance Analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Wei Ding, Xulong Tang, Mahmut Kandemir, Yuanrui Zhang, and Emre Kultursay. 2015. Optimizing Of-chip Accesses in Multicores. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V Veidenbaum. 2012. Improving cache management policies using dynamic reuse distances. In Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on. IEEE, 389--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. E. Hajj, J. Gomez-Luna, C. Li, L. W. Chang, D. Milojicic, and W. m. Hwu. 2016. KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 272-283.Google ScholarGoogle ScholarCross RefCross Ref
  18. Yangqing Jia, Evan Shelhamer, Jef Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Cafe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (MM '14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Nilardish Chatterjee, Steve Keckler, Mahmut T. Kandemir, and Chita R. Das. 2015. Anatomy of GPU Memory System for Multi-Application Execution. In MEMSYS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Adwait Jog, Onur Kayiran, Nachiappan C. Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravi Iyer, and Chita R. Das. 2016. Exploiting Core Criticality for Enhanced Performance in GPUs. In SIGMETRICS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mahmut Kandemir, Hui Zhao, Xulong Tang, and Mustafa Karakoy. 2015. Memory Row Reuse Distance and Its Role in Optimizing Application Performance. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Onur Kayiran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das. 2013. Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2016. uC-States: Fine-grained GPU Datapath Power Management. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2018. Enhancing Computation-to-core Assignment with Physical Location Information. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Gunjae Koo, Yunho Oh, Won Woo Ro, and Murali Annavaram. 2017. Access pattern-aware cache management for improving data utilization in gpu. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 307--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. AL Kuhl. 2010. Thermodynamic States in Explosion Fields. In IDS.Google ScholarGoogle Scholar
  29. B. C. C. Lai, H. K. Kuo, and J. Y. Jou. 2015. A Cache Hierarchy Aware Thread Mapping Methodology for GPGPUs. IEEE Trans. Comput. 64, 4 (April 2015), 884--898.Google ScholarGoogle ScholarCross RefCross Ref
  30. Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 260--271.Google ScholarGoogle ScholarCross RefCross Ref
  31. Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 297--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Localitydriven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 67--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Gu Liu, Hong An, Wenting Han, Xiaoqiang Li, Tao Sun, Wei Zhou, Xuechao Wei, and Xulong Tang. 2012. FlexBFS: A Parallelism-aware Implementation of Breadth-irst Search on GPU. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions. In SC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. NCBI. 2016. National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov. (2016).Google ScholarGoogle Scholar
  37. NVIDIA. 2011. JP Morgan Speeds Risk Calculations with NVIDIA GPUs. (2011).Google ScholarGoogle Scholar
  38. NVIDIA. 2012. Dynamic Parallelism in CUDA. (2012).Google ScholarGoogle Scholar
  39. NVIDIA. 2015. CUDA C/C++ SDK Code Samples. (2015).Google ScholarGoogle Scholar
  40. NVIDIA. 2018. CUDA Programming Guide. (2018).Google ScholarGoogle Scholar
  41. Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun Park, Yongjun Park, Won Woo Ro, and Murali Annavaram. 2016. APRES: improving cache eiciency by exploiting load characteristics on GPUs. ACM SIGARCH Computer Architecture News 44, 3 (2016), 191--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. I. Park, S. P. Ponce, J. Huang, Y. Cao, and F. Quek. 2008. Low-Cost, High-Speed Computer Vision using NVIDIA's CUDA Architecture. In AIPR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Sooraj Puthoor, Xulong Tang, Joseph Gross, and Bradford M. Beckmann. 2018. Oversubscribed Command Queues in GPUs. In Proceedings of the 11th Workshop on General Purpose GPUs (GPGPU collocated with PPoPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A locality-aware memory hierarchy for energyeicient GPU architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 86--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jihyun Ryoo, Orhan Kislal, Xulong Tang, and Mahmut Taylan Kandemir. 2018. Quantifying and Optimizing Data Access Parallelism on Manycores. In Proceedings of the 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google ScholarGoogle ScholarCross RefCross Ref
  48. Peter Sanders and Christian Schulz. 2012. 10th Dimacs Implementation Challenge-Graph Partitioning and Graph Clustering. (2012).Google ScholarGoogle Scholar
  49. Ivy Schmerken. 2009. Wall Street Accelerates Options Analysis with GPU Technology. (2009).Google ScholarGoogle Scholar
  50. Xipeng Shen, Yutao Zhong, and Chen Ding. 2004. Locality Phase Prediction. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli, and Jagadish Kotra. 2016. Improving Bank-Level Parallelism for Irregular Applications. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Xulong Tang, Mahmut Taylan Kandemir, Hui Zhao, Myoungsoo Jung, and Mustafa Karakoy. 2019. Computing with Near Data. In Proceedings of the 2019 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. 2017. Data Movement Aware Computation Partitioning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, Adwait Jog, Sreepathi Pai, Mohamed Ibrahim, Mahmut Kandemir, and Chita Das. 2017. Controlled Kernel Launch for Dynamic Parallelism in GPUs. In Proceedings of the 23rd International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  55. Panagiotis D. Vouzis and Nikolaos V. Sahinidis. 2011. GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics 27, 2 (2011), 182-188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. JinWang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. 2015. Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs. In ISCA.Google ScholarGoogle Scholar
  57. Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. 2016. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Jin Wang and Yalamanchili Sudhakar. 2014. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In IISWC.Google ScholarGoogle Scholar
  59. Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An eicient compiler framework for cache bypassing on GPUs. In Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on. IEEE, 516--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on. IEEE, 76--88.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Quantifying Data Locality in Dynamic Parallelism in GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!