skip to main content
research-article

LaPerm: locality aware scheduler for dynamic parallelism on GPUs

Published:18 June 2016Publication History
Skip Abstract Section

Abstract

Recent developments in GPU execution models and architectures have introduced dynamic parallelism to facilitate the execution of irregular applications where control flow and memory behavior can be unstructured, time-varying, and hierarchical. The changes brought about by this extension to the traditional bulk synchronous parallel (BSP) model also creates new challenges in exploiting the current GPU memory hierarchy. One of the major challenges is that the reference locality that exists between the parent and child thread blocks (TBs) created during dynamic nested kernel and thread block launches cannot be fully leveraged using the current TB scheduling strategies. These strategies were designed for the current implementations of the BSP model but fall short when dynamic parallelism is introduced since they are oblivious to the hierarchical reference locality.

We propose LaPerm, a new locality-aware TB scheduler that exploits such parent-child locality, both spatial and temporal. LaPerm adopts three different scheduling decisions to i) prioritize the execution of the child TBs, ii) bind them to the stream multiprocessors (SMXs) occupied by their parents TBs, and iii) maintain workload balance across compute units. Experiments with a set of irregular CUDA applications executed on a cycle-level simulator employing dynamic parallelism demonstrate that LaPerm is able to achieve an average of 27% performance improvement over the baseline round-robin TB scheduler commonly used in modern GPUs.

References

  1. J. A. Anderson, C. D. Lorenz, and A. Travesset, "General purpose molecular dynamics simulations fully implemented on graphics processing units," Journal of Computational Physics, vol. 227, no. 10, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Mosegaard and T. S. Sørensen, "Real-time deformation of detailed geometry based on mappings to a less detailed physical simulation on the gpu," in Proceedings of the 11th Eurographics Conference on Virtual Environments, pp. 105--111, Eurographics Association, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. V. Podlozhnyuk, "Black-scholes option pricing," 2007.Google ScholarGoogle Scholar
  4. S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, et al., "Optix: a general purpose ray tracing engine," in ACM Transactions on Graphics (TOG), vol. 29, p. 66, ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. NVIDIA, "Cuda dynamic parallelism programming guide," 2015.Google ScholarGoogle Scholar
  6. Khronos, "The opencl specification version 2.0," 2014.Google ScholarGoogle Scholar
  7. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware warp scheduling," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46), pp. 99--110, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving gpu performance via large warps and two-level warp scheduling," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO-44), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance," in Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Xiang, Y. Yang, and H. Zhou, "Warp-level divergence in gpus: Characterization, impact, and mitigation," in Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA-20), 2014.Google ScholarGoogle Scholar
  12. O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither more nor less: Optimizing thread-level parallelism for gpgpus," in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT'13), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Improving gpgpu resource utilization through alternative thread block scheduling," in Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA-20), 2014.Google ScholarGoogle Scholar
  14. S.-Y. Lee, A. Arunkumar, and C.-J. Wu, "Cawa: coordinated warp scheduling and cache prioritization for critical warp acceleration of gpgpu workloads," in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA-42), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Wang and S. Yalamanchili, "Characterization and analysis of dynamic parallelism in unstructured gpu applications," in Proceedings of 2014 IEEE International Symposium on Workload Characterization (IISWC'14), 2014.Google ScholarGoogle Scholar
  16. J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus," in Proceedings of the 42nd Annual International Symposium on Computer Architecuture (ISCA-42), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Chen and X. Shen, "Free launch: Optimizing gpu dynamic kernel launches through thread reuse," in Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Yang and H. Zhou, "Cuda-np: Realizing nested thread-level parallelism in gpgpu applications," in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'14), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Kim and C. Batten, "Accelerating irregular algorithms on gpgpus using fine-grain hardware worklists," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Beamer, K. Asanovic, and D. Patterson, "Locality exists in graph processing: Workload characterization on an ivy bridge server," in Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC'15), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Rhu, M. Sullivan, J. Leng, and M. Erez, "A locality-aware memory hierarchy for energy-efficient gpu architectures," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads," in Proceedings of 2010 IEEE International Symposium o nWorkload Characterization (IISWC'10), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. NVIDIA, "Cuda c programming guide," 2015.Google ScholarGoogle Scholar
  24. I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero, "Enabling preemptive multiprogramming on gpus," in Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA-41), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. NVIDIA, "Nvidia geforce gtx 980 whitepaper," 2014.Google ScholarGoogle Scholar
  26. A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing cuda workloads using a detailed gpu simulator," in Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09), 2009.Google ScholarGoogle Scholar
  27. A. Kuhl, "Thermodynamic states in explosion fields," in 14th International Symposium on Detonation, Coeur d'Alene Resort, ID, USA, 2010.Google ScholarGoogle Scholar
  28. M. Burtscher and K. Pingali, "An efficient cuda implementation of the tree-based barnes hut n-body algorithm," GPU computing Gems Emerald edition, p. 75, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  29. D. Merrill, M. Garland, and A. Grimshaw, "Scalable gpu graph traversal," in Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, "10th dimacs implementation challenge: Graph partitioning and graph clustering," 2011.Google ScholarGoogle Scholar
  31. J. Cohen and P. Castonguay, "Efficient graph matching and coloring on the gpu," in GPU Technology Conference, 2012.Google ScholarGoogle Scholar
  32. L. Wang, S. Chen, Y. Tang, and J. Su, "Gregex: Gpu based high speed regular expression matching engine," in Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on, pp. 366--370, IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. McHugh, "Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory," ACM Transactions on Information and System Security, vol. 3, no. 4, pp. 262--294, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. H. Nadungodage, Y. Xia, J. J. Lee, M. Lee, and C. S. Park, "Gpu accelerated item-based collaborative filtering for big-data applications," in Proceedings of 2013 IEEE International Conference on Big Data, 2013.Google ScholarGoogle Scholar
  35. J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, "An algorithmic framework for performing collaborative filtering," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili, "Relational algorithms for multi-bulk-synchronous processors," in Proceedings of the 18th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP'13), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Kulkarni, M. Burtscher, C. Casçaval, and K. Pingali, "Lonestar: A suite of parallel irregular programs," in Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09), 2009.Google ScholarGoogle Scholar
  38. M. Steffen and J. Zambreno, "Improving simt efficiency of global rendering algorithms with architectural support for dynamic micro-kernels," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Bergstrom and J. Reppy, "Nested data-parallelism on the gpu," in ACM SIGPLAN Notices, vol. 47, pp. 247--258, ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood, "Fine-grain task aggregation and coordination on gpus," in Proceedings of the 41st Annual International Symposium on Computer Architecuture (ISCA-41), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Lee, K. Brown, A. Sujeeth, T. Rompf, and K. Olukotun, "Locality-aware mapping of nested parallel patterns on gpus," in Proceedings of the 47th International Symposium on Microarchitecture (MICRO-47), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. LaPerm: locality aware scheduler for dynamic parallelism on GPUs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!