Abstract
Recent developments in GPU execution models and architectures have introduced dynamic parallelism to facilitate the execution of irregular applications where control flow and memory behavior can be unstructured, time-varying, and hierarchical. The changes brought about by this extension to the traditional bulk synchronous parallel (BSP) model also creates new challenges in exploiting the current GPU memory hierarchy. One of the major challenges is that the reference locality that exists between the parent and child thread blocks (TBs) created during dynamic nested kernel and thread block launches cannot be fully leveraged using the current TB scheduling strategies. These strategies were designed for the current implementations of the BSP model but fall short when dynamic parallelism is introduced since they are oblivious to the hierarchical reference locality.
We propose LaPerm, a new locality-aware TB scheduler that exploits such parent-child locality, both spatial and temporal. LaPerm adopts three different scheduling decisions to i) prioritize the execution of the child TBs, ii) bind them to the stream multiprocessors (SMXs) occupied by their parents TBs, and iii) maintain workload balance across compute units. Experiments with a set of irregular CUDA applications executed on a cycle-level simulator employing dynamic parallelism demonstrate that LaPerm is able to achieve an average of 27% performance improvement over the baseline round-robin TB scheduler commonly used in modern GPUs.
- J. A. Anderson, C. D. Lorenz, and A. Travesset, "General purpose molecular dynamics simulations fully implemented on graphics processing units," Journal of Computational Physics, vol. 227, no. 10, 2008. Google Scholar
Digital Library
- J. Mosegaard and T. S. Sørensen, "Real-time deformation of detailed geometry based on mappings to a less detailed physical simulation on the gpu," in Proceedings of the 11th Eurographics Conference on Virtual Environments, pp. 105--111, Eurographics Association, 2005. Google Scholar
Digital Library
- V. Podlozhnyuk, "Black-scholes option pricing," 2007.Google Scholar
- S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, et al., "Optix: a general purpose ray tracing engine," in ACM Transactions on Graphics (TOG), vol. 29, p. 66, ACM, 2010. Google Scholar
Digital Library
- NVIDIA, "Cuda dynamic parallelism programming guide," 2015.Google Scholar
- Khronos, "The opencl specification version 2.0," 2014.Google Scholar
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45), 2012. Google Scholar
Digital Library
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware warp scheduling," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46), pp. 99--110, 2013. Google Scholar
Digital Library
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving gpu performance via large warps and two-level warp scheduling," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO-44), 2011. Google Scholar
Digital Library
- A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance," in Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13), 2013. Google Scholar
Digital Library
- P. Xiang, Y. Yang, and H. Zhou, "Warp-level divergence in gpus: Characterization, impact, and mitigation," in Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA-20), 2014.Google Scholar
- O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither more nor less: Optimizing thread-level parallelism for gpgpus," in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT'13), 2013. Google Scholar
Digital Library
- M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Improving gpgpu resource utilization through alternative thread block scheduling," in Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA-20), 2014.Google Scholar
- S.-Y. Lee, A. Arunkumar, and C.-J. Wu, "Cawa: coordinated warp scheduling and cache prioritization for critical warp acceleration of gpgpu workloads," in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA-42), 2015. Google Scholar
Digital Library
- J. Wang and S. Yalamanchili, "Characterization and analysis of dynamic parallelism in unstructured gpu applications," in Proceedings of 2014 IEEE International Symposium on Workload Characterization (IISWC'14), 2014.Google Scholar
- J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus," in Proceedings of the 42nd Annual International Symposium on Computer Architecuture (ISCA-42), 2015. Google Scholar
Digital Library
- G. Chen and X. Shen, "Free launch: Optimizing gpu dynamic kernel launches through thread reuse," in Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48), 2015. Google Scholar
Digital Library
- Y. Yang and H. Zhou, "Cuda-np: Realizing nested thread-level parallelism in gpgpu applications," in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'14), 2014. Google Scholar
Digital Library
- J. Kim and C. Batten, "Accelerating irregular algorithms on gpgpus using fine-grain hardware worklists," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47), 2014. Google Scholar
Digital Library
- S. Beamer, K. Asanovic, and D. Patterson, "Locality exists in graph processing: Workload characterization on an ivy bridge server," in Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC'15), 2015. Google Scholar
Digital Library
- M. Rhu, M. Sullivan, J. Leng, and M. Erez, "A locality-aware memory hierarchy for energy-efficient gpu architectures," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46), 2013. Google Scholar
Digital Library
- S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads," in Proceedings of 2010 IEEE International Symposium o nWorkload Characterization (IISWC'10), 2010. Google Scholar
Digital Library
- NVIDIA, "Cuda c programming guide," 2015.Google Scholar
- I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero, "Enabling preemptive multiprogramming on gpus," in Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA-41), 2014. Google Scholar
Digital Library
- NVIDIA, "Nvidia geforce gtx 980 whitepaper," 2014.Google Scholar
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing cuda workloads using a detailed gpu simulator," in Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09), 2009.Google Scholar
- A. Kuhl, "Thermodynamic states in explosion fields," in 14th International Symposium on Detonation, Coeur d'Alene Resort, ID, USA, 2010.Google Scholar
- M. Burtscher and K. Pingali, "An efficient cuda implementation of the tree-based barnes hut n-body algorithm," GPU computing Gems Emerald edition, p. 75, 2011.Google Scholar
Cross Ref
- D. Merrill, M. Garland, and A. Grimshaw, "Scalable gpu graph traversal," in Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12), 2012. Google Scholar
Digital Library
- D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, "10th dimacs implementation challenge: Graph partitioning and graph clustering," 2011.Google Scholar
- J. Cohen and P. Castonguay, "Efficient graph matching and coloring on the gpu," in GPU Technology Conference, 2012.Google Scholar
- L. Wang, S. Chen, Y. Tang, and J. Su, "Gregex: Gpu based high speed regular expression matching engine," in Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on, pp. 366--370, IEEE, 2011. Google Scholar
Digital Library
- J. McHugh, "Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory," ACM Transactions on Information and System Security, vol. 3, no. 4, pp. 262--294, 2000. Google Scholar
Digital Library
- C. H. Nadungodage, Y. Xia, J. J. Lee, M. Lee, and C. S. Park, "Gpu accelerated item-based collaborative filtering for big-data applications," in Proceedings of 2013 IEEE International Conference on Big Data, 2013.Google Scholar
- J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, "An algorithmic framework for performing collaborative filtering," in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. Google Scholar
Digital Library
- G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili, "Relational algorithms for multi-bulk-synchronous processors," in Proceedings of the 18th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP'13), 2013. Google Scholar
Digital Library
- M. Kulkarni, M. Burtscher, C. Casçaval, and K. Pingali, "Lonestar: A suite of parallel irregular programs," in Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09), 2009.Google Scholar
- M. Steffen and J. Zambreno, "Improving simt efficiency of global rendering algorithms with architectural support for dynamic micro-kernels," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43), 2010. Google Scholar
Digital Library
- L. Bergstrom and J. Reppy, "Nested data-parallelism on the gpu," in ACM SIGPLAN Notices, vol. 47, pp. 247--258, ACM, 2012. Google Scholar
Digital Library
- M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood, "Fine-grain task aggregation and coordination on gpus," in Proceedings of the 41st Annual International Symposium on Computer Architecuture (ISCA-41), 2014. Google Scholar
Digital Library
- H. Lee, K. Brown, A. Sujeeth, T. Rompf, and K. Olukotun, "Locality-aware mapping of nested parallel patterns on gpus," in Proceedings of the 47th International Symposium on Microarchitecture (MICRO-47), 2014. Google Scholar
Digital Library
Index Terms
(auto-classified)LaPerm: locality aware scheduler for dynamic parallelism on GPUs
Recommendations
Taming irregular applications via advanced dynamic parallelism on GPUs
CF '18: Proceedings of the 15th ACM International Conference on Computing FrontiersOn recent GPU architectures, dynamic parallelism, which enables the launching of kernels from the GPU without CPU involvement, provides a way to improve the performance of irregular applications by generating child kernels dynamically to reduce workload ...
LaPerm: locality aware scheduler for dynamic parallelism on GPUs
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitectureRecent developments in GPU execution models and architectures have introduced dynamic parallelism to facilitate the execution of irregular applications where control flow and memory behavior can be unstructured, time-varying, and hierarchical. The ...
Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs
COSMIC '15: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many CoresGraphics Processing Units (GPUs) have been used in general purpose computing for several years. The newly introduced Dynamic Parallelism feature of Nvidia's Kepler GPUs allows launching kernels from the GPU directly. However, the naïve use of this ...







Comments