skip to main content
research-article

HARP: Harnessing inactive threads in many-core processors

Published:28 March 2014Publication History
Skip Abstract Section

Abstract

SIMT accelerators are equipped with thousands of computational resources. Conventional accelerators, however, fail to fully utilize available resources due to branch and memory divergences. This underutilization is manifested in two underlying inefficiencies: pipeline width underutilization and pipeline depth underutilization. Width underutilization occurs when SIMD execution units are not entirely utilized due to branch divergences. This affects lane activity and results in SIMD inefficiency. Depth underutilization takes place when the pipeline runs out of active threads and is forced to leave pipeline stages idle. This work addresses both inefficiencies by harnessing inactive threads available to the pipeline. We introduce Harnessing inActive thReads in many-core Processors (or simply HARP) to improve width and depth utilization in accelerators. We show how using inactive yet ready threads can enhance performance. Moreover, we investigate implementation details and study microarchitectural changes needed to build a HARP-enhanced accelerator. Furthermore, we evaluate HARP under a variety of microarchitectural design points. We measure the area overhead associated with HARP and compare to conventional alternatives. Under Fermi-like GPUs, we show that HARP provides 10% speedup on average (maximum of 1.6X) at the cost of 3.5% area overhead. Our analysis shows that HARP performs better under narrower SIMD and shorter pipelines.

References

  1. AMD Inc. 2013. AMD accelerated parallel processing opencl programming guide. http://developer.amd.com/wordpress/media/2013/08/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf.Google ScholarGoogle Scholar
  2. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.Google ScholarGoogle Scholar
  3. Nicolas Brunie, Sylvain Collange, and Gregory Diamos. 2012. Simultaneous branch and warp interweaving for sustained gpu performance. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA'12). IEEE Computer Society, 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-HA LEE, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on In Workload Characterization (IISWC'09). 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sylvain Collange. 2011. Stack-less simt reconvergence at low cost. Tech. rep. http://hal.archives-ouvertes.fr/docs/00/62/26/54/PDF/collange_sympa2011_en.pdf.Google ScholarGoogle Scholar
  6. Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng WU, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'11). ACM Press, New York, 477--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient gpu control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07). IEEE Computer Society, 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient mimd control flow on simd graphics hardware. ACM Trans. Archit. Code Optim. 6, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient simt control flow. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA'11). IEEE Computer Society, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Abdullah Gharaibeh and Matei Ripeanu. 2010. Size matters: Space/time tradeoffs to improve gpgpu applications performance. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE Computer Society, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hynix Semiconductor. 2009. 1Gb (32mx32) gddr5 sgram h5gq1h24afr. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf.Google ScholarGoogle Scholar
  12. Imagination Technologies. 2012. PowerVR series 5 architecture guide for developers. http://www.imgtec.com/powervr/insider/docs/PowerVR%20Series%205.Architecture%20Guide%20for%20Developers.pdf.Google ScholarGoogle Scholar
  13. Raj Jain. 1991. The Art of Computer Systems Performance Analysis. Vol. 182, John Wiley and Sons.Google ScholarGoogle Scholar
  14. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13). ACM Press, New York, 395--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Khronos Group. 2013. OpenCL - The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  16. Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2, 39--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Roberto Mijat. 2012. Take gpu processing power beyond graphics with mali gpu computing. http://malideveloper.arm.com/downloads/WhitePaper_GPU_Computing_on_Mali.pdf.Google ScholarGoogle Scholar
  18. Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38, 3, 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07). IEEE Computer Society, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'11). ACM Press, New York, 308--317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. NVIDIA Corp. 2008. NVIDIA cuda sdk 2.3. https://developer.nvidia.com/cuda-toolkit-23-downloads.Google ScholarGoogle Scholar
  22. NVIDIA Corp. 2012a. CUDA c programming guide. http://docs.nvidia.com/cuda/cuda-cprogramming-guide/index.html.Google ScholarGoogle Scholar
  23. NVIDIA Corp. 2012b. Kepler gk110 architecture. http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.Google ScholarGoogle Scholar
  24. NVIDIA Corp. 2012c. CUDA c programming guide, compute capability section. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilitiesGoogle ScholarGoogle Scholar
  25. NVIDIA Corp. 2012d. CUDA gpus. https://developer.nvidia.com/cuda-gpus.Google ScholarGoogle Scholar
  26. Minsoo Rhu and Mattan Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control divergence in gpgpu architectures. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA'12). IEEE Computer Society, 61--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient gpu control flow. In Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture (HPCA'13). IEEE Computer Society, 591--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Tech. rep., IMPACT. http://impact.crhc.illinois.edu/shared/report/impact-12-01.parboil.pdf.Google ScholarGoogle Scholar
  29. Vasily Volkov and James W. Demmel. 2008. Benchmarking gpus to tune dense linear algebra. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wikipedia. 2013. GeForce 400 series. http://en.wikipedia.org/wiki/GeForce_400_Series.Google ScholarGoogle Scholar
  31. Craig M. Wittenbrink, Emmett Kilgariff, and Arjun Prabhu. 2011. Fermi gf100 gpu architecture. IEEE Micro 31, 2, 50--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying gpu microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'10). 235--246.Google ScholarGoogle ScholarCross RefCross Ref
  33. Haicheng WU, Gregory Diamos, Jin Wang, SI LI, and Sudhakar Yalamanchili. 2012. Characterization and transformation of unstructured control flow in bulk synchronous gpu applications. Int. J. High Perform. Comput. Appl. 26, 2, 170--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xuejun Yang, Xiangke Liao, Kai Lu, Qingfeng Hu, Junqiang Song, and Jinshu Su. 2011. The tianhe-1a supercomputer: Its hardware and software. J. Comput. Sci. Technol. 26, 3, 344--351.Google ScholarGoogle ScholarCross RefCross Ref
  35. George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'09). ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HARP: Harnessing inactive threads in many-core processors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!