Abstract
SIMT accelerators are equipped with thousands of computational resources. Conventional accelerators, however, fail to fully utilize available resources due to branch and memory divergences. This underutilization is manifested in two underlying inefficiencies: pipeline width underutilization and pipeline depth underutilization. Width underutilization occurs when SIMD execution units are not entirely utilized due to branch divergences. This affects lane activity and results in SIMD inefficiency. Depth underutilization takes place when the pipeline runs out of active threads and is forced to leave pipeline stages idle. This work addresses both inefficiencies by harnessing inactive threads available to the pipeline. We introduce Harnessing inActive thReads in many-core Processors (or simply HARP) to improve width and depth utilization in accelerators. We show how using inactive yet ready threads can enhance performance. Moreover, we investigate implementation details and study microarchitectural changes needed to build a HARP-enhanced accelerator. Furthermore, we evaluate HARP under a variety of microarchitectural design points. We measure the area overhead associated with HARP and compare to conventional alternatives. Under Fermi-like GPUs, we show that HARP provides 10% speedup on average (maximum of 1.6X) at the cost of 3.5% area overhead. Our analysis shows that HARP performs better under narrower SIMD and shorter pipelines.
- AMD Inc. 2013. AMD accelerated parallel processing opencl programming guide. http://developer.amd.com/wordpress/media/2013/08/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf.Google Scholar
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.Google Scholar
- Nicolas Brunie, Sylvain Collange, and Gregory Diamos. 2012. Simultaneous branch and warp interweaving for sustained gpu performance. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA'12). IEEE Computer Society, 49--60. Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-HA LEE, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on In Workload Characterization (IISWC'09). 44--54. Google Scholar
Digital Library
- Sylvain Collange. 2011. Stack-less simt reconvergence at low cost. Tech. rep. http://hal.archives-ouvertes.fr/docs/00/62/26/54/PDF/collange_sympa2011_en.pdf.Google Scholar
- Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng WU, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'11). ACM Press, New York, 477--488. Google Scholar
Digital Library
- Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient gpu control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07). IEEE Computer Society, 407--420. Google Scholar
Digital Library
- Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient mimd control flow on simd graphics hardware. ACM Trans. Archit. Code Optim. 6, 2. Google Scholar
Digital Library
- Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient simt control flow. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA'11). IEEE Computer Society, 25--36. Google Scholar
Digital Library
- Abdullah Gharaibeh and Matei Ripeanu. 2010. Size matters: Space/time tradeoffs to improve gpgpu applications performance. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE Computer Society, 1--12. Google Scholar
Digital Library
- Hynix Semiconductor. 2009. 1Gb (32mx32) gddr5 sgram h5gq1h24afr. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf.Google Scholar
- Imagination Technologies. 2012. PowerVR series 5 architecture guide for developers. http://www.imgtec.com/powervr/insider/docs/PowerVR%20Series%205.Architecture%20Guide%20for%20Developers.pdf.Google Scholar
- Raj Jain. 1991. The Art of Computer Systems Performance Analysis. Vol. 182, John Wiley and Sons.Google Scholar
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13). ACM Press, New York, 395--406. Google Scholar
Digital Library
- Khronos Group. 2013. OpenCL - The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/.Google Scholar
- Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2, 39--55. Google Scholar
Digital Library
- Roberto Mijat. 2012. Take gpu processing power beyond graphics with mali gpu computing. http://malideveloper.arm.com/downloads/WhitePaper_GPU_Computing_on_Mali.pdf.Google Scholar
- Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38, 3, 235--246. Google Scholar
Digital Library
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07). IEEE Computer Society, 3--14. Google Scholar
Digital Library
- Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'11). ACM Press, New York, 308--317. Google Scholar
Digital Library
- NVIDIA Corp. 2008. NVIDIA cuda sdk 2.3. https://developer.nvidia.com/cuda-toolkit-23-downloads.Google Scholar
- NVIDIA Corp. 2012a. CUDA c programming guide. http://docs.nvidia.com/cuda/cuda-cprogramming-guide/index.html.Google Scholar
- NVIDIA Corp. 2012b. Kepler gk110 architecture. http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.Google Scholar
- NVIDIA Corp. 2012c. CUDA c programming guide, compute capability section. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilitiesGoogle Scholar
- NVIDIA Corp. 2012d. CUDA gpus. https://developer.nvidia.com/cuda-gpus.Google Scholar
- Minsoo Rhu and Mattan Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control divergence in gpgpu architectures. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA'12). IEEE Computer Society, 61--71. Google Scholar
Digital Library
- Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient gpu control flow. In Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture (HPCA'13). IEEE Computer Society, 591--602. Google Scholar
Digital Library
- John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Tech. rep., IMPACT. http://impact.crhc.illinois.edu/shared/report/impact-12-01.parboil.pdf.Google Scholar
- Vasily Volkov and James W. Demmel. 2008. Benchmarking gpus to tune dense linear algebra. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC'08). Google Scholar
Digital Library
- Wikipedia. 2013. GeForce 400 series. http://en.wikipedia.org/wiki/GeForce_400_Series.Google Scholar
- Craig M. Wittenbrink, Emmett Kilgariff, and Arjun Prabhu. 2011. Fermi gf100 gpu architecture. IEEE Micro 31, 2, 50--59. Google Scholar
Digital Library
- Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying gpu microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'10). 235--246.Google Scholar
Cross Ref
- Haicheng WU, Gregory Diamos, Jin Wang, SI LI, and Sudhakar Yalamanchili. 2012. Characterization and transformation of unstructured control flow in bulk synchronous gpu applications. Int. J. High Perform. Comput. Appl. 26, 2, 170--185. Google Scholar
Digital Library
- Xuejun Yang, Xiangke Liao, Kai Lu, Qingfeng Hu, Junqiang Song, and Jinshu Su. 2011. The tianhe-1a supercomputer: Its hardware and software. J. Comput. Sci. Technol. 26, 3, 344--351.Google Scholar
Cross Ref
- George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'09). ACM Press, New York. Google Scholar
Digital Library
Index Terms
HARP: Harnessing inactive threads in many-core processors
Recommendations
Warp size impact in GPUs: large or small?
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing UnitsThere are a number of design decisions that impact a GPU's performance. Among such decisions deciding the right warp size can deeply influence the rest of the design. Small warps reduce the performance penalty associated with branch divergence at the ...
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture
Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Dynamic warp subdivision for integrated branch and memory divergence tolerance
ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureSIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in ...






Comments