Abstract
The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units.
- T. Aila and S. Laine. Understanding the Efficiency of Ray Traversal on GPUs. In HPG ’09, pages 145–149, 2009. Google Scholar
Digital Library
- Barcelona OpenMP Task Suite (BOTS). Barcelona OpenMP Task Suite (BOTS). https://pm.bsc.es/projects/bots.Google Scholar
- G. Barthe, J. M. Crespo, S. Gulwani, C. Kunz, and M. Marron. From Relational Verification to SIMD Loop Synthesis. In PPoPP ’13, pages 123–134, 2013. Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. In PPOPP ’95, pages 207–216, 1995. Google Scholar
Digital Library
- J. Chhugani, C. Kim, H. Shukla, J. Park, P. Dubey, J. Shalf, and H. D. Simon. Billion-particle SIMD-friendly Two-point Correlation on Largescale HPC Cluster Systems. In SC ’12, pages 1:1–1:11, 2012. Google Scholar
Digital Library
- Cilk. Cilk. http://supertech.csail.mit.edu/cilk/.Google Scholar
- H. Dammertz, J. Hanika, and A. Keller. Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays. In EGSR ’08, pages 1225–1233, 2008. Google Scholar
Digital Library
- J. S. Danaher, I.-T. A. Lee, and C. E. Leiserson. Programming with Exceptions in JCilk. Sci. Comput. Program., 63(2):147–171, Dec. 2006. Google Scholar
Digital Library
- J. O. Eklundh. A Fast Computer Method for Matrix Transposing. IEEE Trans. Comput., 21(7):801–803, July 1972. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In PLDI ’98, pages 212–223, 1998. Google Scholar
Digital Library
- M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-Berlin. Reducers and Other Cilk++ Hyperobjects. In SPAA ’09, pages 79–90, 2009. Google Scholar
Digital Library
- B. Gaster and L. Howes. Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? Computer, 45(8):42–52, August 2012. Google Scholar
Digital Library
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and Help-first Scheduling Policies for Async-finish Task Parallelism. In IPDPS ’09, pages 1–12, 2009. Google Scholar
Digital Library
- J. Havel and A. Herout. Yet Faster Ray-Triangle Intersection (Using SSE4). IEEE Transactions on Visualization and Computer Graphics, 16(3):434–438, May 2010. Google Scholar
Digital Library
- L. Hernquist. Vectorization of Tree Traversals. J. Comput. Phys., 87 (1):137–147, Mar. 1990. Google Scholar
Digital Library
- P. Hudak and E. Hohr. Graphinators and the Duality of SIMD and MIMD. In LFP ’88, pages 224–234, 1988. Google Scholar
Digital Library
- X. Huo, S. Krishnamoorthy, and G. Agrawal. Efficient Scheduling of Recursive Control Flow on GPUs. In ICS ’13, pages 409–420, 2013. Google Scholar
Digital Library
- Y. Jo and M. Kulkarni. Enhancing Locality for Recursive Traversals of Recursive Structures. In OOPSLA ’11, pages 463–482, 2011. Google Scholar
Digital Library
- Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic Vectorization of Tree Traversals. In PACT ’13, pages 363–374, 2013. Google Scholar
Digital Library
- C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In SIGMOD ’10, pages 339–350, 2010. Google Scholar
Digital Library
- S. Kim and H. Han. Efficient SIMD Code Generation for Irregular Kernels. In PPoPP ’12, pages 55–64, 2012. Google Scholar
Digital Library
- S. Krishnamoorthy, G. Baumgartner, D. Cociorva, C.-C. Lam, and P. Sadayappan. Efficient Parallel Out-of-core Matrix Transposition. International Journal of High Performance Computing and Networking, 2(2):110–119, 2004. Google Scholar
Digital Library
- S. Maleki, Y. Gao, M. J. Garzarán, T. Wong, and D. A. Padua. An Evaluation of Vectorizing Compilers. In PACT ’11, pages 372–382, 2011. Google Scholar
Digital Library
- T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel Finite-state Machines. In ASPLOS ’14, pages 529–542, 2014. Google Scholar
Digital Library
- D. Nuzman and A. Zaks. Outer-loop Vectorization: Revisited for Short SIMD Architectures. In PACT ’08, pages 2–11, 2008. Google Scholar
Digital Library
- NVIDIA. CUDA. http://www.nvidia.com/object/cuda_ home_new.html.Google Scholar
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An Unbalanced Tree Search Benchmark. In LCPC’06, pages 235–250, 2007. Google Scholar
Digital Library
- OpenMP Architecture Review Board. OpenMP Specification and Features. http://openmp.org/wp/, May 2008.Google Scholar
- M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood. Finegrain Task Aggregation and Coordination on GPUs. In ISCA ’14, pages 181–192, 2014. Google Scholar
Digital Library
- M. Puschel, J. M. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, 93(2):232–275, 2005.Google Scholar
- J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly, 2007. Google Scholar
Digital Library
- B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. SIMD Parallelization of Applications that Traverse Irregular Data Structures. In CGO ’13, pages 1–10, 2013. Google Scholar
Digital Library
- M. Steffen and J. Zambreno. Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels. In MICRO ’43, pages 237–248, 2010. Google Scholar
Digital Library
- J. E. Stone, D. Gohara, and G. Shi. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test, 12 (3):66–73, May 2010. Google Scholar
Digital Library
- TPL. The Task Parallel Library. http://msdn.microsoft. com/en-us/magazine/cc163340.aspx, Oct. 2007.Google Scholar
- S. Tzeng, A. Patney, and J. D. Owens. Task Management for Irregularparallel Workloads on the GPU. In HPG ’10, pages 29–37, 2010. Google Scholar
Digital Library
- X10. The X10 Programming Language. www.research.ibm. com/x10/, Mar. 2006.Google Scholar
Index Terms
Efficient execution of recursive programs on commodity vector hardware
Recommendations
Extracting SIMD Parallelism from Recursive Task-Parallel Programs
The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to execute data-parallel computations ...
Efficient execution of recursive programs on commodity vector hardware
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and ImplementationThe pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...






Comments