skip to main content
research-article
Public Access

Efficient execution of recursive programs on commodity vector hardware

Published:03 June 2015Publication History
Skip Abstract Section

Abstract

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units.

References

  1. T. Aila and S. Laine. Understanding the Efficiency of Ray Traversal on GPUs. In HPG ’09, pages 145–149, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Barcelona OpenMP Task Suite (BOTS). Barcelona OpenMP Task Suite (BOTS). https://pm.bsc.es/projects/bots.Google ScholarGoogle Scholar
  3. G. Barthe, J. M. Crespo, S. Gulwani, C. Kunz, and M. Marron. From Relational Verification to SIMD Loop Synthesis. In PPoPP ’13, pages 123–134, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. In PPOPP ’95, pages 207–216, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Chhugani, C. Kim, H. Shukla, J. Park, P. Dubey, J. Shalf, and H. D. Simon. Billion-particle SIMD-friendly Two-point Correlation on Largescale HPC Cluster Systems. In SC ’12, pages 1:1–1:11, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cilk. Cilk. http://supertech.csail.mit.edu/cilk/.Google ScholarGoogle Scholar
  7. H. Dammertz, J. Hanika, and A. Keller. Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays. In EGSR ’08, pages 1225–1233, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. S. Danaher, I.-T. A. Lee, and C. E. Leiserson. Programming with Exceptions in JCilk. Sci. Comput. Program., 63(2):147–171, Dec. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. O. Eklundh. A Fast Computer Method for Matrix Transposing. IEEE Trans. Comput., 21(7):801–803, July 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Frigo, C. E. Leiserson, and K. H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In PLDI ’98, pages 212–223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-Berlin. Reducers and Other Cilk++ Hyperobjects. In SPAA ’09, pages 79–90, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Gaster and L. Howes. Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? Computer, 45(8):42–52, August 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and Help-first Scheduling Policies for Async-finish Task Parallelism. In IPDPS ’09, pages 1–12, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Havel and A. Herout. Yet Faster Ray-Triangle Intersection (Using SSE4). IEEE Transactions on Visualization and Computer Graphics, 16(3):434–438, May 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Hernquist. Vectorization of Tree Traversals. J. Comput. Phys., 87 (1):137–147, Mar. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Hudak and E. Hohr. Graphinators and the Duality of SIMD and MIMD. In LFP ’88, pages 224–234, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Huo, S. Krishnamoorthy, and G. Agrawal. Efficient Scheduling of Recursive Control Flow on GPUs. In ICS ’13, pages 409–420, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Jo and M. Kulkarni. Enhancing Locality for Recursive Traversals of Recursive Structures. In OOPSLA ’11, pages 463–482, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic Vectorization of Tree Traversals. In PACT ’13, pages 363–374, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In SIGMOD ’10, pages 339–350, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Kim and H. Han. Efficient SIMD Code Generation for Irregular Kernels. In PPoPP ’12, pages 55–64, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Krishnamoorthy, G. Baumgartner, D. Cociorva, C.-C. Lam, and P. Sadayappan. Efficient Parallel Out-of-core Matrix Transposition. International Journal of High Performance Computing and Networking, 2(2):110–119, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Maleki, Y. Gao, M. J. Garzarán, T. Wong, and D. A. Padua. An Evaluation of Vectorizing Compilers. In PACT ’11, pages 372–382, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel Finite-state Machines. In ASPLOS ’14, pages 529–542, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Nuzman and A. Zaks. Outer-loop Vectorization: Revisited for Short SIMD Architectures. In PACT ’08, pages 2–11, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. NVIDIA. CUDA. http://www.nvidia.com/object/cuda_ home_new.html.Google ScholarGoogle Scholar
  27. S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An Unbalanced Tree Search Benchmark. In LCPC’06, pages 235–250, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. OpenMP Architecture Review Board. OpenMP Specification and Features. http://openmp.org/wp/, May 2008.Google ScholarGoogle Scholar
  29. M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood. Finegrain Task Aggregation and Coordination on GPUs. In ISCA ’14, pages 181–192, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Puschel, J. M. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, 93(2):232–275, 2005.Google ScholarGoogle Scholar
  31. J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. SIMD Parallelization of Applications that Traverse Irregular Data Structures. In CGO ’13, pages 1–10, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Steffen and J. Zambreno. Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels. In MICRO ’43, pages 237–248, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. E. Stone, D. Gohara, and G. Shi. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test, 12 (3):66–73, May 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. TPL. The Task Parallel Library. http://msdn.microsoft. com/en-us/magazine/cc163340.aspx, Oct. 2007.Google ScholarGoogle Scholar
  36. S. Tzeng, A. Patney, and J. D. Owens. Task Management for Irregularparallel Workloads on the GPU. In HPG ’10, pages 29–37, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. X10. The X10 Programming Language. www.research.ibm. com/x10/, Mar. 2006.Google ScholarGoogle Scholar

Index Terms

  1. Efficient execution of recursive programs on commodity vector hardware

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 50, Issue 6
        PLDI '15
        June 2015
        630 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2813885
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
          June 2015
          630 pages
          ISBN:9781450334686
          DOI:10.1145/2737924

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 June 2015

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!