skip to main content
research-article

GPU multisplit

Published:27 February 2016Publication History
Skip Abstract Section

Abstract

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. However, sort does more work than necessary to implement multisplit, and is thus inefficient. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small number of buckets. In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations. We also use warp-synchronous programming models to avoid branch divergence and reduce memory usage, as well as hierarchical reordering of input elements to achieve better coalescing of global memory accesses. On an NVIDIA K40c GPU, for key-only (key-value) multisplit, we demonstrate a 3.0-6.7x (4.4-8.0x) speedup over radix sort, and achieve a peak throughput of 10.0 G keys/s.

References

  1. The Graph 500 list. http://www.graph500.org/, July 2013.Google ScholarGoogle Scholar
  2. Yahoo labs dataset selections. http://webscope.sandbox.yahoo.com/, July 2013.Google ScholarGoogle Scholar
  3. D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1--154:9, Dec. 2009. doi: 10.1145/1661412.1618500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 781--792, Nov. 2014. doi: 10.1109/SC.2014.69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Bang-Jensen and G. Z. Gutin. Digraphs: Theory, Algorithms and Applications, chapter 3.3.4: The Bellman-Ford-Moore Algorithm, pages 97--99. Springer-Verlag London, 2009. doi: 10.1007/978-1-84800-998-1.Google ScholarGoogle Scholar
  6. S. Brown and J. Snoeyink. Modestly faster histogram computations on GPUs. In Proceedings of Innovative Parallel Computing, InPar '12, May 2012. doi: 10.1109/InPar.2012.6339589.Google ScholarGoogle ScholarCross RefCross Ref
  7. A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014, pages 349--359, May 2014. doi: 10.1109/IPDPS.2014.45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011. doi: 10.1145/2049662.2049663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Deo and S. Keely. Parallel suffix array and least common prefix for the GPU. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 197--206, Feb. 2013. doi: 10.1145/2442516.2442536. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Deshpande and P. J. Narayanan. Can GPUs sort strings efficiently? In 20th International Conference on High Performance Computing, HiPC 2013, pages 305--313, Dec. 2013. doi: 10.1109/HiPC.2013.6799129.Google ScholarGoogle ScholarCross RefCross Ref
  11. G. F. Diamos, H. Wu, A. Lele, J. Wang, and S. Yalamanchili. Efficient relational algebra algorithms and data structures for GPU. Technical Report GIT-CERCS-12-01, Georgia Institute of Technology Center for Experimental Research in Computer Systems, Feb. 2012. URL http://www.cercs.gatech.edu/tech-reports/tr2012/git-cercs-12-01.pdf.Google ScholarGoogle Scholar
  12. E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269--271, 1959. ISSN 0029-599X. doi: 10.1007/BF01386390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Harris, S. Sengupta, and J. D. Owens. Parallel prefix sum (scan) with CUDA. In H. Nguyen, editor, GPU Gems 3, chapter 39, pages 851--876. Addison Wesley, Aug. 2007.Google ScholarGoogle Scholar
  14. B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 511--524, June 2008. doi: 10.1145/1376616.1376670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Q. Hou, X. Sun, K. Zhou, C. Lauterbach, and D. Manocha. Memory-scalable GPU spatial hierarchy construction. IEEE Transactions on Visualization and Computer Graphics, 17(4):466--474, Apr. 2011. doi: 10.1109/TVCG.2010.88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2): 39--55, Mar./Apr. 2008. doi: 10.1109/MM.2008.31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Merrill and A. Grimshaw. Revisiting sorting for GPGPU stream architectures. Technical Report CS2010-03, Department of Computer Science, University of Virginia, Feb. 2010. URL https://sites.google.com/site/duanemerrill/RadixSortTR.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. U. Meyer. Buckets strike back: Improved parallel shortest paths. In Proceedings of the 16th International Parallel and Distributed Processing Symposium, IPDPS 2002, Apr. 2002. doi: 10.1109/IPDPS. 2002.1015582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. U. Meyer. Average-case complexity of single-source shortest-paths algorithms: lower and upper bounds. Journal of Algorithms, 48(1): 91--134, Aug. 2003. doi: 10.1016/S0196-6774(03)00046-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. U. Meyer and P. Sanders. Δ-stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114--152, Oct. 2003. doi: 10.1016/S0196-6774(03)00076-2. 1998 European Symposium on Algorithms. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. L. Miller and J. H. Reif. Parallel tree contraction---Part 1: Fundamentals. In S. Micali, editor, Randomness and Computation, volume 5 of Advances in Computing Research, pages 47--72. JAI Press Inc., 1989. ISBN 9780892328963.Google ScholarGoogle Scholar
  22. L. Monroe, J. Wendelberger, and S. Michalak. Randomized selection on the GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG '11, pages 89--98, Aug. 2011. doi: 10.1145/2018323.2018338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40--53, Mar./Apr. 2008. doi: 10.1145/1365490.1365500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Nugteren, G.-J. van den Braak, H. Corporaal, and B. Mesman. High performance predictable histogramming on GPUs: Exploring and evaluating algorithm trade-offs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, page 1. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. NVIDIA Corporation. NVIDIA CUDA C programming guide. PG-02829-001 v6.5, Aug. 2014.Google ScholarGoogle Scholar
  26. J. Pantaleoni. VoxelPipe: A programmable pipeline for 3D voxelization. In Proceedings of High Performance Graphics, HPG '11, pages 99--106, Aug. 2011. ISBN 978-1-4503-0896-0. doi: 10.1145/2018323.2018339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Patidar. Scalable primitives for data mapping and movement on the GPU. Master's thesis, International Institute of Information Technology, Hyderabad, India, June 2009.Google ScholarGoogle Scholar
  28. R. Shams and R. A. Kennedy. Efficient histogram algorithms for NVIDIA CUDA compatible devices. In Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS), pages 418--422, Gold Coast, Australia, Dec. 2007.Google ScholarGoogle Scholar
  29. Z. Wu, F. Zhao, and X. Liu. SAH KD-tree construction on GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG '11, pages 71--78, Aug. 2011. doi: 10.1145/2018323. 2018335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. X. Yang, D. Xu, and L. Zhao. Efficient data management for incoherent ray tracing. Applied Soft Computing, 13(1):1--8, Jan. 2013. doi: 10.1016/j.asoc.2012.07.002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 51, Issue 8
    PPoPP '16
    August 2016
    405 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3016078
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      February 2016
      420 pages
      ISBN:9781450340922
      DOI:10.1145/2851141

    Copyright © 2016 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 27 February 2016

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!