Abstract
Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. However, sort does more work than necessary to implement multisplit, and is thus inefficient. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small number of buckets. In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations. We also use warp-synchronous programming models to avoid branch divergence and reduce memory usage, as well as hierarchical reordering of input elements to achieve better coalescing of global memory accesses. On an NVIDIA K40c GPU, for key-only (key-value) multisplit, we demonstrate a 3.0-6.7x (4.4-8.0x) speedup over radix sort, and achieve a peak throughput of 10.0 G keys/s.
- The Graph 500 list. http://www.graph500.org/, July 2013.Google Scholar
- Yahoo labs dataset selections. http://webscope.sandbox.yahoo.com/, July 2013.Google Scholar
- D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the GPU. ACM Transactions on Graphics, 28(5):154:1--154:9, Dec. 2009. doi: 10.1145/1661412.1618500. Google Scholar
Digital Library
- A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 781--792, Nov. 2014. doi: 10.1109/SC.2014.69. Google Scholar
Digital Library
- J. Bang-Jensen and G. Z. Gutin. Digraphs: Theory, Algorithms and Applications, chapter 3.3.4: The Bellman-Ford-Moore Algorithm, pages 97--99. Springer-Verlag London, 2009. doi: 10.1007/978-1-84800-998-1.Google Scholar
- S. Brown and J. Snoeyink. Modestly faster histogram computations on GPUs. In Proceedings of Innovative Parallel Computing, InPar '12, May 2012. doi: 10.1109/InPar.2012.6339589.Google Scholar
Cross Ref
- A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Work-efficient parallel GPU methods for single source shortest paths. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014, pages 349--359, May 2014. doi: 10.1109/IPDPS.2014.45. Google Scholar
Digital Library
- T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011. doi: 10.1145/2049662.2049663. Google Scholar
Digital Library
- M. Deo and S. Keely. Parallel suffix array and least common prefix for the GPU. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 197--206, Feb. 2013. doi: 10.1145/2442516.2442536. Google Scholar
Digital Library
- A. Deshpande and P. J. Narayanan. Can GPUs sort strings efficiently? In 20th International Conference on High Performance Computing, HiPC 2013, pages 305--313, Dec. 2013. doi: 10.1109/HiPC.2013.6799129.Google Scholar
Cross Ref
- G. F. Diamos, H. Wu, A. Lele, J. Wang, and S. Yalamanchili. Efficient relational algebra algorithms and data structures for GPU. Technical Report GIT-CERCS-12-01, Georgia Institute of Technology Center for Experimental Research in Computer Systems, Feb. 2012. URL http://www.cercs.gatech.edu/tech-reports/tr2012/git-cercs-12-01.pdf.Google Scholar
- E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269--271, 1959. ISSN 0029-599X. doi: 10.1007/BF01386390. Google Scholar
Digital Library
- M. Harris, S. Sengupta, and J. D. Owens. Parallel prefix sum (scan) with CUDA. In H. Nguyen, editor, GPU Gems 3, chapter 39, pages 851--876. Addison Wesley, Aug. 2007.Google Scholar
- B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 511--524, June 2008. doi: 10.1145/1376616.1376670. Google Scholar
Digital Library
- Q. Hou, X. Sun, K. Zhou, C. Lauterbach, and D. Manocha. Memory-scalable GPU spatial hierarchy construction. IEEE Transactions on Visualization and Computer Graphics, 17(4):466--474, Apr. 2011. doi: 10.1109/TVCG.2010.88. Google Scholar
Digital Library
- E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2): 39--55, Mar./Apr. 2008. doi: 10.1109/MM.2008.31. Google Scholar
Digital Library
- D. Merrill and A. Grimshaw. Revisiting sorting for GPGPU stream architectures. Technical Report CS2010-03, Department of Computer Science, University of Virginia, Feb. 2010. URL https://sites.google.com/site/duanemerrill/RadixSortTR.pdf.Google Scholar
Digital Library
- U. Meyer. Buckets strike back: Improved parallel shortest paths. In Proceedings of the 16th International Parallel and Distributed Processing Symposium, IPDPS 2002, Apr. 2002. doi: 10.1109/IPDPS. 2002.1015582. Google Scholar
Digital Library
- U. Meyer. Average-case complexity of single-source shortest-paths algorithms: lower and upper bounds. Journal of Algorithms, 48(1): 91--134, Aug. 2003. doi: 10.1016/S0196-6774(03)00046-4. Google Scholar
Digital Library
- U. Meyer and P. Sanders. Δ-stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114--152, Oct. 2003. doi: 10.1016/S0196-6774(03)00076-2. 1998 European Symposium on Algorithms. Google Scholar
Digital Library
- G. L. Miller and J. H. Reif. Parallel tree contraction---Part 1: Fundamentals. In S. Micali, editor, Randomness and Computation, volume 5 of Advances in Computing Research, pages 47--72. JAI Press Inc., 1989. ISBN 9780892328963.Google Scholar
- L. Monroe, J. Wendelberger, and S. Michalak. Randomized selection on the GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG '11, pages 89--98, Aug. 2011. doi: 10.1145/2018323.2018338. Google Scholar
Digital Library
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, 6(2):40--53, Mar./Apr. 2008. doi: 10.1145/1365490.1365500. Google Scholar
Digital Library
- C. Nugteren, G.-J. van den Braak, H. Corporaal, and B. Mesman. High performance predictable histogramming on GPUs: Exploring and evaluating algorithm trade-offs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, page 1. ACM, 2011. Google Scholar
Digital Library
- NVIDIA Corporation. NVIDIA CUDA C programming guide. PG-02829-001 v6.5, Aug. 2014.Google Scholar
- J. Pantaleoni. VoxelPipe: A programmable pipeline for 3D voxelization. In Proceedings of High Performance Graphics, HPG '11, pages 99--106, Aug. 2011. ISBN 978-1-4503-0896-0. doi: 10.1145/2018323.2018339. Google Scholar
Digital Library
- S. Patidar. Scalable primitives for data mapping and movement on the GPU. Master's thesis, International Institute of Information Technology, Hyderabad, India, June 2009.Google Scholar
- R. Shams and R. A. Kennedy. Efficient histogram algorithms for NVIDIA CUDA compatible devices. In Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS), pages 418--422, Gold Coast, Australia, Dec. 2007.Google Scholar
- Z. Wu, F. Zhao, and X. Liu. SAH KD-tree construction on GPU. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG '11, pages 71--78, Aug. 2011. doi: 10.1145/2018323. 2018335. Google Scholar
Digital Library
- X. Yang, D. Xu, and L. Zhao. Efficient data management for incoherent ray tracing. Applied Soft Computing, 13(1):1--8, Jan. 2013. doi: 10.1016/j.asoc.2012.07.002. Google Scholar
Digital Library
Recommendations
GPU Multisplit: An Extended Study of a Parallel Algorithm
Special Issue: Invited papers from PPoPP 2016, Part 1Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on ...
GPU multisplit
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingMultisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on ...
Accelerating PQMRCGSTAB algorithm on GPU
UCHPC-MAW '09: Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshopThe general computations on GPU are becoming more and more popular because of GPU's powerful computing ability. In this paper, how to use GPU to accelerate sparse linear system solver, preconditioned QMRCGSTAB (PQMRCGSTAB for short), is our concern. We ...






Comments