skip to main content
research-article

Accelerating CUDA graph algorithms at maximum warp

Published:12 February 2011Publication History
Skip Abstract Section

Abstract

Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.

References

  1. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html, 2009.Google ScholarGoogle Scholar
  2. http://en.wikipedia.org/wiki/GeForce_200_Series, 2010.Google ScholarGoogle Scholar
  3. V. Agarwal, F. Petrini, D. Pasetto, and D. Bada. Scalable Graph Exploration on Multicore Processors. In ACM/IEEE SC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Ajwani, R. Dementiev, and U. Meyer. A computational study of external-memory bfs algorithms. In ACM-SIAM SODA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Bader and K. Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In IEEE ICPP, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Bader and K. Madduri. Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In IEEE IPDPS, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  7. D. A. Bader and K. Madduri. Gtgraph: A synthetic graph generator suite. http://www.cc.gatech/edu/kamesh/GTgraph/, 2006.Google ScholarGoogle Scholar
  8. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In IEEE ISPASS, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  9. N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. In Proc. Conf. Supercomputing (SC'09).Google ScholarGoogle Scholar
  10. H. Chafi, Z. DeVito, A. Moors, T. Rompf, A. K. Sujeeth, P. Hanrahan, M. Odersky, and K. Olukotun. Language virtualization for heterogeneous parallel computing. In Proc. Conf. OOPSLA'10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cray, Inc. Cray xmt. http://www.cray.com/products/xmt/.Google ScholarGoogle Scholar
  14. K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In SC, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. harish, V. Vineet, and P. Narayanan. Large graph algorithms for massively multithreaded architectures. Technical Report IIIT/TR/2009/74, International Institute of Information Technology Hyderabad, India, 2009.Google ScholarGoogle Scholar
  17. B. Hendrickson and J. Berry. Graph analysis with high-performance computing. Computing in Science Engineering, 10(2), march 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. JáJá. An introduction to parallel algorithms. Addison Wesley, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Nickolls and W. J. Dally. The gpu computing era. IEEE Micro, 30(2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Niewiadomski, J. Amaral, and R. Holte. A parallel external-memory frontier breadth-first traversal algorithm for clusters of workstations. In IEEE ICPP, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Nvidia. Cuda. http://www.nvidia.com/cuda/.Google ScholarGoogle Scholar
  24. D. J. Watts. Small Worlds: the dynamics of Networks between Order and Randomness, chapter 1-2. Princeton University Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and U. Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In ACM/IEEE SC, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Accelerating CUDA graph algorithms at maximum warp

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 46, Issue 8
          PPoPP '11
          August 2011
          300 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2038037
          Issue’s Table of Contents
          • cover image ACM Conferences
            PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
            February 2011
            326 pages
            ISBN:9781450301190
            DOI:10.1145/1941553
            • General Chair:
            • Calin Cascaval,
            • Program Chair:
            • Pen-Chung Yew

          Copyright © 2011 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 February 2011

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!