Abstract
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.
- Stanford large network dataset collection. http://snap.stanford.edu/data/index.html, 2009.Google Scholar
- http://en.wikipedia.org/wiki/GeForce_200_Series, 2010.Google Scholar
- V. Agarwal, F. Petrini, D. Pasetto, and D. Bada. Scalable Graph Exploration on Multicore Processors. In ACM/IEEE SC, 2010. Google Scholar
Digital Library
- D. Ajwani, R. Dementiev, and U. Meyer. A computational study of external-memory bfs algorithms. In ACM-SIAM SODA, 2006. Google Scholar
Digital Library
- D. Bader and K. Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In IEEE ICPP, 2006. Google Scholar
Digital Library
- D. Bader and K. Madduri. Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In IEEE IPDPS, 2008.Google Scholar
Cross Ref
- D. A. Bader and K. Madduri. Gtgraph: A synthetic graph generator suite. http://www.cc.gatech/edu/kamesh/GTgraph/, 2006.Google Scholar
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In IEEE ISPASS, 2009.Google Scholar
Cross Ref
- N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. In Proc. Conf. Supercomputing (SC'09).Google Scholar
- H. Chafi, Z. DeVito, A. Moors, T. Rompf, A. K. Sujeeth, P. Hanrahan, M. Odersky, and K. Olukotun. Language virtualization for heterogeneous parallel computing. In Proc. Conf. OOPSLA'10, 2010. Google Scholar
Digital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, 2004.Google Scholar
Cross Ref
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE IISWC, 2009. Google Scholar
Digital Library
- Cray, Inc. Cray xmt. http://www.cray.com/products/xmt/.Google Scholar
- K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In SC, 2006. Google Scholar
Digital Library
- P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC, 2007. Google Scholar
Digital Library
- P. harish, V. Vineet, and P. Narayanan. Large graph algorithms for massively multithreaded architectures. Technical Report IIIT/TR/2009/74, International Institute of Information Technology Hyderabad, India, 2009.Google Scholar
- B. Hendrickson and J. Berry. Graph analysis with high-performance computing. Computing in Science Engineering, 10(2), march 2008. Google Scholar
Digital Library
- J. JáJá. An introduction to parallel algorithms. Addison Wesley, 1992. Google Scholar
Digital Library
- V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In ISCA, 2010. Google Scholar
Digital Library
- J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, 2010. Google Scholar
Digital Library
- J. Nickolls and W. J. Dally. The gpu computing era. IEEE Micro, 30(2), 2010. Google Scholar
Digital Library
- R. Niewiadomski, J. Amaral, and R. Holte. A parallel external-memory frontier breadth-first traversal algorithm for clusters of workstations. In IEEE ICPP, 2006. Google Scholar
Digital Library
- Nvidia. Cuda. http://www.nvidia.com/cuda/.Google Scholar
- D. J. Watts. Small Worlds: the dynamics of Networks between Order and Randomness, chapter 1-2. Princeton University Press, 1999. Google Scholar
Digital Library
- A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and U. Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In ACM/IEEE SC, 2005. Google Scholar
Digital Library
Index Terms
Accelerating CUDA graph algorithms at maximum warp
Recommendations
Accelerating CUDA graph algorithms at maximum warp
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingGraphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure ...
RankBoost Acceleration on both NVIDIA CUDA and ATI Stream Platforms
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed SystemsNVIDIA CUDA and ATI Stream are the two major general-purpose GPU (GPGPU) computing technologies. We implemented RankBoost, a web relevance ranking algorithm, on both NVIDIA CUDA and ATI Stream platforms to accelerate the algorithm and illustrate the ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...







Comments