Abstract
In this paper, we present FlexBFS, a parallelism-aware implementation for breadth-first search on GPU. Our implementation can adjust the computation resources according to the feedback of available parallelism dynamically. We also optimized our program in three ways: (1)a simplified two-level queue management,(2)a combined kernel strategy and (3)a high-degree vertices specialization approach. Our experimental results show that it can achieve 3~20 times speedup against the fastest serial version, and can outperform the TBB based multi-threading CPU version and the previous most effective GPU version on all types of input graphs.
- P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. In HiPC'07. Google Scholar
Digital Library
- S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. In PPoPP'11. Google Scholar
Digital Library
- L. Luo, M. Wong, and W. mei Hwu. An effective gpu implementation of breadth-first search. In 47th DAC 2010. Google Scholar
Digital Library
- S. Xiao and W. chun Feng. Inter-block gpu communication via fast barrier synchronization. In IPDPS 2010.Google Scholar
Index Terms
FlexBFS: a parallelism-aware implementation of breadth-first search on GPU
Recommendations
FlexBFS: a parallelism-aware implementation of breadth-first search on GPU
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingIn this paper, we present FlexBFS, a parallelism-aware implementation for breadth-first search on GPU. Our implementation can adjust the computation resources according to the feedback of available parallelism dynamically. We also optimized our program ...
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based ProcessingOpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...







Comments