Abstract
One important type of parallelism exploited in many applications is reduction type parallelism. In these applications, the order of the read-modify-write updates to one shared data object can be arbitrary as long as there is an imposed order for the read-modify-write updates. The typical way to parallelize these types of applications is to first let every individual thread perform local computation and save the results in thread-private data objects, and then merge the results from all worker threads in the reduction stage. All applications that fit into the map reduce framework belong to this category. Additionally, the machine learning, data mining, numerical analysis and scientific simulation applications may also benefit from reduction type parallelism. However, the parallelization scheme via the usage of thread-private data objects may not be vi- able in massively parallel GPU applications. Because the number of concurrent threads is extremely large (at least tens of thousands of), thread-private data object creation may lead to memory space explosion problems.
In this paper, we propose a novel approach to deal with shared data object management for reduction type parallelism on GPUs. Our approach exploits fine-grained parallelism while at the same time maintaining good programmability. It is based on the usage of intrinsic hardware atomic instructions. Atomic operation may appear to be expensive since it causes thread serialization when multiple threads atomically update the same memory object at the same time. However, we discovered that, with appropriate atomic collision reduction techniques, the atomic implementation can out- perform the non-atomics implementation, even for benchmarks known to have high performance non-atomics GPU implementations. In the meantime, the usage of atomics can greatly reduce coding complexity as neither thread-private object management or explicit thread-communication (for the shared data objects protected by atomic operations) is necessary.
- "Cuda occupancy calculator." NVIDIA. {Online}. Available: http://developer.download.nvidia.com/compute/cuda/CUDA Occupancy calculator.xlsGoogle Scholar
- "Matrix market." {Online}. Available: http://math.nist.gov/ MatrixMarket/Google Scholar
- "Whitepaper - nvidia's next generation cuda compute architecture: Kepler gk110." {Online}. Available: http://www.nvidia.com/content/ PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdfGoogle Scholar
- E. Alerstam,W. C. Y. Lo, T. D. Han, J. Rose, S. Andersson-Engels, and L. Lilge, "Next-generation acceleration and code optimization for light transport in turbid media using GPUs," Biomedical Optics Express, vol. 1, no. 2, pp. 658--675, 2010.Google Scholar
Cross Ref
- N. Bell and M. Garland, "Cusp: Generic parallel algorithms for sparse matrix and graph computations," 2012, version 0.3.0. {Online}.Available: http://cusp-library.googlecode.comGoogle Scholar
- J. Gomez-Luna, J. M. Gonzalez-Linares, J. I. B. Benitez, and N. G. Mata, "Performance modeling of atomic additions on gpu scratchpad memory," IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 11, pp. 2273--2282, 2013. Google Scholar
Digital Library
- A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, "The nyu ultracomputer-designing a mimd, shared-memory parallel machine (extended abstract)," in Proceedings of the 9th Annual Symposium on Computer Architecture, ser. ISCA '82. Los Alamitos, CA, USA: IEEE Computer Society Press, 1982, pp. 27--42. {Online}. Available: http://dl.acm.org/citation.cfm?id=800048.801711 Google Scholar
Digital Library
- M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.- W. Liao, E. Bugnion, and M. S. Lam, "Maximizing multiprocessor performance with the suif compiler," Computer, vol. 29, no. 12, pp. 84--89, Dec. 1996. Google Scholar
Digital Library
- M. Harris, "Optimizing parallel reduction in cuda," 2007, http://developer.download.nvidia.com/compute/cuda/1 1/Website/ projects/reduction/doc/reduction.pdf.Google Scholar
- T. Hastie, R. Tibshirani, and J. Friedman, "The elements of statistical learning." Springer, 2001.Google Scholar
- B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a mapreduce framework on graphics processors," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, ser. PACT '08. New York, NY, USA: ACM, 2008, pp. 260--269. Google Scholar
Digital Library
- S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen, "Atomic vector operation on chip multiprocessors," in Proceedings of the 35th Annual International Symposium on Computer Architecture, ser. ISCA '08. Washington, DC, USA: IEEE Computer Society, 2008, pp. 441--452. Google Scholar
Digital Library
- C. Nugteren, G.-J. van den Braak, H. Corporaal, and B. Mesman, "High performance predictable histogramming on gpus: exploring and evaluating algorithm trade-offs," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4. New York, NY, USA: ACM, 2011, pp. 1:1--1:8. Google Scholar
Digital Library
- NVIDIA, "Gpu computing sdk." NVIDIA. {Online}. Available: https://developer.nvidia.com/gpu-computing-sdkGoogle Scholar
- V. Podlozhnyuk, "Histogram calculation in cuda," in Technical Report. NVIDIA, 2007.Google Scholar
- R. Shams and R. A. Kennedy, "Efficient histogram algorithms for NVIDIA CUDA compatible devices," in Proc. Int. Conf. on Signal Processing and Communications Systems (ICSPCS), Gold Coast, Australia, Dec. 2007, pp. 418--422.Google Scholar
- G.-J. Van Den Braak, C. Nugteren, B. Mesman, and H. Corporaal, "Gpu-vote: A framework for accelerating voting algorithms on gpu," in Proceedings of the 18th International Conference on Parallel Processing, ser. Euro-Par'12. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 945--956. {Online}. Available: http://dx.doi.org/10.1007/978--3--642--32820--6 92 Google Scholar
Digital Library
- V. Vineet and P. Narayanan, "Cuda cuts: Fast graph cuts on the gpu," in Proceedings of CVPR workshop on Visual Computer Visions on the GPUs, 2008.Google Scholar
- E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, "On-the-fly elimination of dynamic irregularities for gpu computing," in Proceedings of the sixteenth international conference on Architectural supportfor programming languages and operating systems, ser. ASPLOS '11. New York, NY, USA: ACM, 2011, pp. 369--380. Google Scholar
Digital Library
Index Terms
Massive atomics for massive parallelism on GPUs
Recommendations
Massive atomics for massive parallelism on GPUs
ISMM '14: Proceedings of the 2014 international symposium on Memory managementOne important type of parallelism exploited in many applications is reduction type parallelism. In these applications, the order of the read-modify-write updates to one shared data object can be arbitrary as long as there is an imposed order for the ...
Implementing Cross-Device Atomics in Heterogeneous Processors
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium WorkshopIn this paper we describe how to support atomics across multiple devices in heterogeneous processors. Specifically, this paper provides an overview of how OpenCL 2.0 and Heterogeneous System Architecture (HSA) atomics are supported on integrated CPU-GPU ...
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC SystemsThis paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...







Comments