skip to main content
research-article

Massive atomics for massive parallelism on GPUs

Published:12 June 2014Publication History
Skip Abstract Section

Abstract

One important type of parallelism exploited in many applications is reduction type parallelism. In these applications, the order of the read-modify-write updates to one shared data object can be arbitrary as long as there is an imposed order for the read-modify-write updates. The typical way to parallelize these types of applications is to first let every individual thread perform local computation and save the results in thread-private data objects, and then merge the results from all worker threads in the reduction stage. All applications that fit into the map reduce framework belong to this category. Additionally, the machine learning, data mining, numerical analysis and scientific simulation applications may also benefit from reduction type parallelism. However, the parallelization scheme via the usage of thread-private data objects may not be vi- able in massively parallel GPU applications. Because the number of concurrent threads is extremely large (at least tens of thousands of), thread-private data object creation may lead to memory space explosion problems.

In this paper, we propose a novel approach to deal with shared data object management for reduction type parallelism on GPUs. Our approach exploits fine-grained parallelism while at the same time maintaining good programmability. It is based on the usage of intrinsic hardware atomic instructions. Atomic operation may appear to be expensive since it causes thread serialization when multiple threads atomically update the same memory object at the same time. However, we discovered that, with appropriate atomic collision reduction techniques, the atomic implementation can out- perform the non-atomics implementation, even for benchmarks known to have high performance non-atomics GPU implementations. In the meantime, the usage of atomics can greatly reduce coding complexity as neither thread-private object management or explicit thread-communication (for the shared data objects protected by atomic operations) is necessary.

References

  1. "Cuda occupancy calculator." NVIDIA. {Online}. Available: http://developer.download.nvidia.com/compute/cuda/CUDA Occupancy calculator.xlsGoogle ScholarGoogle Scholar
  2. "Matrix market." {Online}. Available: http://math.nist.gov/ MatrixMarket/Google ScholarGoogle Scholar
  3. "Whitepaper - nvidia's next generation cuda compute architecture: Kepler gk110." {Online}. Available: http://www.nvidia.com/content/ PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdfGoogle ScholarGoogle Scholar
  4. E. Alerstam,W. C. Y. Lo, T. D. Han, J. Rose, S. Andersson-Engels, and L. Lilge, "Next-generation acceleration and code optimization for light transport in turbid media using GPUs," Biomedical Optics Express, vol. 1, no. 2, pp. 658--675, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  5. N. Bell and M. Garland, "Cusp: Generic parallel algorithms for sparse matrix and graph computations," 2012, version 0.3.0. {Online}.Available: http://cusp-library.googlecode.comGoogle ScholarGoogle Scholar
  6. J. Gomez-Luna, J. M. Gonzalez-Linares, J. I. B. Benitez, and N. G. Mata, "Performance modeling of atomic additions on gpu scratchpad memory," IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 11, pp. 2273--2282, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, "The nyu ultracomputer-designing a mimd, shared-memory parallel machine (extended abstract)," in Proceedings of the 9th Annual Symposium on Computer Architecture, ser. ISCA '82. Los Alamitos, CA, USA: IEEE Computer Society Press, 1982, pp. 27--42. {Online}. Available: http://dl.acm.org/citation.cfm?id=800048.801711 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.- W. Liao, E. Bugnion, and M. S. Lam, "Maximizing multiprocessor performance with the suif compiler," Computer, vol. 29, no. 12, pp. 84--89, Dec. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Harris, "Optimizing parallel reduction in cuda," 2007, http://developer.download.nvidia.com/compute/cuda/1 1/Website/ projects/reduction/doc/reduction.pdf.Google ScholarGoogle Scholar
  10. T. Hastie, R. Tibshirani, and J. Friedman, "The elements of statistical learning." Springer, 2001.Google ScholarGoogle Scholar
  11. B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a mapreduce framework on graphics processors," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, ser. PACT '08. New York, NY, USA: ACM, 2008, pp. 260--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen, "Atomic vector operation on chip multiprocessors," in Proceedings of the 35th Annual International Symposium on Computer Architecture, ser. ISCA '08. Washington, DC, USA: IEEE Computer Society, 2008, pp. 441--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Nugteren, G.-J. van den Braak, H. Corporaal, and B. Mesman, "High performance predictable histogramming on gpus: exploring and evaluating algorithm trade-offs," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4. New York, NY, USA: ACM, 2011, pp. 1:1--1:8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. NVIDIA, "Gpu computing sdk." NVIDIA. {Online}. Available: https://developer.nvidia.com/gpu-computing-sdkGoogle ScholarGoogle Scholar
  15. V. Podlozhnyuk, "Histogram calculation in cuda," in Technical Report. NVIDIA, 2007.Google ScholarGoogle Scholar
  16. R. Shams and R. A. Kennedy, "Efficient histogram algorithms for NVIDIA CUDA compatible devices," in Proc. Int. Conf. on Signal Processing and Communications Systems (ICSPCS), Gold Coast, Australia, Dec. 2007, pp. 418--422.Google ScholarGoogle Scholar
  17. G.-J. Van Den Braak, C. Nugteren, B. Mesman, and H. Corporaal, "Gpu-vote: A framework for accelerating voting algorithms on gpu," in Proceedings of the 18th International Conference on Parallel Processing, ser. Euro-Par'12. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 945--956. {Online}. Available: http://dx.doi.org/10.1007/978--3--642--32820--6 92 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Vineet and P. Narayanan, "Cuda cuts: Fast graph cuts on the gpu," in Proceedings of CVPR workshop on Visual Computer Visions on the GPUs, 2008.Google ScholarGoogle Scholar
  19. E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, "On-the-fly elimination of dynamic irregularities for gpu computing," in Proceedings of the sixteenth international conference on Architectural supportfor programming languages and operating systems, ser. ASPLOS '11. New York, NY, USA: ACM, 2011, pp. 369--380. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Massive atomics for massive parallelism on GPUs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 49, Issue 11
      ISMM '14
      November 2014
      121 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2775049
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        ISMM '14: Proceedings of the 2014 international symposium on Memory management
        June 2014
        136 pages
        ISBN:9781450329217
        DOI:10.1145/2602988

      Copyright © 2014 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 June 2014

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!