skip to main content
research-article

Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Authors Info & Claims
Published:23 February 2013Publication History
Skip Abstract Section

Abstract

The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristic methods to determine the new data layouts to create. As a result, they either do not provide any performance guarantee or are effective to only some limited scenarios. This paper contributes a fundamental study to the problem. It systematically analyzes the inherent complexity of the problem in various settings, and for the first time, proves that the problem is NP-complete. It then points out the limitations of existing techniques and reveals that in practice, the essence for designing an appropriate data reorganization algorithm can be reduced to a tradeoff among space, time, and complexity. Based on that insight, it develops two new data reorganization algorithms to overcome the limitations of previous methods. Experiments show that an assembly composed of the new algorithms and a previous algorithm can circumvent the inherent complexity in finding optimal data layouts, making it feasible to minimize non-coalesced memory accesses for a variety of irregular applications and settings that are beyond the reach of existing techniques.

References

  1. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison Wesley, 2nd edition, August 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS'08, pages 225--234, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Carrillo, J. Siegel, and X. Li. A control-structure splitting optimization for gpgpu. In Proceedings of ACM Computing Frontiers, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. C. Cascaval. Compile-time Performance Prediction of Scientific Programs. PhD thesis, University of Illinois at Urbana-Champaign, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. M. Chilimbi and R. Shaham. Cache-conscious coallocation of hot data streams. In PLDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Danalis, G. Marin, C. McCurdy, J. Meredith, P. Roth, K. Spafford, V. Tipparaju, and J. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. 2010.Google ScholarGoogle Scholar
  8. C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. Journal of Parallel and Distributed Computing, 64(1):108--134, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In MICRO'07, pages 407--420, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel Distributed Systems, 17(7):606--618, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. S. Hochbaum. Approximation Algorithms for NP-Hard Problems. PWS Publishing Company, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. X. Huo, V. Ravi, W. Ma, and G. Agrawal. An execution strategy and optimized runtime support for parallelizing irregular reductions on modern gpus. In ICS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Jo and M. KulKarni. Enhancing locality for recursive traversals of recursive structures. In OOPSLA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Kandemir. A compiler technique for improving whole-program locality. In POPL, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. Opencl as a unified programming model for heterogeneous cpu/gpu clusters. In PPoPP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Kulkarni, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. Optimistic parallelism benefits from data partitioning. In ASPLOS, pages 233--243, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Lee, S. Min, and R. Eigenmann. Openmp to gpgpu: A compiler framework for automatic translation and optimization. In PPoPP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, pages 73--82, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. M. Strout, L. Carter, and J. Ferrante. Compile-time composition of run-time data and iteration reorderings. In PLDI, San Diego, CA, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Tarjan, J. Meng, and K. Skadron. Increasing memory miss tolerance for simd cores. In SC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Wu, E. Zhang, and X. Shen. Enhancing data locality for dynamic simulations through asynchronous data transformations and adaptive control. In PACT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Yan, X. Zhang, and Z. Zhang. Cacheminer: A runtime approach to exploit cache locality on smp. IEEE Transactions on Parallel Distributed Systems, 11(4):357--374, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!