skip to main content
research-article

Improving parallelism and locality with asynchronous algorithms

Published:09 January 2010Publication History
Skip Abstract Section

Abstract

As multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the off-chip memory. This is especially true for large computing problems solved by iterative algorithms because of the large data set typically used. Loop tiling, also known as loop blocking, was shown previously to be an effective way to enhance data locality, and hence to reduce the memory bandwidth pressure, for a class of iterative algorithms executed on a single processor. Unfortunately, the tiled programs suffer from reduced parallelism because only the loop iterations within a single tile can be easily parallelized. In this work, we propose to use the asynchronous model to enable effective loop tiling such that both parallelism and locality can be attained simultaneously. Asynchronous algorithms were previously proposed to reduce the communication cost and synchronization overhead between processors. Our new discovery is that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to simultaneously attained data locality and loop-level parallelism. We present supporting evidence from experiments with three well-known numerical kernels.

References

  1. Bull, J. M. Asynchronous Jacobi iterations on local memory parallel computers. M. Sc. Thesis, University of Manchester, Manchester, UK, 1991.Google ScholarGoogle Scholar
  2. Solving PDEs: Grid Computations, Chapter 16.Google ScholarGoogle Scholar
  3. Song, Y. and Li, Z. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dougals, C. C., Hasse, G., and Langer, U. A Tutorial on Elliptic PDE Solvers and Their Parallelization, SIAM.Google ScholarGoogle Scholar
  5. Liu, L., Li, Z., and Sameh, A. H. Analyzing memory access intensity in parallel programs on multicore. In Proceedings of the 22nd Annual international Conference on Supercomputing, Jun 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B. B., Garzarán, M. J., Padua, D., and von Praun, C. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. SPEC benchmark. http://www.spec.org.Google ScholarGoogle Scholar
  8. Perfmon2, the hardware-based performance monitoring interface for Linux. http://perfmon2.sourceforge.net/Google ScholarGoogle Scholar
  9. Jin, H., Frumkin, M., and Yan, J. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. NAS technical report NAS-99-011, NASA Ames Research Center.Google ScholarGoogle Scholar
  10. Renganarayana, L., Harthikote-Matha, M., Dewri, R., and Rajopadhye, S.. Towards Optimal Multi-level Tiling for Stencil Computations. In Parallel and Distributed Processing Symposium, 2007.Google ScholarGoogle Scholar
  11. Bondhugula, U., Hartono, A., Ramanujam, J., and Sadayappan, P. A practical automatic polyhedral parallelizer and locality optimizer. In SIGPLAN Not. 43, 6, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Frommer, A. and Szyld, D. B. Asynchronous two-stage iterative methods. In Numer. Math. 69, 2, Dec 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Meyers, R. and Li, Z. ASYNC Loop Constructs for Relaxed Synchronization. In Languages and Compilers For Parallel Computing: 21th international Workshop, Aug 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Huang, Q., Xue, J., and Vera, X. Code tiling for improving the cache performance of PDE solvers. In Proceedings of International Conference on Parallel Processing, Oct 2003.Google ScholarGoogle Scholar
  15. Alam S. R., Barrett, B. F., Kuehn J. A., Roth P. C., and Vetter J. S. Characterization of Scientific Workloads on Systems with Multi-Core Processors. In International Symposium on Workload Characterization, 2006.Google ScholarGoogle Scholar
  16. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., and Demmel, J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Pugh, W., Rosser, E., and Shpeisman, T. Exploiting Monotone Convergence Functions in Parallel Programs. Technical Report. UMI Order Number: CS-TR-3636., University of Maryland at College Park.Google ScholarGoogle Scholar
  18. Blathras, K., Szyld, D. B., and Shi, Y. Timing models and local stopping criteria for asynchronous iterative algorithms. In Journal of Parallel and Distributed Computing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Bertsekas, D. P. and Tsitsiklis, J. N. Convergence rate and termination of asynchronous iterative algorithms. In Proceedings of the 3rd international Conference on Supercomputing, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Baudet, G. M. Asynchronous Iterative Methods for Multiprocessors. J. ACM 25, 226-244, Apr 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Blathras, K., Szyld, D. B., and Shi, Y. Parallel processing of linear systems using asynchronous. Preprint, Temple University, Philadelphia, PA, April 1997.Google ScholarGoogle Scholar
  22. Venkatasubramanian, S. and Vuduc, R. W. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd international Conference on Supercomputing, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Prokop, H. Cache-oblivious algorithms. Master's thesis, MIT, June 1999.Google ScholarGoogle Scholar
  24. Frigo, M. and Strumpen, V. The memory behavior of cache oblivious stencil computations. J. Supercomput. 39, 2, 93-112. Feb 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving parallelism and locality with asynchronous algorithms

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!