Abstract
As multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the off-chip memory. This is especially true for large computing problems solved by iterative algorithms because of the large data set typically used. Loop tiling, also known as loop blocking, was shown previously to be an effective way to enhance data locality, and hence to reduce the memory bandwidth pressure, for a class of iterative algorithms executed on a single processor. Unfortunately, the tiled programs suffer from reduced parallelism because only the loop iterations within a single tile can be easily parallelized. In this work, we propose to use the asynchronous model to enable effective loop tiling such that both parallelism and locality can be attained simultaneously. Asynchronous algorithms were previously proposed to reduce the communication cost and synchronization overhead between processors. Our new discovery is that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to simultaneously attained data locality and loop-level parallelism. We present supporting evidence from experiments with three well-known numerical kernels.
- Bull, J. M. Asynchronous Jacobi iterations on local memory parallel computers. M. Sc. Thesis, University of Manchester, Manchester, UK, 1991.Google Scholar
- Solving PDEs: Grid Computations, Chapter 16.Google Scholar
- Song, Y. and Li, Z. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1999. Google Scholar
Digital Library
- Dougals, C. C., Hasse, G., and Langer, U. A Tutorial on Elliptic PDE Solvers and Their Parallelization, SIAM.Google Scholar
- Liu, L., Li, Z., and Sameh, A. H. Analyzing memory access intensity in parallel programs on multicore. In Proceedings of the 22nd Annual international Conference on Supercomputing, Jun 2008. Google Scholar
Digital Library
- Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B. B., Garzarán, M. J., Padua, D., and von Praun, C. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. Google Scholar
Digital Library
- SPEC benchmark. http://www.spec.org.Google Scholar
- Perfmon2, the hardware-based performance monitoring interface for Linux. http://perfmon2.sourceforge.net/Google Scholar
- Jin, H., Frumkin, M., and Yan, J. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. NAS technical report NAS-99-011, NASA Ames Research Center.Google Scholar
- Renganarayana, L., Harthikote-Matha, M., Dewri, R., and Rajopadhye, S.. Towards Optimal Multi-level Tiling for Stencil Computations. In Parallel and Distributed Processing Symposium, 2007.Google Scholar
- Bondhugula, U., Hartono, A., Ramanujam, J., and Sadayappan, P. A practical automatic polyhedral parallelizer and locality optimizer. In SIGPLAN Not. 43, 6, May 2008. Google Scholar
Digital Library
- Frommer, A. and Szyld, D. B. Asynchronous two-stage iterative methods. In Numer. Math. 69, 2, Dec 1994. Google Scholar
Digital Library
- Meyers, R. and Li, Z. ASYNC Loop Constructs for Relaxed Synchronization. In Languages and Compilers For Parallel Computing: 21th international Workshop, Aug 2008. Google Scholar
Digital Library
- Huang, Q., Xue, J., and Vera, X. Code tiling for improving the cache performance of PDE solvers. In Proceedings of International Conference on Parallel Processing, Oct 2003.Google Scholar
- Alam S. R., Barrett, B. F., Kuehn J. A., Roth P. C., and Vetter J. S. Characterization of Scientific Workloads on Systems with Multi-Core Processors. In International Symposium on Workload Characterization, 2006.Google Scholar
- Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., and Demmel, J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007. Google Scholar
Digital Library
- Pugh, W., Rosser, E., and Shpeisman, T. Exploiting Monotone Convergence Functions in Parallel Programs. Technical Report. UMI Order Number: CS-TR-3636., University of Maryland at College Park.Google Scholar
- Blathras, K., Szyld, D. B., and Shi, Y. Timing models and local stopping criteria for asynchronous iterative algorithms. In Journal of Parallel and Distributed Computing, 1999. Google Scholar
Digital Library
- Bertsekas, D. P. and Tsitsiklis, J. N. Convergence rate and termination of asynchronous iterative algorithms. In Proceedings of the 3rd international Conference on Supercomputing, 1989. Google Scholar
Digital Library
- Baudet, G. M. Asynchronous Iterative Methods for Multiprocessors. J. ACM 25, 226-244, Apr 1978. Google Scholar
Digital Library
- Blathras, K., Szyld, D. B., and Shi, Y. Parallel processing of linear systems using asynchronous. Preprint, Temple University, Philadelphia, PA, April 1997.Google Scholar
- Venkatasubramanian, S. and Vuduc, R. W. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd international Conference on Supercomputing, June 2009. Google Scholar
Digital Library
- Prokop, H. Cache-oblivious algorithms. Master's thesis, MIT, June 1999.Google Scholar
- Frigo, M. and Strumpen, V. The memory behavior of cache oblivious stencil computations. J. Supercomput. 39, 2, 93-112. Feb 2007. Google Scholar
Digital Library
Index Terms
Improving parallelism and locality with asynchronous algorithms
Recommendations
Improving parallelism and locality with asynchronous algorithms
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingAs multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the off-chip memory. This is especially ...
Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines
Distributed-memory message-passing machines deliver scalable performance but are difficult to program. Shared-memory machines, on the other hand, are easier to program but obtaining scalable performance with large number of processors is difficult. ...
Improving Data Locality by Array Contraction
Array contraction is a program transformation which reduces array size while preserving the correct output. In this paper, we present an aggressive array-contraction technique and study its impact on memory system performance. This technique, called ...







Comments