Abstract
Existing work that deals with parallelization of complicated reductions and scans focuses only on formalism and hardly dealt with implementation. To bridge the gap between formalism and implementation, we have integrated parallelization via matrix multiplication into compiler construction. Our framework can deal with complicated loops that existing techniques in compilers cannot parallelize. Moreover, we have sophisticated our framework by developing two sets of techniques. One enhances its capability for parallelization by extracting max-operators automatically, and the other improves the performance of parallelized programs by eliminating redundancy. We have also implemented our framework and techniques as a parallelizer in a compiler. Experiments on examples that existing compilers cannot parallelize have demonstrated the scalability of programs parallelized by our implementation.
- A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison Wesley, second edition, 2006. Google Scholar
Digital Library
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, 2001. Google Scholar
Digital Library
- A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Automatic Intra-Register Vectorization for the Intel® Architecture. Int. J. Parallel Program., 30 (2): 65--98, 2002. Google Scholar
Digital Library
- R. S. Bird. An Introduction to the Theory of Lists. In Logic of Programming and Calculi of Discrete Design, volume 36 of NATO ASI Series F, pages 3--42. Springer-Verlag, 1987. Google Scholar
Digital Library
- D. Callahan, S. Carr, and K. Kennedy. Improving Register Allocation for Subscripted Variables. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation (PLDI '90), pages 177--187. ACM, 1990. Google Scholar
Digital Library
- W.-N. Chin, A. Takano, and Z. Hu. Parallelization via Context Preservation. In Proceedings of IEEE International Conference on Computer Languages (ICCL '98), pages 153--162. IEEE CS Press, 1998. Google Scholar
Digital Library
- K. Emoto, K. Matsuzaki, Z. Hu, and M. Takeichi. Domain-Specific Optimization Strategy for Skeleton Programs. In Euro-Par 2007 Parallel Processing, volume 4641 of Lecture Notes in Computer Science, pages 705--714. Springer, 2007. Google Scholar
Digital Library
- A. L. Fisher and A. M. Ghuloum. Parallelizing Complex Scans and Reductions. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation (PLDI '94), pages 135--146. ACM, 1994. Google Scholar
Digital Library
- W. Gander and G. H. Golub. Cyclic Reduction -- History and Applications. In Proceedings of the Workshop on Scientific Computing, 1997.Google Scholar
- P. M. Kogge and H. S. Stone. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations. IEEE Trans. Comput., 22 (8): 786--793, 1973. Google Scholar
Digital Library
- K. Matsuzaki. Parallel Programming with Tree Skeletons. PhD thesis, Graduate School of Information Science and Technology, The University of Tokyo, 2007.Google Scholar
- K. Matsuzaki and K. Emoto. Implementing Fusion-Equipped Parallel Skeletons by Expression Templates. In Implementation and Application of Functional Languages (IFL '09), volume 6041 of Lecture Notes in Computer Science, pages 72--89. Springer, 2010. Google Scholar
Digital Library
- K. Matsuzaki, Z. Hu, and M. Takeichi. Towards Automatic Parallelization of Tree Reductions in Dynamic Programming. In Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '06), pages 39--48. ACM, 2006. Google Scholar
Digital Library
- A. Morihata and K. Matsuzaki. Automatic Parallelization of Recursive Functions using Quantifier Elimination. In Functional and Logic Programming (FLOPS '10), volume 6009 of Lecture Notes in Computer Science, pages 321--336. Springer, 2010. Google Scholar
- K. Morita, A. Morihata, K. Matsuzaki, Z. Hu, and M. Takeichi. Automatic Inversion Generates Divide-and-Conquer Parallel Programs. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07), pages 146--155, 2007. Google Scholar
Digital Library
- A. Nistor, W.-N. Chin, T.-S. Tan, and N. Tapus. Optimizing the parallel computation of linear recurrences using compact matrix representations. J. Parallel Distrib. Comput., 69 (4): 373--381, 2009. Google Scholar
Digital Library
- X. Redon and P. Feautrier. Detection of Scans in the Polytope Model. Parallel Algorithms Appl., 15 (3--4): 229--263, 2000.Google Scholar
Cross Ref
- J. H. Reif, editor. Synthesis of Parallel Algorithms. Morgan Kaufmann Pub, 1993. Google Scholar
Digital Library
- S. Sato. Automatic Parallelization via Matrix Multiplication. Master's thesis, The University of Electro-Communications, 2011.Google Scholar
- H. S. Stone. An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations. J. ACM, 20 (1): 27--38, 1973. Google Scholar
Digital Library
- D. N. Xu, S.-C. Khoo, and Z. Hu. PType System: A Featherweight Parallelizability Detector. In Programming Languages and Systems (APLAS '04), volume 3302 of Lecture Notes in Computer Science, pages 197--212. Springer, 2004.Google Scholar
Cross Ref
Index Terms
Automatic parallelization via matrix multiplication
Recommendations
Automatic parallelization via matrix multiplication
PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and ImplementationExisting work that deals with parallelization of complicated reductions and scans focuses only on formalism and hardly dealt with implementation. To bridge the gap between formalism and implementation, we have integrated parallelization via matrix ...
A comparison of automatic parallelization tools/compilers on the SGI origin 2000
SC '98: Proceedings of the 1998 ACM/IEEE conference on SupercomputingPorting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is time consuming and costly, porting codes would ideally be automated by using some parallelization ...
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor
ICA3PP'12: Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part IIMatrix multiplication is an essential building block of many linear algebra operations and applications. This paper presents parallel algorithms with shared A or B matrix in the memory for the special massively multithreaded Fiteng1000 processor. We ...







Comments