Abstract
MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance.
In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.
- A. V. Aho, Ravi Sethi, J. D. Ullman, M. S. Lam. Compilers: Principles, Techniques, & Tools. Pearson Education, 2009. Google Scholar
Digital Library
- G. Almasi, D. Padua. MaJIC: Compiling MATLAB for Speed and Responsiveness. In the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation (PLDI '02). Google Scholar
Digital Library
- ATI Technologies, http://ati.amd.com/products/index.htmlGoogle Scholar
- M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In the 22nd Annual International Conference on Supercomputing (ICS '08). Google Scholar
Digital Library
- M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan. Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. In the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '08). Google Scholar
Digital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, P. Sadayappan. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In the 2008 ACM SIGPLAN conference on Programming language design and implementation (PLDI '08). Google Scholar
Digital Library
- A. Chauhan, C. McCosh, K. Kennedy, and R. Hanson. Automatic Type-Driven Library Generation for Telescoping Languages. In the 2003 ACM/IEEE Conference on Supercomputing (SC '03). Google Scholar
Digital Library
- M. Chevalier-Boisvert, L. Hendren, C. Verbrugge. Optimizing MATLAB Through Just-In-Time Specialization. In the 2010 International Conference on Compiler Construction (CC '10). Google Scholar
Digital Library
- R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, F. K. Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. In the ACM Transactions on Programming Languages and Systems, 13(4):451--490, Oct. 1991. Google Scholar
Digital Library
- L. De Rose, D. Padua. Techniques for the translation of MATLAB programs into Fortran 90. In the ACM Transactions on Programming Languages and Systems, 21(2):286--323, Mar. 1999. Google Scholar
Digital Library
- J. W. Eaton. GNU Octave Manual, Network Theory Limited, 2002.Google Scholar
- GPUMat Home Page. http://gp-you.org/Google Scholar
- M. Haldar et. al. MATCH Virtual Machine: An Adaptive Run-Time System to Execute MATLAB in Parallel. In the 2000 International Conference on Parallel Processing (ICPP '00). Google Scholar
Digital Library
- Jacket Home Page. http://www.accelereyes.com/Google Scholar
- P. Joisha, P. Banerjee. Static Array Storage Optimization in MATLAB. In the ACM SIGPLAN 2003 conference on Programming language design and implementation (PLDI '03). Google Scholar
Digital Library
- P. Joisha, P. Banerjee. An Algebraic Array Shape Inference System for MATLAB. ACM Transactions on Programming Languages and Systems, 28(5):848--907, September 2006. Google Scholar
Digital Library
- P. Joisha, P. Banerjee. A Translator System for the MATLAB Language, Research Articles on Software Practices and Experience '07. Google Scholar
Digital Library
- K. Kennedy, K. S. McKinley. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. In the 6th International Workshop on Languages and Compilers for Parallel Computing (LCPC '93). Google Scholar
Digital Library
- R. Khoury, B. Burgstaller, B. Scholz, Accelerating the Execution of Matrix Languages on the Cell Broadband Engine Architecture. IEEE Transactions on Parallel and Distributed Systems, 22(1):7--21, Jan. 2011. Google Scholar
Digital Library
- E. Lindholm, J. Nickolls, S. Oberman, J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, March 2008. Google Scholar
Digital Library
- Mathworks Home Page. http://www.mathworks.com/Google Scholar
- NVIDIA Corp, NVIDIA CUDA: Compute Unified Device Architecture: Programming Guide, Version 3.0, 2010.Google Scholar
- NVIDIA Corp, Fermi Home Page, http://www.nvidia.com/object/fermi_architecture.htmlGoogle Scholar
- S. K. Singhai, K. S. Mckinley. A Parametrized Loop Fusion Algorithm for Improving Parallelism and Cache Locality, Computer Journal, 1997.Google Scholar
Cross Ref
- D. Tarditi, S. Puri, J. Oglesby. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XII). Google Scholar
Digital Library
- A. Udupa, R. Govindarajan, M. J. Thazhuthaveetil. Software Pipelined Execution of Stream Programs on GPUs. In the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '09). Google Scholar
Digital Library
- A. Udupa, R. Govindarajan, M. J. Thazhuthaveetil. Synergistic Execution of Stream Programs on Multicores with Accelerators. In the 2009 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES '09). Google Scholar
Digital Library
- V. Volkov, J. W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. In the 2008 ACM/IEEE Conference on Supercomputing (SC '08). Google Scholar
Digital Library
- Y. Yang, P. Xiang, J. Kong, H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. In the 2010 ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI '10). Google Scholar
Digital Library
Index Terms
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors
Recommendations
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors
PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and ImplementationMATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also ...
Automating GPU computing in MATLAB
ICS '11: Proceedings of the international conference on SupercomputingMATLAB is a popular software platform for scientific and engineering software writers. It offers a high level of abstraction for fundamental mathematical operations and extensive highly optimized domain-specific libraries for several scientific and ...
GPU-accelerated predicate evaluation on column store
WAIM'10: Proceedings of the 11th international conference on Web-age information managementColumn scan, or predicate evaluation and filtering over a column of data in a database table, is an important primitive for data mining and data warehousing. In this paper, we present our study on accelerating column scan using a massively parallel ...







Comments