ABSTRACT
Heterogeneous computers with processors and accelerators are becoming widespread in scientific computing. However, it is difficult to program hybrid architectures and there is no commonly accepted programming model. Ideally, applications should be written in a way that is portable to many platforms, but providing this portability for general programs is a hard problem.
By restricting the class of programs considered, we can make this portability feasible. We present Liszt, a domain-specific language for constructing mesh-based PDE solvers. We introduce language statements for interacting with an unstructured mesh, and storing data at its elements. Program analysis of these statements enables our compiler to expose the parallelism, locality, and synchronization of Liszt programs. Using this analysis, we generate applications for multiple platforms: a cluster, an SMP, and a GPU. This approach allows Liszt applications to perform within 12% of hand-written C++, scale to large clusters, and experience order-of-magnitude speedups on GPUs.
- J. R. Allwright, R. Bordawekar, P. D. Coddington, K. Dincer, and C. L. Martin. A comparison of parallel graph coloring algorithms. Technical report, SCCS-666, Northeast Parallel Architectures Center at Syracuse University, 1995.Google Scholar
- C. Ancourt, F. Coelho, and R. Keryell. How to add a new phase in PIPS: the case of dead code elimination. In In Sixth International Workshop on Compilers for Parallel Computers, 1996.Google Scholar
- V. G. Asouti, X. S. Trompoukis, I. C. Kampolis, and K. C. Giannakoglou. Unsteady CFD computations using vertex-centered finite volumes for unstructured grids on graphics processing units. International Journal for Numerical Methods in Fluids, 2010.Google Scholar
- S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202. Birkhäuser Press, 1997. Google Scholar
Digital Library
- K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin, and J. C. Sancho. Entering the petaflop era: the architecture and performance of Roadrunner. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC '08, Piscataway, NJ, USA, 2008. IEEE Press. Google Scholar
Digital Library
- T. Brandvik and G. Pullan. SBLOCK: A framework for efficient stencil-based PDE solvers on multi-core platforms. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, pages 1181--1188, July 2010. Google Scholar
Digital Library
- D. L. Brown, G. S. Chesshire, W. D. Henshaw, and D. J. Quinlan. OVERTURE: An object-oriented software system for solving partial differential equations in serial and parallel environments. In PPSC'97, 1997. Google Scholar
Digital Library
- H. Chafi, Z. DeVito, A. Moors, T. Rompf, A. K. Sujeeth, P. Hanrahan, M. Odersky, and K. Olukotun. Language virtualization for heterogeneous parallel computing. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications, OOPSLA '10, pages 835--847, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein. Register allocation via coloring. Comput. Lang., pages 47--57, 1981.Google Scholar
Cross Ref
- A. Corrigan, F. Camelli, R. Löhner, and J. Wallin. Running unstructured grid CFD solvers on modern graphics hardware. In 19th AIAA Computational Fluid Dynamics Conference, number AIAA 2009-4001, June 2009.Google Scholar
Cross Ref
- D. P. Dobkin and M. J. Laszlo. Primitives for the manipulation of three-dimensional subdivisions. In Proceedings of the third annual symposium on Computational geometry, SCG '87, pages 86--99, New York, NY, USA, 1987. ACM. Google Scholar
Digital Library
- J. B. Drake, W. Putman, P. N. Swarztrauber, and D. L. Williamson. High order cartesian method for the shallow water equations on a sphere. Technical report, TM-2001, Oakridge Nation Laboratory, 1999.Google Scholar
- T. Dupont, J. Hoffman, C. Johnson, R. Kirby, M. Larson, A. Logg, and R. Scott. The FEniCS project. Technical report, 2003.Google Scholar
- M. Giles, G. Mudalige, Z. Sharif, G. Markall, and P. Kelly. Performance analysis of the OP2 framework on many-core architecture. In ACM SIGMETRICS Performance Evaluation Review (to appear), March 2011. Google Scholar
Digital Library
- W. Gropp, S. Huss-Ledermanand, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, and M. Snir. MPI - The Complete Reference: Volume 2, The MPI-2 Extensions. MIT Press, Cambridge, MA, 1998.Google Scholar
Cross Ref
- R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of inefficiency in general-purpose chips. SIGARCH Comput. Archit. News, 38:37--47, June 2010. Google Scholar
Digital Library
- M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley. An overview of the Trilinos project. ACM Trans. Math. Softw., 31:397--423, September 2005. Google Scholar
Digital Library
- M. Houston, J.-Y. Park, M. Ren, T. Knight, K. Fatahalian, A. Aiken, W. Dally, and P. Hanrahan. A portable runtime interface for multi-level memory hierarchies. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP '08, pages 143--152, New York, NY, USA, 2008. ACM. Google Scholar
Digital Library
- A. Jameson, T. Baker, and N. Weatherill. Improvements to the aircraft Euler method. In AIAA 25th Aerospace Sciences Meeting, number 86-0103, January 1986.Google Scholar
- I. Kampolis, X. Trompoukis, V. Asouti, and K. Giannakoglou. CFD-based analysis and two-level aerodynamic optimization on graphics processing units. Computer Methods in Applied Mechanics and Engineering, 199(9-12):712--722, 2010.Google Scholar
Cross Ref
- G. Karypis, V. Kumar, and V. Kumar. A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. Journal of Parallel and Distributed Computing, 48:71--95, 1998. Google Scholar
Digital Library
- Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.Google Scholar
- O. Lawlor, S. Chakravorty, T. Wilmarth, N. Choudhury, I. Dooley, G. Zheng, and L. Kale. ParFUM: a parallel framework for unstructured meshes for scalable dynamic physics applications. Engineering with Computers, 22:215--235, 2006. Google Scholar
Digital Library
- A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. In Parallel Computing, pages 201--214. ACM Press, 1998. Google Scholar
Digital Library
- R. Löhner. Applied Computational Fluid Dynamics: An Introduction Based on Finite Element Methods. Wiley, Fairfax, Virginia, 2nd edition, 2008.Google Scholar
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. Queue, 6:40--53, March 2008. Google Scholar
Digital Library
- NVIDIA Corporation. NVIDIA's next generation compute architecture: Fermi, November 2009.Google Scholar
- NVIDIA Corporation. NVIDIA Tesla GPUs power world's fastest supercomputer, 2010.Google Scholar
- M. Odersky, V. Cremet, I. Dragos, G. Dubochet, B. Emir, S. Mcdirmid, S. Micheloud, N. Mihaylov, M. Schinz, E. Stenman, L. Spoon, and M. Zenger. An overview of the Scala programming language (second edition. Technical report, LAMP-REPORT-2006-001, École Polytechnique Fédérale de Lausanne, 2006.Google Scholar
- OpenMP Architecture Review Board. OpenMP: Application Program Interface 3.1, July 2011.Google Scholar
- R. Pecnik, V. E. Terrapon, F. Ham, and G. Iaccarino. Full system scramjet simulation. Annual Research Briefs of the Center for Turbulence Research, Stanford University, Stanford, CA, 2009.Google Scholar
- O. Pironneau, F. Hecht, A. L. Hyaric, and J. Morice. FreeFEM, 2005. Universitè Pierre et Marie Curie Laboratoire Jacques-Louis Lions, http://www.freefem.org/.Google Scholar
- T. Rompf and M. Odersky. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the ninth international conference on Generative programming and component engineering, GPCE '10, pages 127--136, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- D. E. Shaw, R. O. Dror, J. K. Salmon, J. P. Grossman, K. M. Mackenzie, J. A. Bank, C. Young, M. M. Deneroff, B. Batson, K. J. Bowers, E. Chow, M. P. Eastwood, D. J. Ierardi, J. L. Klepeis, J. S. Kuskin, R. H. Larson, K. Lindorff-Larsen, P. Maragakis, M. A. Moraes, S. Piana, Y. Shan, and B. Towles. Millisecond-scale molecular dynamics simulations on Anton. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 39:1--39:11, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- J. R. Stewart and H. C. Edwards. A framework approach for developing parallel adaptive multiphysics applications. Finite Elem. Anal. Des., 40:1599--1617, July 2004. Google Scholar
Digital Library
- H. G. Weller, G. Tabor, H. Jasak, and C. Fureby. A tensorial approach to computational continuum mechanics using object-oriented techniques. Comput. Phys., 12:620--631, November 1998. Google Scholar
Digital Library
Index Terms
Liszt: a domain specific language for building portable mesh-based PDE solvers
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...




Comments