Abstract
Exposing massive parallelism on 3D unstructured meshes computation with efficient load balancing and minimal synchronizations is challenging. Current approaches relying on domain decomposition and mesh coloring struggle to scale with the increasing number of cores per nodes, especially with new many-core processors. In this paper, we propose an hybrid approach using domain decomposition to exploit distributed memory parallelism, Divide-and-Conquer, D&C, to exploit shared memory parallelism and improve locality, and mesh coloring at core level to exploit vectors. It illustrates a new trade-off for many-cores between structuredness, memory locality, and vectorization. We evaluate our approach on the finite element matrix assembly of an industrial fluid dynamic code developed by Dassault Aviation. We compare our D&C approach to domain decomposition and to mesh coloring. D&C achieves a high parallel efficiency, a good data locality as well as an improved bandwidth usage. It competes on current nodes with the optimized pure MPI version with a minimum 10% speed-up. D&C shows an impressive 319x strong scaling on 512 cores (32 nodes) with only 2000 vertices per core. Finally, the Intel Xeon Phi version has a performance similar to 10 Intel E5-2665 Xeon Sandy Bridge cores and 95% parallel efficiency on the 60 physical cores. Running on 4 Xeon Phi (240 cores), D&C has 92% efficiency on the physical cores and performance similar to 33 Intel E5-2665 Xeon Sandy Bridge cores.
- G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39 (3):85–97, Mar. 1996. Google Scholar
Digital Library
- J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder. Sparse matrix solvers on the gpu: conjugate gradients and multigrid. In ACM Transactions on Graphics, volume 22, pages 917–924, 2003. Google Scholar
Digital Library
- A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1):38–53, 2009. Google Scholar
Digital Library
- C. Cecka, A. J. Lew, and E. Darve. Assembly of finite element methods on graphics processors. International journal for numerical methods in engineering, 85(5):640–669, 2011.Google Scholar
Cross Ref
- E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference, ACM ’69, pages 157–172, New York, NY, USA, 1969. ACM. Google Scholar
Digital Library
- J. Dongarra, V. Eijkhout, and P. Luszczek. Recursive approach in sparse matrix lu factorization. Sci. Program., 9(1):51–60, Jan. 2001. Google Scholar
Digital Library
- C. Farhat and L. Crivelli. A general approach to nonlinear fe computations on shared-memory multiprocessors. Computer Methods in Applied Mechanics and Engineering, 72(2):153–171, 1989. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI ’98, pages 212–223, USA, 1998. ACM. Google Scholar
Digital Library
- A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen. Colpack: Software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw., 40(1):1:1–1:31, Oct. 2013. Google Scholar
Digital Library
- G. J. Gorman, J. Southern, P. E. Farrell, M. D. Piggott, G. Rokos, and P. H. J. Kelly. Hybrid openmp/mpi anisotropic mesh smoothing. Procedia CS, pages 1513–1522, 2012.Google Scholar
- W. Gropp, E. Lusk, and A. Skjellum. Using MPI: portable parallel programming with the message-passing interface, volume 1. MIT press, 1999. Google Scholar
Digital Library
- E. Horowitz and A. Zorat. Divide-and-conquer for parallel processing. Computers, IEEE Transactions on, 100(6):582–585, 1983. Google Scholar
Digital Library
- G. Karypis and V. Kumar. Metis-unstructured graph partitioning and sparse matrix ordering system, version 2.0. 1995.Google Scholar
- J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel. Waypoint: Scaling coherence to thousand-core architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pages 99–110, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directory-based cache coherence protocol for the dash multiprocessor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA ’90, pages 148–159, New York, NY, USA, 1990. ACM. Google Scholar
Digital Library
- C.-K. Luk, R. Newton, W. Hasenplaugh, M. Hampton, and G. Lowney. A synergetic approach to throughput computing on x86-based multicore desktops. IEEE Software, 28(1):39–50, 2011. Google Scholar
Digital Library
- G. Markall, A. Slemmer, D. Ham, P. Kelly, C. Cantwell, and S. Sherwin. Finite element assembly strategies on multi-core and many-core architectures. International Journal for Numerical Methods in Fluids, 71(1):80–97, 2013.Google Scholar
Cross Ref
- M. Martone, S. Filippone, P. Gepner, M. Paprzycki, and S. Tucci. Use of hybrid recursive CSR/COO data structures in sparse matrices-vector multiplication. In International Multiconference on Computer Science and Information Technology - IMCSIT, pages 327–335, 2010.Google Scholar
- M. Martone, S. Filippone, S. Tucci, and M. Paprzycki. Assembling recursively stored sparse matrices. In Computer Science and Information Technology (IMCSIT), Proceedings of the 2010 International Multiconference on, pages 317–325. IEEE, 2010.Google Scholar
- L. Oliker, X. Li, P. Husbands, and R. Biswas. Effects of ordering strategies and programming paradigms on sparse matrix computations. SIAM Review, 44(3):373–393, 2002. Google Scholar
Digital Library
- J. Park, M. Smelyanskiy, K. Vaidyanathan, A. Heinecke, D. D. Kalamkar, X. Liu, M. M. A. Patwary, Y. Lu, and P. Dubey. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pages 945–955, Piscataway, NJ, USA, 2014. IEEE Press. Google Scholar
Digital Library
- L. Qu, L. Grigori, and F. Nataf. Parallel design and performance of nested filtering factorization preconditioner. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 81:1–81:12, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- D. Schmidl, T. Cramer, S. Wienke, C. Terboven, and M. S. Müller. Assessing the performance of openmp programs on the intel xeon phi. In Proceedings of the 19th International Conference on Parallel Processing, Euro-Par’13, pages 547–558, Berlin, Heidelberg, 2013. Springer-Verlag. Google Scholar
Digital Library
- M. Tchiboukdjian, V. Danjean, and B. Raffin. Cache-Efficient Parallel Isosurface Extraction for Shared Cache Multicores. In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV), 2010. Google Scholar
Digital Library
- L. Thebault, E. Petit, M. Tchiboukdjian, Q. Dinh, and W. Jalby. Divide and conquer parallelization of finite element method assembly. Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing 25, 2014.Google Scholar
Index Terms
Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly
Recommendations
Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingExposing massive parallelism on 3D unstructured meshes computation with efficient load balancing and minimal synchronizations is challenging. Current approaches relying on domain decomposition and mesh coloring struggle to scale with the increasing ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...






Comments