skip to main content
research-article

Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly

Published:24 January 2015Publication History
Skip Abstract Section

Abstract

Exposing massive parallelism on 3D unstructured meshes computation with efficient load balancing and minimal synchronizations is challenging. Current approaches relying on domain decomposition and mesh coloring struggle to scale with the increasing number of cores per nodes, especially with new many-core processors. In this paper, we propose an hybrid approach using domain decomposition to exploit distributed memory parallelism, Divide-and-Conquer, D&C, to exploit shared memory parallelism and improve locality, and mesh coloring at core level to exploit vectors. It illustrates a new trade-off for many-cores between structuredness, memory locality, and vectorization. We evaluate our approach on the finite element matrix assembly of an industrial fluid dynamic code developed by Dassault Aviation. We compare our D&C approach to domain decomposition and to mesh coloring. D&C achieves a high parallel efficiency, a good data locality as well as an improved bandwidth usage. It competes on current nodes with the optimized pure MPI version with a minimum 10% speed-up. D&C shows an impressive 319x strong scaling on 512 cores (32 nodes) with only 2000 vertices per core. Finally, the Intel Xeon Phi version has a performance similar to 10 Intel E5-2665 Xeon Sandy Bridge cores and 95% parallel efficiency on the 60 physical cores. Running on 4 Xeon Phi (240 cores), D&C has 92% efficiency on the physical cores and performance similar to 33 Intel E5-2665 Xeon Sandy Bridge cores.

References

  1. G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39 (3):85–97, Mar. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder. Sparse matrix solvers on the gpu: conjugate gradients and multigrid. In ACM Transactions on Graphics, volume 22, pages 917–924, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1):38–53, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Cecka, A. J. Lew, and E. Darve. Assembly of finite element methods on graphics processors. International journal for numerical methods in engineering, 85(5):640–669, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  5. E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference, ACM ’69, pages 157–172, New York, NY, USA, 1969. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dongarra, V. Eijkhout, and P. Luszczek. Recursive approach in sparse matrix lu factorization. Sci. Program., 9(1):51–60, Jan. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Farhat and L. Crivelli. A general approach to nonlinear fe computations on shared-memory multiprocessors. Computer Methods in Applied Mechanics and Engineering, 72(2):153–171, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI ’98, pages 212–223, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen. Colpack: Software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw., 40(1):1:1–1:31, Oct. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. J. Gorman, J. Southern, P. E. Farrell, M. D. Piggott, G. Rokos, and P. H. J. Kelly. Hybrid openmp/mpi anisotropic mesh smoothing. Procedia CS, pages 1513–1522, 2012.Google ScholarGoogle Scholar
  11. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: portable parallel programming with the message-passing interface, volume 1. MIT press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Horowitz and A. Zorat. Divide-and-conquer for parallel processing. Computers, IEEE Transactions on, 100(6):582–585, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Karypis and V. Kumar. Metis-unstructured graph partitioning and sparse matrix ordering system, version 2.0. 1995.Google ScholarGoogle Scholar
  14. J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel. Waypoint: Scaling coherence to thousand-core architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pages 99–110, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directory-based cache coherence protocol for the dash multiprocessor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA ’90, pages 148–159, New York, NY, USA, 1990. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C.-K. Luk, R. Newton, W. Hasenplaugh, M. Hampton, and G. Lowney. A synergetic approach to throughput computing on x86-based multicore desktops. IEEE Software, 28(1):39–50, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Markall, A. Slemmer, D. Ham, P. Kelly, C. Cantwell, and S. Sherwin. Finite element assembly strategies on multi-core and many-core architectures. International Journal for Numerical Methods in Fluids, 71(1):80–97, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  18. M. Martone, S. Filippone, P. Gepner, M. Paprzycki, and S. Tucci. Use of hybrid recursive CSR/COO data structures in sparse matrices-vector multiplication. In International Multiconference on Computer Science and Information Technology - IMCSIT, pages 327–335, 2010.Google ScholarGoogle Scholar
  19. M. Martone, S. Filippone, S. Tucci, and M. Paprzycki. Assembling recursively stored sparse matrices. In Computer Science and Information Technology (IMCSIT), Proceedings of the 2010 International Multiconference on, pages 317–325. IEEE, 2010.Google ScholarGoogle Scholar
  20. L. Oliker, X. Li, P. Husbands, and R. Biswas. Effects of ordering strategies and programming paradigms on sparse matrix computations. SIAM Review, 44(3):373–393, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Park, M. Smelyanskiy, K. Vaidyanathan, A. Heinecke, D. D. Kalamkar, X. Liu, M. M. A. Patwary, Y. Lu, and P. Dubey. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pages 945–955, Piscataway, NJ, USA, 2014. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Qu, L. Grigori, and F. Nataf. Parallel design and performance of nested filtering factorization preconditioner. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 81:1–81:12, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Schmidl, T. Cramer, S. Wienke, C. Terboven, and M. S. Müller. Assessing the performance of openmp programs on the intel xeon phi. In Proceedings of the 19th International Conference on Parallel Processing, Euro-Par’13, pages 547–558, Berlin, Heidelberg, 2013. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Tchiboukdjian, V. Danjean, and B. Raffin. Cache-Efficient Parallel Isosurface Extraction for Shared Cache Multicores. In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Thebault, E. Petit, M. Tchiboukdjian, Q. Dinh, and W. Jalby. Divide and conquer parallelization of finite element method assembly. Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing 25, 2014.Google ScholarGoogle Scholar

Index Terms

  1. Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!