skip to main content
10.1145/3394277.3401846acmconferencesArticle/Chapter ViewAbstractPublication PagespascConference Proceedingsconference-collections
research-article

Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications

Published:29 June 2020Publication History

ABSTRACT

Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix---a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications.

References

  1. S. Abdulah, H. Ltaief, Y. Sun, M. G. Genton, and D. E. Keyes. 2018. ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems. IEEE Transactions on Parallel and Distributed Systems 29, 12 (Dec 2018), 2771--2784.Google ScholarGoogle ScholarCross RefCross Ref
  2. S. Abdulah, H. Ltaief, Y. Sun, M. G. Genton, and D. E. Keyes. 2018. Parallel Approximation of the Maximum Likelihood Estimation for the Prediction of Large-Scale Geostatistics Simulations. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 98--108.Google ScholarGoogle Scholar
  3. E. Agullo, O. Aumage, M. Faverge, N. Furmento, F. Pruvost, M. Sergent, and S. Thibault. 2017. Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model. IEEE Transactions on Parallel and Distributed Systems (2017).Google ScholarGoogle Scholar
  4. E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. 2009. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects. Journal of Physics: Conference Series 180 (2009).Google ScholarGoogle Scholar
  5. K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, A. Esposito, and D. E. Keyes. 2018. Exploiting Data Sparsity for Large-Scale Matrix Computations. In Euro-Par 2018: Parallel Processing, M. Aldinucci, L. Padovani, and M. Torquati (Eds.). Springer International Publishing, Cham, 721--734.Google ScholarGoogle Scholar
  6. K. Akbudak, H. Ltaief, A. Mikhalev, and D. Keyes. 2017. Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures. In 32nd International Conference on High Performance, Frankfurt, Germany. Springer International Publishing, 22--40.Google ScholarGoogle Scholar
  7. S. Ambikasaran and E. Darve. 2013. An O(N log N) Fast Direct Solver for Partial Hierarchically Semiseparable Matrices. Journal of Scientific Computing 57, 3 (2013), 477--501.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L'Excellent, and C. Weisbecker. 2015. Improving Multifrontal Methods by Means of Block Low-Rank Representations. SIAM Journal on Scientific Computing 37, 3 (2015), A1451-A1474.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. R. Amestoy, A. Buttari, J.-Y. L'Excellent, and T. Mary. 2019. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures. ACM Trans. Math. Softw. 45, 1, Article 2 (Feb. 2019), 26 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. R. Amestoy, I. S. Duff, J.-Y. L'Excellent, and J. Koster. 2001. MUMPS: A General Purpose Distributed Memory Sparse Solver. Springer Berlin Heidelberg, Berlin, Heidelberg, 121--130. https://doi.org/10.1007/3-540-70734-4_16Google ScholarGoogle Scholar
  11. A. Aminfar, S. Ambikasaran, and E. Darve. 2016. A Fast Block Low-Rank Dense Solver with Applications to Finite-Element Matrices. J. Comput. Phys. 304 (2016), 170--188.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Anderson, Z. Bai, C. H. Bischof, L. Susan Blackford, J. W. Demmel, J.J. Dongarra, J. J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C Sorensen. 1999. LAPACK User's Guide (3rd ed.). SIAM, Philadelphia.Google ScholarGoogle Scholar
  13. C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency Computat. Pract. Exper. 23 (2011), 187--198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC.Google ScholarGoogle Scholar
  15. M. Bebendorf. 2008. Hierarchical Matrices: A Means to Efficiently Solve Elliptic Boundary Value Problems. Lecture Notes in Computational Science and Engineering, Vol. 63. Springer. 269 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. 2006. Operating System Issues for Petascale Systems. SIGOPS Operating Systems Review 40, 2 (2006), 29--33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L.S. Blackford, J. Choi, A. Cleary, E.F. D'Azevedo, J.W. Demmel, I.S. Dhillon, J.J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D.W. Walker, and R.C. Whaley. 1997. ScaLAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia. https://doi.org/10.1137/1.9780898719642Google ScholarGoogle Scholar
  18. R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou. 1996. Cilk: An Efficient Multithreaded Runtime System. J. Parallel and Distrib. Comput. 37, 1 (1996), 55--69. https://doi.org/10.1006/jpdc.1996.0107Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Hérault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, P. Luszczek, A. YarKhan, and J. Dongarra. 2011. Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA. In IPDPS Workshops. IEEE, 1432--1441. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6008655Google ScholarGoogle Scholar
  20. G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. Dongarra. 2013. PaRSEC: A Programming Paradigm Exploiting Heterogeneity for Enhancing Scalability. Computing in Science and Engineering 99 (2013), 1.Google ScholarGoogle Scholar
  21. G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science Engineering 15, 6 (Nov 2013), 36--45. https://doi.org/10.1109/MCSE.2013.98Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Brandt. 1991. Multilevel Computations of Integral Transforms and Particle Interactions with Oscillatory Kernels. Computer Physics Communications 65, 1--3 (1991), 24--38.Google ScholarGoogle ScholarCross RefCross Ref
  23. Q. Cao, Y. Pei, T. Herault, K. Akbudak, A. Mikhalev, G. Bosilca, H. Ltaief, D. Keyes, and J. Dongarra. 2019. Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools. In 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools) at SC19. IEEE, 25--32.Google ScholarGoogle Scholar
  24. E. Chan, E.S. Quintana-Ortí, G. Quintana-Ortí, and R. van de Geijn. 2007. Super-matrix Out-of-order Scheduling of Matrix Operations for SMP And Multi-core Architectures. In SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures. ACM, New York, NY, USA, 116--125. https://doi.org/10.1145/1248377.1248397Google ScholarGoogle Scholar
  25. A. Danalis, G. Bosilca, A. Bouteiller, T. Herault, and J. Dongarra. 2014. PTG: An Abstraction for Unhindered Parallelism. Proceedings of WOLFHPC 2014: 4th International Workshop on DSLs and High-Level Frameworks for High Performance Computing, 21--30. https://doi.org/10.1109/WOLFHPC.2014.8Google ScholarGoogle Scholar
  26. J. Dokulil, M. Sandrieser, and S. Benkner. 2016. Implementing the Open Community Runtime for Shared-Memory and Distributed-Memory Systems. Proceedings - 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2016, 364--368. https://doi.org/10.1109/PDP.2016.81Google ScholarGoogle Scholar
  27. A. Duran, R. Ferrer, E. Ayguadé, R.M. Badia, and J. Labarta. 2009. A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks. International Journal of Parallel Programming 37, 3 (2009), 292--305.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Garg and P. De. 2006. Impact of Noise on Scaling of Collectives: An Empirical Evaluation. In HiPC'06: Proceedings of International Conference on High Performance Computing (LNCS), Springer (Ed.), Vol. 4297. 460--471.Google ScholarGoogle Scholar
  29. C. J. Geoga, M. Anitescu, and M. L. Stein. 2019. Scalable Gaussian Process Computations Using Hierarchical Matrices. Journal of Computational and Graphical Statistics 0, 0 (2019), 1--11. https://doi.org/10.1080/10618600.2019.1652616Google ScholarGoogle ScholarCross RefCross Ref
  30. L. Greengard and V. Rokhlin. 1987. A Fast Algorithm for Particle Simulations. J. Comput. Phys. 73, 2 (1987), 325--348.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Halko, P.-G. Martinsson, and J. A. Tropp. 2011. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Rev. 53, 2 (2011), 217--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Hoque, T. Herault, G. Bosilca, and J. Dongarra. 2017. Dynamic Task Discovery in PaRSEC: A Data-flow Task-based Runtime. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17). ACM, New York, NY, USA, Article 6, 8 pages. https://doi.org/10.1145/3148226.3148233Google ScholarGoogle Scholar
  33. H. Jagode, A. Danalis, and J. Dongarra. 2017. Accelerating NWChem Coupled Cluster through Dataflow-Based Execution. The International Journal of High Performance Computing Applications (01--2017 2017), 1--13.Google ScholarGoogle Scholar
  34. R. Kriemann. 2013. H-LU Factorization on Many-core Systems. Computing and Visualization in Science 16, 3 (2013), 105--117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. X. Lacoste, M. Faverge, G. Bosilca, P. Ramet, and S. Thibault. 2014. Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes. In IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW). 29--38. https://doi.org/10.1109/IPDPSW.2014.9Google ScholarGoogle Scholar
  36. H. Ltaief, A. Charara, D. Gratadour, N. Doucet, B. Hadri, E. Gendron, S. Feki, and D. Keyes. 2018. Real-Time Massively Distributed Multi-object Adaptive Optics Simulations for the European Extremely Large Telescope. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 75--84.Google ScholarGoogle Scholar
  37. V. Martinez, F. Dupros, M. Castro, and P. Navaux. 2017. Performance Improvement of Stencil Computations for Multi-core Architectures based on Machine Learning. Procedia Computer Science 108, Supplement C (2017), 305--314. https://doi.org/10.1016/j.procs.2017.05.164 International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland.Google ScholarGoogle ScholarCross RefCross Ref
  38. T. Mary. 2017. Block Low-Rank Multifrontal Solvers: Complexity, Performance, and Scalability. Ph.D. Dissertation. Paul Sabatier University, Toulouse, France.Google ScholarGoogle Scholar
  39. G.M. Morton. 1966. A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing. International Business Machines Company, New York.Google ScholarGoogle Scholar
  40. OpenMP. 2013. OpenMP 4.0 Complete Specifications. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdfGoogle ScholarGoogle Scholar
  41. R.G. Parr. 1980. Density Functional Theory of Atoms and Molecules. In Horizons of Quantum Chemistry, Kenichi Fukui and Bernard Pullman (Eds.). Springer Netherlands, Dordrecht, 5--15.Google ScholarGoogle Scholar
  42. G. Peano. 1890. Sur une courbe, qui remplit toute une aire plane. Math. Ann. 36, 1 (1890), 157--160.Google ScholarGoogle ScholarCross RefCross Ref
  43. Y. Pei, G. Bosilca, I. Yamazaki, A. Ida, and J. Dongarra. 2019. Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization. In PAW-ATM Workshop at SC19. ACM, ACM, Denver, CO.Google ScholarGoogle Scholar
  44. J. Reinders. 2010. Intel Threading Building Blocks Outfitting C++ for Multi-core Processor Parallelism. O'Reilly Media.Google ScholarGoogle Scholar
  45. F.-H. Rouet, X.S. Li, P. Ghysels, and A. Napov. 2016. A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization. ACM Trans. Math. Software 42, 4, Article 27 (June 2016), 35 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. L. Stein. 2014. Limitations on Low Rank Approximations for Covariance Matrices of Spatial Data. Spatial Statistics 8 (2014), 1--19. https://doi.org/10.1016/j.spasta.2013.06.003 Spatial Statistics Miami.Google ScholarGoogle ScholarCross RefCross Ref
  47. Y. Sun and M.L. Stein. 2016. Statistically and Computationally Efficient Estimating Equations for Large Spatial Datasets. Journal of Computational and Graphical Statistics 25, 1 (2016), 187--208.Google ScholarGoogle ScholarCross RefCross Ref
  48. M. Tillenius, E. Larsson, E. Lehto, and N. Flyer. 2013. A Task Parallel Implementation of a Scattered Node Stencil-based Solver for the Shallow Water Equations. In Proc. 6th Swedish Workshop on Multi-Core Computing. Halmstad University, 33--36.Google ScholarGoogle Scholar
  49. S.J. Treichler. 2014. Realm: Performance Portability through Composable Asynchrony. Ph.D. Dissertation. Stanford University.Google ScholarGoogle Scholar
  50. D. Tsafrir, Y. Etsion, D.G. Feitelson, and S. Kirkpatrick. 2005. System Noise, OS Clock Ticks, and Fine-grained Parallel Applications. In ICS '05: Proceedings of the 19th Annual International Conference on Supercomputing. ACM Press, New York, NY, USA, 303--312.Google ScholarGoogle Scholar
  51. E. E. Tyrtyshnikov. 1996. Mosaic-Skeleton Approximations. Calcolo 33, 1 (1996), 47--57. https://doi.org/10.1007/BF02575706Google ScholarGoogle ScholarCross RefCross Ref
  52. A. G. Wilson and H. Nickisch. 2015. Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP). In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 1775--1784. http://dl.acm.org/citation.cfm?id=3045118.3045307Google ScholarGoogle Scholar
  53. W. Wu, A. Bouteiller, G. Bosilca, M. Faverge, and J. Dongarra. 2015. Hierarchical DAG Scheduling for Hybrid Distributed Systems. In 2015 IEEE International Parallel and Distributed Processing Symposium. 156--165.Google ScholarGoogle Scholar
  54. J. Xia, Y. Xi, and M. Gu. 2012. A Superfast Structured Solver for Toeplitz Linear Systems via Randomized Sampling. SIAM J. Matrix Anal. Appl. 33, 3 (2012), 837--858. https://doi.org/10.1137/110831982 arXiv:https://doi.org/10.1137/110831982Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. C.D. Yu, S. Reiz, and G. Biros. 2018. Distributed-memory Hierarchical Compression of Dense SPD Matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 15, 15 pages.Google ScholarGoogle Scholar

Index Terms

  1. Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!