ABSTRACT
Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix---a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications.
- S. Abdulah, H. Ltaief, Y. Sun, M. G. Genton, and D. E. Keyes. 2018. ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems. IEEE Transactions on Parallel and Distributed Systems 29, 12 (Dec 2018), 2771--2784.Google Scholar
Cross Ref
- S. Abdulah, H. Ltaief, Y. Sun, M. G. Genton, and D. E. Keyes. 2018. Parallel Approximation of the Maximum Likelihood Estimation for the Prediction of Large-Scale Geostatistics Simulations. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 98--108.Google Scholar
- E. Agullo, O. Aumage, M. Faverge, N. Furmento, F. Pruvost, M. Sergent, and S. Thibault. 2017. Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model. IEEE Transactions on Parallel and Distributed Systems (2017).Google Scholar
- E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. 2009. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects. Journal of Physics: Conference Series 180 (2009).Google Scholar
- K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, A. Esposito, and D. E. Keyes. 2018. Exploiting Data Sparsity for Large-Scale Matrix Computations. In Euro-Par 2018: Parallel Processing, M. Aldinucci, L. Padovani, and M. Torquati (Eds.). Springer International Publishing, Cham, 721--734.Google Scholar
- K. Akbudak, H. Ltaief, A. Mikhalev, and D. Keyes. 2017. Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures. In 32nd International Conference on High Performance, Frankfurt, Germany. Springer International Publishing, 22--40.Google Scholar
- S. Ambikasaran and E. Darve. 2013. An O(N log N) Fast Direct Solver for Partial Hierarchically Semiseparable Matrices. Journal of Scientific Computing 57, 3 (2013), 477--501.Google Scholar
Digital Library
- P. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L'Excellent, and C. Weisbecker. 2015. Improving Multifrontal Methods by Means of Block Low-Rank Representations. SIAM Journal on Scientific Computing 37, 3 (2015), A1451-A1474.Google Scholar
Digital Library
- P. R. Amestoy, A. Buttari, J.-Y. L'Excellent, and T. Mary. 2019. Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures. ACM Trans. Math. Softw. 45, 1, Article 2 (Feb. 2019), 26 pages.Google Scholar
Digital Library
- P. R. Amestoy, I. S. Duff, J.-Y. L'Excellent, and J. Koster. 2001. MUMPS: A General Purpose Distributed Memory Sparse Solver. Springer Berlin Heidelberg, Berlin, Heidelberg, 121--130. https://doi.org/10.1007/3-540-70734-4_16Google Scholar
- A. Aminfar, S. Ambikasaran, and E. Darve. 2016. A Fast Block Low-Rank Dense Solver with Applications to Finite-Element Matrices. J. Comput. Phys. 304 (2016), 170--188.Google Scholar
Digital Library
- E. Anderson, Z. Bai, C. H. Bischof, L. Susan Blackford, J. W. Demmel, J.J. Dongarra, J. J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C Sorensen. 1999. LAPACK User's Guide (3rd ed.). SIAM, Philadelphia.Google Scholar
- C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency Computat. Pract. Exper. 23 (2011), 187--198.Google Scholar
Digital Library
- M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC.Google Scholar
- M. Bebendorf. 2008. Hierarchical Matrices: A Means to Efficiently Solve Elliptic Boundary Value Problems. Lecture Notes in Computational Science and Engineering, Vol. 63. Springer. 269 pages.Google Scholar
Digital Library
- P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. 2006. Operating System Issues for Petascale Systems. SIGOPS Operating Systems Review 40, 2 (2006), 29--33.Google Scholar
Digital Library
- L.S. Blackford, J. Choi, A. Cleary, E.F. D'Azevedo, J.W. Demmel, I.S. Dhillon, J.J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D.W. Walker, and R.C. Whaley. 1997. ScaLAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia. https://doi.org/10.1137/1.9780898719642Google Scholar
- R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou. 1996. Cilk: An Efficient Multithreaded Runtime System. J. Parallel and Distrib. Comput. 37, 1 (1996), 55--69. https://doi.org/10.1006/jpdc.1996.0107Google Scholar
Digital Library
- G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Hérault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, P. Luszczek, A. YarKhan, and J. Dongarra. 2011. Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA. In IPDPS Workshops. IEEE, 1432--1441. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6008655Google Scholar
- G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. Dongarra. 2013. PaRSEC: A Programming Paradigm Exploiting Heterogeneity for Enhancing Scalability. Computing in Science and Engineering 99 (2013), 1.Google Scholar
- G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science Engineering 15, 6 (Nov 2013), 36--45. https://doi.org/10.1109/MCSE.2013.98Google Scholar
Digital Library
- A. Brandt. 1991. Multilevel Computations of Integral Transforms and Particle Interactions with Oscillatory Kernels. Computer Physics Communications 65, 1--3 (1991), 24--38.Google Scholar
Cross Ref
- Q. Cao, Y. Pei, T. Herault, K. Akbudak, A. Mikhalev, G. Bosilca, H. Ltaief, D. Keyes, and J. Dongarra. 2019. Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools. In 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools) at SC19. IEEE, 25--32.Google Scholar
- E. Chan, E.S. Quintana-Ortí, G. Quintana-Ortí, and R. van de Geijn. 2007. Super-matrix Out-of-order Scheduling of Matrix Operations for SMP And Multi-core Architectures. In SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures. ACM, New York, NY, USA, 116--125. https://doi.org/10.1145/1248377.1248397Google Scholar
- A. Danalis, G. Bosilca, A. Bouteiller, T. Herault, and J. Dongarra. 2014. PTG: An Abstraction for Unhindered Parallelism. Proceedings of WOLFHPC 2014: 4th International Workshop on DSLs and High-Level Frameworks for High Performance Computing, 21--30. https://doi.org/10.1109/WOLFHPC.2014.8Google Scholar
- J. Dokulil, M. Sandrieser, and S. Benkner. 2016. Implementing the Open Community Runtime for Shared-Memory and Distributed-Memory Systems. Proceedings - 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2016, 364--368. https://doi.org/10.1109/PDP.2016.81Google Scholar
- A. Duran, R. Ferrer, E. Ayguadé, R.M. Badia, and J. Labarta. 2009. A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks. International Journal of Parallel Programming 37, 3 (2009), 292--305.Google Scholar
Digital Library
- R. Garg and P. De. 2006. Impact of Noise on Scaling of Collectives: An Empirical Evaluation. In HiPC'06: Proceedings of International Conference on High Performance Computing (LNCS), Springer (Ed.), Vol. 4297. 460--471.Google Scholar
- C. J. Geoga, M. Anitescu, and M. L. Stein. 2019. Scalable Gaussian Process Computations Using Hierarchical Matrices. Journal of Computational and Graphical Statistics 0, 0 (2019), 1--11. https://doi.org/10.1080/10618600.2019.1652616Google Scholar
Cross Ref
- L. Greengard and V. Rokhlin. 1987. A Fast Algorithm for Particle Simulations. J. Comput. Phys. 73, 2 (1987), 325--348.Google Scholar
Digital Library
- N. Halko, P.-G. Martinsson, and J. A. Tropp. 2011. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Rev. 53, 2 (2011), 217--288.Google Scholar
Digital Library
- R. Hoque, T. Herault, G. Bosilca, and J. Dongarra. 2017. Dynamic Task Discovery in PaRSEC: A Data-flow Task-based Runtime. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17). ACM, New York, NY, USA, Article 6, 8 pages. https://doi.org/10.1145/3148226.3148233Google Scholar
- H. Jagode, A. Danalis, and J. Dongarra. 2017. Accelerating NWChem Coupled Cluster through Dataflow-Based Execution. The International Journal of High Performance Computing Applications (01--2017 2017), 1--13.Google Scholar
- R. Kriemann. 2013. H-LU Factorization on Many-core Systems. Computing and Visualization in Science 16, 3 (2013), 105--117.Google Scholar
Digital Library
- X. Lacoste, M. Faverge, G. Bosilca, P. Ramet, and S. Thibault. 2014. Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes. In IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW). 29--38. https://doi.org/10.1109/IPDPSW.2014.9Google Scholar
- H. Ltaief, A. Charara, D. Gratadour, N. Doucet, B. Hadri, E. Gendron, S. Feki, and D. Keyes. 2018. Real-Time Massively Distributed Multi-object Adaptive Optics Simulations for the European Extremely Large Telescope. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 75--84.Google Scholar
- V. Martinez, F. Dupros, M. Castro, and P. Navaux. 2017. Performance Improvement of Stencil Computations for Multi-core Architectures based on Machine Learning. Procedia Computer Science 108, Supplement C (2017), 305--314. https://doi.org/10.1016/j.procs.2017.05.164 International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland.Google Scholar
Cross Ref
- T. Mary. 2017. Block Low-Rank Multifrontal Solvers: Complexity, Performance, and Scalability. Ph.D. Dissertation. Paul Sabatier University, Toulouse, France.Google Scholar
- G.M. Morton. 1966. A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing. International Business Machines Company, New York.Google Scholar
- OpenMP. 2013. OpenMP 4.0 Complete Specifications. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdfGoogle Scholar
- R.G. Parr. 1980. Density Functional Theory of Atoms and Molecules. In Horizons of Quantum Chemistry, Kenichi Fukui and Bernard Pullman (Eds.). Springer Netherlands, Dordrecht, 5--15.Google Scholar
- G. Peano. 1890. Sur une courbe, qui remplit toute une aire plane. Math. Ann. 36, 1 (1890), 157--160.Google Scholar
Cross Ref
- Y. Pei, G. Bosilca, I. Yamazaki, A. Ida, and J. Dongarra. 2019. Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization. In PAW-ATM Workshop at SC19. ACM, ACM, Denver, CO.Google Scholar
- J. Reinders. 2010. Intel Threading Building Blocks Outfitting C++ for Multi-core Processor Parallelism. O'Reilly Media.Google Scholar
- F.-H. Rouet, X.S. Li, P. Ghysels, and A. Napov. 2016. A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization. ACM Trans. Math. Software 42, 4, Article 27 (June 2016), 35 pages.Google Scholar
Digital Library
- M. L. Stein. 2014. Limitations on Low Rank Approximations for Covariance Matrices of Spatial Data. Spatial Statistics 8 (2014), 1--19. https://doi.org/10.1016/j.spasta.2013.06.003 Spatial Statistics Miami.Google Scholar
Cross Ref
- Y. Sun and M.L. Stein. 2016. Statistically and Computationally Efficient Estimating Equations for Large Spatial Datasets. Journal of Computational and Graphical Statistics 25, 1 (2016), 187--208.Google Scholar
Cross Ref
- M. Tillenius, E. Larsson, E. Lehto, and N. Flyer. 2013. A Task Parallel Implementation of a Scattered Node Stencil-based Solver for the Shallow Water Equations. In Proc. 6th Swedish Workshop on Multi-Core Computing. Halmstad University, 33--36.Google Scholar
- S.J. Treichler. 2014. Realm: Performance Portability through Composable Asynchrony. Ph.D. Dissertation. Stanford University.Google Scholar
- D. Tsafrir, Y. Etsion, D.G. Feitelson, and S. Kirkpatrick. 2005. System Noise, OS Clock Ticks, and Fine-grained Parallel Applications. In ICS '05: Proceedings of the 19th Annual International Conference on Supercomputing. ACM Press, New York, NY, USA, 303--312.Google Scholar
- E. E. Tyrtyshnikov. 1996. Mosaic-Skeleton Approximations. Calcolo 33, 1 (1996), 47--57. https://doi.org/10.1007/BF02575706Google Scholar
Cross Ref
- A. G. Wilson and H. Nickisch. 2015. Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP). In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 1775--1784. http://dl.acm.org/citation.cfm?id=3045118.3045307Google Scholar
- W. Wu, A. Bouteiller, G. Bosilca, M. Faverge, and J. Dongarra. 2015. Hierarchical DAG Scheduling for Hybrid Distributed Systems. In 2015 IEEE International Parallel and Distributed Processing Symposium. 156--165.Google Scholar
- J. Xia, Y. Xi, and M. Gu. 2012. A Superfast Structured Solver for Toeplitz Linear Systems via Randomized Sampling. SIAM J. Matrix Anal. Appl. 33, 3 (2012), 837--858. https://doi.org/10.1137/110831982 arXiv:https://doi.org/10.1137/110831982Google Scholar
Digital Library
- C.D. Yu, S. Reiz, and G. Biros. 2018. Distributed-memory Hierarchical Compression of Dense SPD Matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 15, 15 pages.Google Scholar
Index Terms
Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications
Recommendations
Dynamic Parallelization Strategies for Multifrontal Sparse Cholesky Factorization
Proceedings of the 13th International Conference on Parallel Computing Technologies - Volume 9251This paper discusses parallelization of the computationally intensive numerical factorization phase of sparse Cholesky factorization on shared memory systems. We propose and compare two parallel algorithms based on the multifrontal method. Both ...
Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures
High Performance ComputingAbstractCovariance matrices are ubiquitous in computational science and engineering. In particular, large covariance matrices arise from multivariate spatial data sets, for instance, in climate/weather modeling applications to improve prediction using ...
Row Modifications of a Sparse Cholesky Factorization
Given a sparse, symmetric positive definite matrix C and an associated sparse Cholesky factorization LDL$\tr$, we develop sparse techniques for updating the factorization after a symmetric modification of a row and column of C. We show how the ...






Comments