Abstract
The four-index integral transform is a fundamental and computationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a four- dimensional tensor with a two-dimensional transformation matrix. Differing degrees of permutation symmetry in the intermediate and final tensors in the sequence of contractions cause intermediate tensors to be much larger than the final tensor and limit the number of electronic states in the modeled systems.
Loop fusion, in conjunction with tiling, can be very effective in reducing the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation for the four-index integral transform. We develop a novel approach to address this problem, using lower bounds modeling of data movement complexity. We establish relationships between available aggregate physical memory in a parallel computer system and ineffective fusion configurations, enabling their pruning and consequent identification of effective choices and a characterization of optimality criteria. This work has resulted in the development of a significantly improved implementation of the four-index transform that enables higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite.
- ACES II, a program product of the quantum theory project. See http://www.qtp.ufl.edu/aces/, 1996.Google Scholar
- The massively parallel quantum chemistry program (MPQC). http://www.mpqc.org/index.php, 2004.Google Scholar
- MOLPRO, a package of ab initio programs. See http://www.molpro. net, 2006.Google Scholar
- Nwchem: A comprehensive and scalable open-source solution for large scale molecular simulations. See http://www.nwchem-sw. org/index.php, 2010.Google Scholar
- Psi4, an open-source ab initio electronic structure program. See http://www.psicode.org/, 2012.Google Scholar
- M. Abe, T. Yanai, T. Nakajima, and K. Hirao. A four-index transformation in dirac's four-component relativistic theory. Chem. Phys. Letters, 388 (1-3): 68--73, 2004. Google Scholar
Cross Ref
- G. Bilardi and E. Peserico. A characterization of temporal locality and its portability across memory hierarchies. Automata, Languages and Programming, pages 128--139, 2001. Google Scholar
Cross Ref
- L. A. Covick and K. M. Sando. Four-index transformation on distributed-memory parallel computers. J. Comp. Chem., 11 (10): 1151--1159, 1990. Google Scholar
Digital Library
- J. Dongarra, J.-F. Pineau, Y. Robert, and F. Vivien. Matrix product on heterogeneous master-worker platforms. In PPoPP, pages 53--62, 2008. Google Scholar
Digital Library
- G. Fletcher, M. Schmidt, and M. Gordon. Developments in parallel electronic structure theory. Adv. Chem. Phys., 110: 267--294, 1999. Google Scholar
Cross Ref
- T. R. Furlani and H. F. King. Implementation of a parallel direct scf algorithm on distributed memory computers. J. Comp. Chem., 16 (1): 91--104, 1995. Google Scholar
Cross Ref
- X. Gao, S. Krishnamoorthy, S. K. Sahoo, C. Lam, G. Baumgartner, J. Ramanujam, and P. Sadayappan. Efficient search-space pruning for integrated fusion and tiling transformations. CCPE, 19 (18): 2425--2443, 2007. Google Scholar
Cross Ref
- J.-W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC, pages 326--333, 1981.Google Scholar
- D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64 (9): 1017--1026, 2004. Google Scholar
Digital Library
- C. Lam, T. Rauber, G. Baumgartner, D. Cociorva, and P. Sadayappan. Memory-optimal evaluation of expression trees involving large objects. Comp. Lang. Sys. Struc., 37 (2): 63--75.Google Scholar
- A. C. Limaye and S. R. Gadre. A general parallel solution to the integral transformation and second-order Møller-Plesset energy evaluation on distributed memory parallel machines. J. Chem. Phys., 100 (2): 1303--1307, 1994. Google Scholar
Cross Ref
- W. Ma, S. Krishnamoorthy, and G. Agrawal. Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies. In CC 2011, pages 266--285, 2011. Google Scholar
Cross Ref
- ga1J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà. Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20 (2): 203--231, May 2006. Google Scholar
Digital Library
- M. Pernpointner, L. Visscher, W. A. de Jong, and R. Broer. Parallelization of four-component calculations. i. integral generation, SCF, and four-index transformation in the Dirac-Fock package MOLFDIR. J. Comp. Chem., 21 (13): 1176--1186.Google Scholar
Cross Ref
- G. Rauhut, P. Pulay, and H.-J. Werner. Integral transformation with low-order scaling for large local second-order Møller-Plesset calculations. J. Comp. Chem., 19 (11): 1241--1254.Google Scholar
Cross Ref
- S. Sæbø and J. Almlöf. Avoiding the integral storage bottleneck in LCAO calculations of electron correlation. Chem. Phys. Let., 154 (1): 83 -- 89, 1989. Google Scholar
Cross Ref
- S. K. Sahoo, S. Krishnamoorthy, R. Panuganti, and P. Sadayappan. Integrated loop optimizations for data locality enhancement of tensor contraction expressions. In SC 2005. Google Scholar
Digital Library
- M. W. Schmidt, K. K. Baldridge, J. A. Boatz, S. T. Elbert, M. S. Gordon, J. H. Jensen, S. Koseki, N. Matsunaga, K. A. Nguyen, S. Su, et al. General atomic and molecular electronic structure system. J. Comp. Chem., 14 (11): 1347--1363, 1993. Google Scholar
Digital Library
- R. A. Whiteside, J. S. Binkley, M. E. Colvin, and H. F. Schaefer III. Parallel algorithms for quantum chemistry. i. integral transformations on a hypercube multiprocessor. J. Chem. Phys., 86 (4): 2185--2193, 1987. Google Scholar
Cross Ref
- S. Wilson. Four-index transformations. In Methods in Computational Chemistry, pages 251--309. Springer, 1987. Google Scholar
Cross Ref
- T. L. Windus, M. W. Schmidt, and M. S. Gordon. Parallel algorithm for integral transformations and GUGA MCSCF. Theoretica chimica acta, 89 (1): 77--88, 1994. Google Scholar
Cross Ref
- A. T. Wong, R. J. Harrison, and A. P. Rendell. Parallel direct four-index transformations. Th. Chim. Acta, 93 (6): 317--331.Google Scholar
Cross Ref
Index Terms
Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis
Recommendations
Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingThe four-index integral transform is a fundamental and computationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most ...
Discrete Fourier Transform Tensors and Their Ranks
We introduce a tensor generalization of the matrix discrete Fourier transform (DFT) which we call the collapsed DFT (CDFT) tensor. The CDFT tensor is different from the standard even order DFT tensor (except when the order is two). We study the action and ...
Parallel Algorithm of Two-Dimensional Discrete Cosine Transform Based on Special Data Representation
ICPR '10: Proceedings of the 2010 20th International Conference on Pattern RecognitionThe paper is devoted to parallel approach efficiency research for two-dimensional discrete cosine transform. The algorithm based on data representation in hypercompex algebra is proposed.







Comments