ABSTRACT
Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers 75 Gflops (single-precision) on an Intel® Xeon® Processor X5680 achieving 60% computational efficiency for datasets that fit in the last-level cache. For datasets larger than the last-level cache, this performance drops to 50 Gflops. Our performance is 2-3X higher than a well-known implementation from the Chroma software suite when running on the same hardware platform. The novel implementation of LQCD reported in this paper is based on recently published the 3.5D spatial and 4.5D temporal tiling schemes. Both blocking schemes significantly reduce LQCD external memory bandwidth requirements, delivering a more compute-bound implementation. The performance advantage of our schemes will become more significant as the gap between compute flops and external memory bandwidth continues to grow. We demonstrate very good cluster-level scalability of our implementation: for a lattice of 323 x 256 sites, we achieve over 4 Tflops when strong-scaled to a 128 node system (1536 cores total). For the same lattice size, a full Conjugate Gradients Wilson-Dslash operator, achieves 2.95 Tflops.
- R. Babich, M. A. Clark, and B. Joó. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society. Google Scholar
Digital Library
- H. Baier et al. QPACE -- a QCD parallel computer based on Cell processors. PoS, LAT2009:001, 2009.Google Scholar
- F. Belletti et al. QCD on the Cell Broadband Engine. PoS, LAT2007:039, 2007.Google Scholar
- P. Boyle, D. Chen, N. Christ, M. Clark, S. Cohen, Z. Dong, A. Gara, B. Joo, C. Jung, L. Levkova, X. Liao, G. Liu, R. Mawhinney, S. Ohta, K. Petrov, T. Wettig, A. Yamaguchi, and C. Cristian. QCDOC: A 10 Teraflops Computer for Tightly-Coupled Calculations. In Proceedings of the ACM/IEEE SC2004 Conference, SC '04, page 40, 2004. Google Scholar
Digital Library
- P. A. Boyle. The bagel assembler generation library. Computer Physics Communications, 180(12):2739--2748, 2009. 40 YEARS OF CPC: A celebratory issue focused on quality software for high performance, grid and novel computing architectures.Google Scholar
Cross Ref
- D. Chen, P. Chen, N. H. Christ, R. G. Edwards, G. Fleming, A. Gara, S. Hansen, C. Jung, A. Kahler, S. Kasow, A. D. Kennedy, G. Kilcup, Y. Luo, C. Malureanu, R. D. Mawhinney, J. Parsons, C. Sui, P. Vranas, and Y. Zhestkov. Qcdsp machines: design, performance and cost. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '98, pages 1--6, Washington, DC, USA, 1998. IEEE Computer Society. Google Scholar
Digital Library
- J. Chen and W. W. Iii. Multi-threading performance on commodity multi-core processors. In In Proceedings of 9th International Conference on High Performance Computing in Asia Pacific Region (HPCAsia, 2007.Google Scholar
- M. A. Clark, R. Babich, K. Barros, R. C. Brower, and C. Rebbi. Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun., 181:1517--1528, 2010.Google Scholar
Cross Ref
- M. Creutz. QUARKS, GLUONS AND LATTICES. Cambridge, Uk: Univ. Pr. (1983) 169 P. (Cambridge Monographs On Mathematical Physics).Google Scholar
- R. G. Edwards and B. Joo. The Chroma software system for lattice QCD. Nucl. Phys. Proc. Suppl., 140:832, 2005.Google Scholar
Cross Ref
- A. Gellrich, D. Pop, P. Wegner, H. Wittig, M. Hasenbusch, and K. Jansen. Lattice qcd calculations on commodity clusters at desy, 2003.Google Scholar
- M. R. Hestenes and E. Stiefel. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards, 49(6):409--436, Dec. 1952.Google Scholar
Cross Ref
- D. J. Holmgren. PC clusters for lattice QCD. Nucl. Phys. Proc. Suppl., 140:183--189, 2005.Google Scholar
Cross Ref
- K. Z. Ibrahim and F. Bodin. Efficient simdization and data management of the lattice qcd computation on the cell broadband engine. Sci. Program., 17:153--172, January 2009. Google Scholar
Digital Library
- InfiniBand Trade Association. 2004, http://www.infinibandta.org.Google Scholar
- Intel Advanced Vector Extensions Programming Reference. 2008, http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVX-Programming-Reference-31943302.pdf.Google Scholar
- Intel SSE4 programming reference. 2007, http://www.intel.com/design/processor/manuals/253667.pdf.Google Scholar
- Intel Corporation. Intel MPI: Message-Passing Interface Library. http://software.intel.com/en-us/articles/intel-mpi-library/.Google Scholar
- N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.Google Scholar
- M. Luscher. Schwarz-preconditioned HMC algorithm for two-flavour lattice QCD. Comput. Phys. Commun., 165:199--220, 2005.Google Scholar
Cross Ref
- C. McClendon. Optimized lattice qcd kernels for a pentium 4 cluster. Technical Report JLAB-THY-01-29, Thomas Jefferson National Laboratory, 12000 Jefferson Ave, Newport News, VA 23606, USA, 2001.Google Scholar
Cross Ref
- MPI: A Message-Passing Interface Standard. Mar 1994.Google Scholar
- D. Molka, D. Hackenberg, R. Schone, and M. S. Muller. Memory performance and cache coherency effects on an intel nehalem multiprocessor system. Parallel Architectures and Compilation Techniques, International Conference on, 0:261--270, 2009. Google Scholar
Digital Library
- I. Montvay and G. Munster. Quantum fields on a lattice. Cambridge, UK: Univ. Pr. (1994) 491 p. (Cambridge monographs on mathematical physics).Google Scholar
- A. D. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5--d blocking optimization for stencil computations on modern cpus and gpus. In SC, pages 1--13, 2010. Google Scholar
Digital Library
- A. Pochinsky. Writing efficient QCD code made simpler: QA(0). PoS, LATTICE2008:040, 2008.Google Scholar
- H. J. Rothe. Lattice gauge theories: An Introduction. World Sci. Lect. Notes Phys., 74:1--605, 2005.Google Scholar
- J. Spray, J. Hill, and A. Trew. Performance of a Lattice Quantum Chromodynamics Kernel on the Cell Processor. Comput. Phys. Commun., 179:642--646, 2008.Google Scholar
Cross Ref
- R. Strzodka and D. Göddeke. Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2006), pages 259--268, Apr. 2006. Google Scholar
Digital Library
- H. A. van der Vorst. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems. SIAM Journal on Scientific and Statistical Computing, 13(2):631--644, 1992. Google Scholar
Digital Library
- P. Vranas, G. Bhanot, M. Blumrich, D. Chen, A. Gara, P. Heidelberger, V. Salapura, and J. C. Sexton. The bluegene/l supercomputer and quantum chromodynamics. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC '06, New York, NY, USA, 2006. ACM. Google Scholar
Digital Library
- K. G. Wilson. Quarks and Strings on a Lattice. In Zichichi, A., editor, New Phenomena in Subnuclear Physics, page 69. Plenum Press, New York, 1975.Google Scholar
- Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, PLDI '04, pages 255--266, 2004. Google Scholar
Digital Library
Recommendations
High Performance Parallel Summed-Area Table Kernels for Multi-core and Many-core Systems
Proceedings of the 22nd International Conference on Euro-Par 2016: Parallel Processing - Volume 9833The summed-area table SAT, also known as integral image, is a data structure extensively used in computer graphics and vision for fast image filtering. The parallelization of its construction has been thoroughly investigated and many algorithms have ...
Parallel programming model for the Epiphany many-core coprocessor using threaded MPI
We investigate the use of MPI for programming the Epiphany RISC array processor.A threaded MPI implementation adapted for coprocessor offload is presented.Existing MPI code for four scientific applications was re-used with minimal changes.Demonstrated ...
Hybrid multi-core architecture for boosting single-threaded performance
The scaling of technology and the diminishing return of complicated uniprocessors have driven the industry towards multicore processors. While multithreaded applications can naturally leverage the enhanced throughput of multi-core processors, a large ...





Comments