skip to main content
10.1145/2063384.2063477acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Published:12 November 2011Publication History

ABSTRACT

Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers 75 Gflops (single-precision) on an Intel® Xeon® Processor X5680 achieving 60% computational efficiency for datasets that fit in the last-level cache. For datasets larger than the last-level cache, this performance drops to 50 Gflops. Our performance is 2-3X higher than a well-known implementation from the Chroma software suite when running on the same hardware platform. The novel implementation of LQCD reported in this paper is based on recently published the 3.5D spatial and 4.5D temporal tiling schemes. Both blocking schemes significantly reduce LQCD external memory bandwidth requirements, delivering a more compute-bound implementation. The performance advantage of our schemes will become more significant as the gap between compute flops and external memory bandwidth continues to grow. We demonstrate very good cluster-level scalability of our implementation: for a lattice of 323 x 256 sites, we achieve over 4 Tflops when strong-scaled to a 128 node system (1536 cores total). For the same lattice size, a full Conjugate Gradients Wilson-Dslash operator, achieves 2.95 Tflops.

References

  1. R. Babich, M. A. Clark, and B. Joó. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Baier et al. QPACE -- a QCD parallel computer based on Cell processors. PoS, LAT2009:001, 2009.Google ScholarGoogle Scholar
  3. F. Belletti et al. QCD on the Cell Broadband Engine. PoS, LAT2007:039, 2007.Google ScholarGoogle Scholar
  4. P. Boyle, D. Chen, N. Christ, M. Clark, S. Cohen, Z. Dong, A. Gara, B. Joo, C. Jung, L. Levkova, X. Liao, G. Liu, R. Mawhinney, S. Ohta, K. Petrov, T. Wettig, A. Yamaguchi, and C. Cristian. QCDOC: A 10 Teraflops Computer for Tightly-Coupled Calculations. In Proceedings of the ACM/IEEE SC2004 Conference, SC '04, page 40, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. A. Boyle. The bagel assembler generation library. Computer Physics Communications, 180(12):2739--2748, 2009. 40 YEARS OF CPC: A celebratory issue focused on quality software for high performance, grid and novel computing architectures.Google ScholarGoogle ScholarCross RefCross Ref
  6. D. Chen, P. Chen, N. H. Christ, R. G. Edwards, G. Fleming, A. Gara, S. Hansen, C. Jung, A. Kahler, S. Kasow, A. D. Kennedy, G. Kilcup, Y. Luo, C. Malureanu, R. D. Mawhinney, J. Parsons, C. Sui, P. Vranas, and Y. Zhestkov. Qcdsp machines: design, performance and cost. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '98, pages 1--6, Washington, DC, USA, 1998. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Chen and W. W. Iii. Multi-threading performance on commodity multi-core processors. In In Proceedings of 9th International Conference on High Performance Computing in Asia Pacific Region (HPCAsia, 2007.Google ScholarGoogle Scholar
  8. M. A. Clark, R. Babich, K. Barros, R. C. Brower, and C. Rebbi. Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun., 181:1517--1528, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  9. M. Creutz. QUARKS, GLUONS AND LATTICES. Cambridge, Uk: Univ. Pr. (1983) 169 P. (Cambridge Monographs On Mathematical Physics).Google ScholarGoogle Scholar
  10. R. G. Edwards and B. Joo. The Chroma software system for lattice QCD. Nucl. Phys. Proc. Suppl., 140:832, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. Gellrich, D. Pop, P. Wegner, H. Wittig, M. Hasenbusch, and K. Jansen. Lattice qcd calculations on commodity clusters at desy, 2003.Google ScholarGoogle Scholar
  12. M. R. Hestenes and E. Stiefel. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards, 49(6):409--436, Dec. 1952.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. J. Holmgren. PC clusters for lattice QCD. Nucl. Phys. Proc. Suppl., 140:183--189, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  14. K. Z. Ibrahim and F. Bodin. Efficient simdization and data management of the lattice qcd computation on the cell broadband engine. Sci. Program., 17:153--172, January 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. InfiniBand Trade Association. 2004, http://www.infinibandta.org.Google ScholarGoogle Scholar
  16. Intel Advanced Vector Extensions Programming Reference. 2008, http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVX-Programming-Reference-31943302.pdf.Google ScholarGoogle Scholar
  17. Intel SSE4 programming reference. 2007, http://www.intel.com/design/processor/manuals/253667.pdf.Google ScholarGoogle Scholar
  18. Intel Corporation. Intel MPI: Message-Passing Interface Library. http://software.intel.com/en-us/articles/intel-mpi-library/.Google ScholarGoogle Scholar
  19. N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.Google ScholarGoogle Scholar
  20. M. Luscher. Schwarz-preconditioned HMC algorithm for two-flavour lattice QCD. Comput. Phys. Commun., 165:199--220, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  21. C. McClendon. Optimized lattice qcd kernels for a pentium 4 cluster. Technical Report JLAB-THY-01-29, Thomas Jefferson National Laboratory, 12000 Jefferson Ave, Newport News, VA 23606, USA, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  22. MPI: A Message-Passing Interface Standard. Mar 1994.Google ScholarGoogle Scholar
  23. D. Molka, D. Hackenberg, R. Schone, and M. S. Muller. Memory performance and cache coherency effects on an intel nehalem multiprocessor system. Parallel Architectures and Compilation Techniques, International Conference on, 0:261--270, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. I. Montvay and G. Munster. Quantum fields on a lattice. Cambridge, UK: Univ. Pr. (1994) 491 p. (Cambridge monographs on mathematical physics).Google ScholarGoogle Scholar
  25. A. D. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5--d blocking optimization for stencil computations on modern cpus and gpus. In SC, pages 1--13, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Pochinsky. Writing efficient QCD code made simpler: QA(0). PoS, LATTICE2008:040, 2008.Google ScholarGoogle Scholar
  27. H. J. Rothe. Lattice gauge theories: An Introduction. World Sci. Lect. Notes Phys., 74:1--605, 2005.Google ScholarGoogle Scholar
  28. J. Spray, J. Hill, and A. Trew. Performance of a Lattice Quantum Chromodynamics Kernel on the Cell Processor. Comput. Phys. Commun., 179:642--646, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  29. R. Strzodka and D. Göddeke. Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2006), pages 259--268, Apr. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. A. van der Vorst. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems. SIAM Journal on Scientific and Statistical Computing, 13(2):631--644, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Vranas, G. Bhanot, M. Blumrich, D. Chen, A. Gara, P. Heidelberger, V. Salapura, and J. C. Sexton. The bluegene/l supercomputer and quantum chromodynamics. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC '06, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. G. Wilson. Quarks and Strings on a Lattice. In Zichichi, A., editor, New Phenomena in Subnuclear Physics, page 69. Plenum Press, New York, 1975.Google ScholarGoogle Scholar
  33. Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, PLDI '04, pages 255--266, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2011
    866 pages
    ISBN:9781450307710
    DOI:10.1145/2063384

    Copyright © 2011 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 12 November 2011

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '11 Paper Acceptance Rate74of352submissions,21%Overall Acceptance Rate1,516of6,373submissions,24%

    Upcoming Conference

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!