ABSTRACT

Computer architects have increased hardware parallelism and power efficiency by integrating massively parallel hardware accelerators (coprocessors) into compute systems. Many modern HPC clusters now consist of multi-CPU nodes along with additional hardware accelerators in the form of graphics processing units (GPUs). Each CPU and GPU is integrated with system memory via communication links (QPI and PCIe) and multi-channel memory controllers. The increasing density of these heterogeneous computing systems has resulted in complex performance phenomena including non-uniform memory access (NUMA) and resource contention that make application performance hard to predict and tune. This paper presents the Topology Aware Resource Usability and Contention (TARUC) benchmark. TARUC is a modular, open-source, and highly configurable benchmark useful for profiling dense heterogeneous systems to provide insight for developers who wish to tune application codes for specific systems. Analysis of TARUC performance profiles from a multi-CPU, multi-GPU system is also presented.
- G. Baker. An emperical study of contention and NUMA effects on heterogeneous computing systems. Master's thesis, California Polytechnic State University, June 2016.Google Scholar
- L. Bergstrom. Measuring NUMA Effects With the STREAM Benchmark. arXiv preprint arXiv:1103.3225, 2011.Google Scholar
- F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, and V. Quéma. Large Pages May be Harmful on NUMA Systems. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 231--242, 2014.Google Scholar
- Intel. White paper: An Introduction to the ÂŹQuickPath Interconnect. Technical report, Intel Corporation, January 2009.Google Scholar
- P. Jacob, A. Zia, O. Erdogan, P. M. Belemjian, J.-W. Kim, M. Chu, R. P. Kraft, J. F. McDonald, and K. Bernstein. Mitigating Memory Wall Effects in High-Clock-Rate and Multicore CMOS 3-D Processor Memory Stacks. Proceedings of the IEEE, 97(1):108--122, 2009. Google Scholar
Cross Ref
- J. Lawley. White paper: Understanding Performance of PCI Express Systems. Technical report, XILINX, October 2014.Google Scholar
- S. A. McKee. Reflections on the Memory Wall. In Proceedings of the 1st conference on Computing frontiers, page 162. ACM, 2004. Google Scholar
Digital Library
- K. Spafford, J. S. Meredith, and J. S. Vetter. Quantifying NUMA and Contention Effects in Multi-GPU Systems. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, page 11. ACM, 2011. Google Scholar
Digital Library
- C. Su, D. Li, D. S. Nikolopoulos, M. Grove, K. Cameron, and B. R. De Supinski. Critical Path-Based Thread Placement for NUMA Systems. ACM SIGMETRICS Performance Evaluation Review, 40(2):106--112, 2012. Google Scholar
Digital Library
- The Top500 List of Supercomputers. http://www.top500.org. Accessed: 2016-4-14.Google Scholar
Index Terms
TARUC: A Topology-Aware Resource Usability and Contention Benchmark
Recommendations
Performance evaluation of intel's quad core processors for embedded applications
Recently, multiprocessing is implemented using either chip multiprocessing (CMP) or Simultaneous multithreading (SMT). Multi-core processors, represent CMP processors, are widely used in desktop and server applications and are now appearing in real-time ...
Performance analysis of the high-performance conjugate gradient benchmark on GPUs
Graphics processing unit accelerated supercomputers have proved to be very effective, especially with regard to power efficiency, for accelerating compute intensive applications like the high-performance Linpack used in the TOP500 list. This paper ...
Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores
While the growing number of cores per chip allows researchers to solve larger scientific and engineering problems, the parallel efficiency of the deployed parallel software starts to decrease. This unscalability problem happens to both vendor-provided ...





Comments