ABSTRACT
The critical path of internode communication on large-scale systems is composed of multiple components. When a supercomputing application initiates the transfer of a message using a high-level communication routine such as an MPI_Send, the payload of the message traverses multiple software stacks, the I/O subsystem on both the host and target nodes, and network components such as the switch. In this paper, we analyze where, why, and how much time is spent on the critical path of communication by modeling the overall injection overhead and end-to-end latency of a system. We focus our analysis on the performance of small messages since fine-grained communication is becoming increasingly important with the growing trend of an increasing number of cores per node. The analytical models present an accurate and detailed breakdown of time spent in internode communication. We validate the models on Arm ThunderX2-based servers connected with Mellanox InfiniBand. This is the first work of this kind on Arm. Alongside our breakdown, we describe the methodology to measure the time spent in each component so that readers with access to precise CPU timers and a PCIe analyzer can measure breakdowns on systems of their interest. Such a breakdown is crucial for software developers, system architects, and researchers to guide their optimization efforts. As researchers ourselves, we use the breakdown to simulate the impacts and discuss the likelihoods of a set of optimizations that target the bottlenecks in today's high-performance communication.
- {n. d.}. Teledyne LeCroy Summit T3-16 Analyzer. https://teledynelecroy.com/protocolanalyzer/pci-express/summit-t3-16-analyzerGoogle Scholar
- 2018. Top 500 High Performance Computing Platform Interconnect. Retrieved June 7, 2019 from http://www.mellanox.com/solutions/hpc/top500.phpGoogle Scholar
- 2019. OSU Micro-Benchmarks 5.6.1. http://mvapich.cse.ohio-state.edu/benchmarks/Google Scholar
- 2019. UCS profiling. https://github.com/open/ucx/wiki/ProfilingGoogle Scholar
- Yuichiro Ajima et al. 2018. The Tofu Interconnect D. In 2018 IEEE Intl. Conf. on Cluster Computing (CLUSTER). IEEE, 646--654.Google Scholar
- Mohammad Alian et al. 2018. Simulating PCI-Express Interconnect for Future System Exploration. In 2018 Intl. Symp. on Work. Char. (IISWC). IEEE, 168--178.Google Scholar
- Sudeep Bhoja et al. 2014. FEC codes for 400 Gbps 802.3 bs. IEEE P802. 3bs 400 (2014).Google Scholar
- Nathan L Binkert et al. 2006. Integrated network interfaces for high-bandwidth TCP/IP. ACM Sigplan Not. 41, 11 (2006), 315--324. Google Scholar
Digital Library
- Henri Casanova et al. 2014. Versatile, Scalable, and Accurate Simulation of Distributed Applications and Platforms. J. Parallel and Distrib. Comput. 74, 10 (June 2014), 2899--2917. http://hal.inria.fr/hal-01017319Google Scholar
- Greg Casey. 2018. Gen-Z: High-performance interconnect for the data-centric future. https://www.opencompute.org/files/OCP-GenZ-March-2018-final.pdfGoogle Scholar
- Eric G. 2014. What public disclosures has Intel made about Knights Landing? Retrieved June 7, 2019 from https://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landingGoogle Scholar
- Ali Ghiasi et al. 2012. Investigation of PAM-4/6/8 signaling and FEC for 100 Gb/s serial transmission. IEEE P802. 3bm 40 (2012).Google Scholar
- Adrian Jackson et al. 2019. Evaluating the Arm Ecosystem for High Performance Computing. In Proc. of the Platform for Advanced Scientific Computing Conf. ACM. Google Scholar
Digital Library
- Anuj Kalia et al. 2016. Design Guidelines for High Performance {RDMA} Systems. In 2016 {USENIX} Annual Technical Conf. ({USENIX} {ATC} 16). 437--450. Google Scholar
Digital Library
- Patrick Kennedy. 2018. Cavium ThunderX2 Review and Benchmarks a Real Arm Server Option. Retrieved June 7, 2019 from https://www.servethehome.com/cavium-thunderx2-review-benchmarks-real-arm-server-option/Google Scholar
- Steen Larsen et al. 2015. Reevaluation of PIO with write-combining buffers to improve I/O performance on cluster systems.. In NAS. 345--346.Google Scholar
- Guangdeng Liao et al. 2009. Performance measurement of an integrated NIC architecture with 10GbE. In 2009 17th IEEE Symp. on High Perf. Inter. IEEE, 52--59. Google Scholar
Digital Library
- Arm Ltd. 2019. ARMv8-A Memory types. Retrieved June 7, 2019 from https://developer.arm.com/docs/100941/latest/memory-typesGoogle Scholar
- Rolf Neugebauer et al. 2018. Understanding PCIe performance for end host networking. In Proc. of the 2018 Conf. of the ACM Special Interest Group on Data Communications. ACM, 327--341. Google Scholar
Digital Library
- Nikela Papadopoulou et al. 2017. A performance study of UCX over InfiniBand. In Proc. of the 17th IEEE/ACM Intl. Symp. on Cluster, Cloud and Grid Computing. IEEE Press, 345--354. Google Scholar
Digital Library
- Ken Raffenetti et al. 2017. Why is MPI so slow?: Analyzing the fundamental limits in implementing mpi-3.1. In Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis. ACM, 62. Google Scholar
Digital Library
- Stephen M Rumble et al. 2011. It's Time for Low Latency.. In HotOS, Vol. 13. 11--11. Google Scholar
Digital Library
- Pavel Shamis et al. 2015. UCX: an open source framework for HPC network APIs and beyond. In 2015 IEEE 23rd Ann. Symp. on High-Perf. Inter.. IEEE, 40--43. Google Scholar
Digital Library
- Phil Sun. 2017. 100Gb/sSingle-lane SERDES Discussion. IEEE P802.3 New Ethernet Applications Ad Hoc (2017).Google Scholar
- Rajeev Thakur et al. 2010. MPI at Exascale. Proc. of SciDAC 2 (2010), 14--35.Google Scholar
Index Terms
Breaking Band: A Breakdown of High-performance Communication
Recommendations
Dynamically configurable bus topologies for high-performance on-chip communication
The on-chip communication architecture is a primary determinant of overall performance in complex system-on-chip (SoC) designs. Since the communication requirements of SoC components can vary significantly over time, communication architectures that ...
A full-range dual material gate tunnel field effect transistor drain current model considering both source and drain depletion region band-to-band tunneling
In this paper, a 2-D analytical model for the drain current of a dual material gate tunneling field-effect transistor is developed incorporating the effects of source and drain depletion regions. The model can forecast the effects of drain voltage, gate ...
A compact interband tunneling current model for Gate-on-Source/Channel SOI-TFETs
A tunneling probability-based drain current model for tunnel field-effect transistors (FETs) is presented. First, an analytical model for the surface potential and the potential at the channel---buried oxide interface is derived for a Gate-on-Source/...




Comments