skip to main content
10.1145/3295500.3356189acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Network-accelerated non-contiguous memory transfers

Published: 17 November 2019 Publication History

Abstract

Applications often communicate data that is non-contiguous in the send- or the receive-buffer, e.g., when exchanging a column of a matrix stored in row-major order. While non-contiguous transfers are well supported in HPC (e.g., MPI derived datatypes), they can still be up to 5x slower than contiguous transfers of the same size. As we enter the era of network acceleration, we need to investigate which tasks to offload to the NIC: In this work we argue that non-contiguous memory transfers can be transparently network-accelerated, truly achieving zero-copy communications. We implement and extend sPIN, a packet streaming processor, within a Portals 4 NIC SST model, and evaluate strategies for NIC-offloaded processing of MPI datatypes, ranging from datatype-specific handlers to general solutions for any MPI datatype. We demonstrate up to 8x speedup in the unpack throughput of real applications, demonstrating that non-contiguous memory transfers are a first-class candidate for network acceleration.

References

[1]
T. Shanley. 2003. Infiniband Network Architecture. Addison-Wesley Professional.
[2]
2019. Mellanox Technologies. http://http://www.mellanox.com/. (2019).
[3]
B. Alverson, et al. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).
[4]
B. W Barrett, et al. 2018. The Portals 4.2 Network Programming Interface. Sandia National Laboratories, November 2012, Technical Report SAND2012-10087 (2018).
[5]
T. Schneider, et al. 2013. Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters. In Parallel Processing (ICPP), 2013 42nd International Conference on. 593--602.
[6]
S. Di Girolamo, et al. 2016. Exploiting Offload Enabled Network Interfaces. IEEE MICRO 36, 4 (Jul. 2016).
[7]
T. Schneider, R. Gerstenberger, and T. Hoefler. 2012. Micro-Applications for Communication Data Access Patterns and MPI Datatypes. In Recent Advances in the Message Passing Interface - Proceedings of the 19th European MPI Users' Group Meeting, EuroMPI 2012, 2012., Vol. 7490. Springer, 121--131.
[8]
T. Schneider, R. Gerstenberger, and T. Hoefler. 2014. Application-oriented ping-pong benchmarking: how to assess the real communication overheads. Journal of Computing 96, 4 (Apr. 2014), 279--292.
[9]
T. Hoefler and S. Gottlieb. 2010. Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI Datatypes. In Recent Advances in the Message Passing Interface (EuroMPI'10), Vol. LNCS 6305. Springer, 132--141.
[10]
W. Gropp, et al. 2011. Performance Expectations and Guidelines for MPI Derived Datatypes. In Recent Advances in the Message Passing Interface (EuroMPI'11), Vol. 6960. Springer, 150--159.
[11]
T. Schneider, F. Kjolstad, and T. Hoefler. 2013. MPI Datatype Processing using Runtime Compilation. In Proceedings of the 20th European MPI Users' Group Meeting. ACM, 19--24.
[12]
G. Santhanaraman, J. Wu, and D. K Panda. 2004. Zero-copy MPI derived datatype communication over InfiniBand. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 47--56.
[13]
H. Wang, et al. 2011. Optimized non-contiguous MPI datatype communication for GPU clusters: Design, implementation and evaluation with MVAPICH2. In 2011 IEEE International Conference on Cluster Computing. IEEE, 308--316.
[14]
T. Hoefler, et al. 2017. sPIN: High-performance streaming Processing in the Network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC17).
[15]
R. F Van der Wijngaart and P. Wong. 2002. NAS parallel benchmarks version 2.4. (2002).
[16]
J. Nieplocha, et al. 2006. High performance remote memory access communication: The ARMCI approach. The International Journal of High Performance Computing Applications 20, 2 (2006), 233--253.
[17]
B. Chapman, et al. 2010. Introducing OpenSHMEM: SHMEM for the PGAS community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model. ACM, 2.
[18]
J. Mellor-Crummey, et al. 2009. A new vision for Coarray Fortran. In Proceedings of the Third Conference on Partitioned Global Address Space Programing Models. ACM, 5.
[19]
T. El-Ghazawi and L. Smith. 2006. UPC: unified parallel C. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing. ACM, 27.
[20]
Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard Version 3.0. (09 2012). Chapter author for Collective Communication, Process Topologies, and One Sided Communications.
[21]
W. Gropp, E. Lusk, and D. Swider. 1999. Improving the performance of MPI derived datatypes. In Proceedings of the Third MPI Developer's and User's Conference. MPI Software Technology Press, 25--30.
[22]
S. Byna, et al. 2006. Automatic memory optimizations for improving MPI derived datatype performance. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 238--246.
[23]
N. Tanabe and H. Nakajo. 2008. Introduction to acceleration for MPI derived datatypes using an enhancer of memory and network. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 324--325.
[24]
J. L. Träff. 2014. Optimal MPI Datatype Normalization for Vector and Index-block Types. In Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/ASIA '14). ACM, New York, NY, USA, Article 33, 6 pages.
[25]
R. Ross, et al. 2009. Processing MPI datatypes outside MPI. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 42--53.
[26]
R. Ross, N. Miller, and W. D Gropp. 2003. Implementing fast and reusable datatype processing. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 404--413.
[27]
A. Kurth, et al. 2017. HERO: Heterogeneous embedded research platform for exploring RISC-V manycore accelerators on FPGA. arXiv preprint arXiv:1712.06497 (2017).
[28]
D. Rossi, et al. 2017. Energy-Efficient Near-Threshold Parallel Computing: The PULPv2 Cluster. IEEE Micro 37, 5 (Sep. 2017), 20--31.
[29]
M. Gautschi, et al. 2017. Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 10 (Oct 2017), 2700--2713.
[30]
Mellanox Technologies. 2019. Mellanox BlueField SmartNIC. http://www.mellanox.com/related-docs/prod_adapter_cards/PB_BlueField_Smart_NIC.pdf. (2019). Online; accessed 05. April 2019.
[31]
H. T. Mair, et al. 2016. 4.3 A 20nm 2.5GHz ultra-low-power tri-cluster CPU subsystem with adaptive power allocation for optimal mobile SoC performance. In IEEE International Solid-State Circuits Conference (ISSCC). 76--77.
[32]
J. Pyo, et al. 2015. 23.1 20nm high-K metal-gate heterogeneous 64b quad-core CPUs and hexa-core GPU for high-performance and energy-efficient mobile application processor. In 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers. 1--3.
[33]
R. Sohan, et al. 2010. Characterizing 10 Gbps network interface energy consumption. In IEEE Local Computer Network Conference. 268--271.
[34]
C. L Janssen, et al. 2010. A simulator for large-scale parallel computer architectures. International Journal of Distributed Systems and Technologies (IJDST) 1, 2 (2010), 57--73.
[35]
Nathan Binkert, et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1--7.
[36]
F. A Endo, D. Couroussé, and H. Charles. 2014. Micro-architectural simulation of in-order and out-of-order arm microprocessors with gem5. In 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV). IEEE, 266--273.
[37]
A. Tousi and C. Zhu. 2017. Arm Research Starter Kit: System Modeling using gem5. (2017).
[38]
T. Hoefler, T. Schneider, and A. Lumsdaine. 2010. LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 597--604.
[39]
Lawrence Livermore National Laboratory. 2018. Comb is a communication performance benchmarking tool. (2018). https://github.com/LLNL/Comb
[40]
S. Plimpton. 1995. Fast Parallel Algorithms for Short-range Molecular Dynamics. J. Comput. Phys. 117, 1 (March 1995), 1--19.
[41]
C. Bernard, et al. 1991. Studying quarks and gluons on MIMD parallel computers. The International Journal of Supercomputing Applications 5, 4 (1991), 61--70.
[42]
L. Carrington, et al. 2008. High-frequency Simulations of Global Seismic Wave Propagation Using SPECFEM3D GLOBE on 62K Processors. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 60, 11 pages. http://dl.acm.org/citation.cfm?id=1413370.1413432
[43]
B Sjogreen. 2018. SW4 final report for iCOE. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
[44]
W. C. Skamarock and J. B. Klemp. 2008. A Time-split Nonhydrostatic Atmospheric Model for Weather Research and Forecasting Applications. J. Comput. Phys. 227, 7 (March 2008), 3465--3485.
[45]
R. Neugebauer, et al. 2018. Understanding PCIe performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. ACM, 327--341.
[46]
M. Martinasso, et al. 2016. A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC16). IEEE Press, 63:1--63:11.
[47]
T. Hoefler, C. Siebert, and A. Lumsdaine. 2009. Group Operation Assembly Language - A Flexible Way to Express Collective Communication, In ICPP-2009 - The 38th International Conference on Parallel Processing. (Sep. 2009).
[48]
J. L. Träff, et al. 1999. Flattening on the Fly: efficient handling of MPI derived datatypes. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Jack Dongarra, Emilio Luque, and Tomàs Margalef (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 109--116.
[49]
T. Prabhu and W. Gropp. 2015. DAME: A runtime-compiled engine for derived datatypes. In Proceedings of the 22nd European MPI Users' Group Meeting. ACM, 4.
[50]
T. Schneider, R. Gerstenberger, and T. Hoefler. 2013. Compiler optimizations for non-contiguous remote data movement. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 307--321.
[51]
TV Eicken, et al. 1992. Active messages: a mechanism for integrated communication and computation. In 1992 Proceedings the 19th Annual International Symposium on Computer Architecture. IEEE, 256--266.
[52]
M. Besta and T. Hoefler. 2015. Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages. In Proceedings of the 24th Symposium on High-Performance Parallel and Distributed Computing (HPDC'15). ACM, 161--172.
[53]
M. Besta and T. Hoefler. 2015. Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations. In Proceedings of the 29th International Conference on Supercomputing (ICS'15). ACM, 155--164.

Cited By

View all
  • (2024)Canary: Congestion-aware in-network allreduce using dynamic treesFuture Generation Computer Systems10.1016/j.future.2023.10.010152(70-82)Online publication date: Mar-2024
  • (2023)Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph QueriesACM Computing Surveys10.1145/360493256:2(1-40)Online publication date: 15-Sep-2023
  • (2023)Achieving Zero-copy Serialization for Datacenter RPC2023 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59175.2023.10253859(304-312)Online publication date: 17-Nov-2023
  • Show More Cited By

Index Terms

  1. Network-accelerated non-contiguous memory transfers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2019
    1921 pages
    ISBN:9781450362290
    DOI:10.1145/3295500
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    • European Science Foundation
    • European Research Council (ERC)

    Conference

    SC '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 23 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Canary: Congestion-aware in-network allreduce using dynamic treesFuture Generation Computer Systems10.1016/j.future.2023.10.010152(70-82)Online publication date: Mar-2024
    • (2023)Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph QueriesACM Computing Surveys10.1145/360493256:2(1-40)Online publication date: 15-Sep-2023
    • (2023)Achieving Zero-copy Serialization for Datacenter RPC2023 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59175.2023.10253859(304-312)Online publication date: 17-Nov-2023
    • (2023)Simplifying non-contiguous data transfer with MPI for PythonThe Journal of Supercomputing10.1007/s11227-023-05398-779:17(20019-20040)Online publication date: 7-Jun-2023
    • (2022)Accelerating Data Serialization/Deserialization Protocols with In-Network Compute2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI56604.2022.00008(22-30)Online publication date: Nov-2022
    • (2022)Strided DMA for Multidimensional Array Copy and TransposeIntelligent Computing10.1007/978-3-031-10461-9_26(375-393)Online publication date: 7-Jul-2022
    • (2021)Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3131677(1-1)Online publication date: 2021
    • (2021)High-Performance Routing With Multipathing and Path Diversity in Ethernet and HPC NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303576132:4(943-959)Online publication date: 1-Apr-2021
    • (2021)A RISC-V in-network accelerator for flexible high-performance low-power packet processing2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00079(958-971)Online publication date: Jun-2021
    • (2021)Layout-aware Hardware-assisted Designs for Derived Data Types in MPI2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00044(302-311)Online publication date: Dec-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media