MPI Application Binary Interface Standardization

MPI is the most widely used interface for high-performance computing (HPC) workloads. Its success lies in its embrace of libraries and ability to evolve while maintaining backward compatibility for older codes, enabling them to run on new architectures for many years. In this paper, we propose a new level of MPI compatibility: a standard Application Binary Interface (ABI). We review the history of MPI implementation ABIs, identify the constraints from the MPI standard and ISO C, and summarize recent efforts to develop a standard ABI for MPI. We provide the current proposal from the MPI Forum’s ABI working group, which has been prototyped both within MPICH and as an independent abstraction layer called Mukautuva. We also list several use cases that would benefit from the definition of an ABI while outlining the remaining constraints.


INTRODUCTION
MPI [32] has always been an Application Programming Interface (API) standard, which means that it is standardized in terms of the C and Fortran programming languages.Implementations are not constrained in how they define opaque types (for example, MPI_Comm), which means they compile into different binary representations.This is fine for users who only use one implementation, or are content to recompile their software for each of these.Many users, including those building both traditional C/C++/Fortran libraries and new languages that use MPI via the C ABI, are tired of the duplication of effort required because MPI lacks a standard Application Binary Interface (ABI).
The potential for implementation agnosticism [15,39] and specifically an ABI [31], has been recognized for many years.However, no serious effort was made to standardize an ABI, for a variety of reasons.Some of the forces acting against ABI standardization were the diversity of HPC systems, the prevalence of static linking, and the lack of adoption of third-party languages.Over the past 20 years, the HPC hardware and software ecosystem has changed dramatically.Distributing software packages through shared libraries is now common.Package managers, including HPC-oriented ones such as Spack [14], distribute binaries that depend on MPI.There is increasing adoption of MPI by applications written in languages other than C and Fortran [4,5,9].The MPICH ABI Initiative [29,40] was the first serious effort to create mutually interoperable MPI implementations, by reconciling small differences between the ABIs of MPICH and MPICH-based implementations.This allows applications compiled against appropriate versions of MPICH, Intel MPI, Cray MPI, MVAPICH2 and other implementations to run using the shared libraries from of any of the other implementations.This is especially useful to leverage the level of platform-specific specialization that goes into some of these libraries.
Since 2014, the appetite for MPI implementation compatibility has grown dramatically for at least two reasons.First, containers are an increasingly popular mechanism for distributing HPC software.Singularity [34,45] and Shifter [3,33], among others, now allow complex scientific applications to be shared more easily by packing them as self-sustained software images.However, container portability is hindered [26] by both the lack of a common launch methodology1 and the absence of an MPI ABI -preventing the advent of portable containers featuring MPI programs.Second, MPI is now used by applications written in languages like Python, Julia, and Rust, which are currently required to build and test against all supported implementations and support the end-user installation of their MPI support against the implementation of the user's system.A standard ABI would eliminate the ( ) cost of packaging and simplify testing.The ( ) costs due to MPI implementation ABIs are not unique to these languages.
In the rest of this paper, we describe the constraints associated with an MPI ABI, the potential benefits for the HPC ecosystem, and the proposed ABI implementation as defined by the ABI working group, Performance experiments demonstrate that a high-quality implementation of the standard ABI in MPICH has negligible overhead, while the third-party implementation in Mukautuva has a tolerable overhead.We also discuss important considerations for compatibility besides the C ABI, including library naming, launchers, and Fortran.

BACKGROUND AND RELATED WORK
The HPC user community has been actively working to address the issue of ABI compatibility in MPI implementations.For a long time, the requirements associated with ABI compatibility in MPI have led to complexities in terms of software deployment, particularly in large computing centers.
Wi4MPI is a wrapper interface that implements ABI interoperability for MPI, supporting both Fortran and C languages [30].It can be used in two ways.First, users can compile their applications against the generic MPI interface from Wi4MPI and then redirect them to their implementation of choice.Alternatively, they can redirect one implementation to another.For Wi4MPI to work, its wrapper interface needs to be compiled for each source and target MPI.The performance overhead has been shown to be minimal, making it an effective tool for running containers in a portable manner.Wi4MPI is leveraged in the Extreme-scale Scientific Software Stack [24] as a support tool in the e4s-cl container launcher tool, which implements on-the-fly MPI detection and library translation at container launch time [37].
A similar effort to Wi4MPI was undertaken at the Perimeter Institute for Theoretical Physics, leading to MPItrampoline [35], which defines its own ABI enabling applications' portability on several MPI runtimes.
It is known that a patent [41] exists for a specific method of interoperating different MPI ABIs, preventing its use by the opensource community.
In general, the availability of a standard ABI will simplify the tasks of these converters.Instead of having to implement conversions between the two APIs, these adaptation layers will primarily focus on compilation-related tasks, such as fixing dependency detection and enabling the replacement of one MPI with another.
After the MPI ABI working group was formed, two efforts were started to prototype the proposed designs, to understand their feasibility.The first of these was Mukautuva [22], which is a standalone ABI abstraction layer that maps from its own ABI (i.e. an approximation to the one under discussion in the working group) to MPICH and Open MPI by redirecting MPI symbols through a translation layer to the underlying MPI implementation, with renamed symbols (via dlsym) to avoid conflicts.The final design of Mukautuva is unintentionally quite similar to MPItrampoline; this convergence may be an indication of the suitability of their design.Meanwhile, a prototype was developed in MPICH [46].Working together, these efforts revealed the relative ease of implementing the ABI proposal both internally and externally to an existing implementation.They also exposed non-portable assumptions in various MPI test suites.

CURRENT ABI DESIGNS
There are multiple aspects to an MPI ABI.Here are a few: (1) The integral types of MPI_Aint, MPI_Offset, and MPI_Count.
(2) The MPI_Status object.This is a C struct with three standard members as well as hidden fields used by the implementation.
(3) Opaque handles such as MPI_Comm.Implementations can define these to be anything that satisfies the required properties.(4) Callback functions, e.g., MPI_User_function.These callback functions usually do not allow registering any data with the function pointers, which is a challenge to intercepting and forwarding registered functions.(5) Values for both integer and handle constants, as well as predefined callbacks.Some of these are arbitrary, while others must be chosen carefully.MPI 4.0 requires that most constants be usable in C for initialization and assignments, but not case statements, which means they need not be compile-time constants.Fortran requires they be compile-time constants, which constrains the C ABI when constants are the same in both languages.Buffer address constants cannot be used for initialization/assignment, while string length constants must be suitable as sizes in array declarations.
MPICH [10] has elected to provide compile-time constants, which is necessary on some operating systems that do not support linktime constants, and works in both C and Fortran.Open-MPI [13] does not have compile-time constant predefined handles in C, and has an indirection table from Fortran integer handles to the C ones.

MPI integer types
The types MPI_Aint and MPI_Offset are used to store addresses and file offsets, respectively.MPI_Count was added in MPI-3 for the large-count effort, and this type is required to hold values of MPI_Aint and MPI_Offset, so it is at least as large as these.MPI_Aint is somewhat challenging since it must hold both absolute addresses and relative displacements of pointers, so it is similar to (u)intptr_t and ptrdiff_t from C. However, because it must also work in Fortran as INTEGER(KIND=MPI_ADDRESS_KIND), it must be treated as if it is signed (because Fortran does not support unsigned integers).Another complication is that pointers, addresses, and differences of pointers may not always be the same size.In the past, segmented addressing meant that addresses could be larger than pointers, whereas there are now platforms where the reverse is true, and MPI_Aint must be able to hold a pointer [21] to support struct datatypes, for example.

The status object
This section describes multiple implementations of the MPI_Status object and their history.

New MPICH (MPICH ABI Initiative).
Below is the status object in MPICH, which was made consistent with Intel MPI, in order to establish the MPICH ABI initiative.This meant that applications and libraries compiled against Intel MPI could be run using many implementations.

Old MPICH.
Prior to being consistent with Intel MPI, MPICH had the following status object.This definition included unused fields as a hedge against future needs, but also allowed for platformspecific fields, which meant that MPICH builds on different platforms could be ABI-incompatible.3.2.4MPItrampoline.MPItrampoline defines a status object that holds the three public fields as well as a union of structs equivalent to the status objects of MPICH and Open MPI.This definition is not space efficient but convenient for converting between the trampoline definition and the underlying implementation one, although it stores the public fields redundantly.

. . . t y p e d e f s t r u c t M P I _ S t a t u s
We see here that all variants have the required fields, MPI_SOURCE, MPI_TAG and MPI_ERROR, and the old MPICH ABI matched the Open MPI ABI in having both at least one bit for the canceled state plus a count field that supports at least 63 bit values.The question for ABI standardization is what sort of hidden fields may need to exist in the future, since there is little to no slack space to add new fields in the current implementations.

MPI handle types
MPI datatypes are opaque objects although the constraints on them limit the implementation choices.The MPI standard requires that opaque objects can be compared for equality and inequality.For the C language, this means that they need to have a built-in type, which reasonably only allows integer and pointer types, and excludes union and struct types.
The other important constraint on handles is related to attributes: "Attributes in C are of type void* [. . .] Attributes are scalar values, equal in size to, or larger than a C-language pointer.Attributes can always hold an MPI handle." Because MPI handles must be able to be held in a type void*, they cannot be larger than a pointer.
Since Fortran only supports signed integers, and older versions of C provide a limited set of integer types, one can expect implementations to use a 32-bit integer, a 64-bit integer, or a pointer for handles, although an 8-or 16-bit integer would be permitted.We see that MPICH uses a C int (32-bits on all supported platforms) and Open MPI uses incomplete struct pointers.The utility of incomplete struct pointers is that they allow for compiler typechecking.That is, MPI_Comm and MPI_Group, for example, are recognizable as different types and the compiler can issue warnings about invalid handle arguments.On the other hand, the MPICH design allows for zero-overhead conversion between C and Fortran, as well as the encoding of information in the handle values themselves.Open MPI does not utilize this capability since handles to C objects are not compile-time constants.
Below are some of the MPICH datatype handles, which reveal how information is encoded within them: These handles encode the size of built-in datatypes that can be queried trivially with this macro: # d e f i n e M P I R _ D a t a t y p e _ g e t _ b a s i c _ s i z e ( a ) ( ( ( a ) &0 x 0 0 0 0 f f 0 0 ) > >8) There are other macros that take advantage of the hidden structure of the MPI_Datatype handle that the reader can study in mpir_datatype.h.Open MPI's mpi.h defines the datatype handle to be a pointer to an incomplete struct, which is resolved externally at link-time.The definition of the structure is only visible when building the MPI library itself; otherwise, the compiler only knows its name.This means that the data pointed to by a handle need not be the same at runtime, because the MPI application or library does not depend on it.The runtime cost of querying handles is different in Open MPI relative to MPICH.Open MPI has to look up the size of the datatype inside of a 352-byte struct, which is not a concerning overhead since the type of MPI code that will notice such an overhead is going to pass the same datatype over and over, in which case the CPU is going to cache and correctly branch-predict the lookup and associated use every time.
s t a t i c i n l i n e i n t 3 2 _ t o p a l _ d a t a t y p e _ t y p e _ s i z e ( c o n s t o p a l _ d a t a t y p e _ t * pData , s i z e _ t * size ) { * size = pData −> size ; r e t u r n 0 ; } Wi4MPI defines all the opaque handles to be size_t.This ensures they are at least as large as MPICH's int handles and Open MPI's pointer handles on most platforms (technically, intptr_t must be used for this to be strictly true but the exceptions are obscure [43]).
Wi4MPI defines the built-in datatypes to be sequential integers, which means they are not attempting to encode useful information the way MPICH do, although they are compile-time constants, unlike Open MPI.Analysis.There are advantages to both approaches.MPICH optimizes for the common case of built-in types, and does a lookup for others, while Open MPI always performs a pointer lookup, but then has what it needs in both cases.
The other advantage of the MPICH approach is with Fortran.In Fortran, handles are INTEGER or a type with a single member that is an INTEGER.MPICH conversions between C and Fortran are trivial.Open MPI has to maintain a lookup table to map Fortran handles to C objects.
An advantage of the Open MPI approach of using pointer types to represent opaque types is increased type safety.This enables the compiler to flag type mismatches, e.g. an MPI_Comm and an MPI_Datatype argument have accidentally been swapped.

Functions
Function prototypes in MPI follow naturally from the definitions of their arguments, which are either opaque handles, MPI integer types, or intrinsic language types.What is essential for ABI purposes is that the calling convention be fixed.This can be done by specifying the aforementioned types and defining the calling convention to be "as if" compiled by the platform C compiler.In most cases, all of the C compilers on a given platform share a calling convention but there are at least historical cases where this was not true.As long as the MPI library uses the platform C compiler calling convention, it will be compatible with libraries and applications built with it, or another compatible compiler.

ECOSYSTEM IMPACT
One of the main motivations for an ABI is the ability to simplify the end user's life, thus improving the usability of the various MPI implementation through standardization.In this section, we detail particular points of interest for the community which would directly benefit from the availability of an ABI.

Python
The Python language provides MPI bindings through the mpi4py package [9].mpi4py uses Cython [2], a super-set of the Python language with C extensions.The Cython compiler generates C code calling into the Python C-API and the MPI C-API.The wrapper C code has to be compiled and linked against a specific MPI implementation to generate a Python extension module.
The lack of a standardized MPI ABI presents several drawbacks.The mpi4py testing infrastructure built on publicly available services like GitHub Actions and Azure Pipelines requires adding both MPICH and Open MPI to the build matrix, effectively duplicating the required resources for running continuous integration.The mpi4py maintainers cannot distribute pre-built binary Python wheels via the Python Package Index, effectively forcing end users to set up a working C and Python development environment and build mpi4py from a sources distribution.The conda-forge [8] project somewhat alleviates these issues by featuring the conda package manager and its ability to install different variants of pre-built binaries in user-defined non-system locations.Nonetheless, the lack of a standardized MPI ABI prevents conda-forge binaries from using MPI implementations that are not ABI-compatible with either MPICH or Open MPI.In addition, conda-forge also suffers from the doubling of required resources to generate binaries for every downstream application or library using MPI.
A standardized MPI ABI would allow mpi4py to explore alternative implementations based on the runtime loading of dynamic/shared libraries and C foreign function interface (FFI) mechanisms.Such an approach would circumvent the generation of platformspecific binaries, allowing any pure Python code to access the MPI library and its features in a platform-agnostic way.

Julia
The Julia language provides MPI bindings through the MPI.jl package [5].Unlike Python, Julia does not make use of a C compiler to call into external libraries.Instead, the user provides the corresponding types to the function signature to the ccall command.To support this, the developers of MPI.jl had to (a) define the constants and type definitions for each MPI ABI, (b) develop heuristics to detect which ABI a particular MPI library is using, and (c) provide a mechanism to switch between the ABI definitions, invalidating Julia's cache of pre-compiled code.This code has been a significant source of issues, hampering its usability and requiring significant engineering effort on the part of its volunteer maintainers.
A key benefit of a standardized ABI will be making it easier to provide downstream binaries.The Julia package manager provides prebuilt binaries of many MPI-enabled libraries, such as ADIOS2, HYPRE, P4est, and PETSc, but the support is rather cumbersome, especially when users wish to use non-bundled MPI implementations.The ABI would boost usability, especially for the long-tail of users on lower-end systems.

Rust
The Rust programming language provides MPI bindings through the libraries in the rsmpi project [38].rsmpi combines a thin static library that re-exports underspecified identifiers (providing symbols where MPI implementations are allowed to use macros) and uses bindgen to create the raw Rust interface.bindgen relies on libclang to parse the header (of the thin library and mpi.h) to generate Rust bindings that conform to the C ABI of the MPI implementation at hand.This approach works for any compliant MPI and does not require tedious definition or maintenance of stubs.The disadvantages are many and include long initial compilation times due to having to fetch and build the dozens of dependencies of bindgen, a need for pre-installed libclang, long testing times by building against multiple different MPI implementations, and need for users to rebuild to pick a different MPI implementation.
Rust is known for reliable package management and tooling for binary distribution, including cross-compilation (across OS and ISA).A standardized MPI ABI would allow rsmpi to provide a thin and stable Rust binding that can be built without any dependencies and simply links against a dynamic library on the target platform.For example, a single CI/CD job for an application could publish binaries for MacOS (x86-64 and ARM), Android, Windows, and Linux (x86-64, POWER, and ARM) without needing to think about MPI idiosyncrasies.ABI stability would make it easier to support more MPI features and would also enable low-level idiomatic Rust features that improve safety and static analysis for C FFI bindings, but that are hard to incorporate into the current bindgen approach.

Fortran
Currently, all implementations of MPI Fortran wrappers are integrated with MPI implementations.Vapaa [20] is the first attempt to write the MPI Fortran 2008 interface as a standalone project, based on calling the C interface, without any use of the internal state.Because Fortran constants must be compile-time constants, not just link-time constants, when Open MPI is used, Fortran interfaces must define their own set of constants and translate them to the C ones at runtime.Furthermore, to handle status objects, the status ABI must be known.Thus, Vapaa ends up implementing its own ABI and translating all constants and status objects.A standard ABI would simplify the translation process and eliminate the need for implementation-specific status handling.If the Fortran 2008 interface had the type of MPI_VAL equivalent to handle types in C, then no translation would be necessary.However, this would be both an ABI and an API change for mpi_f08.mod; it also offers nothing to users of older MPI Fortran interfaces.
In addition, due to differences in name mangling among Fortran compilers, running a Fortran program that calls MPI functions inside a container can result in linker dependencies that are not fixed and depend on the specific Fortran module and compiler convention used.This prevents an MPI Fortran program from having its MPI implementation replaced through interposition (i.e., LD_PRELOAD).Having an external Fortran implementation that relies on the ABI would enable the static linking of the Fortran adaptation layer in the target binary, abstracting away from these languagedependent variations and restoring ABI compatibility.

Packaging
The availability of an ABI is highly important for packaging MPI applications.MPI is a fundamental package for most scientific software.However, there are several libraries that provide MPI support, including vendor-specific ones.Therefore, building a binary for MPI can become cumbersome when managing a long chain of dependencies between packages, resulting in repetitive building.While the ABI alone is insufficient to solve packaging dependency issues, it is a significant step in the right direction.When running an executable, the loader is responsible for locating various dynamic shared objects (DSOs) to fulfill execution dependencies, based on the system's configuration and environment (e.g., LD_LIBRARY_PATH and search paths).Thus, the ABI alone has no impact on the loading of libraries when running a program.Nonetheless, it is still a crucial step towards achieving drop-in replacement for MPI, which involves changing the MPI of a given binary.This goal can be attained by defining a common library naming scheme or developing specific stub libraries in charge of bridging implementations -the binary being linked to the stubs.
Linux package managers such as APT and RPM ship binaries for two dependency chains, with packages like hdf5-openmpi and hdf5-mpich.APT manages these through the /etc/alternatives mechanism while RPM (Fedora) deliberately rejects that usage due to their rule [12]: "If a non-root user would gain value by switching between the variants then alternatives MUST NOT be used." As such, Debian/Ubuntu users developing code that depends on an MPI-enabled HDF5 can have a default implementation (that might change unexpectedly at the whims of their sysadmin) while Fedora/Red Hat users must use verbose paths unlikely to be found by configure scripts.In this example, Arch Linux provides hdf5-openmpi in the default repository (binary distribution) while hdf5-mpich (and mpich itself) are in the user repository that must be installed from source.Homebrew provides only hdf5-mpi, which uses Open MPI.This complexity requires maintenance and communication to users, increases testing time and frequency of bugs, and harms reproducibility.With an ABI, there could be one hdf5-mpi and let the mpiexec or ldconfig/LD_LIBRARY_PATH (since these distributions eschew RPATH) determine which MPI to execute with.
A possibility to retarget binaries would be changing embedded RPATHs inside the executable.Spack [14], which relies extensively on RPATHs has implemented such rewriting techniques [44] when deploying binary packages to systems, relocating the RPATHs through carefully planned compilation and clever rewriting techniques.This approach was required as the end-user may have deployed his spack tree at a different location from where it was initially compiled.By doing so, Spack then manages to restructure complex dependency chains, it is a process analogous to what would be required to change the runtime of a given MPI-enabled binary.The Anaconda [27] software distribution and its conda package manager also rely on RPATH rewriting to allow binary relocation.

Testing
The advantage of the ABI support for testing is less direct.Indeed, when running a program with a given MPI, the MPI is also part of the equation to be validated.Gains could be envisioned in the building of the test cases but it is not obvious that MPI implementation will be directly interoperable even in the event of a unified ABI.Indeed, as for packaging, the build chain for MPI involves compiler wrappers and implementations are free to choose the name of the MPI Dynamic Shared Object (DSO) -preventing drop-in replacement of MPI.Overall, for testing, the ability to retarget MPI programs to another implementation requires more than an ABI but either a clearly defined object layout for MPI or a dedicated redirection layer similar to what is provided in "trampoline" interfaces [30,35].Analogous dependencies on the naming of the DSO are also present for containers as discussed in the next section.

Containers
Containers are an abstraction built on Linux namespaces.Containers are the systematic use of such namespaces to run "software images" in a custom environment.The main advantage of containers is the ability to move images around systems to avoid recreating complex software environments.In HPC networks, it is common to use OS bypass techniques to optimize network performance.This involves creating a separate networking layer that operates independently of the operating system, allowing for faster and more efficient communication between nodes.As a result, the networking namespace (IP bound) is not commonly used in HPC networks which prefer faster fabrics than TCP/IP.Similarly, to prevent security issues such as privilege escalation, the user namespace, which allows mimicking root behavior inside the container, is not used for containers in non-virtualized environments [7].Overall, the namespace leveraged by HPC container runtimes such as Singularity and Shifter is the mount namespace.
Using the mount namespace, it is possible to change the mount point seen by the running program.In HPC, the user's home is often bind-mounted inside the container, stacking the container's view of the file system atop of the preexisting shared one.With a containerized program compiled against MPI, the corresponding MPI is likely present in the container image.This MPI has to be compatible with a wide range of interconnects, unlike the host MPI from the system, since it cannot anticipate its target environment.While this can be mitigated with communication libraries such as libfabric [16], which manage to unify high-speed network interfaces, there may be features that are not possible outside of the native MPI environment, perhaps because they are proprietary and not generally available, or because they require system awareness (e.g., network topology information) that cannot be included in the widely distributed implementations of MPI.
On this aspect, using the host MPI (as opposed to a container MPI) would allow the guest binary to take advantage of all the custom features of the system.It also obviates the need for application containers to redistribute MPI at all.For this purpose, approaches such as e4s-cl [37] recursively locate all dependencies of MPI and inject them into the container.The target binary may depend indirectly on libraries such as hwloc that are required by MPI.There are ways to mitigate this issue, like embedding all dependencies or using symbol versioning for standard HPC libraries to reduce possibilities for symbol conflicts.Another method is to ensure that MPI-shared libraries do not cause transitive dependencies so that the binary only requires MPI, and the MPI implementation takes care of its dependencies directly.A second challenge faced by the MPI container is launching the application.Indeed, MPI is in many cases relying on the Process Management Interface (PMI) to wire up its processes, which also has to be mapped into the container.There have been studies on this point as part of a complete rework of this interface in the PMIx standard [6] enabling PMI portability.
To summarize, containers need support from MPI to allow binary retargeting, i.e., the ability to change the MPI implementation on a binary compiled against another MPI implementation.Note that changing the guest MPI to the host MPI also allows PMI disambiguation -removing complexity on the launch side.Having an ABI is compulsory as retargeting does not allow recompilation of the application.This last point is the main blocker for accepting and distributing MPI containers.

Performance and Debugging Tools
MPI tools often use the profiling interface (PMPI) to intercept function calls and extract the current state of MPI and to time operations, for performance and debugging purposes [25,28] .Since this interception operates on the compiled library code, all MPI tools must be compiled against the relevant implementation ABIs.A standard ABI makes it possible for PMPI interposition tools to be compiled only once and reused with different MPI implementations.
The Tools working group in the MPI Forum is working on the QMPI interface [11].This interface is designed to support multiinstrumentation, mimicking what has been previously pioneered with [36].As with PMPI, the absence of a standard ABI requires each QMPI tool to be compiled for every ABI, and potentially more, if any of these tools modify ABI-related properties of the interface.One of the advantages of the proposed status object for the standard ABI is that it has additional space that allows tools to hide state in the reserved fields.Managing this in a layered context is complicated and is left as an exercise for the implementers of such tools.

PROPOSAL
This section outlines the current proposal for the MPI standard ABI, based upon detailed analysis of requirements from MPI as well as the behavior of platforms MPI should support.Following Section 3, the ABI proposal defines MPI integer types, the status object, opaque handles, and constant values.The calling convention must be equivalent to the platform C compiler.

MPI integer types
The purpose of MPI_Aint is to hold addresses or pointers, whichever is larger, because its usage requires both.It should also be signed because Fortran does not support unsigned integers.The only standard C type that meets this requirement is intptr_t.It is necessary to use this integer type on platforms with so-called "wide pointers" [43], although this situation is rare.There is no C integer type associated with filesystem offsets, but all modern systems should use at least 64-bit integers.There are some platforms where the underlying filesystem offset may be 128-bits, but there is no need for MPI to define MPI_Offset this way since MPI files greater than 8 EiB are unlikely. 2Additionally, 128-bit integers are not implemented natively on most systems and thus may perform poorly, so it is undesirable to force the use of 128-bit integers for offset and count to support impossibly large files.On the other hand, most systems with 32-bit addressing have 64-bit filesystems, so there are at least some scenarios where the MPI ABI should be flexible enough to support different sizes of address and offset types.
To ensure all relevant target platforms can be supported, the MPI ABI should be described in terms of the size of MPI_Aint and MPI_Offset, while MPI_Count matches the larger of these two (which will be MPI_Offset on most systems).The integer sizes of the MPI ABI can be denoted A Address O Offset , to denote the number of bits in the MPI_Aint and MPI_Offset types, respectively.This is similar to how platform ABIs are described using I L LL P notation, to denote the size of C int, long, long long, and void*, respectively.For example, modern Linux platforms are described as LP64, meaning that long and void* are 64bit.Today, essentially all MPI ABIs are A32O64 or A64O64 ABIs, because we have 32-or 64-bit addresses, but most filesystems are 64-bit.An A64O128 ABI is possible, although, for the aforementioned reasons, it is neither necessary nor desirable.
The potential for more than one MPI ABI on a given platform is undesirable.Current trends in filesystem technology suggest that a MPI_Offset larger than 64 bits will not be necessary for at least 20 years.For these reasons, we propose to prescribe the MPI ABI for platforms with 32-and 64-bit pointers as follows: t y p e d e f i n t p t r _ t M P I _ A i n t ; t y p e d e f i n t 6 4 _ t M P I _ O f f s e t ; t y p e d e f i n t 6 4 _ t M P I _ C o u n t ; This ABI definition covers essentially all relevant platforms since the introduction of LFS [1] until the availability of filesystems far in excess of 8 exabibytes.These types are part of C99 and C++11 but implementations can use older equivalents for compiler portability.
One may observe that intptr_t is optional in C and, in theory, a system may lack an integer type capable of satisfying its requirements.This is uncommon, and exists to accommodate systems with 128-bit pointers but where supporting intptr_t would force a change in intmax_t, which would be a breaking change in the platform ABI [17].We note that MPICH requires intptr_t and platforms that do not provide it are not supported, so a reasonable portion of the MPI ecosystem is unconcerned with this situation.
While we have considered the case of 128-bit pointers, the current proposal will only include A32O64 and A64O64.It is appropriate for the MPI community to gain more experience with such platforms before attempting to standardize for them.For example, while CHERI [42] has 128-bit pointers but doesn't necessarily require 128-bit file offsets, but if MPI_Aint and therefore MPI_Count have to be 128b, it might be prudent to make offsets the same width, so that there is only one MPI ABI for all 128-bit platforms.
The one MPI integer that cannot be prescribed like the others is MPI_Fint, since this corresponds to Fortran INTEGER, which is not fixed, but varies as a function of Fortran compiler flags.It seems appropriate to have a runtime query to allow C code to know the size of a Fortran integer and work with it appropriately.This requires code changes compared to the current situation where MPI_Fint is known at compile-time, but the C code that relies on this is rare.Alternatively, the standard ABI could force MPI_Fint to be a C int, and disallow MPI Fortran interfaces from supporting larger integer sizes.This would please Fortran purists who loathe the compiler feature that allows changing the Fortran default integer size, but displease users of existing implementations that support it.

The status object
The proposed standard status object is: t y p e d e f s t r u c t M P I _ S t a t u s { i n t

r e s e r v e d [ 5 ] ; } M P I _ S t a t u s ;
This object is 32 bytes in size, which leads to good alignment when arrays of statuses are used, and includes at least two extra fields more than current implementations.

Handle types
In order to have type-safety in handles, incomplete struct pointers are proposed; Open MPI has used this design and its properties are well understood.The incomplete struct name will become part of the ABI, so that compiler warning messages are clear: t y p e d e f s t r u c t M P I _ A B I _ C o m m * M P I _ C o m m ; t y p e d e f s t r u c t M P I _ A B I _ R e q u e s t * M P I _ R e q u e s t ;

Constants
Constants in MPI come in different forms.They include: • Error codes, which start with MPI_SUCCESS=0.
• Buffer address constants, e.g., MPI_BOTTOM, which must have special values distinguishable from user buffers.• Handle constants.
• Integer constants that must have special values to avoid conflicts; for example, MPI_ANY_SOURCE can never be a valid rank, and thus should be a negative number.• Integer constants that must be powers of two, to support combination using XOR.• Integer constants that correspond to string lengths.
• Integer constants that can have any value.
• Predefined attribute callback functions.Some of the desirable properties brought forth by users and implementers include a desire for unique integer constants, so that errors can be identified precisely.For example, if a user passes MPI_ANY_TAG as a rank, this can be identified precisely if the constant value is unique with respect to all other constants, especially MPI_ANY_SOURCE.Another desirable property is the ability to encode information in handle constants, as MPICH does.For maximum portability, integer constants cannot be larger than 32767, because that is the largest value of type int guaranteed by the C standard.This constraint is strictly academic for the relevant systems but there was no reason to violate it either.
For handle constants, the working group discussed designs with and without unique values as well as the use of one or more lookup tables versus a Huffman code.The current proposal uses a Huffman code but is sufficiently compact so as to require a relatively small lookup table, for implementations that choose to use one.The Huffman code uses 10 bits and therefore fits into the zero page of common operating systems; as a result, implementations that allocate user handles from the heap need not verify that they do not conflict with predefined constants.
As datatypes make up the majority of MPI's predefined handles, half of the Huffman code bits are reserved for datatypes, although less than 100 values are used.The language, numerical properties, and sizes of all fixed-size types are encoded in the handles.For example, MPI_CHAR can be determined to be a 1-byte type immediately.Unfortunately, MPI_INT is not a fixed-size type, so its size is not encoded, as that would mean that the constant value was a function of the platform ABI.While it would be possible for some use cases to handle this, it is undesirable to force higher-level languages like Julia to determine the platform ABI in order to use MPI.
Other handles can be decoded quickly using the bit pattern alone.The value zero is always an invalid handle, which allows uninitialized handles to be detected as errors instead of being confused as legal null handles.Legal null handles use the non-zero bits of the handle kind followed by zeros.The current Huffman code has a sufficient amount of free space to allow for many new handle types and new handle constants for existing types to be added, without requiring breaking changes.
The values of integer constants for string lengths, e.g., MPI_MAX_-LIBRARY_VERSION_STRING, and constants that can be combined with XOR, e.g., MPI_MODE_NOCHECK, are not particularly interesting.For the former, the largest known values used in existing implementations were chosen.There was some concern that stack allocation of 8192 bytes could be a problem, but (1) nothing prevents users from allocating such strings on the heap and (2) no issues with this value (used by MPICH) have ever been reported.
The other integer constants are unique negative numbers, which means that implementation can tell the user by name what constant they passed, when the user passes an incorrect constant.
For simplicity, predefined attribute callbacks were set to 0x0 for MPI_XXX_NULL_COPY_FN and MPI_XXX_NULL_DELETE_FN, and 0xD for MPI_XXX_DUP_FN.Since compilers can detect incompatible function pointer arguments there is little need to detect errors at runtime.
The encoding of operation handles is provided in Appendix A.1.The gaps in the ranges for the different operation types are intentional since they provide room for future extensions.Moreover, the modified Huffman encoding enables fast error checking by implementations, simply by applying a bitmask.
Handles for opaque objects are encoded in a similar way, as shown in Appendix A.2.The encoding leaves room for future extensions for each handle type, making it possible to add new handles without requiring special case handling.
Examples of datatype handles are provided in Appendix A.3.Types with variable size (e.g., C int, float) are encoded with the prefix 0b1000XXXXXX.Fixed-size types are encoded with the prefix 0b1001XXXXXX, with the size encoded in the lower bits at position 4-6.For example, types with size 1 are encoded with prefix 0b1001000XXX (e.g., MPI_BYTE with 0b1001000111; size 2 000 ) while types with size 4 are encoded with prefix 0b1001010XXX (e.g., MPI_INT32_T with 0b1001010000 and size 2 010 = 2 2 ).
The full definition of the Huffman code for handle constants can be found in [18], while the other constants are listed in [19].

EXPERIMENTS
In this section, we present three experiments regarding the implementation of the standard ABI.First, we measure the performance impact of different ABIs for querying the size of a type.Second, we measure the message rate for MPICH-based implementations, with and without standard ABI support.Third, we describe Mukautuva, which demonstrates the feasibility of implementing the standard ABI outside of any existing implementation.Finally, we mention the effort required to implement the standard ABI in MPICH.For both implementations -the one outside of an MPI implementation and the one within MPICH -we see that the cost of ABI translation is small.

Performance
Historically, there has been a performance argument in favor of MPICH's integer handles for datatypes because information like type size is encoded directly in handles, whereas with Open MPI, it must be fetched from the internal state.We measured the throughput of MPI_Type_size to be be ≈ 11.5 nanoseconds with both implementations on an AMD EPYC 7413 CPU.Not only is the difference between the two implementations negligible, both are negligible relative to the network cost of sending a single message, which is at least 500 nanoseconds.
Table 1 shows the message rate determined by the OSU MPI Benchmarks 7.0.1 for three different builds of MPICH: the latest Intel MPIand MPICH development versions built with UCX using the MPICH ABI 3 and the standard ABI prototype 4 , with and without Mukautuva.We see that adding the indirection from Mukautuva has a noticeable impact, but it is likely acceptable as a worst-case implementation of the standard ABI.

Mukautuva
Mukautuva [22] ("Adaptable" in Finnish) was created both as an ABI compatibility layer and as a way to prototype the ABI proposal being developed for the MPI Forum.It represents a worst-case scenario implementation for the standard ABI, if implementers refuse to support it.
Mukautuva (MUK) consists of two shared libraries.The first library provides the MPI interface symbols.The second library is  The vast majority of MPI features can be translated from one ABI to another with trivial overhead.The exceptions to this come in two forms: first, when callbacks are involved, and second, when vectors of handles are required.For callbacks, MUK must translate to IMPL handles to call IMPL functions, but then translate IMPL handles back to MUK handles, because the callback functions compiled as user code utilize the MUK ABI.The callback interfaces do not always make this easy, but it can be done in all cases, using methods described in the README.md.The situation with vector arguments is similar to [23], where vectors of datatype handles must be converted from one ABI to another, and freed upon completion, which is tricky in the case of nonblocking alltoallw operations.For these cases, like with callbacks, we use a map, currently implemented with std::map from the C++ standard library, to associate a temporary state with a handle.Callback function trampolines or request completion operations lookup the temporary state associated with handles when needed.The worst-case overhead will arise when the user has initiated a nonblocking alltoallw operation, followed by a large number of nonblocking point-to-point operations to be completed via MPI_Testall, for example.In this case, every call to MPI_Testall will look up every request in the map associated with nonblocking alltoallw operations.This is not currently optimized, due to the low probability of such a scenario in real applications.
During the development of MUK, we identified flaws in the early ABI proposals as well as in MPI test suites.The MPICH test suite, for example, assumed the MPICH ABI in many places 5 , which meant that it could not be used to test other implementations, or ABI translation layers such as Wi4MPI, MPItrampoline, and MUK.Most if not all of these issues have been resolved in the meantime.
MUK now passes all of the MPICH test suite tests except for a handful that uses dynamic process management, which appears to be related to environmental problems, yet to be investigated.MUK also passes all tests associated with ARMCI-MPI, the Intel MPI Benchmarks (IMB), and the OSU MPI Benchmarks (OMB).It complies with MPI-4 except for sessions, which are expected to suffer from the same issues observed in dynamic process management functionality.Calling functions before initialization or after finalization is not fully supported, and will be fixed in the future.

MPICH
While it has been demonstrated that the standard ABI can be implemented without any change to existing implementations, doing the translation inside of an MPI implementation has lower overheads.Hui Zhou has implemented support for the standard ABI in MPICH [46].The changes consist primarily of abstracting away prior assumptions about the types of handles and callback signatures and inserting the appropriate conversions, where necessary.Most of the changes are in the interface code generator or guarded by a preprocessor token, hence having no impact on execution time.The most expensive conversions are for datatype and reduce ops, with a worst-case that requires ( predefined ) comparisons.

OTHER CONSIDERATIONS
A standard ABI is necessary but insufficient to provide seamless compatibility of MPI software across implementations.For example, MPI applications often require a parallel launcher, e.g., mpiexec, which is not part of the ABI, but interacts with the MPI program in non-standard ways, such as environment variables.
There are at least two solutions for portable launching.The first is that the launcher determines the MPI shared library to be used, in which case the launcher and the library will be compatible.Another is the use of a launcher that is supported by multiple MPI implementations, such as the ones provided by popular schedulers like SLURM and PBS.
Applications also need to know what shared library to use.As libmpi.so is used by a number of implementations already, the name libmpi_abi.so is proposed for implementations of the standard ABI.Standardizing a new, descriptive name is especially important since it is expected that implementations will continue to support their existing ABIs, using the existing library name(s).It is expected that libmpi_abi.so will follow the platform-specific conventions for versioning to allow for future -hopefully backwardscompatible -changes.
Obviously, much of the MPI ABI is contained in the header file, mpi.h.The same filename will be used for the standard ABI, to ensure source compatibility, but applications must use exactly one ABI, and therefore every component of an application will need to be compiled against the same header.We expect that the standard ABI will be implemented in a header file provided by the MPI Forum that can be used with any implementation that supports the standard ABI, to ensure consistency in its definition.Implementations can provide this header in a different path from their own header, and perhaps help users with appropriate pkg-config definitions or compiler wrapper scripts, e.g., mpi_abi.pcor mpicc_abi, but these aspects of MPI are not standardized, nor are they part of the ABI.

Fortran
This paper focused on a standard C ABI for MPI, but many codes use MPI from Fortran.Fortran presents its own ABI challenges, not the least of which is that INTEGER, used for MPI handles in mpif.h and mpi.mod (and the MPI_VAL in typed handles defined by mpi_f08.mod)varies in size depending on compiler flags.Furthermore, each Fortran compiler has its own ABI and each has their own runtime library, in contrast to C, where it is common for C compilers to reuse the system C runtime, and thus be ABI compatible (e.g., Intel and GCC on Linux).
The current ABI proposal for Fortran follows the C one; many constants are required to be the same in both languages anyways.While Fortran handles may be too small to hold the C handle values in general, implementations can optimize for the case of predefined handles because the C constants will be representable in Fortran integers and do not require a translation table.
The overhead of translation for user-defined handles could be achieved with a new implementation of mpi_f08.mod,where MPI_VAL is INTEGER(kind=c_intptr_t), although this is a breaking change and would require a new module, which could be called mpi_f08_abi.One could also imagine a module mpi_abi that requires handles be INTEGER(kind=c_intptr_t).No new MPI Fortran interfaces or modules are currently proposed.

CONCLUSIONS
We have reviewed the current practices for MPI ABIs in the popular implementations, MPICH and Open MPI, as well as ABI abstraction layers like Wi4MPI and MPItrampoline.The motivations for standardizing an MPI ABI come from multiple sources, including the packaging and distribution of MPI applications and libraries in binary form, the use of MPI from languages other than C (or C++), and the development of implementation-agnostic MPI performance and debugging tools.The MPI ABI working group has developed a proposal for a standard MPI ABI, which satisfies all of the requirements and relies only on ISO C language features.The standard ABI has been prototyped in both MPICH and Mukautuva, and is determined to be both practical and performant.We identified issues with compatibility and portability not related to the ABI that are expected to be solved by the MPI ecosystem.
The next steps for the proposed MPI ABI are (1) it must be standardized by the MPI Forum, (2) it must be implemented either by the major implementations and/or ABI abstraction layers.(3) users of MPI must recompile against the standard ABI.Work towards 1 is underway, and this paper has provided sufficient evidence that 2 is either already done or straightforward.
We cannot predict the behavior of all MPI users, and certainly, some may be reluctant, either because they expect the MPI standard ABI to be less reliable than existing ABIs or that it will break in the near future.There is obviously a large one-time cost of recompiling everything against the MPI ABI, but it is no worse than the cost of compiling everything against a new major release of Open MPI, for example.Fortunately, there is no immediate need for users to adopt the MPI ABI.It is expected that both MPICH and Open MPI will support their existing ABIs for as long as users require them, and will consider a translation to using the standard ABI natively only after there is sufficient understanding of its use across a wide range of platforms.
Regardless of how long it takes to realize the full potential of a standard ABI, we expect that it will significantly reduce the pain of using MPI in a variety of contexts, and encourage greater use of MPI in new domains.
t y p e d e f i n t M P I _ D a t a t y p e ; # d e f i n e MPI_CHAR ( ( M PI_Datatyp e ) 0 x 4 c 0

#
d e f i n e OMPI_PREDEFINED_GLOBAL ( typ e , g l o b a l ) ( ( t y p e ) ( ( v o i d * ) &( g l o b a l ) ) ) . . .t y p e d e f s t r u c t o m p i _ d a t a t y p e _ t * M P I _ D a t a t y p e ; . . .# d e f i n e MPI_CHAR OMPI_PREDEFINED_GLOBAL ( M PI_Datatyp e , om p i_m p i_c har ) # d e f i n e MPI_DOUBLE OMPI_PREDEFINED_GLOBAL ( M PI_Datatyp e , om p i_m p i_double ) . . .e x t e r n s t r u c t o m p i _ p r e d e f i n e d _ d a t a t y p e _ t o m p i _ m p i _ c h a r ; e x t e r n s t r u c t o m p i _ p r e d e f i n e d _ d a t a t y p e _ t o m p i _ m p i _ d o u b l e ;

/
* C d a t a t y p e s * / # d e f i n e MPI_DATATYPE_NULL 0 # d e f i n e MPI_BYTE 1 # d e f i n e MPI_PACKED 2 # d e f i n e MPI_CHAR 3 # d e f i n e MPI_SHORT 4 # d e f i n e MPI_INT 5 # d e f i n e MPI_LONG 6 # d e f i n e MPI_FLOAT 7 # d e f i n e MPI_DOUBLE 8 MPItrampoline uses uintptr_t internally in its ABI, and incomplete struct pointers in its public API for type safety: t y p e d e f s t r u c t M P I t r a m p o l i n e _ C o m m * M P I _ C o m m ; t y p e d e f s t r u c t M P I t r a m p o l i n e _ D a t a t y p e * M P I _ D a t a t y p e ; i n t p t r _ t ip ; } M U K _ H a n d l e ; t y p e d e f M U K _ H a n d l e M P I _ C o m m ; / / d u r i n g i n i t i a l i z a t i o n . . .M U K _ C o m m _ s i z e = M U K _ D L S Y M ( w r a p _ s o _ h an dl e , " WRAP_Comm_size " ) ; / / wrap s dlsym ( ) . . .i n t M P I _ C o m m _ s i z e ( M P I _ C o m m comm , i n t * size ) { r e t u r n M U K _ C o m m _ s i z e ( comm , size ) ; } impl − wrap .so : # i n c l u d e <mpi .h> / / i m p l e m e n t a t i o n d e t a i l s s t a t i c i n l i n e M P I _C o m m C O N V E R T _ M P I _ C o m m ( W R A P _ C o m m comm ) { i f ( comm .ip == ( i n t p t r _ t ) M U K _ C O M M _ W O R L D ) { r e t u r n M P I _ C O M M _ W O R L D ; } e l s e i f ( comm .ip == ( i n t p t r _ t ) M U K _ C O M M _ S E L F ) { r e t u r n M P I _ C O M M _ S E L F ; } e l s e i f ( comm .ip == ( i n t p t r _ t ) M U K _ C O M M _ N U L L ) {r e t u r n M P I _ C O M M _ N U L L ; } e l s e { # i f d e f MPICH r e t u r n comm .i ; # e l i f OPEN_MPI r e t u r n comm .p ; # e l s e # e r r o r NO ABI # e n d i f } }/ / s u c c e s s i s t h e common c a s e , so s t a t i c i n l i n e i t .in t E R R O R _ C O D E _ I M P L _ T O _ M U K ( i n t e r r o r _ c ) ; s t a t i c i n l i n e i n t R E T U R N _ C O D E _ I M P L _ T O _ M U K ( i n t e rr o r _ c ) { i f ( e r r o r _ c == 0 ) r e t u r n 0 ; r e t u r n E R R O R _ C O D E _ I M P L _ T O _ M U K ( e r r o r _ c ) ; } i n t W R A P _ C o m m _ s i z e ( W R A P _ C o m m comm , i n t * size ) { M P I _ C o m m i m p l _ c o m m = C O N V E R T _ M P I _ C o m m ( comm ) ; i n t rc = I M P L _ C o m m _ s i z e ( impl_comm , size ) ; r e t u r n R E T U R N _ C O D E _ I M P L _ T O _ M U K ( rc ) ; } applied mathematics program.Research at Perimeter Institute is supported in part by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Colleges and Universities.This work was supported partly by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

Table 1 :
Message rate (8-byte messages) determined by osu_mbw_mr on an Intel i7-1165G7 CPU running Linux 5.19.0 (Ubuntu 22.04).Build options unrelated to ABI -the sharedmemory performance of UCX versus OFI -have a significant impact on message rate.The MPICH dev UCX results show no difference between the MPICH ABI and the proposed standard ABI.Applications compiled against MUK are relying on its ABI, which is a proxy for a future MPI standard ABI.At runtime, the first shared library determines which implementation will be used, and activates it via dlopen and dlsym.MPI symbols call a wrapper layer with the MUK namespace.MUK symbols are function pointers to the WRAP namespace in the implementationspecific shared library.WRAP functions call the implementation, with the appropriate conversion of handles and constants.An excerpt for the case of MPI_Comm_size follows.