Embracing Irregular Parallelism in HPC with YGM

YGM is a general-purpose asynchronous distributed computing library for C++/MPI, designed to handle the irregular data access patterns and small messages of graph algorithms and data science applications. It uses data serialization to give an easily usable active message interface and message aggregation to maximize application throughput. Our design philosophy makes a tradeoff that increases network bandwidth utilization at the cost of added latency. We provide a suite of benchmarks showcasing YGM's performance. Compared to similar distributed active message benchmark implementations that do not provide message buffering, we are able to achieve over 10x throughput on thousands of cores at a latency cost that can be as small as 2x or as large as 100x, depending on the machine being used. For applications that can be written to be latency-tolerant, this represents a significant potential performance improvement through using YGM.


INTRODUCTION
The world of high-performance computing offers a wide variety of programming languages and libraries with the goal of enabling scientists and programmers to write highly performant distributedmemory code without excessively large amounts of effort.Among these efforts, many use a partitioned global address space (PGAS) approach, including UPC [1], UPC++ [2], Chapel [3], and Co-array Fortran [4].PGAS languages often present a user with a global view of the memory available to the execution units running on a system and rely on compilers or runtime libraries to handle the network communication necessary for these execution units to function.
In the batch-processing realm that many of these languages are used, a user has a large amount of data warranting distributed memory and needs to perform a global analytic requiring all or most of the data or wants to compute many small analytics involving small subsets of data.For example, de novo genome assemblers [5] fall into this first category, and graph analysis tasks such as single source shortest path calculations [6] and computing clustering coefficients of vertices [7] fall into the second.In these situations, the total time to solution is the most important metric and users are seeking high throughput with little concern for the latencies involved in individual operations.
The above scenario is contrasted by some multi-user problems with large datasets in which individual users only require small portions of the data for analysis.This may occur for approximate nearest neighbor queries [8] in which each user may be an individual accessing a webpage, for instance.In situations such as this, throughput is important but cannot be increased at the expense of usable levels of latency.
In this work, we introduce YGM 1 , our modern C++ communication library explicitly designed for high-throughput HPC applications.YGM is built around its asynchronous active message runtime which allows processors to instruct remote processors to execute active messages with the data stored at the destination along with data sent from the origin.This runtime works in an SPMD programming style and, paired with the ability to explicitly make global pointers to objects, provides many PGAS advantages to users, including transparent communication when using our distributed data structures.A collection of data structure containers and related utilities is built on top of the asynchronous active message runtime to give users the tools to accomplish common data analysis tasks.During its development, YGM provided the flexibility to easily express faster algorithms for triangle counting at massive scales while also handling metadata for subsequent analysis tasks [9].
The main contributions of this work are: • synthesis of previous work on message buffering and routing, i.e. [10] [11] [12], into an open source fully-asynchronous active message runtime; • a reimagining of distributed data structure containers designed for one-sided asynchronous active messages; • a suite of benchmarks demonstrating the scalability of YGM, including making use of a dataset of 12.5 billion Reddit comments.
In the remainder of this paper, we begin with a discussion of related distributed memory languages and libraries.We then discuss the design of YGM's asynchronous runtime and its containers before providing details about the implementation making YGM appropriate for high-performance analysis applications.This is followed by a presentation of the suite of benchmarks used to test YGM's performance.

RELATED WORK
A number of previous projects seek to provide languages and libraries to enable developers to easily write high performance distributed software.Many of these works make use of a partitioned global address space (PGAS) programming model, including UPC [1,13], Coarray Fortran [4], Chapel [3], OpenSHMEM [14], and UPC++ [2,15].PGAS languages typically provide a programming model that resembles shared memory programming in order to be more accessible to users.PGAS languages such as these often use one-sided Remote Memory Access (RMA) primitives to access data stored remotely without involving its associated processor, as opposed to explicit message passing.DASH [16] is a C++ PGAS library which relies on the MPI-3 RMA backend for remote communication.DASH provides global pointers, but it lacks support for RPCs or active messages.
UPC++ [15] is a C++ Partitioned Global Address Space (PGAS) library for implementing distributed algorithms, developed at Lawrence Berkeley National Laboratory.UPC++ provides APIs for RPC, onesided Remote Memory Access (RMA), remote atomics and collectives.UPC++ enables non-contiguous RMA transfers (vector, indexed and strided).In UPC++, RPCs support user-provided functions and lambdas; operations that involve communication are nonblocking and are managed through an API that includes futures and promises [15]; user programs have the ability to control progress of pending or ongoing communication requests and process completed requests.UPC++ uses GASNet-EX [17] for distributed messaging and supports multiple low-level protocols for different network architectures, e.g., InfiniBand Verbs (IBV), OpenFabrics Interfaces (OFI) and UDP.
While differing in its use of message passing, YGM shares many features of these PGAS languages, such as UPC's use of barriers to synchronize, with relaxed assumptions on execution order between barriers to allow maximum performance.Both also support collective for_all-type operations to iterate over all pieces of a distributed object [1].UPC++ adds remote function invocation that serializes function arguments to be transferred to another processor [2].UPC++ chooses to implement futures to retrieve return values from these asynchronous calls, a feature explicitly excluded from YGM to avoid allowing users to wait for a return.
Mercury [18] is an asynchronous RPC interface for HPC systems.Mercury's user facing API exposes the semantics required for non-blocking RPC as well as for supporting large data arguments.Mercury's underlying network implementation is abstracted, which allows easy porting to different transport mechanisms.
Mochi [19] is a framework for implementing HPC solutions with diverse data processing needs and multiphase workflows.Mochi provides a methodology and tools for communication, data storage, concurrency management, and group membership.Mochi enables composition of specialized distributed data services, tailored to application needs and access patterns, from a collection of connectable modules and subservices.Mochi's C++ RPC library, Thallium, is based on Mercury.
Charm++ [20,21], Legion [22], and Regent [23] provide taskbased approaches to exploiting parallelism.In these systems, tasks are defined to operate on pieces of data.Runtime systems are then responsible for scheduling these tasks, accounting for the necessary data and the dependencies of tasks.These systems avoid making a user explicitly consider data movement, but provide a programming model that is farther from standard shared memory programming than PGAS languages.Charm++ gives an object-oriented runtime system with distributed objects that communicate through messages.
HPX is a task-based runtime system for parallel C++ programs [24].The HPX runtime enables execution of standard C++ algorithms in distributed systems and offers efficiency through adaptive resource management.
Several parallel programming libraries have used data containers as an API construct to expose parallelism.The Berkeley Container Library (BCL) is a C++ distributed data structures / container library [25].BCL exploits one-sided communication primitives (e.g., remote get and put operations) that can be executed using RDMA hardware.BCL offers the flexibility of using a number of distributed middleware / transport libraries for communication, e.g., MPI, GASNet-EX and OpenSHMEM.BCL uses a high-level data serialization abstraction called ObjectContainers to allow the storage of arbitrarily complex datatypes inside BCL data structures.BCL avoids coordination between CPUs and instead relies on remote memory atomics to maintain consistency.The Standard Template Adaptive Parallel Library (STAPL) [26] is a parallel programming library compatible with the ANSI C++ Standard Template Library (STL).STAPL provides a SPMD model of parallelism and offers several different algorithms for some library routines, selecting among them adaptively at runtime.Corray C++ and Hierarchically Tiled Arrays [27] are distributed C++ libraries focusing on respective data structures.Multipol [28] and Global Arrays [29] are two earlier efforts that offer high-level data structures for implementing parallel applications.
The message aggregation strategies used in YGM are similar to those seen in biological applications for genome assembly [5,10].Libraries targeting more general irregular applications provide similar buffering approaches with multi-hop routing schemes to allow for further message aggregation, leading to increased scalability in many applications.Conveyors features 2-hop and 3-hop routing schemes prioritizing intranode communication for message aggregation [11].TRAM adds message aggregation and routing to Charm++ and maps virtual routing topologies to physical networks to improve performance on Blue Gene systems [30].These approaches use individual buffer sizes to determine when to send messages.YGM uses a slightly different strategy by placing destinations in a FIFO queue that gets popped from when the total size of all of a rank's outgoing buffers reaches a threshold.YGM also adds fully asynchronous message exchanges and more fully-featured RPC semantics to improve programmability.

YGM DESIGN
In this section we discuss the design of YGM.The base of YGM is its asynchronous active message runtime, which includes its async calls to execute active messages, barrier calls to execute all pending active messages across all processors, and creation of ygm_ptr objects to act as global pointers to the local address space of a processor.On top of the asynchronous active message runtime, we have a collection of templated distributed-memory containers to aid with many of the analysis tasks we perform.

YGM Communicator
The central object in YGM is the YGM communicator.The communicator is constructed from an MPI communicator.As an alternative, the communicator can be constructed from argc and argv and use MPI_COMM_WORLD as the underlying MPI communicator by default.
In the second alternative, calls to MPI_Init and MPI_Finalize are handled by the YGM communicator in an RAII-style.YGM assigns each process in the communicator a rank for identification.This is done using the underlying MPI communicator, but can be replaced with other mechanisms for identifying processing elements in other communication libraries compatible with SPMD programming, such as OpenSHMEM.
3.1.1YGM async.The main functionality of YGM is provided by the communicator's async method.An async call is used to begin execution of a function on a determined rank within the system.The syntax of an async is comm.async(rank,func, args...), resulting in an invocation of func(args...) on the processor specified by rank.In order for this active message to execute correctly, the function arguments provided by the variadic args... and a function pointer that can be understood at the destination rank must be serialized into a buffer of bytes that will be sent to its destination.Details of this process will be covered in Section 4. A simple example program using an async is given in Code 1.In this code, a YGM communicator is constructed.Then, every rank sends an async message to rank 0, instructing rank 0 to print a greeting.C++ lambdas present a natural method of specifying the function to execute remotely, as is done here.
The YGM communicator also provides an async_bcast method.This variant functions the same as the standard async but does not take a rank argument and instead executes the function at all ranks in the system.This asynchronous broadcast operation uses the same message buffers as all other messages, and is routed along a tree derived from the NLNR routing scheme described in Section 4.4.This is done to avoid the  () work of the originating rank Code 2: Illustration of recursive messages to compute !. sending directly to all  ranks, while also limiting any broadcasted message to only needing 3 hops to reach all ranks.Unlike the broadcast functionality in MPI, the async_bcast is one-sided.This operation is useful in cases where any rank seeing a large number of redundant messages can broadcast its state to either tell other ranks to not repeat this message, or to allow other ranks to view its value without an explicit request.YGM's async calls support recursion when given appropriately crafted functions to execute.C++ lambdas are convenient for using in YGM, but they are unable to call themselves recursively.Instead of using lambdas, recursion is handled by writing functor classes.Code 2 gives a demonstration of code to compute ! by starting at rank  and successively passing a message to rank  − 1 to multiply the current factorial value by the current rank.At the conclusion of this process, rank 0 prints !.Although this example is an impractical way of computing !, this same technique of recursive async calls is useful for expressing graph traversals.
The code in Code 2 also demonstrates more intricacies of the async.YGM passes an optional parameter to every active message that is a pointer to the communicator being used.At compile-time, it is determined which of func(args...) and func(ptr_to_comm, args...) is callable, and the correct version is executed at runtime.This functionality is necessary for recursion to access the YGM communicator to keep making recursive calls.

YGM barrier.
Synchronization in YGM programs is accomplished through the communicator's barrier method.This barrier makes the following guarantees: (1) All ranks have reached the barrier before any rank continues (2) All async operations issued before a barrier complete before any rank progresses.The guarantee in (2) includes all recursive async calls.Individual ranks may become inactive and reactivate multiple times during the process of flushing all send buffers and processing all received messages before the full system is able to reach quiescence.
Between barriers, YGM makes very limited guarantees on the ordering of async messages.Two messages originating from a single rank and destined for a single other rank will execute in the order they are issued.This guarantee is derived from the ordering guarantees of sends on a single MPI communicator.YGM provides no other promises on the order async executions occur.This relaxed consistency model is chosen to allow YGM to process messages as quickly as possible.Application programmers must keep this point in mind to avoid introducing bugs arising from invalid assumptions on the order remote async calls will be executed.
3.1.3YGM Pointers.YGM pointers provide a mechanism for ranks to access specific data on a remote rank when sending it an async message.This functionality differs from the global pointers of other systems such as UPC [1] and UPC++ [15] in which pointers can be pointing to data explicitly stored on another processor.YGM pointers point to data stored on every rank, but each rank may, and often will, have unique data stored at that location.Because YGM pointers are accessible on every rank, the creation of YGM pointers is a collective operation.YGM pointers support dereferencing through a * operator.YGM pointers are, however, not equivalent to C-style arrays.Each pointer needs to be created separately and points to a single item on every rank.For this reason, YGM pointers have no notion of pointer arithmetic.Code 3 gives an example of creating and accessing YGM pointers through async calls.
The example in Code 3 creates a YGM pointer pointing to a unique int stored on every rank.Ranks 0 and 1 then use async calls with these pointers to assign values to each other's integers.At the end of the execution of these asyncs, my_int will have the value 1 on rank 0 and 0 on rank 1.
YGM pointers are useful for allowing ranks to communicate about objects they want to operate on.They do not provide a way for a rank to directly access data stored remotely.This functionality is the mechanism by which we implement the containers we will discuss in Section 3.2.In this context, YGM pointers will allow ranks to spawn async functions that operate on a partition of data held in the container on a different rank.
3.1.4Fire-and-Forget Asynchronous Execution.YGM uses a model of computation we call fire-and-forget.In this paradigm, remote functions have no natural return values.PGAS languages have a notion of Remote Memory Access where a processor can perform a get to obtain a value from the memory of another processor.In instances where the hardware is not able to hide the latency of this remote access and the programmer does not explicitly overlap this communication with other computation, this type of remote access leads to a stall on the system.
UPC++ includes future objects that give programmers a handle to test the completion of a remote function invocation and obtain the result of this computation.Testing for completion in a loop is a more obvious potential bottleneck to a programmer than stalling on a remote memory access, but it is still a potential pitfall for programmers writing high performance code.In YGM, all remote functions are invoked through an async call.In the case that a return value is desired, this is handled by performing an async that computes remotely and then spawns an additional async to return the value to the original process.Upon receiving this return message, the original process can resume its original computation.This removes a simple avenue of a processor stalling, leaving severe load imbalance leading to work starvation as the remaining method of stalling work.

YGM Containers
YGM includes a collection of data containers designed specifically to perform well within YGM's asynchronous runtime.Inspired loosely by C++'s Standard Template Library (STL), these containers provide improved programmability by allowing developers to consider an algorithm as the operations that need to performed on the data stored in a container, without concern for where that data is located or how to gain access to it.Most containers feature operations that fall into the classes of for_all operations and async_visit operations.
Both for_all and async_visit operations expect a function as a primary argument, similar to the YGM communicator operation async.The passed function signature must match the contents of the container.Value store containers storing value_type objects expect these functions to address objects with the syntax [](value_type &data_item){}.Key-value store objects expect these functions instead to support separate key_type (which must be immutable) and value_type arguments with the syntax [](key_type key, value_type &value){}.Although for_all and async_visit operations agree as to how contained objects are addressed by functions, the interfaces are subtly different and support additional optional features.
The for_all is a family of collective operations that induce ranks to iteratively apply a function to all locally-held data.Functions passed to the for_all interface do not support additional variadic parameters, and so only support the arguments described in the prior paragraph.However, these functions are stored and executed locally on each rank, and so can capture objects in rank-local scope.The syntax for using a for_all is given in Code 4.
A for_all includes a YGM barrier to assure all previous async functions have been executed before the for_all begins, but otherwise does not inherently involve any communication.The functions executed may induce communication, however.A for_all is often used to begin a phase of computation after data has been assembled into a relevant container.
The async_visit operations provide a mechanism for executing a function at a particular piece of data stored within a container.YGM handles the sending of an appropriately crafted YGM communicator async call to perform the function on the correct data, freeing the user to consider the operations which need to be performed, rather than how to access a certain piece of data.Relying on the YGM communicator means these operations automatically receive the performance benefits of YGM's message buffering and routing functionality.As we will discuss, not all containers naturally support an async_visit type of operation, and others naturally support a number of variations.Lambdas passed to async_visit operations cannot capture, but do accept additional variadic arguments that are packaged with the message and used by the receiver, similar to async.Also like async, async_visit operations support an optional parameter that is a YGM pointer to the container itself.This is necessary for recursive async_visit operations that need to spawn additional messages to visit other container elements, such as matrix multiplication.The general syntax for using an async_visit operation is shown in Code 5. YGM's containers are not designed to adhere to the standards or functionality of associated C++ Standard Template Library containers.When appropriate, inspiration is drawn from the STL for functionality of YGM containers, but the top priority is providing expressive tools that perform well within YGM's asynchronous active message computing model.YGM containers also support conventional query functions, such as a global size function.Furthermore, key-value containers can accept default_value constructor arguments that instruct the container how to create values for keys that have not been explicitly set by the user.
Next, we will discuss some of the specific features of individual containers.

YGM Bag.
The ygm::bag is a dynamic unordered and partitioned collection of values.We observed that for many use cases the classic std::vector<T> was used as a holding 'bag' where order was not relevant to the application.Due to the inefficiencies in maintaining a dynamically ordered container on parallel and distributed systems, we have focused on this unordered bag abstraction.As new values are inserted into the bag they are load balanced across partitions.The bag abstraction does not distinguish between individual elements, so ygm::bag does not support any async_visit functions and bag items are instead only accessible through the for_all interface.
3.2.2YGM Set.The ygm::set and ygm::multiset support similar behavior to std::set and std::multiset in a distributed datastore.Both containers support the value store version of the for_all interface, which iterate over all set contents on all ranks.async_insert performs a distributed insert, placing the inserted object in local memory on some rank if it is not already contained in the set.async_erase similarly removes an element from the set if it is stored on some rank.async_exe_if_contains and async_exe_if_missing execute a visitor function with variadic arguments if the passed key object is or is not contained in the set, respectively.async_insert_exe_if_missing and async_insert_exe_if_contains are similar, except they additionally attempt to insert the passed key into the set.

YGM Map.
The ygm::map and ygm::multimap support similar behavior to std::map and std::multimap in a distributed datastore.Both containers support the key-value store version of the for_all interface.async_insert attempts to insert a novel key-value pair into the map, while async_set attempts to reset the value of an existing key.async_insert_if_missing only performs an insert if the corresponding key is not already set.Similarly, async_visit_if_exists accepts a visitor function with variadic arguments, but instructs the receiver only to perform it if the corresponding key has a set value.async_insert_if_exists_else_visit attempts an insert, but executes a visitor function if the inserted key already has a set value.async_visit executes a visitor on the specified key, adding a default value if the key was not already present.

YGM Array.
Unlike other keyed data structures like ygm::set and ygm::map, ygm::array initializes a contiguous array of objects on all ranks at creation time.The number of elements in this array is fixed at creation time.The initial value in each array element is informed by the default_value, which can be set by the constructor.YGM arrays support a key-value for_all interface, as well as basic async_set and async_visit functions that are similar to the corresponding ygm::map functions.YGM arrays also support async_binary_op_update_value, which modifies a key's value with some binary operation on a passed value, such as addition or multiplication, and async_unary_op_update_value, which applies a passed unary operation such as increment.
3.2.5 YGM Disjoint Set.ygm::disjoint_set implements the popular disjoint set data structure, which YGM uses for algorithms such as connected components as described in Section 5.5.The disjoint set structure is encoded with a map linking set elements to their parents in a forest structure.Elements linked to the same root belong to the same disjoint set within the universe of sets.async_union collapses the trees containing its two arguments into a single tree following a zigzag algorithm from [31], a variant of Rem's algorithm which has been shown to empirically perform well [32].Interleaving two tree traversals naturally fits the asynchronous computing model in YGM.Additionally, we are able to cache frequently-accessed nodes in the forest underlying the disjoint set, as the tree traversals are tolerant to stale data.all_compress simplifies the forest structure of the data by getting all elements to point to their roots through a pointer jumping approach that takes a number of steps logarithmic in the height of each tree.YGM disjoint set also supports a key-value for_all interface, where the key and value will be the held objects and their parent in their local tree, respectively.num_sets returns the number of trees in the forest, i.e. the number of disjoint sets specified by the data structure.

Requirements for Stored
Types.In all YGM containers except for the bag and array, a rank responsible for storing an item must be computable from anywhere based on only the item itself.To accomplish this, types stored in any container other than a bag or array must be hashable.Additionally, this same set of containers currently makes internal use of std::set and std::map structures, which require types to be comparable.

Best Practices
YGM presents users with the task of managing parallel applications with high degrees of asynchronicity, allowing for potential misuses.Common erroneous application behavior stems from users assuming an implicit ordering of messages or chains of messages.Many YGM applications are written as an iteration over data (potentially in a YGM container), where the iteration spawns the asynchronous messages necessary for the computation, with YGM barriers to ensure all messages are processed before the next iteration or phase of computation.With this design, barriers are meant as a heavyweight synchronization point.It is important for any intra-barrier race conditions to be benign.In the case that data is being read and gathered by messages at the same time it is being updated by other messages, it is important that algorithms are designed to be tolerant to stale data.

YGM IMPLEMENTATION
In this section, we will describe some of the implementation details of YGM.This includes how function arguments and function pointers are packaged for sending, the possible routes messages take to open up more possibilities for message aggregation, and how function arguments are reconstructed upon reaching their destination.

Data Serialization
Before an async can be sent across the network, the function arguments must be serialized into a byte stream.This task is accomplished by the cereal [33] library.Cereal is a C++ serialization library with native support for C++ built-in types and STL containers.For custom classes, one extra serialize method must be added to specify the member variables that need to be serialized and deserialized.This serialization step allows YGM to handle datatypes not natively supported in MPI, such as strings.
When a message reaches its destination, the contents of the function arguments must be reconstructed.Unfortunately, when the message arrives, the destination rank only has a function pointer and a collection of bytes.To reassemble the function arguments, the sending rank wraps the user-provided function in another function that deserializes the byte stream and then passes the reconstructed arguments to the user's function.This wrapper function is the function pointer that gets sent to the destination.

Function Pointers
To avoid manually registering every active message lambda or functor, and to simplify the overall user experience, YGM employs techniques at both compile time and static variable initialization time to build an index of all active messages for an executable.This list is generated at static initialization time in the same order on all SPMD MPI ranks and therefore can be used as an index when referencing active messages.The actual data packed into send buffers when async enqueues an active message is this message index as an integer followed by the serialized contents of the function's arguments.

Message Buffering
Before sending data across the network, YGM messages are buffered to reduce the number of MPI calls and increase the size of messages.Messages are sent using MPI two-sided communication methods (MPI_Isend/MPI_Irecv) with wildcard sources in the MPI_Irecv operations.Buffering techniques such as this have been shown to dramatically increase the effective bandwidth seen by applications [10].
YGM creates buffers on-demand as a rank encounters destinations it must send to.Ranks may not have to send to all ranks due to communication patterns in the application being implemented or because of YGM's routing, as we will discuss in Section 4.4.
Previous works involving routing schemes have determined when to send individual buffers based on a buffer's size [10].YGM creates a FIFO queue of destinations to send to.A destination is added to this queue when a new message buffer is created.YGM begins flushing buffers based on the total size of all buffers.Once this total size reaches a threshold, a destination is popped from the FIFO queue of destinations, and the associated message buffer is sent.This continues until the total buffer size drops below its threshold.The next time a message gets queued for one of these destinations just sent to, a new buffer must be created, and the destination gets added to the FIFO queue.
This buffering strategy is designed to reduce the message latency for destinations a rank rarely communicates with.By only considering an individual buffer's size, some buffers may contain very stale data with no possibility of being sent until a barrier is reached.By incorporating the FIFO queue of ranks to send to, small buffers will still be sent eventually, even without reaching a barrier.

Message Routing
YGM provides three different routing schemes to allow applications to further aggregate messages and improve scalability of applications.These routing schemes are (a) no routing, which sends messages directly to the destination, (b) Node-Remote (NR) routing, which sends messages in at most 2 hops to their destination, and (c) Node-Local Node-Remote (NLNR) routing, which sends messages in at most 3 hops to their destinations.NR and NLNR are named to indicate where node-level message bundling and unbundling occurs; NR has messages from a rank unbundled and distributed to ranks on the remote node, while NLNR adds additional bundling on a sending rank's local node.Routing schemes similar to these have proven to be effective in similar libraries at large scales [11] [12].In both approaches, the additional message aggregation works to combat the decrease in average message bundle size as the number of compute nodes increases.Both of these techniques rely on the assumption that on-node communication is fast relative to communication over the network.on every other compute node to send its messages to.For rank  and compute node  , call this designated communication partner  (,  ).When rank  sends a message to rank  on compute node  , the message is sent to  (,  ) on node  first.Then  (,  ) forwards the message to rank .In this way, rank  is able to send all of its messages destined for compute node  to a single rank, rather than splitting them up into several smaller sends to each rank on node  .4.4.2NLNR Routing.NLNR routing takes the ideas of NR routing a step further to expose more aggregation potential.In NLNR routing, for every pair of compute nodes,  and  , a pair of ranks,  (,  ) on  and  (,  ) on  , are identified as the ranks to handle all traffic from ranks on  to ranks on  .A message starting at rank  on node  and destined for rank  on node  will be sent to  (,  ), forwarded to  (,  ), and then finally arrive at rank .In this case, all messages from node  to node  can be aggregated into even larger messages.

EXPERIMENTS
We present a collection of benchmarks of YGM, demonstrating its performance characteristics.For simple benchmarks, we provide comparisons against MPI-only, UPC++, and conveyors [11] solutions for comparison.

Testbeds
These experiments were performed on the Ruby and Lassen clusters at LLNL.Ruby features compute nodes with dual Intel Xeon CLX-8276L processes totaling 56 cores per node paired with 192 GB of DRAM and connected with a Cornelis Networks Omni-Path interconnect.On Ruby, MVAPICH2 version 2.3.7 is used for MPI along with GCC 10.2.1.Lassen compute nodes consist of dual-socket IBM Power9 CPUs, giving 40 usable cores, 4 Nvidia Tesla V100 GPUs (unused in these experiments), and 256 GB of DRAM, and are connected with a Mellanox 100 Gb/s EDR Infiniband network.On Lassen, Spectrum MPI and GCC 8.3.1 are used.
For comparisons against UPC++, we use UPC++ revision 2022.9.0 (the latest at the time of writing) with GASNet version 2022.3.0.We implement the microbenchmarks using the RPC interface which incurs lower overhead compared to the future-promise RPC API.We configured UPC++ to use the Infiniband Verbs (IBV) conduit (transport back-end) on Lassen and the OpenFabrics Interfaces (OFI) conduit on Ruby.The maximum number of processes per node supported by the OFI conduit is less than the number of available cores, giving a maximum of 18 processes per node on Ruby.On Lassen, GASNet is configured to use all four Host Channel Adapters (HCAs) available on each node.
For comparisons against conveyors, we use version 0.6.0 of conveyors configured to use its MPI backend.MPI was chosen to provide the most direct comparison possible with YGM on Ruby and Lassen which already have optimized MPI installations.The conveyors implementation was configured to use its auto-tuning feature to establish its optimal message size.

Around The World
We start of by demonstrating the least-favorable experiment for YGM due to its inherent latency sensitivity.Our around-the-world benchmark is designed to circumnavigate the world of processors by having each rank send a message to the subsequent rank when it receives a message.This process begins with rank 0 sending an initial message to rank 1 and continues until the single in-flight message visits every rank num_trips times.This benchmark is a test of the cumulative latency of sending messages due to there only being a single message active at any time.Because of the block assignment of ranks to compute nodes, messages traverse all ranks in a compute node before making a single hop to the next compute node.
We compare our YGM implementation of around-the-world to MPI-only and UPC++ implementations.All three setups use the same assignment of ranks to cores, ensuring the same number of on-node and off-node messages are sent in all three.On Ruby, we test all three implementations with 18 processes per node, the maximum available through UPC++.On Lassen, we used all of the 40 available ranks per node.Results from this experiment are given in Figure 1.
We see from these results that YGM is able to pass a single message among its ranks slower than UPC++ or MPI.This result is to be expected as message buffering strategies inherently increase message latency in order to improve overall throughput.This single message case is a worst case scenario for YGM.UPC++ is able to perform much closer to MPI with relatively small overheads necessary for its remote procedure calls.On the Ruby cluster, YGM's relative performance is around 45% or above MPI's, and approximately 60% of UPC++'s or better.This is significantly better than the performance of YGM seen on Lassen, where the rate YGM is able to pass the message drops below 5000 hops per second at high node counts, a single-digit percentage of the MPI and UPC++ performance.This discrepancy between these machines is most likely due to differences in networking hardware.

Histogram
Our histogram benchmark is based on the histogram app in bale [34].In this benchmark, a large table of 2  integers is initialized in distributed memory.Each process then generates a collection of indices that are to have their associated values incremented in the distributed table.The timed portion of the benchmark then performs all of these updates.Performance is measured it terms of billions of global updates performed per second.The histogram benchmark is set up as a weak scaling study, in which the table size and number of total updates scales with the number of processes.
This benchmark is very similar to the RandomAccess benchmark of the High Performance Computing Challenge (HPCC) [35].A large table is set up in each with random updates originating from all cores on the system that may need to access the memory of any other core.The main difference between the two is that RandomAccess limits the number of updates queued at any time to not exceed 1024 for any individual process in order to limit data locality.In distributed systems, this limit on pending updates limits performance [36].Optimized RandomAccess algorithms use message routing schemes to aggregate messages as much as possible under these constaints [37] [38].For our histogram benchmark, we do not impose the same limit on the number of pending updates.We do not do any partial reductions or sorting of updates beyond buffering them to be sent to the correct destination rank, limiting    the data locality we are exploiting while also allowing us to run under closer to ideal conditions for the networks in our systems.
In YGM, the histogram is implemented using a ygm::array for the table, and updates are performed using an async_visit on each of the indices being updated.Code 6 provides YGM code to create the ygm::array and perform the updates for the histogram benchmark.This code assumes a YGM communicator named world exists and the indices corresponding to the values to update have been stored in indices.
Comparisons of YGM to UPC++ and conveyors implementations of the histogram benchmark are given in Figure 2 under the assumption that indices for updates are generated uniformly.On Ruby, we show results with UPC++ using 18 ranks per node, the most we were able to use.On Lassen, we show UPC++ results with the number of ranks that gave the best performance at every number of compute nodes.In Figure 3, we show how the optimal number of ranks changes with the number of compute nodes.
Comparing to UPC++, we see that YGM's message buffering is able to give significantly improved throughput across all numbers of compute nodes on both machines tested.On Ruby at 128 compute nodes, YGM achieves 11.3x the rate of updates as the UPC++ implementation when each are using 18 ranks per node.Increasing the number of ranks per node for YGM from 18 to 56 gives a further 2.14x performance improvement.This is slightly less than the 3.11x 8 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
theoretical maximum improvement for increasing the ranks used per node, but this value is consistent across all numbers of compute nodes used, suggesting we are likely reaching a hardware limit on individual nodes pushing messages to the network or memory bandwidth.The code to implement this benchmark is very simple in YGM, as we have seen, but still provides high performance due to YGM's containers sitting atop the YGM communicator's async that handles message aggregation and routing invisibly to the user.
Comparing to conveyors, we can see the effects of some of the overheads of YGM.On Ruby, YGM achieved 65% of the performance of conveyors, and on Lassen, YGM was 13.7% faster than conveyors at 128 nodes.Conveyors is very similar to YGM in terms of routing and buffering, but it lacks serialization capabilities and can only send messages of fixed sizes when using non-elastic conveyors.Additionally, YGM must send an additional 2 bytes with every message to determine the function to execute on the remote rank.This overhead is significant in the histogram benchmark where message payloads are only 8 bytes.
In addition to the above comparison to conveyors, we ran the histogram benchmark with elastic conveyors which allow conveyors to send variable-length messages.This is still short of the full functionality of an active message runtime but makes conveyors behave more similarly to YGM.Under these circumstances, we see YGM achieve 84-104% of the performance of the elastic conveyors on Ruby.We leave the non-elastic conveyors as the main comparison point to YGM as this gives the fastest message routing and aggregation scheme, but these results suggest much of the difference is due to optimizations on conveyors when messages are fixed size.
One of YGM's goals is to be capable of handling imbalanced and irregular communication patterns often experienced in graph and data science applications.To test this, we also perform our histogram study on Ruby using update indices coming from an R-MAT graph generator [39].These results are shown in Figure 4 using the full 56 nodes per rank on Ruby.The R-MAT generator is biased to produce certain indices more often than others.This produces two competing effects.On a local scale, the rank responsible for storing and updating a value that is seen more frequently is likely to have that value in its cache, leading to potential performance improvements.On a global scale, all ranks are producing indices with the same bias, leading to some ranks having significantly more messages to receive and updates to perform, which may limit scalability.In this case, we see caching effects improve performance until 64 nodes are used, at which point the load imbalance begins to impact scalability.

Vertex Neighborhood Embedding
To benchmark the performance of YGM in a vertex neighborhood embedding pipeline, we used sketches available as part of the krowkee library [40].In this setup, distributed YGM containers are created to hold a sketch object representing a vertex's neighborhood for every vertex in a graph.We then iterate over a distributed collection of edges, adding  to the sketch of  for every edge (, ) using an async_visit operation on the distributed container of sketches.These edges are partitioned randomly, so there is no assumption about the location edge (, ) is generated relative to the sketch of .We test the performance of the ygm::map and ygm::array containers for holding sketches of graphs generated from Erdős-Rényi and R-MAT graph models.
In the experiments performed, we use a graph with 2 26 vertices per compute node.Each rank generates 10 million edges locally to be added to the distributed sketches.The sketches are configured to provide an embedding of vertex neighborhoods into 8 dimensions.Figure 5 shows the results of these vertex neighborhood embedding experiments on Ruby.
We see performance for the ygm::array is approximately 2-4x that of the ygm::map for this benchmark due to the slow performance of the ygm::map's search tree to find a sketch in local memory relative to the direct indexing of the ygm::array.Both containers show better scaling behavior for the Erdős-Rényi graph due to the load imbalance caused by the R-MAT graphs at large scales.This imbalance affects the scaling of the ygm::array more noticeably when no routing is done.
To test the effects of routing on scaling of embeddings, we use a ygm::array of sketches for an R-MAT graph under the routing schemes provided by YGM.This test configuration is chosen as it shows the largest degradation in scalability in Figure 5a. Figure 5b shows scalability is greatly improved when using NR or NLNR routing over no routing at large numbers of compute nodes.At 128 compute nodes or above, running without routing gives the worst performance.Scaling from 64 to 512 compute nodes, no routing exhibits a parallel efficiency of just 26%, while NR routing has an efficiency of 61%.Over this same increase of compute nodes, NLNR routing shows a parallel efficiency of 98%.Below 64 nodes, NLNR routing exhibits superlinear speedup due to its not using some cores for off-node communication until the number of compute nodes is at least as large as the number of cores on a node.

Connected Components
We implement two versions of connected components algorithms in YGM.The first is a simple iterative label propagation algorithm in which every active vertex forwards its current label to its neighbors, and every vertex that receives an improved label activates itself for the next round.This process can take a number of iterations that is proporional to a graph's diameter.Improvements to this approach  include shortcutting to allow vertices to get labels from their grandparents [41] or using approaches based on the Shiloach-Vishkin algorithm [42], both of which reduce the number of iterations to be logarithmic in the number of vertices, but are not done in this simple benchmark.
Our second connected components algorithm uses YGM's disjoint set.We add all of the graph's edges to the disjoint set using async_union calls and periodically compress the trees in the forest underlying the disjoint set to prevent them from getting too tall.
To test these algorithms, we use a dataset derived from 12.5 billion Reddit comments between 2005 and 2022 obtained using pushshift [43].In this case, we build a bipartite graph connecting every user on Reddit to the pages they post comments on.This graph contains 1.05 billion vertices and 12.5 billion edges.6 gives the results of each algorithm on the Reddit dataset using YGM's NR routing.The label propagation algorithm is consistently faster than the disjoint set algorithm.Each algorithm scales well up to 256 nodes, but the disjoint set's achieves a higher parallel efficiency, 74%, over the range of allocations tested, compared to label propagation's 65%.While still slower than the label propagation algorithm, this is, to the best of our knowledge, the first demonstration of a disjoint set scaling to tens of thousands of processes.Previous work uses disjoint sets in distributed memory for density-based clustering on thousands of cores [44].

String Sorting
Next, we demonstrate YGM sorting a collection of strings.The algorithm we use picks random pivots from a subsample of the data which are used to partition the data across ranks.Then an in-memory std::sort is used to complete sorting.
To showcase our string sorting, we use the same 12.5 billion comment Reddit dataset used in the connected components experiments.In this case, we are using the full comment text as our strings to sort.This dataset takes 3.7TB of disk space when it is compressed.Timing results of sorting these comments on Ruby is given in Figure 7.This experiment requires at least 64 compute nodes to hold the entire string dataset in memory.Performance is best at low node counts without using routing.Sorting without routing stops scaling 10 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
beyond 128 nodes, however, while NR routing continues scaling and shows improved performance at 256 compute nodes.

Triangle Counting
Counting triangles in large graphs has become an important benchmark for graph analytics hardware and algorithms included in the annual GraphChallenge competition [45].
We include a triangle counting benchmark here that makes use of a degree-ordered directed graph, a common technique in triangle counting to reduce the imbalance caused by high-degree vertices [46].We use formulas derived in [47] for the number of triangles in the non-stochastic Kronecker product of two graphs to verify our results.
Figure 8 shows the results of our weak-scaling triangle counting experiments on Ruby.For these experiments, we count triangles in the Kronecker product of two R-MAT graphs with a resulting Kronecker graph that has 2 23 vertices per compute node.We measure performance in terms of wedge checks per second, a standard measure of the rate of work in triangle counting [48].Once again, we see excellent scalability from YGM, with a parallel efficiency of 70% when scaling from 2 to 128 compute nodes.

CONCLUSION
We have presented YGM, an asynchronous active message library built on MPI.YGM is constructed around its message buffering and routing capabilities to improve throughput and scalability.A collection of data containers sits on top of this functionality with APIs that support the one-sided active message model of programming in YGM.
We have provided a collection of benchmark applications to demonstrate YGM's scalability.Through comparisons to UPC++, we see YGM is able to provide over 10x throughput using thousands of cores in benchmarks involving small messages through its use of message buffering.This increased throughput comes at a latency cost which, depending on the machine being used, can be smaller than 2x or as much as 100x.We have also demonstrated YGM's routing schemes are effective at maintaining scalability of applications by increasing the size of message bundles sent over the network.The benchmarks presented include connected components and string sorting on a Reddit dataset with 12.5 billions comments, requiring many terabytes of DRAM to keep it in memory.We also feature a disjoint set connected components experiment on this dataset, which is the first time a disjoint set has been scaled to billions of items on over ten thousand cores, to the best of our knowledge.

Code 3 :
Example using YGM pointers 1 i n t my_int ; 2 a u t o i n t _ p t r = w o r l d .make_ygm_ptr ( my_int ) ; 3 i f ( w o r l d .r a n k ( ) == 0 ) { 4 w o r l d .a s y n c ( 1 , [ ] ( a u t o p t r ) { * p t r = 0 ; } , i n t _ p t r ) ; 5 } e l s e i f ( w o r l d .r a n k ( ) == 1 ) { 6 w o r l d .a s y n c ( 0 , [ ] ( a u t o p t r ) { * p t r = 1 ; } , i n t _ p t r ) ; 7 }

Code 4 :
General for_all usage 1 i n t m y _ v a r i a b l e ( 1 ) ; 2 / / i f m y _ c o n t a i n e r i s a v a l u e s t o r e 3 a u t o my_func = [& m y _ v a r i a b l e ] ( a u t o &d a t a _ i t e m ) { s t d : : c o u t << d a t a _ i t e m << " , " << m y _ v a r i a b l e << s t d : : e n d l ; } ; 4 / / i f m y _ c o n t a i n e r i s a key − v a l u e s t o r e 5 a u t o my_func = [& m y _ v a r i a b l e ] ( k e y _ t y p e key , v a l u e _ t y p e &v a l u e ) { s t d : : c o u t << key << " , " << v a l u e << " , " << m y _ v a r i a b l e << s t d : : e n d l ; } ; 6 m y _ c o n t a i n e r .f o r _ a l l ( my_func ) ;

4. 4
.1 NR Routing.In NR routing, every message takes at most 2 hops to reach its destination.Every rank chooses a single rank 6 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Figure 1 :
Figure 1: Number of message hops per second for around the world implementations.

Figure 2 :
Figure 2: Number of inserts per second for histogram implementations.

Figure 3 :
Figure 3: Histogram performance of UPC++ on Lassen with varying numbers of ranks and compute nodes.

Figure 4 :
Figure 4: Histogram performance of YGM using indices from a uniform generator and an R-MAT graph generator.
(a) Rate of edge insertions into sketches on Ruby for ygm::map and ygm::array containers under Erdős-Rényi and R-MAT graph models with no YGM routing.(b) Rate of edge insertions into sketches on Ruby for ygm::array container under R-MAT graph model with various YGM routing schemes.

Figure 6 :
Figure 6: Connected components times using label propagation and disjoint sets.

Figure
Figure6gives the results of each algorithm on the Reddit dataset using YGM's NR routing.The label propagation algorithm is consistently faster than the disjoint set algorithm.Each algorithm scales well up to 256 nodes, but the disjoint set's achieves a higher parallel efficiency, 74%, over the range of allocations tested, compared to

Figure 7 :
Figure 7: String sorting times without routing and with NR routing.

2
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.