Low-latency Communication in RISC-V Clusters

Low-latency inter-node communication is important in HPC clusters. In this work, we design and integrate a low-cost interconnect, capable for low-latency user-level communication with open-source RISC-V processors, obviating the need for bulky and expensive network interface cards connected over the PCI. Our lean network interface is connected next to the Load/Store (LD/ST) stage of the RISC-V processor, which we modify to achieve back-to-back stores for the address range dedicated to the the NI. The primitives that we examine are suitable for many-to-one communication and optimized for small messages, while offering reliable delivery and hardware-level protection using protection domains. We also describe our runtime system that presents the NI to user processes with minimal overheads. Our design achieves sub-microsecond (720 ns) user-level latency for small packet generation and transmission between adjacent FPGA nodes containing Ariane RISC-V soft-cores running at 100 MHz. We also present an analytical latency breakdown including key hardware and software components.


INTRODUCTION
High-Performance Computing deals with many aspects of our everyday life, including universe and climate simulations, galaxy formation, physics theory validation and other.Low-latency, highthroughput communication is a precious element of modern highperformance clusters.However, efficient communication does not come for free.A common rule-of-thumb states that one 1 GHz core is needed in order to run the host software for 1 Gb/s (unidirectional) Ethernet-based traffic using sockets.
caRVnet is a custom network interconnect that we develop in order to test interesting network technologies in actual hardware testbeds [8,11,13].The architecture of caRVnet evolves around the following principles for efficient communication: • low-latency user-level transfer initiation, with complete OS bypass in order to minimize latency and CPU overheads, • virtualization using fast hardware primitives to allow sharing the network with deterministic hardware-grade performance, • fast hardware-based congestion management that is both more accurate and faster than TCP [7], • shallow buffers inside the core to reduce latency and cost with flow control for nearly lossless operation.
As shown in Fig. 1, the caRVnet Network Interface (NI) supports user-level, zero-copy Remote Direct Memory Accesses (henceforth RDMA transfers) from user memory at source to user memory at destination that completely bypass the kernel.caRVnet offers protected virtualization through multiple descriptor channels (software pages that are mapped to on-chip SRAM scratchpads) that can be allocated to different processes.In this way, processes can safely initiate multiple concurrent transfers without any kernel involvement.caRVnet multiplexes the payload of different transfers (possibly initiated by different users) on a per-network packet granularity, facilitated by hardware primitives implemented in the caRVnet NI, in order to improve the Quality of Service, e.g. when a former request is blocked due to congestion.System-level protection is ensured by Protection Domain IDs (PDIDs) that are configured by a system manager, and used by host OS and caRVnet NI endpoints.
A first implementation of this architecture is currently installed in the full-scale ExaNeSt prototype [9,12] that comprises 12 blades, each with four (4) Quad FPGA DaughterBoards (QFDBs) [4], each with four (4) Xilinx Zynq MPSoC (Multiprocessor system on chip) 64-bit ARMv8 cores.The interconnect is using 3D Torus topology over 10 Gb/s links [2].The testbed includes 192 ARMv8 cores and runs real HPC applications using a custom MPI library [13].We currently also use this testbed to evaluate efficient hardware congestion control for RDMA [7] while running real HPC applications together with adversarial background traffic.
We present and evaluate the next generation of caRVnet architecture that supports 100 Gb/s links when running at 200 MHz using a 512-bit datapath.In this generation, we designed new hardware for RDMA QoS that replaces a previous co-processor implementation offering much lower latency for RDMA transfers and offers higher processing rate for datacenter workloads.In this paper, we focus on the tightly coupling of the caRVnet interconnect with RISC-V processors, in a testbed where both hardware components run at 100 MHz.In order to minimize latency, we optimize the interface between the core and the NI, the latency and the user-level access of NI blocks, as well as the user-level library using even inline assembly specific for the RISC-V processor.
Our contributions in this paper are summarized below.
(1) We present the new hardware and the system software of the Packetizer-Mailbox building blocks of caRVnet, which can be used for low-latency many-to-one communication, (2) We modify the LD/ST stage of the Ariane RISC-V core to optimize the connection with the caRVnet network interface, achieving a descriptor issue latency of less than 10 processor cycles, (3) We run ping-pong tests between two nodes containing Ariane RISC-V softcores, running at just 100 MHz, while being Figure 1: User-level communication, obviating copies in intermediate buffers.The buffer size inside the NICs may vary between implementations.Even with the zero-copy RDMA, the NIC fetches the entire payload of a network packet from memory before injecting it inside the network -the same applies for caRVnet).At the receiving side, caRVnet writes incoming packets directly to memory or to a mailbox without extra copies, although some NICs may copy packets or even message chunks inside NIC buffers first.
tightly-coupled with the caRVnet NI also running at 100 MHz.
In this platform we achieve user-level latency of 720 ns, 200 ns out of which are spent in the network link, including link traverse time, (4) We present a detailed breakdown of user-level communication across two Ariane RISC-V cores.The remainder of this paper is organized as follows: in Section 2, we present the caRVnet network architecture, and its tight coupling with the Ariane RISC-V core.Then, in Section 3, we present the system software, including the optimized runtime for the RISC-V processor, the driver hierarchy and the user-level program used in measurements.In Section 4, we present, analyze and discuss the results we obtained in a 2-node cluster.Finally, we present related work in Section 5, and conclude in Section 6.

CARVNET ARCHITECTURE
TIGHTLY-COUPLED WITH RISC-V CORES  design based on the Ariane CVA6 3 64-bit RISC-V core programmed on its FPGA and running at 100 MHz.The two boards are connected in a ring topology over the caRVnet network, using a pair of High Speed Serial (HSS) links.
caRVnet is a custom packet-based network protocol that allows distributed memory access across nodes, using low-latency, userlevel RDMA and Packetizer-Mailbox primitives.The RDMA engine implements reliable RDMA Write/Read, supports end-to-end retransmissions offloading the host, can generate transfer completion notification messages at the destination, supports multi-path routing (NI transactions, such as RDMA transfer segments may travel through different paths), features dynamic context bookkeeping at the receiver and rate limiting at the source that can be used for hardware-level congestion control [7].Inside the node (FPGA), the data path of caRVnet has a word width of 512 bits, and supports header-payload and header-only packet types, along with two "valid" signals to determine which kind of data (header/payload) is currently present on the data bus.The caRVnet packet header is depicted in Figure 3.
The caRVnet protocol offers reliability, by using hardware timeouts and end-to-end acknowledgements, sent from the destination endpoint back to the source, in order to ensure that the packet has successfully reached its target.caRVnet packets include a header-CRC field to protect routing and other metadata information in the packet header.When the header is received at a node, the header CRC is calculated again and compared to the existing CRC field.If these match, then the packet is permitted to continue its path, otherwise the packet must be dropped because of header information corruption.As shown in Fig. 4, the generation of the CRC covering the packet payload takes place right before the packet enters the HSS link -caRVnet does not carry a footer field for the CRC inside the node.However, when crossing the HSS links, the packet payload is followed by a CRC word, similar to Ethernet.caRVnet performs cut-through and thus, the 32-bit payload CRC is ready to be sent together with the last payload word of the packet.The Serial TX block in the figure corresponds to the high-speed serial (10 Gb/s) 3 https://docs.openhwgroup.org/projects/cva6-user-manual/link of the Xilinx FPGA.We implement the serial link (including the link-level flow control) using custom logic with one-way latency of approximately 100 ns, excluding the cross-clock latency occurring inside the elastic buffers shown in Fig. 4. At source, we also generate header CRC which is then validated by every receiving transceiver before routing the packet to the next hop or local endpoint.

caRVnet Network Interface
The network packets are formed by the caRVnet Network Interface (NI), which is connected to the LD/ST stage of the Ariane core as can be seen in Fig. 2. The main hardware primitives of the NI examined in this work are a Packetizer and a Mailbox.These can be used for low-latency, many-to-one communication of small messages (up to 64 bytes in our current implementation) and can be used to send messages in eager protocols [13].The hardware primitives offer multiple channels for virtualization purposes.The channels are implemented in FPGA memories and are organized in pages, as described in the following sections, offering low deterministic latency.The NI also includes an RDMA engine that we do not evaluate its performance in this paper.
In our tests, we evaluate user-level communication, in which processes issue network packets via the Packetizer, and receive them by polling the Mailbox.Network packets are routed in the ring network via the caRVnet interconnect switch with 1 cycle cut-through latency inside the FPGA.In the following sections we describe the hardware architecture of the NI primitives.The caRVnet Virtualized Packetizer, shown in Fig. 5, is a network peripheral responsible for the generation of small, low-latency packets carrying the payload of small messages for local or remote Mailboxes.The Packetizer receives the packet's payload, destination address and size from the RISC-V core via store instructions, sets a user-defined time-out for each packet that it issues, and waits for an end-to-end acknowledgement.Additionally, the hardware of the Packetizer sets the protection domain ID in the packet header, which we use to guard the system (endpoints) from unwanted accesses.

caRVnet Virtualized Packetizer
The Packetizer offers 16 virtual pages with 4 channels per page, each of which can produce a caRVnet packet with up to 64 bytes of payload.Each channel has private destination coordinates; on the other hand, the protection domain (set by the OS) is shared among the channels within each page.
The Packetizer maintains internal memory, implemented in BRAMs and registers.BRAMs are used for storing packet-payload, trigger information, destination address and protection domain.Registers are used for channel status bookkeeping.All the memory modules -except the one that keeps the protection domain-can be written directly from user processes in order to form the header and the payload of the caRVnet packet.
We use dual-port memory blocks as shown in Fig. 5 in red outline with FPGA resource utilization shown in Fig. 10.These are eight (8) blocks (64x64 bits) for payload data, one (1) (64x64 bits) memory for destination coordinates, and one (1) (16x16 bits) memory that stores the protection domains, in which the kernel has exclusive access.Finally, we use one (1) (64x28 bits) memory to implement a FIFO for the trigger info (size of packet, timeout value and channel ID) and one (1) memory block (48x28 bits) (not included in the figure) for the auto-chaining mechanism discussed below.The address space of the Packetizer that is exposed to the processor is divided between kernel-accessible space and user-process-accessible space.The address space is further divided into 16 virtual/HW-isolated pages (4K each).Each page hosts four (4) physical channels accessible by the owner user process using specific in-page offsets.
The Packetizer has a single AXI-4 slave port connected to the LD/ST stage of Ariane as shown in Fig. 2. Back-to-back singleword store and load requests from the processor can be served by the Packetizer, without any latency between requests.Each incoming AXI address from the processor is translated by the ADDR_TRNSL unit and the request is forwarded to the internal (payload or destination-address) data-buffers, the status registers or the trigger-queue.
A user process can issue a packet by writing the payload words (chunks of 64 bits) and the destination address (42 bits for in-node address & 22 bits for destination node coordinates) via store instructions into an allocated page of the payload-buffer and the destination-coordinates-buffer and then conclude with the trigger word of the packet generation by storing the size (14 bits) and the channel timeout value (8 bits) into the trigger-queue.
The Packetizer is optimized for low packet latency, and can produce a caRVnet packet with just one cycle latency after all stores have been completed.When a channel is triggered, its ID (channel index) is enqueued into a queue (triggered channel IDs).The transmission-controller (TX in Fig. 5) consumes the entry at the head of this queue and collects data needed to form the packet header on the caRVnet interface.In the following cycles, it reads the payload of the packet from the payload-buffers.The Packetizer is capable of producing caRVnet packets without bubbles as long as this FIFO remains non-empty.
We also implement an auto-chaining function that allows ordering for the packets within a page.The user triggers a channel with the auto-chaining bit enabled in the trigger word and the controller will check first if the previous channel in the same page has received the (end-to-end) acknowledgment packet: if not, the channel's ID info is temporarily stored in a small, special trigger buffer.Once the previous channel receives an acknowledgement, the channel's ID is immediately enqueued to the triggered channels ID queue in order to let the transmission controller start collecting the data of the next channel's packet, without any additional user-process intervention.
Each channel in a page has a status register, readable from the AXI interface.The user-process (or library) must poll the status of each channel before re-using it.A channel can be used only when it is in idle state, shown in the FSM of Fig. 8, otherwise a mechanism will prevent any illegal actions and will assert an error flag on the AXI interface.After a read of a channel's status while at a nonbusy state, the status returns to idle state.An optimization HW mechanism is implemented and can be enabled to allow the user to try re triggering a channel without reading the channel's status only if the channel has concluded its previous task successfully, without any errors before using the channel again.

caRVnet Virtualized Mailbox
The Virtualized Mailbox is an endpoint component of the caRVnet NI, suitable for receiving small messages from one or many Packetizer(s), and consuming them (by local processes that poll the Mailbox) in First-In-First-Out (FIFO) order.It accepts caRVnet packets through its caRVnet interface and stores their payload in order to be accessed by user-level processes through an AXI-Read interface.As shown in Fig. 6, the Mailbox is virtualized (multiple processes can use the primitive concurrently).It implements multiple (16) physical queues (Mailboxes), addressed in a way that protects processes from unwanted accesses.At reception, the hardware checks the protection domain of the incoming packet.Any queue/Mailbox can be addressed by an incoming packet, but enqueue is permitted only if the protection domain ID set for the specific Mailbox by the local OS, via the AXI-Write interface, matches the protection domain ID in the packet header.
Additionally, when a packet arrives, the hardware checks the payload CRC indication of the incoming transceiver of the packet and checks the capacity of the targeted Mailbox.If the data is corrupted or there is not enough space in the 4KB queue, the packet is dropped.When all the payload words of the packet have been enqueued in the targeted Mailbox or dropped, the hardware generates and issues a response packet (end-to-end acknowledgement) to the sender Packetizer to report the success or failure of the enqueue-task.
User-level processes can be informed for the status of a specific Mailbox by reading the size, in bytes, of the enqueued packet on the head of the FIFO using load instructions on a separate status-queue (one per virtual Mailbox/physical queue).After reading the status, the user-process can start reading the payload stored in the targeted virtual Mailbox, one (64 bit) word at a time, under the supervision of the internal dequeue-controller, as can be seen in Fig. 2.

Tightly Coupling RISC-V with Network Interface
In this section, we describe our tight-integration of the NI with the Ariane, RISC-V core.The Ariane core is connected with external memories via the LD/ST stage, through which a load or store instruction may address either a cacheable memory region or non-cacheable I/O.We used a RISC-V processor core like the Ariane/CVA6, as RISC-V is an open-source, standardized architecture.It is designed to be modular and extensible, which makes it easier to customize processor cores and integrate them seamlessly with custom hardware such as network interfaces and accelerators.Furthermore, Ariane/CVA6 is a relatively simple (i.e.6-stage, single-issue, in-order) 64-bit, OS-capable CPU.This underpins our belief that integrating custom hardware to augment the capabilities of a RISC-V processor, as in our case with the Ariane/CVA6 CPU, can be likewise implemented in other RISC-V cores as well.Even more interestingly, this process may encompass a straightforward tradeoff between performance/ease-of-implementation, regarding how advanced the RISC-V microarchitecture of the CPU of interest may be.We first connected the caRVnet NI as an I/O device.However, as schematically shown in Fig. 9, this slowed down the packet formation, because the standard I/O port used in Ariane for noncacheable accesses could issue 1 store every 7 clock cycles.The formation of a small 32B packet requires 6 such stores, two for control info and 4 for the packet payload.If we issue one store instruction every 7 clock cycles, the packet would be taxed with a 42 (processor) clock cycles latency, or 420 ns in our 100 MHz testbed.For this reason, we decided to modify the LD/ST stage of Ariane.Essentially, we created a new port in the LD/ST stage of Ariane, responsible for handling accesses to the NI, allow back-to-back stores.
Ariane's interface to memory from the LD/ST stage uses a custom protocol.We designed special hardware to convert with no latency this custom protocol of Ariane to AXI4-Lite protocol, which we use to connect the caRVnet hardware blocks with the RISC-V core.
 RED-SEA Consortium Partners.All rights reserved.
Page 52 of 98 • While integrating caRVnet with Ariane, we were tempted to develop a custom "peripheral" interconnect instead of AXI, in order to avoid RAW hazards.
Eventually, we abandoned this approach and resolved RAW at the endpointssee Section 4.6.6.• We test a variety of AXI switches to connect Ariane cores with caRVnet peripherals, and measure their throughput and latency performance

Ariane to AXI converters
Ariane's interface to memory uses a custom protocol.In this section, we design special hardware to convert this custom protocol of Ariane to AXI4-Lite protocol, which we use to connect the caRVnet hardware blocks with the RISC-V core.
As shown in Figure 27, we use separate converters for the store and load interfaces of Ariane.The store commands are served by the Ar2AXI WR converter, which creates AXI4 Write transactions, and the load commands by the Ar2AXI RD converter, which generates AXI4 Read transactions.As can be seen, with AXI4, there are separate data and address bus pairs (AXI channels), for read and write transactions.As shown in Fig. 7, the load and store instructions that are targeting the NI are first converted to AXI commands by a custom AXI adapter.We use separate converters the store and load interfaces of Ariane.The store commands are served by the Ar2AXI WR converter, which creates AXI4 Write transactions, and the load commands by the Ar2AXI RD converter, which generates AXI4 Read transactions.As can be seen, with AXI4, there are separate data and address bus pairs (AXI channels), for read and write transactions.
Store instructions are immediately acknowledged to the processor, which allows back-to-back store instructions inside the core.This is feasible in our setup, since the store commands are always consumed by the NI and the systems software is responsible for issuing legitimate accesses to the NI.The AXI adapter nevertheless registers bad AXI responses for debug purposes.
Next to the adapter sits a small, optimized AXI4 interconnect (crossbar), which connects the Ariane with the NI peripherals.This interconnect can forward back-to-back store requests to the peripheral, as desired in order to minimize the latency of the packet.In addition, the small crossbar comes with zero (cycle-level) latency overhead.These modifications tightly couple the Ariane core with the primitives of the network interface.However, to implement the custom port inside the RISC-V core, a number of modifications were required, which we outline in the following paragraphs.

Avoiding RAW Hazards
We are aiming for small end-to-end latency, in which fast transfer initiation is essential.During this process, an unwanted behavior was observed: the load instructions, which address the read status offset, may overpass the store instructions which send the descriptor to a NI peripheral.This happens because we immediately acknowledge the store requests from Ariane.Thus, the load instructions may reach the peripheral and read the status before the peripheral has received the store instruction.
In order to avoid such RAW hazards, we deploy a specially crafted state machine for the channel status FSM in our NI peripherals.The FSM shown in Fig. 8 can tolerate such RAW peripheral a way that no misinformation about the status could happen.The FSM has the following states: • IDLE: The channel has not yet been triggered.
• BUSY: The channel has been triggered and waits for an acknowledgment.
• ACKED: The channel has received a positive acknowledgment.
• FAILURE: The channel has received a negative acknowledgment or has timed out.When a transmission task completes (and endpoint acknowledgment is received), the FSM doesn't return to the IDLE state but rather waits for the CPU to read the status first.In this way, if any read bypasses previous writes, it will return the IDLE state to the CPU, indicating that the channel has not been triggered yet.

Modifications Inside the LD/ST Stage of Ariane
In this section we describe the modifications needed inside the Ariane RISC-V core in order to route accesses (a region of physical addresses) to the NI.The Ariane is an open-source 6-stage RISC-V core [17].We believe that similar changes can be implemented in other RISC-V cores as well.
Inside the Ariane, store instructions initially enter a speculative queue, and later when an operation has clearance to proceed, it enters a commit queue.Intervening in the flow of a store operation seems quite more straightforward than to the one of a load operation.This is so as an ongoing store spends time in the store buffer, where is eventually flagged as ready to commit, while it also carries the already translated physical address.Based on this physical address, it is easy to check if the target address matches the designated NI region.If so, we remove (steal) this store from the top of the store buffer, and route it over the custom, optimized port to the Network Interface.Thus, this store will be considered committed, while it will reach the custom NI instead of the cache subsystem of regular (cacheable) instructions.
As seen in Fig. 9, the store buffer is controlled by an FSM.Initially, we tried to modify this FSM in order to steal stores.We managed to do so, achieving a functional design that pivots the designated store operations, and sends them over to a newly added interface.This connection could now handle a new store operation every 4 clock cycles, quite improved compared to the 7 clock cycles of the initial Ariane core implementation and the default AXI interface.The delay of 4 clock cycles was due to the FSM, but also has to do with the clock cycles lost between the time a ready-to-commit store operation reaches the top of the write buffer, and the time the FSM takes this signal, makes the proper comparisons, and initiates a dequeue.
To further improve the performance, in our solution, we add new logic that checks the next-to-come top entry, every time the current top entry gets a dequeue from the store buffer.When this 2nd entry falls into the designated address space, then a dequeue for the next clock cycle is also prepared.And reaching this next cycle, the write operation now on top of the store buffer gets dequeued immediately, and reaches the NI port of the processor on the same clock cycle.The FSM remains ignorant of this process, and the NI-targeted write accesses are "stolen", while the rest of the processor behaves as if they never existed.With this enhancement, a network-interface write (store) reaches the NI port immediately; at the same time, consecutive write accesses (stores) in a program enter and travel the processor pipeline back-to-back (i.e. with no intervening bubbles), and likewise can exit the processor's NI port with no hiccup.
Dealing with read requests (loads) is not straightforward as is the case for writes.In read operations, initially a read request is signaled, usually along with the target address, and then later in time a response, i.e. the corresponding data, arrives.The data response must be handed back to the execution stage, and, following a common memory model, only then a consecutive read operation is allowed to proceed.This means that we cannot just "steal" and "vanish" a load, as we did with a store.The Ariane RISC-V core incorporates an optimization that allows a second load in flight.However, appropriate bookkeeping logic has to associate the data response with a proper load identifier, in order to match the data with its initiating load.
In order to speedup load operations, by sending them over the new optimized port to the network interface primitives, we had to thoroughly alter the basic LD/ST FSM.When a new load enters the LD/ST stage, the load unit immediately sends a request to the cache to save time, before the address translation is resolved, as the cache is virtually indexed -physically tagged for efficiency.By the time we have the physical address, the cache request is in flight, and so if rule checking designates the NI port, we have to send a kill-request signal to the cache controller.In addition, we have to leave intact all the logic that checks for hazards, such as store-buffer aliasing.The same applies for the logic that allows to roll back from servicing a load operation in case of a TLB miss.

Low-latency Communication in RISC-V Clusters
HPCAsia 2024, January 25-27, 2024, Nagoya, Japan In future work, we want to port this design with cache-coherent RISC-V cores, in which we plan to implement mailboxes inside the main memory and its caches.

caRVnet Resource Consumption
The total CLB utilization of the Ariane and NI is about 45% of the Programmable Logic (PL) of which almost 22% is LUTs and 8% is registers.The total BRAM Tile utilization of Ariane and NI is close to 24% of PL.In a modern FPGA like the Virtex UltraScale+ the CLB utilization of the Ariane and our NI combined is close to 5%.

SYSTEM SOFTWARE
The optimized hardware of the NI together and its tight-coupling with the Ariane core set a good basis for low-latency communication.However, in order to deliver ultra low latency to user processes, we had to carefully re-design the systems software, and, especially, the user-level library, through which processes may send and receive messages.
The software required for the user to be able to send and receive data using the Packetizer/Mailbox hardware is shown in Fig. 11 and consists of the following layers: • The Packetizer and Mailbox back-end drivers, which handle basic interactions with the hardware, and the front-end driver which exposes a single interface for both devices to the user level, • The user-space library, which implements a set of functions that allow the user to communicate with the kernel space drivers in order to send and receive messages using the Packetizer/Mailbox hardware.

Packetizer/Mailbox Drivers
To provide user-level access to our hardware, we first need the OS to identify the base address of each hardware module.The kernel module that implements this functionality consists of two backend drivers, which have direct access to the kernel-accessible memory of each device, as well as a frontend driver that acts as an intermediary between the backend drivers and the user space.Both backend drivers read and store the physical base address of their corresponding device from the device tree.They are tasked with monitoring the virtualized hardware's availability, configuring each allocated instance upon allocation and providing the information necessary for accessing the allocated hardware in kernel space.Every page of the two hardware modules has a dedicated register, where the protection domain of the process using this page, is written during the page's configuration.
The frontend driver exposes a set of system calls to user-space, through which the user can gain access to the user-accessible memory page of an allocated Packetizer/Mailbox pair.The driver achieves this by calling on the functions exposed by the backend drivers, in order to allocate and configure one instance of virtualized hardware.The driver then maps the base address of this instance to a virtual address which is returned to the user via the system call.

User-space Library
The user-space library is a set of hardware-aware user-level functions that call upon the system calls of the frontend driver, to obtain a pair of Packetizer/Mailbox devices and use them to send and receive packets between remote nodes.With this library the user no longer needs to know how the hardware works and needs only follow the library documentation.The user interface of the library is made up of the following functions: • MBOX_ATTACH: allocates a Packetizer/Mailbox pair by invoking the mmap system call • MBOX_DETACH: deallocates the allocated Packetizer/Mailbox pair (invokes the release system call) • MBOX_GET_ID: returns the protection ID of the process to which the Packetizer/Mailbox pair has been allocated to (invokes the unlocked_ioctl system call) • PACK_SEND: Writes a payload to the Packetizer and triggers the packet formation and transmission • MBOX_DEQUEUE: Reads the payload of an incoming packet from the Mailbox • CHECK_STATUS: Checks the status of all channels of the allocated page.
The library also includes a few additional functions that are either called internally and are not necessary to the user, or are simply variations of the enqueue/dequeue functions that allow, for example, non-blocking Mailbox reading.
On the early stages of development the PACK_SEND function would first check the status of the channels on each page.This added a significant and unnecessary delay, since it was executed for every packet sent.Thus we opted to offload this task to the user, who now needs to check the status of the channels only after they have been triggered or when one needs to make sure that a channel is available again.In several scenarios, like the ping-pong tests in our evaluation, after we receive a (pong) packet, we can be certain that the channel used to send a (ping) packet is available again, thus the user program never needs to check the status.

User-level Program Measurement
To measure the user-space latency of our system, we implemented a program that calls on the aforementioned library functions.Latency between two adjacent remote nodes (Node 0, Node 1) is measured in one of the two nodes by a user space program which implements a repeating ping-pong test.Node 0 sends a 32-byte packet to the remote Mailbox of Node 1 and starts polling its own Mailbox for a response.Once Node 1 receives the packet by polling its own Mailbox, it sends back another 32-byte response packet.The pingpong iteration concludes with Node 0 reading the response packet from its Mailbox.Latency is measured per iteration, in Node 0 (see Fig. 2), using the RISC-V performance counters by means of the rdcycle assembly instruction.

Code Optimization of Library for RISC-V
Running the measurement software, initially returned an average round-trip latency of 50 s.This gives an one-way latency of 25 s which was far too slow for the context of this work.However, we managed to reduce this figure significantly using the following methods: (1) We compiled both the library and the measurement program using compile time optimizations, such as the -O3 directive, bringing the latency down to 6 s.(2) We removed legacy code in the user space library which was not implemented for Ariane, and we cleaned up the code which further reduced the latency down to 5 s.(3) We used link-time optimizations such as -flto and -fPIC to pre-compile the user space library.We now compile the measurement software after, by including the output files of the USL compilation.These optimizations brought roundtrip latency down to 1.86 s, which gives us an one-way latency of 0.93 s.
These optimizations improved the overall latency significantly.However, since our purpose is to gather measurements that are as close to hardware, we removed application-specific code that might be unnecessary in some use cases: (1) We removed the thread safety features of the code that we had for ARM processors [12].This means that each Packetizer or Mailbox page can now be used only by a single process, reducing the number of processes that can simultaneously use the hardware to a value equal to the number of pages.We also substituted the C code that writes to the Packetizer and reads from the Mailbox with inline assembly that uses only back-to-back store and load instructions to achieve the same results.This reduced the one-way latency down to 9 s.(2) We stopped checking the channel status in the library before triggering formation of a packet.This can still be done by the user if the user program requires that he does so.This reduced one-way latency further to 2.1 s.
These optimizations provided us with a fully optimized, lean runtime, that includes only the code features absolutely necessary to complete a packet transaction between two remote nodes.The functions calls in the library were further optimized for reduced latency, at instruction level, using inline assembly specific to RISC-V processors.After the one-off overhead of the allocation of the hardware with the MBOX_ATTACH function and the configuration of the node's caRVnet address with the MBOX_GET_ID, the user can send and receive messages using the PACK_SEND function bypassing the OS and its inherent overhead.The PACK_SEND function has a minimal overhead of only 9 assembly instructions in the critical path before the store instructions that write the packet in the hardware (verified by reviewing the assembly code compiled from the library source code).

EXPERIMENTAL RESULTS
We evaluate our hardware primitives and the system's software runtime in the two-node RISC-V testbed depicted in Fig. 2. Each RISC-V is a softcore [10] running at 100 MHz, has a L1 cache of 32 KB, and access to 1 GB DRAM accessible by a 128 bit AXI interface.The Ariane cores run a customized Linux OS.In our experiments, caRVnet also runs at 100 MHz, thus can achieve a bandwidth of up to 50 Gb/s on its 512-bit on-chip data lanes.In this testbed, we run the user level ping-pong test described in Section 3.3 for 32-byte packets.We measure the latency of each individual iteration using the rdcycle RISC-V assembly instruction.

Initial Latency Measurements
After optimizing both the runtime and the library, as described in Section 3.4, we achieved an one-way user-level latency of 930 ns between two adjacent nodes in the ring network.However, observations of packet transfers in real time using the Vivado Logic Analyzer consistently showed an one-way latency of 720 ns.This discrepancy implied the existence of outliers, which we subsequently measured by taking per-iteration latency samples.
As can be seen in the first graph of Fig. 12, the latency measurements vary significantly, between 720 ns and 600 s.We notice a concentration of values in 4 main levels.The vast majority of the measurements are around 720 ns, while 0.03% of them gather at either 400 or 600 s.
Based on the duration and the consistency of these outliers, they can be attributed to context switches, caused by timer interrupts.The remaining 0.06% of values reach up to 2000 ns and are twice in number to the interrupt-related outliers.We assume these happen due to page faults that occur at the start and the end of each context switch, and which are resolved by the MMU (Memory Management Unit) (page walks).

Eliminating Stragglers due to Interrupts
To prove these conjectures, we developed a hardware mechanism, which allows us to disable timer interrupts during measurements.This can be done from software, using inline assembly, without adding any unwanted overhead.The results can be seen in the second graph of Fig. 12, where no latency measurements exceed 1000 ns.The average one-way latency is 725 ns, with the majority of measurements at 720 ns.

RELATED WORK
For medium and large-scale systems, the user traffic between processors, such as MPI communication, is usually handled by a high performance network such as Mellanox InfiniBand, Cray Aries, Intel Omni-Path Architecture (OPA).InfiniBand is the most widely used network in large supercomputers, such as the Summit supercomputer [16].The latest commercially available InfiniBand products are based on the NDR (Next Data Rate) standard, supporting up to 400 Gb/s per port.In 2015, Intel introduced Omnipath, an HPC interconnect running at 100 Gb/s per port.Lately, ATOS introduced the Bull eXascale Interconnect (BXI) [6], which is integrated in the Atos BullSequana XH [15].
The Tofu D interconnect is connected through a custom interface to SPARC processors and achieves a latency of just 500 ns (one way Put latency excluding software runtime overheads) [1].The same paper also shows a comprehensive latency breakdown.The end-toend latency of Cray Slingshot interconnect is between 1 and 2 s but is tolerant to congestion [5].On the other hand, 2023 Infiniband networks advertize end-to-end latency of 600 ns 4 .
For comparison, the latency we obtained in this paper is 720 ns in our FPGA prototype with a single RISC-V softcore running at 100 MHz.'Without the network link, the latency is below 600 ns or 60 processor cycles, i.e. just 60 ns in a system with the processor and NI running at 1 GHz.
Compared to the aforementioned Network Interfaces, caRVnet provides a customizable solution which offers ultra low latency, consumes very little resources and can therefore be integrated in the same chip with the main processor.Integrating the Network Interface in the same chip (or package) with the processors and the memory interconnect offers the possibility to use the same block both for on-chip and system level communication, thus saving cost and reducing the silicon area footprint.In [14], we show that the NI can exploit the IOMMU (Input/Output Memory Management Unit), which is close to the processor, in order to translate process-level virtual addresses to physical memory pages, thus avoiding the need for a separate, synchronized translation mechanism inside the network interface card.Moreover, in this way we do not need to pin the pages involved in the communication, since we handle the occasional page faults that may occur in RDMA transfers by re-transmitting the failing packets in hardware.

CONCLUSIONS
Low latency is a critical component in high performance computing clusters.Typically, low latency is associated with higher cost.In this work, we design a lean network interface and its runtime from principles that governed the micro-architecture of the first RISC processors: simple designs that offer essential functions for high-speed networks: user-level transfer initiation, low latency, reliability, and protection.In order to minimize latency, we optimized the interface between the RISC-V core and the NI, the latency and the user-level access of NI, as well as the user-level library using inline assembly specific for the RISC-V processor.The Packetizer generates the packet immediately after receiving the trigger word, which contains the destination address.At reception, the packet is received by a mailbox that is polled by the runtime on the RISC-V core once every 5 CCs, and the packet is dequeued by the runtime in 30 processor clock cycles.We identified stragglers which we eliminated by disabling the interrupts that caused them.The design of the network is slim regarding the FPGA CLB utilization ( 4,6%) compared to that of the RISC-V core (22,2%).The Mailbox uses a large number of BRAMs to implement the 16 mailboxes, with 4 KB size each.In future work, we want to port this design with cachecoherent RISC-V cores, in which we plan to implement mailboxes inside the main memory and its caches.

Figure 4 :
Figure 4: The caRVnet packet CRC generation in front of the elastic buffers that connect to the 10 Gb/s serial link transceivers.

Figure 7 :
Figure 7: Ariane to AXI conversion for memory requests targeting the NI.

Figure 9 :
Figure 9: In our Fast Interface implementation in Ariane we steal commands targeting the Network Interface from the Store (commit) queue in order to allow back-to-back store instructions to non-cacheable region.Using Ariane's default I/O (AXI) interface, we could issue one store per 7 cycles.

Figure 10 :
Figure 10: The Resource Utilization of Ariane and caRVnet NI on Zynq Ultrascale+ FPGA.

Figure 11 :
Figure 11: The Packetizer and Mailbox software stack.

Figure 12 :
Figure 12: Latency results for different transfer iterations.First graph (left) shows results with interrupts enabled; second graph (right) shows results with interrupts disabled.

Figure 13
Figure13describes the sequence of events we see on the Vivado Logic Analyzer snapshot of the requesting node (indexed 0) and estimates the timing of these events on the answering node (Node 1).For Node 0 we have the following series of events:• 0 CC (Clock Cycle): The Packetizer AWVALID signal is raised, indicating the Packetizer configuration registers are being