CLAY: CXL-based Scalable NDP Architecture Accelerating Embedding Layers

An embedding layer is one of the most critical building blocks of deep neural networks, especially for recommender systems and graph neural networks. The embedding layer dominates a large portion of the total execution time due to its large memory requirements and little data reuse in operations. To accelerate the embedding layers, dual in-line memory module (DIMM) based near-data processing architectures have been proposed. They amplify bandwidth by adding a processing unit to the DIMM’s buffer. However, prior architectures have less capacity scalability due to the limited number of memory channels. Crucially, they are limited in performance improvement due to the load imbalance problem and the limitations of DIMM-based memory systems with a multi-drop bus structure between the processing units and the host. In this paper, we propose CLAY, a CXL-based scalable near-data processing architecture that accelerates general embedding layers in DNN. Breaking away from conventional memory channel structures, CLAY interconnects the DRAM modules to reduce the data transfer overhead among DRAM modules. Furthermore, we devise a dedicated memory address mapping to mitigate load imbalance in CLAY and a packet duplication scheme that enables full utilization of CLAY by reducing the required instruction transmission bandwidth. We propose a method of scaling CLAY and a software stack to use CLAY. Compared to the state-of-the-art NDP architectures of FeaNMP and G-NMP, CLAY achieves an end-to-end speedup of up to 1.87 × and 2.77 × for recommender systems and graph neural networks, respectively.


INTRODUCTION
Deep Neural Networks (DNNs) [13,22,43,53,74] have been attracting attention, given their high performance in various fields.DNNs were primarily composed of a number of dense layers, such as multi-layer perceptron (MLP), which are the main targets of traditional DNN accelerators [6,34,44].However, the embedding layers, which have different characteristics from dense layers, have recently been prominent as a major building block of emerging DNNs such as recommender systems (RecSys) [10,60,85] and graph neural networks (GNNs) [21,42,79] for effective learning using massive data.The embedding layers have memory-intensive and capacity-bound characteristics [20], which leads to a long execution time relative to the amount of computation.Figure 1 shows that the embedding layers of Meta's RecSys (DLRM) [20] and GNN with various graph datasets [23,50] dominate the total execution time.
Accelerating the embedding layer with traditional accelerators [6,34,44] or GPUs can be inefficient due to their limited memory capacity.As models and datasets grow, the amount of memory required by embedding layers can exceed hundreds of GB [23,50].A memory system with a large capacity is required to store the embedding data, whereas the memory capacity of the latest GPUs [62, 63] is up to 80 GB.To match the capacity requirement, tens of GPUs are needed.However, using expensive GPUs for the embedding layer is inefficient, considering sparse data access patterns and simple reduction operations such as element-wise summation.
To address these problems, near-data processing (NDP) architectures have been proposed [3,37,46,64,72,82,83,86].They exploit DIMM-based memory systems that have larger memory capacity than GPU memory.They place processing units (PUs) per rank in the buffer chip of the load-reduced DIMM to accelerate embedding layers by utilizing rank-level parallelism.However, they have scalability issues in the aspect of capacity and performance.The capacity can be expanded by increasing the number of DIMM slots per channel or the number of memory channels in a CPU, but such memory capacity expansion is limited due to the high cost of increasing the physical pin count and signal integrity issues.
With regard to performance, they suffer from (1) data transfer time between the host and PUs and (2) load imbalance among PUs on multiple ranks.First, the NDP architectures are bottlenecked by memory-channel bandwidth while the host operates the final reduction.The partial results from the reduction of each PU (local reduction) must be transferred to the host via memory channels (data transfer) to produce the final result (global reduction).This process cannot exploit rank-level parallelism.As the number of ranks in a channel increases, data transfer takes more time, making it difficult to expect scalable performance improvement.Second, due to the sparse nature of the embedding layer, it is hard to distribute the load equally to all ranks.When the load is imbalanced across ranks, performance is bound to the slowest rank.It becomes severe as the number of ranks increases, resulting in less performance improvement.
Compute Express Link TM (CXL [8,9]), a new open industry standard cache-coherent interconnect, enables a new organization of the NDP architecture.CXL is emerging as a way to solve the scalability problem of DIMM-based systems by supporting memory pooling, expansion, and disaggregation.CXL memory communicates with the host using the CXL protocol rather than the traditional DDR interface.Thus, the internal implementation design and internal memory address mapping of CXL memory have a higher degree of flexibility than the conventional memory system.
In this paper, we propose CLAY, the first scalable CXL-based NDP architecture accelerating general embedding layers.Based on the design exploration of CXL-based NDP architecture, CLAY breaks away from the multi-drop bus connection of DIMM-based memory systems to reduce data transfer time.CLAY consists of a DRAM cluster, which has an NDP module and multiple DRAM devices.DRAM clusters are interconnected on board so that they can communicate directly with each other without the need for a host.To reduce local reduction time, we explore the performance of CLAY in terms of load imbalance over embedding data distributions.
Based on this analysis, we propose a memory address mapping of CLAY that shows the best performance.
We propose a multi-CLAY system using a CXL-switch for even greater scalability, allowing for operations on huge embedding tables.To fully utilize the multi-CLAY system, the host must send commands to more DRAM clusters at the same time.We propose packet duplication to relieve the bandwidth burden between the host and DRAM clusters.CLAY accelerates end-to-end generic DNNs by cooperating with existing processors (e.g., GPUs) for the remaining layers.Finally, we make it easy to use CLAY by suggesting a software stack that integrates into applications.
In this paper, we make the following key contributions: • Breaking away from the multi-drop bus of conventional memory systems, we propose CLAY, the first CXL-based NDP architecture for the embedding layer using interconnect on board with fine address mapping.• We propose a multi-CLAY system that cooperates with other processors to perform end-to-end inference and propose packet duplication to utilize the multi-CLAY system fully.• Our evaluation shows that CLAY improves the end-to-end performance for RecSys and GNN up to 4.92× (1.87×) and 9.85× (2.77×), respectively, compared to the baseline CPU system (each state-of-the-art NDP architecture of FeaNMP and G-NMP).

BACKGROUND 2.1 DNNs with Embedding Layers
A deep-learning-based recommender system (RecSys) usually predicts the click-through rate, indicating the probability of the user clicking the item [60,85].DLRM [20], a representative RecSys model, consists of embedding lookups, bottom MLP, and top MLP (Figure 2(a)).In an embedding lookup, the embedding vectors are indexed from the embedding table based on the sparse feature components and combined into an intermediate embedding vector through element-wise sum.Graph Neural Network (GNN) is a deep-learning-based model for graph analysis.Each node has an embedding vector, and the set of these vectors is called the feature matrix (Figure 2(b)).The GNN models, including GCN [42], GIN [79], and SAGEConv [21], are composed of several GNN layers, each consisting of an aggregation phase and a combination phase.The aggregation phase is performed for all nodes, where each node reads the embedding vectors of its adjacent nodes and produces a single intermediate vector through the element-wise average or summation.This intermediate vector is passed to an MLP, referred to as a combination phase.Unlike RecSys, the result of a GNN layer needs to be stored in memory for the subsequent GNN layer.
Both the embedding lookups in RecSys and the aggregation phase in GNN are embedding layers.An embedding layer gathers embedding vectors (gathering) from the specific indices of an embedding table (DLRM) or a feature matrix (GNN) and then reduces the vectors into a single embedding vector through a simple element-wise operation (e.g., weighted summation).In general, embedding layers access a small number of vectors, leading to sparse memory access patterns overall, but exhibit dense access patterns within each vector as all elements of selected vectors are accessed [20,83].Hereafter, we refer to pooling as both the embedding lookup in RecSys and the aggregation of a node in GNN.

Compute eXpress Link (CXL)
Compute Express Link TM (CXL), the open industry standard cachecoherent interconnect from the CXL consortium, opens up the possibility of memory expansion.CXL runs on the physical layer of PCIe and offers high-bandwidth and low-latency connectivity between a host processor and devices such as accelerators, memory buffers, and smart I/O devices.
CXL supports three protocols: CXL.io, CXL.cache, and CXL.mem.CXL.io is based on standard PCIe and utilized to connect, configure, and manage the CXL devices.CXL.cache allows the CXL device to access the host processor memory while guaranteeing cache coherency, and CXL.mem enables the host processor to access CXL device memory.According to the CXL protocol they support, CXL devices are categorized into three types: Type 1 (CXL.io,CXL.cache), Type 2 (CXL.io,CXL.cache, CXL.mem), and Type 3 (CXL.io,CXL.mem).Typical accelerator devices are Type 1 (i.e., device without local memory) or Type 2 (i.e., device with local memory) devices.Type 3 devices do not support cache coherence access protocol, CXL.cache, so it serves as a memory expander to expand the memory capacity and bandwidth of a host processor.

Dual In-Line Memory Module vs. CXL
A conventional main-memory system consists of one or more channels, each connecting a memory controller (MC) to dual in-line memory modules (DIMMs).A DIMM is composed of multiple ranks, each comprising several DRAM devices that operate in tandem by receiving the same command/address (C/A) signal.Server DIMMs employ a buffer device to repeat the C/A signal to mitigate the signal integrity issues when populating multiple DRAM devices.
CXL can provide the scalability of memory capacity through flexible memory expansion with Type-3 CXL devices (CXL memory) and CXL-switches, which is advantageous for DNN applications requiring large amounts of data.While the host processor communicates with main-memory DIMMs according to JEDEC standards [30,31], CXL memory is connected to a root complex in the host processor through the CXL-switch and communicates via packetized read/write requests (Figure 3).

LIMITATIONS OF DIMM-BASED NDP 3.1 DIMM-based NDP Architecture
To accelerate the embedding layers in RecSys and GNN models, various DIMM-based near-data processing (NDP) architectures have been proposed [3,37,39,46,64,72,82,86].The embedding layer has a low data reuse rate due to sparse data access, and its aggregate function is a simple element-wise operation.Therefore, the ratio of computation to data in the embedding layer (i.e., operations per byte) is low.Moreover, the embedding table size exceeds hundreds of GB, so the performance of the embedding layer depends on memory bandwidth.Such a memory-intensive embedding layer is suitable for acceleration with NDP.As the embedding tables are sized larger than HBM [62, 63], prior NDP architectures are built on DIMM-based memory systems.
In a DIMM-based memory system, a memory channel operates as a multi-drop bus.A huge embedding table must be stored across multiple ranks, but only one rank can occupy the memory channel and transfer data at a time.Thus, prior DIMM-based NDP architectures place a processing unit (PU) at each buffer device to access multiple ranks concurrently (rank-level parallelism).Each PU reduces the embedding layer in parallel using the data stored in its rank (local reduction).To get the final results, the host processor should read the results of local reduction and add or concatenate them (global reduction).

Limitations of Prior Works
Limited capacity scalability.DIMM-based main-memory systems have limitations in capacity expansion.Buffer-on-board solutions (e.g., IBM's Centaur [69]) can increase the memory capacity in a DIMM.However, the number of memory channels supported per socket is limited due to the cost of increasing physical pins.Also, the signal integrity issue bounds the number of DIMMs per channel.
In RecSys and GNN, the size of the embedding table continues to increase for more items or nodes and higher accuracy of the model to hundreds of GB, even reaching several TBs [49,77].Thus, it becomes more challenging to allocate the entire embedding tables in the main-memory system.It is possible to expand memory capacity by creating an NDP accelerator on a dedicated board (not CXL), but it is inefficient because normal applications cannot use the large memory space of NDP as a main memory while the accelerator is not utilized for NDP operation.Limited performance scalability.Prior DIMM-based NDP architectures utilize rank-level parallelism in common, so it is required to equip more DRAM ranks for higher performance.However, these architectures cannot improve the performance in proportion to the number of ranks for the following reasons.First, if the number of DIMMs increases, memory-channel bandwidth becomes the bottleneck in the data transfer (Figure 4).Although multiple PUs can process local reduction in parallel, the host processor operates global reduction by reading each PU's partial results sequentially (data transfer) through one channel.Also, as the number of DIMMs increases, the number of partial results to read increases.Therefore, the NDP architectures with more DIMMs take more time for data transfer, being a primary inhibitor of performance improvement.Second, the load imbalance issue is exacerbated when populating more ranks for NDP, increasing local reduction time.Since each rank has a different number of embedding vectors to read, the execution time is bound to the most heavily loaded rank.Figure 5 shows the load imbalance rate according to the number of ranks the embedding table is stored across.The load imbalance rate is the ratio of the largest load among ranks to an average load.For example, if four loads are on a 2-rank system, and each rank takes one and three, the load imbalance rate becomes 3 ⁄ 2. As the number of ranks increases, the load imbalance rate becomes higher.To distribute loads equally across all ranks, we can make each rank store every embedding vector but a fragment of each so that each rank has the same number of vectors to read [46].However, if the size of the embedding vector fragment is smaller than the DRAM granularity, the memory bandwidth is wasted, degrading performance.

CLAY: SCALABLE NDP ARCHITECTURE
To address the scalability challenges of the prior DIMM-based NDP architectures, we propose CLAY, a CXL-based scalable NDP architecture.CLAY is built on a Type-3 CXL device to provide better scalability than DIMM-based memory systems through switches, which can get more memory capacity.More importantly, the CXL interface enables the flexible internal design of NDP architecture while maintaining the functionality of normal CXL memory.
We first explore the design space of CXL-based NDP architectures for CLAY, focusing on reducing data transfer time.To reduce local reduction time, we estimate the performance of different embedding data allocation methods from the perspective of load imbalance.Based on the analysis, we propose a CLAY's memory address mapping.Finally, we elaborate on the operation flow.

Exploring NDP Design Space
Given the CXL-based NDP architecture, the PU can be located on the CXL controller or DRAM device side (see Figure 6).CXL memory consists of a CXL controller, which communicates with the host or other CXL devices, and DRAM devices.DRAM devices in a rank are connected to the MC in the CXL controller through a channel.Before sending data to the host, the PUs can compute on the CXL controller (Figure 6(a)) or the datapath of each rank (Figure 6(b)), but there are still limitations.
Placing the PU on the CXL controller has the disadvantage of low scalability.The PUs can only exploit the bandwidth of the memory channels connected to the CXL controller.To increase internal bandwidth, more channels must be connected to the CXL controller, but it is limited by the constraint of increasing the CXL controller's pin count.Also, from a CXL memory perspective, it is inefficient to have a larger internal bandwidth between the CXL controller and the DRAM devices under the limited external bandwidth.
To utilize internal rank-level bandwidth, we can put PUs in the middle of the data path for each DRAM rank, similar to previous DIMM-based NDP structures (Figure 6(b)).In this case, each PU can utilize the full bandwidth of each rank regardless of the number of channels.However, it suffers from long data transfer time due to the multi-drop bus structure between PUs as discussed in Section 3. To reduce the amount of transmitted data in data transfer, additional PUs can be placed on the path between an MC and rank PUs [3,37,64].However, they cannot always reduce data transfer time.It depends on how the embedding table is partitioned for each rank.For example, the additional PU is ineffective when it must concatenate the local reduction results from each rank PU so that it cannot reduce the size of transferred data.In conclusion, the data To make scalable connections between PUs, CLAY organizes multiple DRAM clusters, each being a multi-chip module composed of DRAM devices and a PU on the printed circuit board, and interconnects them (Figure 7(a)).DRAM devices in a DRAM cluster receive the same C/A signals and behave like a rank.Unlike the previous DIMM-based NDP architecture, direct data transfer between each PU is allowed, and each PU processes global reduction in part, reducing the time spent on data transfer.Further, each PU can concurrently write its global reduction result to the DRAM cluster, reducing write-back time.
Depending on the target application, the number of DRAM devices in the DRAM cluster can be configured to achieve different read granularity.Also, the optimal topology and routing algorithm of the interconnect can vary depending on a traffic pattern.Because most data traffic inside CLAY is all-to-all communication, an interconnect with high bisection bandwidth is proper.In this paper, we modeled CLAY using 2D-mesh topology and XY routing, which is relatively simple but sufficient for high performance and energy efficiency improvements (details in Section 8).

Organization of the CLAY Architecture
CLAY consists of a CLAY controller and DRAM clusters.The CLAY controller communicates with external devices using the CXL protocol.It receives the embedding reduction requests from the host processor and distributes the requests to each cluster.The DRAM cluster is a unit module that performs local embedding vector reduction; each can simultaneously reduce the vectors locally.It consists of multiple DRAM devices, a DRAM controller, a vector processing unit (vector PU), buffers, a router, and a cluster controller ( Wasted BW ratio 0 0.5 0 commands.The vector PU is made up of multiply-accumulate units (MACs) and performs weighted vector summation for local reduction and global reduction.The buffers temporarily store partial sums or a portion of the embedding vectors.The router processes packets forwarded over the interconnect.The cluster controller controls other units according to the request from the CLAY controller.

DRAM access granularity
The CLAY controller packetizes NDP instructions or normal memory read/write requests and sends them to the corresponding DRAM cluster.Each packet consists of a packet number, packet type, source ID, destination ID, and payload (Figure 7(b)).The packet number identifies the packet.The packet type could be a pooling request, data transfer, or normal CXL memory read/write.Source and destination IDs are used for routing a packet on the interconnect.The payload includes embedding vector reduction information or the address/data for a normal read/write request.If the data size exceeds the maximum packet size, it will be divided into multiple packets.The packet is routed over the interconnect and forwarded to the corresponding DRAM cluster.Then, the cluster controller decodes the packet, accesses the DRAM devices with the DRAM controller, and performs operations.
CLAY can also be used as normal CXL memory.When the CLAY controller receives a normal memory request from an external device, it sends the request to the target DRAM cluster.Then, the cluster controller accesses the data in the DRAM devices using the DRAM controller.After that, for reads, it packetizes the data and sends them to the CLAY controller.Then, the CLAY controller sends it to the external device via the CXL protocol.In this way, memory access requests are transferred over the interconnect, increasing memory access latency.In our configuration with 16 DRAM clusters (details in Section 7), the worst-case CXL memory access latency is increased by about 13 ns.It is insignificant, considering that the CXL memory access latency is approximately 200 ns [51].Also, it is preferred to allocate applications less sensitive to memory access latency to slower CXL memory instead of faster DIMM-based main memory in heterogeneous memory systems [71].Therefore, it will cause only slight performance degradation for normal applications.

Distributing Load Among Clusters
Each DRAM cluster in CLAY performs local reduction using the data allocated in its DRAM devices.Thus, depending on how data are distributed across DRAM clusters, the amount of data to be processed by each cluster varies.Figure 8 shows the load distribution of each DRAM cluster according to the form of embedding table allocation.Each of the four DRAM clusters stores a quarter of the embedding table and gathers and reduces the orange-colored data.Figure 8(a) presents a case where embedding vectors are divided horizontally, so each vector is stored in one DRAM cluster entirely.In this case, the required amount of memory access of each DRAM cluster differs according to the gathering list, leading to load imbalance.The DRAM cluster completing the reduction in advance must wait for the other clusters to finish, being underutilized.This load imbalance worsens as the number of DRAM clusters increases.
Figure 8(c) shows the opposite case, where each embedding vector is divided vertically and distributed to all DRAM clusters.Each DRAM cluster reads the part of each embedding vector in the gathering list and performs reduction.The loads are perfectly balanced as DRAM clusters read the same amount of data.However, if the size of the embedding vector partition stored in a DRAM cluster is smaller than the DRAM cluster's read granularity, DRAM bandwidth is wasted (red shaded color) due to reading unused data.
The best allocation method to reduce local reduction time is to divide each embedding vector into the memory access granularity and distribute them into the DRAM clusters (Figure 8(b)).It minimizes load imbalance without wasting DRAM bandwidth.We conducted experiments on the performance difference among the data allocations to verify this (Figure 9).We configured only one CLAY consisting of 16 DRAM clusters and measured the execution time of local reduction in RecSys and GNN embedding layers (more details in Section 7).In the graph, CLAY  represents that one embedding vector is distributed into  DRAM clusters.CLAY performs best when each cluster stores 16 elements of one embedding vector, the same as the DRAM cluster's read granularity (64 bytes).These results show that using the finest-grain division allocation without causing DRAM bandwidth waste performs best.

Address Mapping
We design memory address mapping for CLAY to allocate the embedding data considering load-balance between DRAM clusters (Figure 10).To support the best allocation of Figure 8(b), we put the DRAM cluster index bits at the bottom of the CLAY address map.Upon receiving the embedding data from the host, the CLAY  controller sends it to the corresponding DRAM clusters with the CLAY address mapping.
To request a reduction operation to CLAY, the host transmits a pooling request containing a model index, table index, and embedding vector index.Due to the CLAY's large capacity, two or more models can be placed together in CLAY.The host can send pooling requests for multiple models in sequence, and CLAY can process embedding layers for multiple models.The CLAY controller receiving a request first generates the start address of the embedding table using the model index and table index.Then, by adding an offset with the embedding vector index and size, the CLAY controller generates the address of the corresponding embedding vector.Thus, the size of the embedding vector determines the number of DRAM clusters within the cluster group.Each cluster within a group is referred to by cluster group offset (Figure 10), and the remaining DRAM cluster index, except for the cluster group offset, becomes the cluster group index.Using this address map, the CLAY controller sends pooling requests, including vector index in the cluster, to DRAM clusters that store the corresponding embedding vector.Then, each DRAM cluster generates a DRAM address to read the vector.
With this mapping, the CLAY controller must send the same pooling request to multiple clusters within the cluster group.For example, the pooling request for the first embedding vector only needs to be sent to C0 in Figure 8(a) but must be sent to C0 and C1 in Figure 8(b).Therefore, the CLAY controller needs more bandwidth for sending pooling requests.To solve this problem, we propose packet duplication in Section 5.1.

Operational Flow of CLAY
CLAY supports embedding layer operation for batched inference to improve throughput.CLAY splits a batch into multiple windows based on each DRAM cluster's buffer size in order to store partial reduction results.In each window, local reduction and global reduction are performed sequentially (Figure 11).After all clusters perform the local reduction operation of one window, CLAY performs global reduction on the local reduction results.CLAY uses the same computing units as a local reduction for global reduction.Each cluster is responsible for a part of the global reduction in one window, and all other clusters send the local results needed for that computation over the interconnect (data transfer).Each cluster performs global reduction instantly while receiving local results from other clusters.Then, the final results are written back to DRAM (write back) in all DRAM clusters at the same time.When the operations of all windows are complete by repeating these processes, the embedding layer for one batch is finished.8(c), to avoid data transmission between CLAYs through the CXL protocol.Therefore, each CLAY produces a fragment of the complete reduction result, and the cooperative processor (i.e., CPU or GPU) concatenates them before use.CLAYs can even be scaled out to rack-level because CXL can be used as a low-latency rack-level interconnect through the CXL-switch [67].CLAY exploits packet duplication to send pooling request packets using less bandwidth.To fully utilize CLAYs, the host should provide enough pooling requests to them. Figure 12 shows the system structure with multiple CLAYs using a CXL-switch.Sending pooling request packets to each DRAM cluster directly requires large bandwidth, especially between the host and CXL-switch, where all packets must be passed through.However, as described in Section 4.4, multiple DRAM clusters work with the same pooling request because one embedding vector is distributed to DRAM clusters in a cluster group.Also, in the system with multiple CLAYs, all CLAYs that divide up one embedding table operate by the same pooling request.If the pooling requests are duplicated in the middle instead of sent as packets directly to each DRAM cluster on each CLAY, we can reduce the bandwidth requirement to send pooling requests.
Packet duplication is performed in two stages at the CXL-switch and CLAY.At first, the path between the CXL-switch and each CLAY has a separate channel, so the CXL-switch duplicates the packets received from the host and sends them to each CLAY at the same rate.In this duplication, the address of the request packet from the host is mapped to the address space of each CLAY, which  is described in Section 6.The second packet duplication takes place inside CLAY.If the CLAY controller sends packets directly to all DRAM clusters in the target cluster group, the path between the CLAY controller and DRAM clusters will be overloaded.Thus, to reduce the load, the CLAY controller sends a request packet to only one cluster, and each cluster receiving the packet forwards it to another remaining target cluster.When the CLAY controller receives pooling request packets from the host, it sequentially includes the IDs of the first and subsequent DRAM clusters to be sent in the packet.The DRAM cluster receiving the packet checks the next target cluster-ID and forwards the packet to that.
We can calculate the required bandwidth for sending pooling request packets with packet duplication.Suppose that a system has four CLAYs, each with 32 DRAM clusters (CLAY specification in Table 1), the largest system in our experiment, and the embedding vector dimension is 128 (512 bytes), each vector being divided into four CLAYs.Due to the address mapping of CLAY, the embedding vector portion stored on each CLAY is divided into two DRAM clusters.Thus, each DRAM cluster receives one request packet of 8-byte size (i.e., an index and weight value of 4 bytes each) to gather the embedding vector portion (64 bytes) and perform reduction.To get the most out of DRAM clusters, the host should send packets to each DRAM cluster at 2.6 GB/s, one-eighth of 20.8 GB/s (a DRAM cluster's maximum read bandwidth).Without packet duplication, the host must send packets to the CXL-switch at a bandwidth of 332.8 GB/s for 128 DRAM clusters in four CLAYs, exceeding the bandwidth of PCIe 5.0 ×16.Moreover, the CLAY controller and DRAM clusters must be connected with a bandwidth of 83.2 GB/s for 32 DRAM clusters in CLAY.Using packet duplication, we can reduce the required bandwidth to 41.6 GB/s as the identical packet is duplicated for each CLAY and DRAM clusters that divide one embedding vector up in the same cluster group within each CLAY.

End-to-End Inference with CLAY
CLAY cooperates with other processors (e.g., CPU, GPU, or other accelerators) for end-to-end DNN inference.CLAY processes the embedding layer while the cooperative processor deals with other layers.CLAY operates as a CXL memory, so other cooperative processors can write input data or read output data in CLAY for the embedding layer through CXL.mem protocol.Through CXL.io protocol, similar to the standard PCIe, the host manages the entire process by receiving the status of the CLAYs and other processors and controlling them.The host checks the completion of each layer from CXL devices and synchronizes their operations.

Configuration space
Dim. of vector   In the case of RecSys, CLAY handles the embedding lookups while the cooperative processor (hereafter exemplified by GPU) handles MLPs.Under host control, the bottom MLP and embedding layer can be independently executed simultaneously on the GPU and CLAY, respectively.After both layers are complete, CLAY and GPU notify the completion to the host.Then, the host controls the GPU through CXL.io protocol (e.g., launch a kernel) to read the result of the embedding layer from CLAY and execute the top MLP.
In the case of GNN, CLAY performs the aggregation phase, while the GPU performs the combination phase.As the combination phase uses the output of the aggregation phase, GPU must read aggregation result from the CLAY through CXL.mem protocol.Because the combination results are also used in the next aggregation phase, the GPU writes the combination result to the CLAY memory.CLAY and GPU repeat this process to complete one GNN layer.
We assume that processing the embedding layer with CLAY and processing other layers with a cooperative processor that uses data in CLAY are performed sequentially.Nevertheless, if the other devices need to access CLAY memory during the embedding layers, CLAY prioritizes the memory access from those devices than embedding layer operations to minimize the memory access latency.When CLAY controller receives such a memory access request, it sends the request to the target DRAM cluster.However, the cluster controller needs to complete the ongoing pooling request before performing the memory access request; thus, additional latency in memory access may occur.

Data Coherence
CLAY can act as the memory expander of the host processor; thus, a system with CLAY must ensure the coherence of data allocated to CLAY.Because the data in CLAY may be modified inside CXL memory by NDP operations, the host can only allocate the memory region for CLAY as uncacheable through the CLAY software stack.When other CXL devices (e.g., GPUs or accelerators) modify the data in CLAY memory (e.g., after the combination phase of GNN), they need to ensure coherence with CLAY by flushing their cache; in cases where those CXL devices do not support cache flush, it can be replaced by writing dummy data to the cache and evicting all data of CLAY memory from the cache.In end-to-end inference, the host synchronizes CXL devices at each layer, as in Section 5.2; thus, the embedding layer operation in CLAY and other operations carried out on the same data by different devices cannot be performed simultaneously.Therefore, CLAY does not modify the data while other devices use the same data stored in their cache.In the case of DRAM clusters in a CLAY, the local reduction results of each cluster are exchanged between clusters in global reduction.Because the DRAM clusters do not guarantee coherency with each other, they leverage a message-passing style communication to exchange data.

SOFTWARE STACK FOR CLAY
We propose a software stack for using CLAY (see Figure 13(a)).A DNN application can store embedding data and perform embedding layer operations through CLAY library.CLAY device driver supports sending and receiving data to and from the CLAY controller via the CXL protocols (i.e., CXL.io and CXL.mem).There are two memory spaces in the CLAY controller, configuration space and pooling request space.These spaces are allocated as memorymapped regions by the BIOS at boot time.The host can access these regions through the CXL.io protocol to control CLAY.In the configuration space, model configurations are stored, such as the number of embedding tables, the number of embedding vectors, the embedding vector dimension, and the starting address of each embedding table and the output.The host uses the pooling request space to send pooling requests to CLAY.
Figure 13(b) shows an exemplar pseudo-code of CLAY, which initializes CLAY, generates appropriate pooling requests, and transmits them to the CLAY controller.CLAY::initialize_embedding.The host allocates memory for the embedding table and the embedding layer output in contiguous address space using the CLAY library.Then, it stores the embedding data to CLAY through CXL.mem protocol.The CLAY controller receiving the CXL.mem request sends the data to the proper DRAM clusters with CLAY's address mapping described in Section 4.4.CLAY::set_configure.The host processor sets the embedding layer configuration (i.e., the number of embedding tables, the number of embedding vectors, and the embedding vector dimension) of each model to the configuration space using CXL.io protocol.Also, the host sets the memory offset of each embedding table and output.The CLAY controller uses these offsets to determine the memory address where each embedding vector is stored.CLAY::generate_request.To operate the embedding layer in CLAY, the host processor transmits the pooling request to the CLAY controller.The pooling request includes indices and weights for one pooling operation in a specific batch of an embedding table of a model.The CLAY controller receiving the pooling request specifies the target DRAM cluster and sends the packet containing the model index, table index, batch index, pooling index, and the pair of vector index and weight to the DRAM clusters.Then, each cluster controller reads data from a DRAM address corresponding to the vector index in the cluster and performs a reduction.

EXPERIMENTAL SETUP
Simulation framework.We modeled CLAY with a trace-driven, cycle-level simulator by modifying Ramulator [40] and Booksim2 [33].We used the timing parameters of DDR5 JEDEC specification [31].We modeled the interconnect of CLAY as 2D mesh topology (4×2, 4×4, and 4×8 for 8, 16, and 32 DRAM clusters, respectively).We used XY routing algorithm and six virtual channels per port.Each port is connected by a 20 GB/s link [2,73].A CLAY controller is connected to four DRAM clusters.The interconnect bisection bandwidth is 80, 160, and 160 GB/s for 8, 16, and 32 DRAM clusters, respectively.More details of CLAY are represented in Table 1.Evaluated architectures.For the baseline, We modeled a CPUonly system by running the embedding layer of each model on the CPU, extracting the memory trace using Intel Pin [56], and putting these traces into the modified Ramulator.
We compared CLAY with the state-of-the-art DIMM-based NDP architectures; TensorDIMM [46], RecNMP [37], and FeaNMP [54] for RecSys and GNNear [86], G-NMP [72], and GraNDe [82] for GNN.The performance of the embedding layer for these architectures is mainly determined by their embedding table allocations.RecNMP [46] and TensorDIMM [37] evenly divide embedding tables among ranks vertically and horizontally, respectively, as shown in Figure 8(a) and (c).GNNear [86] evenly divides embedding tables horizontally among DIMMs while dividing the portion of the embedding table in each DIMM vertically among ranks.GraNDe [82] and FeaNMP [54] adopt the optimal among embedding table allocations of (TensorDIMM, RecNMP) and (TensorDIMM, RecNMP, GNNear), respectively, for each embedding layer.Because the data allocation of G-NMP is not specified, we assumed that G-NMP chooses the better embedding table allocations (vertical or horizontal) among DIMMs and among ranks for each layer.
We assumed that TRiM [64], also a DIMM-based NDP architecture for RecSys, exploits only rank-level parallelism and performs the same as RecNMP.Although TRiM can exploit bank-group-or bank-level parallelism, we exclude those architectures in comparison because they require modifying DRAM devices, whereas CLAY uses conventional DRAM devices.TRiM adopts horizontal embedding table allocation among ranks, the same as RecNMP.
We assumed that the CXL memory and the DIMM-based memory system populate the same number of DRAM devices in our evaluation; thus, CLAY and DIMM-based NDP architectures utilize the same amplified internal bandwidth.Real-system experiments.We also conducted real-system experiments to obtain performance improvement of end-to-end DNN by CLAY.We used Intel Xeon Gold 6230 with 256 GB memory and Tesla V100 [61].To identify the pure execution time of each layer, excluding that of the PyTorch framework wrapper, we implemented DLRM (RecSys) and GCN (GNN) using the Intel MKL library [28] and used OpenMP [12] and CUDA [55] for the CPU and GPU versions, respectively.We measured the end-to-end execution time of each model by combining simulation and real-time experiment results.The execution time of the embedding layer was measured by simulation, while we used the real system execution time for the remaining layers, including data transfer time between NDP architectures and host/GPU memory.For GPU data transfer time, we measured the time of cudaMemcpy on PCIe 3.0 (the current machine) and estimated the time on PCIe 5.0/6.0 (CXL) by scaling based on the bandwidth ratio.Benchmarks.For evaluation, we used DLRM and GCN, representative models of RecSys and GNN.Table 2 summarizes the characteristics of DLRM and GCN models and each dataset.For DLRM, we used the batch size of 128, and criteo dataset [11], which is the real trace publicly available and used by previous works [3,19,48,64].We used the four biggest tables of criteo dataset and duplicated them to make 32 tables [37].We modified the original data to three types of data in which the number of vectors to be gathered (gathering number) is 20 (criteo 20), 40 (criteo 40), and 80 (criteo 80).
GCN consists of three GCONV layers [23].The embedding vector dimension of the first aggregation layer is the same as the input dimension of the embedding vector of each dataset, and that of the second and the third GCONV layer are set to 128 and 256, respectively.We used arxiv, mag, products, and papers datasets from Open Graph Benchmark [23], which provides large-scale graph datasets used by recent works [17,24,25,81,82], and the amazon dataset [50] used by [76] for GCN.The size of each element of an embedding vector in all datasets is four bytes (32-bit floating-point).Power and area.To estimate power consumption in DRAM devices, we used the Micron DRAM power calculator [57] with the DDR5 datasheet [58].We also estimated the power consumption of interconnect (assuming each link is 80 mm) and on-package I/O between the DRAM devices and the PU in CLAY referring [73].For power and area estimation of PU, we designed arithmetic units and routers [14] in Verilog and synthesized them using Synopsys Design Compiler with a 7 nm predictive process design kit (ASAP7 [7]).The arithmetic units run at 400 MHz, resulting in an aggregate throughput of 16 MACs of 25.6 GB/s, which is greater than the maximum bandwidth of a DRAM cluster with four DDR5-5200 devices (20.8 GB/s).The router has four input and output ports and routes one flit (128 bits) per port at 1.5 GHz frequency, which is greater than the link speed of 20 GB/s.We referred to the area overhead and power consumption of the DRAM controller in [4], which could be smaller  considering the latest technology.We modified FinCACTI [66] to match the published information of SRAM [5,27,32,35,68,78] and used it to model SRAM-based buffers.

EVALUATION 8.1 Execution Time Breakdown of CLAY
To quantify the superiority of CLAY, we measured the time taken by local reduction, data transfer, and write back in CLAY and the naïve CXL memory NDP architecture (Base) of Figure 6(b) with varying hardware configuration and embedding vector size (Figure 14).In this experiment, we configured one CLAY consisting of 8, 16, or 32 DRAM clusters.We set the same number of DRAM devices in a rank of Base and in a DRAM cluster.Base has four channels, and execution times are normalized to Base using eight ranks in each embedding vector dimension and dataset.The interconnect of CLAY reduces both data transfer and write back time.CLAY reduces data transfer, write back, and total execution time up to 71%, 86%, and 57% compared to Base, respectively.CLAY performs global reduction on all DRAM clusters with direct data transfer between DRAM clusters.Also, all DRAM clusters perform write back concurrently using the amplified bandwidth.However, Base uses a centralized controller to perform global reduction and a multi-drop bus for write back.CLAY becomes more effective as we increase the number of DRAM clusters (i.e., scale-up).As the number of DRAM clusters increases, the total size of local reduction results to transfer for global reduction increases, although the local reduction time decreases.The decrease in embedding vector size and the number of gatherings in a pooling reduces the reduction operation on each DRAM cluster, increasing the data transfer time ratio.Therefore, Base experiences a significant surge in the data transfer time ratio.However, CLAY reduces this ratio increase by the direct data transfer between DRAM clusters.This result shows that CLAY has better performance scalability compared to Base.
Finally, CLAY eliminates the bottleneck of sending pooling request packets by reducing required bandwidth with packet duplication and accelerates the reduction operation.Figure 15 shows the embedding layer speedup compared to the execution time on the CPU according to the number of CLAY devices.Without packet duplication, 4 CLAY only achieves 1.02× and 1.02× speedup over 2   CLAY on DLRM and GCN, respectively, because the required bandwidth to send pooling request packets exceeds the bandwidth of PCIe, resulting in low utilization of CLAY.In contrast, by applying packet duplication, the performance increases linearly with the number of CLAY devices due to the reduced required bandwidth.

Comparison with Prior DIMM-based NDP Architectures
We performed an execution time breakdown for one embedding layer of CLAY and the prior NDP architectures (TensorDIMM, RecNMP, and FeaNMP for DLRM, and GNNear, G-NMP, and GraNDe for GCN) for comparison.We used two types of CLAY configurations: 16 and 32 DRAM clusters.In each case, the other architectures used the main memory system with 4Channel-2DIMM-2Rank and 4Channel-4DIMM-2Rank, consisting of the same number of DRAM devices.The dimensions of the embedding layer for DLRM and GCN are 128 and 256, respectively.The execution time of DLRM is the average value of a single batch request.For the prior NDP architectures, we assume that local reduction, data transfer, and sending instructions could be pipelined and ignore global reduction time in favor of the prior NDP architectures.We assume an instruction needed for one gathering is 8 bytes (index and weight).Also, we assume that data transfer and sending instructions utilize the maximum memory bandwidth.CLAY performs better than the other NDP architectures in most cases by balancing the load across DRAM clusters and reducing data transfer time.Figure 16 shows the normalized execution time of one embedding layer for DLRM and GCN.The red diagonally striped box represents the overhead due to data transfer time that cannot be hidden within local reduction time.In the case of Tensor-DIMM, each embedding vector is split into 16 ranks in 4Channel-2DIMM-2Rank.As the embedding vector dimension is 128 (512 bytes), the embedding vector portion stored per rank is 32 bytes, wasting half of the memory bandwidth.RecNMP stores each entire embedding vector in a rank, not wasting memory bandwidth, but suffers from load imbalance, resulting in lower performance than TensorDIMM in 4Channel-2DIMM-2Rank.In 4Channel-4DIMM-2Rank, TensorDIMM wastes bandwidth more than half, showing in lower performance than RecNMP.FeaNMP with 4Channel-2DIMM-2Rank and CLAY with 16 DRAM clusters outperform Ten-sorDIMM and RecNMP thanks to adaptive address mapping that reduces local reduction time by mitigating the load imbalance rate compared to previous architectures.Also, CLAY achieves higher performance than FeaNMP by exploiting more fine-grained mapping and enabling data transfer between clusters during global reduction without sending data to the host or receiving data from the host.
In the case of GCN, GNNear and G-NMP show a significant data transfer overhead, which is more prominent in 4Channel-4DIMM-2Rank.The result of the embedding layer in GCN must be written to the memory, and this write-back time is added to overhead.This overhead is exacerbated as the NDP architectures equip more ranks.In contrast, GraNDe incurs low overheads because it reduces writeback time through pipelining with local reduction and broadcasting the instruction.CLAY performs slightly worse than the ideally modeled GraNDe in 16 DRAM clusters (4Channel-2DIMM-2Rank).Nevertheless, when we scale the system to 32 DRAM clusters, CLAY achieves better performance than GraNDe.CLAY's address mapping properly distributes the embedding vector as the number of DRAM clusters increases.However, GraNDe's limited memory mapping makes the portion of the embedding vector stored in a single rank smaller than the granularity of the memory access, leading to a waste of memory bandwidth and degraded performance.

End-to-End Performance Evaluation
We evaluated the end-to-end performance by combining the simulation results of the embedding layers and the execution time for the remaining MLP layers measured on the actual CPU and GPU machines.We converted the simulation results (speedup) of embedding layers into the execution time on a real machine by multiplying them with the embedding layers' execution time on the actual CPU (baseline).We assumed the embedding layer is performed on the CPU or NDP architectures due to the dataset size.We included the data transfer time between the CLAYs and the GPU in the execution time of the MLP layer.Figure 17 compares the execution time of end-to-end DLRM and GCN for various machine combinations.The execution time of the MLP layers is shorter at GPU than CPU and at PCIe 6.0 than 5.0.In the system with four CLAY and PCIe 6.0based GPU, 4 CLAY achieves performance improvement of 8.08×, 6.24×, 6.89×, 9.03×, and 11.73× for Criteo, arxiv, amazon, mag, and products datasets compared to the baseline system, respectively.CLAY can also provide a higher quality of service than other NDP architectures.Data centers need to process many requests with high throughput while maintaining low latency [18,20].Figure 18  CLAY with 16 DRAM clusters incurs 3.60 mm 2 of overhead, excluding DRAM devices, corresponding to about 5% of a 24Gb DDR5 device [45].Specifically, the areas of the arithmetic unit, buffers, router, and DRAM controller in a PU per DRAM cluster are 0.004 mm 2 , 0.038 mm 2 , 0.003 mm 2 , and 0.18 mm 2 , respectively.Interconnects are connected by wires on the PCB, typically consisting of four to eight layers.We can implement CLAY with a 2D Mesh topology that only requires connections between adjacent DRAM clusters.

DISCUSSION
Why CXL.Implementing CLAY with CXL rather than PCIe provides significant benefits.CXL's support for heterogeneous systems facilitates direct data transfers between CLAY and other processors for end-to-end inference.Also, CXL-switches enable a system to easily attach multiple CLAYs when the embedding layer size is too large for a single CLAY.Moreover, CLAY can be used as a CXL memory expander.Samsung CXL-PNM architecture [65] demonstrated the CXL's efficiency in eliminating data duplication and transfers between host and accelerator memories.Page management and fragmentation.During CLAY initialization, the CLAY device driver allocates contiguous memory space for the embedding table.The OS manages these pages in the same manner as other pages.While contiguous memory allocation may not be possible due to memory fragmentation, CLAY device driver migrates the memory spaces allocated within the desired contiguous memory region.This migration is processed only once during initialization, so the overhead is not incurred during inference.Concurrency for multi-process environment.CLAY may receive normal memory requests from other processes during an NDP operation because it also has to serve as a normal CXL memory expander.In such multi-process environments where the embedding layer and other processes operate concurrently, we assume that CLAY prioritizes normal memory requests over NDP operations, considering the relatively long latency of embedding layer processing.Thus, when the cluster controller receives conventional memory requests, it pauses memory request generation for NDP operation and resumes after conventional request processing finishes.

RELATED WORK
Accelerating RecSys and GNN.[1,36,38,47,48,59,84] accelerate the inference or training of at-scale RecSys on a GPU or a heterogenous computing system.[70,77] accelerate the inference of RecSys through in-memory computation on the SSD, considering a large embedding table.[15,52,80] proposed ASIC solutions to accelerate GCN models with high bandwidth memory (HBM [29]).[16,75,76] increase data reusability in GNN by reconstructing the graph in consideration of the adjacency between graph nodes, reducing the execution time of GNN.To the best of our knowledge, CLAY is the first CXL-based architecture to accelerate generalized embedding layers.CXL use cases.Beacon [26] has proposed CXL-based NDP accelerators for genome analysis.It uses CXL to leverage abundant memory and high communication bandwidth.Pond [51], which is built on the CXL, is the full-stack memory pool that satisfies the requirements of cloud providers.

CONCLUSION
We have proposed CLAY, a CXL-based NDP architecture accelerating the general memory-intensive embedding layers of various DNN models.We first identified the limitations of DIMM-based NDP architectures for accelerating embedding layers in RecSys and GNNs.Breaking away from the multi-drop bus of conventional memory systems, which hinders performance improvement, we designed CLAY that effectively reduces data transfer time by interconnecting the DRAM clusters.We introduced fine-grained memory address mapping to reduce local reduction time by mitigating load imbalance across DRAM clusters within CLAY.We suggested packet duplication to reduce the bandwidth required to send pooling request packets across CLAYs.CLAY improves the end-to-end performance of RecSys and GNN by up to 1.87× and 2.77× compared to the state-of-the-art NDP architectures, FeaNMP and G-NMP, respectively.

ICS ' 24 ,Figure 1 :
Figure 1: Time breakdown of the recommender system and graph neural network (details in Section 7).

Figure 2 :
Figure 2: The workflows of RecSys and GNN.Each has dense layers and embedding layers which produce a single intermediate vector by aggregating embedding vectors.

Figure 3 :
Figure 3: Full system with CXL memory.CXL memory communicates with the CPU through the CXL interface.

Figure 4 :
Figure 4: The operation process in the prior NDP architectures, which can perform local reduction in parallel, but partial-sums must be sequentially transferred to the host.

Figure 5 :
Figure 5: Box-plot graph of load imbalance rate according to the number of ranks.Load imbalance rate is the ratio of the most loaded to the perfectly distributed load.

Figure 6 :
Figure 6: CXL-based NDP architecture with PUs located on (a) the CXL controller or (b) each DRAM rank.

Figure 7 :
Figure 7: (a) CLAY architecture using interconnect between the DRAM cluster and structure of PU.(b) packet format.transfer time issue arises as the PU in a CXL controller processes global reduction entirely.It could be solved if PUs at each rank could communicate with each other and process the global reduction.To make scalable connections between PUs, CLAY organizes multiple DRAM clusters, each being a multi-chip module composed of DRAM devices and a PU on the printed circuit board, and interconnects them (Figure7(a)).DRAM devices in a DRAM cluster receive the same C/A signals and behave like a rank.Unlike the previous DIMM-based NDP architecture, direct data transfer between each PU is allowed, and each PU processes global reduction in part, reducing the time spent on data transfer.Further, each PU can concurrently write its global reduction result to the DRAM cluster, reducing write-back time.Depending on the target application, the number of DRAM devices in the DRAM cluster can be configured to achieve different read granularity.Also, the optimal topology and routing algorithm of the interconnect can vary depending on a traffic pattern.Because most data traffic inside CLAY is all-to-all communication, an interconnect with high bisection bandwidth is proper.In this paper, we modeled CLAY using 2D-mesh topology and XY routing, which is relatively simple but sufficient for high performance and energy efficiency improvements (details in Section 8).
Figure 7(a)).The DRAM controller generates and schedules DRAM

Figure 8 :
Figure 8: The load distribution with various embedding table allocations with 64 bytes DRAM cluster access granularity.

Figure 10 :
Figure 10: Memory address mapping of CLAY.

Figure 12 :
Figure 12: A multi-CLAY system and packet duplication.

Figure 13 :
Figure 13: (a) A software stack for CLAY and (b) an exemplar pseudo-code exploiting CLAY.

Figure 15 :
Figure 15: The speedup of the embedding layer depending on the number of CLAY devices compared to the CPU.The embedding vector dimension in both DLRM and GCN is 128.

Figure 16 :
Figure 16: Normalized execution time of NDP architectures for an embedding layer.The embedding vector dimension is 128.The red diagonally striped box represents the overhead due to data transfer.

Figure 17 :
Figure 17: The normalized end-to-end execution time of DLRM and GCN.The MLP layers are computed by CPU or GPU, and the embedding layers are computed by CPU or NDP architectures.

Table 1 :
Configurations used for evaluation.

Table 2 :
The datasets and model configuration used in evaluation.Each embedding vector element is 32-bit floating-point.Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Base CLAY Normalized execution time of naïve CXL memory NDP architecture (Base) and CLAY for the embedding layer varying hardware configuration and embedding vector size.
[41]s the throughput and latency of CLAY and other NDP architectures under the input requests with Poisson distribution.CLAY with GPU improves throughput and latency over other NDP architectures with a higher end-to-end performance per batch.4CLAY+GPUfurtherincreasesthroughput up to 2.40× (1,742 batches/s).8.4 Energy, Power, and Area AnalysisFigure19compares the energy consumption and power between CLAY and the other architectures.CLAY reduces the energy consumption for the embedding layer by up to 56% and 20%, respectively, compared to CPU and state-of-the-art NDP architectures.Although CLAY's address mapping increases the number of ACTs by splitting each embedding vector into multiple DRAM clusters, CLAY reduces energy consumption by reading data as on-package I/O rather than off-chip I/O.CLAY also consumes less static energy by reducing the embedding layer execution time.The interconnect incurs additional energy consumption, but it is negligible.CPU consumes more static energy than ACT and RD/WR because CPU reads data from only one rank at a time in a channel, which leads to a larger execution time.On average, CLAY consumes about 38.45 W. The peak power of CLAY is 55.97 W including the peak power of interconnect 8.99 W, which is under the 70 W power budget of the board[41].