A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models

Recommendation systems are of crucial importance for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale. In this paper, we provide insights into the intriguing and challenging inference domain of online recommendation systems. We propose the HugeCTR Hierarchical Parameter Server (HPS), an industry-leading distributed recommendation inference framework, that combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. Among other things, HPS features (1) a redundant hierarchical storage system, (2) a novel high-bandwidth cache to accelerate parallel embedding lookup on NVIDIA GPUs, (3) online training support and (4) light-weight APIs for easy integration into existing large-scale recommendation workflows. To demonstrate its capabilities, we conduct extensive studies using both synthetically engineered and public datasets. We show that our HPS can dramatically reduce end-to-end inference latency, achieving 5~62x speedup (depending on the batch size) over CPU baseline implementations for popular recommendation models. Through multi-GPU concurrent deployment, the HPS can also greatly increase the inference QPS.


INTRODUCTION
Recommendation Systems (RS) are used in various apps and online services, such as news feeds, e-commerce, social networks, search, etc.To provide accurate predictions, state-of-the-art algorithms rely on embedding-based deep learning models.Figure 1 illustrates the typical architecture of a deep recommendation model (DLRM).The input consists of dense features (e.g., age, price, etc.) and sparse features (e.g., user ID, category ID, etc.).The sparse features are transformed into dense embedding vectors through lookup in an embedding table, so that the result from combining these with the dense features can be fed through some densely connected deep learning model (e.g., a MLP, transformer, etc. [38,39]) to predict the Click-Through Rate (CTR).
Embeddings can consume a significant portion of the memory capacity in a data center.Often, a significant amount of time is spent to retrieve these embeddings from a centralized parameter server, which adds latency that delays downstream computations.Unlike in throughput-oriented training systems [5, 7, 12-14, 16, 17, 22, 42], online inference systems are tightly constrained by latency requirements [40].Thus, the embedding lookup speed is essential for deep recommendation model inference performance.
During inference, each mini-batch of data usually references tens of thousands of embeddings.Realizing the exhaustive search of each embedding by its key requires the parameter server to walk certain internal data structures.The lookup of individual embeddings from an embedding table is usually independent, and, thus, easily parallelizable.At the same time, modern GPU architectures allow scheduling thousands of threads to run concurrently, and their memory subsystems adopt special memory technology that provides higher bandwidth and throughput than equivalent CPU memories [27].These features could make GPU architectures ideal for processing embedding vector lookup workloads.
Challenges.The size of embedding tables used in state-of-theart recommendation models can be vast, often ranging from tens of giga-to several terabytes, which is well beyond the memory complement of most GPUs.Furthermore, batch sizes during online inference are usually too small to efficiently utilize the massively parallel-processing-optimized computational resources of just a single GPU.Hence, embedding lookup workloads require large amounts of GPU memory, but only few computational resources.This imbalance of requirements significantly deviates from the available hardware, and diminishes GPUs' attractiveness for use in inference systems.Therefore, most existing solutions decouple the embedding lookup operation from the dense computations (i.e., the remainder of the model), which are executed in the GPU, and move it to the CPU [21].Thereby, they forfeit the memory bandwidth advantages of GPUs, while the CPU and the communication bandwidth between the CPU and the GPU becomes the primary bottleneck.As a result, the disproportionate processing capabilities of GPUs sit mostly idle in such setups (=resource waste).
Approach.It is usually not possible to retain all embedding tables entirely in GPU memory.However, empirical evidence for real-world recommendation datasets suggests that embedding key access during inference for CTR and other recommendation tasks often exhibits strong locality, and approximately follows the power law distribution [5,7,12,17].Hence, a significant proportion of the embedding keys per mini-batch reference only a small set of hot embeddings.Caching such hot embeddings in the GPU memory, where the remainder of the model is processed, makes partial GPUaccelerated embedding lookup possible.Based on these observations, we have built an inference framework, namely the HugeCTR Hierarchical Parameter Server (HPS), to take advantage of GPU resources, without being constrained to GPU memory limitations.In particular, HPS introduces a GPU embedding cache data structure that tries to retain hot embeddings within the GPU memory.The cache is complimented by a parameter server that keeps a full copy of all embedding tables.Our contributions can be summarized as follows: • Hierarchical database architecture that allows utilizing cluster memory resources, and provides an asynchronous update mechanism to maintain a high GPU embedding cache hit rate during online inference.• High-performance dynamic GPU embedding cache that maximizes throughput by tracking and caching frequently occurring embeddings in high-throughput GPU memory, while overlapping the host/device transfers.• Online model update mechanism for distributed inference deployments (i.e., real-time updates).• Customizable HPS backend that provides concurrent model execution, hybrid model deployment, and ensemble model pipeline services for NVIDIA Triton GPU inference server [31].This paper is structured as follows.In Section 2, we provide a fundamental discussion of core concepts that underpin our approach.Then, we subsequently introduce and discuss the individual components of the HPS and how they interact in Sections 3~5.In Section 6, we discuss how our HPS realizes real-time model updates.Eventually, we conduct an experimental study to evaluate the performance of the HPS in Section 7, and provide concluding remarks in Section 8.

BACKGROUND 2.1 Embedding Tables
Current mainstream algorithms in advertising, recommendation and search adopt model structures that combine embedding tables with a deep neural network to form a deep learning recommendation model (DLRM) [24].At the foundation of such models are embeddings , which represent learned numeric representations of user or item features as dense vectors that are aligned in some -dimensional space ( ∈ R  ).We let   = { 0  ,  1  , • • • ,    } denote some discrete subset of the embeddings for some feature .For easy access within the model, we organize these embeddings as embedding feature tables of the form that consist of tuples ⟨   ,    ⟩, where    is a key that identifies and references the -th embedding table entry    .The key space The value of each key depends on the underlying data or task.Usually, the key space is sparsely populated.
To evaluate a DLRM for CTR (cf. Figure 1), the driver application must first select the entries from the embedding table that are relevant to make the prediction.This is simply done by looking up 1 the keys from the query key subset   for each embedding feature table (i.e., Accelerating the retrieval of such result sets at scale is our primary objective.

Deduplication and skewness.
To avoid unnecessary double-lookups if the same embedding table entries are required multiple times, HugeCTR always applies a deduplication operator prior to executing any subsequent steps (i.e.,  * = dedup()).This is particularly important for mini-batch processing, where  is the concatenation of many input samples.Naturally, deduplication becomes more effective if the skewness of the query distribution Q increases.
Understanding and utilizing skewness properties of the dataset is peril achieving peak efficiency.Many real-world recommendation datasets (e.g., Criteo [6]) exhibit a power law distribution [3].That is, certain subsets of keys are referenced more frequently than others, such that sampling   ∼ Q  eventually approximates  () ∝  − .Figure 2 depicts a scenario where the embedding key recall statistics approximate a power law distribution.The key space can be divided into three categories: (1) Frequent embeddings effectively appear in every batch.They represent a significant fraction of the recall/update requests.The frequent set is usually small.Even for large embedding corpora, only up to a few thousand embeddings appear that regularly.(2) Stochastic embeddings appear every few batches (i.e., somewhat regularly over time).(3) Rare embeddings are at the far end of the spectrum.They appear rather infrequently in queries.
Because requests repeatedly reference frequent and stochastic embeddings, applying efficient caching methods to them improves the overall system performance the most.Our HPS design (see Section 3) builds up on this observation.
Such category assignments of embeddings are absolutely determined if the query dataset is fixed.When training HugeCTR models, we take advantage of this to achieve world-class model convergence rates [8,19].During online inference, the recall statistics depend on the actually incoming user requests.These cannot be preempted.Due to sudden events, changing trends or fashion, the category assignment of individual embeddings can vary over time.For most recommendation tasks, the runtime statistics are in a constant flux.Thus, inference systems must be adaptive. 1Direct key lookup is the predominantly used method.For complex models other methods to determine the query keys  may exist.

GPU-accelerated Inference Architecture
Parameter servers for ML inference workloads mostly rely on database operations that are trivially parallelizable with GPUs [35,37,41].Applications that require fast response times, e.g., online transaction processing (OLTP), often benefit greatly from GPU acceleration [1].However, GPU memory constraints pose a tough challenge.To achieve scalability, many existing GPU-accelerated database systems, as well as our approach, implement a hierarchical storage architecture that extends the available GPU memory with other storage resources.Because external memory resources cannot be accessed as efficiently as native GPU memory [27], the data exchange performance with the host system is emphasized in such systems [23].To achieve peak performance, overlapped query processing must be used in conjunction with efficient communication patterns and data placement strategies that are actively refined at runtime [1,2,20].
Constructing a parameter server for a machine learning platform poses many challenges [2, 5, 7, 12-14, 16, 17, 20, 22, 42].When designing mixed GPU/CPU-based architectures for inference production environments, at least two major bottlenecks must be overcome: (1) High latency due to DRAM bandwidth limitations when communicating between CPU and GPU [18,40].(2) Deployment latency due to growing model size and complexity induced by online training, because fast-paced incremental model updates pose a great challenge with respect to data consistency and bandwidth.To address these bottlenecks, our HPS is specifically tailored for deployment as an inference parameter server for large-scale recommendation models on GPUs.It handles the data synchronization and communication to share model parameters (embedding tables) across different inference nodes [26], and performs various optimizations to improve GPU utilization during parallel multi-model/multi-GPU inference, including the organization of the distributed embedding table into partitions [25], GPU-friendly caching [30], and an asynchronous data movement mechanism [29].

HIERARCHICAL PARAMETER SERVER
Our Hierarchical Parameter Server (HPS) allows HugeCTR to use models with huge embedding tables for inference.This is achieved through extending the embedding storage space beyond the constraints of GPUs using CPU memory resources from across the cluster.The design target of the HPS is to address the three challenges that traditional CPU parameter server approaches typically suffer from most: (1) Downloading/streaming of model parameters from the centrally maintained embedding table partitions in CPU memory to the model instances on individual GPU compute devices.This issue is magnified if the embedding table cannot be fit entirely into the GPU memory.HPS greatly alleviates this problem through a GPU caching mechanism that takes advantage of the locality of the data distribution.(2) Increased deployment cost caused by high-availability requirements of inference platforms and bandwidth limitations.By jointly organizing and using the distributed CPU memories of the inference cluster, the HPS saves resources and realizes immediate online model updating (i.e., training to inference updates).(3) Parameter update and refresh between the GPU cache and the parameter server.This is particularly challenging if only a part of the model is loaded into GPU memory, so that parameters are missed on the GPU during lookup.HPS handles additional parameter exchanges between the CPU and GPU using an asynchronous insertion and refreshing mechanism to maintain parameter consistency.

Storage Architecture
Our HPS is implemented as a 3-level hierarchical memory architecture (cf. Figure 3) that utilizes GPU GDDR and/or high-bandwidth memory (HBM), distributed CPU memory and local SSD storage resources.The communication mechanisms between these components ensure that the most frequently used embeddings reside in the GPU embedding cache.Somewhat frequently used embeddings are cached in CPU memory, while a full copy of all model parameters, including those that rarely occur, is always kept available on the hard disk/SSD.To minimize delays, we overlap parameter updating and the migration of missing parameters from higher storage levels (SSD → CPU memory → GPU memory) with the dense model computation.The three memory architecture levels of the HPS are defined as follows: GPU embedding cache (level 1).This is a dynamic cache designed for recommendation model inference.It attempts to improve the lookup performance for embeddings by reducing additional/repetitive parameter movements through cleverly utilizing data locality to keep frequently used features (i.e. the hot features) in the GPU memory.The GPU cache supports several operators (see Section 4), as well as a dynamic insertion and an asynchronous refresh mechanism (see Section 6) to retain a high cache hit rate.
Parameter partitions (level 2) store a partial copy of the embedding parameters in CPU memory.They act as an extension to the GPU embedding cache, and are queried if an embedding is required that is currently not present in the cache.Practitioners can choose between stand-alone deployment and cluster deployment, depending on their application scenario.In stand-alone deployment, the partitions are either placed in an optimized parallel hash-map (server-less deployment) or a local Redis instance.Distributed deployments can make use of multi-node Redis configurations.The contents of each partition are asynchronously adjusted in response to the queries processed by all inference nodes of a deployment.To receive online updates, parameter partitions can subscribe to topics from a distributed event stream.
Parameter replications (level 3).To ensure fault-tolerance, HPS retains a full copy of all model parameters (i.e., a model replica) in a disk-based RocksDB key-value store in each inference node.This fallback storage is accessed if a lookup request towards the corresponding parameter partitions fails.Thus, if given enough time budget, a HPS deployment is always able to produce a full answer to every query.To stay up to date, each node separately monitors the distributed event stream and applies online updates at its own pace.

INFERENCE GPU EMBEDDING CACHE
When processing online inference workloads, it is usually not possible to know which embedding table subsets will be required next.Therefore, our GPU embedding cache is designed as a generalpurpose dynamic cache, which can accept new embeddings by evicting old embeddings.

Cache Data Model
The GPU embedding cache consists of hierarchical 3-level structures as shown in Figure 4: slots, slabs and slabsets.
Slots represent the GPU embedding cache's basic storage unit.Each slot contains an embedding key, the associated embedding vector, and an access counter.
Slabs.Modern GPU architectures manage and execute code in warps (groups of 32 threads; [28]).Peak performance can be achieved by writing warp-aware programs.Therefore, we group 32 slots into one slab, so that each warp thread is assigned to a distinct slot.When searching for matching embedding keys, we use warps to linearly probe slabs.In determining if and where a key was found in a slab, we perform register-level intra-warp communications (shuffle, ballot, etc.) to eliminate branch and memory divergences.
Slabsets.Like cache lines are grouped into cache sets in N-way set-associative caches, slabs are packed into slabsets.To exploit the massively parallel computing power of GPUs, each embedding key is first mapped to a particular slabset, but may then occupy any slot in that slabset.This way, linear probing is confined to a single slabset, without conflicting with independent slabsets.A smaller slabset size can reduce key search latency, but also leads to increasing conflict misses.It is important to find the optimal slabset size to balance these two factors.We empirically set the slabset size to 2 for contemporary NVIDIA GPU architectures such as Ampere.To maximize GPU resource utilization and inference concurrency, inference workers can share the same embedding cache.Race conditions are prevented by only granting a single warp exclusive access to a slabset for particular cache operations, such as query and replace.This approach also implicitly ensures thread safety.Because the total number of slabsets is usually much higher (≥ millions) than the maximum number of warps per GPU (≥ thousands), the mutual exclusion does not incur significant stalls.

GPU Embedding Cache API
The GPU embedding cache supports four APIs: • Query (Algorithm 2) retrieves embedding vectors for sets of embedding keys.Missing keys are returned as a list that can be used to attempt fetching these embeddings from the parameter partitions.Query, Replace and Update share the same core algorithm (cf.Algorithms 2, 3 and 4).For each key, the assigned processing warp will first locate the slabset that contains the key using a hash function.

Training Set Files
Then it linearly probes the slabs within this slabset to either find the matching key-slot, or determine an empty/replaceable slot for insertion (replace & update only).The Dump API is trivial in that it simply copies all keys currently in the cache to the CPU memory.All APIs launch CUDA kernels that are executed asynchronously.I.e., the control flow is immediately returned to the CPU.Because they are thread safe on the slabset-level (see Section 4.1), concurrent invocation of all APIs is permissible.To avoid frequent CUDA kernel launches and improve GPU resource utilization, all APIs accept mini-batches as input.The respective input keys are fairly distributed to warps, and pushed into a warp work queue.

Embedding Insertion
For failed lookups (i.e., key is currently not present in the GPU embedding cache), a cache insertion operation is triggered to fetch the missed embeddings from the parameter partitions in the CPU memory or a replica on a local SSD.As shown in Algorithm 1, the Remark: ∥ is a thread-safe concatenation operator.
Input:  = keys (to be updated);  = associated embeddings..valueWhere(key= ) ←   7: break for 8: else if Null ∈ .keysthen ⊲ Not found, ignore!9: break for HPS has two insertion modes, between which the GPU embedding cache switches based on the relation between the current cache hit rate and a user-defined hit rate threshold: Asynchronous insertion is activated if the cache hit rate is higher than the predefined threshold.For any missing keys, the default embedding vectors whose values are user configurable are returned immediately.The actual embeddings are fetched asynchronously from higher-level storage into the GPU embedding cache to have them available for future queries.This lazy insertion mechanism ensures that the prediction accuracy loss is negligible with a high hit rate.
Synchronous insertion blocks the rest of the pipeline until the missed embeddings have been fetched.With a reasonable threshold, synchronous insertion usually occurs only during the warm-up stage, or after model updates.

CPU MEMORY AND SSD STORAGE LAYERS
To process models that scale beyond GPU memory capacity, in addition to the GPU embedding cache (Section 4), the HPS incorporates two additional layers in its storage hierarchy.These layers are constructed based on either system memory, SSDs or network storage, and are highly modularized to support various backend implementations.
Volatile database (VDB) layers (level 2 in Figure 3) reside in volatile memory such as system memory, that requires traversal through a NVLink or the PCIe bus to access it from a GPU.In comparison to GPU memory, system memory can be extended at lower costs.To grow even further, VDBs can take advantage of multiple, low-latency system memories on a inference cluster.For example, using our RedisClusterBackend VDB template implementation, users can use distributed Redis instances as a storage backend for embeddings.Thus, VDB implementations can, but do not have to be limited to machine boundaries.To distribute the workload, VDBs organize embedding table storage in partitions.Partitions are nonoverlapping subsets of an embedding table that are stored in the same physical location.They are sparsely populated in response to the inference queries processed by all nodes that share VDB access.The maximum size (=overflow margin) and amount of partitions per embedding table are configurable, and subject to a trade-off.More smaller partitions allow for smoother load balancing, but each partition adds a small processing overhead.
VDBs are operated as an asynchronous cache.If a GPU embedding cache reports missing keys, the HPS queries the VDB next.Analogous to the embedding cache, each VDB entry contains a timestamp indicating when the entry was last accessed.For embedding vectors that were successfully retrieved, the VDB asynchronously updates this timestamp after returning the result.Missed embedding vectors are scheduled for insertion into the VDB to accelerate potential future queries.Thereby, the partition assignment of each embedding is fixed and determined by the XXH64hash value [4] of the key.Insertions happen asynchronously to not stall pending lookup processes, and subsequently fill up the VDB partitions.Per-partition eviction policies determine what should be done if a partition exceeds its overflow margin.We implement multiple eviction policies.For example, the evict oldest policy finds and prunes infrequently accessed keys.
Persistent database (PDB) layers (level 3 in Figure 3) use harddisks or SSDs to permanently store entire embedding tables (i.e., all model parameters).As such, the PDB is helpful to improve the prediction accuracy for datasets that exhibit an extreme long-tail distribution.PDB layers can serve as backup and ultimate ground truth for any number of models.To avoid key collisions, PDB implementations form separate key namespaces for each unique embedding table.
Our template implementation maps embedding tables to column groups in a RocksDB database, stored on a local SSD in each inference node.Hence, the entire model data is replicated in each inference node.This way, we achieve maximum fault tolerance, because node failures will not impair the ability of other inference nodes to fully answer each query.Continued operation is possible, even if a failure in a neighbor node brings down an attached Redis VDB.Without the VDB as an intermediate cache, it can of course take somewhat longer until embedding vectors for missed keys are asynchronously migrated into GPU embedding cache (see also Section 7).However, assuming the GPU embedding cache can retain a high-enough hit rate, clients should only witness minor deviations in inference performance.

ONLINE MODEL UPDATING
Thus far, we have described how the HPS organizes resources to enable inference with pretrained models.In Figure 5, we have highlighted this portion of the data-flow graph in red (→).However, there exist many scenarios where recommendations depend on recent information (e.g., user interactions in social networks).After completing a training epoch incremental updates have to be propagated to all inference nodes for improved recommendations.Our HPS achieves this functionality using a dedicated online updating mechanism.
Volatile & persistent database update.Model training is resource intensive, and therefore conducted by a set of nodes that is distinct form the inference cluster.Training sets for HugeCTR models are split into files that maximize the locality in the embedding cache.The model is trained by sequentially loading these files into the cache and processing the training episodes.Our online updating mechanism wraps around HugeCTR model training.It is designed as an auxiliary process (blue [→] data-flow graph in Figure 5) that can be turned on and off at any point in time.
Once training progress has been made, the training nodes dump their updates to an Apache Kafka-based message buffer [36].This is done via our Message Producer API, which handles serialization, batching, and the organization of updates into distinct message queues for each embedding table.Inference nodes that have loaded the affected model can use the corresponding Message Source API to discover and subscribe to these message queues.As indicated in Figure 5, separate subscriptions can be created for different VDB partitions.This allows nodes that share a VDB to also share the update workload among them.If a node becomes unresponsive, its current assignment is shifted to other nodes.
Applying online updates inevitably adds overhead.Therefore, we allow updates to be consumed lazily by each node using a background process.The execution of the update process is aligned with other I/O requests.To control and adjust the impact on online inference, users can limit the update ingestion speed and frequency.
Through message buffer subscriptions, updates are guaranteed to be in order and complete.Hence, upon fully processing all pending messages (sync), the individual database levels are guaranteed to be consistent (i.e., we guarantee final consistency).The lazy nature in which we apply updates implies that slight inconsistencies during model update periods have to be expected.However, in practice this does not matter because learning rates for model retraining are usually very small.As long as the optimization process is reasonably smooth, the prediction performance should not diminish significantly [2,20].Note that the same assumption also underpins the working principle of the GPU embedding cache's query API, which returns default embedding values for missed keys if the hit rate criteria is met (see Section 4).However, since no downtime is required to ingest updates, it is possible to achieve continuous model improvement, which makes HPS particularly suitable for usage with highly active data sources.
Asynchronous GPU embedding cache refresh.The GPU embedding cache needs to be readily available when an inference request arrives.Ongoing streaming of small updates from message buffers to the GPU embedding cache would create spontaneous GPU load-spikes that are hard to predict and could diminish response times.Thus, instead of ingesting updates directly from Kafka, we allow the GPU embedding cache to regularly poll the VDB/PDB for updates and replace embeddings if necessary.The refresh cycle is configurable to best fit the training schedule.When using online training, the GPU embedding cache can be configured to periodically (minutes, hours, etc.) refresh its contents.When using offline training, refreshes are triggered through signals sent by the Triton model management API [9]. Figure 3 illustrates the entire sequence until a model update becomes effective in the GPU embedding cache: 1 ○ Monitor message stream.Dispatch and apply updates to CPU memory partitions (VDB) and the SSD (PDB). 2 ○ Dump GPU embedding cache keys in batches (size is configurable) and write them into the dump key buffer.3 ○ Lookup embedding keys, written to the dump key buffer, from the CPU memory partitions and/or the SDD, 4 ○ and copy the corresponding embedding key-vectors to the queried key-vector buffer.5 ○ Download the queried key-vector buffer into the GPU device and refresh the GPU embedding cache.

PERFORMANCE EVALUATION
In this section, we showcase the performance of our HPS from several aspects, including the end-to-end inference throughput and latency.Further, we provide empirical analyses of GPU embedding cache and different database backends, and investigate the impact of online updates on the HPS performance.

Experiment Setup
Unless specified otherwise, all experiments are carried out on a cluster consisting of NVIDIA DGX A100 [32] nodes.Each node is equipped with two AMD EPYC 7742 CPUs, 2 TB of CPU memory, eight NVSwitch-interconnected NVIDIA A100 GPUs with 80 GB GPU memory each, and eight Mellanox CX6 InfiniBand adapters for inter-node communication.
To demonstrate the HPS's capabilities, we use two publicly available and two synthetically generated datasets.For the public datasets, we trained a DLRM model to obtain an embedding table which is then used for inference.
Criteo 1 TB [6] is a large publicly available log of user click behavior in response to ads, containing 13 dense features and 26 MovieLens [15] is a small publicly available dataset containing movie recommendations.It contains 3 sparse features, of which one is a multi-hot feature.The embedding table is only ~20 MB large (embedding vector size = 128).
Synthetic dataset A mimics the properties of Criteo 1 TB dataset.However, the final embedding table amounts to 650 GB.In lieu of generating a huge training dataset to obtain the embedding, we generate the embedding first by randomly creating embedding vectors of size 128.Then, we use the resulting key range to generate an inference request dataset by randomly drawing keys from a power-law distribution with  = 1.2 (see Section 2.2).In the resulting inference requests about 95% of the embedding table lookups reference 10% of the embedding table.
Synthetic dataset B is created in a similar way as Synthetic dataset A, but contains 9 dense features and 130 sparse features.Further, we decreased the number of unique keys in each sparse feature so that the embedding table size becomes 81 GB large (i.e., close to the size of embedding table for the Criteo 1 TB dataset).

Inference performance
7.2.1 Single-GPU single-instance deployment on Triton.In this section, we evaluate the performance of HPS running on the top of a NVIDIA Triton Inference Server [31], in comparison with a Py-Torch CPU implementation.To measure inference performance, we utilize Triton's performance analyzer [11].For all datasets, the size of DLRM model's trained dense weights is at most 10 MB, which can be easily loaded into either CPU or GPU memory.
Table 1 lists the configuration for the GPU embedding cache and the VDB.Frequent embeddings are kept in GPU memory.Since our test system has 2 TB CPU memory, we can increase the VDB capacity to fully cache the embedding table.In contrast, our PyTorch baseline keeps both the embedding table and the dense weights in the CPU memory.Figure 6a compares the end-to-end inference performance of our HPS and PyTorch CPU [33] with the Criteo 1 TB dataset.We measure latency and throughput, while varying the batch size from 32 to 131,072.HPS significantly outperforms PyTorch CPU in terms of average latency per batch.Because GPU compute and memory resources can be exploited better, larger batch sizes lead to higher speedup.At the maximum batch size of 131,072, a 62x speedup is achieved.The throughput ranges from 2.4 million samples per second (batch size = 1,024) to 6.4 million samples per second (batch size = 131,072).In contrast, PyTorch CPU delivers at most 0.2 million samples per second (batch size = 2,048).It is also worth noting that HPS also has a 2.35x throughput advantage when compared to a TensorFlow GPU inference solution [34] at batch size = 2,048 (1.43 million samples per second; model size = 15.6 GB).
For the Synthetic dataset A we keep all settings the same, except the GPU cache percentage, which we lowered to 5%, so that up to 32.5 GB of embeddings reside in GPU memory.Results are shown in Figure 6b.Like with Criteo, the throughput increases with the batch size and saturates around 131,072, while the latency remains stable.

Multi-GPU multi-instance deployment on Triton.
To demonstrate how HPS can take advantage of multi-GPU environments, we set the batch size to 1,024 and test with both, Criteo 1 TB and the synthetic dataset A. We program the Triton Inference Server to evenly distribute inference instances while varying the number of GPUs [10].Figure 8 shows the resulting average QPS.With a single GPU, the QPS improves until up to 4 instances are deployed.This is due to enhanced GPU resource utilization from sharing the GPU embedding cache concurrently.Beyond 4 instances, increased resource contention degrades the QPS.Contention can be amortized by deploying the same number of instances on more GPUs.Consequently, the highest QPS (7.2x speedup) is achieved when deploying 8 model instances on 8 GPUs, so that each GPU has its own embedding cache.Note that they may still share VDB parameter partitions.To summarize, when deploying multiple instances, using both scaling up per GPU and multi-GPU scale out maximizes the QPS.7.2.3Warm-up and stable stage performance of the GPU embedding cache.To achieve stable performance, the HPS has to pass the warmup stage, during which hot embeddings are fetched into the GPU embedding cache.During the warm-up stage, the hit rate keeps increasing until the cache is fully occupied.Thereby, the hit rate threshold controls whether cache updates should be applied either in synchronous or asynchronous mode (see Section 4).First, we study the GPU embedding cache's behavior during the warm-up stage.Figures 7a (Criteo 1 TB) and 7b (Synthetic dataset A) respectively show how the hit rate and inference latency change as the inference session progresses (batch size = 1,024).Using a hit rate threshold of 0.0, the inference latency stabilizes very quickly because the cache is always updated asynchronously (i.e., lazily).When setting the threshold to 1.0, the stabilization period is much longer.The overall latency is higher because cache updates block the inference pipeline.With a hit rate threshold of 0.5, the latency is at first relatively high because of blocking updates.Once the hit rate threshold is met, the latency is lowered and flattens overall.In other words, properly setting the hit rate threshold allows taking advantage of both blocking and asynchronous updates to balance the latency and hit rate.
Next, we analyze the further development of the latency as the cache enters the stable stage, where its hit rate saturates, using the synthetic inference request data.For this experiment, the batch size and hit rate threshold are both fixed at respectively 1,024 and 1.0.Results are shown in Figure 7c.We measured with GPU cache percentage ratios of 1% and 5%.Note that cutting the cache percentage by a fifth degrades the saturated hit rate from 76% to 70%, but leads only to a mean latency increase of 5%.Thus, due to the HPS's effective usage of the request data's skewness property, a high performance can be retained, without having the embedding cache occupy too much GPU memory.The distribution of the input request data also affects the cache performance.Using input request data with highly amplified locality that we generated just for this experiment (dlrm_synthetic in Figure 7c) and a GPU cache percentage ratio of 5%, the hit rate will eventually saturate at 100%.The latency improves accordingly.As the hit rate surpasses 90%, the latency is already 23% lower than with the unaltered Synthetic dataset A.

7.2.4
End-to-end inference performance.To verify its overall effectiveness, we study the end-to-end inference throughput performance of the HPS when combining differently configured storage layers (see Section 5).Results for the Criteo 1 TB dataset are presented in Figure 10a.We can draw the following conclusions: (1) HPS provides better inference performance with large batch sizes.(2) Unsurprisingly, the best performfance can be obtained by caching the entire embedding table.However, HPS still attains comparable performance figures for large models that cannot fit into memory.(3) Reducing the GPU cache ratio (20% → 10%), while increasing the maximum size of the Redis VDB (40% → 45%) yields better inference performance.Thus, the VDB as a 2nd-level cache can lower the pressure on the GPU cache, while the HPS-aware GPU cache update mechanism ensures a high hit rate, and greatly reduces the inference latency.
The advantages of having a sophisticated hierarchical storage architecture become more remarkable as the model size increases.Figure 10b shows that granting the HPS just 5% more CPU memory resources leads to an 1.24x end-to-end throughput increase with the much larger Synthetic dataset A.

Performance and accuracy comparison.
To determine the influence of the hardware on HPS performance, we measure the stable stage inference latency with different datasets and batch sizes on respectively a NVIDIA T4 (16 GB memory), A30 (24 GB memory) and A100 GPU (80 GB memory).To allow for a fair comparison despite the limited memory compliment of the T4 GPU, we set the cache percentage and hit rate threshold to respectively 10% and 1.0 for all GPUs.Hence, for the same dataset, the different GPUs stabilize at the same hit rate throughout Figure 11.For the comparatively tiny MovieLens (Figure 11a) dataset, we can achieve an inference latency of less than 1 ms with the hit rate saturating at ~98.5%.With Criteo 1 TB (Figure 11b) and Synthetic dataset B (Figure 11c), the hit rate gradually decreases as the batch size increases.Synthetic dataset B simulates a recommendation task where more categorical features are used.Larger batch sizes lead to increased computational overhead of the dense layers and higher overall inference latency.Thus, although their embedding tables are similarly sized, the latency figures for Synthetic dataset B are much higher than those for the Criteo 1 TB dataset.Because large recommendation models often do not fit into GPU memory, most inference frameworks only load the dense model part into GPU memory.Meanwhile, HPS can deploy and accelerate such models on the GPU through its GPU cache mechanism.When using small and medium batch sizes with HPS, the inference latencies for T4 and A30 GPUs are on par with an A100 GPU, thus demonstrating that HPS is a scalable inference solution for recommender systems.To complete our investigation, we present the prediction accuracy, i.e., correct total samples, when varying the cache hit rate for the Criteo 1 TB dataset in Figure 9.
Here, the three different hit rate thresholds overlap almost perfectly, while their cache hit rates stabilize above 0.9.This strongly implies that the GPU embedding cache retains hot embeddings well, even if the asynchronous insertion is employed.

Online update performance
Our updating mechanism consists of three major components (cf.Section 6): (1) dumping the model in the training nodes, (2) update ingestion by the inference nodes, and (3) the embedding cache refresh operation.Because model dumps are done in isolation from the overall HPS inference system, we only focus on the remaining two components.
Update ingestion mechanism.The delivery ratio of model parameter updates depends heavily on the configuration of the intermediate Kafka message buffer storage and its network connectivity.Of the latter, we have plenty (cf.Section 7.1).Therefore, we concentrate on the receiving VDB/PDB instances.In Table 2, we perform asynchronous random batch insertion (batch size = 128 MB) limits for Synthetic dataset A in our test environment.The insertion speed slowly declines as the model size increases due to storage management overheads.
GPU embedding cache refresh.In Table 3, we analyze the embedding cache refresh performance with different cache capacities, and show the impact of latency and capacity.Note how the overhead for actually dumping the embedding keys scales with the cache size, but is almost negligible in comparison to the following update operation.The throughput remains stable around 199 GB/s, so that it only takes 200 ms to refresh a 40 GB embedding cache.
To summarize, online updates only have a minor impact on the overall inference performance.That is because the VDB/PDB update operations happen lazily and infrequently, and subsequent cache refreshes happen near-instantaneously.

CONCLUSION
In this paper we presented and analyzed HPS, an efficient GPUenabled hierarchical parameter server for building large-scale model inference services.Our high-performance GPU embedding cache exploits the typical properties of recommendation datasets to improve inference throughput.By extending this GPU embedding cache with other cluster storage resources (VDB & PDB), HPS can efficiently process queries for very large models.Through its asynchronous update mechanisms, HPS ensures that its GPU embedding cache retains a high hit rate over time.
Our experiments show that the HugeCTR HPS can reduce the latency for end-to-end model inference by 5-62x in comparison with PyTorch CPU.Furthermore, HPS offers excellent scaling and performance on different GPUs.
As for future work, we intend to continue extending Merlin HugeCTR and HPS with additional features, including but not limited to better support for next generation GPU technologies and further optimizing performance, like relaxing the constraints when using locks to protect data and ensure thread-safety for embedding cache.

Figure 7 :
Figure 7: Inference latency during warm-up and stable stage.

Figure 10 :
Figure 10: HPS end-to-end inference throughput.Redis refers to a VDB with 40 storage partitions spread across a 3-node Redis cluster.

Figure 11 :
Figure 11: Comparison of HPS performance with different NVIDIA GPUs.
} denotes a query for looking up the  corresponding embeddings entries from   .The corresponding result set is

Table 2 :
Volatile and persistent database random insertion.