AuRORA: Virtualized Accelerator Orchestration for Multi-Tenant Workloads

With the widespread adoption of deep neural networks (DNNs) across applications, there is a growing demand for DNN deployment solutions that can seamlessly support multi-tenant execution. This involves simultaneously running multiple DNN workloads on heterogeneous architectures with domain-specific accelerators. However, existing accelerator interfaces directly bind the accelerator’s physical resources to user threads, without an efficient mechanism to adaptively re-partition available resources. This leads to high programming complexities and performance overheads due to sub-optimal resource allocation, making scalable many-accelerator deployment impractical.To address this challenge, we propose AuRORA, a novel accelerator integration methodology that enables scalable accelerator deployment for multi-tenant workloads. In particular, AuRORA supports virtualized accelerator orchestration via co-designing the hardware-software stack of accelerators to allow adaptively binding current workloads onto available accelerators. We demonstrate that AuRORA achieves 2.02× higher overall SLA satisfaction, 1.33× overall system throughput, and 1.34× overall fairness compared to existing accelerator integration solutions with less than 2.7% area overhead.CCS CONCEPTS• Computer systems organization → Multicore architectures; Distributed architectures; Neural networks; • Hardware → Communication hardware, interfaces and storage; Application-specific VLSI designs.


INTRODUCTION
With the slowdown in technology scaling, architects have turned to heterogeneous multi-core many-accelerator system-on-chips (SoCs) to meet the increasing compute demands of modern workloads [25].One particular class of applications that drives the development of many-accelerator systems is deep neural networks (DNNs).Specifically, the concurrent multi-tenant execution of DNN applications, where multiple DNN workloads are co-located on the same SoCs, has become crucial for both the cloud [44,46,49,51,59] and edge devices [22,27,35] to meet stringent throughput and latency service-level agreements (SLAs).Previous research has underscored the importance of spatially co-locating DNN workload executions to improve the quality of service (QoS) [21,31,40].
However, performance variability due to contention for shared hardware resources presents a substantial challenge for these workloads.More specifically, multi-tenant systems require a flexible and efficient mechanism to dynamically partition shared resources based on application requirements and available resources.While shared-resource management for multi-core processors has been a well-studied area in computer architecture, less attention has been paid to the accelerator interface, i.e., how accelerators interact with CPUs and the system stack.
In particular, existing accelerator integration approaches restrict options for run-time accelerator management, as workloads or threads are explicitly bound to physical accelerators [12,31,40] or subarrays [21,36].Challenges arise when the system load of the application is unknown prior to execution or when the application runs complex cascaded pipelines [27,32,35].In these scenarios, kernel drivers must either explicitly preempt [12,21] user threads, leading to high thread migration cost, or wait for user threads to complete and release their resources [31,40], resulting in suboptimal resource partitioning and utilization.New methods have been proposed to develop virtualized interfaces aimed at reducing kernel driver overhead in accelerator deployment [14,47].However, these approaches primarily focus on queue-based, first-come-first-serve accelerator scheduling, lacking the capability for user threads to dynamically re-partition accelerators during runtime.
To address these challenges, this work presents AuRORA 1 , a full-stack methodology for integrating accelerators in a scalable manner for multi-tenant execution.AuRORA consists of ReRoCC 2 (Remote RoCC), a virtualized and disaggregated accelerator integration interface for many-accelerator integration, and a runtime system for adaptive accelerator management.Similar to virtual memory systems that provide an abstraction between user memory and physical machine resources, AuRORA provides an abstraction between the user's view of accelerators and the physical accelerator instances.AuRORA's virtualized interface allows workloads to be flexibly orchestrated to available accelerators based on their latency requirements, regardless of where accelerators are physically located.This is particularly crucial for multi-tenant execution since resources must be dynamically reallocated to meet the distinct demands of concurrent workloads.To effectively support virtualized accelerator orchestration, AuRORA delivers a full-stack solution that co-designs the hardware and software layers, as shown in Figure 1, with the goal of delivering scalable performance for heterogeneous systems with multiple accelerators.Specifically, the AuRORA stack, from bottom to top, includes: • Low-overhead shim microarchitecture to interface between cores and accelerators, • A hardware messaging protocol between CPU and accelerators to enable scalable and virtualized accelerator deployment on SoCs, • ISA extensions to allow user threads to interact with the AuRORA hardware in a programmable fashion, and • A lightweight software runtime to dynamically reallocate available resources for multi-tenant workloads.

BACKGROUND AND MOTIVATION
This section discusses challenges with running multi-tenant DNN workloads and how existing approaches for accelerator integration are insufficient for addressing these challenges.

Multi-tenant DNN Execution
Multi-tenancy refers to the scenario where multiple tasks share hardware, leading to contention for system resources such as compute and memory.Shared resource partitioning is a thoroughly explored domain within computer architecture, where novel mechanisms have been proposed to manage multi-core architecture for data centers [11,56,57] and, more recently, on GPUs [45].
On the accelerator side, Prema [12] introduced the concept of time-multiplexing a monolithic accelerator across multiple DNN tasks.However, this approach suffers from low hardware utilization for individual DNNs.To address this limitation, spatial colocation of multiple DNN tasks has been proposed, where compute resources [21,34,36,40] or memory resources [29,31] are spatially partitioned across applications.However, all existing multi-tenant accelerators bind workloads to physical accelerators or subarrays explicitly [12,21,36], leading to high-performance overhead when migrating workload threads during accelerator resource reallocation.To avoid the thread migration overhead, recent works use coarser-grained scheduling to reduce the frequency of resource reallocation [31,40].However, such coarse-grained scheduling lacks the ability to respond promptly to dynamic changes in system load.

Physical Accelerator Integration
Table 1 provides a summary of the multi-accelerator integration strategies.We classify existing methods into two main categories: physical integration, where workloads are explicitly mapped onto physical accelerators, and virtual integration, where programmers interact solely with virtualized accelerators, with the workload-toaccelerator binding managed by a separate integration layer.
On the physical accelerator integration side, the existing space can be broadly categorized into two types: tightly-coupled and loosely-coupled.Tightly-coupled accelerators are directly implemented as part of the core datapath in a general-purpose CPU.Examples of standards for CPU-coupled accelerators include the ARM Custom Instruction interface [13], the RoCC RISC-V accelerator interface [4], and the Tensilica Instruction Extension interface [23].
Since tightly-coupled accelerators can directly access the architectural state in the host thread, software support for these accelerators can be provided in the form of low-overhead userspaceaccessible custom instructions, greatly reducing software integration costs.However, tightly-coupled accelerators require expensive host thread migration when adjusting accelerator affinity, as host threads must be migrated to the appropriate control core for the target accelerator [31,40].We measure the accelerator reallocation overhead of physically integrated accelerators when co-running four applications, ResNet50, AlexNet, GoogLeNet, and BERT-small, and observe 300-700K cycle overhead when thread migration happens.Furthermore, physical design challenges and limited instruction encoding space prohibit scaling up the number of accelerators integrated into a single general-purpose core.
The other approach is to decouple the accelerator from the core over the SoC interconnect, most commonly by binding the accelerator to memory-mapped control registers [7,26,43].Attaching accelerators over memory-mapped registers is supported in all standard SoC interconnect protocols, including AMBA protocols [2], TileLink [15], Wishbone [54], and CXL [1].This allows for scalable accelerator deployment, as many accelerators can be instantiated across a single SoC, each mapped to a unique address range of control registers.Prior work [41] proposed a novel shared-memory management for many-accelerator systems where accelerators are physically integrated with the MMIO interface.However, software support for memory-mapped accelerators is more burdensome, as privileged drivers must make the physical control registers visible to user threads and manage the allocation of accelerators to users, leading to significant performance overhead.

Virtual Accelerator Integration
The cumbersome physical accelerator integration does not scale to many-accelerator systems running multi-tenant workloads, especially when resources need to be frequently reallocated to meet the distinct demands of applications during execution.To improve scalability, recent research has proposed virtualized accelerator integration, which allows user threads to invoke accelerators dynamically without binding workloads to physical accelerators [14,47].In particular, both works have proposed ISA extensions and microarchitecture mechanisms to dynamically map user threads to accelerators.However, in addition to being closed-sourced, these efforts only allow non-preemptive resource allocation, i.e., user threads are scheduled onto accelerators in a first-come-first-served fashion using a command queue in hardware.Furthermore, in both these works, the host CPU performs address translation for the accelerator before issuing the memory request to accelerator [47], or the host core to handle TLB misses with OS handler [14].Both cases invoke software overhead that would prohibit the CPU from performing other tasks.Such a simple accelerator allocation approach does not allow dynamic accelerator orchestration, where accelerator resources are flexibly partitioned based on the current demands of concurrent workloads.In particular, dynamic accelerator orchestration through preemptive allocation is required for multi-tenant execution where multiple tasks share the system resource with different target requirements.To the best of our knowledge, AuRORA is the first work that supports virtualized accelerator integration with dynamic resource allocation for multi-tenant execution.

AURORA ARCHITECTURE
AuRORA is a new full-stack approach of accelerator integration for efficient multi-tenant execution on virtualized accelerators.Au-RORA provides an abstraction to the software of virtualized accelerators, where user threads invoke virtual accelerators which are then dynamically mapped to physical accelerators by the AuRORA runtime.The following sections discuss the AuRORA microarchitecture (Section 3.1), hardware protocol (Section 3.2), ISA extensions (Section 3.3), and runtime system (Section 3.4).

AuRORA Microarchitecture
Figure 2 shows the key microarchitecture components of AuRORA Client and Manager and how they can be seamlessly integrated with existing CPU and accelerator designs.Client.The AuRORA's Client shim integrates with host generalpurpose cores to allow communication to and from disaggregated accelerators, while providing the architectural illusion of a tightlycoupled accelerator.Each core tracks which accelerators it has currently reserved using a hardware table in the Client.The Client is implemented as a RoCC accelerator [4], allowing it to be integrated with existing RoCC-compatible cores like Rocket [5] and BOOM [62].Manager.The AuRORA's Manager shim wraps an existing accelerator and facilitates the virtualization and disaggregation of accelerators across the SoC interconnect.The Manager receives AuRORA and accelerator commands from the Client and forwards accelerator commands to the attached accelerators.The Managers also implement a shadow copy of architectural CSRs used by the accelerator MMU.These CSRs include those that describe the host thread privilege level, memory translation mode, and page table address.A page table walker (PTW), an optional PTW cache, and an L2 TLB provide an architecturally compliant memory-managementunit (MMU) to the accelerator.These modules eliminate the need for user-or supervisor-managed IOMMU, preserving the illusion of a shared MMU between the core and accelerators.To support software-managed QoS, the Managers also implement configurable traffic throttlers, which can be used to set bandwidth limits on accelerator memory traffic.The bandwidth limit can be set by writing to a configuration register in the Manager.

AuRORA Hardware Protocol
To support the integration and disaggregation of accelerators at the SoC level, AuRORA connects Client and Manager with the AuRORA hardware communication protocol.Figure 3 shows the AuRORA hardware messaging protocol between Clients and Managers.There are two states for Manager, IDLE and ACQUIRED.When a Client tries to acquire accelerators, it sends the acquire request signal to the Manager.If the Manager is in the IDLE state (e.g., Client 0 to Manager 1 in Figure 3), the acquire succeeds, and an acknowledgment, (i.e., the granted signal), is sent to the Client.The Client will then forward its own core's configuration registers to the acquired Manager to set up the MMU on the manager as a shadow of the core's.From this point, accelerator instructions issued to the Client will be automatically forwarded to the Manager.
However, when the accelerator has already been occupied by another process (e.g., Client 1 to Manager 1), the acquire attempt will fail.If there are other accelerators of the same functionality in the system, the Client can attempt to acquire another accelerator (e.g., Client 1 to Manager 2).For these cases, the software has to configure the AuRORA Client with a set of accelerators that share the same functionality.After the Client has finished using this accelerator, it sends a release message to the Manager (e.g., Client 0 to Manager 0), returning the accelerator's Manager state to IDLE.All these transactions are non-blocking to guarantee forward progress.
The AuRORA hardware protocol can be mapped onto various interconnect architectures, including crossbar and network-on-chip (NoC), as shown in Table 2.The AuRORA traffic can share the system interconnect with memory traffic or use a separate interconnect to avoid contention.In particular, AuRORA focuses on the interface between the accelerator and the CPU, which is orthogonal to the SoC interconnect standard that defines how data are transferred in SoCs.Our evaluation uses TileLink [15], an SoC interconnect standard that can provide coherent access across SoCs, with a shared global address space, since this is common in many-core/manyaccelerator SoCs.AuRORA can also be implemented using other SoC interconnect standards like CXL [1], which enables a global shared memory space between chips for multi-chip integration.

AuRORA ISA Extensions
The AuRORA ISA extensions expose the virtualized and disaggregated accelerator management to software, as specified in Table 3.The acquire and release instructions allow Client to claim and release accelerators.When claiming an accelerator, Client encodes a target physical accelerator acc_id in the acquire instruction so that it can be delivered to the target Manager.If acquire succeeds, Client will assign a virtual accelerator index acq_id to this accelerator, which is then used in the rest of the runtime.The assign instruction maps an acquired accelerator to an available opcode on its architectural thread.This allows a single architectural thread to acquire more accelerators than the available opcode space would permit.The memrate instruction configures the maximum memory request rate for an accelerator for QoS management.

AuRORA Runtime
The AuRORA runtime offers mechanisms for provisioning and releasing accelerators using ISA extensions introduced earlier.Specifically, this runtime operates within userspace software, utilizing custom AuRORA instructions accessible in userspace, to adaptively partition available resources for multi-tenant execution.The runtime system is designed to be lightweight and only needs to be invoked when acquiring, configuring, or releasing an accelerator.Furthermore, it is also important to note that the AuRORA runtime maintains backward compatibility with existing RoCC-based [4] accelerator software stacks.To improve the performance of multi-tenant applications, the AuRORA runtime provides support for two key contention-aware partitioning: compute-resource allocation and memory-resource allocation.The compute-resource allocation dynamically partitions different numbers of accelerators for different tasks considering the NUMA effect, while the memory-resource allocation adaptively reconfigures the available memory bandwidth to different accelerators.Unlike prior works where the scheduler explicitly encodes the physical accelerator and the number of accelerators when scheduling tasks [12,21,31,40], the AuRORA runtime manages virtualized accelerator resources and dynamically partitions them during runtime.As a result, a user application only needs to specify its latency target, simplifying its interaction with the AuRORA runtime.

Compute-resource allocation.
The AuRORA runtime dynamically re-partitions compute resources based on latency targets and available compute resources.Figure 4 describes how the runtime operates to allocate compute resources.The runtime receives an end-to-end DNN network (i.e., a task)  from the task queue and is invoked before the execution of every layer.The LatencyEst module estimates the latency of each task based on its current acquired accelerators (ACQ  ).Together with the remaining Slack to its target deadline, this latency is fed into the calc_score module to calculate its dynamic deadline score (ddl_score), which indicates the likelihood of meeting the target deadline (i.e., a higher score indicates it is more likely to meet its deadline).The analyzer compares the dynamic ddl_score  of this task against those of other on-going tasks (ddl_scores) and decides whether task  requires the release or acquisition of accelerators to meet their performance targets while balancing system throughput and fairness.Finally, the runtime notifies the task thread's Client return ddl_score 8: for Layer i in Layers do (need_release, need_acquire, num_accel) runLayer(Layer i ) && .popLayer(Layer i ) of the changes so that the Client can acquire or release accelerators based on the updated assignment from the AuRORA runtime.
Algorithm 1 further elaborates on this process.Upon invocation, the runtime calculates the dynamic deadline score, ddl_score, of each task on its slack.The runtime compares the ddl_score  of the current task with the scores of other concurrently running tasks to whether the release or acquisition of accelerators needs to happen and the number of accelerators affected.We use the latency estimation technique from [31] that considers the multilevel memory hierarchy, the number of processing elements, and per-layer compute-to-memory ratios for individual DNN execution, similar to other multi-tenant DNN execution work [12,21].
The Analyze function in the AuRORA runtime compares the score of task  with the scores of other tasks to decide whether  needs to release its acquired accelerator or acquire other idle ones, based on the relative confidence in meeting the deadline target.If release is necessary, the runtime releases acquired accelerators, so that tasks with tighter deadlines can acquire them.If acquire is needed, the runtime would try to acquire idle accelerators.All of this happens in user-space code.Thus, unlike prior works [12,21,31,40], AuRORA's accelerator scheduling does not require thread preemption, synchronization, and migration to reallocate the accelerator.

NUMA-aware compute partitioning.
Distributing accelerators and memory across an SoC's network-onchip (NoC) interconnect inevitably causes non-uniform memory accesses (NUMA) [16,39], which adds to system heterogeneity.Alleviating the challenges of NUMA memory systems has been wellresearched in the multi-core domain [9,17,37,39].Notably, prior work has proposed scheduling by application bandwidth sensitivity as a mechanism to reduce interference in a shared multi-core or multi-accelerator system [16,53].Previous work has also found that thread migration overhead presents a significant challenge for such NUMA-aware thread scheduling approaches [9].
The AuRORA runtime leverages its virtual accelerator abstraction to enable simple but efficient NUMA optimization.Different workloads would face varying degrees of NUMA effect on each NoC node with NoC-based interconnect.To capture the performance slowdown caused by NUMA effects, we build an empirical performance model based on hardware measurements to capture each workload's sensitivity to NUMA.The AuRORA runtime quantifies each task's slowdown caused by the NUMA effect based on its assigned accelerators.
When deciding on new accelerators to acquire, the AuRORA runtime compares the relative NUMA slowdown to co-running tasks across different accelerator assignments and assigns the set of accelerators that causes the lowest relative slowdown for each task.This allows the runtime to allocate the resources to the task thread to minimize the overall system's latency degradation due to the NUMA effect.In addition, AuRORA runtime performs accelerator swapping optimization before running the layer if there are idle accelerators in the system with a lower relative NUMA slowdown for the task.This swap is implemented as an atomic series of acquire and release.When the system does not exhibit NUMA properties, for example, if the interconnects are configured as crossbar, the NUMA optimization is not enabled.

Memory-resource allocation.
AuRORA also supports dynamic memory re-partition, as shown in Algorithm 1 Line 22-23.It dynamically detects system-level interference and sets limits on the memory access rates of accelerators to resolve contention if necessary.AuRORA's memory re-partitioning methodology with dynamic scoring and run-time contention detection is implemented similarly to prior work [31].Upon detection of contention over memory bandwidth, AuRORA runtime triggers  Client to send the memrate instruction to the acquired Manager to configure the memory access rate for each instruction.Based on the configured value, Manager would limit the memory requests from the target accelerator.

METHODOLOGY
This section details AuRORA's implementation, together with the workloads and metrics used for our evaluation.

AuRORA Implementation
We implement the AuRORA microarchitecture using the Chisel HDL [6] on top of the Chipyard [3] SoC framework, an opensource framework for designing and evaluating systems-on-chips.We use Gemmini [20], a systolic-array-based DNN accelerator without multi-tenancy support, as a representative DNN accelerator in our evaluation.Additionally, we implement an AuRORA protocol adapter for the Constellation [61] NoC generator, to enable evaluations on systems with a NoC-based interconnect.We evaluate AuRORA's performance of running end-to-end DNN workloads using FireSim, a cycle-exact, FPGA-accelerated RTL simulator [30].Table 4 shows the SoC configuration we use in our evaluations of AuRORA.To demonstrate how AuRORA scales to realistic manyaccelerator architectures, we evaluate AuRORA in two different SoC configurations:  (1) Crossbar: All components (AuRORA Client, Manager, memory system) are connected to a crossbar.This configuration provides a uniform memory system.(2) NoC: All components are integrated in a 7x4 2D mesh as illustrated in Figure 5.This configuration provides a realistic and scalable NUMA memory system for a many-core/manyaccelerator SoC.
We integrate Gemmini, a TPU-style systolic-array accelerator, within a AuRORA Manager and replicate it across ten separate homogeneous Manager accelerator tiles on the same SoC.Each Gemmini accelerator is equipped with a 16x16 weight-stationary systolic array for matrix multiplications and convolutions, with private scratchpad memories to store weights and in/output activations.All the tiles also share the memory subsystem, including a shared L2 cache and DRAM.
The AuRORA runtime is implemented in C++ and operates seamlessly on top of a full Linux stack.The runtime uses a lightweight software look-up table for the scoreboard, which manages the compute allocation and memory bandwidth utilization of each application on the Clients.The runtime also implements task queues, which track generated tasks.

Microarchitecture Exploration
This section explores various configurations of AuRORA and demonstrates AuRORA's adaptability for diverse deployment scenarios.PTW cache/TLB configuration.We sweep the Manager's L2 TLB and private PTW cache to determine the optimal TLB and PTW cache size for multi-tenant DNN execution.Figure 6 shows the effects of the Manager's L2 TLB size and PTW cache sizes on the end-to-end latency of ResNet50 and AlexNet.We generate a single DNN accelerator using Gemmini [20] with the hardware configurations described in Table 4.The latency is normalized to the bare-metal test, where no address translation happens.We notice that ResNet50's performance is saturated with a small L2 TLB and PTW cache, reaching the minimum latency with only a 256-entry L2 TLB and 0.5KB (4 ways, 2 sets) PTW cache.AlexNet, on the other hand, is dominated by fully-connected (FC) layers with frequent TLB misses, where a bigger PTW cache can take advantage of the spatial locality to reduce end-to-end latency.For both cases, a 0.5KB PTW cache and 512-entry L2 TLB are enough to  2 shows their performance and AuRORA overhead.In particular, we generate three different interconnect configurations: (1) crossbar: all memory traffic and AuRORA hardware protocol traffic are connected with crossbars.(2) crossbar+NoC: memory traffic is routed via a 4x4 2D mesh NoC, while the AuRORA traffic uses a separate crossbar.(3) Shared NoC: both memory traffic and the AuRORA protocol traffic share the same 4x4 2D mesh NoC.
To capture the worst-case overhead from frequent AuRORA protocol traffic, accelerators are acquired and released before and after each layer (Gemm, Convolution, Residual addition).Our result clearly demonstrates that AuRORA's management overhead for accelerators is quite negligible, accounting for less than 1% of the total cycles across all the scenarios.This underscores AuRORA's flexibility and scalability for multi-accelerator systems.

Depth Estimation MiDaS [50] Plane Detection
PlaneRCNN [38] To construct a multi-tenant workload from each scenario, we randomly select N number of different inference tasks, where N ranges from 200 to 300, for concurrent execution.QoS targets.We set our baseline QoS based on prior works [8,40], setting 25ms for AlexNet, ResNet50, 10ms for SqueezeNet and YOLO-Lite, 50ms for BERT-base and 15ms for the rest.To assess how AuRORA performs with varying latency targets, we also adjust the baseline latency target to 1.2× and 0.8× QoS, corresponding to a 20% increase and decrease in the latency target, respectively.Specifically, QoS-H (hard) denotes a 0.8× QoS latency target, which is more difficult to achieve.QoS-L (light) represents a 1.2× QoS latency target, which is a more lenient goal.QoS-M refers to the baseline QoS latency target.
Emerging applications.To demonstrate the utility of AuRORA for emerging applications, we also deploy a usage scenario for AR/VR, as suggested by XRBench [35], and create Workload set-XR using AR/VR Gaming scenarios.We construct load generation settings following the guidelines in XRBench: inference requests are injected at the target frame-per-second (FPS) processing rate with a jitter applied to each frame.

Metrics
We evaluate the efficacy of multi-tenant execution with AuRORA using the metrics proposed in [19], which are commonly used in multi-tenant evaluation [12,21,31].These metrics encompass the percentage of workloads for which we meet the Service Level Agreement (SLA), the throughput of the co-located applications, and the fairness of AuRORA's resource management strategy.To determine workload latency, we measure the duration from the time it is generated until it is completed and commits, including the time it spends in the task queue and its runtime.SLA satisfaction rate.We set the SLA target, which is the QoS latency target constraint, for each workload based on the three QoS levels defined in Section 4.3, as mentioned in the 'QoS targets' paragraph.Achieving a higher SLA satisfaction rate would mean more queries meeting the QoS latency target.We use SLA and QoS targets interchangeably in the following discussion.
Fairness.The fairness is a metric that measures equal progress under multi-tenant execution compared to the single task's isolated execution.Fairness metric has been used in prior multi-tenant works [12,21,31].This metric assesses AuRORA's dynamic scorebased virtual accelerator management, for both compute-resource partitioning and memory-resource partitioning.As shown in Equation 1, C i represents the cycles of the i-th workload, C single indicates the cycles of the workload running on the SoC with no other concurrent workloads.C MT denotes multi-tenant execution cycles.We define fairness in terms of normalized progress (NP), which describes the slowdown of multi-tenant execution compared to its isolated execution without interference, suggested in [19] as follows: Throughput.To evaluate the effectiveness of AuRORA in increasing overall hardware utilization, we analyze the total system throughput (STP).STP is defined as the system throughput of executing n programs, which sums up each program's normalized progress, ranging from 1 to n. Maximizing overall progress when co-locating multiple applications is crucial to maximizing STP.
Real-time and QoE Score.To evaluate Workload set-XR, we use the metrics suggested by XRBench [35]: Real-time (RT) Score and Quality of experience (QoE) Score.RT Score uses a modified sigmoid function to gradually increase and decrease the score when the inference latency is shorter or longer, respectively, than the target, and we use the default value in XRBench for the parameter k.QoE score quantifies the penalty for FPS drops due to dropped frames, which is not counted in the RT score.We set the Accuracy and Energy score to 1 as AuRORA does not affect DNN accuracy and our evaluation focuses on homogeneous accelerators.The overall score is computed using the QoE, RT, Accuracy, and Energy scores as XRBench describes.

Baselines
To evaluate the effectiveness of AuRORA's virtual accelerator management and QoS optimization, we compare against two different baselines that use physical accelerator with AuRORA and measure the performance improvement.The prior works that we use as baseline are the following: (1) Veltair [40]: dynamic compute-resource partition with coarse-grained layer-blocks to avoid rescheduling overhead; (2) MoCA [31]: adaptive memory-resource re-partition based on system-level contention for spatially co-located DNNs.These baselines are the most recent works proposing system support for QoS management in multi-tenant DNN workloads and addressing the accelerator migration cost through coarse-grained scheduling.Note that Veltair is a joint adaptive compilation and scheduling work that targets different hardware platforms (CPU cluster).We take Veltair's scheduling component, which is a layerblocking strategy and scheduler, as a physical integration baseline.Our MoCA implementation uses AuRORA's memory access rate configuration instruction to change the memory access rate, instead of modifying the accelerator's internal DMA.
For physical accelerator binding baselines, the task thread would request its target number of accelerators to meet the QoS requirement by directly pinning the accelerator.If there is a scheduling conflict, which is when there are fewer accelerators available in the system, it would attempt to pin the accelerator from the other thread that is going to finish the current layer block the earliest, and then start the execution after synchronizing and adjusting the accelerator affinity.
For AuRORA evaluation, we use two different configurations by incrementally enabling QoS optimization to show the effectiveness of each resource management feature: (1) AuRORA-Compute, which performs dynamic computeresource re-partitioning with virtual accelerator.
(2) AuRORA-All, which adds NUMA-aware compute partitioning for NoC deployment scenarios and dynamic memoryresource re-partitioning for both crossbar and NoC.

EVALUATION
In this section, we evaluate the effectiveness of AuRORA for multitenant workloads by comparing it to two baseline solutions.Veltair [40] and MoCA [31] are recent proposals for improving DNN multitenant execution by co-locating multiple DNNs while binding accelerators physically to user threads.Our evaluation demonstrates that AuRORA improves SLA satisfaction rates, STP, and fairness across a wide range of workload scenarios with different DNN models and QoS requirements with a small hardware area overhead.

SLA Satisfaction Rate
We evaluate the effectiveness of AuRORA-enabled virtual accelerator management for multi-tenant execution using the three sets of workloads listed in Table 6, each with three QoS targets (Hard: QoS-H, Medium: QoS-M, Light: QoS-L) on two hardware platforms (crossbar and NoC), resulting in a total of 18 runtime scenarios.We measure the SLA satisfaction rate for each scenario and compare them to the baseline to demonstrate the performance improvement achieved.is because AuRORA's memory management feature can further enhance the target satisfying ability when QoS requirements become harder to meet by prioritizing memory requests of workloads with less time margin with its memory partitioning scheme.

NoC configuration.
AuRORA's NUMA-aware virtual accelerator management is efficient in NoC deployed scenarios.The impact of system-level interference varies among distributed accelerator nodes connected via NoC, primarily due to the NUMA effect.Furthermore, the extent of performance degradation differs among different DNN models, depending on the sensitivity of the workloads to a NUMA-based memory system.AuRORA's NUMA aware compute partitioning scheme captures this and optimizes through better allocation of accelerators and accelerator swapping.As Figure 7b shows, compared to baselines, AuRORA-All achieves 2.41× geomean improvement over Veltair (max 3.99× in Workload-A/QoS-H) and 1.87× over MoCA (2.85× in Workload-C/QoS-H).AuRORA-All, which enables both NUMA-aware compute resource partitioning and dynamic memory resource management, increase SLA satisfaction rate by 1.25× on geomean compared to AuRORA-Compute.Across the workload sets, AuRORA-All achieved a geomean improvement of 1.05× for Workload-A, 1.38× for Workload-B, and 1.33× for Workload-C, over AuRORA-Compute.NUMA and memory optimization achieve a higher increase in heavy or mixed sets than light ones, as the NUMA effect is more pronounced for workloads that cause more memory traffic.Thus, AuRORA-All would benefit in those scenarios by alleviating the NUMA effect with better compute partitioning and alleviating memory contention by memory partitioning.

System Throughput Analysis
We evaluate the STP of multi-tenant scenarios, as described in Section 4, to demonstrate that AuRORA improves the STP compared to the baselines.
AuRORA's virtual accelerator allocation increases overall system throughput.Figure 8a shows an improvement in system throughput in the crossbar-based system.AuRORA-Compute exhibits 1.26× geomean improvement over Veltair (max 1.34× in Workload-A/QoS-H).Compared to MoCA, AuRORA-Compute demonstrates 1.18× geomean improvement (max 1.28× Workload-C/QoS-H).Although STP improvement is overall consistent, across different scenarios, the improvement increases as the QoS requirement gets stricter, showing the highest improvement of 1.28× over Veltair in the QoS-H group.This indicates that accelerator virtualization with AuRORA can improve resource utilization across all scenarios with flexible and fast resource reallocation, especially under the increase in resource conflicts.
AuRORA's memory resource management improves STP.
As Figure 8a shows, AuRORA-All achieves 1.33× improvement over Veltair and 1.25× over MoCA (max 1.38× and 1.37× respectively in Workload-A/QoS-L), which is 1.06× geomean STP improvement over AuRORA-Compute.Across the workload sets, AuRORA-All shows the most improvement over AuRORA-Compute in Workload-B with a 1.12× improvement.This is because memory access rate management would alleviate performance degradation due to memory interference, which becomes more prominent with heavier workloads.
AuRORA's NUMA-aware accelerator allocation improves STP. Figure 8b shows NoC deployment results.When both NUMA and memory resource optimizations are enabled, AuRORA-All improves STP 1.79× over Veltair (2.04× max in Workload-C/QoS-H) and 1.59× over MoCA (1.97× max in Workload-C/QoS-M).Compared to AuRORA-Compute, AuRORA-All achieves 1.32× STP improvement, which is greater than the crossbar scenario, which further shows the effectiveness of AuRORA-All in the system with NUMA effect in improving throughput.The impact is most prominent in Workload-C, where there was a 1.46× improvement over AuRORA-Compute across all QoS levels.The NUMA effect is more pronounced for heavier workloads and the degree of variance in the NUMA effect gets severe with workload heterogeneity.Thus, enabling NUMA optimization can alleviate this effect, leading to an improvement in the overall STP.

Fairness Analysis
We evaluate the overall system fairness of multi-tenant execution, as defined in Section 4, to demonstrate the effectiveness of AuRORA in improving this metric.We compare the fairness of AuRORA with the baseline strategies and normalize the results to Veltair's fairness, shown in Figure 9.
AuRORA virtual accelerator support improves fairness.The memory resource management feature helps resolve shared memory system contention whose impact differs due to different compute-to-memory ratios of the workloads.

Physical Design and Area Analysis
We synthesize AuRORA Manager-integrated Gemmini accelerator and AuRORA Client-integrated Rocket CPU using Cadence Genus with commercial 16nm process technology with the configuration used in the evaluation.As shown in Table 7, AuRORA incurs an overhead of 2.7% of the total area.Specifically, Client incurs 1.2% and Manager incurs 3% of CPU and accelerator tile area, respectively.Client overhead is minimal as it only needs enough bits to track which accelerators are assigned to the current resident thread.
Manager also causes very low physical area overhead compared to an accelerator, as the critical architectural shadowed state is less than 100 bits of storage.The majority of its overhead is the page table walker and TLB, which is present in any accelerator that requires an IOMMU.

CONCLUSION
This work proposes AuRORA, a scalable accelerator integration approach that enables efficient execution of multi-tenant workloads using a virtual accelerator abstraction.Unlike existing accelerator integrations, AuRORA optimizes for dynamic contention-aware scheduling of multi-tenant tasks with minimal performance overhead with a full-stack architecture.We implement AuRORA's microarchitecture, messaging protocol, ISA, and runtime, and demonstrate its ability to improve end-to-end metrics for multi-tenant DNN workloads.Our evaluation of diverse workload sets, latency targets, and hardware deployment shows AuRORA achieves overall SLA 2.41×, STP 1.79×, and fairness 1.41× improvement compared to existing multi-tenant solution for NoC deployed scenario, and overall improvement of SLA 2.02×, STP 1.33×, and fairness 1.34× for crossbar deployed scenario, with 2.7% area overhead.

Figure 1 :
Figure1: AuRORA is a full-stack accelerator integration methodology for scalable accelerator deployment.

Figure 3 :
Figure 3: AuRORA's hardware protocol for how a Client manages accelerator integrated into Manager tiles.

Figure 4 :
Figure 4: AuRORA runtime takes Task  and its target latency and reconfigures the acquired accelerators for each Client.

Figure 6 :
Figure 6: Normalized latency sweeping Manager's L2 TLB and PTW cache size.Latency is normalized to the ideal case.

Figure 7 :
Figure 7: AuRORA's SLA satisfaction rate improvement over evaluated multi-tenancy baselines with different QoS targets (QoS-L/M/H: light/medium/hard latency target) and DNN workload sizes (Workload-A/B/C: light/heavy/mixed models).

Figure 8 :
Figure 8: STP improvement of AuRORA over evaluated multitenancy baselines (normalized to Veltair baseline) with different QoS targets and DNN workload sizes.

Figure 9 :
Figure 9: Fairness improvement of AuRORA over evaluated multi-tenancy baselines (normalized to Veltair baseline) with different QoS targets and DNN workload sizes.

Figure 10 :
Figure 10: Real-time (RT) score, QoE score and Overall score improvement of AuRORA over evaluated multi-tenancy baselines for Workload set-XR Gaming usage scenario.

5. 4
Real-time and QoE AnalysisAuRORA improves meeting real-time requirements.As Figure10shows, AuRORA-All achieves 1.61× RT score improvement over Veltair and 1.44× over MoCA for crossbar deployment, and 2.24× over Veltair and 2.02× over MoCA for NoC deployment.Virtual accelerator allocation alone (AuRORA-Compute) shows 1.57× improvement over Veltair for the crossbar and 1.54× for the NoC scenario, which indicates the effectiveness of AuRORA's virtual compute resource management.AuRORA improves quality of experience.AuRORA's improvement of both RT and QoE scores indicates that AuRORA is able to preserve the target FPS as well as maintain the timing requirements for the executed frames.As Figure10shows, AuRORA-All achieves QoE improvement of 1.12× over Veltair and 1.1 × MoCA for crossbar, and 1.41× over Veltair and MoCA for NoC deployment.Thus, AuRORA-All's Overall score improves by 1.66× and 1.49× over Veltair and MoCA for the crossbar scenario, and 2.74× and 2.45× for the NoC scenario.

Table 1 :
Comparison of multi-accelerator integration methodologies.

Table 2 :
The AuRORA protocol can share the same on-chip interconnect with the memory traffic or use a separate interconnect.All listed combinations are supported in the Au-RORA implementation.
AuRORA Pseudoinst.Operands Purpose rerocc_acquire success, acc_id, acq_id Acquires accelerator and maps it to the local client, returns success status.rerocc_release acq_id Releases an accelerator currently acquired by the local client.rerocc_assign acq_id, opcode Maps a currently acquired accelerator to an available instruction opcode.rerocc_fence acq_id Memory fence between core memory and an acquired accelerator.rerocc_memrate acq_id, rate Sets the maximum memory request rate the accelerator can make.

Table 4 :
SoC configurations used in the evaluation.

Table 5 :
AuRORA end-to-end latency overhead across SoC configurations and ResNet sizes.
minimize the end-to-end performance overhead.Thus, we use this configuration of the AuRORA Manager for further experiments.SoC Configurations.To illustrate the effectiveness of AuRORA in different SoC configurations, we run experiments of ResNet50 and ResNet18 running on one, two, and four accelerators, each interconnected through three different SoC interconnect designs.Table

Table 6 :
Benchmark DNNs and workload set categorization based on model size used in the evaluation.

Table 7 :
Area breakdown of accelerator design with AuRORA.