Abstract
The wide adoption of smart devices and Internet-of-Things (IoT) sensors has led to massive growth in data generation at the edge of the Internet over the past decade. Intelligent real-time analysis of such a high volume of data, particularly leveraging highly accurate deep learning (DL) models, often requires the data to be processed as close to the data sources (or at the edge of the Internet) to minimize the network and processing latency. The advent of specialized, low-cost, and power-efficient edge devices has greatly facilitated DL inference tasks at the edge. However, limited research has been done to improve the inference throughput (e.g., number of inferences per second) by exploiting various system techniques. This study investigates system techniques, such as batched inferencing, AI multi-tenancy, and cluster of AI accelerators, which can significantly enhance the overall inference throughput on edge devices with DL models for image classification tasks. In particular, AI multi-tenancy enables collective utilization of edge devices’ system resources (CPU, GPU) and AI accelerators (e.g., Edge Tensor Processing Units; EdgeTPUs). The evaluation results show that batched inferencing results in more than 2.4× throughput improvement on devices equipped with high-performance GPUs like Jetson Xavier NX. Moreover, with multi-tenancy approaches, e.g., concurrent model executions (CME) and dynamic model placements (DMP), the DL inference throughput on edge devices (with GPUs) and EdgeTPU can be further improved by up to 3× and 10×, respectively. Furthermore, we present a detailed analysis of hardware and software factors that change the DL inference throughput on edge devices and EdgeTPUs, thereby shedding light on areas that could be further improved to achieve high-performance DL inference at the edge.
1 INTRODUCTION
High network latency and privacy concerns have hindered the adoption of cloud computing for Internet-of-things (IoT) and cyber-physical system (CPS) applications that require prompt responses and collect sensitive user information [27, 47]. Wide-area data transmission from IoT sensors to cloud data centers often causes intolerable network latency [44, 71]. Moreover, IoT systems that monitor user activity (e.g., smart homes and activity trackers) have the risk of a privacy breach because all sensing data need to be stored and processed in cloud data centers [59, 74]. Edge Computing, a new computing paradigm, attempts to address those issues by placing computing resources at the edge of the Internet (or closer to data sources) [45, 60, 61]. The proliferation of IoT sensors and single-board computers (SBCs), e.g., Raspberry Pi [20], enables the adoption of edge computing paradigm across a wide range of latency-sensitive applications, including emergency alert services, traffic monitoring, wearable cognitive-assistance systems, and augmented reality [31, 61].
However, there are computationally-intensive edge applications and tasks that still need offloading to clouds. A typical example is deep learning (DL) applications. While DL technologies are increasingly leveraged in real-world edge applications, the majority of DL tasks are still relying on resources in cloud data centers rather than being processed at the Internet edge. [36, 44, 70]. Steep resource requirements (e.g., CPU, memory, and GPU) of DL tasks arising from high dimensionality of input data and a large number of floating-point operations means that the DL tasks often need to be sent to more powerful cloud data centers [2, 3, 4, 5, 64]. Therefore, emerging edge applications, such as autonomous driving, disaster response systems, and drone-based surveillance [50, 55, 75], may offer poor quality of services (QoS) and user experience due to high network latency and privacy issues.
Significant efforts have been conducted to bring DL to the edge of the Internet. In particular, optimizations in hardware and software functionalities are carried out to ensure that the DL models fit in the resource-constrained devices. For example, model compression [30, 38] and quantization [35, 68] techniques are developed to downsize DL models without significant degrade in their accuracy [39, 42, 58]. Similarly, studies involving efficient partitioning of heavier models and subsequent offloading to the cloud [40, 44, 52] show how the edge-cloud hybrid model could be leveraged for DL inference tasks. Moreover, light-weight machine learning (ML) frameworks, e.g., Tensorflow Lite [51], mobile neural network [43], are developed for enabling on-device inference. Finally, the advent of edge devices with GPUs, e.g., Nvidia’s Jetson Series [12, 15, 16], and AI accelerators, e.g., Google’s Coral Accelerator [7] and Intel’s Neural Compute Stick [11], have played a significant role in enabling edge-based DL inferencing.
Given the significant technological advancement for enabling DL inferencing at the edge, it is crucial to understand the opportunities and limitations of various edge devices and AI-accelerators for real-world DL deployment scenarios. Existing works [25, 34, 48, 49, 57, 73] have quantified the efficiency of various edge devices for DL inference tasks. However, most existing studies have focused on characterizing performance (e.g., latency and throughput) of edge devices and AI accelerators with single DL tasks, which is a significantly limiting assumption in the DL deployment scenario. In real-world edge applications, AI multi-tenancy-based deployments are widely required, in which multiple DL tasks are co-running on edge devices. For instance, drone-based surveillance requires simultaneous executions of inference tasks on video and audio streams [72]. Furthermore, system approaches to maximize DL inference throughputs (e.g., the number of inferences per second) on edge devices have not been deeply investigated.
In this study, we start by characterizing the performance (e.g., inference throughput) of various edge devices and AI accelerators with a set of pre-trained DL models and popular DL frameworks. Through the characterization step, we obtain the baseline performance of various edge devices/AI accelerators and investigate factors that change the throughput of DL inference tasks on such devices. We then employ three techniques to maximize the DL inference throughput on the devices with single and multi-tenancy-based use cases: batched inferencing, concurrent model executions(CME), and dynamic model placements(DMP). Batched inferencing is for single-tenancy use cases on edge devices and maximizes the DL inference throughput by exploiting the parallel computing capabilities of computing resources on edge devices. Both CME and DMP are techniques for maximizing DL inference throughput with AI multi-tenancy. CME allows the deployment of multiple DL models on either GPU or EdgeTPU resources and runs them parallel to improve the overall DL inference throughput by simultaneously enabling the execution of different DL models. DMP enables AI multi-tenancy by deploying and executing DL models on different resources on edge devices at the same time. e.g., DL models on both GPU and EdgeTPU. DMP is particularly useful when AI accelerators (e.g., EdgeTPU) enhance edge devices, and it can significantly increase the resource utilization and the DL inference throughput by utilizing multiple resources on the devices and the accelerators. Furthermore, we evaluate DL deployment strategy on edge devices with multiple AI accelerators (e.g., EdgeTPU cluster) and report the benefits and limitations of the deployment scenarios with EdgeTPU clusters.
Our extensive evaluation results confirm that the DL inference throughput on edge devices and EdgeTPU accelerators can be significantly improved by leveraging the three approaches proposed in this work. For the DL single-tenancy use cases, compared to the single-batch inferencing, batched inferencing can process up to \(2.4\times\) more inferences per second on devices equipped with high-performance GPUs, including
— | We provide a thorough characterization and quantitative analysis of the performance (i.e., latency and throughput) of various edge devices and AI accelerators when running DL tasks for image classifications. We also provide a thorough analysis of the factors that affect the DL inference throughput. | ||||
— | We propose three techniques to maximize DL inference throughput on edge devices. In particular, batched inferencing is a throughput maximization approach for single DL tasks. Furthermore, CME and DMP maximize the inference throughput with AI multi-tenancy. | ||||
— | With these three techniques, we discover the empirical upper bound of batch size, throughput, and model concurrency on edge devices and AI accelerators for DL inference tasks. | ||||
— | We identify the performance benefits and limitations when adopting DMP to leverage heterogeneous resources on edge resources and EdgeTPUs (AI accelerators). | ||||
— | We investigate the benefits and limitations of deploying DL models on edge devices with EdgeTPU clusters. | ||||
This work is based on our preliminary version [65] and we take one step in a broader and more thorough evaluation of techniques for maximizing the DL inference throughput on edge devices and AI accelerators. In particular, to the best of our knowledge, this work is the first study on evaluating DMP on heterogeneous edge resources/EdgeTPUs and characterizing the performance of EdgeTPU cluster for DL inference tasks. Therefore, the last two contributions from the above list are unique to this article.
The rest of the article is structured as follows. Section 2 provides the description of edge devices, EdgeTPUs, DL models, and DL frameworks used in this work. Section 4 describes the evaluation design and the benchmark tools used for accessing edge devices and AI accelerators. Section 5 reports evaluation results with single DL tasks (single-tenancy). Section 6 provides evaluation results when deploying multiple DL models (AI multi-tenancy). Section 7 discusses related work, and Section 8 concludes this article.
2 BACKGROUND
We describe the background of edge devices, AI accelerators, DL models, and DL frameworks used in this study.
2.1 Edge Devices and EdgeTPU Accelerators
We use three categories of devices; (1) general-purpose edge devices, (2) edge devices with GPU accelerators, and (3) EdgeTPU-based AI Accelerators. The summary of these devices is shown in Table 1.
| Raspberry Pi 4 (Model B) | Odroid N2 | Jetson Nano | Jetson TX2 | Jetson Xavier NX | Coral Dev Board | Coral USB Accelerator | |
|---|---|---|---|---|---|---|---|
| # CPU Cores | 4 | 6 | 4 | 6 | 6 | 4 | - |
| CPU | 4-core Cortex-A72 @ 1.5GHz | 4-core Cortex A73 @ 1.8GHz + 2-core Cortex-A53 @ 1.9GHz | 4-core Cortex-A57 @ 1.5GHz | 2-core Denver 2 64bit CPU @ 2.0 GHz + 4-core Cortex-A57 @ 2.0GHz | 6-core Carmel ARM v8.2 64bit CPU | 4-core Cortex-A53, Cortex-M4F | - |
| GPU | - | Mali-G52 | 128-core Nvidia Maxwell | 256-core Nvidia Pascal | 384-core Nvidia Volta | Integrated GC7000 Lite | - |
| Co-Processor | - | - | - | - | - | Google EdgeTPU (4 TOPS) | Google EdgeTPU (4 TOPS) |
| Memory | 4 GB LPDDR4 | 4 GB LPDDR4 | 4 GB LPDDR4 | 8 GB LPDDR4 | 8 GB LPDDR4 | 1 GB LPDDR4 | - |
| Power | 1.8–5.3W | 2–5W | 5–10W | 7.5–15W | 10–15W | 4 TOPS @2W | 4 TOPS @2W |
| OS | Ubuntu 18.04 LTS | Ubuntu 18.04 LTS | Ubuntu 18.04 LTS | Ubuntu 18.04 LTS | Ubuntu 18.04 LTS | Mendel Debian 10 | - |
| Price | $55.00 | $79.00 | $99.00 | $399.00 | $399.00 | $129.99 | $59.99 |
Table 1. Specifications of Edge Devices and AI Accelerators
General-purpose Edge Devices. Raspberry Pi 4 (
Edge Devices with GPU Accelerators.
Three Jetson devices developed by Nvidia are chosen for this category. Jetson Nano (
Jetson TX2 (
Jetson Xavier NX (
All three Jetson devices allow altering the power modes using
EdgeTPU-based AI Accelerators.
We use two EdgeTPU-based AI accelerators; Google’s Coral Dev Board (
Both
2.2 Deep Learning Models
The accuracy of DL models keeps increasing along with the complexity of model dimension and the increased number of layers. However, huge models often do not fit into the resource-constrained, low-capacity edge devices like
| DL Model | Year | Input Size | Num. Layers | Billion FLOPS | # Params (Millions) | Approx. File Size (MB) | DL Framework Availability | |||
|---|---|---|---|---|---|---|---|---|---|---|
| PyTorch | MXNet | TF | TFLite | |||||||
| 2012 | 224 \(\times\) 224 | 8 | 0.7 | 61 | 244 | ✓ | ✓ | ✓ | ✗ | |
| 2016 | 224 \(\times\) 224 | 161 | 7.9 | 28.7 | 115 | ✓ | ✓ | ✓ | ✗ | |
| 2015 | 299 \(\times\) 299 | 48 | 2.9 | 27.2 | 101, 25* | ✓ | ✓ | ✓ | ✓ | |
| 2017 | 224 \(\times\) 224 | 28 | 1.1 | 4.3 | 17, 4.5* | ✓ | ✓ | ✓ | ✓ | |
| 2018 | 224 \(\times\) 224 | 20 | 0.3 | 3.5 | 14, 4* | ✓ | ✓ | ✓ | ✓ | |
| 2015 | 224 \(\times\) 224 | 18 | 1.8 | 11.7 | 46 | ✓ | ✓ | ✓ | ✗ | |
| 2015 | 224 \(\times\) 224 | 50 | 4.1 | 25.6 | 102 | ✓ | ✓ | ✓ | ✗ | |
| 2016 | 224 \(\times\) 224 | 15 | 0.4 | 1.2 | 5 | ✓ | ✓ | ✓ | ✗ | |
| 2014 | 224 \(\times\) 224 | 16 | 15.4 | 138.36 | 553 | ✓ | ✓ | ✓ | ✗ | |
✓ denotes that the pre-trained version of the models are available for the corresponding ML framework, ✗ denotes the unavailability of model for the ML framework, * means the model size for TF Lite.
Table 3. Overview of DL Models
✓ denotes that the pre-trained version of the models are available for the corresponding ML framework, ✗ denotes the unavailability of model for the ML framework, * means the model size for TF Lite.
Moreover, each model has different characteristics and advantages. For instance,
2.3 Deep Learning Frameworks
We use four, widely-used open-source DL frameworks; PyTorch [53], MxNet [29], TensorFlow [24], and TensorFlow-Lite [21], which can be deployed on resource-constrained edge devices. PyTorch, MxNet, and TensorFlow are used for performing CPU- and GPU-based DL inference tasks on edge devices (e.g.,
MxNet [29]. MxNet is a scalable open-source DL framework from Apache Software Foundation. It has been designed to support distributed training by leveraging a distributed parameter server. The performance of the framework is claimed to scale linearly with multiple GPUs (or CPUs) as well. It also allows mixing symbolic and imperative programming models enabling better efficiency and productivity for users. MxNet currently supports Python, Java, Scala, R, Julia, Go, Clojure, Perl, MATLAB, and JavaScript.
PyTorch [53]. PyTorch, developed by Facebook, is an open-source ML framework built on top of the Torch library and designed specifically for Python. One of the most prominent features of PyTorch is providing a NumPy-like tensor computing support (but only for CUDA-capable Nvidia GPUs). Unlike TensorFlow (version \(\lt\) 2.0.0) and MxNet, PyTorch adopts a dynamic computational graph-based approach to create the computational graph at runtime. This feature enables flexibility for developers when writing and debugging DL applications.
TensorFlow [24]. TensorFlow, from Google, is another open-source and widely popular ML framework. It uses a dataflow graph where nodes represent operations while the edges represent tensors. It can map the nodes of the graph across many machines in a distributed cluster and within a machine across multiple computational devices (CPUs, GPUs, or TPUs), thereby proving to be a scalable framework. Starting from version 2.0.0, TensorFlow supports eager execution mode, which emulates the behavior of PyTorch’s dynamic computation graphs.
TensorFlow-Lite [21]. TensorFlow-Lite is a lightweight version of TensorFlow developed to support on-device DL inferencing. TensorFlow-Lite employs several techniques to optimize memory utilization, such as intermediate tensors, shared memory buffer objects, and memory offset calculation to run DL models on resource-constrained devices, including EdgeTPUs. TensorFlow-Lite has two primary parts, which are an interpreter and a converter. The interpreter runs DL models, and the converter converts TensorFlow models into TensorFlow-Lite ones.
Table 3 also shows DL frameworks’ support for DL models. All nine DL models are available for PyTorch, MxNet, and TensorFlow for CPU-/GPU-based inferencing. However,
3 APPROACHES FOR DEEP LEARNING INFERENCE THROUGHPUT MAXIMIZATION
The goal of this study is to investigate different system approaches for maximizing the DL inference throughput on various edge devices and AI accelerators. For the DL single-tenancy use cases, we investigate the batching approach. For the AI multi-tenancy use cases, two approaches are studied; CME and DMP.
3.1 DL Throughput Maximization for AI Single-Tenancy on Edge Devices
For AI single-tenancy cases, where only one DL model is running on an edge device, we use batched inferencing (or multi-batch inferencing) to maximize the DL inference throughput. In the context of inference, Batching refers to enabling a single DL model to process multiple inputs simultaneously by implementing the concept of single instruction/multiple data operations. Figure 1 illustrates single batch inferencing and batched inferencing in single-tenancy on edge devices.
In real-world use cases, batched inferencing is beneficial as edge devices are often required to handle batches of data either from multiple IoT sensors (e.g., autonomous cars with multiple cameras) or from end devices that collect data over a period of time and send requests in batch (e.g., traffic monitoring and wearable devices). Therefore, we investigate the impact of batched inferencing on the overall throughput, specifically on GPU-enabled devices (
3.2 DL Throughput Maximization for AI Multi-Tenancy on Edge Devices
We investigate techniques for maximizing DL inference throughput for AI multi-tenancy, in which multiple DL models are running simultaneously on the same edge devices (or with AI accelerators). In particular two techniques are investigated for AI multi-tenancy at the edge: (1) CME and (2) DMP.
Concurrent Model Executions (CMEs). CME leverages the idea of parallel processing and enables AI multi-tenancy by simultaneously executing multiple DL inference tasks (models) on edge devices’ resources (either GPU or EdgeTPUs). Figure 2(a) and (b) illustrate the CME on edge devices and AI accelerators.
CME can provide two potential benefits to edge devices and EdgeTPUs; (1) improvement in the overall DL inference throughput and (2) ability to run multiple (often different) DL (e.g., inference) tasks. However, due to the resource-constrained nature of edge devices (e.g., limited memory sizes and the number of CPU cores), it is unclear how many DL models can be concurrently executed and which level of concurrency yields the maximum throughput. Therefore, it is important to empirically identify the upper bound of throughput improvement and the concurrency level (the number of co-running DL models) on the devices by CME. To this end, we will measure throughput changes with different levels of concurrency. The concurrency level obtained from the last successful execution will be considered as the maximum concurrency level supported by the edge devices and EdgeTPUs. Furthermore, because CMEs provide software-level parallelism, we will also evaluate the impact of introducing the idea of EdgeTPU-cluster (e.g., running DL models on multiple
Dynamic Model Placements (DMPs). DMP is another approach for maximizing DL inference throughput for AI multi-tenancy with the idea of leveraging heterogeneous computing resources. DMP leverages the collective powers of edge devices and
Furthermore, because
4 EVALUATION PROCESS AND BENCHMARKER DESIGN
This study investigates systematic approaches to maximize the DL inference throughput on edge devices and AI accelerators. The primary performance metric thus is the DL inference throughput. For the baseline performance (single-tenancy case), the \(DL~inference~throughput\) for single computing resource (\(T_{single}\)) is generally calculated by DL inferences per second, as expressed below. (1) \(\begin{equation} {T_{single}} = \dfrac{The~number~of~inferences}{Total~execution~time}. \end{equation}\)
However, the definition of the number of inferences varies with the type of experiment. For instance, when leveraging the single-tenancy with batched inferencing (feeding multiple input images into one DL model on a device), the number of inferences in Equation (1) is “batch size (\(bs\))” \(\times\) “the number of batches (\(bc\)).” The equation for batched inferencing throughput (\(T_{batch}\)) is formulated below. (2) \(\begin{equation} {T_{batch}} = \dfrac{bs \times bc}{Total~execution~time}. \end{equation}\)
On the other hand, when leveraging AI multi-tenancy with CME, the number of inferences will be calculated by “concurrency level (\(cc\))” \(\times\) “\(bs\)” \(\times\) “\(bc\)”. The DL inference throughput with CME (\(T_{cme}\)) is expressed in Equation (3). (3) \(\begin{equation} {T_{cme}} = \dfrac{cc \times bs \times bc}{Total~execution~time}. \end{equation}\)
For the DMP evaluation, which uses multiple resources like both GPUs and TPUs, the total inference throughput (\(T_{dmp}\)) is the sum of the throughput of all used resources. The equation for DMP throughput is expressed as (4) \(\begin{equation} {T_{dmp}} = \sum _{i=1}^{n}{{T_{cme_{i}}}}, \end{equation}\) where \(i\) represents various computing resources for DL inference (e.g., CPU, GPU, and TPU), and \(n\) indicates the number of different resources employed by DMP.
Benchmarker Design. We develop a benchmarker that measures the DL inference throughput and collects other necessary system statistics together. We deploy it along with an image classification application on the edge devices and EdgeTPUs accelerators. The measurement procedure of the benchmarker is illustrated in Figure 3.
Fig. 3. Benchmark procedure.
The benchmarker is invoked from a bash script (❶ in Figure 3) that takes parameters in a config file specific to a measurement. The config file specifies the DL model, framework, and the number of iterations to run the experiment. The config file also contains other parameters that are common across experiments, such as the number of warmup executions to perform, the input batch size, the number of batches, and resources (CPU, GPU, or EdgeTPU) used for the inference task. The bash script then runs the benchmarker (written in Python) with all these configurations. Invoking the Python interpreter using the bash script ensures that the cache constructed and maintained by the Python runtime gets cleared with each new iteration. The benchmarker then prepares a framework-specific data-loader (❷) that uses the validation dataset from
The monitoring thread is responsible for collecting diverse system statistics using
Fig. 4. Experimental setup for power measurement. Power consumption by a target edge device is being measured and transmitted to a computing board using I2C cables by INA-219 chip.
5 EVALUATION WITH DL SINGLE-TENANCY
We first report the evaluation results with single-tenancy on edge devices and EdgeTPU accelerators. As single-tenancy is a common use case of running AI tasks at the edge, it can be used on edge devices (with CPUs or GPUs) or EdgeTPUs. Regarding the single-tenancy on edge devices, Section 5.1 provides DL inference throughput with single-tenancy on edge devices. Moreover, as an approach for maximizing the inference throughput for single-tenancy, we evaluate the impact and performance of batched inferencing, where a DL model processes a batch of input images and outputs the classification results of all the images simultaneously. For single-tenancy on EdgeTPU, Section 5.2 discusses the evaluation results on EdgeTPUs. Finally, Section 5.3 thoroughly analyzes the experiment results and identifies the factors altering the DL inference throughput on edge devices and EdgeTPUs. The results reported in this section will serve as the baseline performance to evaluate throughput maximization approaches (CME and DMP) for AI multi-tenancy on edge devices.
5.1 DL Inference Throughput on Edge Devices (CPU or GPU) with Single-Tenancy
The first set of experiments measured the inference throughput of all the DL models on edge devices with a batch size of 1 i.e., single input image per model per iteration. Figure 5 reports the average DL inference throughput for all the models using the three DL frameworks. Please note that the results of
Fig. 5. DL inference throughput variations across models, edge devices, and DL frameworks with a batch size of 1.
DL Model Size.
Figure 5 shows that the inference throughput varied significantly across different DL models. In particular, DL models with fewer parameters and floating-point operations (e.g.,
GPU vs. CPU.
Figure 5 also confirms that, for single-batch inference, the GPU-based devices’ DL inference throughput significantly outperformed the CPU-based devices’ inference throughput. The edge devices with GPUs (
DL Frameworks. Among the three DL frameworks, PyTorch showed the highest throughput on GPUs. On average, the throughput of DL models with PyTorch was \(31\%\) and \(26\%\) higher than MxNet and TensorFlow, respectively. PyTorch’s superiority on GPUs was because of Torch library designed to make tensor operations on GPU faster and more efficient. On the other hand, TensorFlow significantly outperformed the other two on CPUs. The average throughput across all the models on CPUs using TensorFlow was about \(5\times\) the results from MxNet and \(10\times\) from PyTorch. This result was due to TensorFlow’s design to support mapping nodes (from computational graph) across multicore CPUs [24]. Therefore, TensorFlow could enable faster computation, processing more DL inferences on CPUs than the other two frameworks.
MxNet, on all the devices, was the least performing framework. We observed that two
Impact of Batched Inferencing.
As discussed in Section 3.1, batched inferencing is the approach to maximize the throughput for single-tenancy. Figure 6 reports the DL inference throughput of batched inferencing with increasing batch sizes, on the three ML frameworks. Please note that Figure 6 includes the results of five models on four devices due to the page limitation, and other omitted results have similar patterns. We observed that there was a significant throughput improvement with increasing batch size for GPU-enabled devices. On average, a batch size of 32 showed \(240\%\) higher DL inference throughput. The impact of batching on
Fig. 6. Throughput variation across DL models, edge devices, and DL framework with different batch sizes.
Another observation is that the inference throughput did not always increase as batch size increased. While inferencing without batching, there is an interval between computations in two activation layers in the model. With batched inferencing, the model can calculate other images’ activation layers in the same batch and store the results in the memory for the next layer. The batch size is limited by an edge device’s memory capacity [63]. If the memory is sufficient enough to store all the activations, that batch can be processed directly [32]. However, when the memory size is insufficient, batching would trigger the edge device’s memory saving mechanism, such as swap space. The data in the memory will then be moved to the storage (SD card in edge devices). This mechanism can slow down the inference speed, hence decreasing inference throughput. Therefore, employing the right (or optimal) batch size is a critical factor for maximizing the DL inference throughput on edge devices.
5.2 DL Inference Throughput on EdgeTPU with Single-Tenancy
EdgeTPUs are designed to support faster processing of tensors (one of the primary components of CNNs), which in turn, can boost the DL inference throughput. Note that the
Table 5 reports the DL throughput fluctuations on EdgeTPUs connected with different host devices, and the throughputs fluctuated across the different host devices. For example, when
| Model | Host Device + Edge TPU | Avg. Infer. Through. | Std. Dev. |
|---|---|---|---|
| 12.35 | 0.35 | ||
| 15.59 | 0.47 | ||
| 16.42 | 0.34 | ||
| 18.54 | 0.48 | ||
| 17.28 | 0.38 | ||
| 13.26 | 0.19 | ||
| 54.65 | 4.03 | ||
| 58.84 | 6.73 | ||
| 63.60 | 5.58 | ||
| 64.65 | 5.45 | ||
| 64.01 | 2.73 | ||
| 59.02 | 2.48 | ||
| 55.79 | 4.15 | ||
| 59.70 | 5.78 | ||
| 66.61 | 4.23 | ||
| 64.01 | 6.57 | ||
| 64.69 | 2.41 | ||
| 60.67 | 5.23 |
Table 5. DL Inference Throughput of the Three Quantized DL Models on EdgeTPUs
The benefits of using EdgeTPUs are confirmed by comparing the inference throughput against other edge devices’ (CPU- and GPU-based) throughput results. As shown in Figure 7,
Fig. 7. Comparison of inference throughput in CPU, GPU, and EdgeTPU. The throughput results of CPU- and GPU-based inferences are the maximum throughput results of those devices amongst all three frameworks. Please note that USB-Accelerator’s throughput in this graph is the maximum throughput from the results reported in Table 5.
Compared to GPU-based inferencing on
Moreover, we observed that
5.3 Analysis of Factors for Influencing DL Inference Throughput with Single-tenancy
In this subsection, we further discuss the analysis of factors that can affect DL inference throughput on edge devices and EdgeTPUs when employing DL single-tenancy.
Correlation Analysis Between System Factors and DL Inference Throughput. We first performed correlation analysis to investigate the factors that change the DL inference throughput on edge devices and EdgeTPUs. The correlation analysis was performed by calculating the Pearson Correlation Coefficient, \(\frac{cov(x,y)}{\sigma _x\sigma _y} = \frac{\sum _{i}^{n}(x_i-\overline{x})(y_i-\overline{y})}{\sqrt {\sum _{i}^{n}(x_i-\overline{x})^2(y_i-\overline{y})^2}}\), between measured throughput results and resource usage statistics [26]. This coefficient represents the linear relationship between two variables, ranging from \(-1\) to 1. Please note that the coefficient of 1 indicates an ideal positive correlation, negative values mean reverse correlation, and 0 means there is no correlation between two variables.
Figure 8 shows the correlated factors for the DL inference throughput when using CPUs, GPUs, and EdgeTPUs. For the CPU-based inferences on
Fig. 8. Correlated factors that change the inference throughput. (BS: Batch Size, CPU: CPU usage, MEM: memory usage, PW: Power consumption, USB-IO: USB IO bandwidth usage).
| Model | Batch Size | Avg. Throughput | Avg. CPU Usage(%) | Model | Batch Size | Avg. Throughput | Avg. CPU Usage(%) | |
|---|---|---|---|---|---|---|---|---|
| 1 | 2.85 | 53.9 | 1 | 4.05 | 60.9 | |||
| 32 | 4.63 | 100.0 | 32 | 4.92 | 87.4 | |||
| (lr)1-4 (lr)6-9 | 1 | 0.53 | 76.5 | 1 | 2.61 | 73.1 | ||
| 32 | 0.56 | 100.0 | 32 | 2.90 | 98.3 | |||
| (lr)1-4 (lr)6-9 | 1 | 1.02 | 81.0 | 1 | 1.16 | 72.8 | ||
| 32 | 0.95 | 100.0 | 32 | 1.34 | 98.1 | |||
| (lr)1-4 (lr)6-9 | 1 | 4.14 | 59.2 | 1 | 5.89 | 53.3 | ||
| 32 | 5.51 | 93.0 | 32 | 7.86 | 85.8 |
Table 6. Change in CPU Usage and Inference Throughput (on TensorFlow) with Varying Batch Sizes in RPi4
For the GPU-based inference tasks on
For the inference tasks on EdgeTPU accelerators (especially
Impact of USB Bandwidth on
Fig. 9. Difference in DL inference throughput and data transfer with USB 2.0 and 3.0 interfaces. (DT: Data Transfer Rate).
6 EVALUATION WITH AI MULTI-TENANCY
CME and DMP are two approaches to maximizing the DL inference throughput with AI multi-tenancy. This section reports our measurement results with CME (Section 6.1) and DMP (Section 6.2) and discusses the benefits and limitations of both approaches.
6.1 Concurrent Model Executions (CME)
We first describe the evaluation procedure of CME and then report CME measurement results on GPUs on edge devices and EdgeTPUs. Finally, we will discuss CME results with a cluster of EdgeTPUs. In this evaluation, we seek answers to the following research questions:
(1) | What is the maximum DL inference throughput of the edge devices and EdgeTPUs with CME? | ||||
(2) | What is the maximum concurrency level on the edge devices and EdgeTPUs with CME? | ||||
(3) | What is the concurrency level on edge devices and EdgeTPUs to maximize DL inference throughput? | ||||
(4) | What are the benefits and limitations of leveraging multiple EdgeTPUs (a.k.a cluster of EdgeTPUs) with CME for maximizing the DL inference throughput? | ||||
For the rest of this study, we only use three DL models,
Evaluation Procedure.
Based on the measurement results from single-tenancy cases (Section 5), we gradually increase the number of co-running DL models (“concurrency level”) on the devices and EdgeTPU to find the maximum level of concurrency and throughput improvement with CME. This process continues until the benchmarker fails to run for one of the following reasons. (1) the memory is fully saturated or (2) the device can no longer create more DL tasks. The concurrency level obtained from the last successful execution is considered as the maximum concurrency level supported by the edge devices and EdgeTPUs. In this measurement, we only report the results with leveraging CME on GPUs (
The benchmarking process described in Figure 3 (Section 4) is tweaked such that instead of running a model in the main thread (❻ in Figure 3), new threads are created to run models concurrently (i.e., separate copies of the model are created for each thread). The main thread then waits for all the models to finish execution and finally terminates the script, followed by steps similar to the previous workflow.
6.1.1 CME Evaluation Results on GPU in Edge Devices..
The next evaluation is to measure the DL inference throughput of GPUs with CME and increasing concurrency levels using PyTorch (Figure 10) and MxNet (Figure 11), respectively, on
Fig. 10. Concurrency measurement results on J.Nano, J.TX2, and J.Xavier GPUs with PyTorch (BS: Batch Size).
Fig. 11. Concurrency measurement results on J.Nano, J.TX2, and J.Xavier GPUs with PyTorch (BS: Batch Size).
Input batch size and level of concurrency complemented the performance gain as both the approaches rely on running multiple inferences simultaneously. However, due to memory and CPU usage constraints on edge devices, we could not indefinitely increase both to maximize performance. In our study, we observed that 5 to 6 of the concurrent level with a batch size of 8 resulted in the maximum empirical throughput improvement. After that, we observed increasing either of the two parameters (concurrency level and batch size) resulted in lower performance.
The level of concurrency was directly related to the size of the model and the available memory in edge devices.
Fig. 12. Resource utilization and inference throughput changes with CME (PyTorch). J.Nano uses a batch size of 4, and J.TX2 employs a batch size of 8.
6.1.2 CME Evaluation Results on EdgeTPUs.
The second evaluation for CME was to measure the DL inference throughput on EdgeTPUs, and Figure 13 shows CME results on TPUs (both
Fig. 13. Results of CME measurement on EdgeTPU.
Like the previous GPU results, CME on EdgeTPUs also increased throughput over the single-tenancy cases. For
We found two interesting observations about the throughput improvement. One is that CME’s throughput increase with
The second observation (Figure 13(b) and 13(c))–maximum throughput gain of
Furthermore, all three models reported much higher concurrency levels on EdgeTPUs than on GPUs. For example,
Regarding the varying concurrency levels, Figure 14 shows resource utilization changes with different concurrency levels measured on
Fig. 14. Resource utilization changes with increased concurrency level (EdgeTPUs).
We also measured the changes in the host edge device’s memory and USB bandwidth when throughput changes. Our observation is that the memory utilization kept increasing as the inference throughput degraded after the peak throughput. However, the USB bandwidth kept stable after reaching the peak throughput.
Finally, compared with CME on GPUs, the maximum throughput of EdgeTPUs was nearly 230 inferences per second when running concurrent
6.1.3 CME Evaluation Results on EdgeTPU Cluster.
As discussed in the previous subsection,
Fig. 15. EdgeTPU-cluster composed of four EdgeTPUs (USB-Accelerator) connected with J.Xavier.
We started the experiment by running each of three quantized models (
Fig. 16. DL inference throughput variation with multiple USB-Accelerators (EdgeTPU cluster).
The EdgeTPU-cluster with three or four
Fig. 17. Total USB bandwidth usage and bandwidth consumed by each USB-Accelerator when using EdgeTPU-cluster with J.TX2.
6.2 Dynamic Model Placements (DMP)
This subsection evaluates the DMP technique for AI multi-tenancy on edge devices and EdgeTPUs. DMP allows running multiple DL models simultaneously by placing DL models on an edge device’s resource (CPU and/or GPU) and other DL models on EdgeTPUs. Because
(1) | What are the performance benefits (e.g., DL inference throughput) from DMP on heterogeneous resources? | ||||
(2) | What are the actual performance penalties of using DMP, compared to CME for AI multi-tenancy? | ||||
(3) | What are the performance benefits and limitations of using EdgeTPU-cluster for DMP? | ||||
Similar to the previous CME evaluations, we used three DL models (
We initially used all edge devices connected with
6.2.1 DMP Evaluation Results on Edge Device and a Single USB-Accelerator .
As described above, CME was also enabled for DMP. The first step of this evaluation was to find an empirically optimal concurrency level that could produce the maximum DL inference throughput. While we reported the throughput changes with different concurrency levels on either GPU or EdgeTPU in Section 6.1, such high concurrency levels may not be achievable for DMP. This is because the edge device needs to manage multiple inference tasks on both GPU and EdgeTPU, and there will be a contention of the edge device’s resources (e.g., memory). Therefore, we re-measured the throughput changes with different concurrency levels on both GPU and EdgeTPU for DMP. The evaluation results are shown in Figure 18. As expected, much lower levels of concurrency were supported by edge devices and
Fig. 18. Throughput changes with different concurrency levels on both GPU and EdgeTPU when enabling DMP. We omit the results of MobileNet-V2 because the results are similar to the results of MobileNet-V1 (Figure 18(b)).
Then, we measured the overall throughput with DMP, and the evaluation used concurrency levels for GPU and EdgeTPU that could produce overall (accumulated) maximum DL inference throughput. Figure 19 shows DMP’s DL inference throughput improvement against the single-tenancy cases. All
Fig. 19. Comparison of DL inference throughput between DMP and single-tenancy.
Figure 20 reports the throughput comparison between (ideal) CME results and DMP results. The figure contains the results measured from J.TX2 when using PyTorch/MXNet (for GPU) and TFLite (for EdgeTPU). Please note that we omit the results from
Fig. 20. J.TX2’s DL inference throughput comparison between (ideal) results from CME and DMP. The (ideal) results from CME are calculated by the sum of separately measured CME throughput on GPU and EdgeTPU.
To understand the gap between the DMP’s throughput and ideal throughput, we performed further analysis on resource consumption. Figure 21 shows the resource utilization (CPU, memory, and USB IO) between the ideal sum of CME on GPU/EdgeTPU (measured in Section 6.1) and DMP. The figure shows that the ideal throughput often could not be achievable with current HW specifications. Specifically, CPU (Figure 21(a)) and memory (Figure 21(b)) utilization should exceed the HW limits of the edge devices (more than 100%) to reach such high throughput. Moreover, similar to the CME analysis, memory was identified as a critical resource when enabling DMP. Specifically, we observed that memory utilization reached 100% with DMP while CPU utilization did not reach 100%. Based on this observation, the DL inference throughput, when the memory resource is saturated, can be the empirical performance upper bound when enabling DMP. We also observed that resource contention could impact the DL inference throughput because the shared resources, such as memory and CPU, were needed to manage multiple DL models running on different resources. The decreased USB IO utilization (about 8% to 15%) with DMP (Figure 21(c)) was because such resource contentions and the reduced USB IO utilization could decrease DL inference throughput from in USB-Accelerator.
Fig. 21. Resource usage comparison between (ideal) sum of CME on GPU/EdgeTPU and DMP.
6.2.2 DMP Evaluation Results on Edge Device and EdgeTPU-cluster.
The next evaluation is to measure the throughput improvements of DMP when leveraging EdgeTPU-cluster. As shown in Section 6.1.3, EdgeTPU-cluster with two accelerators showed almost maximum performance improvement due to the limitation of USB bandwidth. Therefore, in this evaluation, we used an EdgeTPU-cluster with two accelerators.
Figure 22 reports DL inference throughput changes from DMP employing EdgeTPU-cluster. Please note that the figure only shows the results from
Fig. 22. Throughput comparison between DMP with 1 EdgeTPU and DMP with 2 EdgeTPUs (EdgeTPU-Cluster).
7 RELATED WORK
Several studies have been conducted quantifying the performance of various edge devices for DL and ML inference tasks [25, 34, 48, 49, 54, 57, 73]. However, most studies have focused on characterizing performance (e.g., latency and throughput) and efficiency (e.g., energy consumption) of edge devices and AI accelerators with single DL tasks.
pCamp [73] evaluated ML packages and frameworks’ performance when executing image classification tasks on edge platforms, including
More recently, Liang et al. [48] have conducted an experimental study to evaluate model splitting and compression techniques on edge devices and accelerators when performing co-inference tasks with clouds. Network latency, bandwidth usage, and resource utilization with various configurations were also reported when applying model splitting and compression to cloud-edge co-inference use-cases. Additionally, the authors have evaluated the concurrency model executions for multi-tenancy use cases. However, the concurrency evaluation is narrowly performed with only one model having a single batch size. Moreover, in addition to evaluating the CME strategy, our work also evaluates and characterizes the DMP strategy for AI multi-tenancy that leverages heterogeneous resources in edge and EdgeTPU.
8 CONCLUSION
This study investigated system approaches to maximize the DL inference throughput on resource-constrained edge devices and EdgeTPU accelerators with AI multi-tenancy.
We first evaluated various DL models’ performance with image classification tasks on edge devices and AI accelerators, including CPU, GPU, and EdgeTPU. Based on the evaluation, we further investigated three system approaches for maximizing DL inference throughput. Batched inferencing is the approach for maximizing the throughput with DL single-tenancy use cases. With batched inferencing, GPU-equipped devices showed significant throughput improvement as multiple images could be processed in parallel on the GPU resources. We then explored the feasibility and effectiveness of AI multi-tenancy at the edge. Notably, two approaches were applied—CME and DMP. CME exploits the available system resources (CPU, memory, and GPU) to load more models into the system and process multiple inference tasks in parallel. DMP, on the other hand, leverages available, heterogeneous computing resources by placing models on different processors (GPU and EdgeTPU) and processes DL inference tasks on both the processors/accelerators simultaneously. Our evaluation results confirmed that CME and DMP were viable and successfully improved the system’s overall throughput, including GPU and EdgeTPU, by a significant factor.
However, we also observed the limitations of the three approaches that will be future research explorations. For batched inferencing, the performance improvements start decreasing once the batch size exceeds a certain threshold (e.g., the number of GPU cores). Besides, due to the limited memory size, there was a limit to the number of input images that could be simultaneously loaded into the memory. For CME and DMP with multi-tenancy, we started getting diminishing returns once the number of concurrently processed models exceeded the number of concurrent threads (or cores) supported by the CPUs. System memory also turned out to be a bottleneck as we increased the number of concurrent models. Finally, since USB bandwidth drived the rate at which
This study confirmed that AI multi-tenancy on edge devices is a promising technique to improve the performance of DL tasks. Further study on strategic placement of models to minimize resource contention and isolation mechanism for dynamic control of DL inference throughput can push the performance boundaries of DL inferencing. In addition, since the multi-tenant applications share the same system memory, a thorough analysis of the security of individual applications (i.e., isolation from other models) is necessary for techniques like CME or DMP to be suitable for deployment.
Footnotes
1
FootnoteUSB-Accelerator only have 8MB of cache memory (SRAM).
- [1] 2020. Torchvision 0.5.0. Retrieved from https://pytorch.org/vision/. Accessed 9/12/2020.Google Scholar
- [2] 2021. Azure AI. Retrieved from https://azure.microsoft.com/en-us/overview/ai-platform/. Accessed 2/8/2021.Google Scholar
- [3] 2021. Cloud AI – Google Cloud. Retrieved from https://cloud.google.com/products/ai/. Accessed 2/12/2021.Google Scholar
- [4] 2021. IBM Watson Machine Learning. Retrieved from https://www.ibm.com/cloud/machine-learning. Accessed 2/12/2021.Google Scholar
- [5] 2021. Machine Learning on AWS. Retrieved from https://aws.amazon.com/machine-learning/. Accessed 2/13/2021.Google Scholar
- [6] 2022. Coral Dev Board datasheet. Retrieved from https://coral.ai/docs/dev-board/datasheet/. Accessed 2/16/2022.Google Scholar
- [7] 2022. Coral USB Accelerator Datasheet. Retrieved from https://coral.ai/docs/accelerator/datasheet/. Accessed 1/27/2022.Google Scholar
- [8] 2022. Edge TPU Python API overview. Retrieved from https://coral.ai/docs/edgetpu/api-intro/. Accessed 1/27/2022.Google Scholar
- [9] 2022. Environment Variables – MXNet v1.7.0. Retrieved from https://mxnet.apache.org/versions/1.7.0/api/faq/env_var. Accessed 2/2/2022.Google Scholar
- [10] 2022. INA219–26V, 12-bit, i2c output current/voltage/power monitor. Retrieved from https://www.ti.com/product/INA219. Accessed 2/2/2022.Google Scholar
- [11] 2022. Intel Neural Compute Stick. Retrieved from https://ark.intel.com/content/www/us/en/ark/products/140109/intel-neural-compute-stick-2.html. Accessed 2/4/2022.Google Scholar
- [12] 2022. Jetson Nano | Nvidia Developer. Retrieved from https://developer.nvidia.com/embedded/jetson-nano. Accessed 2/3/2022.Google Scholar
- [13] 2022. kerascv 0.0.40. Retrieved from https://pypi.org/project/kerascv/. Accessed 2/3/2022.Google Scholar
- [14] 2022. NVIDIA Jetson Linux Developer Guide : Clock Frequency and Power Management. Retrieved from https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/clock_power_setup.html#. Accessed 2/5/2022.Google Scholar
- [15] 2022. Nvidia Jetson TX2. Retrieved from https://developer.nvidia.com/embedded/jetson-tx2. Accessed 2/5/2022.Google Scholar
- [16] 2022. NVIDIA Jetson Xavier NX. Retrieved from https://developer.nvidia.com/embedded/jetson-xavier-nx.
[online] .Google Scholar - [17] 2022. NVPModel – Nvidia Jetson TX2 Dev. Kit. Retrieved from https://www.jetsonhacks.com/2017/03/25/nvpmodel-nvidia-jetson-tx2-development-kit/. Accessed 2/5/2022.Google Scholar
- [18] 2022. ODROID-N2. Retrieved from https://wiki.odroid.com/odroid-n2/odroid-n2. Accessed 2/5/2022.Google Scholar
- [19] 2022. pi-ina219 1.4.0. Retrieved from https://pypi.org/project/pi-ina219/. Accessed 2/5/2022.Google Scholar
- [20] 2022. Raspberry Pi 4. Retrieved from https://www.raspberrypi.org/products/raspberry-pi-4-model-b/. Accessed 2/5/2022.Google Scholar
- [21] 2022. TensorFlow Lite – ML for Mobile and Edge Devices. Retrieved from https://www.tensorflow.org/lite. Accessed 2/3/2022.Google Scholar
- [22] 2022. tf.Graph – TensorFlow v2.4.1. Retrieved from https://www.tensorflow.org/api_docs/python/tf/Graph. Accessed 2/3/2022.Google Scholar
- [23] 2022. tf.hub – TensorFlow Hub. Retrieved from https://www.tensorflow.org/hub. Accessed 2/3/2022.Google Scholar
- [24] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation.Google Scholar
- [25] . 2019. EmBench: Quantifying performance variations of deep neural networks across modern commodity devices. The 3rd International Workshop on Deep Learning for Mobile Systems and Applications. 1–6.Google Scholar
- [26] . 2008. On the importance of the pearson correlation coefficient in noise reduction. IEEE Transactions on Speech and Audio Processing 16, 4 (2008), 757–765.Google Scholar
Digital Library
- [27] . 2019. Deep learning with edge computing: A review. Proceedings of the IEEE 107, 8 (2019), 1655–1674.Google Scholar
- [28] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation.Google Scholar
- [29] . 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).Google Scholar
- [30] . 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017).Google Scholar
- [31] . 2017. Comparison of edge computing implementations: Fog computing, cloudlet and mobile edge computing. In Proceedings of the Global Internet of Things Summit. IEEE, Geneva, Switzerland, 1–6.Google Scholar
Cross Ref
- [32] . 2021. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983 (2021).Google Scholar
- [33] He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, and Yi Zhu. 2020. GluonCV and GluonNLP: Deep learning in computer vision and natural language processing. Journal of Machine Learning Research 21, 23 (2020), 23:1–23:7.Google Scholar
- [34] . 2019. Characterizing the deployment of deep neural networks on commercial edge devices. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE, Orlando, FL, 35–48.Google Scholar
Cross Ref
- [35] . 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of the 4th International Conference on Learning Representations.Google Scholar
- [36] Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture. 620–629.Google Scholar
- [37] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Las Vegas, NV, 770–778.Google Scholar
Cross Ref
- [38] . 2018. AMC: AutoML for model compression and acceleration on mobile devices. In Proceedings of the 15th European Conference Computer Vision.Springer, Munich, Germany, 815–832.Google Scholar
Digital Library
- [39] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
- [40] . 2019. Dynamic adaptive DNN surgery for inference acceleration on the edge. In Proceedings of the IEEE Conference on Computer Communications. IEEE, Paris, France, 1423–1431.Google Scholar
Digital Library
- [41] . 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [42] . 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).Google Scholar
- [43] . 2020. MNN: A universal and efficient inference engine. In Proceedings of the 3rd Conference on Machine Learning and Systems.Google Scholar
- [44] . 2017. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
Digital Library
- [45] . 2019. Edge computing: A survey. Future Generation Computing Systems 97 (2019), 219–235.Google Scholar
Digital Library
- [46] . 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems.Google Scholar
- [47] . 2020. Edge AI: On-demand accelerating deep neural network inference via edge computing. IEEE Transactions Wireless Communications 19, 1 (2020), 447–457.Google Scholar
Cross Ref
- [48] . 2020. AI on the edge: Characterizing AI-based IoT applications using specialized edge architectures. In Proceedings of the IEEE International Symposium on Workload Characterization.Google Scholar
Cross Ref
- [49] . 2020. Benchmarking performance and power of USB accelerators for inference with MLPerf. In Proceedings of the International Workshop on Accelerated Machine Learning.Google Scholar
- [50] . 2019. Edge computing for autonomous driving: Opportunities and challenges. Proceedings of IEEE 107, 8 (2019), 1697–1716.Google Scholar
Cross Ref
- [51] . 2019. Towards deep learning using tensorflow lite on RISC-V. In Proceedings of the 3rd Workshop on Computer Architecture Research with RISC-V. Phoenix, AZ.Google Scholar
- [52] . 2020. Distributed inference acceleration with adaptive DNN partitioning and offloading. In Proceedings of the IEEE Conference on Computer Communications. IEEE, Toronto, ON, Canada, 854–863.Google Scholar
Digital Library
- [53] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Annual Conference on Neural Information Processing Systems.Google Scholar
- [54] . 2022. EdgeFaaSBench: Benchmarking edge devices using serverless computing. In Proceedings of the IEEE International Conference on Edge Computing.Google Scholar
Cross Ref
- [55] . 2017. Serving at the edge: A scalable IoT architecture based on transparent computing. IEEE Network 31, 5 (2017), 96–105.Google Scholar
Digital Library
- [56] . 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
DOI: Google ScholarDigital Library
- [57] . 2019. Resource characterisation of personal-scale sensing models on edge accelerators. In Proceedings of the International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things.Google Scholar
- [58] . 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [59] . 2021. ChatterHub: Privacy invasion via smart home hub. In Proceedings of the IEEE International Conference on Smart Computing. IEEE, 1–8.Google Scholar
Cross Ref
- [60] . 2016. Edge computing: Vision and challenges. IEEE Internet of Things Journal 3, 5 (2016), 637–646.
DOI: Google ScholarCross Ref
- [61] . 2016. The promise of edge computing. IEEE Computer 49, 5 (2016), 78–81.Google Scholar
Digital Library
- [62] . 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [63] . 2018. A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820 (2018).Google Scholar
- [64] . 2017. A berkeley view of systems challenges for AI. arXiv preprint arXiv:1712.05855 (2017).Google Scholar
- [65] . 2021. AI multi-tenancy on edge: Concurrent deep learning model executions and dynamic model placements on edge devices. In Proceedings of the 14th IEEE International Conference on Cloud Computing.Google Scholar
Cross Ref
- [66] . 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [67] . 2013. Mini-batch primal and dual methods for SVMs. International Conference on Machine Learning, PMLR, 1022–1030.Google Scholar
- [68] . 2019. HAQ: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, Long Beach, CA, 8612–8620.Google Scholar
Cross Ref
- [69] . 2020. A systematic methodology for analysis of deep learning hardware and software platforms. In Proceedings of the Conference on Machine Learning and Systems.Google Scholar
- [70] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Peter Vajda, Xiaodong Wang, Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo, and Peizhao Zhang. February, 2019. Machine learning at facebook: Understanding inference at the edge. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture. Washington DC, 331–344.Google Scholar
- [71] . 2015. The cloud is not enough: Saving IoT from the cloud. In Proceedings of the 7th USENIX Workshop on Hot Topics in Cloud Computing. USENIX Association, Santa Clara, CA.Google Scholar
- [72] . 2019. Eye in the sky: Drone-based object tracking and 3D localization. In Proceedings of the ACM International Conference on Multimedia.Google Scholar
Digital Library
- [73] . 2018. pCAMP: Performance comparison of machine learning packages on the edges. In Proceedings of the USENIX Workshop on Hot Topics in Edge Computing. Google Scholar
- [74] . 2018. User perceptions of smart home IoT privacy. ACM on Human-Computer Interaction 2, CSCW (2018), 200:1–200:20.Google Scholar
- [75] . 2019. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proceedings of IEEE 107, 8 (2019), 1738–1762.Google Scholar
Cross Ref
- [76] . 2017. Understanding the security of discrete GPUs. In Proceedings of the General Purpose GPUs. 1–11.Google Scholar
Digital Library
Index Terms
Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy
Recommendations
Characterizing Resource Heterogeneity in Edge Devices for Deep Learning Inferences
SNTA '21: Proceedings of the 2021 on Systems and Network Telemetry and AnalyticsSignificant advances in hardware capabilities and the availability of enormous data sets have led to the rise and penetration of artificial intelligence (AI) and deep learning (DL) in various domains. Considerable efforts have been put forth in academia ...
Distributing deep learning inference on edge devices
CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and TechnologiesDeep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) are widely used in IoT related applications. However, inferencing pre-trained large DNNs and CNNs consumes a significant amount of time, memory and computational resources. This makes ...
Multi-tenancy on GPGPU-based servers
VTDC '13: Proceedings of the 7th international workshop on Virtualization technologies in distributed computingWhile GPUs have become prominent both in high performance computing and in online or cloud services, they still appear as explicitly selected 'devices' rather than as first class schedulable entities that can be efficiently shared by diverse server ...




























Comments