Abstract
Since emerging edge applications such as Internet of Things (IoT) analytics and augmented reality have tight latency constraints, hardware AI accelerators have been recently proposed to speed up deep neural network (DNN) inference run by these applications. Resource-constrained edge servers and accelerators tend to be multiplexed across multiple IoT applications, introducing the potential for performance interference between latency-sensitive workloads. In this article, we design analytic models to capture the performance of DNN inference workloads on shared edge accelerators, such as GPU and edgeTPU, under different multiplexing and concurrency behaviors. After validating our models using extensive experiments, we use them to design various cluster resource management algorithms to intelligently manage multiple applications on edge accelerators while respecting their latency constraints. We implement a prototype of our system in Kubernetes and show that our system can host 2.3× more DNN applications in heterogeneous multi-tenant edge clusters with no latency violations when compared to traditional knapsack hosting algorithms.
- [1] . 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, USA, 265–283.Google Scholar
- [2] . 2018. Edge computing technologies for Internet of Things: A primer. Dig. Commun. Netw. 4, 2 (2018), 77–86.Google Scholar
Cross Ref
- [3] . 2020. Waiting game: Optimally provisioning fixed resources for cloud-enabled schedulers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’20).Google Scholar
Cross Ref
- [4] . 2017. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In Proceedings of the IEEE Real-Time Systems Symposium (RTSS’17). 104–115. Google Scholar
Cross Ref
- [5] . 2015. Always-on vision processing unit for mobile applications. IEEE Micro 35, 2 (2015), 56–66. Google Scholar
Digital Library
- [6] . 2005. A polynomial time approximation scheme for the multiple knapsack problem. SIAM J. Comput. 35, 3 (2005), 713–728. Google Scholar
Digital Library
- [7] . 2019. Deep learning with edge computing: A review. Proc. IEEE 107, 8 (2019), 1655–1674.Google Scholar
Cross Ref
- [8] . 2018. MARVEL: Enabling mobile augmented reality with low energy and low latency. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems (SenSys’18). ACM, New York, NY, 292–304. Google Scholar
Digital Library
- [9] . 2020. PREMA: A predictive multi-task scheduling algorithm for preemptible neural processing units. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20), 220–233.Google Scholar
Cross Ref
- [10] . 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). 153–167. Google Scholar
Digital Library
- [11] . 2017. Clipper: A low-latency online prediction serving system. In Proceedings of the Usenix Conference on Networked Systems Design and Implementation (NSDI’17).Google Scholar
- [12] . 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255. Google Scholar
Cross Ref
- [13] . 2020. GSLICE: Controlled spatial sharing of GPUs for a scalable inference platform. In Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC’20). 492–506. Google Scholar
Digital Library
- [14] . 2008. CUDA toolkit and libraries. In Proceedings of the IEEE Conference on Hot Chips (HCS’08). 1–22.Google Scholar
Cross Ref
- [15] . 2019. Leveraging queueing theory and OS profiling to reduce application latency. In Proceedings of the 20th International Middleware Conference Tutorials (Middleware’19). ACM, New York, NY, 1–5. Google Scholar
Digital Library
- [16] . 2016. GPUShare: Fair-Sharing middleware for GPU clouds. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’16). 1769–1776. Google Scholar
Cross Ref
- [17] . 2008. Fundamentals of Queueing Theory (4th ed.). Wiley-Interscience, USA.Google Scholar
Cross Ref
- [18] . 2020. Serving DNNs like Clockwork: Performance predictability from the bottom up. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI’20).Google Scholar
- [19] . 2019. Characterizing the deployment of deep neural networks on commercial edge devices. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 35–48. Google Scholar
Cross Ref
- [20] . 2021. Design considerations for energy-efficient inference on edge devices. In Proceedings of the 12th ACM International Conference on Future Energy Systems (e-Energy’21). ACM, New York, NY, 302–308. Google Scholar
Digital Library
- [21] . 2008. Scheduling for server farms: Approaches and open problems. In Proceedings of the SPEC International Workshop on Performance Evaluation: Metrics, Models and Benchmarks (SIPEW’08). Springer-Verlag, Berlin, 1–3. Google Scholar
Digital Library
- [22] . 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action (1st ed.). Cambridge University Press, USA.Google Scholar
Cross Ref
- [23] . 2005. Multi-Server queueing systems with multiple priority classes. Queue. Syst. 51 (2005), 331–360.Google Scholar
Digital Library
- [24] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778. Google Scholar
- [25] . 2017. GPU virtualization and scheduling methods: A comprehensive survey. ACM Comput. Surv. 50, 3, Article
35 (June 2017), 37 pages. Google ScholarDigital Library
- [26] . 2018. Olympian: Scheduling GPU usage in a deep neural network model serving system. In Proceedings of the 19th International Middleware Conference (Middleware’18). 53–65. Google Scholar
Digital Library
- [27] . 2021. Queueing analysis of GPU-based inference servers with dynamic batching: A closed-form characterization. Perform. Eval. 147 (2021), 102183. Google Scholar
Digital Library
- [28] . 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). ACM, New York, NY, 675–678. Google Scholar
Digital Library
- [29] , , , , and (Eds.). 2019. Deep Learning in Object Detection and Recognition. Springer Singapore. Google Scholar
Cross Ref
- [30] . 2012. Learning Multiple Layers of Features from Tiny Images. University of Toronto.Google Scholar
- [31] . 2020. AI on the edge: Characterizing AI-based IoT applications using specialized edge architectures. In Proceedings of the International Symposium on Workload Characterization. IEEE, 145–156.Google Scholar
Cross Ref
- [32] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14).Google Scholar
Cross Ref
- [33] . 2016. CloudFog: Leveraging fog to extend cloud gaming for thin-client MMOG with high quality of service. Trans. Parallel Distrib. Syst. 28, 2 (2016), 431–445.Google Scholar
Digital Library
- [34] . 2020. Real-Time apple detection system using embedded systems with hardware accelerators: An edge AI application. IEEE Access 8 (2020), 9102–9114. Google Scholar
Cross Ref
- [35] . 2020. NVIDIA Jetson Modules. Retrieved October 19, 2020 from https://developer.nvidia.com/embedded/jetson-modules.Google Scholar
- [36] . 2020. Multi Process Service. Retrieved from https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.Google Scholar
- [37] . 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, , , , , , and (Eds.), Vol. 32.Google Scholar
- [38] . 2012. Analyzing utilization rates in data centers for optimizing energy management. In Proceedings of the International Green Computing Conference (IGCC’12). 1–6. Google Scholar
Digital Library
- [39] . 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision Pattern Recognition. 779–788.Google Scholar
Cross Ref
- [40] . 2019. VideoPipe: Building video stream processing pipelines at the edge. In Proceedings of the 20th International Middleware Conference Industrial Track (Middleware’19). ACM, New York, NY, 43–49. Google Scholar
Digital Library
- [41] . 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
Cross Ref
- [42] . 2017. The emergence of edge computing. Computer 50, 1 (2017), 30–39. Google Scholar
Digital Library
- [43] . 2009. The case for vm-based cloudlets in mobile computing. IEEE Pervas. Comput. 8, 4 (2009), 14–23. Google Scholar
Digital Library
- [44] . 2019. Nexus: A GPU cluster engine for accelerating DNN-Based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). 322–337. Google Scholar
Digital Library
- [45] . 2011. CloudScale: Elastic resource scaling for multi-tenant cloud systems. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC’11). Article
5 , 14 pages. Google ScholarDigital Library
- [46] . 2019. Deep learning inference service at microsoft. In Proceedings of the USENIX Conference on Operational Machine Learning (OpML’19). Santa Clara, CA, 15–17.Google Scholar
- [47] . 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2818–2826. Google Scholar
Cross Ref
- [48] . 2019. Neural network accelerator with parameters resident on chip.
U.S. Patent No. US20190050717A1. Google Scholar - [49] . 2005. An analytical model for multi-tier internet services and its applications. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’05). 291–302. Google Scholar
Digital Library
- [50] . 2007. Application placement on a cluster of servers. Int. J. Found. Comput. Sci. 18 (2007), 1023–1041.Google Scholar
Cross Ref
- [51] . 2019. When FPGA-Accelerator meets stream data processing in the edge. In Proceedings of the IEEE 39th International Conference on Distributed Computing Systems (ICDCS’19). 1818–1829. Google Scholar
Cross Ref
- [52] . 2018. Efficient deep neural network serving: Fast and Furious. IEEE Trans. Netw. Serv. Manage. 15, 1 (2018), 112–126. Google Scholar
Cross Ref
- [53] . 2018. Avoiding pitfalls when using NVIDIA GPUs for real-time tasks in autonomous systems. In Proceedings of the 30th Euromicro Conference on Real-Time Systems (ECRTS’18), Vol. 106. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 20:1–20:21. Google Scholar
Cross Ref
- [54] . 2020. Enabling cost-effective, SLO-Aware machine learning inference serving on public cloud. IEEE Trans. Cloud Comput. 10, 3 (2020), 1765–1779. Google Scholar
Cross Ref
- [55] . 2017. Towards efficient edge cloud augmentation for virtual reality MMOGs. In Proceedings of the 2nd ACM/IEEE Symposium on Edge Computing (SEC’17). Article
8 , 14 pages. Google ScholarDigital Library
- [56] . 2019. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proc. IEEE 107, 8 (2019), 1738–1762.Google Scholar
Cross Ref
Index Terms
Model-driven Cluster Resource Management for AI Workloads in Edge Clouds
Recommendations
QoS-Driven Cloud Resource Management through Fuzzy Model Predictive Control
ICAC '15: Proceedings of the 2015 IEEE International Conference on Autonomic ComputingVirtualized systems such as public and private clouds are emerging as important new computing platforms with great potential to conveniently deliver computing across the Internet and efficiently utilize resources consolidated via virtualization. ...
Performance Analysis of Network I/O Workloads in Virtualized Data Centers
Server consolidation and application consolidation through virtualization are key performance optimizations in cloud-based service delivery industry. In this paper, we argue that it is important for both cloud consumers and cloud providers to understand ...
Prediction of resource contention in cloud using second order Markov model
AbstractThe performance of applications running on the cloud entirely depends on two factors, namely, network availability and resource management. Resource contention occurs when request for resources to a host exceeds the availability of the resources ...






Comments