skip to main content
research-article

Model-driven Cluster Resource Management for AI Workloads in Edge Clouds

Published:27 March 2023Publication History
Skip Abstract Section

Abstract

Since emerging edge applications such as Internet of Things (IoT) analytics and augmented reality have tight latency constraints, hardware AI accelerators have been recently proposed to speed up deep neural network (DNN) inference run by these applications. Resource-constrained edge servers and accelerators tend to be multiplexed across multiple IoT applications, introducing the potential for performance interference between latency-sensitive workloads. In this article, we design analytic models to capture the performance of DNN inference workloads on shared edge accelerators, such as GPU and edgeTPU, under different multiplexing and concurrency behaviors. After validating our models using extensive experiments, we use them to design various cluster resource management algorithms to intelligently manage multiple applications on edge accelerators while respecting their latency constraints. We implement a prototype of our system in Kubernetes and show that our system can host 2.3× more DNN applications in heterogeneous multi-tenant edge clusters with no latency violations when compared to traditional knapsack hosting algorithms.

REFERENCES

  1. [1] Abadi Martín, Barham Paul, Chen Jianmin, Chen Zhifeng, Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Irving Geoffrey, Isard Michael, Kudlur Manjunath, Levenberg Josh, Monga Rajat, Moore Sherry, Murray Derek G., Steiner Benoit, Tucker Paul, Vasudevan Vijay, Warden Pete, Wicke Martin, Yu Yuan, and Zheng Xiaoqiang. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, USA, 265283.Google ScholarGoogle Scholar
  2. [2] Ai Yuan, Peng Mugen, and Zhang Kecheng. 2018. Edge computing technologies for Internet of Things: A primer. Dig. Commun. Netw. 4, 2 (2018), 7786.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Ambati Pradeep, Bashir Noman, Irwin David W., and Shenoy Prashant J.. 2020. Waiting game: Optimally provisioning fixed resources for cloud-enabled schedulers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’20).Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Amert Tanya, Otterness Nathan, Yang Ming, Anderson James H., and Smith F. Donelson. 2017. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In Proceedings of the IEEE Real-Time Systems Symposium (RTSS’17). 104115. Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Barry Brendan, Brick Cormac, Connor Fergal, Donohoe David, Moloney David, Richmond Richard, O’Riordan Martin, and Toma Vasile. 2015. Always-on vision processing unit for mobile applications. IEEE Micro 35, 2 (2015), 5666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chekuri Chandra and Khanna Sanjeev. 2005. A polynomial time approximation scheme for the multiple knapsack problem. SIAM J. Comput. 35, 3 (2005), 713728. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chen J. and Ran X.. 2019. Deep learning with edge computing: A review. Proc. IEEE 107, 8 (2019), 16551674.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chen Kaifei, Li Tong, Kim Hyung-Sin, Culler David E., and Katz Randy H.. 2018. MARVEL: Enabling mobile augmented reality with low energy and low latency. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems (SenSys’18). ACM, New York, NY, 292304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Choi Yujeong and Rhu Minsoo. 2020. PREMA: A predictive multi-task scheduling algorithm for preemptible neural processing units. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20), 220233.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Cortez Eli, Bonde Anand, Muzio Alexandre, Russinovich Mark, Fontoura Marcus, and Bianchini Ricardo. 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). 153167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Crankshaw Daniel, Wang Xin, Zhou Giulio, Franklin Michael J., Gonzalez Joseph E., and Stoica Ion. 2017. Clipper: A low-latency online prediction serving system. In Proceedings of the Usenix Conference on Networked Systems Design and Implementation (NSDI’17).Google ScholarGoogle Scholar
  12. [12] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248255. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Dhakal Aditya, Kulkarni Sameer G., and Ramakrishnan K. K.. 2020. GSLICE: Controlled spatial sharing of GPUs for a scalable inference platform. In Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC’20). 492506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Fatica M.. 2008. CUDA toolkit and libraries. In Proceedings of the IEEE Conference on Hot Chips (HCS’08). 122.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Gandhi Anshul and Suresh Amoghvarsha. 2019. Leveraging queueing theory and OS profiling to reduce application latency. In Proceedings of the 20th International Middleware Conference Tutorials (Middleware’19). ACM, New York, NY, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Goswami Anshuman, Young Jeffrey, Schwan Karsten, Farooqui Naila, Gavrilovska Ada, Wolf Matthew, and Eisenhauer Greg. 2016. GPUShare: Fair-Sharing middleware for GPU clouds. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’16). 17691776. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Gross Donald, Shortle John F., Thompson James M., and Harris Carl M.. 2008. Fundamentals of Queueing Theory (4th ed.). Wiley-Interscience, USA.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Gujarati Arpan, Karimi Reza, Alzayat Safya, Kaufmann Antoine, Vigfusson Ymir, and Mace Jonathan. 2020. Serving DNNs like Clockwork: Performance predictability from the bottom up. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI’20).Google ScholarGoogle Scholar
  19. [19] Hadidi Ramyad, Cao Jiashen, Xie Yilun, Asgari Bahar, Krishna Tushar, and Kim Hyesoon. 2019. Characterizing the deployment of deep neural networks on commercial edge devices. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 3548. Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hanafy Walid A., Molom-Ochir Tergel, and Shenoy Rohan. 2021. Design considerations for energy-efficient inference on edge devices. In Proceedings of the 12th ACM International Conference on Future Energy Systems (e-Energy’21). ACM, New York, NY, 302308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Harchol-Balter Mor. 2008. Scheduling for server farms: Approaches and open problems. In Proceedings of the SPEC International Workshop on Performance Evaluation: Metrics, Models and Benchmarks (SIPEW’08). Springer-Verlag, Berlin, 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Harchol-Balter Mor. 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action (1st ed.). Cambridge University Press, USA.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Harchol-Balter Mor, Osogami Takayuki, Scheller-Wolf Alan, and Wierman Adam. 2005. Multi-Server queueing systems with multiple priority classes. Queue. Syst. 51 (2005), 331360.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770778. Google ScholarGoogle Scholar
  25. [25] Hong Cheol-Ho, Spence Ivor, and Nikolopoulos Dimitrios S.. 2017. GPU virtualization and scheduling methods: A comprehensive survey. ACM Comput. Surv. 50, 3, Article 35 (June2017), 37 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Hu Yitao, Rallapalli Swati, Ko Bongjun, and Govindan Ramesh. 2018. Olympian: Scheduling GPU usage in a deep neural network model serving system. In Proceedings of the 19th International Middleware Conference (Middleware’18). 5365. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Inoue Yoshiaki. 2021. Queueing analysis of GPU-based inference servers with dynamic batching: A closed-form characterization. Perform. Eval. 147 (2021), 102183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Jia Yangqing, Shelhamer Evan, Donahue Jeff, Karayev Sergey, Long Jonathan, Girshick Ross, Guadarrama Sergio, and Darrell Trevor. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). ACM, New York, NY, 675678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Jiang Xiaoyue, Hadid Abdenour, Pang Yanwei, Granger Eric, and Feng Xiaoyi (Eds.). 2019. Deep Learning in Object Detection and Recognition. Springer Singapore. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Krizhevsky Alex. 2012. Learning Multiple Layers of Features from Tiny Images. University of Toronto.Google ScholarGoogle Scholar
  31. [31] Liang Qianlin, Shenoy Prashant J., and Irwin David E.. 2020. AI on the edge: Characterizing AI-based IoT applications using specialized edge architectures. In Proceedings of the International Symposium on Workload Characterization. IEEE, 145156.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Lin Tsung-Yi, Maire Michael, Belongie Serge J., Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14).Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Lin Yuhua and Shen Haiying. 2016. CloudFog: Leveraging fog to extend cloud gaming for thin-client MMOG with high quality of service. Trans. Parallel Distrib. Syst. 28, 2 (2016), 431445.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Mazzia Vittorio, Khaliq Aleem, Salvetti Francesco, and Chiaberge Marcello. 2020. Real-Time apple detection system using embedded systems with hardware accelerators: An edge AI application. IEEE Access 8 (2020), 91029114. Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Nvidia. 2020. NVIDIA Jetson Modules. Retrieved October 19, 2020 from https://developer.nvidia.com/embedded/jetson-modules.Google ScholarGoogle Scholar
  36. [36] Nvidida. 2020. Multi Process Service. Retrieved from https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.Google ScholarGoogle Scholar
  37. [37] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fox E., and Garnett R. (Eds.), Vol. 32.Google ScholarGoogle Scholar
  38. [38] Pawlish Michael, Varde Aparna S., and Robila Stefan A.. 2012. Analyzing utilization rates in data centers for optimizing energy management. In Proceedings of the International Green Computing Conference (IGCC’12). 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Redmon Joseph, Divvala Santosh, Girshick Ross, and Farhadi Ali. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision Pattern Recognition. 779788.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Salehe Mohammad, Hu Zhiming, Mortazavi Seyed Hossein, Mohomed Iqbal, and Capes Tim. 2019. VideoPipe: Building video stream processing pipelines at the edge. In Proceedings of the 20th International Middleware Conference Industrial Track (Middleware’19). ACM, New York, NY, 4349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Sandler Mark, Howard Andrew, Zhu Menglong, Zhmoginov Andrey, and Chen Liang-Chieh. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Satyanarayanan Mahadev. 2017. The emergence of edge computing. Computer 50, 1 (2017), 3039. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Satyanarayanan Mahadev, Bahl Paramvir, Caceres Ramon, and Davies Nigel. 2009. The case for vm-based cloudlets in mobile computing. IEEE Pervas. Comput. 8, 4 (2009), 1423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Shen Haichen, Chen Lequn, Jin Yuchen, Zhao Liangyu, Kong Bingyu, Philipose Matthai, Krishnamurthy Arvind, and Sundaram Ravi. 2019. Nexus: A GPU cluster engine for accelerating DNN-Based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). 322337. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Shen Zhiming, Subbiah Sethuraman, Gu Xiaohui, and Wilkes John. 2011. CloudScale: Elastic resource scaling for multi-tenant cloud systems. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC’11). Article 5, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Soifer Jonathan, Li Jason, Li Mingqin, Zhu Jeffrey, Li Yingnan, He Yuxiong, Zheng Elton, Oltean Adi, Mosyak Maya, Barnes Chris, Liu Thomas, and Wang Junhua. 2019. Deep learning inference service at microsoft. In Proceedings of the USENIX Conference on Operational Machine Learning (OpML’19). Santa Clara, CA, 1517.Google ScholarGoogle Scholar
  47. [47] Szegedy C., Vanhoucke V., Ioffe S., Shlens J., and Wojna Z.. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 28182826. Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Temam Olivier, Khaitan Harshit, Narayanaswami Ravi, and Woo Dong Hyuk. 2019. Neural network accelerator with parameters resident on chip. U.S. Patent No. US20190050717A1.Google ScholarGoogle Scholar
  49. [49] Urgaonkar Bhuvan, Pacifici Giovanni, Shenoy Prashant, Spreitzer Mike, and Tantawi Asser. 2005. An analytical model for multi-tier internet services and its applications. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’05). 291302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Urgaonkar Bhuvan, Rosenberg Arnold L., and Shenoy Prashant J.. 2007. Application placement on a cluster of servers. Int. J. Found. Comput. Sci. 18 (2007), 10231041.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wu Song, Hu Die, Ibrahim Shadi, Jin Hai, Xiao Jiang, Chen Fei, and Liu Haikun. 2019. When FPGA-Accelerator meets stream data processing in the edge. In Proceedings of the IEEE 39th International Conference on Distributed Computing Systems (ICDCS’19). 18181829. Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Yan Feng, He Yuxiong, Ruwase Olatunji, and Smirni Evgenia. 2018. Efficient deep neural network serving: Fast and Furious. IEEE Trans. Netw. Serv. Manage. 15, 1 (2018), 112126. Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Yang Ming, Otterness Nathan, Amert Tanya, Bakita Joshua, Anderson James H., and Smith F. Donelson. 2018. Avoiding pitfalls when using NVIDIA GPUs for real-time tasks in autonomous systems. In Proceedings of the 30th Euromicro Conference on Real-Time Systems (ECRTS’18), Vol. 106. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 20:1–20:21. Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Zhang Chengliang, Yu Minchen, Wang Wei, and Yan Feng. 2020. Enabling cost-effective, SLO-Aware machine learning inference serving on public cloud. IEEE Trans. Cloud Comput. 10, 3 (2020), 17651779. Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Zhang Wuyang, Chen Jiachen, Zhang Yanyong, and Raychaudhuri Dipankar. 2017. Towards efficient edge cloud augmentation for virtual reality MMOGs. In Proceedings of the 2nd ACM/IEEE Symposium on Edge Computing (SEC’17). Article 8, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Zhou Z., Chen X., Li E., Zeng L., Luo K., and Zhang J.. 2019. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proc. IEEE 107, 8 (2019), 17381762.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Model-driven Cluster Resource Management for AI Workloads in Edge Clouds

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Autonomous and Adaptive Systems
      ACM Transactions on Autonomous and Adaptive Systems  Volume 18, Issue 1
      March 2023
      82 pages
      ISSN:1556-4665
      EISSN:1556-4703
      DOI:10.1145/3589019
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 March 2023
      • Online AM: 25 January 2023
      • Accepted: 10 January 2023
      • Revised: 1 July 2022
      • Received: 22 December 2021
      Published in taas Volume 18, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)276
      • Downloads (Last 6 weeks)68

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!