skip to main content
research-article
Public Access

Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers

Published:25 March 2016Publication History
Skip Abstract Section

Abstract

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

References

  1. David Huggins-Daines, Mohit Kumar, Arthur Chan, Alan W Black, Mosur Ravishankar, and Alex I Rudnicky. Pocketsphinx: A Free, Real-time Continuous Speech Recognition System for Hand-Held Devices. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 1, pages 185--188. IEEE, 2006.Google ScholarGoogle Scholar
  2. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukás Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlıcek, Yanmin Qian, Petr Schwarz, et al. The Kaldi Speech Recognition Toolkit. 2011.Google ScholarGoogle Scholar
  3. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded Up Robust Features. In Computer Vision--ECCV 2006, pages 404--417. Springer, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Qualcomm Acquires Kooaba Visual Recognition Company. http://mobilemarketingmagazine.com/qualcomm-acquires-kooaba-visual-recognition-company.Google ScholarGoogle Scholar
  5. Erik F Tjong Kim Sang and Sabine Buchholz. Introduction to the CoNLL-2000 Shared Task: Chunking. In the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning-Volume 7, pages 127--132. Association for Computational Linguistics, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Marti A Hearst. 'Natural' Search User Interfaces. Communications of the ACM, 54(11):60--67, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Google's Google Now. http://www.google.com/landing/now/.Google ScholarGoogle Scholar
  8. Microsoft's Cortana. http://www.windowsphone.com/en-us/features-8--1.Google ScholarGoogle Scholar
  9. Apple Siri. https://www.apple.com/ios/siri/.Google ScholarGoogle Scholar
  10. Baidu YuYin. http://yuyin.baidu.com/.Google ScholarGoogle Scholar
  11. Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ron Dreslinski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, and Jason Mars. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. In the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2015.Google ScholarGoogle Scholar
  12. Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jordan Gray, et al. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In the 41st International Symposium on Computer Architecture (ISCA), pages 13--24. ACM/IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  13. Nicola Jones. The Learning Machines, 2014.Google ScholarGoogle Scholar
  14. Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture, 8(3):1--154, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jeffrey Dean and Luiz André Barroso. The Tail at Scale. Communications of the ACM, 56(2):74--80, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lingjia Tang, Jason Mars, and Mary Lou Soffa. Compiling for niceness: Mitigating contention for qos in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO), CGO '12, pages 1--12, New York, NY, USA, 2012. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jason Mars and Lingjia Tang. Whare-map: Heterogeneity in "homogeneous" warehouse-scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), ISCA '13, pages 619--630, New York, NY, USA, 2013. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259. IEEE/ACM, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack W. Davidson, and Mary Lou Soffa. Performance analysis of thread mappings with a holistic view of the hardware resources. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), ISPASS '12, pages 156--167, Washington, DC, USA, 2012. IEEE Computer Society.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In the 40th Annual International Symposium on Computer Architecture (ISCA), pages 607--618. ACM/IEEE, 2013.Google ScholarGoogle Scholar
  21. Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144. ACM, 2014.Google ScholarGoogle Scholar
  22. David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In the 42nd International Symposium on Computer Architecture (ISCA), pages 450--462. ACM/IEEE, 2015.Google ScholarGoogle Scholar
  23. Yunqi Zhang, Michael Laurenzano, Jason Mars, and Lingjia Tang. SMiTe: Precise QoS Prediction on Real System SMT Processors to Improve Utilization in Warehouse Scale Computers. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 406--418. IEEE/ACM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Johann Hauswald, Yiping Kang, Michael A. Laurenzano, Quan Chen, Cheng Li, Ronald Dreslinski, Trevor Mudge, Jason Mars, and Lingjia Tang. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 27--40. ACM/IEEE, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Vinicius Petrucci, Michael Laurenzano, John Doherty, Yunqi Zhang, Daniel Mosse, Jason Mars, and Lingjia Tang. Octopus-Man: QoS-driven Task Management for Heterogeneous Multicores in Warehouse-Scale Computers. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 246--258. IEEE, 2015.Google ScholarGoogle Scholar
  26. Nvidia Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA\_Multi\_Process\_Service\_Overview.pdf.Google ScholarGoogle Scholar
  27. Carlos Boneti, Francisco J. Cazorla, Roberto Gioiosa, Alper Buyuktosunoglu, Chen-Yong Cher, and Mateo Valero. Software-Controlled Priority Characterization of POWER5 Processor. In the 35th International Symposium on Computer Architecture (ISCA), pages 415--426. ACM/IEEE, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. TimeGraph: GPU Scheduling for Real-time Multi-tasking Environments. In USENIX Annual Technical Conference (ATC), pages 17--30. USENIX, 2011.Google ScholarGoogle Scholar
  29. Glenn Elliott, Bryan C Ward, and James H Anderson. GPUSync: A Framework for Real-time GPU Management. In the 34th Real-Time Systems Symposium (RTSS), pages 33--44. IEEE, 2013.Google ScholarGoogle Scholar
  30. Profiler User's Guide. http://docs.nvidia.com/cuda/profiler-users-guide.Google ScholarGoogle Scholar
  31. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.Google ScholarGoogle Scholar
  32. CUDA Nvidia. cuBLAS library. Nvidia Corporation, Santa Clara, California, 15, 2008.Google ScholarGoogle Scholar
  33. David Kirk et al. Nvidia CUDA Software and GPU Parallel Computing Architecture. In the 6th International Symposium on Memory Management (ISMM), volume 7, pages 103--104. ACM, 2007.Google ScholarGoogle Scholar
  34. George AF Seber and Alan J Lee. Linear Regression Analysis, volume 936. John Wiley & Sons, 2012.Google ScholarGoogle Scholar
  35. Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching Fixed Dimensions. Journal of the ACM, 45(6):891--923, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning. Springer, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Cui Yu, Rui Zhang, Yaochun Huang, and Hui Xiong. High-Dimensional KNN Joins with Incremental Updates. Geoinformatica, 14(1):55--82, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast K Nearest Neighbor Search using GPU. In Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1--6. IEEE, 2008.Google ScholarGoogle Scholar
  40. Alex Goldhammer and John Ayer Jr. Understanding Performance of PCI Express Systems. Xilinx WP350, Sept, 4, 2008.Google ScholarGoogle Scholar
  41. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009.Google ScholarGoogle Scholar
  42. Sergio Barrachina, Maribel Castillo, Francisco D Igual, Rafael Mayo, and Enrique S Quintana-Orti. Evaluation and Tuning of the Level 3 cuBLAS for Graphics Processors. In International Parallel and Distributed Processing Symposium (IPDPS), pages 1--8. IEEE, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  43. Harold W Kuhn. The Hungarian Method for the Assignment Problem. Naval research logistics quarterly, 2(1--2):83--97, 1955.Google ScholarGoogle Scholar
  44. Subramanian Kannan, Mark Roberts, Peter Mayes, Dave Brelsford, and Joseph F Skovira. Workload Management with Loadleveler. IBM Redbooks, 2:2, 2001.Google ScholarGoogle Scholar
  45. Haeseung Lee, Al Faruque, and Mohammad Abdullah. GPU-EvR: Run-time Event based Real-time Scheduling Framework on GPGPU Platform. In Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 1--6. IEEE, 2014.Google ScholarGoogle Scholar
  46. Michael A Laurenzano, Yunqi Zhang, Lingjia Tang, and Jason Mars. Protean code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 558--570. IEEE/ACM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. David Jackson, Quinn Snell, and Mark Clement. Core Algorithms of the Maui Scheduler. In Job Scheduling Strategies for Parallel Processing, pages 87--102. Springer, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Ahuva W Mu Alem and Dror G Feitelson. Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Transactions on Parallel and Distributed Systems, 12(6):529--543, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Chung Laung Liu and James W Layland. Scheduling Algorithms for Multiprogramming in a Hard-real-time Environment. Journal of the ACM, 20(1):46--61, 1973.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Lui Sha, Ragunathan Rajkumar, and Shirish S Sathaye. Generalized Rate-Monotonic Scheduling Theory: A Framework for Developing Real-time Systems. Proceedings of the IEEE, 82(1):68--82, 1994.Google ScholarGoogle Scholar
  51. Neil C Audsley, Alan Burns, MF Richardson, and AJ Wellings. Deadline Monotonic Scheduling. Citeseer, 1990.Google ScholarGoogle Scholar
  52. Alan Bertossi, Luigi V Mancini, and Federico Rossini. Fault-tolerant Rate-Monotonic First-fit Scheduling in Hard-real-time Systems. IEEE Transactions on Parallel and Distributed Systems, 10(9):934--945, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Glenn A Elliott and James H Anderson. Globally Scheduled Real-time Multiprocessor Systems with GPUs. Real-Time Systems, 48(1):34--74, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Pedro Aguilera, Katherine Morrow, and Nam Sung Kim. QoS-aware Dynamic Resource Allocation for Spatial-multitasking GPUs. In the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 726--731. IEEE, 2014.Google ScholarGoogle Scholar
  55. Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 593--606. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Kittisak Sajjapongse, Xiang Wang, and Michela Becchi. A Preemption-based Runtime to Efficiently Schedule Multi-process Applications on Heterogeneous Clusters with GPUs. In the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC), pages 179--190. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling Preemptive Multiprogramming on GPUs. In the 41st International Symposium on Computer Architecuture (ISCA), pages 193--204. ACM/IEEE, 2014.Google ScholarGoogle Scholar
  58. Neha Agarwal, David Nellans, Mike O'Connor, Stephen W Keckler, and Thomas F Wenisch. Unlocking Bandwidth for GPUs in CC-NUMA Systems. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 354--365. IEEE, 2015.Google ScholarGoogle Scholar
  59. Jens Breitbart. Analysis of a Memory Bandwidth Limited Scenario for NUMA and GPU systems. In the 25th International Symposium on Parallel and Distributed Processing Workshops(IPDPSW), pages 693--699. IEEE, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Ping Xiang, Yi Yang, and Huiyang Zhou. Warp-level Divergence in GPUs: Characterization, Impact, and Mitigation. In the 20th International Symposium on High Performance Computer Architecture (HPCA), pages 284--295. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  61. Guoyang Chen, Bo Wu, Dong Li, and Xipeng Shen. Enabling Portable Optimizations of Data Placement on GPU. Micro, 35(4):16--24, July 2015.Google ScholarGoogle Scholar
  62. Daniel Lustig and Margaret Martonosi. Reducing GPU Offload Latency via Fine-grained CPU-GPU Synchronization. In the 19th International Symposium on High Performance Computer Architecture (HPCA), pages 354--365. IEEE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Ankit Sethia and Scott Mahlke. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 647--658. IEEE/ACM, 2014.Google ScholarGoogle Scholar
  64. J-F Dollinger and Vincent Loechner. Adaptive Runtime Selection for GPU. In the 42nd International Conference on Parallel Processing (ICPP), pages 70--79. IEEE, 2013.Google ScholarGoogle Scholar
  65. Vignesh T Ravi, Michela Becchi, Gagan Agrawal, and Srimat Chakradhar. Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework. In the 20th International Symposium on High Performance Distributed Computing (HPDC), pages 217--228. ACM, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Khaled M Diab, M Mustafa Rafique, and Mohamed Hefeeda. Dynamic Sharing of GPUs in Cloud Systems. In the 27th International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 947--954. IEEE, 2013.Google ScholarGoogle Scholar
  67. Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing GPU Concurrency in Heterogeneous Architectures. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 114--126. IEEE/ACM, 2014.Google ScholarGoogle Scholar
  68. Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T Lewis, Chunling Hu, and Keshav Pingali. Adaptive Heterogeneous Scheduling for Integrated GPUs. In the 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 151--162. ACM, 2014.Google ScholarGoogle Scholar

Index Terms

  1. Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 51, Issue 4
          ASPLOS '16
          April 2016
          774 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2954679
          • Editor:
          • Andy Gill
          Issue’s Table of Contents
          • cover image ACM Conferences
            ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
            March 2016
            824 pages
            ISBN:9781450340915
            DOI:10.1145/2872362
            • General Chair:
            • Tom Conte,
            • Program Chair:
            • Yuanyuan Zhou

          Copyright © 2016 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 March 2016

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!