Abstract
Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.
- David Huggins-Daines, Mohit Kumar, Arthur Chan, Alan W Black, Mosur Ravishankar, and Alex I Rudnicky. Pocketsphinx: A Free, Real-time Continuous Speech Recognition System for Hand-Held Devices. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 1, pages 185--188. IEEE, 2006.Google Scholar
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukás Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlıcek, Yanmin Qian, Petr Schwarz, et al. The Kaldi Speech Recognition Toolkit. 2011.Google Scholar
- Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded Up Robust Features. In Computer Vision--ECCV 2006, pages 404--417. Springer, 2006.Google Scholar
Digital Library
- Qualcomm Acquires Kooaba Visual Recognition Company. http://mobilemarketingmagazine.com/qualcomm-acquires-kooaba-visual-recognition-company.Google Scholar
- Erik F Tjong Kim Sang and Sabine Buchholz. Introduction to the CoNLL-2000 Shared Task: Chunking. In the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning-Volume 7, pages 127--132. Association for Computational Linguistics, 2000.Google Scholar
Digital Library
- Marti A Hearst. 'Natural' Search User Interfaces. Communications of the ACM, 54(11):60--67, 2011.Google Scholar
Digital Library
- Google's Google Now. http://www.google.com/landing/now/.Google Scholar
- Microsoft's Cortana. http://www.windowsphone.com/en-us/features-8--1.Google Scholar
- Apple Siri. https://www.apple.com/ios/siri/.Google Scholar
- Baidu YuYin. http://yuyin.baidu.com/.Google Scholar
- Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ron Dreslinski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, and Jason Mars. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. In the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2015.Google Scholar
- Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jordan Gray, et al. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In the 41st International Symposium on Computer Architecture (ISCA), pages 13--24. ACM/IEEE, 2014.Google Scholar
Cross Ref
- Nicola Jones. The Learning Machines, 2014.Google Scholar
- Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture, 8(3):1--154, 2013.Google Scholar
Digital Library
- Jeffrey Dean and Luiz André Barroso. The Tail at Scale. Communications of the ACM, 56(2):74--80, 2013.Google Scholar
Digital Library
- Lingjia Tang, Jason Mars, and Mary Lou Soffa. Compiling for niceness: Mitigating contention for qos in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO), CGO '12, pages 1--12, New York, NY, USA, 2012. ACM.Google Scholar
Digital Library
- Jason Mars and Lingjia Tang. Whare-map: Heterogeneity in "homogeneous" warehouse-scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), ISCA '13, pages 619--630, New York, NY, USA, 2013. ACM.Google Scholar
Digital Library
- Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259. IEEE/ACM, 2011.Google Scholar
Digital Library
- Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack W. Davidson, and Mary Lou Soffa. Performance analysis of thread mappings with a holistic view of the hardware resources. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), ISPASS '12, pages 156--167, Washington, DC, USA, 2012. IEEE Computer Society.Google Scholar
Digital Library
- Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In the 40th Annual International Symposium on Computer Architecture (ISCA), pages 607--618. ACM/IEEE, 2013.Google Scholar
- Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144. ACM, 2014.Google Scholar
- David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In the 42nd International Symposium on Computer Architecture (ISCA), pages 450--462. ACM/IEEE, 2015.Google Scholar
- Yunqi Zhang, Michael Laurenzano, Jason Mars, and Lingjia Tang. SMiTe: Precise QoS Prediction on Real System SMT Processors to Improve Utilization in Warehouse Scale Computers. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 406--418. IEEE/ACM, 2014.Google Scholar
Digital Library
- Johann Hauswald, Yiping Kang, Michael A. Laurenzano, Quan Chen, Cheng Li, Ronald Dreslinski, Trevor Mudge, Jason Mars, and Lingjia Tang. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 27--40. ACM/IEEE, 2015.Google Scholar
Digital Library
- Vinicius Petrucci, Michael Laurenzano, John Doherty, Yunqi Zhang, Daniel Mosse, Jason Mars, and Lingjia Tang. Octopus-Man: QoS-driven Task Management for Heterogeneous Multicores in Warehouse-Scale Computers. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 246--258. IEEE, 2015.Google Scholar
- Nvidia Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA\_Multi\_Process\_Service\_Overview.pdf.Google Scholar
- Carlos Boneti, Francisco J. Cazorla, Roberto Gioiosa, Alper Buyuktosunoglu, Chen-Yong Cher, and Mateo Valero. Software-Controlled Priority Characterization of POWER5 Processor. In the 35th International Symposium on Computer Architecture (ISCA), pages 415--426. ACM/IEEE, 2008.Google Scholar
Digital Library
- Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. TimeGraph: GPU Scheduling for Real-time Multi-tasking Environments. In USENIX Annual Technical Conference (ATC), pages 17--30. USENIX, 2011.Google Scholar
- Glenn Elliott, Bryan C Ward, and James H Anderson. GPUSync: A Framework for Real-time GPU Management. In the 34th Real-Time Systems Symposium (RTSS), pages 33--44. IEEE, 2013.Google Scholar
- Profiler User's Guide. http://docs.nvidia.com/cuda/profiler-users-guide.Google Scholar
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.Google Scholar
- CUDA Nvidia. cuBLAS library. Nvidia Corporation, Santa Clara, California, 15, 2008.Google Scholar
- David Kirk et al. Nvidia CUDA Software and GPU Parallel Computing Architecture. In the 6th International Symposium on Memory Management (ISMM), volume 7, pages 103--104. ACM, 2007.Google Scholar
- George AF Seber and Alan J Lee. Linear Regression Analysis, volume 936. John Wiley & Sons, 2012.Google Scholar
- Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching Fixed Dimensions. Journal of the ACM, 45(6):891--923, 1998.Google Scholar
Digital Library
- Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning. Springer, 2013.Google Scholar
Digital Library
- Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27, 2011.Google Scholar
Digital Library
- Cui Yu, Rui Zhang, Yaochun Huang, and Hui Xiong. High-Dimensional KNN Joins with Incremental Updates. Geoinformatica, 14(1):55--82, 2010.Google Scholar
Digital Library
- Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast K Nearest Neighbor Search using GPU. In Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1--6. IEEE, 2008.Google Scholar
- Alex Goldhammer and John Ayer Jr. Understanding Performance of PCI Express Systems. Xilinx WP350, Sept, 4, 2008.Google Scholar
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009.Google Scholar
- Sergio Barrachina, Maribel Castillo, Francisco D Igual, Rafael Mayo, and Enrique S Quintana-Orti. Evaluation and Tuning of the Level 3 cuBLAS for Graphics Processors. In International Parallel and Distributed Processing Symposium (IPDPS), pages 1--8. IEEE, 2008.Google Scholar
Cross Ref
- Harold W Kuhn. The Hungarian Method for the Assignment Problem. Naval research logistics quarterly, 2(1--2):83--97, 1955.Google Scholar
- Subramanian Kannan, Mark Roberts, Peter Mayes, Dave Brelsford, and Joseph F Skovira. Workload Management with Loadleveler. IBM Redbooks, 2:2, 2001.Google Scholar
- Haeseung Lee, Al Faruque, and Mohammad Abdullah. GPU-EvR: Run-time Event based Real-time Scheduling Framework on GPGPU Platform. In Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 1--6. IEEE, 2014.Google Scholar
- Michael A Laurenzano, Yunqi Zhang, Lingjia Tang, and Jason Mars. Protean code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 558--570. IEEE/ACM, 2014.Google Scholar
Digital Library
- David Jackson, Quinn Snell, and Mark Clement. Core Algorithms of the Maui Scheduler. In Job Scheduling Strategies for Parallel Processing, pages 87--102. Springer, 2001.Google Scholar
Digital Library
- Ahuva W Mu Alem and Dror G Feitelson. Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Transactions on Parallel and Distributed Systems, 12(6):529--543, 2001.Google Scholar
Digital Library
- Chung Laung Liu and James W Layland. Scheduling Algorithms for Multiprogramming in a Hard-real-time Environment. Journal of the ACM, 20(1):46--61, 1973.Google Scholar
Digital Library
- Lui Sha, Ragunathan Rajkumar, and Shirish S Sathaye. Generalized Rate-Monotonic Scheduling Theory: A Framework for Developing Real-time Systems. Proceedings of the IEEE, 82(1):68--82, 1994.Google Scholar
- Neil C Audsley, Alan Burns, MF Richardson, and AJ Wellings. Deadline Monotonic Scheduling. Citeseer, 1990.Google Scholar
- Alan Bertossi, Luigi V Mancini, and Federico Rossini. Fault-tolerant Rate-Monotonic First-fit Scheduling in Hard-real-time Systems. IEEE Transactions on Parallel and Distributed Systems, 10(9):934--945, 1999.Google Scholar
Digital Library
- Glenn A Elliott and James H Anderson. Globally Scheduled Real-time Multiprocessor Systems with GPUs. Real-Time Systems, 48(1):34--74, 2012.Google Scholar
Digital Library
- Pedro Aguilera, Katherine Morrow, and Nam Sung Kim. QoS-aware Dynamic Resource Allocation for Spatial-multitasking GPUs. In the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 726--731. IEEE, 2014.Google Scholar
- Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 593--606. ACM, 2015.Google Scholar
Digital Library
- Kittisak Sajjapongse, Xiang Wang, and Michela Becchi. A Preemption-based Runtime to Efficiently Schedule Multi-process Applications on Heterogeneous Clusters with GPUs. In the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC), pages 179--190. ACM, 2013.Google Scholar
Digital Library
- Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling Preemptive Multiprogramming on GPUs. In the 41st International Symposium on Computer Architecuture (ISCA), pages 193--204. ACM/IEEE, 2014.Google Scholar
- Neha Agarwal, David Nellans, Mike O'Connor, Stephen W Keckler, and Thomas F Wenisch. Unlocking Bandwidth for GPUs in CC-NUMA Systems. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 354--365. IEEE, 2015.Google Scholar
- Jens Breitbart. Analysis of a Memory Bandwidth Limited Scenario for NUMA and GPU systems. In the 25th International Symposium on Parallel and Distributed Processing Workshops(IPDPSW), pages 693--699. IEEE, 2011.Google Scholar
Digital Library
- Ping Xiang, Yi Yang, and Huiyang Zhou. Warp-level Divergence in GPUs: Characterization, Impact, and Mitigation. In the 20th International Symposium on High Performance Computer Architecture (HPCA), pages 284--295. IEEE, 2014.Google Scholar
Cross Ref
- Guoyang Chen, Bo Wu, Dong Li, and Xipeng Shen. Enabling Portable Optimizations of Data Placement on GPU. Micro, 35(4):16--24, July 2015.Google Scholar
- Daniel Lustig and Margaret Martonosi. Reducing GPU Offload Latency via Fine-grained CPU-GPU Synchronization. In the 19th International Symposium on High Performance Computer Architecture (HPCA), pages 354--365. IEEE, 2013.Google Scholar
Digital Library
- Ankit Sethia and Scott Mahlke. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 647--658. IEEE/ACM, 2014.Google Scholar
- J-F Dollinger and Vincent Loechner. Adaptive Runtime Selection for GPU. In the 42nd International Conference on Parallel Processing (ICPP), pages 70--79. IEEE, 2013.Google Scholar
- Vignesh T Ravi, Michela Becchi, Gagan Agrawal, and Srimat Chakradhar. Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework. In the 20th International Symposium on High Performance Distributed Computing (HPDC), pages 217--228. ACM, 2011.Google Scholar
Digital Library
- Khaled M Diab, M Mustafa Rafique, and Mohamed Hefeeda. Dynamic Sharing of GPUs in Cloud Systems. In the 27th International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 947--954. IEEE, 2013.Google Scholar
- Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing GPU Concurrency in Heterogeneous Architectures. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 114--126. IEEE/ACM, 2014.Google Scholar
- Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T Lewis, Chunling Hu, and Keshav Pingali. Adaptive Heterogeneous Scheduling for Integrated GPUs. In the 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 151--162. ACM, 2014.Google Scholar
Index Terms
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
Recommendations
Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsGuaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on ...
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsModern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language ...
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
ASPLOS'16Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language ...







Comments