Abstract
Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on a server, performance interference due to resource contention can be detrimental to the application QoS. Although prior work has proposed techniques to identify "safe" co-locations where application QoS is satisfied by predicting the performance interference on multicores, no such prediction technique on accelerators such as GPUs.
In this work, we present Prophet, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location. We analyzed the performance interference on accelerators through a real system investigation and found that unlike on multicores where the key contentious resources are shared caches and main memory bandwidth, the key contentious resources on accelerators are instead processing elements, accelerator memory bandwidth and PCIe bandwidth. Based on this observation, we designed interference models that enable the precise prediction for processing element, accelerator memory bandwidth and PCIe bandwidth contention on real hardware. By using a novel technique to forecast solo-run execution traces of the co-located applications using interference models, Prophet can accurately predict the performance degradation of latency-sensitive applications on non-preemptive accelerators. Using Prophet, we can identify "safe" co-locations on accelerators to improve utilization without violating the QoS target. Our evaluation shows that Prophet can predict the performance degradation with an average prediction error 5.47% on real systems. Meanwhile, based on the prediction, Prophet achieves accelerator utilization improvements of 49.9% on average while maintaining the QoS target of latency-sensitive applications.
- Nvidia Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA\_Multi\_Process\_Service\_Overview.pdf.Google Scholar
- Profiler User's Guide. http://docs.nvidia.com/cuda/profiler-users-guide.Google Scholar
- J. Adriaens, K. Compton, N. S. Kim, and M. Schulte. The Case for GPGPU Spatial Multitasking. In the 18th International Symposium on High Performance Computer Architecture (HPCA), pages 1--12. IEEE, 2012. Google Scholar
Digital Library
- P. Aguilera, K. Morrow, and N. S. Kim. QoS-aware Dynamic Resource Allocation for Spatial-Multitasking GPUs. In the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 726--731. IEEE, 2014. Google Scholar
Cross Ref
- J. Anantpur and R. Govindarajan. PRO: Progress Aware GPU Warp Scheduling Algorithm. In International Parallel and Distributed Processing Symposium (IPDPS), pages 979--988. IEEE, 2015.Google Scholar
- n, Loh, Das, Kandemir, and Mutlu]ausavarungnirunexploitingR. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 25--38. ACM, 2015.Google Scholar
Digital Library
- le(2007)]barroso2007caseL. A. Barroso and U. Hölzle. The Case for Energy-proportional Computing. Computer, (12): 33--37, 2007.Google Scholar
- le]barroso2003webL. A. Barroso, J. Dean, and U. Hölzle. Web Search for a Planet: The Google Cluster Architecture. Micro, 23 (2): 22--28, 2003.Google Scholar
- T. Beisel, T. Wiersema, C. Plessl, and A. Brinkmann. Cooperative Multitasking for Heterogeneous Accelerators in the Linux Completely Fair Scheduler. In IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pages 223--226. IEEE, 2011. Google Scholar
Digital Library
- R. Bittner, E. Ruf, and A. Forin. Direct GPU/FPGA Communication via PCI Express. Cluster Computing, 17 (2): 339--348, 2014. Google Scholar
Digital Library
- J. Cabezas, L. Vilanova, I. Gelado, T. B. Jablin, N. Navarro, and W.-m. Hwu. Automatic Execution of Single-GPU Computations across Multiple GPUs. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 467--468. ACM, 2014. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009. Google Scholar
Digital Library
- Q. Chen, H. Yang, J. Mars, and L. Tang. Baymax: QoS Awareness and Increased Utilization of Non-Preemptive Accelerators in Warehouse Scale Computers. In Proceedings of the 21th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 681--696, New York, NY, USA, 2016. ACM. Google Scholar
Digital Library
- ]delimitrou2013ibenchC. Delimitrou and C. Kozyrakis. iBench: Quantifying Interference for Datacenter Applications. In International Symposium on Workload Characterization (IISWC), pages 23--33. IEEE, 2013\natexlaba.Google Scholar
Cross Ref
- ]delimitrou2013paragonC. Delimitrou and C. Kozyrakis. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. ACM SIGARCH Computer Architecture News, 41 (1): 77--88, 2013\natexlabb.Google Scholar
- C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, New York, NY, USA, 2014. ACM. Google Scholar
Digital Library
- G. Elliott, B. C. Ward, J. H. Anderson, et al. GPUSync: A Framework for Real-time GPU Management. In the 34th Real-Time Systems Symposium (RTSS), pages 33--44. IEEE, 2013. Google Scholar
Digital Library
- G. A. Elliott and J. H. Anderson. Globally Scheduled Real-time Multiprocessor Systems with GPUs. Real-Time Systems, 48 (1): 34--74, 2012. Google Scholar
Digital Library
- Hauswald, Kang, Laurenzano, Chen, Li, Dreslinski, Mudge, Mars, and Tang]hauswald15iscaJ. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, R. Dreslinski, T. Mudge, J. Mars, and L. Tang. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 27--40, New York, NY, USA, 2015\natexlaba. ACM.Google Scholar
Digital Library
- Hauswald, Laurenzano, Zhang, Li, Rovinski, Khurana, Dreslinski, Mudge, Petrucci, Tang, and Mars]hauswald15asplosJ. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 223--238, New York, NY, USA, 2015\natexlabb. ACM.Google Scholar
Digital Library
- N. Jones. The Learning Machines, 2014.Google Scholar
- W. Joo and D. Shin. Resource-Constrained Spatial Multi-tasking for Embedded GPU. In International Conference on Consumer Electronics (ICCE), pages 339--340. IEEE, 2014. Google Scholar
Cross Ref
- H. Kasture and D. Sanchez. Ubik: Efficient Cache Sharing with Strict QoS for Latency-critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 729--742, New York, NY, USA, 2014. ACM. Google Scholar
Digital Library
- S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU Scheduling for Real-time Multi-tasking Environments. In USENIX Annual Technical Conference (ATC), pages 17--30, 2011.Google Scholar
- D. Kirk et al. NVIDIA CUDA Software and GPU Parallel Computing Architecture. In ISMM, volume 7, pages 103--104, 2007.Google Scholar
Digital Library
- M. A. Laurenzano, Y. Zhang, L. Tang, and J. Mars. Protean code: Achieving Near-free Online Code Transformations for Warehouse Scale Computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 558--570. IEEE, 2014. Google Scholar
Digital Library
- H. Lee, A. Faruque, and M. Abdullah. GPU-EvR: Run-time Event-based Real-time Scheduling Framework on GPGPU Platform. In Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 1--6. IEEE, 2014.Google Scholar
- S.-Y. Lee, A. Arunkumar, and C.-J. Wu. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 515--527. ACM, 2015. Google Scholar
Digital Library
- J. Leverich and C. Kozyrakis. Reconciling High Server Utilization and Sub-millisecond Quality-of-Service. In Proceedings of the 9th European Conference on Computer Systems, page 4. ACM, 2014. Google Scholar
Digital Library
- D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 450--462. ACM, 2015. Google Scholar
Digital Library
- J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- D. Meisner, B. T. Gold, and T. F. Wenisch. PowerNap: Eliminating Server Idle Power. In ACM Sigplan Notices, volume 44, pages 205--216. ACM, 2009.Google Scholar
- K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair Queuing Memory Systems. In the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 208--222. IEEE, 2006. Google Scholar
Digital Library
- C. Nvidia. Compute Unified Device Architecture Programming Guide. 2007.Google Scholar
- C. NVIDIA. GPU Occupancy Calculator. CUDA SDK, 2010.Google Scholar
- S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU Concurrency with Elastic Kernels. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 407--418, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 593--606. ACM, 2015. Google Scholar
Digital Library
- V. Petrucci, M. Laurenzano, J. Doherty, Y. Zhang, D. Mosse, J. Mars, L. Tang, et al. Octopus-Man: QoS-Driven Task Management for Heterogeneous Multicores in Warehouse-Scale Computers. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 246--258. IEEE, 2015.Google Scholar
- R. Phull, C.-H. Li, K. Rao, H. Cadambi, and S. Chakradhar. Interference-Driven Resource Management for GPU-based Heterogeneous Clusters. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC), pages 109--120. ACM, 2012. Google Scholar
Digital Library
- B. Pichai, L. Hsu, and A. Bhattacharjee. Address Translation for Throughput-Oriented Accelerators. Micro, IEEE, 35 (3): 102--113, May 2015. Google Scholar
Cross Ref
- A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, et al. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In the 41st International Symposium on Computer Architecture (ISCA), pages 13--24. IEEE, 2014. Google Scholar
Cross Ref
- T. G. Rogers, M. O'Connor, and T. M. Aamodt. Divergence-Aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 99--110. ACM, 2013. Google Scholar
Digital Library
- K. Sajjapongse, X. Wang, and M. Becchi. A Preemption-based Runtime to Efficiently Schedule Multi-process Applications on Heterogeneous Clusters with GPUs. In Proceedings of the 22nd international symposium on High-performance Parallel and Distributed Computing (HPDC), pages 179--190. ACM, 2013. Google Scholar
Digital Library
- L. Tang, J. Mars, W. Wang, T. Dey, and M. L. Soffa. ReQoS: Reactive Static/Dynamic Compilation for QoS in Warehouse Scale Computers. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 89--100, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu. A Case for Core-assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 41--53. ACM, 2015. Google Scholar
Digital Library
- Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing. In the 22th International Symposium on High Performance Computer Architecture (HPCA), pages 358--369. IEEE, 2016.Google Scholar
Cross Ref
- G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou. GPGPU Performance and Power Estimation Using Machine Learning. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 564--576. IEEE, 2015. Google Scholar
Cross Ref
- H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-Flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pages 607--618, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- G. L. Yuan, A. Bakhoda, and T. M. Aamodt. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 34--44. ACM, 2009. Google Scholar
Digital Library
- Y. Zhang, M. Laurenzano, J. Mars, and L. Tang. SMiTe: Precise QoS Prediction on Real System SMT Processors to Improve Utilization in Warehouse Scale Computers. In Proceedings of the 47th Annual International Symposium on Microarchitecture (MICRO), pages 406--418, New York, NY, USA, 2014. ACM. Google Scholar
Digital Library
- J. Zhong and B. He. Kernelet: High-throughput GPU Kernel Executions with Dynamic Slicing and Scheduling. IEEE Transactions on Parallel and Distributed Systems, 25 (6): 1522--1532, 2014. Google Scholar
Digital Library
Index Terms
Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
Recommendations
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsModern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language ...
Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsGuaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on ...
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
ASPLOS '16Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language ...







Comments