skip to main content
research-article
Public Access

Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers

Authors Info & Claims
Published:04 April 2017Publication History
Skip Abstract Section

Abstract

Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on a server, performance interference due to resource contention can be detrimental to the application QoS. Although prior work has proposed techniques to identify "safe" co-locations where application QoS is satisfied by predicting the performance interference on multicores, no such prediction technique on accelerators such as GPUs.

In this work, we present Prophet, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location. We analyzed the performance interference on accelerators through a real system investigation and found that unlike on multicores where the key contentious resources are shared caches and main memory bandwidth, the key contentious resources on accelerators are instead processing elements, accelerator memory bandwidth and PCIe bandwidth. Based on this observation, we designed interference models that enable the precise prediction for processing element, accelerator memory bandwidth and PCIe bandwidth contention on real hardware. By using a novel technique to forecast solo-run execution traces of the co-located applications using interference models, Prophet can accurately predict the performance degradation of latency-sensitive applications on non-preemptive accelerators. Using Prophet, we can identify "safe" co-locations on accelerators to improve utilization without violating the QoS target. Our evaluation shows that Prophet can predict the performance degradation with an average prediction error 5.47% on real systems. Meanwhile, based on the prediction, Prophet achieves accelerator utilization improvements of 49.9% on average while maintaining the QoS target of latency-sensitive applications.

References

  1. Nvidia Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA\_Multi\_Process\_Service\_Overview.pdf.Google ScholarGoogle Scholar
  2. Profiler User's Guide. http://docs.nvidia.com/cuda/profiler-users-guide.Google ScholarGoogle Scholar
  3. J. Adriaens, K. Compton, N. S. Kim, and M. Schulte. The Case for GPGPU Spatial Multitasking. In the 18th International Symposium on High Performance Computer Architecture (HPCA), pages 1--12. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Aguilera, K. Morrow, and N. S. Kim. QoS-aware Dynamic Resource Allocation for Spatial-Multitasking GPUs. In the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 726--731. IEEE, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  5. J. Anantpur and R. Govindarajan. PRO: Progress Aware GPU Warp Scheduling Algorithm. In International Parallel and Distributed Processing Symposium (IPDPS), pages 979--988. IEEE, 2015.Google ScholarGoogle Scholar
  6. n, Loh, Das, Kandemir, and Mutlu]ausavarungnirunexploitingR. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 25--38. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. le(2007)]barroso2007caseL. A. Barroso and U. Hölzle. The Case for Energy-proportional Computing. Computer, (12): 33--37, 2007.Google ScholarGoogle Scholar
  8. le]barroso2003webL. A. Barroso, J. Dean, and U. Hölzle. Web Search for a Planet: The Google Cluster Architecture. Micro, 23 (2): 22--28, 2003.Google ScholarGoogle Scholar
  9. T. Beisel, T. Wiersema, C. Plessl, and A. Brinkmann. Cooperative Multitasking for Heterogeneous Accelerators in the Linux Completely Fair Scheduler. In IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pages 223--226. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Bittner, E. Ruf, and A. Forin. Direct GPU/FPGA Communication via PCI Express. Cluster Computing, 17 (2): 339--348, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Cabezas, L. Vilanova, I. Gelado, T. B. Jablin, N. Navarro, and W.-m. Hwu. Automatic Execution of Single-GPU Computations across Multiple GPUs. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 467--468. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Q. Chen, H. Yang, J. Mars, and L. Tang. Baymax: QoS Awareness and Increased Utilization of Non-Preemptive Accelerators in Warehouse Scale Computers. In Proceedings of the 21th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 681--696, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ]delimitrou2013ibenchC. Delimitrou and C. Kozyrakis. iBench: Quantifying Interference for Datacenter Applications. In International Symposium on Workload Characterization (IISWC), pages 23--33. IEEE, 2013\natexlaba.Google ScholarGoogle ScholarCross RefCross Ref
  15. ]delimitrou2013paragonC. Delimitrou and C. Kozyrakis. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. ACM SIGARCH Computer Architecture News, 41 (1): 77--88, 2013\natexlabb.Google ScholarGoogle Scholar
  16. C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Elliott, B. C. Ward, J. H. Anderson, et al. GPUSync: A Framework for Real-time GPU Management. In the 34th Real-Time Systems Symposium (RTSS), pages 33--44. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. A. Elliott and J. H. Anderson. Globally Scheduled Real-time Multiprocessor Systems with GPUs. Real-Time Systems, 48 (1): 34--74, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hauswald, Kang, Laurenzano, Chen, Li, Dreslinski, Mudge, Mars, and Tang]hauswald15iscaJ. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, R. Dreslinski, T. Mudge, J. Mars, and L. Tang. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 27--40, New York, NY, USA, 2015\natexlaba. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hauswald, Laurenzano, Zhang, Li, Rovinski, Khurana, Dreslinski, Mudge, Petrucci, Tang, and Mars]hauswald15asplosJ. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 223--238, New York, NY, USA, 2015\natexlabb. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Jones. The Learning Machines, 2014.Google ScholarGoogle Scholar
  22. W. Joo and D. Shin. Resource-Constrained Spatial Multi-tasking for Embedded GPU. In International Conference on Consumer Electronics (ICCE), pages 339--340. IEEE, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  23. H. Kasture and D. Sanchez. Ubik: Efficient Cache Sharing with Strict QoS for Latency-critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 729--742, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU Scheduling for Real-time Multi-tasking Environments. In USENIX Annual Technical Conference (ATC), pages 17--30, 2011.Google ScholarGoogle Scholar
  25. D. Kirk et al. NVIDIA CUDA Software and GPU Parallel Computing Architecture. In ISMM, volume 7, pages 103--104, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. A. Laurenzano, Y. Zhang, L. Tang, and J. Mars. Protean code: Achieving Near-free Online Code Transformations for Warehouse Scale Computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 558--570. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Lee, A. Faruque, and M. Abdullah. GPU-EvR: Run-time Event-based Real-time Scheduling Framework on GPGPU Platform. In Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 1--6. IEEE, 2014.Google ScholarGoogle Scholar
  28. S.-Y. Lee, A. Arunkumar, and C.-J. Wu. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 515--527. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Leverich and C. Kozyrakis. Reconciling High Server Utilization and Sub-millisecond Quality-of-Service. In Proceedings of the 9th European Conference on Computer Systems, page 4. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 450--462. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Meisner, B. T. Gold, and T. F. Wenisch. PowerNap: Eliminating Server Idle Power. In ACM Sigplan Notices, volume 44, pages 205--216. ACM, 2009.Google ScholarGoogle Scholar
  33. K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair Queuing Memory Systems. In the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 208--222. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Nvidia. Compute Unified Device Architecture Programming Guide. 2007.Google ScholarGoogle Scholar
  35. C. NVIDIA. GPU Occupancy Calculator. CUDA SDK, 2010.Google ScholarGoogle Scholar
  36. S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU Concurrency with Elastic Kernels. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 407--418, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 593--606. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. V. Petrucci, M. Laurenzano, J. Doherty, Y. Zhang, D. Mosse, J. Mars, L. Tang, et al. Octopus-Man: QoS-Driven Task Management for Heterogeneous Multicores in Warehouse-Scale Computers. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 246--258. IEEE, 2015.Google ScholarGoogle Scholar
  39. R. Phull, C.-H. Li, K. Rao, H. Cadambi, and S. Chakradhar. Interference-Driven Resource Management for GPU-based Heterogeneous Clusters. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC), pages 109--120. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. B. Pichai, L. Hsu, and A. Bhattacharjee. Address Translation for Throughput-Oriented Accelerators. Micro, IEEE, 35 (3): 102--113, May 2015. Google ScholarGoogle ScholarCross RefCross Ref
  41. A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, et al. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In the 41st International Symposium on Computer Architecture (ISCA), pages 13--24. IEEE, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  42. T. G. Rogers, M. O'Connor, and T. M. Aamodt. Divergence-Aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 99--110. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. K. Sajjapongse, X. Wang, and M. Becchi. A Preemption-based Runtime to Efficiently Schedule Multi-process Applications on Heterogeneous Clusters with GPUs. In Proceedings of the 22nd international symposium on High-performance Parallel and Distributed Computing (HPDC), pages 179--190. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. L. Tang, J. Mars, W. Wang, T. Dey, and M. L. Soffa. ReQoS: Reactive Static/Dynamic Compilation for QoS in Warehouse Scale Computers. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 89--100, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu. A Case for Core-assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 41--53. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing. In the 22th International Symposium on High Performance Computer Architecture (HPCA), pages 358--369. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  47. G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou. GPGPU Performance and Power Estimation Using Machine Learning. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 564--576. IEEE, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  48. H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-Flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pages 607--618, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. G. L. Yuan, A. Bakhoda, and T. M. Aamodt. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 34--44. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Y. Zhang, M. Laurenzano, J. Mars, and L. Tang. SMiTe: Precise QoS Prediction on Real System SMT Processors to Improve Utilization in Warehouse Scale Computers. In Proceedings of the 47th Annual International Symposium on Microarchitecture (MICRO), pages 406--418, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Zhong and B. He. Kernelet: High-throughput GPU Kernel Executions with Dynamic Slicing and Scheduling. IEEE Transactions on Parallel and Distributed Systems, 25 (6): 1522--1532, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!