skip to main content
research-article
Public Access

Work stealing for interactive services to meet target latency

Published:27 February 2016Publication History
Skip Abstract Section

Abstract

Interactive web services increasingly drive critical business workloads such as search, advertising, games, shopping, and finance. Whereas optimizing parallel programs and distributed server systems have historically focused on average latency and throughput, the primary metric for interactive applications is instead consistent responsiveness, i.e., minimizing the number of requests that miss a target latency. This paper is the first to show how to generalize work-stealing, which is traditionally used to minimize the makespan of a single parallel job, to optimize for a target latency in interactive services with multiple parallel requests.

We design a new adaptive work stealing policy, called tail-control, that reduces the number of requests that miss a target latency. It uses instantaneous request progress, system load, and a target latency to choose when to parallelize requests with stealing, when to admit new requests, and when to limit parallelism of large requests. We implement this approach in the Intel Thread Building Block (TBB) library and evaluate it on real-world workloads and synthetic workloads. The tail-control policy substantially reduces the number of requests exceeding the desired target latency and delivers up to 58% relative improvement over various baseline policies. This generalization of work stealing for multiple requests effectively optimizes the number of requests that complete within a target latency, a key metric for interactive services.

References

  1. K. Agrawal, Y. He, W. J. Hsu, and C. E. Leiserson. Adaptive scheduling with parallelism feedback. In PPoPP, pages 100--109, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Agrawal, C. E. Leiserson, Y. He, and W. J. Hsu. Adaptive work-stealing with parallelism feedback. ACM Transactions on Computer Systems (TOCS), 26(3):7, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache Lucene. http://lucene.apache.org/, 2014.Google ScholarGoogle Scholar
  4. N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems, 34 (2):115--144, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  5. R. Barik, Z. Budimlić, V. Cavè, S. Chatterjee, Y. Guo, D. Peixotto, R. Raman, J. Shirako, S. Taşιrlar, Y. Yan, Y. Zhao, and V. Sarkar. The Habanero multicore software research project. In ACM Conference Companion on Object Oriented Programming Systems Languages and Applications (OOPSLA), pages 735--736, Orlando, Florida, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Blagojevic, D. S. Nikolopoulos, A. Stamatakis, C. D. Antonopoulos, and M. Curtis-Maury. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Computing, 33(10-11): 700--719, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720--748, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. C. Borst, O. J. Boxma, R. Núñez-Queija, and A. Zwart. The impact of the service discipline on delay asymptotics. Performance Evaluation, 54 (2):175--206, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Broadie and P. Glasserman. Estimating security price derivatives using simulation. Manage. Sci., 42:269--285, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In International Conference on Principles and Practice of Programming in Java (PPPJ), pages 51--61, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to nonuniform cluster computing. In ACM Conference on Object--Oriented Programming Systems, Languages, and Applications (OOPSLA), pages 519--538, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Cortazar, M. Gravet, and J. Urzua. The valuation of multidimensional american real options using the lsm simulation method. Comp. and Operations Research., 35(1):113--129, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In ACM International Conference on Supercomputing (ICS), pages 157--166, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. S. Danaher, I.-T. A. Lee, and C. E. Leiserson. Programming with exceptions in JCilk. Science of Computer Programming, 63(2):147--171, Dec. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74--80, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In ACM Symposium on Operating Systems Principles (SOSP), pages 205--220, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Feitelson. A Survey of Scheduling in Multiprogrammed Parallel Systems. Research report. IBM T.J. Watson Research Center, 1994.Google ScholarGoogle Scholar
  19. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. E. Haque, Y. hun Eom, Y. He, S. Elnikety, R. Bianchini, and K. S. McKinley. Few-to-many: Incremental parallelism for reducing tail latency in interactive services. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 161--175, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. He, W.-J. Hsu, and C. E. Leiserson. Provably efficient online nonclair-voyant adaptive scheduling. Parallel and Distributed Systems, IEEE Transactions on (TPDS), 19(9):1263--1279, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. He, S. Elnikety, J. Larus, and C. Yan. Zeta: Scheduling interactive services with partial execution. In ACM Symposium on Cloud Computing (SOCC), page 12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Intel. Intel CilkPlus v1.2, Sep 2013. https://www.cilkplus.org/sites/default/files/open_specifications/Intel_Cilk_plus_lang_spec_1.2.htm.Google ScholarGoogle Scholar
  24. V. Jalaparti, P. Bodik, S. Kandula, I. Menache, M. Rybalkin, and C. Yan. Speeding up distributed request-response workflows. In SIGCOMM '13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Jeon, Y. He, S. Elnikety, A. L. Cox, and S. Rixner. Adaptive parallelism for web search. In ACM European Conference on Computer Systems (EuroSys), pages 155--168, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Jeon, S. Kim, S.-W. Hwang, Y. He, S. Elnikety, A. L. Cox, and S. Rixner. Predictive parallelization: taming tail latencies in web search. In ACM Conference on Research and Development in Information Retrieval (SIGIR), pages 253--262, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Jung, D. Lim, J. Lee, and S. Han. Adaptive execution techniques for SMT multiprocessor architectures. In PPoPP, pages 236--246, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G. Wei, and D. Brooks. Profiling a warehouse-scale computer. In ACM SIGARCH International Conference on Computer Architecture (ISCA), pages 158--169, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Kim, Y. He, S.-w. Hwang, S. Elnikety, and S. Choi. Delayed-dynamic-selective (DDS) prediction for reducing extreme tail latency in web search. In WSDM, pages 7--16, 2015a. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Kim, Y. He, S.-W. Hwang, S. Elnikety, and S. Choi. Delayed-Dynamic-Selective (DDS) prediction for reducing extreme tail latency in web search. In ACM International Conference on Web Search and Data Mining (WSDM), 2015b. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Kleinrock. Time-shared systems: A theoretical treatment. Journal of the ACM (JACM), 14(2):242--261, 1967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Ko, M. N. Yankelevsky, D. S. Nikolopoulos, and C. D. Polychronopoulos. Effective cross-platform, multilevel parallelism via dynamic adaptive execution. In IPDPS, pages 8 pp--, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu. Work-stealing without the baggage. In ACM Conference on Object--Oriented Programming Systems, Languages, and Applications (OOPSLA), pages 297--314, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Lea. A Java fork/join framework. In ACM 2000 Conference on Java Grande, pages 36--43, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. I.-T. A. Lee, C. E. Leiserson, T. B. Schardl, J. Sukha, and Z. Zhang. On-the-fly pipeline parallelism. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 140--151, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Lee, H. Wu, M. Ravichandran, and N. Clark. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In International Symposium on Computer Architecture (ISCA), pages 270--279, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In ACM SIGPLAN Notices, volume 44, pages 227--242, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. C. E. Leiserson. The Cilk++ concurrency platform. J. Supercomputing, 51 (3):244--257, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. R. Lorch and A. J. Smith. Improving dynamic voltage scaling algorithms with PACE. In ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 50--61, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. McCann, R. Vaswani, and J. Zahorjan. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Transactions on Computer Systems, 11(2):146--178, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. OpenMP. OpenMP Application Program Interface v4.0, July 2013. http://http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.Google ScholarGoogle Scholar
  43. K. K. Pusukuri, R. Gupta, and L. N. Bhuyan. Thread reinforcer: Dynamically determining number of threads via OS level monitoring. In IEEE International Symposium on Workload Characterization (IISWC), pages 116--125, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: The degree of parallelism executive. In ACM Conference on Programming Language Design and Implementation (PLDI), volume 46, pages 26--37, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, 2010.Google ScholarGoogle Scholar
  46. S. Ren, Y. He, S. Elnikety, and K. S. McKinley. Exploiting processor heterogeneity in interactive services. In ICAC, pages 45--58, 2013.Google ScholarGoogle Scholar
  47. B. Schroeder and M. Harchol-Balter. Web servers under overload: How scheduling can help. ACM Trans. Internet Technol., 6(1):20--52, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Z. Wang and M. F. O'Boyle. Mapping parallelism to multi-cores: a machine learning based approach. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 75--84, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Wierman and B. Zwart. Is tail-optimal scheduling possible? Operations research, 60(5):1249--1257, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. T. Y. Yeh, P. Faloutsos, and G. Reinman. Enabling real-time physics simulation in future interactive entertainment. In ACM SIGGRAPH Symposium on Videogames, Sandbox '06, pages 71--81, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Yi, F. Maghoul, and J. Pedersen. Deciphering mobile search patterns: A study of Yahoo! mobile search queries. In ACM International Conference on World Wide Web (WWW), pages 257--266, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zircon Computing. Parallelizing a computationally intensive financial R application with zircon technology. In IEEE CloudCom, 2010.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 51, Issue 8
    PPoPP '16
    August 2016
    405 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3016078
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      February 2016
      420 pages
      ISBN:9781450340922
      DOI:10.1145/2851141

    Copyright © 2016 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 27 February 2016

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!