Abstract
Interactive web services increasingly drive critical business workloads such as search, advertising, games, shopping, and finance. Whereas optimizing parallel programs and distributed server systems have historically focused on average latency and throughput, the primary metric for interactive applications is instead consistent responsiveness, i.e., minimizing the number of requests that miss a target latency. This paper is the first to show how to generalize work-stealing, which is traditionally used to minimize the makespan of a single parallel job, to optimize for a target latency in interactive services with multiple parallel requests.
We design a new adaptive work stealing policy, called tail-control, that reduces the number of requests that miss a target latency. It uses instantaneous request progress, system load, and a target latency to choose when to parallelize requests with stealing, when to admit new requests, and when to limit parallelism of large requests. We implement this approach in the Intel Thread Building Block (TBB) library and evaluate it on real-world workloads and synthetic workloads. The tail-control policy substantially reduces the number of requests exceeding the desired target latency and delivers up to 58% relative improvement over various baseline policies. This generalization of work stealing for multiple requests effectively optimizes the number of requests that complete within a target latency, a key metric for interactive services.
- K. Agrawal, Y. He, W. J. Hsu, and C. E. Leiserson. Adaptive scheduling with parallelism feedback. In PPoPP, pages 100--109, 2006. Google Scholar
Digital Library
- K. Agrawal, C. E. Leiserson, Y. He, and W. J. Hsu. Adaptive work-stealing with parallelism feedback. ACM Transactions on Computer Systems (TOCS), 26(3):7, 2008. Google Scholar
Digital Library
- Apache Lucene. http://lucene.apache.org/, 2014.Google Scholar
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems, 34 (2):115--144, 2001.Google Scholar
Cross Ref
- R. Barik, Z. Budimlić, V. Cavè, S. Chatterjee, Y. Guo, D. Peixotto, R. Raman, J. Shirako, S. Taşιrlar, Y. Yan, Y. Zhao, and V. Sarkar. The Habanero multicore software research project. In ACM Conference Companion on Object Oriented Programming Systems Languages and Applications (OOPSLA), pages 735--736, Orlando, Florida, USA, 2009. Google Scholar
Digital Library
- F. Blagojevic, D. S. Nikolopoulos, A. Stamatakis, C. D. Antonopoulos, and M. Curtis-Maury. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Computing, 33(10-11): 700--719, 2007. Google Scholar
Digital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720--748, 1999. Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google Scholar
Digital Library
- S. C. Borst, O. J. Boxma, R. Núñez-Queija, and A. Zwart. The impact of the service discipline on delay asymptotics. Performance Evaluation, 54 (2):175--206, 2003. Google Scholar
Digital Library
- M. Broadie and P. Glasserman. Estimating security price derivatives using simulation. Manage. Sci., 42:269--285, 1996. Google Scholar
Digital Library
- V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In International Conference on Principles and Practice of Programming in Java (PPPJ), pages 51--61, 2011. Google Scholar
Digital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to nonuniform cluster computing. In ACM Conference on Object--Oriented Programming Systems, Languages, and Applications (OOPSLA), pages 519--538, 2005. Google Scholar
Digital Library
- G. Cortazar, M. Gravet, and J. Urzua. The valuation of multidimensional american real options using the lsm simulation method. Comp. and Operations Research., 35(1):113--129, 2008. Google Scholar
Digital Library
- M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In ACM International Conference on Supercomputing (ICS), pages 157--166, 2006. Google Scholar
Digital Library
- J. S. Danaher, I.-T. A. Lee, and C. E. Leiserson. Programming with exceptions in JCilk. Science of Computer Programming, 63(2):147--171, Dec. 2006. Google Scholar
Digital Library
- J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74--80, 2013. Google Scholar
Digital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In ACM Symposium on Operating Systems Principles (SOSP), pages 205--220, 2007. Google Scholar
Digital Library
- D. Feitelson. A Survey of Scheduling in Multiprogrammed Parallel Systems. Research report. IBM T.J. Watson Research Center, 1994.Google Scholar
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google Scholar
Digital Library
- M. E. Haque, Y. hun Eom, Y. He, S. Elnikety, R. Bianchini, and K. S. McKinley. Few-to-many: Incremental parallelism for reducing tail latency in interactive services. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 161--175, 2015. Google Scholar
Digital Library
- Y. He, W.-J. Hsu, and C. E. Leiserson. Provably efficient online nonclair-voyant adaptive scheduling. Parallel and Distributed Systems, IEEE Transactions on (TPDS), 19(9):1263--1279, 2008. Google Scholar
Digital Library
- Y. He, S. Elnikety, J. Larus, and C. Yan. Zeta: Scheduling interactive services with partial execution. In ACM Symposium on Cloud Computing (SOCC), page 12, 2012. Google Scholar
Digital Library
- Intel. Intel CilkPlus v1.2, Sep 2013. https://www.cilkplus.org/sites/default/files/open_specifications/Intel_Cilk_plus_lang_spec_1.2.htm.Google Scholar
- V. Jalaparti, P. Bodik, S. Kandula, I. Menache, M. Rybalkin, and C. Yan. Speeding up distributed request-response workflows. In SIGCOMM '13, 2013. Google Scholar
Digital Library
- M. Jeon, Y. He, S. Elnikety, A. L. Cox, and S. Rixner. Adaptive parallelism for web search. In ACM European Conference on Computer Systems (EuroSys), pages 155--168, 2013. Google Scholar
Digital Library
- M. Jeon, S. Kim, S.-W. Hwang, Y. He, S. Elnikety, A. L. Cox, and S. Rixner. Predictive parallelization: taming tail latencies in web search. In ACM Conference on Research and Development in Information Retrieval (SIGIR), pages 253--262, 2014. Google Scholar
Digital Library
- C. Jung, D. Lim, J. Lee, and S. Han. Adaptive execution techniques for SMT multiprocessor architectures. In PPoPP, pages 236--246, 2005. Google Scholar
Digital Library
- S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G. Wei, and D. Brooks. Profiling a warehouse-scale computer. In ACM SIGARCH International Conference on Computer Architecture (ISCA), pages 158--169, 2015. Google Scholar
Digital Library
- S. Kim, Y. He, S.-w. Hwang, S. Elnikety, and S. Choi. Delayed-dynamic-selective (DDS) prediction for reducing extreme tail latency in web search. In WSDM, pages 7--16, 2015a. Google Scholar
Digital Library
- S. Kim, Y. He, S.-W. Hwang, S. Elnikety, and S. Choi. Delayed-Dynamic-Selective (DDS) prediction for reducing extreme tail latency in web search. In ACM International Conference on Web Search and Data Mining (WSDM), 2015b. Google Scholar
Digital Library
- L. Kleinrock. Time-shared systems: A theoretical treatment. Journal of the ACM (JACM), 14(2):242--261, 1967. Google Scholar
Digital Library
- W. Ko, M. N. Yankelevsky, D. S. Nikolopoulos, and C. D. Polychronopoulos. Effective cross-platform, multilevel parallelism via dynamic adaptive execution. In IPDPS, pages 8 pp--, 2002. Google Scholar
Digital Library
- V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu. Work-stealing without the baggage. In ACM Conference on Object--Oriented Programming Systems, Languages, and Applications (OOPSLA), pages 297--314, 2012. Google Scholar
Digital Library
- D. Lea. A Java fork/join framework. In ACM 2000 Conference on Java Grande, pages 36--43, 2000. Google Scholar
Digital Library
- I.-T. A. Lee, C. E. Leiserson, T. B. Schardl, J. Sukha, and Z. Zhang. On-the-fly pipeline parallelism. In ACM Symposium on Parallelism in Algorithms and Architectures, pages 140--151, 2013. Google Scholar
Digital Library
- J. Lee, H. Wu, M. Ravichandran, and N. Clark. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In International Symposium on Computer Architecture (ISCA), pages 270--279, 2010. Google Scholar
Digital Library
- D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In ACM SIGPLAN Notices, volume 44, pages 227--242, 2009. Google Scholar
Digital Library
- C. E. Leiserson. The Cilk++ concurrency platform. J. Supercomputing, 51 (3):244--257, 2010. Google Scholar
Digital Library
- J. R. Lorch and A. J. Smith. Improving dynamic voltage scaling algorithms with PACE. In ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 50--61, 2001. Google Scholar
Digital Library
- J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259, 2011. Google Scholar
Digital Library
- C. McCann, R. Vaswani, and J. Zahorjan. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Transactions on Computer Systems, 11(2):146--178, 1993. Google Scholar
Digital Library
- OpenMP. OpenMP Application Program Interface v4.0, July 2013. http://http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.Google Scholar
- K. K. Pusukuri, R. Gupta, and L. N. Bhuyan. Thread reinforcer: Dynamically determining number of threads via OS level monitoring. In IEEE International Symposium on Workload Characterization (IISWC), pages 116--125, 2011. Google Scholar
Digital Library
- A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: The degree of parallelism executive. In ACM Conference on Programming Language Design and Implementation (PLDI), volume 46, pages 26--37, 2011. Google Scholar
Digital Library
- J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, 2010.Google Scholar
- S. Ren, Y. He, S. Elnikety, and K. S. McKinley. Exploiting processor heterogeneity in interactive services. In ICAC, pages 45--58, 2013.Google Scholar
- B. Schroeder and M. Harchol-Balter. Web servers under overload: How scheduling can help. ACM Trans. Internet Technol., 6(1):20--52, 2006. Google Scholar
Digital Library
- Z. Wang and M. F. O'Boyle. Mapping parallelism to multi-cores: a machine learning based approach. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 75--84, 2009. Google Scholar
Digital Library
- A. Wierman and B. Zwart. Is tail-optimal scheduling possible? Operations research, 60(5):1249--1257, 2012. Google Scholar
Digital Library
- T. Y. Yeh, P. Faloutsos, and G. Reinman. Enabling real-time physics simulation in future interactive entertainment. In ACM SIGGRAPH Symposium on Videogames, Sandbox '06, pages 71--81, 2006. Google Scholar
Digital Library
- J. Yi, F. Maghoul, and J. Pedersen. Deciphering mobile search patterns: A study of Yahoo! mobile search queries. In ACM International Conference on World Wide Web (WWW), pages 257--266, 2008. Google Scholar
Digital Library
- Zircon Computing. Parallelizing a computationally intensive financial R application with zircon technology. In IEEE CloudCom, 2010.Google Scholar
Recommendations
Work stealing for interactive services to meet target latency
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingInteractive web services increasingly drive critical business workloads such as search, advertising, games, shopping, and finance. Whereas optimizing parallel programs and distributed server systems have historically focused on average latency and ...
The data locality of work stealing
SPAA '00: Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architecturesThis paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-...
Optimized distributed work-stealing
IA^3 '16: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and AlgorithmsWork-stealing is a popular approach for dynamic load balancing of task-parallel programs. However, as has been widely studied, the use of classical work-stealing algorithms on massively parallel and distributed supercomputers introduces several ...






Comments