Abstract
Interactive services, such as Web search, recommendations, games, and finance, must respond quickly to satisfy customers. Achieving this goal requires optimizing tail (e.g., 99th+ percentile) latency. Although every server is multicore, parallelizing individual requests to reduce tail latency is challenging because (1) service demand is unknown when requests arrive; (2) blindly parallelizing all requests quickly oversubscribes hardware resources; and (3) parallelizing the numerous short requests will not improve tail latency. This paper introduces Few-to-Many (FM) incremental parallelization, which dynamically increases parallelism to reduce tail latency. FM uses request service demand profiles and hardware parallelism in an offline phase to compute a policy, represented as an interval table, which specifies when and how much software parallelism to add. At runtime, FM adds parallelism as specified by the interval table indexed by dynamic system load and request execution time progress. The longer a request executes, the more parallelism FM adds. We evaluate FM in Lucene, an open-source enterprise search engine, and in Bing, a commercial Web search engine. FM improves the 99th percentile response time up to 32% in Lucene and up to 26% in Bing, compared to prior state-of-the-art parallelization. Compared to running requests sequentially in Bing, FM improves tail latency by a factor of two. These results illustrate that incremental parallelism is a powerful tool for reducing tail latency.
- Apache Lucene. http://lucene.apache.org/. Retreived July 2014.Google Scholar
- N. Bansal, K. Dhamdhere, and A. Sinha. Non-clairvoyant scheduling for minimizing mean slowdown. Algorithmica, 40 (4):305--318, 2004. Google Scholar
Digital Library
- L. A. Barroso, J. Dean, and U. Holzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22--28, 2003. Google Scholar
Digital Library
- F. Blagojevic, D. S. Nikolopoulos, A. Stamatakis, C. D. Antonopoulos, and M. Curtis-Maury. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Computing, 33(10-11):700--719, 2007. Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55 -- 69, 1996. Google Scholar
Digital Library
- B. Cahoon, K. S. McKinley, and Z. Lu. Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems (TOIS), 18(1):1--43, 2000. Google Scholar
Digital Library
- G. Contreras and M. Martonosi. Characterizing and improving the performance of intel threading building blocks. In IEEE International Symposium on Workload Characterization (IISWC), pages 57--66, 2008.Google Scholar
Cross Ref
- M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In ACM International Conference on Supercomputing (ICS), pages 157--166, 2006. Google Scholar
Digital Library
- J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74--80, 2013. Google Scholar
Digital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key- value store. In ACM Symposium on Operating Systems Principles (SOSP), pages 205--220, 2007. Google Scholar
Digital Library
- D. Feitelson. A Survey of Scheduling in Multiprogrammed Parallel Systems. Research report. IBM T.J. Watson Research Center, 1994.Google Scholar
- B. Goetz, T. Peierls, J. Bloch, J. Bowbeer, D. Holmes, and D. Lea. Java Concurrency in Practice. Addison Wesley Professional, 2006. Google Scholar
Digital Library
- R. Guida. Parallelizing a computationally intensive financial R application with zircon technology. In The R User Conference, 2010.Google Scholar
- J. Hamilton. The cost of latency, 2009. http://perspectives.mvdirona.com/2009/10/31/-TheCostOfLatency.aspx.Google Scholar
- J. Hamilton. Overall data center costs, 2010. http://perspectives.mvdirona.com/2010/09/18-/OverallDataCenterCosts.aspx.Google Scholar
- Y. He, W.-J. Hsu, and C. E. Leiserson. Provably efficient online non-clairvoyant adaptive scheduling. IEEE Transactions on Parallel and Distributed Systems, 19(9):1263--1279, 2008. Google Scholar
Digital Library
- Y. He, S. Elnikety, J. Larus, and C. Yan. Zeta: Scheduling interactive services with partial execution. In ACM Symposium on Cloud Computing (SOCC), page 12, 2012. Google Scholar
Digital Library
- M. Jeon, Y. He, S. Elnikety, A. L. Cox, and S. Rixner. Adaptive parallelism for web search. In ACM European Conference on Computer Systems (EuroSys), pages 155--168, 2013. Google Scholar
Digital Library
- M. Jeon, S. Kim, S.-W. Hwang, Y. He, S. Elnikety, A. L. Cox, and S. Rixner. Predictive parallelization: taming tail latencies in web search. In ACM Conference on Research and Development in Information Retrieval (SIGIR), pages 253--262, 2014. Google Scholar
Digital Library
- C. Jung, D. Lim, J. Lee, and S. Han. Adaptive execution techniques for SMT multiprocessor architectures. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 236--246, 2005. Google Scholar
Digital Library
- A. Jurgelionis, P. Fechteler, P. Eisert, F. Bellotti, H. David, J. P. Laulajainen, R. Carmichael, V. Poulopoulos, A. Laikari, P. Perala, A. De Gloria, and C. Bouras. Platform for distributed 3D gaming. International Journal of Computer Games Technology, 2009:1:1--1:15, 2009. Google Scholar
Digital Library
- S. Kim, Y. He, S.-W. Hwang, S. Elnikety, and S. Choi. Delayed-Dynamic-Selective (DDS) prediction for reducing extreme tail latency in web search. In ACM International Conference on Web Search and Data Mining (WSDM), 2015. Google Scholar
Digital Library
- W. Ko, M. N. Yankelevsky, D. S. Nikolopoulos, and C. D. Polychronopoulos. Effective cross-platform, multilevel parallelism via dynamic adaptive execution. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), page 130, 2002. Google Scholar
Digital Library
- J. Lee, H. Wu, M. Ravichandran, and N. Clark. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In International Symposium on Computer Architecture (ISCA), pages 270--279, 2010. Google Scholar
Digital Library
- D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In ACM Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA), pages 227--242, 2009. Google Scholar
Digital Library
- J. R. Lorch and A. J. Smith. Improving dynamic voltage scaling algorithms with PACE. In ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pages 50--61, 2001. Google Scholar
Digital Library
- Lucene Nightly Benchmarks. http://people.apache.org/~mikemccand/lucenebench. Retreived June 2014.Google Scholar
- J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259, 2011. Google Scholar
Digital Library
- C. McCann, R. Vaswani, and J. Zahorjan. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Transactions on Computer Systems, 11 (2):146--178, 1993. Google Scholar
Digital Library
- K. K. Pusukuri, R. Gupta, and L. N. Bhuyan. Thread reinforcer: Dynamically determining number of threads via OS level monitoring. In IEEE International Symposium on Workload Characterization (IISWC), pages 116--125, 2011. Google Scholar
Digital Library
- A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: the degree of parallelism executive. In ACM Conference on Programming Language Design and Implementation (PLDI), pages 26--37, 2011. Google Scholar
Digital Library
- S. Ren, Y. He, S. Elnikety, and K. McKinley. Exploiting processor heterogeneity in interactive services. In USENIX International Conference on Autonomic Computing (ICAC), pages 45--58, 2013.Google Scholar
- E. Schurman and J. Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. In Velocity Conference, 2009.Google Scholar
- M. A. Suleman, M. K. Qureshi, and Y. N. Patt. Feedback- driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 277--286, 2008. Google Scholar
Digital Library
- G. Upadhyaya, V. S. Pai, and S. P. Midkiff. Expressing and exploiting concurrency in networked applications with aspen. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 13--23, 2007. Google Scholar
Digital Library
- Z. Wang and M. F. O'Boyle. Mapping parallelism to multi-cores: a machine learning based approach. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 75--84, 2009. Google Scholar
Digital Library
- M. Welsh, D. Culler, and E. Brewer. SEDA: an architecture for well-conditioned, scalable internet services. In ACM Symposium on Operating Systems Principles (SOSP), pages 230--243, 2001. Google Scholar
Digital Library
- Wikipedia: Database download. http://en.wikipedia.org/-wiki/wikipedia:databasedownload#english-languagewikipedia/. Retreived May 2014.Google Scholar
- J. Yi, F. Maghoul, and J. Pedersen. Deciphering mobile search patterns: A study of Yahoo! mobile search queries. In ACM International Conference on World Wide Web (WWW), pages 257--266, 2008. Google Scholar
Digital Library
Index Terms
Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services
Recommendations
Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsInteractive services, such as Web search, recommendations, games, and finance, must respond quickly to satisfy customers. Achieving this goal requires optimizing tail (e.g., 99th+ percentile) latency. Although every server is multicore, parallelizing ...
Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services
ASPLOS'15Interactive services, such as Web search, recommendations, games, and finance, must respond quickly to satisfy customers. Achieving this goal requires optimizing tail (e.g., 99th+ percentile) latency. Although every server is multicore, parallelizing ...
TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services
ASPLOS'16In interactive services such as web search, recommendations, games and finance, reducing the tail latency is crucial to provide fast response to every user. Using web search as a driving example, we systematically characterize interactive workload to ...







Comments