skip to main content
research-article

Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services

Published:14 March 2015Publication History
Skip Abstract Section

Abstract

Interactive services, such as Web search, recommendations, games, and finance, must respond quickly to satisfy customers. Achieving this goal requires optimizing tail (e.g., 99th+ percentile) latency. Although every server is multicore, parallelizing individual requests to reduce tail latency is challenging because (1) service demand is unknown when requests arrive; (2) blindly parallelizing all requests quickly oversubscribes hardware resources; and (3) parallelizing the numerous short requests will not improve tail latency. This paper introduces Few-to-Many (FM) incremental parallelization, which dynamically increases parallelism to reduce tail latency. FM uses request service demand profiles and hardware parallelism in an offline phase to compute a policy, represented as an interval table, which specifies when and how much software parallelism to add. At runtime, FM adds parallelism as specified by the interval table indexed by dynamic system load and request execution time progress. The longer a request executes, the more parallelism FM adds. We evaluate FM in Lucene, an open-source enterprise search engine, and in Bing, a commercial Web search engine. FM improves the 99th percentile response time up to 32% in Lucene and up to 26% in Bing, compared to prior state-of-the-art parallelization. Compared to running requests sequentially in Bing, FM improves tail latency by a factor of two. These results illustrate that incremental parallelism is a powerful tool for reducing tail latency.

References

  1. Apache Lucene. http://lucene.apache.org/. Retreived July 2014.Google ScholarGoogle Scholar
  2. N. Bansal, K. Dhamdhere, and A. Sinha. Non-clairvoyant scheduling for minimizing mean slowdown. Algorithmica, 40 (4):305--318, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. A. Barroso, J. Dean, and U. Holzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22--28, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Blagojevic, D. S. Nikolopoulos, A. Stamatakis, C. D. Antonopoulos, and M. Curtis-Maury. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Computing, 33(10-11):700--719, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55 -- 69, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Cahoon, K. S. McKinley, and Z. Lu. Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems (TOIS), 18(1):1--43, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Contreras and M. Martonosi. Characterizing and improving the performance of intel threading building blocks. In IEEE International Symposium on Workload Characterization (IISWC), pages 57--66, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In ACM International Conference on Supercomputing (ICS), pages 157--166, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74--80, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key- value store. In ACM Symposium on Operating Systems Principles (SOSP), pages 205--220, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Feitelson. A Survey of Scheduling in Multiprogrammed Parallel Systems. Research report. IBM T.J. Watson Research Center, 1994.Google ScholarGoogle Scholar
  12. B. Goetz, T. Peierls, J. Bloch, J. Bowbeer, D. Holmes, and D. Lea. Java Concurrency in Practice. Addison Wesley Professional, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Guida. Parallelizing a computationally intensive financial R application with zircon technology. In The R User Conference, 2010.Google ScholarGoogle Scholar
  14. J. Hamilton. The cost of latency, 2009. http://perspectives.mvdirona.com/2009/10/31/-TheCostOfLatency.aspx.Google ScholarGoogle Scholar
  15. J. Hamilton. Overall data center costs, 2010. http://perspectives.mvdirona.com/2010/09/18-/OverallDataCenterCosts.aspx.Google ScholarGoogle Scholar
  16. Y. He, W.-J. Hsu, and C. E. Leiserson. Provably efficient online non-clairvoyant adaptive scheduling. IEEE Transactions on Parallel and Distributed Systems, 19(9):1263--1279, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. He, S. Elnikety, J. Larus, and C. Yan. Zeta: Scheduling interactive services with partial execution. In ACM Symposium on Cloud Computing (SOCC), page 12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Jeon, Y. He, S. Elnikety, A. L. Cox, and S. Rixner. Adaptive parallelism for web search. In ACM European Conference on Computer Systems (EuroSys), pages 155--168, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Jeon, S. Kim, S.-W. Hwang, Y. He, S. Elnikety, A. L. Cox, and S. Rixner. Predictive parallelization: taming tail latencies in web search. In ACM Conference on Research and Development in Information Retrieval (SIGIR), pages 253--262, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Jung, D. Lim, J. Lee, and S. Han. Adaptive execution techniques for SMT multiprocessor architectures. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 236--246, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Jurgelionis, P. Fechteler, P. Eisert, F. Bellotti, H. David, J. P. Laulajainen, R. Carmichael, V. Poulopoulos, A. Laikari, P. Perala, A. De Gloria, and C. Bouras. Platform for distributed 3D gaming. International Journal of Computer Games Technology, 2009:1:1--1:15, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Kim, Y. He, S.-W. Hwang, S. Elnikety, and S. Choi. Delayed-Dynamic-Selective (DDS) prediction for reducing extreme tail latency in web search. In ACM International Conference on Web Search and Data Mining (WSDM), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Ko, M. N. Yankelevsky, D. S. Nikolopoulos, and C. D. Polychronopoulos. Effective cross-platform, multilevel parallelism via dynamic adaptive execution. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), page 130, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Lee, H. Wu, M. Ravichandran, and N. Clark. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In International Symposium on Computer Architecture (ISCA), pages 270--279, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In ACM Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA), pages 227--242, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. R. Lorch and A. J. Smith. Improving dynamic voltage scaling algorithms with PACE. In ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pages 50--61, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lucene Nightly Benchmarks. http://people.apache.org/~mikemccand/lucenebench. Retreived June 2014.Google ScholarGoogle Scholar
  28. J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. McCann, R. Vaswani, and J. Zahorjan. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Transactions on Computer Systems, 11 (2):146--178, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. K. Pusukuri, R. Gupta, and L. N. Bhuyan. Thread reinforcer: Dynamically determining number of threads via OS level monitoring. In IEEE International Symposium on Workload Characterization (IISWC), pages 116--125, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: the degree of parallelism executive. In ACM Conference on Programming Language Design and Implementation (PLDI), pages 26--37, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Ren, Y. He, S. Elnikety, and K. McKinley. Exploiting processor heterogeneity in interactive services. In USENIX International Conference on Autonomic Computing (ICAC), pages 45--58, 2013.Google ScholarGoogle Scholar
  33. E. Schurman and J. Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. In Velocity Conference, 2009.Google ScholarGoogle Scholar
  34. M. A. Suleman, M. K. Qureshi, and Y. N. Patt. Feedback- driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 277--286, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Upadhyaya, V. S. Pai, and S. P. Midkiff. Expressing and exploiting concurrency in networked applications with aspen. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 13--23, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Z. Wang and M. F. O'Boyle. Mapping parallelism to multi-cores: a machine learning based approach. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 75--84, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Welsh, D. Culler, and E. Brewer. SEDA: an architecture for well-conditioned, scalable internet services. In ACM Symposium on Operating Systems Principles (SOSP), pages 230--243, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Wikipedia: Database download. http://en.wikipedia.org/-wiki/wikipedia:databasedownload#english-languagewikipedia/. Retreived May 2014.Google ScholarGoogle Scholar
  39. J. Yi, F. Maghoul, and J. Pedersen. Deciphering mobile search patterns: A study of Yahoo! mobile search queries. In ACM International Conference on World Wide Web (WWW), pages 257--266, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!