skip to main content
research-article

User-level Threading: Have Your Cake and Eat It Too

Published:27 May 2020Publication History
Skip Abstract Section

Abstract

An important class of computer software, such as network servers, exhibits concurrency through many loosely coupled and potentially long-running communication sessions. For these applications, a long-standing open question is whether thread-per-session programming can deliver comparable performance to event-driven programming. This paper clearly demonstrates, for the first time, that it is possible to employ user-level threading for building thread-per-session applications without compromising functionality, efficiency, performance, or scalability. We present the design and implementation of a general-purpose, yet nimble, user-level M:N threading runtime that is built from scratch to accomplish these objectives. Its key components are efficient and effective load balancing and user-level I/O blocking. While no other runtime exists with comparable characteristics, an important fundamental finding of this work is that building this runtime does not require particularly intricate data structures or algorithms. The runtime is thus a straightforward existence proof for user-level threading without performance compromises and can serve as a reference platform for future research. It is evaluated in comparison to event-driven software, system-level threading, and several other user-level threading runtimes. An experimental evaluation is conducted using benchmark programs, as well as the popular Memcached application. We demonstrate that our user-level runtime outperforms other threading runtimes and enables thread-per-session programming at high levels of concurrency and hardware parallelism without sacrificing performance.

References

  1. Lock and semaphore icons designed by Freepik (https://www.flaticon.com/authors/freepik) from Flaticon (www. flaticon.com). Licensed via Flaticon Basic License.Google ScholarGoogle Scholar
  2. libevent. http://libevent.org/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  3. libev. http://libev.schmorp.de/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  4. libuv. http://libuv.org/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  5. Adya, A., Howell, J., Theimer, M., Bolosky, W. J., and Douceur, J. R. Cooperative Task Management Without Manual Stack Management. In Proceedings of USENIX ATC (2002), pp. 289--302.Google ScholarGoogle Scholar
  6. Akka Toolkit. https://akka.io/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  7. Anderson, T. E., Bershad, B. N., Lazowska, E. D., and Levy, H. M. Scheduler Activations: Effective Kernel Support for the User-level Management of Parallelism. In Proceedings of ACM SOSP (1991), pp. 95-- 109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., and Paleczny, M. Workload Analysis of a Large-scale Key-value Store. In Proceedings of SIGMETRICS (2012), pp. 53--64.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Barghi, S. uThreads: Concurrent User Threads in C++(and C). https://github.com/samanbarghi/uThreads. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  10. Barroso, L., Marty, M., Patterson, D., and Ranganathan, P. Attack of the killer microseconds. Commun. ACM 60, 4 (Mar. 2017), 48--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Blumofe, R. D., and Leiserson, C. E. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (Sept. 1999), 720--748.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Boost.Fiber. https://www.boost.org/doc/libs/1_69_0/libs/fiber/doc/html/index.html. [Online; accessed 2019--12--27.Google ScholarGoogle Scholar
  13. Brown, T., Ellen, F., and Ruppert, E. A General Technique for Non-blocking Trees. In Proceedings of ACM PPoPP '14 (2014), pp. 329--342.Google ScholarGoogle Scholar
  14. Buhr, P. A., Ditchfield, G., Stroobosscher, R. A., Younger, B. M., and Zarnke, C. R. ??C++: Concurrency in the Object-Oriented Language C++. Software: Practice and Experience 22, 2, 137--172.Google ScholarGoogle Scholar
  15. Buntinas, D., Mercier, G., and Gropp,W. Design and evaluation of nemesis, a scalable, low-latency, message-passing communication subsystem. In Proceedings of CCGRID (2006), pp. 521--530.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C++ Actor Framework. https://actor-framework.org/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  17. Casazza, S. ULib Application Development Framework. https://github.com/stefanocasazza/ULib. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  18. Dice, D. Malthusian Locks. In Proceedings of EuroSys (2017), pp. 314--327.Google ScholarGoogle Scholar
  19. Dmitry Vyukov. Intrusive MPSC node-based queue. http://www.1024cores.net/home/lock-free-algorithms/queues/ intrusive-mpsc-node-based-queue. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  20. Elmeleegy, K., Chanda, A., Cox, A. L., and Zwaenepoel, W. Lazy Asynchronous I/O for Event-driven Servers. In Proceedings of USENIX ATC (2004), pp. 21--21.Google ScholarGoogle Scholar
  21. Erb, B. Concurrent Programming for Scalable Web Architectures. Diploma thesis, Institute of Distributed Systems, Ulm University, April 2012.Google ScholarGoogle Scholar
  22. Erlang Programming Language. https://www.erlang.org/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  23. folly::fibers. https://github.com/facebook/folly/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  24. Gasiunas, V., Dominguez-Sal, D., Acker, R., Avitzur, A., Bronshtein, I., Chen, R., Ginot, E., Martinez-Bazan, N., Müller, M., Nozdrin, A., Ou, W., Pachter, N., Sivov, D., and Levy, E. Fiber-based Architecture for NFV Cloud Databases. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1682--1693.Google ScholarGoogle Scholar
  25. Glozer, W. wrk - a HTTP benchmarking tool. https://github.com/wg/wrk. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  26. The Go Programming Language. https://www.golang.org/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  27. Graham, R. The C10M Problem. http://c10m.robertgraham.com/p/manifesto.html, 2013. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  28. Gramoli, V. More Than You Ever Wanted to Know About Synchronization: Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms. In Proceedings of ACM PPoPP 2015 (2015), pp. 1--10.Google ScholarGoogle Scholar
  29. Guiroux, H., Lachaize, R., and Quéma, V. Multicore Locks: The Case is Not Closed Yet. In Proceedings of USENIX ATC (2016), pp. 649--662.Google ScholarGoogle Scholar
  30. Haller, P., and Odersky, M. Scala Actors: Unifying thread-based and event-based programming. Theoretical Computer Science 410, 2 (2009), 202 -- 220. Distributed Computing Techniques.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Harji, A. S., Buhr, P. A., and Brecht, T. Comparing High-performance Multi-core Web-server Architectures. In Proceedings of SYSTOR (2012), pp. 1:1--1:12.Google ScholarGoogle Scholar
  32. Callback Hell. https://en.wiktionary.org/wiki/callback_hell. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  33. Hewitt, C., Bishop, P., and Steiger, R. A Universal Modular ACTOR Formalism for Artificial Intelligence. In Proceedings of IJCAI (1973), pp. 235--245.Google ScholarGoogle Scholar
  34. Hoare, C. A. R. Communicating Sequential Processes. Communications of the ACM 21, 8 (Aug. 1978), 666--677.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ian Lance Taylor. Split Stacks in GCC. https://gcc.gnu.org/wiki/SplitStacks. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  36. IEEE and The Open Group. Standard for Information Technology--Portable Operating System Interface Base Specifications, Issue 7, Sep. 2016. https://doi.org/10.1109/IEEESTD.2016.7582338 - see Section 2.14 File Descriptor Allocation.Google ScholarGoogle Scholar
  37. Kegel, D. The C10K Problem. http://www.kegel.com/c10k.html, 1999. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  38. Leverich, J. Mutilate. https://github.com/leverich/mutilate. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  39. Li, S., Lim, H., Lee, V.W., Ahn, J. H., Kalia, A., Kaminsky, M., Andersen, D. G., O, S., Lee, S., and Dubey, P. Full-Stack Architecting to Achieve a Billion-Requests-Per-Second Throughput on a Single Key-Value Store Server Platform. ACM Trans. Comput. Syst. 34, 2 (Apr. 2016), 5:1--5:30.Google ScholarGoogle Scholar
  40. A User Space Threading Library Supporting Multi-Core Systems. https://github.com/brianwatling/libfiber/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  41. Lozi, J.-P., David, F., Thomas, G., Lawall, J., and Muller, G. Remote Core Locking: Migrating Critical-section Execution to Improve the Performance of Multithreaded Applications. In Proceedings of USENIX ATC (2012), pp. 6--6.Google ScholarGoogle Scholar
  42. Memcached. http://www.memcached.org/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  43. Mordor I/O Library. https://github.com/mozy/mordor/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  44. Nakashima, J., and Taura, K. MassiveThreads: A Thread Library for High Productivity Languages. In Concurrent Objects and Beyond, G. Agha, A. Igarashi, N. Kobayashi, H. Masuhara, S. Matsuoka, E. Shibayama, and K. Taura, Eds., vol. 8665 of Lecture Notes in Computer Science. Springer, pp. 222--238.Google ScholarGoogle Scholar
  45. Osterhout, J. Why Threads Are A Bad Idea (for most purposes). Invited talk at USENIX ATC, 1996. https: //web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  46. Pariag, D., Brecht, T., Harji, A., Buhr, P., and Shukla, A. Comparing the Performance of Web Server Architectures. In Proceedings of EuroSys (2007), pp. 231--243.Google ScholarGoogle Scholar
  47. Pony Programming Language. https://www.ponylang.org/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  48. Proto Actor. http://proto.actor/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  49. GNU Pth. https://www.gnu.org/software/pth/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  50. Qin, H., Li, Q., Speiser, J., Kraft, P., and Ousterhout, J. Arachne: Core-aware Thread Management. In Proceedings of OSDI (2018), pp. 145--160.Google ScholarGoogle Scholar
  51. Quasar. http://www.paralleluniverse.co/quasar/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  52. Scaling in the Linux Networking Stack. https://www.kernel.org/doc/Documentation/networking/scaling.txt. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  53. Roghanchi, S., Eriksson, J., and Basu, N. Ffwd: Delegation is (Much) Faster Than You Think. In Proceedings of SOSP (2017), pp. 342--358.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Schillings, B. Be Engineering Insights: Benaphores. https://www.haiku-os.org/legacy-docs/benewsletter/Issue1- 26.html. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  55. Segmented Stacks in LLVM. https://llvm.org/docs/SegmentedStacks.html. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  56. State Threads Library. http://state-threads.sourceforge.net/. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  57. TechEmpower, Inc. Web Framework Benchmarks. https://www.techempower.com/benchmarks/#section=datar17& hw=ph&test=plaintext. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  58. The Linux man-pages project. clone, __clone2 - create a child process. http://man7.org/linux/man-pages/man2/ clone.2.html. See CLONE_FILES flag. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  59. Turner, P. User-level threads....... with threads. Talk at Linux Plumbers Conference, 2013. http://www. linuxplumbersconf.org/2013/ocw/proposals/1653. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  60. Valialkin, A. Fast HTTP Package for Go. https://github.com/valyala/fasthttp. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  61. von Behren, R., Condit, J., and Brewer, E. Why Events Are a Bad Idea (for High-concurrency Servers). In Proceedings of HOTOS (2003), pp. 4--4.Google ScholarGoogle Scholar
  62. von Behren, R., Condit, J., Zhou, F., Necula, G. C., and Brewer, E. Capriccio: Scalable Threads for Internet Services. In Proceedings of ACM SOSP (2003), pp. 268--281.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. weighttp. https://github.com/lighttpd/weighttp. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar
  64. Welsh, M., Culler, D., and Brewer, E. SEDA: An Architecture for Well-conditioned, Scalable Internet Services. In Proceedings of ACM SOSP (2001), ACM, pp. 230--243.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Wheeler, K. B., Murphy, R. C., and Thain, D. Qthreads: An API for Programming with Millions of Lightweight Threads. In Proceedings of IEEE IPDPS (April 2008), pp. 1--8.Google ScholarGoogle Scholar
  66. Windows Fibers. https://docs.microsoft.com/en-us/windows/desktop/ProcThread/fibers. [Online; accessed 2019--12--27].Google ScholarGoogle Scholar

Index Terms

  1. User-level Threading: Have Your Cake and Eat It Too

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!