Abstract
An important class of computer software, such as network servers, exhibits concurrency through many loosely coupled and potentially long-running communication sessions. For these applications, a long-standing open question is whether thread-per-session programming can deliver comparable performance to event-driven programming. This paper clearly demonstrates, for the first time, that it is possible to employ user-level threading for building thread-per-session applications without compromising functionality, efficiency, performance, or scalability. We present the design and implementation of a general-purpose, yet nimble, user-level M:N threading runtime that is built from scratch to accomplish these objectives. Its key components are efficient and effective load balancing and user-level I/O blocking. While no other runtime exists with comparable characteristics, an important fundamental finding of this work is that building this runtime does not require particularly intricate data structures or algorithms. The runtime is thus a straightforward existence proof for user-level threading without performance compromises and can serve as a reference platform for future research. It is evaluated in comparison to event-driven software, system-level threading, and several other user-level threading runtimes. An experimental evaluation is conducted using benchmark programs, as well as the popular Memcached application. We demonstrate that our user-level runtime outperforms other threading runtimes and enables thread-per-session programming at high levels of concurrency and hardware parallelism without sacrificing performance.
- Lock and semaphore icons designed by Freepik (https://www.flaticon.com/authors/freepik) from Flaticon (www. flaticon.com). Licensed via Flaticon Basic License.Google Scholar
- libevent. http://libevent.org/. [Online; accessed 2019--12--27].Google Scholar
- libev. http://libev.schmorp.de/. [Online; accessed 2019--12--27].Google Scholar
- libuv. http://libuv.org/. [Online; accessed 2019--12--27].Google Scholar
- Adya, A., Howell, J., Theimer, M., Bolosky, W. J., and Douceur, J. R. Cooperative Task Management Without Manual Stack Management. In Proceedings of USENIX ATC (2002), pp. 289--302.Google Scholar
- Akka Toolkit. https://akka.io/. [Online; accessed 2019--12--27].Google Scholar
- Anderson, T. E., Bershad, B. N., Lazowska, E. D., and Levy, H. M. Scheduler Activations: Effective Kernel Support for the User-level Management of Parallelism. In Proceedings of ACM SOSP (1991), pp. 95-- 109.Google Scholar
Digital Library
- Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., and Paleczny, M. Workload Analysis of a Large-scale Key-value Store. In Proceedings of SIGMETRICS (2012), pp. 53--64.Google Scholar
Digital Library
- Barghi, S. uThreads: Concurrent User Threads in C++(and C). https://github.com/samanbarghi/uThreads. [Online; accessed 2019--12--27].Google Scholar
- Barroso, L., Marty, M., Patterson, D., and Ranganathan, P. Attack of the killer microseconds. Commun. ACM 60, 4 (Mar. 2017), 48--54.Google Scholar
Digital Library
- Blumofe, R. D., and Leiserson, C. E. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (Sept. 1999), 720--748.Google Scholar
Digital Library
- Boost.Fiber. https://www.boost.org/doc/libs/1_69_0/libs/fiber/doc/html/index.html. [Online; accessed 2019--12--27.Google Scholar
- Brown, T., Ellen, F., and Ruppert, E. A General Technique for Non-blocking Trees. In Proceedings of ACM PPoPP '14 (2014), pp. 329--342.Google Scholar
- Buhr, P. A., Ditchfield, G., Stroobosscher, R. A., Younger, B. M., and Zarnke, C. R. ??C++: Concurrency in the Object-Oriented Language C++. Software: Practice and Experience 22, 2, 137--172.Google Scholar
- Buntinas, D., Mercier, G., and Gropp,W. Design and evaluation of nemesis, a scalable, low-latency, message-passing communication subsystem. In Proceedings of CCGRID (2006), pp. 521--530.Google Scholar
Digital Library
- C++ Actor Framework. https://actor-framework.org/. [Online; accessed 2019--12--27].Google Scholar
- Casazza, S. ULib Application Development Framework. https://github.com/stefanocasazza/ULib. [Online; accessed 2019--12--27].Google Scholar
- Dice, D. Malthusian Locks. In Proceedings of EuroSys (2017), pp. 314--327.Google Scholar
- Dmitry Vyukov. Intrusive MPSC node-based queue. http://www.1024cores.net/home/lock-free-algorithms/queues/ intrusive-mpsc-node-based-queue. [Online; accessed 2019--12--27].Google Scholar
- Elmeleegy, K., Chanda, A., Cox, A. L., and Zwaenepoel, W. Lazy Asynchronous I/O for Event-driven Servers. In Proceedings of USENIX ATC (2004), pp. 21--21.Google Scholar
- Erb, B. Concurrent Programming for Scalable Web Architectures. Diploma thesis, Institute of Distributed Systems, Ulm University, April 2012.Google Scholar
- Erlang Programming Language. https://www.erlang.org/. [Online; accessed 2019--12--27].Google Scholar
- folly::fibers. https://github.com/facebook/folly/. [Online; accessed 2019--12--27].Google Scholar
- Gasiunas, V., Dominguez-Sal, D., Acker, R., Avitzur, A., Bronshtein, I., Chen, R., Ginot, E., Martinez-Bazan, N., Müller, M., Nozdrin, A., Ou, W., Pachter, N., Sivov, D., and Levy, E. Fiber-based Architecture for NFV Cloud Databases. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1682--1693.Google Scholar
- Glozer, W. wrk - a HTTP benchmarking tool. https://github.com/wg/wrk. [Online; accessed 2019--12--27].Google Scholar
- The Go Programming Language. https://www.golang.org/. [Online; accessed 2019--12--27].Google Scholar
- Graham, R. The C10M Problem. http://c10m.robertgraham.com/p/manifesto.html, 2013. [Online; accessed 2019--12--27].Google Scholar
- Gramoli, V. More Than You Ever Wanted to Know About Synchronization: Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms. In Proceedings of ACM PPoPP 2015 (2015), pp. 1--10.Google Scholar
- Guiroux, H., Lachaize, R., and Quéma, V. Multicore Locks: The Case is Not Closed Yet. In Proceedings of USENIX ATC (2016), pp. 649--662.Google Scholar
- Haller, P., and Odersky, M. Scala Actors: Unifying thread-based and event-based programming. Theoretical Computer Science 410, 2 (2009), 202 -- 220. Distributed Computing Techniques.Google Scholar
Digital Library
- Harji, A. S., Buhr, P. A., and Brecht, T. Comparing High-performance Multi-core Web-server Architectures. In Proceedings of SYSTOR (2012), pp. 1:1--1:12.Google Scholar
- Callback Hell. https://en.wiktionary.org/wiki/callback_hell. [Online; accessed 2019--12--27].Google Scholar
- Hewitt, C., Bishop, P., and Steiger, R. A Universal Modular ACTOR Formalism for Artificial Intelligence. In Proceedings of IJCAI (1973), pp. 235--245.Google Scholar
- Hoare, C. A. R. Communicating Sequential Processes. Communications of the ACM 21, 8 (Aug. 1978), 666--677.Google Scholar
Digital Library
- Ian Lance Taylor. Split Stacks in GCC. https://gcc.gnu.org/wiki/SplitStacks. [Online; accessed 2019--12--27].Google Scholar
- IEEE and The Open Group. Standard for Information Technology--Portable Operating System Interface Base Specifications, Issue 7, Sep. 2016. https://doi.org/10.1109/IEEESTD.2016.7582338 - see Section 2.14 File Descriptor Allocation.Google Scholar
- Kegel, D. The C10K Problem. http://www.kegel.com/c10k.html, 1999. [Online; accessed 2019--12--27].Google Scholar
- Leverich, J. Mutilate. https://github.com/leverich/mutilate. [Online; accessed 2019--12--27].Google Scholar
- Li, S., Lim, H., Lee, V.W., Ahn, J. H., Kalia, A., Kaminsky, M., Andersen, D. G., O, S., Lee, S., and Dubey, P. Full-Stack Architecting to Achieve a Billion-Requests-Per-Second Throughput on a Single Key-Value Store Server Platform. ACM Trans. Comput. Syst. 34, 2 (Apr. 2016), 5:1--5:30.Google Scholar
- A User Space Threading Library Supporting Multi-Core Systems. https://github.com/brianwatling/libfiber/. [Online; accessed 2019--12--27].Google Scholar
- Lozi, J.-P., David, F., Thomas, G., Lawall, J., and Muller, G. Remote Core Locking: Migrating Critical-section Execution to Improve the Performance of Multithreaded Applications. In Proceedings of USENIX ATC (2012), pp. 6--6.Google Scholar
- Memcached. http://www.memcached.org/. [Online; accessed 2019--12--27].Google Scholar
- Mordor I/O Library. https://github.com/mozy/mordor/. [Online; accessed 2019--12--27].Google Scholar
- Nakashima, J., and Taura, K. MassiveThreads: A Thread Library for High Productivity Languages. In Concurrent Objects and Beyond, G. Agha, A. Igarashi, N. Kobayashi, H. Masuhara, S. Matsuoka, E. Shibayama, and K. Taura, Eds., vol. 8665 of Lecture Notes in Computer Science. Springer, pp. 222--238.Google Scholar
- Osterhout, J. Why Threads Are A Bad Idea (for most purposes). Invited talk at USENIX ATC, 1996. https: //web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf. [Online; accessed 2019--12--27].Google Scholar
- Pariag, D., Brecht, T., Harji, A., Buhr, P., and Shukla, A. Comparing the Performance of Web Server Architectures. In Proceedings of EuroSys (2007), pp. 231--243.Google Scholar
- Pony Programming Language. https://www.ponylang.org/. [Online; accessed 2019--12--27].Google Scholar
- Proto Actor. http://proto.actor/. [Online; accessed 2019--12--27].Google Scholar
- GNU Pth. https://www.gnu.org/software/pth/. [Online; accessed 2019--12--27].Google Scholar
- Qin, H., Li, Q., Speiser, J., Kraft, P., and Ousterhout, J. Arachne: Core-aware Thread Management. In Proceedings of OSDI (2018), pp. 145--160.Google Scholar
- Quasar. http://www.paralleluniverse.co/quasar/. [Online; accessed 2019--12--27].Google Scholar
- Scaling in the Linux Networking Stack. https://www.kernel.org/doc/Documentation/networking/scaling.txt. [Online; accessed 2019--12--27].Google Scholar
- Roghanchi, S., Eriksson, J., and Basu, N. Ffwd: Delegation is (Much) Faster Than You Think. In Proceedings of SOSP (2017), pp. 342--358.Google Scholar
Digital Library
- Schillings, B. Be Engineering Insights: Benaphores. https://www.haiku-os.org/legacy-docs/benewsletter/Issue1- 26.html. [Online; accessed 2019--12--27].Google Scholar
- Segmented Stacks in LLVM. https://llvm.org/docs/SegmentedStacks.html. [Online; accessed 2019--12--27].Google Scholar
- State Threads Library. http://state-threads.sourceforge.net/. [Online; accessed 2019--12--27].Google Scholar
- TechEmpower, Inc. Web Framework Benchmarks. https://www.techempower.com/benchmarks/#section=datar17& hw=ph&test=plaintext. [Online; accessed 2019--12--27].Google Scholar
- The Linux man-pages project. clone, __clone2 - create a child process. http://man7.org/linux/man-pages/man2/ clone.2.html. See CLONE_FILES flag. [Online; accessed 2019--12--27].Google Scholar
- Turner, P. User-level threads....... with threads. Talk at Linux Plumbers Conference, 2013. http://www. linuxplumbersconf.org/2013/ocw/proposals/1653. [Online; accessed 2019--12--27].Google Scholar
- Valialkin, A. Fast HTTP Package for Go. https://github.com/valyala/fasthttp. [Online; accessed 2019--12--27].Google Scholar
- von Behren, R., Condit, J., and Brewer, E. Why Events Are a Bad Idea (for High-concurrency Servers). In Proceedings of HOTOS (2003), pp. 4--4.Google Scholar
- von Behren, R., Condit, J., Zhou, F., Necula, G. C., and Brewer, E. Capriccio: Scalable Threads for Internet Services. In Proceedings of ACM SOSP (2003), pp. 268--281.Google Scholar
Digital Library
- weighttp. https://github.com/lighttpd/weighttp. [Online; accessed 2019--12--27].Google Scholar
- Welsh, M., Culler, D., and Brewer, E. SEDA: An Architecture for Well-conditioned, Scalable Internet Services. In Proceedings of ACM SOSP (2001), ACM, pp. 230--243.Google Scholar
Digital Library
- Wheeler, K. B., Murphy, R. C., and Thain, D. Qthreads: An API for Programming with Millions of Lightweight Threads. In Proceedings of IEEE IPDPS (April 2008), pp. 1--8.Google Scholar
- Windows Fibers. https://docs.microsoft.com/en-us/windows/desktop/ProcThread/fibers. [Online; accessed 2019--12--27].Google Scholar
Index Terms
User-level Threading: Have Your Cake and Eat It Too
Recommendations
Lightweight preemptive user-level threads
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingMany-to-many mapping models for user- to kernel-level threads (or "M:N threads") have been extensively studied for decades as a lightweight substitute for current Pthreads implementations that provide a simple one-to-one mapping ("1:1 threads"). M:N ...
User-level Threading: Have Your Cake and Eat It Too
SIGMETRICS '20: Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsAn important class of computer software, such as network servers, exhibits concurrency through many loosely coupled and potentially long-running communication sessions. For these applications, a long-standing open question is whether thread-per-session ...
User-level Threading: Have Your Cake and Eat It Too
An important class of computer software, such as network servers, exhibits concurrency through many loosely coupled and potentially long-running communication sessions. For these applications, a long-standing open question is whether thread-per-session ...






Comments