Abstract
In our hybrid runtime (HRT) model, a parallel runtime system and the application are together transformed into a specialized OS kernel that operates entirely in kernel mode and can thus implement exactly its desired abstractions on top of fully privileged hardware access. We describe the design and implementation of two new tools that support the HRT model. The first, the Nautilus Aerokernel, is a kernel framework specifically designed to enable HRTs for x64 and Xeon Phi hardware. Aerokernel primitives are specialized for HRT creation and thus can operate much faster, up to two orders of magnitude faster, than related primitives in Linux. Aerokernel primitives also exhibit much lower variance in their performance, an important consideration for some forms of parallelism. We have realized several prototype HRTs, including one based on the Legion runtime, and we provide application macrobenchmark numbers for our Legion HRT. The second tool, the hybrid virtual machine (HVM), is an extension to the Palacios virtual machine monitor that allows a single virtual machine to simultaneously support a traditional OS and software stack alongside an HRT with specialized hardware access. The HRT can be booted in a time comparable to a Linux user process startup, and functions in the HRT, which operate over the user process's memory, can be invoked by the process with latencies not much higher than those of a function call.
- Ammons, G., Appavoo, J., Butrico, M., Da Silva, D., Grove, D., Kawachiya, K., Krieger, O., Rosenburg, B., Hensbergen, E. V., and Wisniewski, R. W. Libra: A library operating system for a jvm in a virtualized execution environment. In Proceedings of the 3rd International Conference on Virtual Execution Environments (VEE 2007) (June 2007), pp. 44--54.Google Scholar
Digital Library
- Anderson, T. E., Bershad, B. N., Lazowska, E. D., and Levy, H. M. Scheduler activations: Effective kernel support for the user-level management of parallelism. In Proceedings of the $13^th$ ACM Symposium on Operating Systems Principles (SOSP 1991) (Oct. 1991), pp. 95--109.Google Scholar
Digital Library
- Bae, C., Lange, J., and Dinda, P. Enhancing virtualized application performance through dynamic adaptive paging mode selection. In Proceedings of the 8th International Conference on Autonomic Computing (ICAC 2011) (June 2011).Google Scholar
Digital Library
- Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. Legion: Expressing locality and independence with logical regions. In Proceedings of Supercomputing (SC 2012) (Nov. 2012).Google Scholar
Digital Library
- Baumann, A., Barham, P., Dagand, P. E., Harris, T., Isaacs, R., Peter, S., Roscoe, T., Schüpbach, A., and Singhania, A. The Multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the $22^nd$ ACM Symposium on Operating Systems Principles (SOSP 2009) (Oct. 2009), pp. 29--44.Google Scholar
Digital Library
- Baumann, A., Lee, D., Fonseca, P., Glendenning, L., Lorch, J. R., Bond, B., Olinsky, R., and Hunt, G. C. Composing OS extensions safely and efficiently with Bascule. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys 2013) (Apr. 2013), pp. 239--252.Google Scholar
Digital Library
- Belay, A., Bittau, A., Mashtizadeh, A., Terei, D., Mazières, D., and Kozyrakis, C. Dune: Safe user-level access to privileged CPU features. In Proceedings of the $10^th$ USENIX Conference on Operating Systems Design and Implementation (OSDI 2012) (Oct. 2012), pp. 335--348.Google Scholar
- Bergstrom, L., Fluet, M., Rainey, M., Reppy, J., Rosen, S., and Shaw, A. Data-only flattening for nested data parallelism. In Proceedings of the $18^th$ ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2013) (Feb. 2013), pp. 81--92.Google Scholar
Digital Library
- Bergstrom, L., and Reppy, J. Nested data-parallelism on the GPU. In Proceedings of the $17^th$ ACM SIGPLAN International Conference on Functional Programming (ICFP 2012) (Sept. 2012), pp. 247--258.Google Scholar
Digital Library
- Bershad, B. N., Savage, S., Pardyak, P., Sirer, E. G., Fiuczynski, M. E., Becker, D., Chambers, C., and Eggers, S. Extensibility, safety and performance in the SPIN operating system. In Proceedings of the $15^th$ ACM Symposium on Operating Systems Principles (SOSP 1995) (Dec. 1995), pp. 267--283.Google Scholar
Digital Library
- Black, D. L., Golub, D. B., Julin, D. P., Rashid, R. F., Draves, R. P., Dean, R. W., Forin, A., Barrera, J., Tokuda, H., Malan, G., and Bohman, D. Microkernel operating system architecture and Mach. In Proceedings of the USENIX Workshop on Micro-Kernels and Other Kernel Architectures (Apr. 1992), pp. 11--30.Google Scholar
- Blelloch, G. E., Chatterjee, S., Hardwick, J., Sipelstein, J., and Zagha, M. Implementation of a portable nested data-parallel language. Journal of Parallel and Distributed Computing 21, 1 (Apr. 1994), 4--14.Google Scholar
Digital Library
- Blelloch, G. E., and Greiner, J. A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM SIGPLAN International Conference on Functional Programming (ICFP 1996) (May 1996), pp. 213--225.Google Scholar
Digital Library
- Bomberger, A. C., Frantz, W. S., Hardy, A. C., Hardy, N., Landau, C. R., and Shapiro, J. S. The KeyKOS nanokernel architecture. In Proceedings of the USENIX Workshop on Micro-kernels and Other Kernel Architectures (Apr. 1992), pp. 95--112.Google Scholar
Digital Library
- Boyd-Wickizer, S., Chen, H., Chen, R., Mao, Y., Kaashoek, F., Morris, R., Pesterev, A., Stein, L., Wu, M., Dai, Y., Zhang, Y., and Zhang, Z. Corey: An operating system for many cores. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI 2008) (Dec. 2008), pp. 43--57.Google Scholar
- Boyd-Wickizer, S., Clements, A. T., Mao, Y., Pesterev, A., Kaashoek, M. F., Morris, R., and Zeldovich, N. An analysis of Linux scalability to many cores. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2010) (Oct. 2010).Google Scholar
- Cadamb, S., Coviello, G., Li, C.-H., Phull, R., Rao, K., Sankaradass, M., and Chakradhar, S. COSMIC: Middleware for high performance and reliable multiprocessing on xeon phi coprocessors. In Proceedings of the $22^nd$ ACM Symposium on High-performance Parallel and Distributed Computing (HPDC 2013) (June 2013), pp. 215--226.Google Scholar
Digital Library
- Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. Microreboot: A technique for cheap recovery. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2004) (Dec. 2004), pp. 31--44.Google Scholar
- Carlson, W., Draper, J., Culler, D., Yelick, K., Brooks, E., and Warren, K. Introduction to upc and language specification. Tech. Rep. CCS-TR-99--157, IDA Center for Computing Sciences, May 1999.Google Scholar
- Chakravarty, M., Keller, G., Leshchinskiy, R., and Pfannenstiel, W. Nepal--nested data-parallelism in haskell. In Proceedings of the 7th International Euro-Par Conference (EUROPAR 2001) (Aug. 2001).Google Scholar
Cross Ref
- Chakravarty, M., Leshchinskiy, R., Jones, S. P., Keller, G., and Marlow, S. Data parallel haskell: A status report. In Proceedings of the Workshop on Declarative Aspects of Multicore Programming (Jan. 2007).Google Scholar
Digital Library
- Chamberlain, B., Callahan, D., and Zima, H. Parallel programmability and the chapel langauge. International Journal of High Performance Computing Applications 21, 3 (Aug. 2007), 291--312.Google Scholar
Digital Library
- Charles, P., Donawa, C., Ebicioglu, K., Grothoff, C., Kielstra, A., von Praun, C., Saraswat, V., and Sarkar, V. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the $20^th$ ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA 2005) (Oct. 2005), pp. 519--538.Google Scholar
Digital Library
- Chase, J. S., Levy, H. M., Levy, H. M., Feeley, M. J., Feeley, M. J., Lazowska, E. D., and Lazowska, E. D. Sharing and protection in a single address space operating system. ACM Transactions on Computer Systems 12, 4 (Nov. 1994), 271--307.Google Scholar
Digital Library
- Cheriton, D. R., and Duda, K. J. A caching model of operating system kernel functionality. In Proceedings of the 1st USENIX Symposium on Operating Systems Design and Implementation (OSDI 2004) (Nov. 1994).Google Scholar
Digital Library
- Das, R., Uysal, M., Saltz, J., and Hwang, Y.-S. Communication optimizations for irregular scientific computations on distributed memory architectures. Journal of Parallel and Distributed Computing 22, 3 (September 1994), 462--478.Google Scholar
Digital Library
- Dongarra, J., and Heroux, M. A. Toward a new metric for ranking high performance computing systems. Tech. Rep. SAND2013--4744, Sandia National Laboratories, June 2013.Google Scholar
- Dotsenko, Y., Coarfa, C., and Mellor-Crummey, J. A multi-platform co-array fortran compiler. In Proceedings of the $13^th$ International Conference on Parallel Architectures and Compilation Techniques (PACT 2004) (Sept. 2004), pp. 29--40.Google Scholar
Cross Ref
- Engler, D. R., and Kaashoek, M. F. Exterminate all operating system abstractions. In Proceedings of the 5th Workshop on Hot Topics in Operating Systems (HotOS 1995) (May 1995), pp. 78--83.Google Scholar
Cross Ref
- Engler, D. R., Kaashoek, M. F., and O'Toole, Jr., J. Exokernel: An operating system architecture for application-level resource management. In Proceedings of the $15^th$ ACM Symposium on Operating Systems Principles (SOSP 1995) (Dec. 1995), pp. 251--266.Google Scholar
Digital Library
- Ferreira, K. B., Bridges, P., and Brightwell, R. Characterizing application sensitivity to OS interference using kernel-level noise injection. In Proceedings of Supercomputing (SC 2008) (Nov. 2008).Google Scholar
Cross Ref
- Ferreira, K. B., Bridges, P. G., Brightwell, R., and Pedretti, K. T. Impact of system design parameters on application noise sensitivity. Journal of Cluster Computing 16, 1 (Mar. 2013).Google Scholar
Digital Library
- Fluet, M., Rainey, M., Reppy, J., and Shaw, A. Implicitly threaded parallelism in manticore. In Proceedings of the $13^th$ ACM SIGPLAN International Conference on Functional Programming (ICFP 2008) (Sept. 2008), pp. 119--130.Google Scholar
Digital Library
- Fluet, M., Rainey, M., Reppy, J., Shaw, A., and Xiao, Y. Manticore: A heterogeneous parallel language. In Proceedings of the Workshop on Declarative Aspects of Multicore Programming (DAMP 2007) (Jan. 2007), pp. 37--44.Google Scholar
Digital Library
- Goerbessiotis, A. V., and Valiant, L. G. Direct bulk-synchronous parallel algorithms. Journal of Parallel and Distributed Computing 22, 2 (1994), 251--267.Google Scholar
- Hale, K. C., and Dinda, P. A. Guarded modules: Adaptively extending the VMM's privilege into the guest. In Proceedings of the $11^th$ International Conference on Autonomic Computing (ICAC 2014) (June 2014), pp. 85--96.Google Scholar
- Hale, K. C., and Dinda, P. A. A case for transforming parallel runtimes into operating system kernels. In Proceedings of the $24^th$ International Symposium on High-performance Parallel and Distributed Computing (HPDC 2015) (June 2015), pp. 27--32.Google Scholar
Digital Library
- Heroux, M. A., Dongarra, J., and Luszczek, P. HPCG technical specification. Tech. Rep. SAND2013--8752, Sandia National Laboratories, October 2013.Google Scholar
- High Performance Fortran Forum. High Performance Fortran language specification, version 2.0. Tech. rep., Center for Research on Parallel Computation, Rice University, January 1996.Google Scholar
- Hoefler, T., Schneider, T., and Lumsdaine, A. Characterizing the influence of system noise on large-scale applications by simulation. In Proceedings of Supercomputing (SC 2010) (Nov. 2010).Google Scholar
Digital Library
- Hofmeyr, S., Colmenares, J. A., Iancu, C., and Kubiatowicz, J. Juggle: Proactive load balancing on multicore computers. In Proceedings of the $20^th$ ACM Symposium on High-performance Parallel and Distributed Computing (HPDC 2011) (June 2011), pp. 3--14.Google Scholar
Digital Library
- Hunt, G. C., and Larus, J. R. Singularity: Rethinking the software stack. SIGOPS Operating Systems Review 41, 2 (Apr. 2007), 37--49.Google Scholar
Digital Library
- Kaiser, H., Brodowicz, M., and Sterling, T. ParalleX: An advanced parallel execution model for scaling-impaired applications. In Proceedings of the $38^th$ International Conference on Parallel Processing Workshops (ICPPW 2009) (Sept. 2009), pp. 394--401.Google Scholar
Digital Library
- Kalé, L. V., Ramkumar, B., Sinha, A., and Gursoy, A. The Charm parallel programming language and system: Part II--the runtime system. Tech. Rep. 95-03, Parallel Programming Laboratory, University of Illinois at Urbana-Champaign, 1994.Google Scholar
- Kivity, A., Laor, D., Costa, G., Enberg, P., Har\textquoterightEl, N., Marti, D., and Zolotarov, V. OSv\textemdashoptimizing the operating system for virtual machines. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 2014) (June 2014).Google Scholar
- Krieder, S., Wozniak, J., Armstrong, T., Wilde, M., Katz, D., Grimmer, B., Foster, I., and Raicu, I. Design and evaluation of the GeMTC framework for gpu-enabled many-task computing. In Proceedings of the $23^rd$ ACM Symposium on High-performance Parallel and Distributed Computing (HPDC 2014) (June 2014), pp. 153--164.Google Scholar
Digital Library
- Krieger, O., Auslander, M., Rosenburg, B., Wisniewski, R. W., Xenidis, J., Da Silva, D., Ostrowski, M., Appavoo, J., Butrico, M., Mergen, M., Waterland, A., and Uhlig, V. K42: Building a complete operating system. In Proceedings of the 1st ACM European Conference on Computer Systems (EuroSys 2006) (Apr. 2006), pp. 133--145.Google Scholar
Digital Library
- Lange, J., Pedretti, K., Hudson, T., Dinda, P., Cui, Z., Xia, L., Bridges, P., Gocke, A., Jaconette, S., Levenhagen, M., and Brightwell, R. Palacios and kitten: New high performance operating systems for scalable virtualized and native supercomputing. In Proceedings of the $24^th$ IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010) (Apr. 2010).Google Scholar
Cross Ref
- Lauderdale, C., and Khan, R. Towards a codelet-based runtime for exascale computing. In Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT 2012) (Mar. 2012), pp. 21--26.Google Scholar
Digital Library
- Lee, S., and Vetter, J. OpenARC: Open accelerator research compiler for directive-based, efficient heterogeneous computing. In Proceedings of the $23^rd$ ACM Symposium on High-performance Parallel and Distributed Computing (HPDC 2014) (June 2014), pp. 115--120.Google Scholar
Digital Library
- Liedtke, J. On micro-kernel construction. In Proceedings of the $15^th$ ACM Symposium on Operating Systems Principles (SOSP 1995) (Dec. 1995), pp. 237--250.Google Scholar
Digital Library
- Liu, R., Klues, K., Bird, S., Hofmeyr, S., Asanović, K., and Kubiatowicz, J. Tessellation: Space-time partitioning in a manycore client OS. In Proceedings of the 1st USENIX Conference on Hot Topics in Parallelism (HotPar 2009) (Mar. 2009).Google Scholar
- Madhavapeddy, A., Mortier, R., Rotsos, C., Scott, D., Singh, B., Gazagnaire, T., Smith, S., Hand, S., and Crowcroft, J. Unikernels: Library operating systems for the cloud. In Proceedings of the $18^th$ International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2013) (Mar. 2013), pp. 461--472.Google Scholar
Digital Library
- Menage, P. B. Adding generic process containers to the Linux kernel. In Proceedings of the Linux Symposium (June 2007), pp. 45--58.Google Scholar
- Montz, A. B., Mosberger, D., O'Malley, S. W., Peterson, L. L., and Proebsting, T. A. Scout: A communications-oriented operating system. In Proceedings of the 5th Workshop on Hot Topics in Operating Systems (HotOS 1995) (May 1995), pp. 58--61.Google Scholar
Cross Ref
- NVIDIA Corporation. Dynamic parallelism in CUDA, Dec. 2012.Google Scholar
- Oayang, J., Kocoloski, B., Lange, J., and Pedretti, K. Achieving performance isolation with lightweight co-kernels. In Proceedings of the $24^th$ International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC 2015) (June 2015), pp. 149--160.Google Scholar
Digital Library
- Okuji, Y. K., Ford, B., Boleyn, E. S., and Ishiguro, K. The multiboot specification--version 1.6. Tech. rep., Free Software Foundation, Inc., 2010.Google Scholar
- Peter, S., and Anderson, T. Arrakis: A case for the end of the empire. In Proceedings of the $14^th$ Workshop on Hot Topics in Operating Systems (HotOS 2013) (May 2013).Google Scholar
- Porter, D. E., Boyd-Wickizer, S., Howell, J., Olinsky, R., and Hunt, G. C. Rethinking the library OS from the top down. In Proceedings of the $16^th$ International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2011) (Mar. 2011), pp. 291--304.Google Scholar
Digital Library
- Roscoe, T. Linkage in the Nemesis single address space operating system. ACM SIGOPS Operating Systems Review 28, 4 (Oct. 1994), 48--55.Google Scholar
Digital Library
- Rossbach, C. J., Currey, J., Silberstein, M., Ray, B., and Witchel, E. Ptask: Operating system abstractions to manage gpus as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP 2011) (2011).Google Scholar
Digital Library
- Swaine, J., Fetscher, B., St-Amour, V., Findler, R. B., and Flatt, M. Seeing the futures: Profiling shared-memory parallel Racket. In Proceedings of the 1st ACM SIGPLAN Workshop on Functional High-performance Computing (FHPC 2012) (Sept. 2012).Google Scholar
Digital Library
- Swaine, J., Tew, K., Dinda, P., Findler, R., and Flatt, M. Back to the futures: Incremental parallelization of existing sequential runtime systems. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2010) (October 2010).Google Scholar
Digital Library
- Tew, K., Swaine, J., Flatt, M., Findler, R., and Dinda, P. Places: Adding message passing parallelism to racket. In Proceedings of the 7th Dynamic Languages Symposium (DLS 2011) (Oct. 2011), pp. 85--96.Google Scholar
Digital Library
- Treichler, S., Bauer, M., and Aiken, A. Language support for dynamic, hierarchical data partitioning. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA 2013) (Oct. 2013), pp. 495--514.Google Scholar
Digital Library
- Wheeler, K. B., Murphy, R. C., and Thain, D. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the $22^nd$ International Symposium on Parallel and Distributed Processing (IPDPS 2008) (Apr. 2008).Google Scholar
Cross Ref
- Wisniewski, R. W., Inglett, T., Keppel, P., Murty, R., and Riesen, R. mOS: An architecture for extreme-scale operating systems. In Proceedings of the 4th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2014) (June 2014).Google Scholar
Digital Library
- Yaghmour, K. Adaptive domain environment for operating systems. http://www.opersys.com/ftp/pub/Adeos/adeos.pdf.Google Scholar
Index Terms
Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support
Recommendations
Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support
VEE '16: Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution EnvironmentsIn our hybrid runtime (HRT) model, a parallel runtime system and the application are together transformed into a specialized OS kernel that operates entirely in kernel mode and can thus implement exactly its desired abstractions on top of fully ...
Virtualization and its effect on operating system
ACM-SE '11: Proceedings of the 49th Annual Southeast Regional ConferenceVirtual Machines were derived under one simple core idea, which was to gain the ability to run multiple Operating Systems that allow time-sharing of a single computer between several multitasking Operating Systems. Virtual Machines (VMs) provide the ...







Comments