skip to main content
research-article

Holistic aggregate resource environment

Published:01 January 2008Publication History
Skip Abstract Section

Abstract

Within a few short years, we can expect to be dealing with multi-million-thread programs running on million-core systems [16]. This will no doubt stress the contemporary HPC software model which was developed in a time when 512 cores was a large number. Historical approaches have been further challenged by the increased desire of developers and end users for supercomputer light weight kernels (LWKs) to support the same environment, libraries, and tools as their desktops. As a result, the emerging workloads of today are far more sophisticated than those of the last two decades when much of the HPC infrastructure was developed, and feature the use of scripting environments such as Python, dynamic libraries, and complex multi-scale physics frameworks. Complicating this picture is the overwhelming management, monitoring and reliability problem created by the huge number of nodes in a system of that magnitude.

We believe that a re-evaluation and exploration of distributed system principals is called for in order to address the challenges of ultrascale. To that end we will be evaluating and extending the Plan 9 [21] distributed system on the largest machines available to us, namely the BG/L [28] and BG/P [10] supercomputers. We have chosen Plan 9 based on our previous experiences with it in combination with previous research [17] which determined Plan 9 was a "right weight kernel", balancing trade offs between LWKs and more general purpose operating systems such as Linux. To deal with issues of scale, we plan on leveraging the use of the high-performance interconnects by system services as well as exploring aggregation as more of a first-class system construct -- providing dynamic hierarchical organization and management of all resources. Our plan is to evaluate the viability of these concepts at scale as well as create an alternative development and execution environment which compliments the features and capabilities of the existing system software and run time options. Our intent is to broaden the application base as well as make the system as a whole more approachable to a larger class of developers and end-users.

References

  1. G. Ammons, J. Appavoo, M. Butrico, D. D. Silva, D. Grove, K. K. an Orran Krieger, B. Rosenburg, E. V. Hensbergen, and R. W. Wisniewski. Libra: A library operating system for a JVM in a virtualized execution environment. To appear in the Proceedings of VEE 2007, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating System Design and Implementation (OSDI '06), pages 335--350, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chakravoty and et. al. Hpc-colony: services and interfaces for very large systems. ACM SIGOPS Operating Systems Review, 40(2), April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Coraid. The linux storage people. http://www.coraid.com.Google ScholarGoogle Scholar
  5. S. Corporation. Throughput Computing: Changing the economics and ecology of the data center with innovative SPARC technology. Technical report, Sun Corporation, November 2005.Google ScholarGoogle Scholar
  6. R. C. Emil Sit, Josh Cates. A dht-based backup system. In In Proceedings of the First IRIS Student Workshop, 2003.Google ScholarGoogle Scholar
  7. C. Engelmann and et. al. Molar: Adaptive runtime support for high-end computing operating and runtime systems. ACM SIGOPS Operating Systems Review, 40(2), April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. V. Hensbergen. P.R.O.S.E.: partitioned reliable operating system environment. SIGOPS Oper. Syst. Rev., 40(2):12--15, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. V. Hensbergen, J. McKie, C. Forsyth, and R. Minnich. Night of the lepus: A plan 9 perspective on blue gene's interconnects. In In Proceedings of the second annual international workshop on Plan 9, 2007.Google ScholarGoogle Scholar
  10. IBM BlueGene Team. Overview of the Blue Gene/P project. IBM Journal for Research and Development, 52(1/2), January 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell multiprocessor. IBM Journal of Research and Development, 49(4/5):589--604, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Kelly and R. Brightwell. Software architecture of the light weight kernel, catamount. In Proceedings of the 2005 Cray User Group Conference, 2005.Google ScholarGoogle Scholar
  13. L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133--169, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Lamport and M. Massa. Cheap Paxos. In DSN '04: Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN'04), page 307, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. W. Lampson. How to build a highly available system using consensus. In WDAG '96: Proceedings of the 10th International Workshop on Distributed Algorithms, pages 1--17, London, UK, 1996. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Leininger. Application and algorithm challenges for future petascale to exascale computing architectures. In OASCR AMR Future Architectures Panel, 2007.Google ScholarGoogle Scholar
  17. R. Minnich and et. al. Right-weight kernels: an off-the shelf alternative to custom light-weight kernels. ACM SIGOPS Operating Systems Review, 40(2), April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. G. Minnich, A. Mirtchovski, and L. Ionkov. XCPU: a new, 9p-based, process management system for clusters and grids. In Cluster 2006, 2006.Google ScholarGoogle Scholar
  19. R. Pike and et. al. Plan 9 - the documents (volume 2). http://plan9.bell-labs.com/sys/doc.Google ScholarGoogle Scholar
  20. R. Pike and et. al. Plan 9 - the manual (volume 1). http://plan9.bell-labs.com/sys/man.Google ScholarGoogle Scholar
  21. R. Pike, D. Presotto, S. Dorward, B. Flandrena, K. Thompson, H. Trickey, and P. Winterbottom. Plan 9 from Bell Labs. Computing Systems, 8(3):221--254, 1995.Google ScholarGoogle Scholar
  22. R. Pike, D. Presotto, S. Dorward, D. M. Ritchie, H. Trickey, and P. Winterbottom. The Inferno operating system. Bell Labs Technical Journal, 2(1), Winter 1997.Google ScholarGoogle Scholar
  23. S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In In Proceedings of the Conference on File and Storage Technologies, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Sanders. Meet Larrabee, Intel's answer to a GPU. Vunet.com, 2007. http://www.theinquirer.net/default.aspx?article=37548.Google ScholarGoogle Scholar
  25. M. Sottile and R. Minnich. Analysis of microbenchmarks for performance tuning of clusters. In Proceedings of Cluster 2004, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. J. Sottile and R. G. Minnich. Supermon: A high-speed cluster monitoring system. In CLUSTER '02: Proceedings of the IEEE International Conference on Cluster Computing, page 39, Washington, DC, USA, 2002. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for Internet applications. ACM SIGCOMM 2001, pages 149--160, August 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. The BlueGene/L Team. An overview of the BlueGene/L supercomputer. In ACM Supercomputing Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Thompson and G. Collyer. The 64-bit standalone plan 9 file server. http://plan9.bell-labs.com/sys/doc/fs.pdf.Google ScholarGoogle Scholar
  30. Uriel. 9p implementations. http://9p.cat-v.org/implementations.Google ScholarGoogle Scholar
  31. D. Wallace. Compute node linux: New frontiers in compute node operating systems. In In Proceedings of the Cray User's Group, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Holistic aggregate resource environment

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGOPS Operating Systems Review
              ACM SIGOPS Operating Systems Review  Volume 42, Issue 1
              January 2008
              133 pages
              ISSN:0163-5980
              DOI:10.1145/1341312
              Issue’s Table of Contents

              Copyright © 2008 Authors

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 January 2008

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!