skip to main content
research-article

Adaptive, efficient, parallel execution of parallel programs

Published:09 June 2014Publication History
Skip Abstract Section

Abstract

Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the task of efficiently exposing a program's parallelism onto these resources. Coupled with this uncertainty is the diverse set of efficiency metrics that users may desire. This paper proposes Varuna, a system that dynamically, continuously, rapidly and transparently adapts a program's parallelism to best match the instantaneous capabilities of the hardware resources while satisfying different efficiency metrics. Varuna is applicable to both multithreaded and task-based programs and can be seamlessly inserted between the program and the operating system without needing to change the source code of either.

We demonstrate Varuna's effectiveness in diverse execution environments using unaltered C/C++ parallel programs from various benchmark suites. Regardless of the execution environment, Varuna always outperformed the state-of-the-art approaches for the efficiency metrics considered.

References

  1. Posix threads programming. https://computing.llnl.gov/tutorials/pthreads/.Google ScholarGoogle Scholar
  2. Windows fiber. In http://msdn.microsoft.com/en-us/library/windows/desktop/ms682661(v=vs.85).aspx.Google ScholarGoogle Scholar
  3. M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young. Mach: A new kernel foundation for unix development. pages 93--112, 1986.Google ScholarGoogle Scholar
  4. A. Adya, J. Howell, M. Theimer, W. J. Bolosky, and J. R. Douceur. Cooperative task management without manual stack management. In Proceedings of the USENIX ATC, pages 289--302, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Anand, C. Muthukrishnan, A. Akella, and R. Ramjee. Redundancy in network traffic: findings and implications. In SIGMETRICS, pages 37--48, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. Scheduler activations: effective kernel support for the user-level management of parallelism. In SOSP, pages 95--109, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. J. Barbara Chapman and R. van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. In PACT, pages 72--81, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Boyd-Wickizer, A. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An analysis of linux scalability to many cores. In OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In ICS, pages 157--166, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. Nikolopoulos, B. R. de Supinski, and M. Schulz. Prediction models for multi-dimensional power-performance optimization on many cores. In PACT, pages 250--259, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Dussa, B. Carlson, L. Dowdy, and K.-H. Park. Dynamic partitioning in a transputer environment. SIGMETRICS Perform. Eval. Rev., 18(1):203--213, Apr. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. In ASPLOS, pages 335--346, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. R. Engler, M. F. Kaashoek, and J. O'Toole, Jr. Exokernel: an operating system architecture for application-level resource management. In SOSP, pages 251--266, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Gilchrist. Parallel data compression with bzip2. In PDCS, pages 559--564, 2004.Google ScholarGoogle Scholar
  17. S. Goldstein, K. Schauser, and D. Culler. Lazy threads: Implementing a fast parallel call. Journal of Parallel and Distributed Computing, 37(1):5--20, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Illikkal, V. Chadha, A. Herdrich, R. Iyer, and D. Newell. Pirate: Qos and performance management in cmp architectures. SIGMETRICS Perform. Eval. Rev., 37:3--10, March 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Iyer. Cqos: a framework for enabling qos in shared caches of cmp platforms. In ICS, pages 257--266. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. In SIGMETRICS, pages 25--36. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, pages 111--122, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval. Lonestar: A suite of parallel irregular programs. In ISPASS, pages 65--76, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. Lee, H. Wu, M. Ravichandran, and N. Clark. Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications. In ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. S. Lockhart. Introduction to Statistics and Data Analysis: For the Behavioral Sciences. Macmillan, 1998.Google ScholarGoogle Scholar
  25. C. McCann, R. Vaswani, and J. Zahorjan. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comput. Syst., 11(2):146--178, May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst., 2(3):264--280, July 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Mucci, S. Browne, C. Deane, and G. Ho. Papi: A portable interface to hardware performance counters. In Proc. Dept. of Defense HPCMP Users Group Conference, pages 7--10, 1999.Google ScholarGoogle Scholar
  28. H. Pan, B. Hindman, and K. Asanović. Composing parallel software efficiently with lithe. In PLDI, pages 376--387, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: the degree of parallelism executive. In PLDI, pages 26--37, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Raman, A. Zaks, J. W. Lee, and D. I. August. Parcae: a system for flexible parallel execution. In PLDI, pages 133--144, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In HPCA, pages 13--24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Reinders. Intel Threading Building Blocks. O'Reilly Media, Inc., 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Sridharan, G. Gupta, and G. S. Sohi. Holistic run-time parallelism management for time and energy efficiency. In ICS, pages 337--348, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. A. Suleman, M. K. Qureshi, and Y. N. Patt. Feedback-driven threading: power-efficient and high-performance execution of multithreaded workloads on cmps. In ASPLOS, pages 277--286, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. Uhlig, J. LeVasseur, E. Skoglund, and U. Dannowski. Towards scalable multiprocessor virtual machines. In USENIX VM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adaptive, efficient, parallel execution of parallel programs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!