Abstract
Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the task of efficiently exposing a program's parallelism onto these resources. Coupled with this uncertainty is the diverse set of efficiency metrics that users may desire. This paper proposes Varuna, a system that dynamically, continuously, rapidly and transparently adapts a program's parallelism to best match the instantaneous capabilities of the hardware resources while satisfying different efficiency metrics. Varuna is applicable to both multithreaded and task-based programs and can be seamlessly inserted between the program and the operating system without needing to change the source code of either.
We demonstrate Varuna's effectiveness in diverse execution environments using unaltered C/C++ parallel programs from various benchmark suites. Regardless of the execution environment, Varuna always outperformed the state-of-the-art approaches for the efficiency metrics considered.
- Posix threads programming. https://computing.llnl.gov/tutorials/pthreads/.Google Scholar
- Windows fiber. In http://msdn.microsoft.com/en-us/library/windows/desktop/ms682661(v=vs.85).aspx.Google Scholar
- M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young. Mach: A new kernel foundation for unix development. pages 93--112, 1986.Google Scholar
- A. Adya, J. Howell, M. Theimer, W. J. Bolosky, and J. R. Douceur. Cooperative task management without manual stack management. In Proceedings of the USENIX ATC, pages 289--302, 2002. Google Scholar
Digital Library
- A. Anand, C. Muthukrishnan, A. Akella, and R. Ramjee. Redundancy in network traffic: findings and implications. In SIGMETRICS, pages 37--48, 2009. Google Scholar
Digital Library
- T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. Scheduler activations: effective kernel support for the user-level management of parallelism. In SOSP, pages 95--109, 1991. Google Scholar
Digital Library
- G. J. Barbara Chapman and R. van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press, 2007. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. In PACT, pages 72--81, 2008. Google Scholar
Digital Library
- S. Boyd-Wickizer, A. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An analysis of linux scalability to many cores. In OSDI, 2010. Google Scholar
Digital Library
- M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In ICS, pages 157--166, 2006. Google Scholar
Digital Library
- M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. Nikolopoulos, B. R. de Supinski, and M. Schulz. Prediction models for multi-dimensional power-performance optimization on many cores. In PACT, pages 250--259, 2008. Google Scholar
Digital Library
- K. Dussa, B. Carlson, L. Dowdy, and K.-H. Park. Dynamic partitioning in a transputer environment. SIGMETRICS Perform. Eval. Rev., 18(1):203--213, Apr. 1990. Google Scholar
Digital Library
- E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. In ASPLOS, pages 335--346, 2010. Google Scholar
Digital Library
- D. R. Engler, M. F. Kaashoek, and J. O'Toole, Jr. Exokernel: an operating system architecture for application-level resource management. In SOSP, pages 251--266, 1995. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google Scholar
Digital Library
- J. Gilchrist. Parallel data compression with bzip2. In PDCS, pages 559--564, 2004.Google Scholar
- S. Goldstein, K. Schauser, and D. Culler. Lazy threads: Implementing a fast parallel call. Journal of Parallel and Distributed Computing, 37(1):5--20, 1996. Google Scholar
Digital Library
- R. Illikkal, V. Chadha, A. Herdrich, R. Iyer, and D. Newell. Pirate: Qos and performance management in cmp architectures. SIGMETRICS Perform. Eval. Rev., 37:3--10, March 2010. Google Scholar
Digital Library
- R. Iyer. Cqos: a framework for enabling qos in shared caches of cmp platforms. In ICS, pages 257--266. ACM, 2004. Google Scholar
Digital Library
- R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. In SIGMETRICS, pages 25--36. ACM, 2007. Google Scholar
Digital Library
- S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, pages 111--122, 2004. Google Scholar
Digital Library
- M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval. Lonestar: A suite of parallel irregular programs. In ISPASS, pages 65--76, 2009.Google Scholar
Cross Ref
- J. Lee, H. Wu, M. Ravichandran, and N. Clark. Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications. In ISCA, 2010. Google Scholar
Digital Library
- R. S. Lockhart. Introduction to Statistics and Data Analysis: For the Behavioral Sciences. Macmillan, 1998.Google Scholar
- C. McCann, R. Vaswani, and J. Zahorjan. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comput. Syst., 11(2):146--178, May 1993. Google Scholar
Digital Library
- E. Mohr, D. A. Kranz, and R. H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst., 2(3):264--280, July 1991. Google Scholar
Digital Library
- P. Mucci, S. Browne, C. Deane, and G. Ho. Papi: A portable interface to hardware performance counters. In Proc. Dept. of Defense HPCMP Users Group Conference, pages 7--10, 1999.Google Scholar
- H. Pan, B. Hindman, and K. Asanović. Composing parallel software efficiently with lithe. In PLDI, pages 376--387, 2010. Google Scholar
Digital Library
- A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: the degree of parallelism executive. In PLDI, pages 26--37, 2011. Google Scholar
Digital Library
- A. Raman, A. Zaks, J. W. Lee, and D. I. August. Parcae: a system for flexible parallel execution. In PLDI, pages 133--144, 2012. Google Scholar
Digital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In HPCA, pages 13--24, 2007. Google Scholar
Digital Library
- J. Reinders. Intel Threading Building Blocks. O'Reilly Media, Inc., 2007. Google Scholar
Digital Library
- S. Sridharan, G. Gupta, and G. S. Sohi. Holistic run-time parallelism management for time and energy efficiency. In ICS, pages 337--348, 2013. Google Scholar
Digital Library
- M. A. Suleman, M. K. Qureshi, and Y. N. Patt. Feedback-driven threading: power-efficient and high-performance execution of multithreaded workloads on cmps. In ASPLOS, pages 277--286, 2008. Google Scholar
Digital Library
- V. Uhlig, J. LeVasseur, E. Skoglund, and U. Dannowski. Towards scalable multiprocessor virtual machines. In USENIX VM, 2004. Google Scholar
Digital Library
Index Terms
Adaptive, efficient, parallel execution of parallel programs
Recommendations
Adaptive, efficient, parallel execution of parallel programs
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationFuture multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the ...
Holistic run-time parallelism management for time and energy efficiency
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingThe ubiquity of parallel machines will necessitate time- and energy-efficient parallel execution of a program in a wide range of hardware and software environments. Prevalent parallel execution models can fail to be efficient. Unable to account for ...
Minimization of Xeon Phi Core Use with Negligible Execution Time Impact
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at ScaleFor many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel® Xeon Phi™ been included (Tianhe-2 and Stampede). For example, GPUs are in 14% of systems in the November 2015 Top500 list, while ...







Comments