Abstract
Workload, platform, and available resources constitute a parallel program's execution environment. Most parallelization efforts statically target an anticipated range of environments, but performance generally degrades outside that range. Existing approaches address this problem with dynamic tuning but do not optimize a multiprogrammed system holistically. Further, they either require manual programming effort or are limited to array-based data-parallel programs.
This paper presents Parcae, a generally applicable automatic system for platform-wide dynamic tuning. Parcae includes (i) the Nona compiler, which creates flexible parallel programs whose tasks can be efficiently reconfigured during execution; (ii) the Decima monitor, which measures resource availability and system performance to detect change in the environment; and (iii) the Morta executor, which cuts short the life of executing tasks, replacing them with other functionally equivalent tasks better suited to the current environment. Parallel programs made flexible by Parcae outperform original parallel implementations in many interesting scenarios.
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., 2002. Google Scholar
Digital Library
- J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2009. Google Scholar
Digital Library
- C. W. Antoine, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27:2001, 2000.Google Scholar
- Apple Open Source. md5sum: Message Digest 5 computation. http://www.opensource.apple.com/darwinsource.Google Scholar
- M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 219--228, 2009. Google Scholar
Digital Library
- A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP), pages 29--44, 2009. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008. Google Scholar
Digital Library
- O. Bilgir, M. Martonosi, and Q. Wu. Exploring the potential of CMP core count management on data center energy savings. In Proceedings of the 3rd Workshop on Energy Efficient Design (WEED), 2011.Google Scholar
- S. L. Bird and B. J. Smith. PACORA: Performance aware convex optimization for resource allocation. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar: Posters), 2011.Google Scholar
- F. Blagojevic, D. S. Nikolopoulos, A. Stamatakis, C. D. Antonopoulos, and M. Curtis-Maury. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Computing, 33(10--11):700--719, 2007. Google Scholar
Digital Library
- Y. Ding, M. Kandemir, P. Raghavan, and M. J. Irwin. Adapting application execution in CMPs using helper threads. Journal of Parallel and Distributed Computing, 69(9):790--806, 2009. Google Scholar
Digital Library
- P. Diniz and M. Rinard. Dynamic feedback: An effective technique for adaptive computing. In Proceedings of the 18th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1997. Google Scholar
Digital Library
- G. Edjlali, G. Agrawal, A. Sussman, J. Humphries, and J. Saltz. Compiler and runtime support for programming in adaptive parallel environments. In Scientific Programming, pages 215--227, 1995.Google Scholar
- M. W. Hall and M. Martonosi. Adaptive parallelism in compiler-parallelized code. In Proceedings of the 2nd SUIF Compiler Workshop, 1997.Google Scholar
- J. L. Hellerstein, V. Morrison, and E. Eilebrecht. Applying control theory in the real world: Experience with building a controller for the .NET thread pool. Performance Evaluation Review, 37:38--42, 2010. Google Scholar
Digital Library
- T. Karcher and V. Pankratius. Run-time automatic performance tuning for multicore applications. In Proceedings of the International Euro-Par Conference on Parallel Processing (Euro-Par), pages 3--14, 2011. Google Scholar
Digital Library
- A. Kejariwal, A. Nicolau, A. V. Veidenbaum, U. Banerjee, and C. D. Polychronopoulos. Efficient scheduling of nested parallel loops on multi-core systems. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP), pages 74--83, 2009. Google Scholar
Digital Library
- M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 211--222, 2007. Google Scholar
Digital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the Annual International Symposium on Code Generation and Optimization (CGO), pages 75--86, 2004. Google Scholar
Digital Library
- C. E. Leiserson. The Cilk concurrency platform. In Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC), pages 522--527, 2009. Google Scholar
Digital Library
- LLVM Test Suite Guide. http://llvm.org/docs/TestingGuide.html.Google Scholar
- C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 45--55, 2009. Google Scholar
Digital Library
- J. Mars, N. Vachharajani, M. L. Soffa, and R. Hundt. Contention aware execution: Online contention detection and response. In Proceedings of the Annual International Symposium on Code Generation and Optimization (CGO), Toronto, Canada, 2010. Google Scholar
Digital Library
- G. Memik, W. H. Mangione-Smith, and W. Hu. NetBench: A benchmarking suite for network processors. In Proceedings of the 2001 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2001. Google Scholar
Digital Library
- C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford Transactional Applications for Multi-Processing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2008.Google Scholar
- R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. 2006.Google Scholar
- I. Neamtiu. Elastic executions from inelastic programs. In Proceedings of the 6th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS), 2011. Google Scholar
Digital Library
- H. Pan, B. Hindman, and K. Asanović. Composing parallel software efficiently with Lithe. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 376--387, 2010. Google Scholar
Digital Library
- D. A. Penry. Multicore diversity: A software developer's nightmare. ACM SIGOPS Operating Systems Review, 43:100--101, 2009. Google Scholar
Digital Library
- C. D. Polychronopoulos. The hierarchical task graph and its use in auto-scheduling. In Proceedings of the 5th International Conference on Supercomputing (ICS), pages 252--263, 1991. Google Scholar
Digital Library
- P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011. Google Scholar
Digital Library
- M. Püschel, F. Franchetti, and Y. Voronenko. Encyclopedia of Parallel Computing, chapter Spiral. Springer, 2011. Google Scholar
Digital Library
- A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: the degree of parallelism executive. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011. Google Scholar
Digital Library
- E. Raman, G. Ottoni, A. Raman, M. Bridges, and D. I. August. Parallel-stage decoupled software pipelining. In Proceedings of the Annual International Symposium on Code Generation and Optimization (CGO), 2008. Google Scholar
Digital Library
- L. Rauchwerger, N. M. Amato, and D. A. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming (IJPP), 26:537--576, 1995. Google Scholar
Digital Library
- A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on work stealing in TBB. In Proceedings of the 22nd International Parallel and Distributed Processing Symposium (IPDPS), pages 1--8, 2008.Google Scholar
Cross Ref
- J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40, 1991. Google Scholar
Digital Library
- P. Selinger. potrace: Transforming bitmaps into vector graphics. http://potrace.sourceforge.net.Google Scholar
- J. C. Spall. Introduction to Stochastic Search and Optimization. Wiley-Interscience, 2003. Google Scholar
Cross Ref
- M. A. Suleman, M. K. Qureshi, Khubaib, and Y. N. Patt. Feedback-directed pipeline parallelism. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 147--156, 2010. Google Scholar
Digital Library
- A. Tiwari and J. K. Hollingsworth. Online adaptive code generation and tuning. In Proceedings of the 25th International Parallel and Distributed Processing Symposium (IPDPS), 2011. Google Scholar
Digital Library
- A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. Lazy binary-splitting: A run-time adaptive work-stealing scheduler. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 179--190, 2010. Google Scholar
Digital Library
- H. Vandierendonck, S. Rul, and K. De Bosschere. The Paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 389--400, 2010. Google Scholar
Digital Library
- M. J. Voss and R. Eigenmann. ADAPT: Automated de-coupled adaptive program transformation. In Proceedings of the 1999 International Conference on Parallel Processing (ICPP), pages 163--170, 1999. Google Scholar
Digital Library
- Z. Wang and M. F. O'Boyle. Mapping parallelism to multi-cores: A machine learning based approach. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 75--84, 2009. Google Scholar
Digital Library
- M. Wolfe. DOANY: Not just another parallel loop. In Proceedings of the 4th International Workshop on Languages and Compilers for Parallel Computing (LCPC), 1992. Google Scholar
Digital Library
- H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In Proceedings of the 14th International Symposium on High-Performance Computer Architecture (HPCA), 2008.Google Scholar
Index Terms
Parcae: a system for flexible parallel execution
Recommendations
Parcae: a system for flexible parallel execution
PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and ImplementationWorkload, platform, and available resources constitute a parallel program's execution environment. Most parallelization efforts statically target an anticipated range of environments, but performance generally degrades outside that range. Existing ...
From sequential programming to flexible parallel execution
CASES '12: Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systemsThe embedded computing landscape is being transformed by three trends: growing demand for greater functionality and enriched user experience, increasing diversity and parallelism in the processing substrate, and an accelerating push for ever-greater ...
MemoDyn: exploiting weakly consistent data structures for dynamic parallel memoization
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation TechniquesSeveral classes of algorithms for combinatorial search and optimization problems employ memoization data structures to speed up their serial convergence. However, accesses to these data structures impose dependences that obstruct program ...







Comments