skip to main content
research-article

Parcae: a system for flexible parallel execution

Published:11 June 2012Publication History
Skip Abstract Section

Abstract

Workload, platform, and available resources constitute a parallel program's execution environment. Most parallelization efforts statically target an anticipated range of environments, but performance generally degrades outside that range. Existing approaches address this problem with dynamic tuning but do not optimize a multiprogrammed system holistically. Further, they either require manual programming effort or are limited to array-based data-parallel programs.

This paper presents Parcae, a generally applicable automatic system for platform-wide dynamic tuning. Parcae includes (i) the Nona compiler, which creates flexible parallel programs whose tasks can be efficiently reconfigured during execution; (ii) the Decima monitor, which measures resource availability and system performance to detect change in the environment; and (iii) the Morta executor, which cuts short the life of executing tasks, replacing them with other functionally equivalent tasks better suited to the current environment. Parallel programs made flexible by Parcae outperform original parallel implementations in many interesting scenarios.

References

  1. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. W. Antoine, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27:2001, 2000.Google ScholarGoogle Scholar
  4. Apple Open Source. md5sum: Message Digest 5 computation. http://www.opensource.apple.com/darwinsource.Google ScholarGoogle Scholar
  5. M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 219--228, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP), pages 29--44, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. O. Bilgir, M. Martonosi, and Q. Wu. Exploring the potential of CMP core count management on data center energy savings. In Proceedings of the 3rd Workshop on Energy Efficient Design (WEED), 2011.Google ScholarGoogle Scholar
  9. S. L. Bird and B. J. Smith. PACORA: Performance aware convex optimization for resource allocation. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar: Posters), 2011.Google ScholarGoogle Scholar
  10. F. Blagojevic, D. S. Nikolopoulos, A. Stamatakis, C. D. Antonopoulos, and M. Curtis-Maury. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems. Parallel Computing, 33(10--11):700--719, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Ding, M. Kandemir, P. Raghavan, and M. J. Irwin. Adapting application execution in CMPs using helper threads. Journal of Parallel and Distributed Computing, 69(9):790--806, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Diniz and M. Rinard. Dynamic feedback: An effective technique for adaptive computing. In Proceedings of the 18th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Edjlali, G. Agrawal, A. Sussman, J. Humphries, and J. Saltz. Compiler and runtime support for programming in adaptive parallel environments. In Scientific Programming, pages 215--227, 1995.Google ScholarGoogle Scholar
  14. M. W. Hall and M. Martonosi. Adaptive parallelism in compiler-parallelized code. In Proceedings of the 2nd SUIF Compiler Workshop, 1997.Google ScholarGoogle Scholar
  15. J. L. Hellerstein, V. Morrison, and E. Eilebrecht. Applying control theory in the real world: Experience with building a controller for the .NET thread pool. Performance Evaluation Review, 37:38--42, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Karcher and V. Pankratius. Run-time automatic performance tuning for multicore applications. In Proceedings of the International Euro-Par Conference on Parallel Processing (Euro-Par), pages 3--14, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Kejariwal, A. Nicolau, A. V. Veidenbaum, U. Banerjee, and C. D. Polychronopoulos. Efficient scheduling of nested parallel loops on multi-core systems. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP), pages 74--83, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 211--222, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the Annual International Symposium on Code Generation and Optimization (CGO), pages 75--86, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. E. Leiserson. The Cilk concurrency platform. In Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC), pages 522--527, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. LLVM Test Suite Guide. http://llvm.org/docs/TestingGuide.html.Google ScholarGoogle Scholar
  22. C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 45--55, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Mars, N. Vachharajani, M. L. Soffa, and R. Hundt. Contention aware execution: Online contention detection and response. In Proceedings of the Annual International Symposium on Code Generation and Optimization (CGO), Toronto, Canada, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Memik, W. H. Mangione-Smith, and W. Hu. NetBench: A benchmarking suite for network processors. In Proceedings of the 2001 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford Transactional Applications for Multi-Processing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2008.Google ScholarGoogle Scholar
  26. R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. 2006.Google ScholarGoogle Scholar
  27. I. Neamtiu. Elastic executions from inelastic programs. In Proceedings of the 6th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Pan, B. Hindman, and K. Asanović. Composing parallel software efficiently with Lithe. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 376--387, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. A. Penry. Multicore diversity: A software developer's nightmare. ACM SIGOPS Operating Systems Review, 43:100--101, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. D. Polychronopoulos. The hierarchical task graph and its use in auto-scheduling. In Proceedings of the 5th International Conference on Supercomputing (ICS), pages 252--263, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Püschel, F. Franchetti, and Y. Voronenko. Encyclopedia of Parallel Computing, chapter Spiral. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: the degree of parallelism executive. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. E. Raman, G. Ottoni, A. Raman, M. Bridges, and D. I. August. Parallel-stage decoupled software pipelining. In Proceedings of the Annual International Symposium on Code Generation and Optimization (CGO), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. Rauchwerger, N. M. Amato, and D. A. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming (IJPP), 26:537--576, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on work stealing in TBB. In Proceedings of the 22nd International Parallel and Distributed Processing Symposium (IPDPS), pages 1--8, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  37. J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P. Selinger. potrace: Transforming bitmaps into vector graphics. http://potrace.sourceforge.net.Google ScholarGoogle Scholar
  39. J. C. Spall. Introduction to Stochastic Search and Optimization. Wiley-Interscience, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  40. M. A. Suleman, M. K. Qureshi, Khubaib, and Y. N. Patt. Feedback-directed pipeline parallelism. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 147--156, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Tiwari and J. K. Hollingsworth. Online adaptive code generation and tuning. In Proceedings of the 25th International Parallel and Distributed Processing Symposium (IPDPS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. Lazy binary-splitting: A run-time adaptive work-stealing scheduler. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 179--190, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. H. Vandierendonck, S. Rul, and K. De Bosschere. The Paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 389--400, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. J. Voss and R. Eigenmann. ADAPT: Automated de-coupled adaptive program transformation. In Proceedings of the 1999 International Conference on Parallel Processing (ICPP), pages 163--170, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Z. Wang and M. F. O'Boyle. Mapping parallelism to multi-cores: A machine learning based approach. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 75--84, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. Wolfe. DOANY: Not just another parallel loop. In Proceedings of the 4th International Workshop on Languages and Compilers for Parallel Computing (LCPC), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In Proceedings of the 14th International Symposium on High-Performance Computer Architecture (HPCA), 2008.Google ScholarGoogle Scholar

Index Terms

  1. Parcae: a system for flexible parallel execution

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!