Abstract
Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and parallelism requires a sufficiently large input problem to achieve satisfactory efficiency on a given number of cores. Unfortunately, finding a good match between input size and core count usually requires significant experimentation, which is expensive and sometimes even impractical. In this paper, we propose an automated empirical method for finding the isoefficiency function of a task-based program, binding efficiency, core count, and the input size in one analytical expression. This allows the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we not only find (i) the actual isoefficiency function but also (ii) the function one would yield if the program execution was free of resource contention and (iii) an upper bound that could only be reached if the program was able to maintain its average parallelism throughout its execution. The difference between the three helps to explain low efficiency, and in particular, it helps to differentiate between resource contention and structural conflicts related to task dependencies or scheduling. The insights gained can be used to co-design programs and shared system resources.
- BSC Application Repository. https://pm.bsc.es/projects/bar/wiki/Applications.Google Scholar
- Extra-P -- Automated Performance-modeling Tool. http://www.scalasca.org/software/extra-p.Google Scholar
- Graphviz -- Graph Visualization Software. http://www.graphviz.org.Google Scholar
- MATLAB. http://www.mathworks.com.Google Scholar
- G. E. Blelloch. NESL: A Nested Data-Parallel Language. Technical report, 1992.Google Scholar
Digital Library
- G. E. Blelloch. Programming Parallel Algorithms. phCommun. ACM, 39 (3): 85--97, 1996.Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37 (8): 55--69, 1997.Google Scholar
- R. P. Brent. The Parallel Evaluation of General Arithmetic Expressions. Journal of the ACM (JACM), 21 (2): 201--206, April 1974. Google Scholar
Digital Library
- A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing (SC '13), pages 45:1--45:12. ACM, 2013. Google Scholar
Digital Library
- A. Calotoiu, D. Beckingsale, C. W. Earl, T. Hoefler, I. Karlin, M. Schulz, and F. Wolf. Fast Multi-Parameter Performance Modeling. In Proc. of the 2016 IEEE International Conference on Cluster Computing (CLUSTER '16), pages 1--10, September 2016. (to appear). Google Scholar
Cross Ref
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and S. Clifford. Multithreaded Algorithms. In Introduction to Algorithms, Third Edition, pages 772--812. The MIT Press, 3rd edition, 2009.Google Scholar
- A. Duran, J. Corbalán, and E. Ayguadé. An Adaptive Cut-off for Task Parallelism. In Proc. of the 2008 ACM/IEEE Conference on Supercomputing (SC '08), pages 36:1--36:11. IEEE Press, 2008\natexlaba. Google Scholar
Cross Ref
- A. Duran, J. Corbalán, and E. Ayguadé. Evaluation of OpenMP Task Scheduling Strategies. In Proc. of the 4th International Conference on OpenMP in a New Era of Parallelism (IWOMP '08), pages 100--110. Springer-Verlag, 2008\natexlabb. Google Scholar
Cross Ref
- Duran2009A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguadé. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proc. of the 2009 International Conference on Parallel Processing (ICPP '09), pages 124--131. IEEE, 2009.Google Scholar
Digital Library
- Badia, Labarta, Martinell, Martorell, and Planas]Duran2011A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSs: A Proposal for Programming Heterogeneous Multi-core Architectures. Parallel Processing Letters, 21 (02): 173--193, 2011.Google Scholar
- D. L. Eager, J. Zahorjan, and E. D. Lazowska. Speedup Versus Efficiency in Parallel Systems. IEEE Transactions on Computers, 38 (3): 408--423, 1989. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI '98), pages 212--223. ACM, 1998. Google Scholar
Digital Library
- R. L. Graham. Bounds on Multiprocessing Timing Anomalies. SIAM Journal on Applied Mathematics, 17 (2): 416--429, 1969. Google Scholar
Cross Ref
- A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing. Addison-Wesley Longman Publishing Co., Inc., 2nd edition, 2002.Google Scholar
- A. Y. Grama, A. Gupta, and V. Kumar. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures. Parallel Distributed Technology: Systems Applications, 1 (3): 12--21, 1993. Google Scholar
Digital Library
- Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview Scalability Analyzer. In Proc. of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10), pages 145--156, 2010. Google Scholar
Digital Library
- T. Hoefler and R. Belli. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses when Reporting Performance Results. In Proc. of the 2015 ACM/IEEE Conference on Supercomputing (SC '15), pages 73:1--73:12. ACM, 2015. Google Scholar
Digital Library
- ]Hoefler2010T. Hoefler, W. Gropp, R. Thakur, and J. L. Träff. Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues. In Proc. of the European MPI Users' Group Meeting (EuroMPI '10), pages 21--30. Springer-Verlag, 2010.Google Scholar
Cross Ref
- G. Kestor, R. Gioiosa, and D. Chavarra-Miranda. Prometheus: scalable and accurate emulation of task-based applications on many-core systems. In Proc. of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '15), pages 308--317. IEEE, 2015. Google Scholar
Cross Ref
- D. Lorenz, P. Philippen, D. Schmidl, and F. Wolf. Profiling of OpenMP Tasks with Score-P. In Proc. of the 2012 41st International Conference on Parallel Processing Workshops (ICPPW '12), pages 444--453, September 2012. Google Scholar
Digital Library
- M. R. Meswani, L. Carrington, D. Unat, A. Snavely, S. Baden, and S. Poole. Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators. International Journal of High Performance Computing Applications, 27 (2): 89--108, May 2013. Google Scholar
Digital Library
- S. L. Olivier, B. R. de Supinski, M. Schulz, and J. F. Prins. Characterizing and Mitigating Work Time Inflation in Task Parallel Programs. In Proc. of the 2012 ACM/IEEE Conference on Supercomputing (SC '12), pages 65:1--65:12. IEEE Computer Society Press, 2012. Google Scholar
Digital Library
- OpenMP Architecture Review Board. OpenMP application programming interface, version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.Google Scholar
- M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education Group, 2003.Google Scholar
Digital Library
- A. Rico, F. Cabarcas, C. Villavieja, M. Pavlovic, A. Vega, Y. Etsion, A. Ramirez, and M. Valero. On the simulation of large-scale architectures using multiple application abstraction levels. ACM Transactions on Architecture and Code Optimization, 8 (4): 1--20, 2012. Google Scholar
Digital Library
- S. Shudler, A. Calotoiu, T. Hoefler, A. Strube, and F. Wolf. Exascaling Your Library: Will Your Implementation Meet Your Expectations? In Proc. of the 29th ACM International Conference on Supercomputing (ICS '15), pages 165--175. ACM, June 2015. Google Scholar
Digital Library
- N. R. Tallent and A. Hoisie. Palm: Easing the Burden of Analytical Performance Modeling. In Proc. of the 28th ACM International Conference on Supercomputing (ICS '14), pages 221--230. ACM, 2014. Google Scholar
Digital Library
- N. R. Tallent and J. M. Mellor-Crummey. Effective Performance Measurement and Analysis of Multithreaded Applications. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '09), pages 229--240. ACM, 2009. Google Scholar
Digital Library
- U.S. Department of Energy. The Opportunities and Challenges of Exascale Computing. Office of Science, Washington, D.C., 2010. http://science.energy.gov/ /media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdf.Google Scholar
- W. Wang, T. Dey, J. W. Davidson, and M. L. Soffa. DraMon: Predicting Memory Bandwidth Usage of Multi-threaded Programs With High Accuracy and Low Overhead. In Proc. of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14), pages 380--391, February 2014. Google Scholar
Cross Ref
Index Terms
Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications
Recommendations
Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingTask-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and ...
Benchmarking Parallel Performance on Many-Core Processors
OpenSHMEM 2014: Proceedings of the First Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools - Volume 8356With the emergence of many-core processor architectures onto the HPC scene, concerns arise regarding the performance and productivity of numerous existing parallel-programming tools, models, and languages. As these devices begin augmenting conventional ...
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
ICPP '13: Proceedings of the 2013 42nd International Conference on Parallel ProcessingNAS parallel benchmarks (NPB) are a set of applications commonly used to evaluate parallel systems. We use the NPB-OpenMP version to examine the performance of the Intel's new Xeon Phi co-processor and focus in particular on the many core aspect of the ...







Comments