Abstract
Dependency-aware task-based parallel programming models have proven to be successful for developing efficient application software for multicore-based computer architectures. The programming model is amenable to programmers, thereby supporting productivity, whereas hardware performance is achieved through a runtime system that dynamically schedules tasks onto cores in such a way that all dependencies are respected. However, even if the scheduling is completely successful with respect to load balancing, the scaling with the number of cores may be suboptimal due to resource contention. Here we consider the problem of scheduling tasks not only with respect to their interdependencies but also with respect to their usage of resources, such as memory and bandwidth. At the software level, this is achieved by user annotations of the task resource consumption. In the runtime system, the annotations are translated into scheduling constraints. Experimental results for different hardware, demonstrating performance gains both for model examples and real applications, are presented. Furthermore, we provide a set of tools to detect resource sensitivity and predict the performance improvements that can be achieved by resource-aware scheduling. These tools are solely based on parallel execution traces and require no instrumentation or modification of the application code.
- Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180, 1, 012037. http://stacks.iop.org/1742-6596/180/i=1/a=012037Google Scholar
Cross Ref
- R. Al-Omary, Guillermo Miranda, Xavier Martorell, Jesus Labarta, Rosa M. Badia, D. Keyes, and Hatem Ltaief. 2013. Dense Cholesky factorization on NUMA architectures with socket-aware work stealing. Submitted.Google Scholar
- Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2, 187--198. Google Scholar
Digital Library
- Eduard Ayguadé, Rosa M. Badia, Pieter Bellens, Daniel Cabrera, Alejandro Duran, Roger Ferrer, Marc González, Francisco D. Igual, Daniel Jiménez-González, and Jesús Labarta. 2010. Extending OpenMP to survive the heterogeneous multi-core era. International Journal of Parallel Programming 38, 5--6, 440--459.Google Scholar
Cross Ref
- Major Bhadauria and Sally A. McKee. 2010. An approach to resource-aware co-scheduling for CMPs. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS’10). ACM, New York, NY, 189--199. DOI: http://dx.doi.org/10.1145/1810085.1810113 Google Scholar
Digital Library
- Ramazan Bitirgen, Engin Ipek, and Jose F. Martinez. 2008. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). IEEE, Los Alamitos, CA, 318--329. DOI: http://dx.doi.org/10.1109/MICRO.2008.4771801 Google Scholar
Digital Library
- Sergey Blagodurov, Sergey Zhuravlev, and Alexandra Fedorova. 2010. Contention-aware scheduling on multicore systems. ACM Transactions on Computer Systems 28, 4, 8. Google Scholar
Digital Library
- Gérman Ceballos and David Black-Schaffer. 2013. Shared resource sensitivity in task-based runtime systems. In Proceedings of the 6th Swedish Workshop on Multicore Computing. 61--64.Google Scholar
- Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08). IEEE, Los Alamitos, CA, Article No. 4. http://dl.acm.org/citation.cfm?id=1413370.1413375 Google Scholar
Digital Library
- Alejandro Duran, Eduard Ayguadé, Rosa M. Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 2, 173--193.Google Scholar
Cross Ref
- Pontus Ekberg and Wang Yi. 2012. Outstanding paper award: Bounding and shaping the demand of mixed-criticality sporadic tasks. In Proceedings of the 24th Euromicro Conference on Real-Time Systems (ECRTS’12). 135--144. DOI: http://dx.doi.org/10.1109/ECRTS.2012.24 Google Scholar
Digital Library
- David Eklov, Nikos Nikoleris, David Black-Schaffer, and Erik Hagersten. 2011. Cache pirating: Measuring the curse of the shared cache. In Proceedings of the International Conference on Parallel Processing (ICPP’11). 165--175. Google Scholar
Digital Library
- Ali El-Moursy, Rajeev Garg, David H. Albonesi, and Sandhya Dwarkadas. 2006. Compatible phase co-scheduling on a CMP of multi-threaded processors. In Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS’06). DOI: http://dx.doi.org/10.1109/IPDPS.2006.1639376 Google Scholar
Digital Library
- Alexandra Fedorova, Margo Seltzer, Christoper Small, and Daniel Nussbaum. 2005. Performance of multithreaded chip multiprocessors and implications for operating system design. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (ATEC’05). 26. http://dl.acm.org/citation.cfm?id=1247360.1247386 Google Scholar
Digital Library
- Nan Guan, Martin Stigge, Wang Yi, and Ge Yu. 2009. Cache-aware scheduling and analysis for multicores. In Proceedings of the 7th ACM International Conference on Embedded Software (EMSOFT’09). ACM, New York, NY, 245--254. DOI: http://dx.doi.org/10.1145/1629335.1629369 Google Scholar
Digital Library
- Charles E. Leiserson. 2010. The Cilk++ concurrency platform. Journal of Supercomputing 51, 3, 244--257. Google Scholar
Digital Library
- Cristoph Niethammer, Colin W. Glass, and José Gracia. 2012. Avoiding serialization effects in data/dependency aware task parallel algorithms for spatial decomposition. In Proceedings of the IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA). 743--748. DOI: http://dx.doi.org/10.1109/ISPA.2012.109 Google Scholar
Digital Library
- Josep M. Pérez, Rosa M. Badia, and Jesús Labarta. 2008. A dependency-aware task-based programming environment for multi-core architectures. In Proceedings of the IEEE International Conference on Cluster Computing. 142--151.Google Scholar
Cross Ref
- Judit Planas, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta. 2009. Hierarchical task-based programming with StarSs. International Journal of High Performance Computing Applications 23, 3, 284--299. Google Scholar
Digital Library
- Allan Snavely and Dean M. Tullsen. 2000. Symbiotic jobscheduling for a simultaneous multithreaded processor. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX). ACM, New York, NY, 234--244. DOI: http://dx.doi.org/10.1145/378993.379244 Google Scholar
Digital Library
- Martin Tillenius. 2012a. Leveraging Multicore Processors for Scientific Computing. Licentiate thesis. Department of Information Technology, Uppsala University.Google Scholar
- Martin Tillenius. 2012b. SuperGlue Project. Retrieved October 28, 2014, from http://www.it.uu.se/research/scicomp/software/superglueGoogle Scholar
- Martin Tillenius and Elisabeth Larsson. 2010. An efficient task-based approach for solving the n-body problem on multicore architectures. In PARA 2010: State of the Art in Scientific and Parallel Computing.Google Scholar
- Martin Tillenius, Elisabeth Larsson, Rosa M. Badia, and Xavier Martorell. 2013. Resource aware task scheduling. In Proceedings of the 8th International Conference on High-Performance and Embedded Architectures and Compilers (Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures). ACM, New York, NY.Google Scholar
- Hans Vandierendonck, George Tzenakis, and Dimitrios S. Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 1--11. Google Scholar
Digital Library
- Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2009. Optimization of sparse matrix--vector multiplication on emerging multicore platforms. Parallel Computing 35, 3, 178--194. DOI: http://dx.doi.org/10.1016/j.parco.2008.12.006 Google Scholar
Digital Library
- Asim YarKhan, Jakub Kurzak, and Jack Dongarra. 2011. QUARK Users’ Guide: QUeueing and Runtime for Kernels. Technical Report ICL-UT-11-02. ICL, University of Tennessee, Knoxville, TN.Google Scholar
- Afshin Zafari, Martin Tillenius, and Elisabeth Larsson. 2012. Programming models based on data versioning for dependency-aware task-based parallelisation. In CSE 2012: The 15th IEEE International Conference on Computational Science and Engineering. Google Scholar
Digital Library
- Sergey Zhuravlev, Juan Carlos Saez, Sergey Blagodurov, Alexandra Fedorova, and Manuel Prieto. 2012. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys 45, 1, Article No. 4. DOI: http://dx.doi.org/10.1145/2379776.2379780 Google Scholar
Digital Library
Index Terms
Resource-Aware Task Scheduling
Recommendations
Kernel mechanisms with dynamic task-aware scheduling to reduce resource contention in NUMA multi-core systems
The dynamic task-aware scheduling reduces resource contention in NUMA multi-core systems.It dynamically classifies tasks into CPU- or memory-bound according to their run time behavior.Processor cores that share resources are arranged to run ...
Topology-aware GPU scheduling for learning workloads in cloud environments
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisRecent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex ...
Enhanced GPU Resource Utilization through Fairness-aware Task Scheduling
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03Underutilization as well as oversubscription of processing resources are common problems in current accelerator-based computing systems. Facing these challenges will require intelligent algorithms for scheduling parallel workloads on accelerators. The ...






Comments