Abstract
Work stealing is a popular approach to scheduling task-parallel programs. The flexibility inherent in work stealing when dealing with load imbalance results in seemingly irregular computation structures, complicating the study of its runtime behavior. In this paper, we present an approach to efficiently trace async-finish parallel programs scheduled using work stealing. We identify key properties that allow us to trace the execution of tasks with low time and space overheads. We also study the usefulness of the proposed schemes in supporting algorithms for data-race detection and retentive stealing presented in the literature. We demonstrate that the perturbation due to tracing is within the variation in the execution time with 99% confidence and the traces are concise, amounting to a few tens of kilobytes per thread in most cases. We also demonstrate that the traces enable significant reductions in the cost of detecting data races and result in low, stable space overheads in supporting retentive stealing for async-finish programs.
- U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002. ISSN 1432-4350.Google Scholar
Cross Ref
- K. Agrawal, J. T. Fineman, and J. Sukha. Nested parallelism in transactional memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP '08, pages 163--174, 2008. Google Scholar
Digital Library
- E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. Parallel and Distributed Systems, IEEE Transactions on, 20(3): 404--418, 2009. Google Scholar
Digital Library
- M. A. Bender, J. T. Fineman, S. Gilbert, and C. E. Leiserson. Onthe-fly maintenance of series-parallel relationships in fork-join multithreaded programs. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '04, pages 133--144, 2004. Google Scholar
Digital Library
- R. D. Blumofe. Executing multithreaded programs efficiently. PhD thesis, Massachusetts Institute of Technology, 1995. Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPOPP '95, pages 207--216, 1995. Google Scholar
Digital Library
- J. F. Box. Guinness, Gosset, Fisher, and Small Samples. Statistical Science, 2(1):45--52, Feb. 1987. ISSN 0883-4237. doi: 10.1214/ss/1177013437.Google Scholar
Cross Ref
- B. Fulgham. Computer language benchmarks game, August 2012. URL http://shootout.alioth.debian.org/.Google Scholar
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--12. IEEE, 2009. Google Scholar
Digital Library
- T. Karunaratna. Nondeterminator-3: a provably good data-race detector that runs in parallel. PhD thesis, Massachusetts Institute of Technology, 2005.Google Scholar
- D. Lea et al. Java specification request 166: Concurrency utilities, 2004.Google Scholar
- J. Lifflander, S. Krishnamoorthy, and L. V. Kale. Work stealing and persistence-based load balancers for iterative overdecomposed applications. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages 137--148, 2012. Google Scholar
Digital Library
- R. Raman, J. Zhao, V. Sarkar, M. Vechev, and E. Yahav. Scalable and precise dynamic datarace detection for structured parallelism. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI '12, pages 531--542, 2012. Google Scholar
Digital Library
- J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly Media, 2007. Google Scholar
Digital Library
- V. A. Saraswat, V. Sarkar, and C. von Praun. X10: concurrent programming for modern architectures. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 271--271, 2007. Google Scholar
Digital Library
- N. R. Tallent and J. M. Mellor-Crummey. Identifying performance bottlenecks in work-stealing computations. IEEE Computer, 42(12): 44--50, 2009. Google Scholar
Digital Library
- C. Wu, A. Kalyanaraman, and W. R. Cannon. PGraph: efficient parallel construction of large-scale protein sequence homology graphs. IEEE Transactions on Parallel and Distributed Systems, 23(10):1923--1933, 2012. Google Scholar
Digital Library
- G. Zheng. Achieving high performance on extremely large parallel machines: performance prediction and load balancing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005. Google Scholar
Digital Library
Index Terms
Steal Tree: low-overhead tracing of work stealing schedulers
Recommendations
Steal Tree: low-overhead tracing of work stealing schedulers
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and ImplementationWork stealing is a popular approach to scheduling task-parallel programs. The flexibility inherent in work stealing when dealing with load imbalance results in seemingly irregular computation structures, complicating the study of its runtime behavior. ...
Cooperative Scheduling of Parallel Tasks with General Synchronization Patterns
Proceedings of the 28th European Conference on ECOOP 2014 --- Object-Oriented Programming - Volume 8586In this paper, we address the problem of scheduling parallel tasks with general synchronization patterns using a cooperative runtime. Current implementations for task-parallel programming models provide efficient support for fork-join parallelism, but ...
Trace-based parallel performance overhead compensation
HPCC'05: Proceedings of the First international conference on High Performance Computing and CommunicationsTracing parallel programs to observe their performance introduces intrusion as the result of trace measurement overhead. If post-mortem trace analysis does not compensate for the overhead, the intrusion will lead to errors in the performance results. We ...







Comments