ABSTRACT
We present a new technique, early phase termination, for eliminating idle processors in parallel computations that use barrier synchronization. This technique simply terminates each parallel phaseas soon as there are too few remaining tasks to keep all of the processors busy.
Although this technique completely eliminates the idling that would other wise occur at barrier synchronization points, it may also change the computation and therefore the result that the computation produces. We address this issue by providing probabilistic distortion models that characterize how the use of early phase termination distorts the result that the computation produces. Our experimental results show that for our set of benchmark applications, 1) early phase termination can improve the performance of the parallel computation, 2) the distortion is small (or can be made to be small with the use of an appropriate compensation technique) and 3) the distortion models provide accurate and tight distortion bounds. These bounds can enable users to evaluate the effect of early phase termination and confidently accept results from parallel computations that use this technique if they find the distortion bounds to be acceptable.
Finally, we identify a general computational pattern that works well with early phase termination and explain why computations that exhibit this pattern can tolerate the early termination of parallel tasks without producing unacceptable results.
- C. Ananian and M. Rinard. Efficient object-based software transactions. In Proceedings of the Workshop on Synchronization and Concurrency in Object-Oriented Languages, San Diego, CA, Oct. 2005.Google Scholar
- C. S. Ananian. Architectural and Compiler Support for Strongly Atomic Transactional Memory. PhD thesis, Dept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 2007. Google Scholar
Digital Library
- J. Barnes and P. Hut. A hierarchical O(NlogN) force calculation algorithm. Nature, 324(4):446--449, Dec. 1986.Google Scholar
Cross Ref
- W. Blume and R. Eigenmann. Performance analysis of parallelizing compilers on the Perfect Benchmarks programs. IEEE Transactions on Parallel and Distributed Systems, 3(6):643--656, Nov. 1992. Google Scholar
Digital Library
- R. Browning, T. Li, B. Chui, J. Ye, R. Pease, Z. Czyzewski, and D. Joy. Empirical forms for the electron/atom elastic scattering cross sections from 0.1-30keV. J. Appl. Phys., 76(4):2016--2022, Aug. 1994.Google Scholar
Cross Ref
- R. Browning, T. Li, B. Chui, J. Ye, R. Pease, Z. Czyzewski, and D. Joy. Low-energy electron/atom elastic scattering cross sections for 0.1-30keV. Scanning, 17(4):250--253, July/August 1995.Google Scholar
Cross Ref
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, CA, Dec. 2004. Google Scholar
Digital Library
- R. Freund and R. Littell. SAS System for Regression. SAS Publishing, 2000. Google Scholar
Digital Library
- A. Frommer and D. Szyld. On asynchronous iterations.Google Scholar
- J. Goodman. Chemical Applications of Molecular Modeling. Royal Society of Chemistry, 2007. Google Scholar
Digital Library
- R. Gupta. The fuzzy barrier: A mechanism for high speed synchronization of processors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA, Apr. 1989. Google Scholar
Digital Library
- J. Harris, S. Lazaratos, and R. Michelena. Tomographic string inversion. In Proceedings of the 60th Annual International Meeting, Society of Exploration and Geophysics, Extended Abstracts, pages 82--85, 1990.Google Scholar
Cross Ref
- M. Herlihy and J. Moss. Transactional memory: architectural support for lock-free data structures. In Proceedings of the 20th International Symposium on Computer Architecture, San Diego, CA, May 1993. Google Scholar
Digital Library
- T. Kay and J. Kajiya. Ray tracing complex scenes. Computer Graphics (Proceedings of SIGGRAPH'86), 20(4):269--78, Aug. 1986.Google Scholar
Digital Library
- C. Moler. Numerical Computing with Matlab. Society for Industrial and Applied Mathematics, 2004.Google Scholar
Cross Ref
- J. Nieh and M. Levoy. Volume rendering on scalable shared-memory MIMD architectures. Technical Report CSL-TR-92-537, Computer Systems Laboratory, Stanford Univ., Stanford, Calif., Aug. 1992.Google Scholar
- M. Rinard. The Design, Implementation and Evaluation of Jade, a Portable, Implicitly Parallel Programming Language. PhD thesis, Dept. of Computer Science, Stanford Univ., Stanford, Calif., 1994. Google Scholar
Digital Library
- M. Rinard. Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives. ACM Transactions on Computer Systems, 19(4), Nov. 1999. Google Scholar
Digital Library
- M. Rinard. Exploring the acceptability envelope. In 2005 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications Companion (OOPSLA'05 Companion) Onwards! Session, Oct. 2005. Google Scholar
Digital Library
- M. Rinard. Probabilistic accuracy bounds for fault-tolerant computations that discard tasks. In Proceedings of the 2006 ACM International Conference on Supercomputing, Cairns, Australia, June 2006. Google Scholar
Digital Library
- D. Scales and M. S. Lam. Transparent fault tolerance for parallel applications on networks of workstations. In Proceedings of the 1996 Usenix Technical Conference, Jan. 1996. Google Scholar
Digital Library
Index Terms
Using early phase termination to eliminate load imbalances at barrier synchronization points
Recommendations
Using early phase termination to eliminate load imbalances at barrier synchronization points
Proceedings of the 2007 OOPSLA conferenceWe present a new technique, early phase termination, for eliminating idle processors in parallel computations that use barrier synchronization. This technique simply terminates each parallel phaseas soon as there are too few remaining tasks to keep all ...
The Barrier Synchronization Impact on the MPI-Programs Performance Using a Cluster of Workstations
ISPAN '97: Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and NetworksThe aim of this work is to measure the barrier synchronization influence on the overall performance of several application programs. In order to do that, we use the MPICH implementation version 1.1 of the MPI library. Moreover, we choose two barrier ...
Lightweight barrier-based parallelization support for non-cache-coherent MPSoC platforms
CASES '07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systemsMany MPSoC applications are loop-intensive and amenable to automatic parallelization with suitable compiler support. One of the key components of any compiler-parallelized code is barrier instructions which are used to perform global synchronization ...







Comments