ABSTRACT
Harnessing the hardware parallelism of the emerging multi-cores systems necessitates concurrent software. Unfortunately, most of the existing mainstream software is sequential in nature. Although one could auto-parallelize a given program, the efficacy of this is largely limited to floating-point codes. One of the ways to alleviate the above limitation is to parallelize programs, which cannot be auto-parallelized, via explicit synchronization. In this regard, efficient placement of the synchronization primitives - say, post, wait - plays a key role in achieving high degree of thread-level parallelism (TLP). In this paper, we propose novel compiler techniques for the above. Specifically, given a control flow graph (CFG), the proposed techniques place a post as early as possible and place a wait as late as possible in the CFG, subject to dependences. We demonstrate the efficacy of our techniques, on a real machine, using real codes, specifically, from the industry-standard SPEC CPU benchmarks, the Linux kernel and other widely used open source codes. Our results show that the proposed techniques yield significantly higher levels of TLP than the state-of-the-art.
- S. Midkiff and D. Padua. Compiler algorithms for synchronization. IEEE Transactions on Computers, C-36(12):1485--1495, December 1987. Google Scholar
Digital Library
- J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III), pages 64--75, Boston, MA, 1989. Google Scholar
Digital Library
- J. Labarta and E. Ayguadé. GTS: Extracting full parallelism out of DO loops. In Proceedings of the Parallel Architectures and Languages Europe, pages 43--54, Eindhoven, The Netherlands, 1989. Google Scholar
Digital Library
- G. Granunke and S. Thakkar. Synchronization algorithms for shared-memory multiprocessors. IEEE Computer, 23(6):60--69, 1990. Google Scholar
Digital Library
- Z. Li. Compiler algorithms for event variable synchronization. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991. Google Scholar
Digital Library
- A. Krishnamurthy and K. Yelick. Optimizing parallel programs with explicit synchronization. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 196--204, La Jolla, CA, 1995. Google Scholar
Digital Library
- A. Aiken and D. Gay. Barrier inference. In Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 342--354, San Diego, CA, 1998. Google Scholar
Digital Library
- A. Kagi. Mechanism for Efficient Shared-Memory Lock-based Synchronization. PhD thesis, Department of Computer Science, University of Wisconsin-Madison, 1999. Google Scholar
Digital Library
- D. S. Nikolopoulos and T. S. Papatheodorou. Fast synchronization on scalable cache-coherent multiprocessors using hybrid primitives. In Proceedings of the 14th International Parallel and Distributed Processing Symposium, pages 711--720, Cancun, Mexico, 2000. Google Scholar
Digital Library
- D. F. Bacon, R. Konuru, C. Murthy, and M. J. Serrano. Thin locks: Featherweight synchronization for java. ACM SIGPLAN Notices, 39(4):583--595, 2004. Google Scholar
Digital Library
- A. Kejariwal, X. Tian, H. Saito, W. Li, M. Girkar, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. Lightweight lock-free synchronization methods for multithreading. In Proceedings of the 20th ACM International Conference on Supercomputing, pages 361--371, Cairns, Australia, 2006. Google Scholar
Digital Library
- The Linux Kernel Archives. http://www.kernel.org.Google Scholar
- J. A. Fisher. Trace Scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, C-30(7):478--490, July 1981. Google Scholar
Digital Library
- A. Nicolau. Percolation scheduling: A parallel compilation technique. Technical Report TR85-678, Dept. of Computer Science, Cornell University, May 1985. Google Scholar
Digital Library
- W. M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery. The superblock: An effective technique for VLIW and super-scalar compilation. The JournaL of Supercomputing, 7(1-2):229--248, November 1993. Google Scholar
Digital Library
- Jens Knoop, Oliver Rüthing, and Bernhard Steffen. Optimal code motion: Theory and practice. ACM Transactions on Programming Languages and Systems, 16(4):1117--1155, July 1994. Google Scholar
Digital Library
- M. Hailperin. Cost-optimal code motion. ACM Transactions on Programming Languages and Systems, 20(6):1297--1322, 1998. Google Scholar
Digital Library
- E. Morel and C. Renvoise. Global optimization by suppression of partial redun-dancies. Communications of the ACM, 22(2):96--103, February 1979. Google Scholar
Digital Library
- SPEC CPU Benchmarks. http://www.spec.org/benchmarks.html.Google Scholar
- A. Kejariwal, X. Tian, W. Li, M. Girkar, S. Kozhukhov, H. Saito, U. Banerjee, A. Nicolau, A. V. Veidenbaum, and C. D. Polychronopoulos. On the performance potential of different types of speculative thread-level parallelism. In Proceedings of the 20th ACM International Conference on Supercomputing, pages 24--35, Cairns, Australia, 2006. Google Scholar
Digital Library
- U. Banerjee. Dependence Analysis. Kluwer Academic Publishers, Boston, MA, 1997. Google Scholar
Digital Library
- SPEC CPU2006. http://www.spec.org/cpu2006.Google Scholar
- T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, MA, 1990. Google Scholar
Digital Library
- K. Karplus and A. Nicolau. Efficient hardware for multiway jumps and prefetches. In Proceedings of the 18th annual workshop on Microprogramming, pages 11--18, 1985. Google Scholar
Digital Library
- D. Kuck. The Structure of Computers and Computations, VOLUME 1. John Wiley and Sons, New York, NY, 1978. Google Scholar
Digital Library
- SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.Google Scholar
- S. Novack and A. Nicolau. Trailblazing: A hierarchical approach to percolation scheduling. International Journal of Parallel Programming, 23(1), 1995. Google Scholar
Digital Library
- A. Nicolau. Percolation scheduling. In Proceedings of the 1985 International Conference on Parallel Processing, August 1985.Google Scholar
- K. Ebcioglu and T. Nakatani. A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture. In Proceedings of the Third Workshop on Languages and Compilers for Parallel Computing, Urbana, IL, May 1990. Google Scholar
Digital Library
- S. Muchnick. Advanced Compiler Design Implementation. Second edition, 2000. Google Scholar
Digital Library
- SPEC CPU2000. http://www.spec.org/cpu2000.Google Scholar
- Sendmail. http://www.sendmail.org/.Google Scholar
- Apache. http://download.nextag.com/apache.Google Scholar
- D. A. Padua. Multiprocessors: Discussion of theoritical and practical problems. Technical Report 79-990, Department of Computer Science, University of Illinois at Urbana-Champaign, November 1979.Google Scholar
- J. Davies. Parallel loop constructs for multiprocessors. Technical Report 81-1070, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1981.Google Scholar
- C. Zhu and P. Yew. A synchronization scheme and its applications for large scale multiprocessors. In Proceedings of the Conference on Distributed Computing Systems, pages 486--491, San Francisco, CA, May 1984.Google Scholar
- J. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, 1991. Google Scholar
Digital Library
- D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, pages 54--58, 1999. Google Scholar
Digital Library
- S. Sridharan, A. Rodrigues, and P. Kogge. Evaluating synchronization techniques for light-weight multithreaded/multicore architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 57--58, San Diego, CA, 2007. Google Scholar
Digital Library
- W. Zhu, V. C Sreedhar, Z. Hu, and G. R. Gao. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th International Symposium on Computer Architecture, pages 35--45, San Diego, CA, 2007. Google Scholar
Digital Library
- J. Whaley and M. Rinard. Compositional pointer and escape analysis for Java programs. In Proceedings of the 14th ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications, pages 187--206, Denver, CO, 1999. Google Scholar
Digital Library
- A. Salcianu and M. Rinard. Pointer and escape analysis for multithreaded programs. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 12--23, Snowbird, UT, 2001. Google Scholar
Digital Library
- R. Cytron. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing, pages 836--844, St. Charles, IL, August 1986.Google Scholar
- S. Midkiff and D. Padua. Compiler generated synchronization for DO loops. In Proceedings of the 1986 International Conference on Parallel Processing, pages 544--551, St. Charles, IL, August 1986.Google Scholar
- H. Kasahara, H. Honda, M. Iwata, and M. Hirota. A compilation scheme for macro-dataow computation on hierarchical multiprocessor systems. In Proceedings of the International Conference on Parallel Processing, pages II294--II295, Urbana-Champaign, IL, August 1990.Google Scholar
- M. B. Girkar. Functional Parallelism Theoretical Foundations and Implemen-tation. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, December 1991. Google Scholar
Digital Library
- R. Cytron, M. Hind, and W. Hsieh. Automatic generation of DAG parallelism. In Proceedings of the SIGPLAN '89 Conference on Programming Language Design and Implementation, pages 54--68, Portland, OR, 1989. Google Scholar
Digital Library
- V. Sarkar. Instruction reordering for fork-join parallelism. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 322--336, White Plains, NY, 1990. Google Scholar
Digital Library
- C. Tian, V. Nagarajan, R. Gupta, and S. Tallam. Dynamic recognition of synchronization operations for improved data race detection. In Proceedings of the ACM/SIGSOFT International Symposium on Software Testing and Analysis, pages 143--154, Seattle, WA, 2008. Google Scholar
Digital Library
Index Terms
Techniques for efficient placement of synchronization primitives
Recommendations
Synchronization optimizations for efficient execution on multi-cores
ICS '09: Proceedings of the 23rd international conference on SupercomputingMulti-cores are becoming ubiquitous as exemplified by Sun's Niagra-2, Intel's Nehalem and AMD's Sau Paulo octal cores. The number of cores per chip is expected to rise in foreseeable future, as evidenced by the recently announced Intel's 80-core ...
Techniques for efficient placement of synchronization primitives
PPoPP '09Harnessing the hardware parallelism of the emerging multi-cores systems necessitates concurrent software. Unfortunately, most of the existing mainstream software is sequential in nature. Although one could auto-parallelize a given program, the efficacy ...
Compiler Techniques for the Superthreaded Architectures
Several useful compiler and program transformation techniques for the superthreaded architectures are presented in this paper. The superthreaded architecture adopts a thread pipelining execution model to facilitate runtime data dependence checking ...







Comments