skip to main content
10.1145/1504176.1504207acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Techniques for efficient placement of synchronization primitives

Published:14 February 2009Publication History

ABSTRACT

Harnessing the hardware parallelism of the emerging multi-cores systems necessitates concurrent software. Unfortunately, most of the existing mainstream software is sequential in nature. Although one could auto-parallelize a given program, the efficacy of this is largely limited to floating-point codes. One of the ways to alleviate the above limitation is to parallelize programs, which cannot be auto-parallelized, via explicit synchronization. In this regard, efficient placement of the synchronization primitives - say, post, wait - plays a key role in achieving high degree of thread-level parallelism (TLP). In this paper, we propose novel compiler techniques for the above. Specifically, given a control flow graph (CFG), the proposed techniques place a post as early as possible and place a wait as late as possible in the CFG, subject to dependences. We demonstrate the efficacy of our techniques, on a real machine, using real codes, specifically, from the industry-standard SPEC CPU benchmarks, the Linux kernel and other widely used open source codes. Our results show that the proposed techniques yield significantly higher levels of TLP than the state-of-the-art.

References

  1. S. Midkiff and D. Padua. Compiler algorithms for synchronization. IEEE Transactions on Computers, C-36(12):1485--1495, December 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III), pages 64--75, Boston, MA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Labarta and E. Ayguadé. GTS: Extracting full parallelism out of DO loops. In Proceedings of the Parallel Architectures and Languages Europe, pages 43--54, Eindhoven, The Netherlands, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Granunke and S. Thakkar. Synchronization algorithms for shared-memory multiprocessors. IEEE Computer, 23(6):60--69, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Li. Compiler algorithms for event variable synchronization. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Krishnamurthy and K. Yelick. Optimizing parallel programs with explicit synchronization. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 196--204, La Jolla, CA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Aiken and D. Gay. Barrier inference. In Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 342--354, San Diego, CA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Kagi. Mechanism for Efficient Shared-Memory Lock-based Synchronization. PhD thesis, Department of Computer Science, University of Wisconsin-Madison, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. S. Nikolopoulos and T. S. Papatheodorou. Fast synchronization on scalable cache-coherent multiprocessors using hybrid primitives. In Proceedings of the 14th International Parallel and Distributed Processing Symposium, pages 711--720, Cancun, Mexico, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. F. Bacon, R. Konuru, C. Murthy, and M. J. Serrano. Thin locks: Featherweight synchronization for java. ACM SIGPLAN Notices, 39(4):583--595, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Kejariwal, X. Tian, H. Saito, W. Li, M. Girkar, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. Lightweight lock-free synchronization methods for multithreading. In Proceedings of the 20th ACM International Conference on Supercomputing, pages 361--371, Cairns, Australia, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. The Linux Kernel Archives. http://www.kernel.org.Google ScholarGoogle Scholar
  13. J. A. Fisher. Trace Scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, C-30(7):478--490, July 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Nicolau. Percolation scheduling: A parallel compilation technique. Technical Report TR85-678, Dept. of Computer Science, Cornell University, May 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery. The superblock: An effective technique for VLIW and super-scalar compilation. The JournaL of Supercomputing, 7(1-2):229--248, November 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jens Knoop, Oliver Rüthing, and Bernhard Steffen. Optimal code motion: Theory and practice. ACM Transactions on Programming Languages and Systems, 16(4):1117--1155, July 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Hailperin. Cost-optimal code motion. ACM Transactions on Programming Languages and Systems, 20(6):1297--1322, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Morel and C. Renvoise. Global optimization by suppression of partial redun-dancies. Communications of the ACM, 22(2):96--103, February 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. SPEC CPU Benchmarks. http://www.spec.org/benchmarks.html.Google ScholarGoogle Scholar
  20. A. Kejariwal, X. Tian, W. Li, M. Girkar, S. Kozhukhov, H. Saito, U. Banerjee, A. Nicolau, A. V. Veidenbaum, and C. D. Polychronopoulos. On the performance potential of different types of speculative thread-level parallelism. In Proceedings of the 20th ACM International Conference on Supercomputing, pages 24--35, Cairns, Australia, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. U. Banerjee. Dependence Analysis. Kluwer Academic Publishers, Boston, MA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. SPEC CPU2006. http://www.spec.org/cpu2006.Google ScholarGoogle Scholar
  23. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, MA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Karplus and A. Nicolau. Efficient hardware for multiway jumps and prefetches. In Proceedings of the 18th annual workshop on Microprogramming, pages 11--18, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Kuck. The Structure of Computers and Computations, VOLUME 1. John Wiley and Sons, New York, NY, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. SPEC CINT2006. http://www.spec.org/cpu2006/CINT2006.Google ScholarGoogle Scholar
  27. S. Novack and A. Nicolau. Trailblazing: A hierarchical approach to percolation scheduling. International Journal of Parallel Programming, 23(1), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Nicolau. Percolation scheduling. In Proceedings of the 1985 International Conference on Parallel Processing, August 1985.Google ScholarGoogle Scholar
  29. K. Ebcioglu and T. Nakatani. A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture. In Proceedings of the Third Workshop on Languages and Compilers for Parallel Computing, Urbana, IL, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Muchnick. Advanced Compiler Design Implementation. Second edition, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. SPEC CPU2000. http://www.spec.org/cpu2000.Google ScholarGoogle Scholar
  32. Sendmail. http://www.sendmail.org/.Google ScholarGoogle Scholar
  33. Apache. http://download.nextag.com/apache.Google ScholarGoogle Scholar
  34. D. A. Padua. Multiprocessors: Discussion of theoritical and practical problems. Technical Report 79-990, Department of Computer Science, University of Illinois at Urbana-Champaign, November 1979.Google ScholarGoogle Scholar
  35. J. Davies. Parallel loop constructs for multiprocessors. Technical Report 81-1070, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1981.Google ScholarGoogle Scholar
  36. C. Zhu and P. Yew. A synchronization scheme and its applications for large scale multiprocessors. In Proceedings of the Conference on Distributed Computing Systems, pages 486--491, San Francisco, CA, May 1984.Google ScholarGoogle Scholar
  37. J. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, pages 54--58, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Sridharan, A. Rodrigues, and P. Kogge. Evaluating synchronization techniques for light-weight multithreaded/multicore architectures. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 57--58, San Diego, CA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. W. Zhu, V. C Sreedhar, Z. Hu, and G. R. Gao. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th International Symposium on Computer Architecture, pages 35--45, San Diego, CA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Whaley and M. Rinard. Compositional pointer and escape analysis for Java programs. In Proceedings of the 14th ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications, pages 187--206, Denver, CO, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Salcianu and M. Rinard. Pointer and escape analysis for multithreaded programs. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 12--23, Snowbird, UT, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R. Cytron. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing, pages 836--844, St. Charles, IL, August 1986.Google ScholarGoogle Scholar
  44. S. Midkiff and D. Padua. Compiler generated synchronization for DO loops. In Proceedings of the 1986 International Conference on Parallel Processing, pages 544--551, St. Charles, IL, August 1986.Google ScholarGoogle Scholar
  45. H. Kasahara, H. Honda, M. Iwata, and M. Hirota. A compilation scheme for macro-dataow computation on hierarchical multiprocessor systems. In Proceedings of the International Conference on Parallel Processing, pages II294--II295, Urbana-Champaign, IL, August 1990.Google ScholarGoogle Scholar
  46. M. B. Girkar. Functional Parallelism Theoretical Foundations and Implemen-tation. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, December 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. R. Cytron, M. Hind, and W. Hsieh. Automatic generation of DAG parallelism. In Proceedings of the SIGPLAN '89 Conference on Programming Language Design and Implementation, pages 54--68, Portland, OR, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. V. Sarkar. Instruction reordering for fork-join parallelism. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 322--336, White Plains, NY, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. C. Tian, V. Nagarajan, R. Gupta, and S. Tallam. Dynamic recognition of synchronization operations for improved data race detection. In Proceedings of the ACM/SIGSOFT International Symposium on Software Testing and Analysis, pages 143--154, Seattle, WA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Techniques for efficient placement of synchronization primitives

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
        February 2009
        322 pages
        ISBN:9781605583976
        DOI:10.1145/1504176
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 44, Issue 4
          PPoPP '09
          April 2009
          294 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/1594835
          Issue’s Table of Contents

        Copyright © 2009 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 February 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!