skip to main content
research-article

Communication optimizations for global multi-threaded instruction scheduling

Published:01 March 2008Publication History
Skip Abstract Section

Abstract

The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for parallelism beyond local regions of code. To exploit these opportunities, especially for sequential applications, researchers have recently proposed global multi-threaded instruction scheduling techniques, including DSWP and GREMIO. These techniques simultaneously schedule instructions from large regions of code, such as arbitrary loop nests or whole procedures, and have been shown to be effective at extracting threads for many applications. A key enabler of these global instruction scheduling techniques is the Multi-Threaded Code Generation (MTCG) algorithm proposed in [16], which generates multi-threaded code for any partition of the instructions into threads. This algorithm inserts communication and synchronization instructions in order to satisfy all inter-thread dependences.

In this paper, we present a general compiler framework, COCO, to optimize the communication and synchronization instructions inserted by the MTCG algorithm. This framework, based on thread-aware data-flow analyses and graph min-cut algorithms, appropriately models andoptimizes all kinds of inter-thread dependences, including register, memory, and control dependences. Our experiments, using a fully automatic compiler implementation of these techniques, demonstrate significant reductions (about 30% on average) in the number of dynamic communication instructions in code parallelized with DSWP and GREMIO. This reduction in communication translates to performance gains of up to 40%.

Skip Supplemental Material Section

Supplemental Material

Video

References

  1. S. P. Amarasinghe and M. S. Lam. Communication optimization and code generation for distributed memory machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 126--138, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for VLIWs: a preliminary analysis of tradeoffs. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 292--300, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chakrabarti, M. Gupta, and J.-D. Choi. Global communication analysis and optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 68--78, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press and McGraw-Hill, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9:319--349, July 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. R. Ford, Jr. and D. R. Fulkerson. Flows in Networks. Princeton University Press, 1962.Google ScholarGoogle Scholar
  7. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W H Freeman & Co, New York, NY, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and N. Shenoy. A global communication optimization technique based on data-flow analysis and linear algebra. ACM Trans. Program. Lang. Syst., 21(6):1251--1297, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Knoop, O. Rüthing, and B. Steffen. Lazy code motion. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 224--234, June 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. S. Lam and R. P. Wilson. Limits of control flow on parallelism. In Proceedings of the 19th International Symposium on Computer Architecture, pages 46--57, May 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. P. Amarasinghe. Space-time scheduling of instruction-level parallelism on a Raw Machine. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 46--57, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent scheduling. In Proceedings of the 35th Annual International Symposium on Microarchitecture, November 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling. In Proceedings of the 31st International Symposium on Microarchitecture, pages 103--114, December 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. M. Nystrom, H.-S. Kim, and W.-M. Hwu. Bottom-up and top-down context-sensitive summary-based pointer analysis. In Proceedings of the 11th Static Analysis Symposium, August 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. G. Ottoni and D. I. August. Global multi-threaded instruction scheduling. In Proceedings of the 40th Annual IEEE/ACM Inter-national Symposium on Microarchitecture, pages 56--68, December 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, pages 105--116, November 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. A. Penry, M. Vachharajani, and D. I. August. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation, June 2005.Google ScholarGoogle Scholar
  18. R. Rangan, N. Vachharajani, A. Stoler, G. Ottoni, D. I. August, and G. Z. N. Cai. Support for high-frequency streaming in CMPs. In Proceedings of the 39th International Symposium on Microarchitecture, pages 259--269, December 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 177--188, September 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Rich and M. Farrens. Code partitioning in decoupled compilers. In Proceedings of the 6th European Conference on Parallel Processing, pages 1008--1017, Munich, Germany, September 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Sarkar. A concurrent execution semantics for parallel program graphs and program dependence graphs. In Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. W. Sias, S.-Z. Ueng, G. A. Kent, I. M. Steiner, E. M. Nystrom, and W. mei W. Hwu. Field-testing IMPACT EPIC research results in Itanium 2. In Proceedings of the 31st Annual International Symposium on Computer Architecture. IEEE Computer Society, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. S. Sohi, S. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22th International Symposium on Computer Architecture, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. B. Taylor, W. Lee, S. P. Amarasinghe, and A. Agarwal. Scalar operand networks. IEEE Transactions on Parallel and Distributed Systems, 16(2):145--162, February 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Triantafyllis, M. J. Bridges, E. Raman, G. Ottoni, and D. I. August. A framework for unrestricted whole-program optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 61--71, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Vachharajani, M. Iyer, C. Ashok, M. Vachharajani, D. I. August, and D. A. Connors. Chip multi-processor scalability for single-threaded applications. In Proceedings of the Workshop on Design, Architecture, and Simulation of Chip Multi-Processors, November 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. Vachharajani, R. Rangan, E. Raman, M. J. Bridges, G. Ottoni, and D. I. August. Speculative decoupled software pipelining. In Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques, September 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Wu and J. R. Larus. Static branch prediction and program profile analysis. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 1--11, December 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Zalamea, J. Llosa, E. Ayguadé, and M. Valero. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 160--169, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Zhai, C. B. Colohan, J. G. Steffan, and T. C. Mowry. Compiler optimization of scalar value communication between speculative threads. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 171--183, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Zilles and G. Sohi. Master/slave speculative parallelization. In Proceedings of the 35th Annual International Symposium on Microarchitecture, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Communication optimizations for global multi-threaded instruction scheduling

      Recommendations

      Reviews

      Olivier Louis Marie Lecarme

      This interesting paper is another example of the extraordinary complications that compiler writers must endure if they want to take advantage, at least partly, of the theoretical capabilities of new multiprocessors. Since chip building is approaching the limits of what can be improved in a single-core monoprocessor, the industry is shifting toward multicore processors that, in principle, multiply monoprocessor performance by the number of processors. In fact, however, most programs are not developed for using multiprocessors, and programmers do not want to write highly parallel programs that are substantially more complicated than nonparallel programs. Thus, compilers have to automatically detect all opportunities of parallelizing in programs. This can be done at a coarse level only for very specialized applications. If the compiler automatically extracts thread-level parallelism from general-purpose applications, a large number of dependencies must be respected. In order to realize the performance gains obtained by parallelization, the hardware must be extended with specialized instructions that respect these dependencies. Moreover, specialized optimization techniques and algorithms must be added to the compiler to exploit the new opportunities provided by these mechanisms. This paper describes a general compiler framework, aimed at optimizing the communication and synchronization instructions inserted by the multithreaded code generation algorithm. The methods and algorithms are described at length. This framework is implemented and tested by simulation. The results are encouraging, but not spectacular. Moreover, some of the algorithms used have cubic performance, which means the compiling time may become overly costly. All in all, it is not clear whether this approach makes significant progress toward the theoretical peak performances offered by the chip multiprocessor. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!