Abstract
The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for parallelism beyond local regions of code. To exploit these opportunities, especially for sequential applications, researchers have recently proposed global multi-threaded instruction scheduling techniques, including DSWP and GREMIO. These techniques simultaneously schedule instructions from large regions of code, such as arbitrary loop nests or whole procedures, and have been shown to be effective at extracting threads for many applications. A key enabler of these global instruction scheduling techniques is the Multi-Threaded Code Generation (MTCG) algorithm proposed in [16], which generates multi-threaded code for any partition of the instructions into threads. This algorithm inserts communication and synchronization instructions in order to satisfy all inter-thread dependences.
In this paper, we present a general compiler framework, COCO, to optimize the communication and synchronization instructions inserted by the MTCG algorithm. This framework, based on thread-aware data-flow analyses and graph min-cut algorithms, appropriately models andoptimizes all kinds of inter-thread dependences, including register, memory, and control dependences. Our experiments, using a fully automatic compiler implementation of these techniques, demonstrate significant reductions (about 30% on average) in the number of dynamic communication instructions in code parallelized with DSWP and GREMIO. This reduction in communication translates to performance gains of up to 40%.
Supplemental Material
Available for Download
Supplemental material for Communication optimizations for global multi-threaded instruction scheduling
- S. P. Amarasinghe and M. S. Lam. Communication optimization and code generation for distributed memory machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 126--138, 1993. Google Scholar
Digital Library
- A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for VLIWs: a preliminary analysis of tradeoffs. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 292--300, 1992. Google Scholar
Digital Library
- S. Chakrabarti, M. Gupta, and J.-D. Choi. Global communication analysis and optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 68--78, 1996. Google Scholar
Digital Library
- T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press and McGraw-Hill, 1990. Google Scholar
Digital Library
- J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9:319--349, July 1987. Google Scholar
Digital Library
- L. R. Ford, Jr. and D. R. Fulkerson. Flows in Networks. Princeton University Press, 1962.Google Scholar
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W H Freeman & Co, New York, NY, 1979. Google Scholar
Digital Library
- M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and N. Shenoy. A global communication optimization technique based on data-flow analysis and linear algebra. ACM Trans. Program. Lang. Syst., 21(6):1251--1297, 1999. Google Scholar
Digital Library
- J. Knoop, O. Rüthing, and B. Steffen. Lazy code motion. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 224--234, June 1992. Google Scholar
Digital Library
- M. S. Lam and R. P. Wilson. Limits of control flow on parallelism. In Proceedings of the 19th International Symposium on Computer Architecture, pages 46--57, May 1992. Google Scholar
Digital Library
- W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. P. Amarasinghe. Space-time scheduling of instruction-level parallelism on a Raw Machine. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 46--57, 1998. Google Scholar
Digital Library
- W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent scheduling. In Proceedings of the 35th Annual International Symposium on Microarchitecture, November 2002. Google Scholar
Digital Library
- E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling. In Proceedings of the 31st International Symposium on Microarchitecture, pages 103--114, December 1998. Google Scholar
Digital Library
- E. M. Nystrom, H.-S. Kim, and W.-M. Hwu. Bottom-up and top-down context-sensitive summary-based pointer analysis. In Proceedings of the 11th Static Analysis Symposium, August 2004.Google Scholar
Cross Ref
- G. Ottoni and D. I. August. Global multi-threaded instruction scheduling. In Proceedings of the 40th Annual IEEE/ACM Inter-national Symposium on Microarchitecture, pages 56--68, December 2007. Google Scholar
Digital Library
- G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, pages 105--116, November 2005. Google Scholar
Digital Library
- D. A. Penry, M. Vachharajani, and D. I. August. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation, June 2005.Google Scholar
- R. Rangan, N. Vachharajani, A. Stoler, G. Ottoni, D. I. August, and G. Z. N. Cai. Support for high-frequency streaming in CMPs. In Proceedings of the 39th International Symposium on Microarchitecture, pages 259--269, December 2006. Google Scholar
Digital Library
- R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 177--188, September 2004. Google Scholar
Digital Library
- K. Rich and M. Farrens. Code partitioning in decoupled compilers. In Proceedings of the 6th European Conference on Parallel Processing, pages 1008--1017, Munich, Germany, September 2000. Google Scholar
Digital Library
- V. Sarkar. A concurrent execution semantics for parallel program graphs and program dependence graphs. In Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing, 1992. Google Scholar
Digital Library
- J. W. Sias, S.-Z. Ueng, G. A. Kent, I. M. Steiner, E. M. Nystrom, and W. mei W. Hwu. Field-testing IMPACT EPIC research results in Itanium 2. In Proceedings of the 31st Annual International Symposium on Computer Architecture. IEEE Computer Society, 2004. Google Scholar
Digital Library
- G. S. Sohi, S. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22th International Symposium on Computer Architecture, June 1995. Google Scholar
Digital Library
- M. B. Taylor, W. Lee, S. P. Amarasinghe, and A. Agarwal. Scalar operand networks. IEEE Transactions on Parallel and Distributed Systems, 16(2):145--162, February 2005. Google Scholar
Digital Library
- S. Triantafyllis, M. J. Bridges, E. Raman, G. Ottoni, and D. I. August. A framework for unrestricted whole-program optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 61--71, June 2006. Google Scholar
Digital Library
- N. Vachharajani, M. Iyer, C. Ashok, M. Vachharajani, D. I. August, and D. A. Connors. Chip multi-processor scalability for single-threaded applications. In Proceedings of the Workshop on Design, Architecture, and Simulation of Chip Multi-Processors, November 2005. Google Scholar
Digital Library
- N. Vachharajani, R. Rangan, E. Raman, M. J. Bridges, G. Ottoni, and D. I. August. Speculative decoupled software pipelining. In Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques, September 2007. Google Scholar
Digital Library
- Y. Wu and J. R. Larus. Static branch prediction and program profile analysis. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 1--11, December 1994. Google Scholar
Digital Library
- J. Zalamea, J. Llosa, E. Ayguadé, and M. Valero. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 160--169, 2001. Google Scholar
Digital Library
- A. Zhai, C. B. Colohan, J. G. Steffan, and T. C. Mowry. Compiler optimization of scalar value communication between speculative threads. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 171--183, 2002. Google Scholar
Digital Library
- C. Zilles and G. Sohi. Master/slave speculative parallelization. In Proceedings of the 35th Annual International Symposium on Microarchitecture, 2002. Google Scholar
Digital Library
Index Terms
Communication optimizations for global multi-threaded instruction scheduling
Recommendations
Communication optimizations for global multi-threaded instruction scheduling
ASPLOS '08The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for ...
Communication optimizations for global multi-threaded instruction scheduling
ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systemsThe recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for ...
Communication optimizations for global multi-threaded instruction scheduling
ASPLOS '08The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for ...









Comments