Abstract
Code generation in a compiler is commonly divided into several phases: instruction selection, scheduling, register allocation, spill code generation, and, in the case of clustered architectures, cluster assignment. These phases are interdependent; for instance, a decision in the instruction selection phase affects how an operation can be scheduled We examine the effect of this separation of phases on the quality of the generated code. To study this we have formulated optimal methods for code generation with integer linear programming; first for acyclic code and then we extend this method to modulo scheduling of loops. In our experiments we compare optimal modulo scheduling, where all phases are integrated, to modulo scheduling, where instruction selection and cluster assignment are done in a separate phase. The results show that, for an architecture with two clusters, the integrated method finds a better solution than the nonintegrated method for 27% of the instances.
- Altemose, G. and Norris, C. 2001. Register pressure responsive software pipelining. In Proceedings of the ACM Symposium on Applied Computing (SAC’01). ACM, New York, 626--631. Google Scholar
Digital Library
- Altman, E. R., Govindarajan, R., and Gao, G. R. 1995. Scheduling and mapping: Software pipelining in the presence of structural hazards. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’95). ACM, New York, 139--150. Google Scholar
Digital Library
- Bednarski, A. and Kessler, C. W. 2006. Optimal integrated VLIW code generation with integer linear programming. In Proceedings of the European Conference on Parallel Computing (Euro-Par’06). Springer, 461--472. Google Scholar
Digital Library
- Blachot, F., de Dinechin, B. D., and Huard, G. 2006. SCAN: A heuristic for near-optimal software pipelining. In Proceedings of the European Conference on Parallel Computing (Euro-Par’06). Springer, 289--298. Google Scholar
Digital Library
- Chang, C., Chen, C., and King, C. 1997. Using integer linear programming for instruction scheduling and register allocation in multi-issue processors. Comput. Math. Appl. 34, 9, 1--14.Google Scholar
Cross Ref
- Charlesworth, A. 1981. An approach to scientific array processing: The architectural design of the AP-120b/FPS-164 family. Computer 14, 9, 18--27. Google Scholar
Digital Library
- Codina, J. M., Sánchez, J., and González, A. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’01). IEEE, 175--184. Google Scholar
Digital Library
- Eichenberger, A. E., Davidson, E. S., and Abraham, S. G. 1996. Minimizing register requirements of a modulo schedule via optimum stage scheduling. Int. J. Paral. Program. 24, 2, 103--132. Google Scholar
Digital Library
- Eisenbeis, C. and Sawaya, A. 1996. Optimal loop parallelization under register constraints. In Proceedings of the 6th Workshop on Compilers for Parallel Computers (CPC’96). 245--259.Google Scholar
- Eriksson, M. 2009. Integrated software pipelining. Licentiate degree thesis, Linköping Studies in Science and Technology Thesis No. 1393, Linköping University, Sweden.Google Scholar
- Eriksson, M. V. and Kessler, C. W. 2009. Integrated modulo scheduling for clustered VLIW architectures. In Proceedings of the International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’09). Springer, Lecture Notes in Computer Science, vol. 5409, 65--79. Google Scholar
Digital Library
- Eriksson, M. V., Skoog, O., and Kessler, C. W. 2008. Optimal vs. heuristic integrated code generation for clustered VLIW architectures. In Proceedings of the 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES’’08). ACM, New York, 11--20. Google Scholar
- Fan, K., Kudlur, M., Park, H., and Mahlke, S. 2005. Cost sensitive modulo scheduling in a loop accelerator synthesis system. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 219--232. Google Scholar
Digital Library
- Fernandes, M. M. 1998. A clustered VLIW architecture based on queue register files. Ph.D. thesis, University of Edinburgh.Google Scholar
- Fernandes, M. M., Llosa, J., and Topham, N. 1999. Distributed modulo scheduling. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA’99). IEEE, 130. Google Scholar
Digital Library
- Fimmel, D. and Müller, J. 2002. Optimal software pipelining with rational initiation interval. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’02). CSREA Press, 638--643. Google Scholar
Digital Library
- Fisher, J. A. 1983. Very long instruction word architectures and the ELI-512. In Proceedings of the 10th Annual International Symposium on Computer Architecture (ISCA’83). ACM, New York, 140--150. Google Scholar
Digital Library
- Gebotys, C. and Elmasry, M. 1993. Global optimization approach for architectural synthesis. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 12, 9, 1266--1278. Google Scholar
Digital Library
- Hanono, S. and Devadas, S. 1998. Instruction selection, resource allocation, and scheduling in the AVIV retargetable code generator. In Proceedings of the 35th Annual Conference on Design Automation (DAC’98). ACM, New York, 510--515. Google Scholar
Digital Library
- Huff, R. A. 1993. Lifetime-sensitive modulo scheduling. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’93). ACM, New York, 258--267. Google Scholar
Digital Library
- Kailas, K., Ebcioglu, K., and Agrawala, A. 2001. CARS: A new code generation framework for clustered ILP processors. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA’01). IEEE, 133--143. Google Scholar
Digital Library
- Kästner, D. 2001. Propan: A retargetable system for postpass optimizations and analyses. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES’00). Springer, 63--80. Google Scholar
Digital Library
- Kessler, C. W. and Bednarski, A. 2006. Optimal integrated code generation for VLIW architectures. Concur. Comput. Pract. Exper. 18, 11, 1353--1390. Google Scholar
Digital Library
- Kessler, C., Bednarski, A., and Eriksson, M. 2007. Classification and generation of schedules for VLIW processors. Concur. Comput. Pract. Exper. 19, 18, 2369--2389. Google Scholar
Digital Library
- Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. SIGPLAN Not. 23, 7, 318--328. Google Scholar
Digital Library
- Lee, C., Potkonjak, M., and Mangione-Smith, W. H. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the International Symposium on Microarchitecture. IEEE, 330--335. Google Scholar
Digital Library
- Leupers, R. 2000. Instruction scheduling for clustered VLIW DSPs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’00). IEEE, 291. Google Scholar
Digital Library
- Llosa, J., Valero, M., Ayguadé, E., and González, A. 1995. Hypernode reduction modulo scheduling. In Proceedings of the 28th Annual International Symposium on Microarchitecture. IEEE, 350--360. Google Scholar
Digital Library
- Llosa, J., Gonzalez, A., Ayguade, E., and Valero, M. 1996. Swing modulo scheduling: A lifetime-sensitive approach. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT’96). IEEE, 80--86. Google Scholar
Digital Library
- Lorenz, M. and Marwedel, P. 2004. Phase coupled code generation for DSPs using a genetic algorithm. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’04). IEEE, 1270--1275. Google Scholar
Digital Library
- Malik, A. M., Chase, M., Russell, T., and van Beek, P. 2008. An application of constraint programming to superblock instruction scheduling. In Proceedings of the 14th International Conference on Principles and Practice of Constraint Programming. Springer, 97--111. Google Scholar
Digital Library
- Nagarakatte, S. G. and Govindarajan, R. 2007. Register allocation and optimal spill code scheduling in software pipelined loops using 0-1 integer linear programming formulation. In Proceedings of the 16th International Conference on Compiler Construction. Springer, 126--140. Google Scholar
Digital Library
- Nagpal, R. and Srikant, Y. N. 2004. Integrated temporal and spatial scheduling for extended operand clustered VLIW processors. In Proceedings of the 1st Conference on Computing Frontiers. ACM, New York, 457--470. Google Scholar
Digital Library
- Ning, Q. and Gao, G. R. 1993. A novel framework of register allocation for software pipelining. In Proceedings of the 20th ACM Symposium on Principles of Programming Languages (POPL’93). ACM, New York, 29--42. Google Scholar
Digital Library
- Nystrom, E. and Eichenberger, A. E. 1998. Effective cluster assignment for modulo scheduling. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 103--114. Google Scholar
Digital Library
- Ozer, E., Banerjia, S., and Conte, T. M. 1998. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 308--315. Google Scholar
Digital Library
- Pister, M. and Kästner, D. 2005. Generic software pipelining at the assembly level. In Proceedings of the Workshop on Software and Compilers for Embedded Systems (SCOPES’05). ACM, New York, 50--61. Google Scholar
Digital Library
- Rau, B. R. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture. ACM, New York, 63--74. Google Scholar
Digital Library
- Rau, B. R. and Glaeser, C. D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. SIGMICRO Newsl. 12, 4, 183--198. Google Scholar
Cross Ref
- Ruttenberg, J., Gao, G. R., Stoutchinin, A., and Lichtenstein, W. 1996. Software pipelining showdown: Optimal vs. heuristic methods in a production compiler. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’96). ACM, New York, 1--11. Google Scholar
Digital Library
- Stotzer, E. and Leiss, E. 1999. Modulo scheduling for the TMS320C6x VLIW DSP architecture. SIGPLAN Not. 34, 7, 28--34. Google Scholar
Digital Library
- Tarjan, R. E. 1973. Enumeration of the elementary circuits of a directed graph. SIAM J. Comput. 2, 3, 211--216.Google Scholar
Digital Library
- Texas Instruments Incorporated. 2000. TMS320C6000 CPU and Instruction Set Reference Guide.Google Scholar
- Touati, S.-A.-A. 2007. On periodic register need in software pipelining. IEEE Trans. Comput. 56, 11, 1493--1504. Google Scholar
Digital Library
- Touati, S.-A.-A. 2009. Data dependence graphs from Spec, Mediabench, and Ffmpeg benchmark suites. Personal communication.Google Scholar
- Vegdahl, S. R. 1992. A dynamic-programming technique for compacting loops. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE, 180--188. Google Scholar
Digital Library
- Wilken, K., Liu, J., and Heffernan, M. 2000. Optimal instruction scheduling using integer programming. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’00). ACM, New York, 121--133. Google Scholar
Digital Library
- Wilson, T. C., Grewal, G. W., and Banerji, D. K. 1994. An ILP solution for simultaneous scheduling, allocation, and binding in multiple block synthesis. In Proceedings of the International Conference on Computer Design (ICCD’94). IEEE, 581--586. Google Scholar
Digital Library
- Winkel, S. 2004. Optimal global instruction scheduling for the Itanium processor architecture. Ph.D. thesis, Universität des Saarlandes.Google Scholar
- Winkel, S. 2007. Optimal versus heuristic global code scheduling. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 43--55. Google Scholar
Digital Library
- Yang, H., Govindarajan, R., Gao, G. R., and Theobald, K. B. 2002. Power-performance trade-offs for energy-efficient architectures: A quantitative study. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD’02). IEEE, 174. Google Scholar
Digital Library
Index Terms
Integrated Code Generation for Loops
Recommendations
CSMT: Simultaneous Multithreading for Clustered VLIW Processors
Simultaneous MultiThreading (SMT) is a well-known technique that improves resource utilization by exploiting thread-level parallelism at the instruction grain level. However, implementing SMT for VLIWs requires complex structures, which is contrary to ...
Pragmatic integrated scheduling for clustered VLIW architectures
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Scheduling for clustered architectures involves spatial concerns (...
Loop fusion for clustered VLIW architectures
Embedded systems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, high-performance digital signal processors can often exploit considerable instruction-level ...






Comments