Abstract
Recent studies show that very long instruction word (VLIW) architectures, which inherently have wide datapath (e.g. 128 or 256 bits for one VLIW instruction word), can benefit from dynamic implied addressing mode (DIAM) and can achieve lower power consumption and smaller code size with a small performance overhead. Such overhead, which is claimed to be small, is mainly caused by the execution of additionally generated special instructions for conveying information that cannot be encoded in reduced instruction bit-width. In this paper, however, we show that the performance impact of applying DIAM on VLIW architecture cannot be overlooked expecially when applications possess high level of instruction level parallelism (ILP), which is mostly the case for loops because of the result of aggressive code scheduling. We also propose a way to relieve the performance degradation especially focusing on loops since loops spend almost 90% of total execution time in programs and tend to have high ILP. We first implement the original DIAM compilation technique in a compiler, and augment it with the proposed loop optimization scheme to show that ours can clearly alleviate the performance loss caused by the excessive number of additional instructions, with the help of slightly modified hardware. Moreover, the well-known loop unrolling scheme, which would produce denser code in loops at the cost of substantial code size bloating, is integrated into our compiler. The experiment result shows that the loop unrolling technique, combined with our augmented DIAM scheme, produces far better code in terms of performance with quite an acceptable amount of code increase.
- M. Ahn and Y. Paek. Fast code generation for embedded processors with aliased heterogeneous registers. In Transactions on High-Performance Embedded Architectures and Compilers II, pages 149--172. Springer, 2009. Google Scholar
Digital Library
- A. Aiken and A. Nicolau. Optimal loop parallelization, volume 23. ACM, 1988.Google Scholar
- A. Aiken and A. Nicolau. A realistic resource-constrained software pipelining algorithm. Advances in Languages and Compilers for Parallel Processing, pages 274--290, 1991.Google Scholar
- V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software pipelining. ACM Computing Surveys (CSUR), 27(3):367--432, 1995. Google Scholar
Digital Library
- P.-Y. Calland, A. Darte, and Y. Robert. Circuit retiming applied to decomposed software pipelining. Parallel and Distributed Systems, IEEE Transactions on, 9(1):24--35, 1998. Google Scholar
Digital Library
- T. M. Conte, S. Banerjia, S. Y. Larin, K. N. Menezes, and S. W. Sathaye. Instruction fetch mechanisms for vliw architectures with compressed encodings. In Microarchitecture, 1996. MICRO-29. Proceedings of the 29th Annual IEEE/ACM International Symposium on, pages 201--211. IEEE, 1996. Google Scholar
Digital Library
- K. Ebcioglu. A compilation technique for software pipelining of loops with conditional jumps. In Proceedings of the 20th annual workshop on Microprogramming, pages 69--79. ACM, 1987. Google Scholar
Digital Library
- K. Ebcioglu and A. Nicolau. A global resource-constrained parallelization technique. In Proceedings of the 3rd international conference on Supercomputing, pages 154--163. ACM, 1989. Google Scholar
Digital Library
- A. E. Eichenberger, E. S. Davidson, and S. G. Abraham. Optimum modulo schedules for minimum register requirements. In Proceedings of the 9th international conference on Supercomputing, pages 31--40. ACM, 1995. Google Scholar
Digital Library
- P. Feautrier. Fine-grain scheduling under resource constraints. In Languages and Compilers for Parallel Computing, pages 1--15. Springer, 1995. Google Scholar
Digital Library
- J. A. Fisher. Trace scheduling: A technique for global microcode compaction. Computers, IEEE Transactions on, 100(7):478--490, 1981. Google Scholar
Digital Library
- F. Gasperoni and U. Schwiegeishohn. Scheduling loops on parallel processors: a simple algorithm with close to optimum performance. In Parallel Processing: CONPAR 92VAPP V, pages 625--636. Springer, 1992. Google Scholar
Digital Library
- R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th annual international symposium on Microarchitecture, pages 85--94. ACM, 1994. Google Scholar
Digital Library
- S. Hines, J. Green, G. Tyson, and D. Whalley. Improving program efficiency by packing instructions into registers. In ACM SIGARCH Computer Architecture News, volume 33, pages 260--271. IEEE Computer Society, 2005. Google Scholar
Digital Library
- R. A. Huff. Lifetime-sensitive modulo scheduling. In ACM SIGPLAN Notices, volume 28, pages 258--267. ACM, 1993. Google Scholar
Digital Library
- W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, et al. The superblock: an effective technique for vliw and superscalar compilation. the Journal of Supercomputing, 7(1--2):229--248, 1993. Google Scholar
Digital Library
- S. Jee and K. Palaniappan. Performance evaluation for a compressedvliw processor. In Proceedings of the 2002 ACM symposium on Applied computing, pages 913--917. ACM, 2002. Google Scholar
Digital Library
- M. Lam. Software pipelining: An effective scheduling technique for vliw machines. In ACM Sigplan Notices, volume 23, pages 318--328. ACM, 1988. Google Scholar
Digital Library
- J. Lee, J. M. Youn, D. Cho, and Y. Paek. Reducing instruction bitwidth for low-power vliw architectures. ACM Transactions on Design Automation of Electronic Systems (TODAES), 18(2):25, 2013. Google Scholar
Digital Library
- C. Lefurgy, P. Bird, I.-C. Chen, and T. Mudge. Improving code density using compression techniques. In Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM International Symposium on, pages 194--203. IEEE, 1997. Google Scholar
Digital Library
- H. Lin and Y. Fei. Harnessing horizontal parallelism and vertical instruction packing of programs to improve system overall efficiency. In Proceedings of the conference on Design, automation and test in Europe, pages 758--763. ACM, 2008. Google Scholar
Digital Library
- S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective compiler support for predicated execution using the hyperblock. In ACM SIGMICRO Newsletter, volume 23, pages 45--54. IEEE Computer Society Press, 1992. Google Scholar
Digital Library
- F. H. McMahon. The livermore fortran kernels: A computer test of the numerical performance range. Technical report, Lawrence Livermore National Lab., CA (USA), 1986.Google Scholar
- S.-M. Moon and K. Ebcioglu. An efficient resource-constrained global scheduling technique for superscalar and vliw processors. In Microarchitecture, 1992. MICRO 25., Proceedings of the 25th Annual International Symposium on, pages 55--71. IEEE, 1992. Google Scholar
Digital Library
- S. Muchnick. Advanced compiler design and implementation. 1997. Google Scholar
Digital Library
- J. Ramanujam. Optimal software pipelining of nested loops. In Parallel Processing Symposium, 1994. Proceedings., Eighth International, pages 335--342. IEEE, 1994. Google Scholar
Digital Library
- B. R. Rau and J. A. Fisher. Instruction-level parallel processing: history, overview, and perspective. The journal of Supercomputing, 7(1--2):9--50, 1993. Google Scholar
Digital Library
- B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In ACM SIGMICRO Newsletter, volume 12, pages 183--198. IEEE Press, 1981. Google Scholar
- Y. Shan and L. Bill. Stream execution on embedded wide-issue clustered vliw architectures. EURASIP Journal on Embedded Systems, 2008, 2009.Google Scholar
- Design Compiler Reference Manual. Synopsys Inc., Mountain View, CA, 2001.Google Scholar
- M. Technologies. MIPS32 Architecture for Programmers Volume IVa: The MIPS16 Application Specific Extension to the MIPS32 Architecture. 2001.Google Scholar
- G.-R. Uh, Y.Wang, D. Whalley, S. Jinturkar, C. Burns, and V. Cao. Effective exploitation of a zero overhead loop buffer. In ACM SIGPLAN Notices, volume 34, pages 10--19. ACM, 1999. Google Scholar
Digital Library
- V. H. Van Dongen, G. R. Gao, and Q. Ning. A polynomial time method for optimal software pipelining. In Parallel Processing: CONPAR 92VAPP V, pages 613--624. Springer, 1992. Google Scholar
Digital Library
- N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In ACM SIGMICRO Newsletter, volume 23, pages 170--179. IEEE Computer Society Press, 1992. Google Scholar
Digital Library
- V. M.Weaver and S. A. McKee. Code density concerns for new architectures. In Computer Design, 2009. ICCD 2009. IEEE International Conference on, pages 459--464. IEEE, 2009. Google Scholar
Digital Library
- Y. Xie, W. Wolf, and H. Lekatsas. Code compression for embedded vliw processors using variable-to-fixed coding. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 14(5):525--536, 2006. Google Scholar
Digital Library
- J. M. Youn, J. Lee, Y. Paek, J. Kim, and J. Cho. Implementing dynamic implied addressing mode for multi-output instructions. In Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems, pages 87--96. ACM, 2010. Google Scholar
Digital Library
- V. Zivojnovic, J. M. Velarde, C. Schlager, and H. Meyr. Dspstone: A dsp-oriented benchmarking methodology. In Proc. of ICSPAT, volume 94, 1994.Google Scholar
Index Terms
Improving performance of loops on DIAM-based VLIW architectures
Recommendations
Improving performance of loops on DIAM-based VLIW architectures
LCTES '14: Proceedings of the 2014 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systemsRecent studies show that very long instruction word (VLIW) architectures, which inherently have wide datapath (e.g. 128 or 256 bits for one VLIW instruction word), can benefit from dynamic implied addressing mode (DIAM) and can achieve lower power ...
Reducing instruction bit-width for low-power VLIW architectures
VLIW (very long instruction word) architectures have proven to be useful for embedded applications with abundant instruction level parallelism. But due to the long instruction bus width it often consumes more power and memory space than necessary. One ...
Dynamic Operands Insertion for VLIW Architecture with a Reduced Bit-width Instruction Set
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing SymposiumPerformance, code size and power consumption are all primary concern in embedded systems. To this effect, VLIW architecture has proven to be useful for embedded applications with abundant instruction level parallelism. But due to the long instruction ...







Comments