skip to main content
research-article

Improving performance of loops on DIAM-based VLIW architectures

Authors Info & Claims
Published:12 June 2014Publication History
Skip Abstract Section

Abstract

Recent studies show that very long instruction word (VLIW) architectures, which inherently have wide datapath (e.g. 128 or 256 bits for one VLIW instruction word), can benefit from dynamic implied addressing mode (DIAM) and can achieve lower power consumption and smaller code size with a small performance overhead. Such overhead, which is claimed to be small, is mainly caused by the execution of additionally generated special instructions for conveying information that cannot be encoded in reduced instruction bit-width. In this paper, however, we show that the performance impact of applying DIAM on VLIW architecture cannot be overlooked expecially when applications possess high level of instruction level parallelism (ILP), which is mostly the case for loops because of the result of aggressive code scheduling. We also propose a way to relieve the performance degradation especially focusing on loops since loops spend almost 90% of total execution time in programs and tend to have high ILP. We first implement the original DIAM compilation technique in a compiler, and augment it with the proposed loop optimization scheme to show that ours can clearly alleviate the performance loss caused by the excessive number of additional instructions, with the help of slightly modified hardware. Moreover, the well-known loop unrolling scheme, which would produce denser code in loops at the cost of substantial code size bloating, is integrated into our compiler. The experiment result shows that the loop unrolling technique, combined with our augmented DIAM scheme, produces far better code in terms of performance with quite an acceptable amount of code increase.

References

  1. M. Ahn and Y. Paek. Fast code generation for embedded processors with aliased heterogeneous registers. In Transactions on High-Performance Embedded Architectures and Compilers II, pages 149--172. Springer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Aiken and A. Nicolau. Optimal loop parallelization, volume 23. ACM, 1988.Google ScholarGoogle Scholar
  3. A. Aiken and A. Nicolau. A realistic resource-constrained software pipelining algorithm. Advances in Languages and Compilers for Parallel Processing, pages 274--290, 1991.Google ScholarGoogle Scholar
  4. V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software pipelining. ACM Computing Surveys (CSUR), 27(3):367--432, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P.-Y. Calland, A. Darte, and Y. Robert. Circuit retiming applied to decomposed software pipelining. Parallel and Distributed Systems, IEEE Transactions on, 9(1):24--35, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. M. Conte, S. Banerjia, S. Y. Larin, K. N. Menezes, and S. W. Sathaye. Instruction fetch mechanisms for vliw architectures with compressed encodings. In Microarchitecture, 1996. MICRO-29. Proceedings of the 29th Annual IEEE/ACM International Symposium on, pages 201--211. IEEE, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Ebcioglu. A compilation technique for software pipelining of loops with conditional jumps. In Proceedings of the 20th annual workshop on Microprogramming, pages 69--79. ACM, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Ebcioglu and A. Nicolau. A global resource-constrained parallelization technique. In Proceedings of the 3rd international conference on Supercomputing, pages 154--163. ACM, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. E. Eichenberger, E. S. Davidson, and S. G. Abraham. Optimum modulo schedules for minimum register requirements. In Proceedings of the 9th international conference on Supercomputing, pages 31--40. ACM, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Feautrier. Fine-grain scheduling under resource constraints. In Languages and Compilers for Parallel Computing, pages 1--15. Springer, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. A. Fisher. Trace scheduling: A technique for global microcode compaction. Computers, IEEE Transactions on, 100(7):478--490, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Gasperoni and U. Schwiegeishohn. Scheduling loops on parallel processors: a simple algorithm with close to optimum performance. In Parallel Processing: CONPAR 92VAPP V, pages 625--636. Springer, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th annual international symposium on Microarchitecture, pages 85--94. ACM, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Hines, J. Green, G. Tyson, and D. Whalley. Improving program efficiency by packing instructions into registers. In ACM SIGARCH Computer Architecture News, volume 33, pages 260--271. IEEE Computer Society, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. A. Huff. Lifetime-sensitive modulo scheduling. In ACM SIGPLAN Notices, volume 28, pages 258--267. ACM, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, et al. The superblock: an effective technique for vliw and superscalar compilation. the Journal of Supercomputing, 7(1--2):229--248, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Jee and K. Palaniappan. Performance evaluation for a compressedvliw processor. In Proceedings of the 2002 ACM symposium on Applied computing, pages 913--917. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Lam. Software pipelining: An effective scheduling technique for vliw machines. In ACM Sigplan Notices, volume 23, pages 318--328. ACM, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Lee, J. M. Youn, D. Cho, and Y. Paek. Reducing instruction bitwidth for low-power vliw architectures. ACM Transactions on Design Automation of Electronic Systems (TODAES), 18(2):25, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Lefurgy, P. Bird, I.-C. Chen, and T. Mudge. Improving code density using compression techniques. In Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM International Symposium on, pages 194--203. IEEE, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Lin and Y. Fei. Harnessing horizontal parallelism and vertical instruction packing of programs to improve system overall efficiency. In Proceedings of the conference on Design, automation and test in Europe, pages 758--763. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective compiler support for predicated execution using the hyperblock. In ACM SIGMICRO Newsletter, volume 23, pages 45--54. IEEE Computer Society Press, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. H. McMahon. The livermore fortran kernels: A computer test of the numerical performance range. Technical report, Lawrence Livermore National Lab., CA (USA), 1986.Google ScholarGoogle Scholar
  24. S.-M. Moon and K. Ebcioglu. An efficient resource-constrained global scheduling technique for superscalar and vliw processors. In Microarchitecture, 1992. MICRO 25., Proceedings of the 25th Annual International Symposium on, pages 55--71. IEEE, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Muchnick. Advanced compiler design and implementation. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Ramanujam. Optimal software pipelining of nested loops. In Parallel Processing Symposium, 1994. Proceedings., Eighth International, pages 335--342. IEEE, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. R. Rau and J. A. Fisher. Instruction-level parallel processing: history, overview, and perspective. The journal of Supercomputing, 7(1--2):9--50, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In ACM SIGMICRO Newsletter, volume 12, pages 183--198. IEEE Press, 1981. Google ScholarGoogle Scholar
  29. Y. Shan and L. Bill. Stream execution on embedded wide-issue clustered vliw architectures. EURASIP Journal on Embedded Systems, 2008, 2009.Google ScholarGoogle Scholar
  30. Design Compiler Reference Manual. Synopsys Inc., Mountain View, CA, 2001.Google ScholarGoogle Scholar
  31. M. Technologies. MIPS32 Architecture for Programmers Volume IVa: The MIPS16 Application Specific Extension to the MIPS32 Architecture. 2001.Google ScholarGoogle Scholar
  32. G.-R. Uh, Y.Wang, D. Whalley, S. Jinturkar, C. Burns, and V. Cao. Effective exploitation of a zero overhead loop buffer. In ACM SIGPLAN Notices, volume 34, pages 10--19. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. H. Van Dongen, G. R. Gao, and Q. Ning. A polynomial time method for optimal software pipelining. In Parallel Processing: CONPAR 92VAPP V, pages 613--624. Springer, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In ACM SIGMICRO Newsletter, volume 23, pages 170--179. IEEE Computer Society Press, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. M.Weaver and S. A. McKee. Code density concerns for new architectures. In Computer Design, 2009. ICCD 2009. IEEE International Conference on, pages 459--464. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Xie, W. Wolf, and H. Lekatsas. Code compression for embedded vliw processors using variable-to-fixed coding. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 14(5):525--536, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. M. Youn, J. Lee, Y. Paek, J. Kim, and J. Cho. Implementing dynamic implied addressing mode for multi-output instructions. In Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems, pages 87--96. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. V. Zivojnovic, J. M. Velarde, C. Schlager, and H. Meyr. Dspstone: A dsp-oriented benchmarking methodology. In Proc. of ICSPAT, volume 94, 1994.Google ScholarGoogle Scholar

Index Terms

  1. Improving performance of loops on DIAM-based VLIW architectures

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Article Metrics

            • Downloads (Last 12 months)6
            • Downloads (Last 6 weeks)1

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!