skip to main content
research-article

Software thread integration for instruction-level parallelism

Published:05 September 2013Publication History
Skip Abstract Section

Abstract

Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor's functional units busy.

This article proposes Software Thread Integration (STI) for instruction-level parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler's scope and hence allowing it to create a more efficient instruction schedule. Assuming the parallel procedures are given, we define a methodology for finding the best performing integrated procedure with a minimum compilation time.

We quantitatively estimate the performance impact of integration, allowing various integration scenarios to be compared and ranked via profitability analysis. During integration of threads, different ILP-improving code transformations are selectively applied according to the control structure and the ILP characteristics of the code, driven by interactions with software pipelining. The estimated profitability is verified and corrected by an iterative compilation approach, compensating for possible estimation inaccuracy. Our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration.

References

  1. Aigner, G., Diwan, A., Heine, D. L., Lam, M. S., Moore, D. L., Murphy, B. R., and Sapuntzakis, C. 1999. An overview of the SUIF2 compiler infrastructure. http://Suif.stanford.edu/Suif/Suif2/doc-2.2.0-4/.Google ScholarGoogle Scholar
  2. Aiken, A. and Nicolau, A. 1987. Loop quantization: an analysis and algorithm. Tech. rep., Cornell University, Ithaca, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Allen, J., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of Programming Languages. 177--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Allen, R. and Johnson, S. 1988. Compiling C for vectorization, parallelization, and inline expansion. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 241--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Allen, V. H., Janardhan, J., Lee, R. M., and Srinivas, M. 1992. Enhanced region scheduling on a program dependence graph. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 72--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Almagor, L., Cooper, K. D., Grosul, A., Harvey, T. J., Reeves, S. W., Subramanian, D., Torczon, L., and Waterman, T. 2004. Finding effective compilation sequences. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. ACM Press, New York, NY, 231--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Callahan, D., Cocke, J., and Kennedy, K. 1988. Estimating interlock and improving balance for pipelined architectures. J. Parallel Distrib. Comput. 5, 4, 334--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Carr, S., Ding, C., and Sweany, P. 1996. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Hawaii International Conference on System Sciences. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Carribault, P., Cohen, A., and Jalby, W. 2005. Deep jam: Conversion of coarse-grain parallelism to instruction-level and vector parallelism for irregular applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 291--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chang, P. P., Warter, N. J., Mahlke, S., Chen, W. Y., and Hwu, W. W. 1991. Three superblock scheduling models for superscalar and superpipelined processors. Tech. rep., University of Illinois, Urbana, IL.Google ScholarGoogle Scholar
  11. Charlesworth, A. 1981. An approach to scientific array processing: The architectural design of the AP-120B/FPS-164 family. IEEE Comput. 14, 3, 18--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Davidson, J. W. and Holler, A. M. 1992. Subprogram inlining: A study of its effects on program execution time. IEEE Trans. Softw. Engi. 18, 2, 89--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dean, A. G. 2000. Software thread integration for hardware to software migration. Ph.D. Dissertation, Carnegie Mellon University, Pittsburg, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dean, A. G. 2002. Compiling for fine-grain concurrency: Planning and performing software thread integration. In Proceedings of the 23rd IEEE Real-Time Systems Symposium (RTSS'02). IEEE Computer Society. 103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dean, A. G. and Shen, J. P. 1998. Techniques for software thread integration in real-time embedded systems. In Proceedings of the 19th IEEE Real-Time Systems Symposium. 322--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ferrante, J., Ottenstein, K. J., and Warren, J. D. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3, 319--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. 30, 7, 278--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gupta, R. and Soffa, M. L. 1990. Region scheduling: An approach for detecting and redistributing parallelism. IEEE Trans. Softw. Engi. 16, 4, 421--431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. Mibench: A free, commercially representative embedded benchmark suite. In Proceedings of the 4th IEEE Annual Workshop on Workload Characterization. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hank, R. E., Hwu, W. W., and Rau, B. R. 1995. Region-based compilation: An introduction and motivation. In Proceedings of the 28th Annual International Symposium on Microarchitecture. IEEE Computer Society Press, 158--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Havanki, W., Banerjia, S., and Conte, T. 1998. Treegion scheduling for wide issue processors. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture. IEEE Computer Society, Washington, DC, 266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hwu, W. W. and Chang, P. P. 1989. Inline function expansion for compiling C programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 246--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hwu, W. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., Hank, R. E., Kiyohara, T., Haab, G. E., Holm, J. G., and Lavery, D. M. 1993. The superblock: An effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 1--2, 229--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kisuki, T., Knijnenburg, P. M. W., and O'Boyle, M. F. P. 2000. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Washington, DC, 237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 318--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., and Bringmann, R. A. 1992. Effective compiler support for predicated execution using the hyperblock. SIGMICRO Newsl. 23, 1--2, 45--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Qian, Y., Carr, S., and Sweany, P. H. 2002. Optimizing loop performance for clustered VLIW architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 271--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rau, B. R. and Glaeser, C. D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming. IEEE Press, Piscataway, NJ, 183--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. So, W. 2007. Software thread integration for insturction level parallelism. Ph.D. Dissertation, North Carolina State University, Raleigh, NC.Google ScholarGoogle Scholar
  30. So, W. and Dean, A. G. 2003. Procedure cloning and integration for converting parallelism from coarse to fine grain. In Proceedings of the 7th Workshop on Interaction between Compilers and Computer Architectures. IEEE Computer Society, 27--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. So, W. and Dean, A. G. 2005. Complementing software pipelining with software thread integration. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. So, W. and Dean, A. G. 2006. Reaching fast code faster: Using modeling for efficient software thread integration on a VLIW DSP. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM Press, 13--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Texas Instruments. 2004. TMS320C6000 Optimizing Compiler User's Guide (Rev. L). Texas Instruments, Dallas, TX.Google ScholarGoogle Scholar
  34. Thies, W., Karczmarek, M., and Amarasinghe, S. 2002. StreamIt: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Tullsen, D. M., Eggers, S. J., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. ACM Press, New York, NY, 392--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Wall, D. W. 1991. Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York, NY, 176--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Warter, N. J., Bockhaus, J. W., Haab, G. E., and Subramanian, K. 1992. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Way, T., Breech, B., and Pollock, L. 2001. Demand-driven inlining heuristics in a region-based optimizing compiler for ILP architectures. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems. 90--95.Google ScholarGoogle Scholar
  39. Zhou, H. and Conte, T. M. 2002. Code size efficiency in global scheduling for ILP processors. In Proceedings of the 6th Annual Workshop on Interaction between Compilers and Computer Architectures. IEEE Computer Society, Washington, DC, 79. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Software thread integration for instruction-level parallelism

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Article Metrics

      • Downloads (Last 12 months)12
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!