Abstract
Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor's functional units busy.
This article proposes Software Thread Integration (STI) for instruction-level parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler's scope and hence allowing it to create a more efficient instruction schedule. Assuming the parallel procedures are given, we define a methodology for finding the best performing integrated procedure with a minimum compilation time.
We quantitatively estimate the performance impact of integration, allowing various integration scenarios to be compared and ranked via profitability analysis. During integration of threads, different ILP-improving code transformations are selectively applied according to the control structure and the ILP characteristics of the code, driven by interactions with software pipelining. The estimated profitability is verified and corrected by an iterative compilation approach, compensating for possible estimation inaccuracy. Our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration.
- Aigner, G., Diwan, A., Heine, D. L., Lam, M. S., Moore, D. L., Murphy, B. R., and Sapuntzakis, C. 1999. An overview of the SUIF2 compiler infrastructure. http://Suif.stanford.edu/Suif/Suif2/doc-2.2.0-4/.Google Scholar
- Aiken, A. and Nicolau, A. 1987. Loop quantization: an analysis and algorithm. Tech. rep., Cornell University, Ithaca, NY. Google Scholar
Digital Library
- Allen, J., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of Programming Languages. 177--189. Google Scholar
Digital Library
- Allen, R. and Johnson, S. 1988. Compiling C for vectorization, parallelization, and inline expansion. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 241--249. Google Scholar
Digital Library
- Allen, V. H., Janardhan, J., Lee, R. M., and Srinivas, M. 1992. Enhanced region scheduling on a program dependence graph. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 72--80. Google Scholar
Digital Library
- Almagor, L., Cooper, K. D., Grosul, A., Harvey, T. J., Reeves, S. W., Subramanian, D., Torczon, L., and Waterman, T. 2004. Finding effective compilation sequences. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. ACM Press, New York, NY, 231--239. Google Scholar
Digital Library
- Callahan, D., Cocke, J., and Kennedy, K. 1988. Estimating interlock and improving balance for pipelined architectures. J. Parallel Distrib. Comput. 5, 4, 334--358. Google Scholar
Digital Library
- Carr, S., Ding, C., and Sweany, P. 1996. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Hawaii International Conference on System Sciences. Google Scholar
Digital Library
- Carribault, P., Cohen, A., and Jalby, W. 2005. Deep jam: Conversion of coarse-grain parallelism to instruction-level and vector parallelism for irregular applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 291--302. Google Scholar
Digital Library
- Chang, P. P., Warter, N. J., Mahlke, S., Chen, W. Y., and Hwu, W. W. 1991. Three superblock scheduling models for superscalar and superpipelined processors. Tech. rep., University of Illinois, Urbana, IL.Google Scholar
- Charlesworth, A. 1981. An approach to scientific array processing: The architectural design of the AP-120B/FPS-164 family. IEEE Comput. 14, 3, 18--27. Google Scholar
Digital Library
- Davidson, J. W. and Holler, A. M. 1992. Subprogram inlining: A study of its effects on program execution time. IEEE Trans. Softw. Engi. 18, 2, 89--102. Google Scholar
Digital Library
- Dean, A. G. 2000. Software thread integration for hardware to software migration. Ph.D. Dissertation, Carnegie Mellon University, Pittsburg, PA. Google Scholar
Digital Library
- Dean, A. G. 2002. Compiling for fine-grain concurrency: Planning and performing software thread integration. In Proceedings of the 23rd IEEE Real-Time Systems Symposium (RTSS'02). IEEE Computer Society. 103. Google Scholar
Digital Library
- Dean, A. G. and Shen, J. P. 1998. Techniques for software thread integration in real-time embedded systems. In Proceedings of the 19th IEEE Real-Time Systems Symposium. 322--333. Google Scholar
Digital Library
- Ferrante, J., Ottenstein, K. J., and Warren, J. D. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3, 319--349. Google Scholar
Digital Library
- Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. 30, 7, 278--490. Google Scholar
Digital Library
- Gupta, R. and Soffa, M. L. 1990. Region scheduling: An approach for detecting and redistributing parallelism. IEEE Trans. Softw. Engi. 16, 4, 421--431. Google Scholar
Digital Library
- Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. Mibench: A free, commercially representative embedded benchmark suite. In Proceedings of the 4th IEEE Annual Workshop on Workload Characterization. Google Scholar
Digital Library
- Hank, R. E., Hwu, W. W., and Rau, B. R. 1995. Region-based compilation: An introduction and motivation. In Proceedings of the 28th Annual International Symposium on Microarchitecture. IEEE Computer Society Press, 158--168. Google Scholar
Digital Library
- Havanki, W., Banerjia, S., and Conte, T. 1998. Treegion scheduling for wide issue processors. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture. IEEE Computer Society, Washington, DC, 266. Google Scholar
Digital Library
- Hwu, W. W. and Chang, P. P. 1989. Inline function expansion for compiling C programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 246--257. Google Scholar
Digital Library
- Hwu, W. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., Hank, R. E., Kiyohara, T., Haab, G. E., Holm, J. G., and Lavery, D. M. 1993. The superblock: An effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 1--2, 229--248. Google Scholar
Digital Library
- Kisuki, T., Knijnenburg, P. M. W., and O'Boyle, M. F. P. 2000. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Washington, DC, 237. Google Scholar
Digital Library
- Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 318--328. Google Scholar
Digital Library
- Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., and Bringmann, R. A. 1992. Effective compiler support for predicated execution using the hyperblock. SIGMICRO Newsl. 23, 1--2, 45--54. Google Scholar
Digital Library
- Qian, Y., Carr, S., and Sweany, P. H. 2002. Optimizing loop performance for clustered VLIW architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 271--280. Google Scholar
Digital Library
- Rau, B. R. and Glaeser, C. D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming. IEEE Press, Piscataway, NJ, 183--198. Google Scholar
Digital Library
- So, W. 2007. Software thread integration for insturction level parallelism. Ph.D. Dissertation, North Carolina State University, Raleigh, NC.Google Scholar
- So, W. and Dean, A. G. 2003. Procedure cloning and integration for converting parallelism from coarse to fine grain. In Proceedings of the 7th Workshop on Interaction between Compilers and Computer Architectures. IEEE Computer Society, 27--36. Google Scholar
Digital Library
- So, W. and Dean, A. G. 2005. Complementing software pipelining with software thread integration. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. ACM Press. Google Scholar
Digital Library
- So, W. and Dean, A. G. 2006. Reaching fast code faster: Using modeling for efficient software thread integration on a VLIW DSP. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM Press, 13--23. Google Scholar
Digital Library
- Texas Instruments. 2004. TMS320C6000 Optimizing Compiler User's Guide (Rev. L). Texas Instruments, Dallas, TX.Google Scholar
- Thies, W., Karczmarek, M., and Amarasinghe, S. 2002. StreamIt: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction. Google Scholar
Digital Library
- Tullsen, D. M., Eggers, S. J., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. ACM Press, New York, NY, 392--403. Google Scholar
Digital Library
- Wall, D. W. 1991. Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York, NY, 176--188. Google Scholar
Digital Library
- Warter, N. J., Bockhaus, J. W., Haab, G. E., and Subramanian, K. 1992. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture. Google Scholar
Digital Library
- Way, T., Breech, B., and Pollock, L. 2001. Demand-driven inlining heuristics in a region-based optimizing compiler for ILP architectures. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems. 90--95.Google Scholar
- Zhou, H. and Conte, T. M. 2002. Code size efficiency in global scheduling for ILP processors. In Proceedings of the 6th Annual Workshop on Interaction between Compilers and Computer Architectures. IEEE Computer Society, Washington, DC, 79. Google Scholar
Digital Library
Index Terms
Software thread integration for instruction-level parallelism
Recommendations
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors
Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Exploiting Java instruction/thread level parallelism with horizontal multithreading
Java bytecodes can be executed with the following three methods: a Java interpretor running on a particular machine interprets bytecodes; a Just-In-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the ...






Comments