Abstract
Modern embedded processors are designed to maximize execution efficiency—the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency, we propose utilizing static strands, dependence chains without fan-out, which are exposed by a compiler pass. These dependent instructions are resequenced to be sequential and annotated to communicate their location to the hardware. Importantly, this modified application is binary compatible and functionally identical to the original, allowing transparent execution on a baseline processor. However, these static strands can be easily collapsed and optimized by simple processor modifications, significantly reducing the workload energy. Results show that over 30% of MediaBench and Spec2000int dynamic instructions can be collapsed, reducing issue logic energy by 20%, bypass energy 19%, and register file energy 14%. In addition, by increasing the effective capactity of pipeline resources by almost a third, average IPC can be improved up to 15%. This performance gain can then be traded in for a lower clock frequency to maintain a basline level of performance, further reducing energy.
- Bik, A., Girkar, M., Grey, P., and Tian, X. 2001. Efficient exploitation of parallelism on Pentium III and Pentium 4 processor-based systems. In Intel Technology Journal.Google Scholar
- Bracy, A., Prahlad, P., and Roth, A. 2004. Dataflow mini-graphs: Amplifying superscalar capacity and bandwidth. In Proceedings of the International Symposium on Microarchitecture. Google Scholar
Digital Library
- Brash, D. 2002. The ARM architecture version 6 (ARMv6). White paper, ARM.Google Scholar
- Burger, D. and Austin, T. 1997. The Simplescalar tool set, version 2.0. Tech. Rep. 1342, Dept of Computer Science, University of Wisconsin-Madison.Google Scholar
- Butts, A. and Sohi, G. 2002. Characterizing and predicting value degree of use. In Proceedings of the International Symposium on Microarchitecture. Google Scholar
Digital Library
- Cao, Y., Sato, T., Sylvester, D., Orshansky, M., and Hu, C. 2000. New paradigm of predictive mosfet and interconnect modeling for early circuit design. In Proceedings of IEEE Custom Integrated Circuits Conference.Google Scholar
- Clark, N., Kudlur, M., Park, H., Mahlke, S., and Flautner, K. 2004. Application-specific processing on a general-purpose core via transparent instruction set customization. In Proceedings of the International Symposium on Microarchitecture. Google Scholar
Digital Library
- Corbal, J., Valero, M., and Espasa, R. 1999. Exploiting a new level of DLP in multimedia applications. In Proceedings of the International Symposium on Microarchitecture. Google Scholar
Digital Library
- Costa, A., Franca, F., and Filho, E. 2000. The dynamic trace memoization reuse technique. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. Google Scholar
Digital Library
- Ernst, D. and Austin, T. 2002. Efficient dynamic scheduling through tag elimination. In Proceedings of the International Symposium on Computer Architecture. Google Scholar
Digital Library
- Gochman, S., Ronen, R., Anati, I., Berkovits, A., Kurts, T., Naveh, A., Saeed, A., Sperber, Z., and Valentine, R. 2003. The Intel Pentium M processor: Microarchitecture and performance. Intel Technology Journal 7, 2 (May).Google Scholar
- Huang, J. and Lilja, D. 1999. Exploiting basic block value locality with block reuse. In Proceedings of the International Symposium on High Performance Computer Architecture. Google Scholar
Digital Library
- Hwu, W., Mahlke, S., Chen, W., Chang, P., Water, N., Bringmann, R., Ouellette, R., Hank, R., Kiyohara, T., Haab, G., Holm, J., and Lavery, D. 1993. The superblock: An effective structure for VLIW and superscalar compilation. Journal of Supercomputing 7, 1 (Jan.). Google Scholar
Digital Library
- IBM Corporation. PowerPC 750 RISC Microprocessor Technical Summary. http://www-3.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699300470399/$file/750_ts.pdfwww.ibm.com.Google Scholar
- Kim, H. and Smith, J. 2003. Dynamic binary translation for accumulator-oriented architectures. In Proceedings of the International Conference on Code Generation and Optimization. Google Scholar
Digital Library
- Kim, I. and Lipasti, M. 2003a. Half-price architecture. In Proceedings of the International Symposium on Computer Architecture. Google Scholar
Digital Library
- Kim, I. and Lipasti, M. 2003b. Macro-op scheduling: Relaxing scheduling loop constraints. In Proceedings of the International Symposium on Microarchitecture. Google Scholar
Digital Library
- Lee, C., Potkonjak, M., and Mangione-Smith, W. 1997. Mediabench: A tool for evaluating multimedia and communications systems. In Proceedings of the International Symposium on Microarchitecture. Google Scholar
Digital Library
- Mamidipaka, M. and Dutt, N. 2004. eCACTI: An enhanced power estimation model for on-chip caches. Tech. Rep. 04-28, Center for Embedded Computer Systems, University of California, Irvine.Google Scholar
- Marquez, A., Theobald, K., Tang, X., and Gao, G. 1997. A superstrand architecture. Technical Memo 14, University of Delaware, Computer Architecture and Parallel Systems Laboratory.Google Scholar
- Palacharla, S., Jouppi, N., and Smith, J. 1997. Complexity-effective superscalar processors. In Proceedings of the International Symposium on Computer Architecture. Google Scholar
Digital Library
- Park, I., Powell, M., and Vijaykumar, T. 2002. Reducing register ports for higher speed and lower energy. In Proceedings of the International Symposium on Microarchitecture. Google Scholar
Digital Library
- Pilla, M., Navaux, P., Costa, A., Franca, F., Childers, B., and Soffa, M. 2003. The limits of speculative trace reuse on deeply pipelined processors. In Proceedings of the Computer Architecture and High Performance Computing. Google Scholar
Digital Library
- Raasch, S., Binkert, N., and Reinhardt, S. 2002. A scalable instruction queue design using dependence chains. In Proceedings of the International Symposium on Computer Architecture. Google Scholar
Digital Library
- Renesas Technology. SH-4A Software Manual. http://documentation.renesas.com/eng/products/mpumcu/rej09b0003_sh4a.pdfwww.renesas.com.Google Scholar
- Sassone, P. and Wills, D. 2004. Dynamic strands: Collapsing speculative dependence chains for reducing pipeline communication. In Proceedings of the International Symposium on Microarchitecture. Google Scholar
Digital Library
- UC Berkeley. Berkeley predictive technology model. http://www-device.eecs.berkeley.edu/~ptmwww-device.eecs.berkeley.edu/~ptm.Google Scholar
- Yehia, S. and Temam, O. 2004. From sequences of dependent instructions to functions: A complexity-effective approach for improving performance without ILP or speculation. In Proceedings of the International Symposium on Computer Architecture. Google Scholar
Digital Library
Index Terms
Static strands: Safely exposing dependence chains for increasing embedded power efficiency
Recommendations
Static strands: safely collapsing dependence chains for increasing embedded power efficiency
LCTES '05: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsModern embedded processors are designed to maximize execution efficiency--the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency we propose utilizing static strands, ...
Static strands: safely collapsing dependence chains for increasing embedded power efficiency
Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsModern embedded processors are designed to maximize execution efficiency--the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency we propose utilizing static strands, ...
Scheduling instruction effects for a statically pipelined processor
CASES '15: Proceedings of the 2015 International Conference on Compilers, Architecture and Synthesis for Embedded SystemsStatically pipelined processors have a fully exposed datapath where all portions of the pipeline are directly controlled by effects within an instruction, which simplifies hardware and enables a new level of compiler optimizations. This paper describes ...






Comments