Abstract
The diversity of today's mobile applications requires embedded processor cores with a high resource efficiency, that means, the devices should provide a high performance at low area requirements and power consumption. The fine-grained parallelism supported by multiple functional units of VLIW architectures offers a high throughput at reasonable low clock frequencies compared to single-core RISC processors. To efficiently utilize the processor pipeline, common system architectures have to cope with data hazards due to data dependencies between consecutive operations. On the one hand, such hazards can be resolved by complex forwarding circuits (i.e., a pipeline bypass) which forward intermediate results to a subsequent instruction. On the other hand, the pipeline bypass can strongly affect or even dominate the total resource requirements and degrade the maximum clock frequency. In this work the CoreVA VLIW architecture is used for the development and the analysis of application-specific bypass configurations. It is shown that many paths of a comprehensive bypass system are rarely used and may not be required for certain applications. For this reason, several strategies have been implemented to enhance the efficiency of the total system by introducing application-specific bypass configurations. The configuration can be carried out statically by only implementing required paths or at runtime by dynamically reconfiguring the hardware. An algorithm is proposed which derives an optimized configuration by iteratively disabling single bypass paths. The adaptation of these application-specific bypass configurations allows for a reduction of the critical path by 26%. As a result, the execution time and energy requirements could be reduced by up to 21.5%. Using Dynamic Frequency Scaling (DFS) and dynamic deactivation/reactivation of bypass paths allows for a runtime reconfiguration of the bypass system. This ensures the highest efficiency while processing varying applications.
- Ahuja, P. S., Clark, D. W., and Rogers, A. 1995. The performance impact of incomplete bypassing in processor pipelines. In Proceedings of the 28th Annual International Symposium on Microarchitecture (MICRO'95). 36--45. Google Scholar
Digital Library
- Brigham, E. and Morrow, R. 2009. The fast Fourier transform. IEEE Spectrum 4, 12, 63--70. Google Scholar
Digital Library
- Brown, M. D. and Patt, Y. N. 2001. Using internal redundant representations and limited bypass to support pipelined adders and register files. In Proceedings of the 8th Annual International Symposium on High-Performance Computer Architecture. 289--298. Google Scholar
Digital Library
- Daemen, J. and Rijmen, V. 2002. The Design of Rijndael: AES--The Advanced Encryption Standard. Springer. Google Scholar
Digital Library
- Dreesen, R., Jungeblut, T., Thies, M., Porrmann, M., Rückert, U., and Kastens, U. 2009. A synchronization method for register traces of pipelined processors. In Proceedings of the International Embedded Systems Symposium (IESS'09). 207--217.Google Scholar
- Ekdahl, P. and Johansson, T. 2000. SNOW-- A new stream cipher. In Proceedings of the 1st Open NESSIE Workshop.Google Scholar
- Fan, K., Clark, N., Chu, M., Manjunath, K. V., Ravindran, R., Smelyanskiy, M., and Mahlke, S. 2003. Systematic register bypass customization for application-specific processors. In Proceedings of the of IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASSAP'03). 64--74.Google Scholar
- Fisher, J. A. 1983. Very long instruction word architectures and the ELI-512. In Proceedings of the 10th Annual International Symposium on Computer Architecture (ISCA'83). 140--150. Google Scholar
Digital Library
- Fisher, J. A. 2009. Retrospective: Very long instruction word architectures and the ELI-512. IEEE Solid-State Circ. Mag. 1, 34--36.Google Scholar
Cross Ref
- Fisher, J. A., Faraboschi, P., and Young, C. 2009. VLIW processors: From blue sky to best buy. IEEE Solid-State Circ. Mag. 1, 10--17.Google Scholar
Cross Ref
- Goel, N., Kumar, A., and Panda, P. R. 2007. Power reduction in VLIW processor with compiler driven bypass network. In Proceedings of the 20th International Conference on VLSI Design (VLSID'07), held jointly with 6th International Conference on Embedded Systems. 233--238. Google Scholar
Digital Library
- Hsu, C., Kremer, U., and Hsiao, M. 2001. Compiler-directed dynamic voltage/frequency scheduling for energy reduction in microprocessors. In Proceedings of the International Symposium on Low Power Electronics and Design. IEEE, 275--278. Google Scholar
Digital Library
- Hussmann, M., Thies, M., and Kastens, U. 2005. Parallelizing compilation through load-time scheduling for a superscalar processor family. In Proceedings of the 3rd Workshop on Optimizations for DSP and Embedded Systems (ODES'05), held in conjunction with the 3rd IEEE/ACM International Symposium on Code Generation and Optimization (CGO'05).Google Scholar
- Jungeblut, T., Dreesen, R., Porrmann, M., Thies, M., Rückert, U., and Kastens, U. 2010a. A framework for the design space exploration of software-defined radio applications. In Proceedings of the 2nd International ICST Conference on Mobile Lightweight Wireless Systems.Google Scholar
- Jungeblut, T., Klassen, D., Dreesen, R., Porrmann, M., Thies, M., Rückert, U., and Kastens, U. 2009. Design space exploration for next generation wireless technologies. In Proceedings of the Electrical and Electronic Engineering for Communication Conference (EEEfCOM'09).Google Scholar
- Jungeblut, T., Puttmann, C., Dreesen, R., Porrmann, M., Thies, M., Rückert, U., and Kastens, U. 2010b. Resource efficiency of hardware extensions of a 4-issue VLIW processor for elliptic curve cryptography. Adv. Radio Sci. 8, 295--305.Google Scholar
Cross Ref
- Jungeblut, T., Sievers, G., Porrmann, M., and Rückert, U. 2010c. Design space exploration for memory subsystems of VLIW architectures. In Proceedings of the 5th IEEE International Conference on Networking, Architecture, and Storage (NAS'10). Google Scholar
Digital Library
- Kastens, U., Le, D. K., Slowik, A., and Thies, M. 2004. Feedback driven instruction-set extension. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'04). Google Scholar
Digital Library
- Lung, C., Hsiao, H., Zeng, Z., and Chang, S. 2010. LP-based multi-mode multi-corner clock skew optimization. In Proceedings of the International Symposium on VLSI Design Automation and Test (VLSI-DAT'10). IEEE, 335--338.Google Scholar
- Peterson, W. W. and Brown, D. T. 1961. Cyclic codes for error detection. Proc. IRE 49, 1, 228--235.Google Scholar
Cross Ref
- Porrmann, M., Hagemeyer, J., Pohl, C., Romoth, J., and Strugholtz, M. 2010. RAPTOR -- A scalable platform for rapid prototyping and FPGA-based cluster computing. In Parallel Computing: From Multicores and GPU's to Petascale, Advances in Parallel Computing, vol. 19, IOS Press, 592--599.Google Scholar
- Richardson, I. 2010. The H.264 Advanced Video Compression Standard. John Wiley and Sons. Google Scholar
Digital Library
- Sami, M., Sciuto, D., Silvano, C., Zaccaria, V., and Zafalon, R. 2002. Low-power data forwarding for VLIW embedded architectures. IEEE Trans. VLSI Syst. 10, 5, 614--622. Google Scholar
Digital Library
- Terechko, A., Garg, M., and Corporaal, H. 2005. Evaluation of speed and area of clustered VLIW processors. In Proceedings of the 18th International Conference on VLSI Design. IEEE, 557--563. Google Scholar
Digital Library
- Viterbi, A. 2002. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2, 260--269. Google Scholar
Digital Library
- Weicker, R. 1984. Dhrystone: A synthetic systems programming benchmark. Comm. ACM 27, 10, 1013--1030. Google Scholar
Digital Library
- Xie, Y., Wolf, W., and Lekatsas, H. 2006. Code compression for embedded VLIW processors using variable-to-fixed coding. IEEE Trans. VLSI Syst. 14, 5, 525--536. Google Scholar
Digital Library
Index Terms
A systematic approach for optimized bypass configurations for application-specific embedded processors
Recommendations
Retargetable code generation for application-specific processors
Special issue: Parallel computing technologiesAn approach of intelligent retargetable compiler is introduced to overcome the gap between hardware and software development and to increase performance of embedded systems. It focuses on knowledgeable treatment of code generation where knowledge about ...
Dynamic configuration of application-specific implicit instructions for embedded pipelined processors
SAC '08: Proceedings of the 2008 ACM symposium on Applied computingIn this paper, we propose the dynamic configuration of application specific implicit instructions for pipelined processors to better exploit the available parallelism at instruction level. Given the target application, the compiler selects a set of ...
Code generation for an application-specific VLIW processor with clustered, addressable register files
ODES '13: Proceedings of the 10th Workshop on Optimizations for DSP and Embedded SystemsModern compilers integrate recent advances in compiler construction, intermediate representations, algorithms and programming language front-ends. Yet code generation for application-specific architectures benefits only marginally from this trend, as ...






Comments