Abstract
For most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation of our work stems from: (a) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (b) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. We present a robust design framework for vector coprocessor sharing in multicore environments that maximizes vector unit utilization and performance at substantially reduced energy costs. For our adaptive vector unit, which is attached to multiple cores, we propose three basic shared working policies that enforce coarse-grain, fine-grain, and vector-lane sharing. We benchmark these vector coprocessor sharing policies for a dual-core system and evaluate them using the floating-point performance, resource utilization, and power/energy consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, and LU factorization shows that these coprocessor sharing policies yield high utilization and performance with low energy costs. The proposed policies provide 1.2--2 speedups and reduce the energy needs by about 50% as compared to a system having a single core with an attached vector coprocessor. With the performance expressed in clock cycles, the sharing policies demonstrate 3.62--7.92 speedups compared to optimized Xeon runs. We also introduce performance and empirical power models that can be used by the runtime system to estimate the effectiveness of each policy in a hybrid system that can simultaneously implement this suite of shared coprocessor policies.
- Azevedo, A. and Juurlink, B. 2009. Scalar processing overhead on simd-only architectures. In Proceedings of 20th IEEE International Conference on Application-specific Systems, Architectures and Processors. IEEE, 183--190. Google Scholar
Digital Library
- Beldianu, S. F. and Ziavras, S. G. 2011. On-chip vector coprocessor sharing for multicores. In Proceedings of the 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP'11). IEEE, 431--438. Google Scholar
Digital Library
- Cho, J., Chang, H., and Sung, W. 2006. An fpga based simd processor with a vector memory unit. In Proceedings of the IEEE International Symposium on Circuits and Systems. 525--528.Google Scholar
- Chou, C. H., Severance, A., Brant, A. D., Liu, Z., Sant, S., and Lemieux, G. 2011. VEGAS: Soft vector processor with scratchpad memory. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA'11). ACM Press, New York, 15--24. Google Scholar
Digital Library
- Eggers, S., Emer, J., Levy, H., Lo, J., Stamm, R., and Tullsen, D. 1997. Simultaneous multithreading: A platform for next-generation processors. IEEE Micro 17, 5, 12--19. Google Scholar
Digital Library
- Frigo, M. and Johnson, S. G. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2, 216--231.Google Scholar
Cross Ref
- Gerneth, F. 2010. FIR filter algorithm implementation using intel SSE instructions: Optimizing for intel atom architecture. Software white paper on Intel embedded design center. http://download.intel.com/design/intarch/papers/323411.pdf.Google Scholar
- Golub, G. H. and Van Loan, C. F. 1996. Matrix Computations 3rd Ed. Johns Hopkins, Baltimore, MD. Google Scholar
Digital Library
- Hagiescu, A. and Wong, W. F. 2011. Co-synthesis of fpga-based application-specific floating point SIMD accelerators. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA'11). ACM Press, New York, 247--256. Google Scholar
Digital Library
- Intel IPP. 2010. Integrated performance primitives for intel architecture reference manual. http://software.intel.com/en-us/articles/intel-ipp.Google Scholar
- Intel MKL. 2011. Intel math kernel library reference manual. http://software.intel.com/enus/articles/intel-math-kernel-library-documentation.Google Scholar
- Keating, M., Flynn, D., Aitken, R., Gibsons, A., and Shi, K. 2007. Low Power Methodology Manual for System on Chip Design. Springer. Google Scholar
Digital Library
- Kozyrakis, C. and Patterson, D. 2002. Vector vs. superscalar and vliw architectures for embedded multimedia benchmarks. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture. 283--293. Google Scholar
Digital Library
- Kozyrakis, C. and Patterson, D. 2003a. Overcoming the limitations of conventional vector processors. SIGARCH Comput. Archit. News 31, 2, 399--409. Google Scholar
Digital Library
- Kozyrakis, C. and Patterson, D. 2003b. Scalable, vector processors for embedded systems. IEEE Micro 23, 6, 36--45. Google Scholar
Digital Library
- Laforest, C. E. and Steffan, J. G. 2010. Efficient multi-ported memories for FPGAs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 41--50. Google Scholar
Digital Library
- Lin, Y., Lee, H., Woh, M., Harel, Y., Mahlke, S., Mudge, T., Chakrabarti, C., and Flautner, K. 2006. SODA: A low-power architecture for software radio. In Proceedings of the 33rd Annual International Symposium on Computer Architecture. IEEE, 89--101. Google Scholar
Digital Library
- Sanchez, F., Alvarez, M., Salami, E., Ramirez, A., and Valero, M. 2005. On the scalability of 1- and 2-dimensional simd extensions for multimedia applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'05). IEEE Computer Society, Washington, DC, 167--176. Google Scholar
Digital Library
- Sung, W. and Mitra, S. K. 1987. Implementation of digital filtering algorithms using pipelined vector processors. Proc. IEEE 75, 9, 1293--1303.Google Scholar
Cross Ref
- Woh, M., Seo, S., Mahlke, S., Mudge, T., Chakrabarti, C., and Flautner, K. 2010. AnySP: Anytime anywhere anyway signal processing. IEEE Micro 30, 1, 81--91. Google Scholar
Digital Library
- Xilinx Inc. 2010a. XPower estimator user guide. www.xilinx.com/support/documentation/user_guides.Google Scholar
- Xilinx Inc. 2010b. MicroBlaze processor reference guide. http://www.xilinx.com/support/documentation/sw_manuals/mb_ ref_guide.pdf.Google Scholar
- Yang, H. and Ziavras, S. 2005. FPGA-based vector processor for algebraic equation solvers. In Proceedings of the IEEE International Systems-On-Chip Conference. IEEE, 115--116.Google Scholar
- Yiannacouras, P., Steffan, J. G., and Rose, J. 2008. VESPA: Portable, scalable, and flexible FPGA-based vector processors. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM Press, New York, 61--70. Google Scholar
Digital Library
- Yu, J., Eagleston, C., Chou, C. H.-Y., Perreault, M., and Lemieux, G. 2009. Vector processing as a soft processor accelerator. ACM Trans. Reconfig. Technol. Syst. 2, 2, 1--34. Google Scholar
Digital Library
Index Terms
Multicore-based vector coprocessor sharing for performance and energy gains
Recommendations
On-chip Vector Coprocessor Sharing for Multicores
PDP '11: Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based ProcessingFor most of the applications that make use of a vector coprocessor, the resources are not highly utilized due to the lack of sustained data parallelism, which sometimes occurs due to vector-length changes in dynamic environments. The motivation of our ...
Versatile design of shared vector coprocessors for multicores
For a wide range of applications that make use of a vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism, which often occurs due to insufficient vector parallelism or vector-length variations in dynamic ...
Dark silicon and the end of multicore scaling
ISCA '11: Proceedings of the 38th annual international symposium on Computer architectureSince 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit ...






Comments