Abstract
Vector coprocessors (VPs), commonly being assigned exclusively to a single thread/core, are not often performance and energy efficient due to mismatches with the vector needs of individual applications. We present in this article an easy-to-implement VP virtualization technique that, when applied, enables a multithreaded VP to simultaneously execute multiple threads of similar or arbitrary vector lengths to achieve improved aggregate utilization. With a vector register file (VRF) virtualization technique invented to dynamically allocate physical vector registers to threads, our VP virtualization approach improves programmer productivity by providing at runtime a distinct physical register name space to each competing thread, thus eliminating the need to solve register-name conflicts statically. We applied our virtualization technique to a multithreaded VP and prototyped an FPGA-based multicore processor system that supports VP sharing as well as power gating for better energy efficiency. Under the dynamic creation of disparate threads, our benchmarking results show impressive VP speedups of up to 333% and total energy savings of up to 37% with proper thread scheduling and power gating compared to a similar-sized system that allows VP access to just one thread at a time.
- Spiridon F. Beldianu and Sotirios G. Ziavras. 2013. Multicore-based vector coprocessor sharing for performance and energy gains. ACM Trans. Embed. Comput. Syst. 13, 2. Google Scholar
Digital Library
- Spiridon F. Beldianu and Sotirios G. Ziavras. 2015. Performance-energy optimizations for shared vector accelerators in multicores. IEEE Trans. Comput. 64, 3, 805--817.Google Scholar
Cross Ref
- Christopher H. Chou, Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, and Guy Lemieux. 2011. VEGAS: Soft vector processor with scratchpad memory. 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 15--24. Google Scholar
Digital Library
- Andreas Ehliar. 2014. Area efficient floating-point adder and multiplier with IEEE-754 compatible semantics. IEEE International Conference on Field-Programmable Technology. 131--138.Google Scholar
Cross Ref
- Christoforos E. Kozyrakis and David Patterson. 2003. Scalable, vector processors for embedded systems. IEEE Micro. 23, 6, 36--45. Google Scholar
Digital Library
- Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti, and Krisztian Flautner. 2006. SODA: A low-power architecture for software radio. 33rd IEEE Annual International Symposium on Computer Architecture (Boston, MA), 89--101. Google Scholar
Digital Library
- Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2013. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. ACM Trans. Comput. Syst. 31, 3, 6. Google Scholar
Digital Library
- Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, Favid A. Koufaty, J. Allen Miller, and Michael Upton. 2002. Hyper-threading technology architecture and microarchitecture. Intel Tech. J. 6, 2, 1--12.Google Scholar
- Nvidia Corp. 2014. Gefore GTX 980 white paper. Featuring Maxwell, the Most Advanced GPU Ever Made.Google Scholar
- Rezaur Rahman. 2014. Intel Xeon Phi coprocessor vector microarchitecture, http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-vector-microarchitecture.Google Scholar
- Seyed A. Rooholamin and Sotirios G. Ziavras. 2015. Modular vector processor architecture targeting at data-level parallelism, Microprocess. Microsyst. Elsevier, 39, 4, 237--249. Google Scholar
Digital Library
- Aaron Severance and George Lemieux. 2012. VENICE: A compact vector processor for FPGA applications. IEEE International Conference on Field-Programmable Technology. 261--268. Google Scholar
Digital Library
- Wonyong Sung and Sanjit K. Mitra. 1987. Implementation of digital filtering algorithms using pipelined vector processors. Proc. IEEE. 75, 9, 1293--1303.Google Scholar
Cross Ref
- Xilinx, Inc. 2010. MicroBlaze Processor Reference Guide, http://www.xilinx.com/support/documentation/sw_manuals/mb_ref_guide.pdf.Google Scholar
- Xilinx, Inc. 2011. AXI Reference Guide, http://www.xilinx.com/support/documentation/ip_documentation/ ug761_axi_reference_guide.pdf.Google Scholar
- Xilinx, Inc. 2012. XPower Analyzer User Guide. Xilinx, www.xilinx.com/support/documentation /user_guides/ ug440.pdf.Google Scholar
- Hongyan Yang and Sotirios G. Ziavras. 2005. FPGA-based vector processor for algebraic equation solvers. IEEE International System on Chip Conference. 115--116.Google Scholar
- Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. 2008. VESPA: Portable, scalable, and flexible FPGA-based vector processors. ACM International Conference on Compilers, Architectures and Synthesis for Embedded Systems.(Atlanta, GA), 61--70. Google Scholar
Digital Library
- Jason Yu, Christopher Eagleston, Christopher Han-Yu Chou, Maxime Perreault, and Guy Lemieux. 2009. Vector processing as a soft processor accelerator. ACM Trans. Reconfig. Tech. Syst. 2, 1--34. Google Scholar
Digital Library
Index Terms
Vector Coprocessor Virtualization for Simultaneous Multithreading
Recommendations
Efficient consolidation-aware VCPU scheduling on multicore virtualization platform
Multicore processors are widely used in today's computer systems. Multicore virtualization technology provides an elastic solution to more efficiently utilize the multicore system. However, the Lock Holder Preemption (LHP) problem in the virtualized ...
KVM/ARM: the design and implementation of the linux ARM hypervisor
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsAs ARM CPUs become increasingly common in mobile devices and servers, there is a growing demand for providing the benefits of virtualization for ARM-based devices. We present our experiences building the Linux ARM hypervisor, KVM/ARM, the first full ...
Multicore-based vector coprocessor sharing for performance and energy gains
Special issue on application-specific processorsFor most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation ...






Comments