Abstract
Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes six-input LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC.
- Krste Asanović, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley. Retrieved from http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.Google Scholar
- AVNET. 2018. ULTRA96. Retrieved from http://www.ultra96.org/sites/default/files/product_briefs/5354-pb-ultra96-v3b.pdf.Google Scholar
- D. J. Moss et al. 2018. A customizable matrix multiplication framework for the Intel HARPv2 Xeon+ FPGA platform: A deep learning case study. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 107--116. Google Scholar
Digital Library
- F. Pedersoli et al. 2018. Espresso: Efficient forward propagation for BCNNs. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061 (2016).Google Scholar
- J. Bachrach et al. 2012. Chisel: Constructing hardware in a Scala embedded language. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, 1216--1225. Google Scholar
Digital Library
- M. Kumm and J. Kappauf. 2018. Advanced compressor tree synthesis for FPGAs. IEEE Trans. Comput. 67, 8 (Aug. 2018), 1078--1091.Google Scholar
Cross Ref
- Martin Kumm and Peter Zipf. 2014. Pipelined compressor tree optimization using integer linear programming. In Proceedings of the Conference on Field Programmable Logic and Applications. IEEE, 1--8.Google Scholar
Cross Ref
- Hsiang-Tsung Kung. 1982. Why systolic architectures? IEEE Comput. 15, 1 (1982), 37--46. Google Scholar
Digital Library
- Kiran Kumar Matam and Viktor K. Prasanna. 2013. Energy-efficient large-scale matrix multiplication on FPGAs. In Proceedings of the 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig’13). IEEE, 1--8.Google Scholar
- Sparsh Mittal. 2016. A survey of techniques for approximate computing. Comput. Surveys 48, 4 (2016), 62. Google Scholar
Digital Library
- Wojchech Mula. 2018. Scalar version of SSE move mask instruction. Retrieved from http://0x80.pl/articles/scalar-sse-movmask.html.Google Scholar
- P. Judd et al. 2016. Stripes: Bit-serial deep neural network computing. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture. IEEE, 1--12. Google Scholar
Digital Library
- Hadi Parandeh-Afshar, Arkosnato Neogy, Philip Brisk, and Paolo Ienne. 2011. Compressor tree synthesis on commercial high-performance FPGAs. ACM Trans. Reconfig. Technol. Syst. 4, 4 (Dec. 2011), 39:1--39:19. Google Scholar
Digital Library
- Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo. 2017. Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5456--5464.Google Scholar
Cross Ref
- Thomas B. Preußer. 2017. Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs. In Proceedings of the Conference on Field Programmable Logic and Applications. IEEE, 1--7.Google Scholar
Cross Ref
- Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 65--74. Google Scholar
Digital Library
- Yaman Umuroglu and Magnus Jahre. 2017. Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060 (2017).Google Scholar
- Y. Umuroglu, L. Rasnayake, and M. Själander. 2018. BISMO: A scalable bit-serial matrix multiplication overlay for reconfigurable computing. In Proceedings of the Conference on Field Programmable Logic and Applications.Google Scholar
- Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2018. HAQ: Hardware-aware automated quantization. arXiv preprint arXiv:1811.08886 (2018).Google Scholar
- Xilinx. 2017. Vivado Design Suite User Guide—Release Notes, Installation, and Licensing (UG973 (v2017.4) ed.). Xilinx.Google Scholar
- Xilinx. 2018. Python Productivity for Zynq (Pynq) Documentation (release 2.2 ed.). Xilinx.Google Scholar
- Xilinx. 2018. UltraScale Architecture and Product Data Sheet: Overview. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds890-ultrascale-overview.pdf.Google Scholar
- Xilinx. 2018. Zynq UltraScale+ MPSoC Data Sheet: Overview. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds891-zynq-ultrascale-plus-overview.pdf.Google Scholar
- Mehdi R. Zargham. 1996. Computer Architecture: Single and Parallel Systems. Prentice-Hall. Google Scholar
Digital Library
Index Terms
Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing
Recommendations
Design of a Coarse-Grained Processing Element for Matrix Multiplication on FPGA
MCSOC '14: Proceedings of the 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCsIn this paper, we discuss and evaluate about a grain size of the PE of a matrix operation specific architecture with fused multiply add (FMA) units, Rapid MatriX, on FPGAs. Recent FPGAs have many DSP blocks which are high-performance arithmetic units. ...
Power and energy efficiency evaluation for HW and SW implementation of nxn matrix multiplication on Altera FPGAs
FPGAworld '09: Proceedings of the 6th FPGAworld ConferenceMatrix multiplication is most often involved in graphics, image processing, digital signal processing, robotics and control engineering applications. In this paper we compared and analyzed the power and energy consumption in three different designs, ...
64-bit floating-point FPGA matrix multiplication
FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arraysWe introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm ...






Comments