skip to main content
research-article

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

Published:20 August 2019Publication History
Skip Abstract Section

Abstract

Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes six-input LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC.

References

  1. Krste Asanović, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley. Retrieved from http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.Google ScholarGoogle Scholar
  2. AVNET. 2018. ULTRA96. Retrieved from http://www.ultra96.org/sites/default/files/product_briefs/5354-pb-ultra96-v3b.pdf.Google ScholarGoogle Scholar
  3. D. J. Moss et al. 2018. A customizable matrix multiplication framework for the Intel HARPv2 Xeon+ FPGA platform: A deep learning case study. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 107--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Pedersoli et al. 2018. Espresso: Efficient forward propagation for BCNNs. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  5. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061 (2016).Google ScholarGoogle Scholar
  6. J. Bachrach et al. 2012. Chisel: Constructing hardware in a Scala embedded language. In Proceedings of the ACM/IEEE Design Automation Conference. ACM, 1216--1225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Kumm and J. Kappauf. 2018. Advanced compressor tree synthesis for FPGAs. IEEE Trans. Comput. 67, 8 (Aug. 2018), 1078--1091.Google ScholarGoogle ScholarCross RefCross Ref
  8. Martin Kumm and Peter Zipf. 2014. Pipelined compressor tree optimization using integer linear programming. In Proceedings of the Conference on Field Programmable Logic and Applications. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  9. Hsiang-Tsung Kung. 1982. Why systolic architectures? IEEE Comput. 15, 1 (1982), 37--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kiran Kumar Matam and Viktor K. Prasanna. 2013. Energy-efficient large-scale matrix multiplication on FPGAs. In Proceedings of the 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig’13). IEEE, 1--8.Google ScholarGoogle Scholar
  11. Sparsh Mittal. 2016. A survey of techniques for approximate computing. Comput. Surveys 48, 4 (2016), 62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wojchech Mula. 2018. Scalar version of SSE move mask instruction. Retrieved from http://0x80.pl/articles/scalar-sse-movmask.html.Google ScholarGoogle Scholar
  13. P. Judd et al. 2016. Stripes: Bit-serial deep neural network computing. In Proceedings of the ACM/IEEE International Symposium on Microarchitecture. IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hadi Parandeh-Afshar, Arkosnato Neogy, Philip Brisk, and Paolo Ienne. 2011. Compressor tree synthesis on commercial high-performance FPGAs. ACM Trans. Reconfig. Technol. Syst. 4, 4 (Dec. 2011), 39:1--39:19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo. 2017. Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5456--5464.Google ScholarGoogle ScholarCross RefCross Ref
  16. Thomas B. Preußer. 2017. Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs. In Proceedings of the Conference on Field Programmable Logic and Applications. IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  17. Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yaman Umuroglu and Magnus Jahre. 2017. Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060 (2017).Google ScholarGoogle Scholar
  19. Y. Umuroglu, L. Rasnayake, and M. Själander. 2018. BISMO: A scalable bit-serial matrix multiplication overlay for reconfigurable computing. In Proceedings of the Conference on Field Programmable Logic and Applications.Google ScholarGoogle Scholar
  20. Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2018. HAQ: Hardware-aware automated quantization. arXiv preprint arXiv:1811.08886 (2018).Google ScholarGoogle Scholar
  21. Xilinx. 2017. Vivado Design Suite User Guide—Release Notes, Installation, and Licensing (UG973 (v2017.4) ed.). Xilinx.Google ScholarGoogle Scholar
  22. Xilinx. 2018. Python Productivity for Zynq (Pynq) Documentation (release 2.2 ed.). Xilinx.Google ScholarGoogle Scholar
  23. Xilinx. 2018. UltraScale Architecture and Product Data Sheet: Overview. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds890-ultrascale-overview.pdf.Google ScholarGoogle Scholar
  24. Xilinx. 2018. Zynq UltraScale+ MPSoC Data Sheet: Overview. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds891-zynq-ultrascale-plus-overview.pdf.Google ScholarGoogle Scholar
  25. Mehdi R. Zargham. 1996. Computer Architecture: Single and Parallel Systems. Prentice-Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 12, Issue 3
        Special Section on Security in FPGAs and Regular Articles
        September 2019
        150 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3357092
        • Editor:
        • Deming Chen
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 August 2019
        • Accepted: 1 May 2019
        • Revised: 1 March 2019
        • Received: 1 December 2018
        Published in trets Volume 12, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!