skip to main content
research-article

A Microcoded Kernel Recursive Least Squares Processor Using FPGA Technology

Published:24 September 2016Publication History
Skip Abstract Section

Abstract

Kernel methods utilize linear methods in a nonlinear feature space and combine the advantages of both. Online kernel methods, such as kernel recursive least squares (KRLS) and kernel normalized least mean squares (KNLMS), perform nonlinear regression in a recursive manner, with similar computational requirements to linear techniques. In this article, an architecture for a microcoded kernel method accelerator is described, and high-performance implementations of sliding-window KRLS, fixed-budget KRLS, and KNLMS are presented. The architecture utilizes pipelining and vectorization for performance, and microcoding for reusability. The design can be scaled to allow tradeoffs between capacity, performance, and area. The design is compared with a central processing unit (CPU), digital signal processor (DSP), and Altera OpenCL implementations. In different configurations on an Altera Arria 10 device, our SW-KRLS implementation delivers floating-point throughput of approximately 16 GFLOPs, latency of 5.5μS, and energy consumption of 10− 4 J, these being improvements over a CPU by factors of 12, 17, and 24, respectively.

References

  1. Altera. 2016a. Altera Floating-Point IP Cores User Guide. (2016). http://www.altera.com.Google ScholarGoogle Scholar
  2. Altera. 2016b. Altera Megawizard User Guide. (2016). http://www.altera.com.Google ScholarGoogle Scholar
  3. Altera. 2016c. Altera Stratix V Device Handbook. (2016). http://www.altera.com.Google ScholarGoogle Scholar
  4. Davide Anguita, Luca Carlino, Alessandro Ghio, and Sandro Ridella. 2011. A FPGA core generator for embedded classification systems. Journal of Circuits, Systems and Computers 20, 02 (2011), 263--282. DOI:http://dx.doi.org/10.1142/S0218126611007244 Google ScholarGoogle ScholarCross RefCross Ref
  5. Davide Anguita, Alessandro Ghio, Stefano Pischiutta, and Scitidro Ridella. 2007. A hardware-friendly support vector machine for embedded automotive applications. In International Joint Conference on Neural Networks, 2007 (IJCNN’07). 1360--1364. DOI:http://dx.doi.org/10.1109/IJCNN.2007.4371156 Google ScholarGoogle ScholarCross RefCross Ref
  6. Ray Bittner and Erik Ruf. 2012. Direct GPU/FPGA communication via PCI express. In 2012 41st International Conference on Parallel Processing Workshops (ICPPW’12). 135--139. DOI:http://dx.doi.org/10.1109/ICPPW.2012.20 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christopher H. Chou, Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, and Guy G. F. Lemieux. 2011. VEGAS: Soft vector processor with scratchpad memory. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). ACM, New York, NY, 15--24. DOI:http://dx.doi.org/10.1145/1950413.1950420 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Theodorus J. Dekker. 1971. A floating-point technique for extending the available precision. Numerical Mathematics 18, 3 (1971), 224--242. DOI:http://dx.doi.org/10.1007/BF01397083 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yaakov Engel, Shie Mannor, and Ron Meir. 2004. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing 52, 8 (Aug. 2004), 2275--2285. DOI:http://dx.doi.org/10.1109/TSP.2004.830985 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jerome H. Friedman. 2006. Recent advances in predictive (machine) learning. Journal of Classification 23 (2006), 175--197. Google ScholarGoogle ScholarCross RefCross Ref
  11. Nicholas J. Higham. 1996. Accuracy and Stability of Numerical Algorithms. Number 48. Siam. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Andrew K. S. Jardine, Daming Lin, and Dragan Banjevic. 2006. A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mechanical Systems and Signal Processing 20, 7 (2006), 1483--1510. DOI:http://dx.doi.org/10.1016/j.ymssp.2005.09.012 Google ScholarGoogle ScholarCross RefCross Ref
  13. Jainik Kathiara and Miriam E. Leeser. 2011. An autonomous vector/scalar floating point coprocessor for FPGAs. In 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’11). 33--36. DOI:http://dx.doi.org/10.1109/FCCM.2011.14 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Martin Langhammer and Bogdan Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 117--125. DOI:http://dx.doi.org/10.1145/2684746.2689071 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Daniel Le Ly and Paul Chow. 2010. High-performance reconfigurable hardware architecture for restricted Boltzmann machines. IEEE Transactions on Neural Networks 21, 11 (Nov. 2010), 1780--1792. DOI:http://dx.doi.org/10.1109/TNN.2010.2073481 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mingjie Lin, Ilia Lebedev, and John Wawrzynek. 2010. High-throughput Bayesian computing machine with reconfigurable hardware. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’10). ACM, New York, NY, 73--82. DOI:http://dx.doi.org/10.1145/1723112.1723127 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. John W. Lockwood, Adwait Gupte, Nishit Mehta, Michaela Blott, Tom English, and Kees A. Vissers. 2012. A low-latency library in FPGA hardware for high-frequency trading (HFT). In 2012 IEEE 20th Annual Symposium on High-Performance Interconnects (HOTI’12). 9--16. DOI:http://dx.doi.org/ 10.1109/HOTI.2012.15 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael C. Mackey and Leon Glass. 1977. Oscillation and chaos in physiological control systems. Science 197, 4300 (1977), 287--289. Google ScholarGoogle ScholarCross RefCross Ref
  19. Abhinandan Majumdar, Srihari Cadambi, Michela Becchi, Srimat T. Chakradhar, and Hans Peter Graf. 2012. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Transactions on Architecture and Code Optimization 9, 1, Article 6 (March 2012), 30 pages. DOI:http://dx.doi.org/10.1145/2133382.2133388 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yuri V. Makarov, Victor I. Reshetov, Vladimir A. Stroev, and Nikolai I. Voropai. 2005. Blackout prevention in the United States, Europe, and Russia. Proceedings of the IEEE 93, 11 (2005), 1942--1955. DOI:http://dx.doi.org/10.1109/JPROC.2005.857486 Google ScholarGoogle ScholarCross RefCross Ref
  21. Yeyong Pang, Shaojun Wang, Yu Peng, N. J. Fraser, and P. H. W. Leong. 2013. A low latency kernel recursive least squares processor using FPGA technology. In 2013 International Conference on Field-Programmable Technology (FPT’13). 144--151. DOI:http://dx.doi.org/10.1109/FPT.2013.6718345 Google ScholarGoogle ScholarCross RefCross Ref
  22. Markos Papadonikolakis and Christos-Savvas S. Bouganis. 2008. A scalable FPGA architecture for non-linear SVM training. In International Conference on ICECE Technology, 2008 (FPT’08).. 337--340. DOI:http://dx.doi.org/10.1109/FPT.2008.4762412 Google ScholarGoogle ScholarCross RefCross Ref
  23. Rafat Rashid, J. Gregory Steffan, and Vaughn Betz. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS. In 2014 International Conference on Field-Programmable Technology (FPT’14). 20--27. DOI:http://dx.doi.org/10.1109/FPT.2014.7082748 Google ScholarGoogle ScholarCross RefCross Ref
  24. Cedric Richard, J. C. M. Bermudez, and Paul Honeine. 2009. Online prediction of time series data with kernels. IEEE Transactions on Signal Processing 57, 3 (March 2009), 1058--1067. DOI:http://dx.doi.org/10.1109/TSP.2008.2009895 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Bernhard Scholkopf and Alexander J. Smola. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Aaron Severance, Joe Edwards, Hossein Omidian, and Guy Lemieux. 2014. Soft vector processors with streaming pipelines. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’14). ACM, New York, NY, 117--126. DOI:http://dx.doi.org/ 10.1145/2554688.2554774 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Aaron Severance and Guy Lemieux. 2012. VENICE: A compact vector processor for FPGA applications. In 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’12). 245--245. DOI:http://dx.doi.org/10.1109/FCCM.2012.55 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ningyi Xu, and Huazhong Yang. 2010. FPMR: MapReduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’10). ACM, New York, NY, 93--102. DOI:http://dx.doi.org/ 10.1145/1723112.1723129 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Song Sun and J. Zambreno. 2009. A floating-point accumulator for FPGA-based high performance computing applications. In International Conference on Field-Programmable Technology, 2009 (FPT’09). 493--499. DOI:http://dx.doi.org/10.1109/FPT.2009.5377624 Google ScholarGoogle ScholarCross RefCross Ref
  30. Steven Van Vaerenbergh. 2012. Kernel Methods Toolbox KAFBOX: A Matlab benchmarking toolbox for kernel adaptive filtering. Grupo de Tratamiento Avanzado de Señal, Departamento de Ingeniería de Comunicaciones, Universidad de Cantabria, Spain. (2012). Software available at http://sourceforge.net/p/kafbox.Google ScholarGoogle Scholar
  31. Steven Van Vaerenbergh and I. Santamaria. 2013. A comparative study of kernel adaptive filtering algorithms. In 2013 IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE’13). 181--186. DOI:http://dx.doi.org/10.1109/DSP-SPE.2013.6642587 Google ScholarGoogle ScholarCross RefCross Ref
  32. Steven Van Vaerenbergh, I. Santamaria, Weifeng Liu, and J. C. Principe. 2010. Fixed-budget kernel recursive least-squares. In 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP’10). 1882--1885. DOI:http://dx.doi.org/10.1109/ICASSP.2010.5495350 Google ScholarGoogle ScholarCross RefCross Ref
  33. Steven Van Vaerenbergh, Javier Via, and I. Santamaria. 2006. A sliding-window kernel RLS algorithm and its application to nonlinear channel identification. In Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006 (ICASSP’06), Vol. 5. V--V. DOI:http://dx.doi.org/10.1109/ICASSP.2006.1661394 Google ScholarGoogle ScholarCross RefCross Ref
  34. R. Clint Whaley and Antoine Petitet. 2005. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience 35, 2 (Feb. 2005), 101--121. DOI:http://dx.doi.org/10.1002/spe.v35:2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. 2007. Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1 (Dec. 2007), 1--37. DOI:http://dx.doi.org/10.1007/s10115-007-0114-2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Peter Yiannacouras, J. G. Steffan, and J. Rose. 2012. Portable, flexible, and scalable soft vector processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 8 (2012), 1429--1442. DOI:http://dx.doi.org/10.1109/TVLSI.2011.2160463 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jason Yu, Christopher Eagleston, Christopher Han-Yu Chou, Maxime Perreault, and Guy Lemieux. 2009. Vector processing as a soft processor accelerator. ACM Transactions on Reconfigurable Technology Systems 2, 2, Article 12 (June 2009), 34 pages. DOI:http://dx.doi.org/10.1145/1534916.1534922 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Masahiro Yukawa. 2012. Multikernel adaptive filtering. IEEE Transactions on Signal Processing 60, 9 (Sept. 2012), 4672--4682. DOI:http://dx.doi.org/10.1109/TSP.2012.2200889 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Microcoded Kernel Recursive Least Squares Processor Using FPGA Technology

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 10, Issue 1
      March 2017
      206 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3002131
      • Editor:
      • Steve Wilton
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 September 2016
      • Revised: 1 May 2016
      • Accepted: 1 May 2016
      • Received: 1 May 2015
      Published in trets Volume 10, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!