Abstract
Kernel methods utilize linear methods in a nonlinear feature space and combine the advantages of both. Online kernel methods, such as kernel recursive least squares (KRLS) and kernel normalized least mean squares (KNLMS), perform nonlinear regression in a recursive manner, with similar computational requirements to linear techniques. In this article, an architecture for a microcoded kernel method accelerator is described, and high-performance implementations of sliding-window KRLS, fixed-budget KRLS, and KNLMS are presented. The architecture utilizes pipelining and vectorization for performance, and microcoding for reusability. The design can be scaled to allow tradeoffs between capacity, performance, and area. The design is compared with a central processing unit (CPU), digital signal processor (DSP), and Altera OpenCL implementations. In different configurations on an Altera Arria 10 device, our SW-KRLS implementation delivers floating-point throughput of approximately 16 GFLOPs, latency of 5.5μS, and energy consumption of 10− 4 J, these being improvements over a CPU by factors of 12, 17, and 24, respectively.
- Altera. 2016a. Altera Floating-Point IP Cores User Guide. (2016). http://www.altera.com.Google Scholar
- Altera. 2016b. Altera Megawizard User Guide. (2016). http://www.altera.com.Google Scholar
- Altera. 2016c. Altera Stratix V Device Handbook. (2016). http://www.altera.com.Google Scholar
- Davide Anguita, Luca Carlino, Alessandro Ghio, and Sandro Ridella. 2011. A FPGA core generator for embedded classification systems. Journal of Circuits, Systems and Computers 20, 02 (2011), 263--282. DOI:http://dx.doi.org/10.1142/S0218126611007244 Google Scholar
Cross Ref
- Davide Anguita, Alessandro Ghio, Stefano Pischiutta, and Scitidro Ridella. 2007. A hardware-friendly support vector machine for embedded automotive applications. In International Joint Conference on Neural Networks, 2007 (IJCNN’07). 1360--1364. DOI:http://dx.doi.org/10.1109/IJCNN.2007.4371156 Google Scholar
Cross Ref
- Ray Bittner and Erik Ruf. 2012. Direct GPU/FPGA communication via PCI express. In 2012 41st International Conference on Parallel Processing Workshops (ICPPW’12). 135--139. DOI:http://dx.doi.org/10.1109/ICPPW.2012.20 Google Scholar
Digital Library
- Christopher H. Chou, Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, and Guy G. F. Lemieux. 2011. VEGAS: Soft vector processor with scratchpad memory. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). ACM, New York, NY, 15--24. DOI:http://dx.doi.org/10.1145/1950413.1950420 Google Scholar
Digital Library
- Theodorus J. Dekker. 1971. A floating-point technique for extending the available precision. Numerical Mathematics 18, 3 (1971), 224--242. DOI:http://dx.doi.org/10.1007/BF01397083 Google Scholar
Digital Library
- Yaakov Engel, Shie Mannor, and Ron Meir. 2004. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing 52, 8 (Aug. 2004), 2275--2285. DOI:http://dx.doi.org/10.1109/TSP.2004.830985 Google Scholar
Digital Library
- Jerome H. Friedman. 2006. Recent advances in predictive (machine) learning. Journal of Classification 23 (2006), 175--197. Google Scholar
Cross Ref
- Nicholas J. Higham. 1996. Accuracy and Stability of Numerical Algorithms. Number 48. Siam. Google Scholar
Digital Library
- Andrew K. S. Jardine, Daming Lin, and Dragan Banjevic. 2006. A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mechanical Systems and Signal Processing 20, 7 (2006), 1483--1510. DOI:http://dx.doi.org/10.1016/j.ymssp.2005.09.012 Google Scholar
Cross Ref
- Jainik Kathiara and Miriam E. Leeser. 2011. An autonomous vector/scalar floating point coprocessor for FPGAs. In 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’11). 33--36. DOI:http://dx.doi.org/10.1109/FCCM.2011.14 Google Scholar
Digital Library
- Martin Langhammer and Bogdan Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 117--125. DOI:http://dx.doi.org/10.1145/2684746.2689071 Google Scholar
Digital Library
- Daniel Le Ly and Paul Chow. 2010. High-performance reconfigurable hardware architecture for restricted Boltzmann machines. IEEE Transactions on Neural Networks 21, 11 (Nov. 2010), 1780--1792. DOI:http://dx.doi.org/10.1109/TNN.2010.2073481 Google Scholar
Digital Library
- Mingjie Lin, Ilia Lebedev, and John Wawrzynek. 2010. High-throughput Bayesian computing machine with reconfigurable hardware. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’10). ACM, New York, NY, 73--82. DOI:http://dx.doi.org/10.1145/1723112.1723127 Google Scholar
Digital Library
- John W. Lockwood, Adwait Gupte, Nishit Mehta, Michaela Blott, Tom English, and Kees A. Vissers. 2012. A low-latency library in FPGA hardware for high-frequency trading (HFT). In 2012 IEEE 20th Annual Symposium on High-Performance Interconnects (HOTI’12). 9--16. DOI:http://dx.doi.org/ 10.1109/HOTI.2012.15 Google Scholar
Digital Library
- Michael C. Mackey and Leon Glass. 1977. Oscillation and chaos in physiological control systems. Science 197, 4300 (1977), 287--289. Google Scholar
Cross Ref
- Abhinandan Majumdar, Srihari Cadambi, Michela Becchi, Srimat T. Chakradhar, and Hans Peter Graf. 2012. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Transactions on Architecture and Code Optimization 9, 1, Article 6 (March 2012), 30 pages. DOI:http://dx.doi.org/10.1145/2133382.2133388 Google Scholar
Digital Library
- Yuri V. Makarov, Victor I. Reshetov, Vladimir A. Stroev, and Nikolai I. Voropai. 2005. Blackout prevention in the United States, Europe, and Russia. Proceedings of the IEEE 93, 11 (2005), 1942--1955. DOI:http://dx.doi.org/10.1109/JPROC.2005.857486 Google Scholar
Cross Ref
- Yeyong Pang, Shaojun Wang, Yu Peng, N. J. Fraser, and P. H. W. Leong. 2013. A low latency kernel recursive least squares processor using FPGA technology. In 2013 International Conference on Field-Programmable Technology (FPT’13). 144--151. DOI:http://dx.doi.org/10.1109/FPT.2013.6718345 Google Scholar
Cross Ref
- Markos Papadonikolakis and Christos-Savvas S. Bouganis. 2008. A scalable FPGA architecture for non-linear SVM training. In International Conference on ICECE Technology, 2008 (FPT’08).. 337--340. DOI:http://dx.doi.org/10.1109/FPT.2008.4762412 Google Scholar
Cross Ref
- Rafat Rashid, J. Gregory Steffan, and Vaughn Betz. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS. In 2014 International Conference on Field-Programmable Technology (FPT’14). 20--27. DOI:http://dx.doi.org/10.1109/FPT.2014.7082748 Google Scholar
Cross Ref
- Cedric Richard, J. C. M. Bermudez, and Paul Honeine. 2009. Online prediction of time series data with kernels. IEEE Transactions on Signal Processing 57, 3 (March 2009), 1058--1067. DOI:http://dx.doi.org/10.1109/TSP.2008.2009895 Google Scholar
Digital Library
- Bernhard Scholkopf and Alexander J. Smola. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Aaron Severance, Joe Edwards, Hossein Omidian, and Guy Lemieux. 2014. Soft vector processors with streaming pipelines. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’14). ACM, New York, NY, 117--126. DOI:http://dx.doi.org/ 10.1145/2554688.2554774 Google Scholar
Digital Library
- Aaron Severance and Guy Lemieux. 2012. VENICE: A compact vector processor for FPGA applications. In 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’12). 245--245. DOI:http://dx.doi.org/10.1109/FCCM.2012.55 Google Scholar
Digital Library
- Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ningyi Xu, and Huazhong Yang. 2010. FPMR: MapReduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’10). ACM, New York, NY, 93--102. DOI:http://dx.doi.org/ 10.1145/1723112.1723129 Google Scholar
Digital Library
- Song Sun and J. Zambreno. 2009. A floating-point accumulator for FPGA-based high performance computing applications. In International Conference on Field-Programmable Technology, 2009 (FPT’09). 493--499. DOI:http://dx.doi.org/10.1109/FPT.2009.5377624 Google Scholar
Cross Ref
- Steven Van Vaerenbergh. 2012. Kernel Methods Toolbox KAFBOX: A Matlab benchmarking toolbox for kernel adaptive filtering. Grupo de Tratamiento Avanzado de Señal, Departamento de Ingeniería de Comunicaciones, Universidad de Cantabria, Spain. (2012). Software available at http://sourceforge.net/p/kafbox.Google Scholar
- Steven Van Vaerenbergh and I. Santamaria. 2013. A comparative study of kernel adaptive filtering algorithms. In 2013 IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE’13). 181--186. DOI:http://dx.doi.org/10.1109/DSP-SPE.2013.6642587 Google Scholar
Cross Ref
- Steven Van Vaerenbergh, I. Santamaria, Weifeng Liu, and J. C. Principe. 2010. Fixed-budget kernel recursive least-squares. In 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP’10). 1882--1885. DOI:http://dx.doi.org/10.1109/ICASSP.2010.5495350 Google Scholar
Cross Ref
- Steven Van Vaerenbergh, Javier Via, and I. Santamaria. 2006. A sliding-window kernel RLS algorithm and its application to nonlinear channel identification. In Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006 (ICASSP’06), Vol. 5. V--V. DOI:http://dx.doi.org/10.1109/ICASSP.2006.1661394 Google Scholar
Cross Ref
- R. Clint Whaley and Antoine Petitet. 2005. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience 35, 2 (Feb. 2005), 101--121. DOI:http://dx.doi.org/10.1002/spe.v35:2 Google Scholar
Digital Library
- Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. 2007. Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1 (Dec. 2007), 1--37. DOI:http://dx.doi.org/10.1007/s10115-007-0114-2 Google Scholar
Digital Library
- Peter Yiannacouras, J. G. Steffan, and J. Rose. 2012. Portable, flexible, and scalable soft vector processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 8 (2012), 1429--1442. DOI:http://dx.doi.org/10.1109/TVLSI.2011.2160463 Google Scholar
Digital Library
- Jason Yu, Christopher Eagleston, Christopher Han-Yu Chou, Maxime Perreault, and Guy Lemieux. 2009. Vector processing as a soft processor accelerator. ACM Transactions on Reconfigurable Technology Systems 2, 2, Article 12 (June 2009), 34 pages. DOI:http://dx.doi.org/10.1145/1534916.1534922 Google Scholar
Digital Library
- Masahiro Yukawa. 2012. Multikernel adaptive filtering. IEEE Transactions on Signal Processing 60, 9 (Sept. 2012), 4672--4682. DOI:http://dx.doi.org/10.1109/TSP.2012.2200889 Google Scholar
Digital Library
Index Terms
A Microcoded Kernel Recursive Least Squares Processor Using FPGA Technology
Recommendations
FPGA Implementations of Kernel Normalised Least Mean Squares Processors
Kernel adaptive filters (KAFs) are online machine learning algorithms which are amenable to highly efficient streaming implementations. They require only a single pass through the data and can act as universal approximators, i.e. approximate any ...
An FPGA implementation for neural networks with the FDFM processor core approach
This paper presents a field programmable gate array FPGA implementation of a three-layer perceptron using the few DSP blocks and few block RAMs FDFM approach implemented in the Xilinx Virtex-6 family FPGA. In the FDFM approach, multiple processor cores ...
Floating-point FPGA: architecture and modeling
This paper presents an architecture for a reconfigurable device that is specifically optimized for floating-point applications. Fine-grained units are used for implementing control logic and bit-oriented operations, while parameterized and ...






Comments