Abstract
Given the growth in data inputs and application complexity, it is often the case that a single hardware accelerator is not enough to solve a given problem. In particular, the computational demands and I/O of many tasks in machine learning often require a cluster of accelerators to make a relevant difference in performance. In this article, we explore the efficient construction of FPGA clusters using inference over Decision Tree Ensembles as the target application. The article explores several levels of the problem: (1) a lightweight inter-FPGA communication protocol and routing layer to facilitate the communication between the different FPGAs, (2) the data partitioning and distribution strategies maximizing performance, (3) and an in depth analysis on how applications can be efficiently distributed over such a cluster. The experimental analysis shows that the resulting system can support inference over decision tree ensembles at a significantly higher throughput than that achieved by existing systems.
- 2018. Amazon EC2 F1 Instances. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- 2018. Distributed Inference over Decision Tree Ensembles. Retrieved from https://github.com/fpgasystems/Distributed-DecisionTrees.Google Scholar
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.Google Scholar
- Gustavo Alonso, Zsolt Istvan, Kaan Kara, Muhsen Owaida, and David Sidler. 2019. doppioDB 1.0: Machine learning inside a relational engine. IEEE DE Bull, 42, 2 (2019).Google Scholar
- Flora Amato, Mario Barbareschi, Valentina Casola, and Antonino Mazzeo. 2014. An FPGA-based smart classifier for decision support systems. In Proceedings of the ACM Conference on Interaction Design and Children (IDC’14).Google Scholar
Cross Ref
- Zachary K. Baker and Viktor K. Prasanna. 2005. Efficient hardware data mining with the apriori algorithm on FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05). Google Scholar
Digital Library
- Mario Barbareschi, Salvatore Del Prete, Francesco Gargiulo, Antonino Mazzeo, and Carlo Sansone. 2015. Decision tree-based multiple classifier systems: An FPGA perspective. In Proceedings of the International Workshop on Multiple Classifier Systems (MCS’15).Google Scholar
Cross Ref
- J. Castillo, Jose L. Bosque, E. Castillo, P. Huerta, and J.I. Martinez. 2009. Hardware accelerated montecarlo financial simulation over low cost FPGA cluster. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’09). Google Scholar
Digital Library
- Adrian Caulfield, Eric Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Haselman Michael, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. 2016. A cloud-scale acceleration architecture. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’16). Google Scholar
Digital Library
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD’16). Google Scholar
Digital Library
- E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. RouhaniA. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Yi Xiao, D. Zhang, R. Zhao, and D. Burger. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (Mar, 2018), 8--20.Google Scholar
Cross Ref
- Thomas G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the International Workshop on Multiple Classifier Systems (MCS’00). Google Scholar
Digital Library
- Brian Van Essen, Chris Macaraeg, Maya Gokhale, and Ryan Prenger. 2012. Accelerating a random forest classifier: Multi-core, GP-GPU, or FPGA? In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’12). Google Scholar
Digital Library
- J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield, E. S. Chung, and D. Burger. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). 1--14. Google Scholar
Digital Library
- Jerome H. Friedman and Jacqueline J. Meulman. 2003. Multiple additive regression trees with application in epidemiology. Stat. Med. 22, 9 (Apr. 2003).Google Scholar
Cross Ref
- Tong Geng, Tianqi Wang, Ahmed Sanaullah, Chen Yang, Rui Xu, Rushi Patel, and Martin C. Herbordt. 2018. FPDeep: Acceleration and load balancing of CNN training on FPGA clusters. In Proceedings of the 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18), 81--84.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR’16.Google Scholar
- Kaan Kara, Dan Alistarh, Ce Zhang, Onur Mutlu, and Gustavo Alonso. 2017. FPGA accelerated dense linear machine learning: A precision-convergence trade-off. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’17).Google Scholar
Cross Ref
- Kaan Kara, Ken Eguro, Ce Zhang, and Gustavo Alonso. 2018. ColumnML: Column-store machine learning with on-the-fly data transformation. In Proceedings of the International Conference on Very Large Data Bases (PVLDB’18).Google Scholar
Digital Library
- Yoshiaki Kono, Kentaro Sano, and Satoru Yamamoto. 2012. Scalability analysis of tightly-coupled FPGA-cluster for Lattice Boltzmann computation. In Proceedings of the International Conference on Field-Programmable Logic and Applications (FPL’12).Google Scholar
Cross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems—Volume 1 (NIPS’12). Curran Associates Inc., USA, 1097--1105. http://dl.acm.org/citation.cfm?id=2999134.2999257. Google Scholar
Digital Library
- Rafal Kulaga and Mrek Gorgon. 2015. FPGA implementation of decision trees and tree ensembles for character recognition in Vivado HLS. Image Process. Commun. 19, 2 (Mar. 2015).Google Scholar
- Oskar Mencer, Kuen Hung Tsoi, and Stephen Craimer. 2009. Cube: A 512-FPGA cluster. In Proceedings of the Southern Programmable Logic Conference (SPL’09).Google Scholar
Cross Ref
- Alexey Natekin and Alois Knoll. 2013. Gradient boosting machines, a tutorial. Front. Neurorobot. 7, Dec. (2013), 21.Google Scholar
- Jason Oberg, Ken Eguro, and Ray Bittner. 2012. Random decision tree body part recognition using FPGAs. In Proceedings of the International Conference on Field-Programmable Logic and Applications (FPL’12).Google Scholar
Cross Ref
- N. Oliver, R. R. Sharma, S. Chang, et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig’11). Google Scholar
Digital Library
- Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric Chung. 2015. Accelerating Deep Convolutional Neural Networks Using Specialized Hardware. Retrieved from https://www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/.Google Scholar
- M. Owaida and G. Alonso. 2018. Application partitioning on FPGA clusters: Inference over decision tree ensembles. In Proceedings of the 2018 28th International Conference on Field Programmable Logic and Applications (FPL’18). 1--8.Google Scholar
- Muhsen Owaida, David Sidler, Kan Kara, and Gustavo Alonso. 2017a. Centaur: A framework for hybrid CPU-FPGA databases. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’17).Google Scholar
Cross Ref
- Muhsen Owaida, Hantian Zhang, Ce Zhang, and Gustavo Alonso. 2017b. Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’17).Google Scholar
Cross Ref
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, et al. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, Oct. (2011). Google Scholar
Digital Library
- Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, and Derek et. al Chiou. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the International Symposium on Computer Architecture (ISCA’14). Google Scholar
Digital Library
- Yun R. Qu and Viktor K. Prasanna. 2014. Scalable and dynamically updatable lookup engine for decision-trees on FPGA. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’14).Google Scholar
- Struharik R. 2015. Decision tree ensemble hardware accelerators for embedded applications. In Proceedings of the IEEE 16 th International Symposium on Intelligent Systems and Informatics (SISY’15).Google Scholar
Cross Ref
- Fareena Saqib, Aindrik Dutta, and Jim Plusquellic. 2015. Pipelined decision tree classification accelerator implementation in FPGA (DT-CAIF). IEEE Trans. Comput. 64, 1 (Jan. 2015).Google Scholar
Cross Ref
- David Sidler, Zsolt István, Muhsen Owaida, and Gustavo Alonso. 2017a. Accelerating pattern matching queries in hybrid CPU-FPGA architectures. In Proceedings of the Conference of the Association for Computing Machinery Special Interest Group on Management of Data (SIGMOD’17). Google Scholar
Digital Library
- David Sidler, Zsolt István, Muhsen Owaida, Kaan Kara, and Gustavo Alonso. 2017b. doppioDB: A hardware accelerated database. In Proceedings of the Conference of the Association for Computing Machinery Special Interest Group on Management of Data (SIGMOD’17). Google Scholar
Digital Library
- Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia, and Paul Chow. 2017. Enabling flexible network FPGA clusters in a heterogeneous cloud data center. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). Google Scholar
Digital Library
- Tommy Tracy, Yao Fu, Indranil Roy, Eric Jonas, and Paul Glendenning. 2016. Towards machine learning on the automata processor. In Proceedings of the International Conference ISC High Performance (ISC’16).Google Scholar
Cross Ref
- Kuen Hung Tsoi and Wayne Luk. 2010. Axel: A heterogeneous cluster with FPGAs and GPUs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’10). Google Scholar
Digital Library
- Zeke Wang, Kaan Kara, Hantian Zhang, Gustavo Alonso, Ce Zhang, and Onur Mutlu. 2019. Accelerating generalized linear models with MLWeaving: A one-size-fits-all system for any-precision learning. In Proceedings of the International Conference on Very Large Data Bases (PVLDB’17). Google Scholar
Digital Library
- Chen Zhang, Peng Li2, Guangyu Sun, Yijin Guan1, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). Google Scholar
Digital Library
- Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’16). Google Scholar
Digital Library
Index Terms
Distributed Inference over Decision Tree Ensembles on Clusters of FPGAs
Recommendations
Co-Processor for evolutionary full decision tree induction
In this paper a co-processor for the hardware aided decision tree induction using evolutionary approach (EFTIP) is proposed. EFTIP is used for hardware acceleration of the fitness evaluation task since this task is proven in the paper to be the ...
Efficient traversal of decision tree ensembles with FPGAs
AbstractSystem-on-Chip (SoC) based Field Programmable Gate Arrays (FPGAs) provide a hardware acceleration technology that can be rapidly deployed and tuned, thus providing a flexible solution adaptable to specific design requirements and to ...
A framework for designing power-efficient inference accelerators in tree-based learning applications
AbstractMachine Learning techniques (ML) are being widely adopted in embedded devices due to their efficiency and flexibility. However, the strict power limitations in such devices, combined with the variable resource requirements of ML models,...






Comments