skip to main content
research-article

Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL Components

Published:12 August 2021Publication History
Skip Abstract Section

Abstract

Stencil-based algorithms are a relevant class of computational kernels in high-performance systems, as they appear in a plethora of fields, from image processing to seismic simulations, from numerical methods to physical modeling. Among the various incarnations of stencil-based computations, Iterative Stencil Loops (ISLs) and Convolutional Neural Networks (CNNs) represent two well-known examples of kernels belonging to the stencil class. Indeed, ISLs apply the same stencil several times until convergence, while CNN layers leverage stencils to extract features from an image. The computationally intensive essence of ISLs, CNNs, and in general stencil-based workloads, requires solutions able to produce efficient implementations in terms of throughput and power efficiency. In this context, FPGAs are ideal candidates for such workloads, as they allow design architectures tailored to the stencil regular computational pattern. Moreover, the ever-growing need for performance enhancement leads FPGA-based architectures to scale to multiple devices to benefit from a distributed acceleration. For this reason, we propose a library of HDL components to effectively compute ISLs and CNNs inference on FPGA, along with a scalable multi-FPGA architecture, based on custom PCB interconnects. Our solution eases the design flow and guarantees both scalability and performance competitive with state-of-the-art works.

References

  1. A. Ahmad and M. A. Pasha. 2019. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’19). 1106–1111.Google ScholarGoogle Scholar
  2. Afzal Ahmad and Muhammad Adeel Pasha. 2020. FFConv: An FPGA-based accelerator for fast convolution layers in convolutional neural networks. ACM Trans. Embedd. Comput. Syst. 19, 2 (2020), 1–24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Francesc Aràndiga, Albert Cohen, Rosa Donat, and Basarab Matei. 2010. Edge detection insensitive to changes of illumination in the image. Image Vis. Comput. 28, 4 (2010), 553–562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL™ deep learning accelerator on Arria 10. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’17). ACM, New York, NY, 55–64. DOI:DOI:https://doi.org/10.1145/3020078.3021738 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio. 2017. A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’17). 90–97. DOI:DOI:https://doi.org/10.1109/IPDPSW.2017.44Google ScholarGoogle Scholar
  6. Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society Press, Los Alamitos, CA. Retrieved from http://dl.acm.org/citation.cfm?id=2388996.2389051. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Uday Bondhugula. 2008. PLUTO - An automatic parallelizer and locality optimizer for affine loop nests. Retrieved from http://pluto-compiler.sourceforge.net/.Google ScholarGoogle Scholar
  8. Uday Bondhugula. 2008. PLUTO Compiler Repository - Examples. Retrieved from https://github.com/bondhugula/pluto/tree/master/examples.Google ScholarGoogle Scholar
  9. Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Proceedings of the International Conference on Compiler Construction. Springer, 132–146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101–113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bing-Yang Cao and Ruo-Yu Dong. 2012. Nonequilibrium molecular dynamics simulation of shear viscosity by a uniform momentum source-and-sink scheme. J. Comput. Phys. 231, 16 (2012), 5306–5316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Riccardo Cattaneo, Giuseppe Natale, Carlo Sicignano, Donatella Sciuto, and Marco Domenico Santambrogio. 2016. On how to accelerate iterative stencil loops: A scalable streaming-based approach. ACM Trans. Archit. Code Optim. 12, 4 (2016), 53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Adrian Caulfield, Eric Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. 2016. A cloud-scale acceleration architecture. Retrieved from https://www.microsoft.com/en-us/research/publication/configurable-cloud-acceleration/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD’18). IEEE, 1–8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the IEEE International Parallel Distributed Processing Symposium. 676–687. DOI:DOI:https://doi.org/10.1109/IPDPS.2011.70 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Davide Conficconi, Eleonora D’Arnese, Emanuele Del Sozzo, Donatella Sciuto, and Marco D. Santambrogio. 2021. A framework for customizable FPGA-based image registration accelerators. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 251–261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Cong, P. Li, B. Xiao, and P. Zhang. 2014. An optimal microarchitecture for stencil computation acceleration based on non-uniform partitioning of data reuse buffer. In Proceedings of the 51st Design Automation Conference (DAC’14). ACM, New York, NY. DOI:DOI:https://doi.org/10.1145/2593069.2593090 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Cong, P. Li, B. Xiao, and P. Zhang. 2016. An optimal microarchitecture for stencil computation acceleration based on nonuniform partitioning of data reuse buffers. IEEE Trans. Comput.-aided Des. Integ. Circ. Syst. 35, 3 (2016), 407–418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Del Sozzo, A. Solazzo, A. Miele, and M. D. Santambrogio. 2016. On the automation of high level synthesis of convolutional neural networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’16). 217–224. DOI:DOI:https://doi.org/10.1109/IPDPSW.2016.153Google ScholarGoogle Scholar
  20. Hang Ding and Chang Shu. 2006. A stencil adaptive algorithm for finite difference solution of incompressible viscous flows. J. Comput. Phys. 214, 1 (2006), 397–420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Clément Farabet, Cyril Poulet, Jefferson Y. Han, and Yann LeCun. 2009. CNP: An FPGA-based processor for convolutional networks. In Proceedings of the International Conference on Field-programmable Logic and Applications. IEEE, 32–37.Google ScholarGoogle ScholarCross RefCross Ref
  22. Paul Feautrier and Christian Lengauer. 2011. Polyhedron model.Encyclopedia of Parallel Computing 1 (2011), 1581–1592.Google ScholarGoogle Scholar
  23. Matteo Frigo and Volker Strumpen. 2007. The memory behavior of cache oblivious stencil computations. J. Supercomput. 39, 2 (2007), 93–112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1–4.Google ScholarGoogle Scholar
  25. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the Conference on Innovative Parallel Computing (InPar’12). IEEE, 1–10.Google ScholarGoogle ScholarCross RefCross Ref
  26. K. Guo, et al. 2017. Angel-eye: A complete design flow for mapping cnn onto embedded FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 1 (2017), 35–47.Google ScholarGoogle ScholarCross RefCross Ref
  27. Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. [DL] A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfig. Technol. Syst. 12, 1 (2019), 1–26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Robert M. Haralick and Linda G. Shapiro. 1992. Computer and Robot Vision. Vol. 1. Addison-Wesley Reading. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 311–320. DOI:DOI:https://doi.org/10.1145/2304576.2304619 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Amazon Inc.2018. EC2 F1 Instances. Retrieved from https://aws.amazon.com/it/ec2/instance-types/f1/.Google ScholarGoogle Scholar
  32. Microsoft Inc.2018. Project Brainwave. Retrieved from https://www.microsoft.com/en-us/research/blog/mi crosoft-unveils-project-brainwave/.Google ScholarGoogle Scholar
  33. Xilinx Inc.2018. Aurora 64B/66B link-layer protocol.g. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/aurora_64b66b_protocol_spec_sp011.pdf.Google ScholarGoogle Scholar
  34. Kazufumi Ito and Jari Toivanen. 2009. Lagrange multiplier approach with optimized finite difference stencils for pricing American options under stochastic volatility. SIAM J. Sci. Comput. 31 (2009), 2646–2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. 675–678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, and Gaurav Agrawal et al. 2017. In-datacenter performance analysis of a tensor processing unit. Retrieved from https://arxiv.org/pdf/1704.04760.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tomoyoshi Kobori and Tsutomu Maruyama. 2003. A high speed computation system for 3D FCHC lattice gas model with FPGA. In Proceedings of the International Conference on Field-programmable Logic and Applications. Springer, 755–765.Google ScholarGoogle ScholarCross RefCross Ref
  38. Raghuraman Krishnamoorthi. 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR abs/1806.08342 (2018).Google ScholarGoogle Scholar
  39. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Proc. Syst. 25 (2012), 1097–1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Andrew Lavin. 2015. Fast algorithms for convolutional neural networks. CoRR abs/1509.09308 (2015). Retrieved from https://arxiv.org/pdf/1704.04760.pdf.Google ScholarGoogle Scholar
  41. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278–2324. DOI:DOI:https://doi.org/10.1109/5.726791Google ScholarGoogle ScholarCross RefCross Ref
  42. Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In Proceedings of the 26th International Conference on Field-programmable Logic and Applications (FPL’16). IEEE, 1–9.Google ScholarGoogle Scholar
  43. Y. Liang, L. Lu, Q. Xiao, and S. Yan. 2020. Evaluating fast algorithms for convolutional neural networks on FPGAs. IEEE Trans. Comput.-aided Des. Integ. Circ. Syst. 39, 4 (2020), 857–870. DOI:DOI:https://doi.org/10.1109/TCAD.2019.2897701Google ScholarGoogle ScholarCross RefCross Ref
  44. Zhiqiang Liu, Yong Dou, Jingfei Jiang, and Jinwei Xu. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. In Proceedings of the International Conference on Field-programmable Technology (FPT’16). IEEE, 61–68.Google ScholarGoogle Scholar
  45. Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 45–54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yufei Ma, Naveen Suda, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2018. ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler. Integration 62 (2018), 14–23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Theodore Markettos, Paul J. Fox, Simon W. Moore, and Andrew W. Moore. 2014. Interconnect for commodity FPGA clusters: Standardized or customized? In Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL’14). IEEE, 1–8.Google ScholarGoogle Scholar
  48. John Marshall, Alistair Adcroft, Chris Hill, Lev Perelman, and Curt Heisey. 1997. A finite-volume, incompressible Navier Stokes model for studies of the ocean on parallel computers. J. Geophys. Res.: Oceans 102, C3 (1997), 5753–5766.Google ScholarGoogle ScholarCross RefCross Ref
  49. Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing. 256–265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. A. Mondigo, K. Sano, and H. Takizawa. 2018. Performance Estimation of deeply pipelined fluid simulation on multiple FPGAs with high-speed communication subsystem. In Proceedings of the IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP’18). 1–4.Google ScholarGoogle Scholar
  51. Alessandro Antonio Nacci, Vincenzo Rana, Francesco Bruschi, Donatella Sciuto, Politecnico di Milano, Ivan Beretta, and David Atienza. 2013. A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices. In Proceedings of the 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). IEEE, 1–6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Aiichiro Nakano, Rajiv K. Kalia, and Priya Vashishta. 1994. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Comput. Phys. Commun. 83, 2-3 (1994), 197–214.Google ScholarGoogle ScholarCross RefCross Ref
  53. Giuseppe Natale, Giulio Stramondo, Pietro Bressana, Riccardo Cattaneo, Donatella Sciuto, and Marco D. Santambrogio. 2016. A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. In Proceedings of the 35th International Conference on Computer-aided Design. ACM, 77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. NVIDIA. 2018. TensorRT. Retrieved from https://developer.nvidia.com/tensorrt.Google ScholarGoogle Scholar
  55. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. NIPS 2017 Workshop Autodiff Submission. Retrieved on 28 Oct, 2017 from https://openreview.net/forum?.id=BJJsrmfCZ.Google ScholarGoogle Scholar
  56. A. Podili, C. Zhang, and V. Prasanna. 2017. Fast and efficient implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP’17). 11–18.Google ScholarGoogle Scholar
  57. Murad Qasaimeh, Kristof Denolf, Jack Lo, Kees Vissers, Joseph Zambreno, and Phillip H. Jones. 2019. Comparing energy efficiency of CPU, GPU and FPGA implementations for vision kernels. In Proceedings of the IEEE International Conference on Embedded Software and Systems (ICESS’19). IEEE, 1–8.Google ScholarGoogle Scholar
  58. Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’16). ACM, New York, NY, 26–35. DOI:DOI:https://doi.org/10.1145/2847263.2847265 Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. N. Raspa, G. Natale, M. Bacis, and M. D. Santambrogio. 2018. A framework with cloud integration for CNN acceleration on FPGA devices. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). 170–177. DOI:DOI:https://doi.org/10.1109/IPDPSW.2018.00033Google ScholarGoogle Scholar
  60. Enrico Reggiani, Giuseppe Natale, Carlo Moroni, and Marco D. Santambrogio. 2018. An FPGA-based acceleration methodology and performance model for iterative stencils. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). IEEE, 115–122.Google ScholarGoogle Scholar
  61. Enrico Reggiani, Marco Rabozzi, Anna Maria Nestorov, Alberto Scolari, Luca Stornaiuolo, and Marco Santambrogio. 2019. Pareto optimal design space exploration for accelerated CNN on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’19). IEEE, 107–114.Google ScholarGoogle ScholarCross RefCross Ref
  62. Franz Richter, Michael Schmidt, and Dietmar Fey. 2012. A Configurable VHDL template for parallelization of 3D stencil codes on FPGAs. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA’12). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).Google ScholarGoogle Scholar
  63. M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P. Graf. 2009. A massively parallel coprocessor for convolutional neural networks. In Proceedings of the 20th IEEE International Conference on Application-specific Systems, Architectures and Processors. 53–60. DOI:DOI:https://doi.org/10.1109/ASAP.2009.25 Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. K. Sano, Y. Hatsuda, and S. Yamamoto. 2014. Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth. IEEE Trans. Parallel Distrib. Syst. 25, 3 (Mar. 2014), 695–705. DOI:DOI:https://doi.org/10.1109/TPDS.2013.51 Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. K. Sano and S. Yamamoto. 2017. FPGA-Based scalable and power-efficient fluid simulation using floating-point DSP Blocks. IEEE Trans. Parallel Distrib. Syst. 28, 10 (2017), 2823–2837.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. L. Shapiro and G. Stockman. 2001. Computer Vision. Prentice Hall, Inc., NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Proceedings of the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1–12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, and Chunyuan Zhang. 2018. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 97–106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  70. Gerard Sleijpen and Henk Van der Vorst. 2000. A Jacobi-Davidson iteration method for linear eigenvalue problems. SIAM Review 42, 2 (2000), 267–293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. A. Solazzo, E. Del Sozzo, I. De Rose, M. De Silvestri, G. C. Durelli, and M. D. Santambrogio. 2016. Hardware design automation of convolutional neural networks. In Proceedings of the IEEE Computer Society Symposium on VLSI (ISVLSI’16). 224–229. DOI:DOI:https://doi.org/10.1109/ISVLSI.2016.101Google ScholarGoogle Scholar
  72. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Liu, and Wen mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).Google ScholarGoogle Scholar
  73. Daniel Strigl, Klaus Kofler, and Stefan Podlipnig. 2010. Performance and scalability of GPU-based convolutional neural networks. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing. IEEE, 317–324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 16–25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures. 117–128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Maxeler Technologies. 2015. MPC-X Series. Retrieved from https://www.maxeler.com/products/mpc-xseries/.Google ScholarGoogle Scholar
  77. T. Tian, X. Jin, L. Zhao, X. Wang, J. Wang, and W. Wu. 2020. Exploration of memory access optimization for FPGA-based 3D CNN accelerator. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’20). 1650–1655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. S. I. Venieris and C. Bouganis. 2017. Latency-driven design for FPGA-based convolutional neural networks. In Proceedings of the 27th International Conference on Field-programmable Logic and Applications (FPL’17). 1–8.Google ScholarGoogle Scholar
  79. Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Toolflows for mapping convolutional neural networks on FPGAs: A survey and future directions. ACM Comput. Surv. 51, 3 (2018), 1–39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Hasitha Muthumala Waidyasooriya and Masanori Hariyama. 2019. Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (2019), 53188–53201.Google ScholarGoogle ScholarCross RefCross Ref
  81. Hasitha Muthumala Waidyasooriya, Yasuhiro Takei, Shunsuke Tatsumi, and Masanori Hariyama. 2016. OpenCL-based FPGA-platform for stencil computation and its optimization methodology. IEEE Trans. Parallel Distrib. Syst. 28, 5 (2016), 1390–1402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. S. Wang and Y. Liang. 2017. A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1–6. DOI:DOI:https://doi.org/10.1145/3061639.3062185 Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53rd ACM/EDAC/IEEE Design Automation Conference (DAC’16). 1–6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and J. Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1–6. DOI:DOI:https://doi.org/10.1145/3061639.3062207 Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Stephen Wolfram. 1984. Computation theory of cellular automata. Comm. Math. Phys. 96, 1 (1984), 15–57. Retrieved from https://projecteuclid.org:443/euclid.cmp/1103941718.Google ScholarGoogle ScholarCross RefCross Ref
  86. Omry Yadan, Keith Adams, Yaniv Taigman, and Facebook Ai Group. 2013. Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853 (2013).Google ScholarGoogle Scholar
  87. Kai Yu. 2013. Large-scale deep learning at Baidu. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (CIKM’13). ACM, New York, NY, 2211–2212. DOI:DOI:https://doi.org/10.1145/2505515.2514699 Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna. 2018. A framework for generating high throughput CNN implementations on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 117–126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. C. Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD’16). 1–8. DOI:DOI:https://doi.org/10.1145/2966986.2967011 Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161–170. DOI:DOI:https://doi.org/10.1145/2684746.2689060 Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Chi Zhang and Viktor Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 35–44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the International Symposium on Low Power Electronics and Design. 326–331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’17). ACM, New York, NY, 25–34. DOI:DOI:https://doi.org/10.1145/3020078.3021698 Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD’18). IEEE, 1–8. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL Components

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Reconfigurable Technology and Systems
              ACM Transactions on Reconfigurable Technology and Systems  Volume 14, Issue 3
              September 2021
              137 pages
              ISSN:1936-7406
              EISSN:1936-7414
              DOI:10.1145/3472296
              • Editor:
              • Deming Chen
              Issue’s Table of Contents

              Copyright © 2021 Association for Computing Machinery.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 12 August 2021
              • Accepted: 1 April 2021
              • Revised: 1 March 2021
              • Received: 1 September 2020
              Published in trets Volume 14, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!