Abstract
Stencil-based algorithms are a relevant class of computational kernels in high-performance systems, as they appear in a plethora of fields, from image processing to seismic simulations, from numerical methods to physical modeling. Among the various incarnations of stencil-based computations, Iterative Stencil Loops (ISLs) and Convolutional Neural Networks (CNNs) represent two well-known examples of kernels belonging to the stencil class. Indeed, ISLs apply the same stencil several times until convergence, while CNN layers leverage stencils to extract features from an image. The computationally intensive essence of ISLs, CNNs, and in general stencil-based workloads, requires solutions able to produce efficient implementations in terms of throughput and power efficiency. In this context, FPGAs are ideal candidates for such workloads, as they allow design architectures tailored to the stencil regular computational pattern. Moreover, the ever-growing need for performance enhancement leads FPGA-based architectures to scale to multiple devices to benefit from a distributed acceleration. For this reason, we propose a library of HDL components to effectively compute ISLs and CNNs inference on FPGA, along with a scalable multi-FPGA architecture, based on custom PCB interconnects. Our solution eases the design flow and guarantees both scalability and performance competitive with state-of-the-art works.
- A. Ahmad and M. A. Pasha. 2019. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’19). 1106–1111.Google Scholar
- Afzal Ahmad and Muhammad Adeel Pasha. 2020. FFConv: An FPGA-based accelerator for fast convolution layers in convolutional neural networks. ACM Trans. Embedd. Comput. Syst. 19, 2 (2020), 1–24. Google Scholar
Digital Library
- Francesc Aràndiga, Albert Cohen, Rosa Donat, and Basarab Matei. 2010. Edge detection insensitive to changes of illumination in the image. Image Vis. Comput. 28, 4 (2010), 553–562. Google Scholar
Digital Library
- Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL™ deep learning accelerator on Arria 10. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’17). ACM, New York, NY, 55–64. DOI:DOI:https://doi.org/10.1145/3020078.3021738 Google Scholar
Digital Library
- M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio. 2017. A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’17). 90–97. DOI:DOI:https://doi.org/10.1109/IPDPSW.2017.44Google Scholar
- Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society Press, Los Alamitos, CA. Retrieved from http://dl.acm.org/citation.cfm?id=2388996.2389051. Google Scholar
Digital Library
- Uday Bondhugula. 2008. PLUTO - An automatic parallelizer and locality optimizer for affine loop nests. Retrieved from http://pluto-compiler.sourceforge.net/.Google Scholar
- Uday Bondhugula. 2008. PLUTO Compiler Repository - Examples. Retrieved from https://github.com/bondhugula/pluto/tree/master/examples.Google Scholar
- Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Proceedings of the International Conference on Compiler Construction. Springer, 132–146. Google Scholar
Digital Library
- Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101–113. Google Scholar
Digital Library
- Bing-Yang Cao and Ruo-Yu Dong. 2012. Nonequilibrium molecular dynamics simulation of shear viscosity by a uniform momentum source-and-sink scheme. J. Comput. Phys. 231, 16 (2012), 5306–5316. Google Scholar
Digital Library
- Riccardo Cattaneo, Giuseppe Natale, Carlo Sicignano, Donatella Sciuto, and Marco Domenico Santambrogio. 2016. On how to accelerate iterative stencil loops: A scalable streaming-based approach. ACM Trans. Archit. Code Optim. 12, 4 (2016), 53. Google Scholar
Digital Library
- Adrian Caulfield, Eric Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. 2016. A cloud-scale acceleration architecture. Retrieved from https://www.microsoft.com/en-us/research/publication/configurable-cloud-acceleration/. Google Scholar
Digital Library
- Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD’18). IEEE, 1–8. Google Scholar
Digital Library
- M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the IEEE International Parallel Distributed Processing Symposium. 676–687. DOI:DOI:https://doi.org/10.1109/IPDPS.2011.70 Google Scholar
Digital Library
- Davide Conficconi, Eleonora D’Arnese, Emanuele Del Sozzo, Donatella Sciuto, and Marco D. Santambrogio. 2021. A framework for customizable FPGA-based image registration accelerators. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 251–261. Google Scholar
Digital Library
- J. Cong, P. Li, B. Xiao, and P. Zhang. 2014. An optimal microarchitecture for stencil computation acceleration based on non-uniform partitioning of data reuse buffer. In Proceedings of the 51st Design Automation Conference (DAC’14). ACM, New York, NY. DOI:DOI:https://doi.org/10.1145/2593069.2593090 Google Scholar
Digital Library
- J. Cong, P. Li, B. Xiao, and P. Zhang. 2016. An optimal microarchitecture for stencil computation acceleration based on nonuniform partitioning of data reuse buffers. IEEE Trans. Comput.-aided Des. Integ. Circ. Syst. 35, 3 (2016), 407–418. Google Scholar
Digital Library
- E. Del Sozzo, A. Solazzo, A. Miele, and M. D. Santambrogio. 2016. On the automation of high level synthesis of convolutional neural networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’16). 217–224. DOI:DOI:https://doi.org/10.1109/IPDPSW.2016.153Google Scholar
- Hang Ding and Chang Shu. 2006. A stencil adaptive algorithm for finite difference solution of incompressible viscous flows. J. Comput. Phys. 214, 1 (2006), 397–420. Google Scholar
Digital Library
- Clément Farabet, Cyril Poulet, Jefferson Y. Han, and Yann LeCun. 2009. CNP: An FPGA-based processor for convolutional networks. In Proceedings of the International Conference on Field-programmable Logic and Applications. IEEE, 32–37.Google Scholar
Cross Ref
- Paul Feautrier and Christian Lengauer. 2011. Polyhedron model.Encyclopedia of Parallel Computing 1 (2011), 1581–1592.Google Scholar
- Matteo Frigo and Volker Strumpen. 2007. The memory behavior of cache oblivious stencil computations. J. Supercomput. 39, 2 (2007), 93–112. Google Scholar
Digital Library
- V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1–4.Google Scholar
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the Conference on Innovative Parallel Computing (InPar’12). IEEE, 1–10.Google Scholar
Cross Ref
- K. Guo, et al. 2017. Angel-eye: A complete design flow for mapping cnn onto embedded FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 1 (2017), 35–47.Google Scholar
Cross Ref
- Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. [DL] A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfig. Technol. Syst. 12, 1 (2019), 1–26. Google Scholar
Digital Library
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015).Google Scholar
Digital Library
- Robert M. Haralick and Linda G. Shapiro. 1992. Computer and Robot Vision. Vol. 1. Addison-Wesley Reading. Google Scholar
Digital Library
- Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 311–320. DOI:DOI:https://doi.org/10.1145/2304576.2304619 Google Scholar
Digital Library
- Amazon Inc.2018. EC2 F1 Instances. Retrieved from https://aws.amazon.com/it/ec2/instance-types/f1/.Google Scholar
- Microsoft Inc.2018. Project Brainwave. Retrieved from https://www.microsoft.com/en-us/research/blog/mi crosoft-unveils-project-brainwave/.Google Scholar
- Xilinx Inc.2018. Aurora 64B/66B link-layer protocol.g. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/aurora_64b66b_protocol_spec_sp011.pdf.Google Scholar
- Kazufumi Ito and Jari Toivanen. 2009. Lagrange multiplier approach with optimized finite difference stencils for pricing American options under stochastic volatility. SIAM J. Sci. Comput. 31 (2009), 2646–2664. Google Scholar
Digital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. 675–678. Google Scholar
Digital Library
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, and Gaurav Agrawal et al. 2017. In-datacenter performance analysis of a tensor processing unit. Retrieved from https://arxiv.org/pdf/1704.04760.pdf. Google Scholar
Digital Library
- Tomoyoshi Kobori and Tsutomu Maruyama. 2003. A high speed computation system for 3D FCHC lattice gas model with FPGA. In Proceedings of the International Conference on Field-programmable Logic and Applications. Springer, 755–765.Google Scholar
Cross Ref
- Raghuraman Krishnamoorthi. 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR abs/1806.08342 (2018).Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Proc. Syst. 25 (2012), 1097–1105. Google Scholar
Digital Library
- Andrew Lavin. 2015. Fast algorithms for convolutional neural networks. CoRR abs/1509.09308 (2015). Retrieved from https://arxiv.org/pdf/1704.04760.pdf.Google Scholar
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278–2324. DOI:DOI:https://doi.org/10.1109/5.726791Google Scholar
Cross Ref
- Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In Proceedings of the 26th International Conference on Field-programmable Logic and Applications (FPL’16). IEEE, 1–9.Google Scholar
- Y. Liang, L. Lu, Q. Xiao, and S. Yan. 2020. Evaluating fast algorithms for convolutional neural networks on FPGAs. IEEE Trans. Comput.-aided Des. Integ. Circ. Syst. 39, 4 (2020), 857–870. DOI:DOI:https://doi.org/10.1109/TCAD.2019.2897701Google Scholar
Cross Ref
- Zhiqiang Liu, Yong Dou, Jingfei Jiang, and Jinwei Xu. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. In Proceedings of the International Conference on Field-programmable Technology (FPT’16). IEEE, 61–68.Google Scholar
- Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 45–54. Google Scholar
Digital Library
- Yufei Ma, Naveen Suda, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2018. ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler. Integration 62 (2018), 14–23.Google Scholar
Digital Library
- A. Theodore Markettos, Paul J. Fox, Simon W. Moore, and Andrew W. Moore. 2014. Interconnect for commodity FPGA clusters: Standardized or customized? In Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL’14). IEEE, 1–8.Google Scholar
- John Marshall, Alistair Adcroft, Chris Hill, Lev Perelman, and Curt Heisey. 1997. A finite-volume, incompressible Navier Stokes model for studies of the ocean on parallel computers. J. Geophys. Res.: Oceans 102, C3 (1997), 5753–5766.Google Scholar
Cross Ref
- Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing. 256–265. Google Scholar
Digital Library
- A. Mondigo, K. Sano, and H. Takizawa. 2018. Performance Estimation of deeply pipelined fluid simulation on multiple FPGAs with high-speed communication subsystem. In Proceedings of the IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP’18). 1–4.Google Scholar
- Alessandro Antonio Nacci, Vincenzo Rana, Francesco Bruschi, Donatella Sciuto, Politecnico di Milano, Ivan Beretta, and David Atienza. 2013. A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices. In Proceedings of the 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). IEEE, 1–6. Google Scholar
Digital Library
- Aiichiro Nakano, Rajiv K. Kalia, and Priya Vashishta. 1994. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Comput. Phys. Commun. 83, 2-3 (1994), 197–214.Google Scholar
Cross Ref
- Giuseppe Natale, Giulio Stramondo, Pietro Bressana, Riccardo Cattaneo, Donatella Sciuto, and Marco D. Santambrogio. 2016. A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops. In Proceedings of the 35th International Conference on Computer-aided Design. ACM, 77. Google Scholar
Digital Library
- NVIDIA. 2018. TensorRT. Retrieved from https://developer.nvidia.com/tensorrt.Google Scholar
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. NIPS 2017 Workshop Autodiff Submission. Retrieved on 28 Oct, 2017 from https://openreview.net/forum?.id=BJJsrmfCZ.Google Scholar
- A. Podili, C. Zhang, and V. Prasanna. 2017. Fast and efficient implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP’17). 11–18.Google Scholar
- Murad Qasaimeh, Kristof Denolf, Jack Lo, Kees Vissers, Joseph Zambreno, and Phillip H. Jones. 2019. Comparing energy efficiency of CPU, GPU and FPGA implementations for vision kernels. In Proceedings of the IEEE International Conference on Embedded Software and Systems (ICESS’19). IEEE, 1–8.Google Scholar
- Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’16). ACM, New York, NY, 26–35. DOI:DOI:https://doi.org/10.1145/2847263.2847265 Google Scholar
Digital Library
- N. Raspa, G. Natale, M. Bacis, and M. D. Santambrogio. 2018. A framework with cloud integration for CNN acceleration on FPGA devices. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). 170–177. DOI:DOI:https://doi.org/10.1109/IPDPSW.2018.00033Google Scholar
- Enrico Reggiani, Giuseppe Natale, Carlo Moroni, and Marco D. Santambrogio. 2018. An FPGA-based acceleration methodology and performance model for iterative stencils. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). IEEE, 115–122.Google Scholar
- Enrico Reggiani, Marco Rabozzi, Anna Maria Nestorov, Alberto Scolari, Luca Stornaiuolo, and Marco Santambrogio. 2019. Pareto optimal design space exploration for accelerated CNN on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’19). IEEE, 107–114.Google Scholar
Cross Ref
- Franz Richter, Michael Schmidt, and Dietmar Fey. 2012. A Configurable VHDL template for parallelization of 3D stencil codes on FPGAs. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA’12). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).Google Scholar
- M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P. Graf. 2009. A massively parallel coprocessor for convolutional neural networks. In Proceedings of the 20th IEEE International Conference on Application-specific Systems, Architectures and Processors. 53–60. DOI:DOI:https://doi.org/10.1109/ASAP.2009.25 Google Scholar
Digital Library
- K. Sano, Y. Hatsuda, and S. Yamamoto. 2014. Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth. IEEE Trans. Parallel Distrib. Syst. 25, 3 (Mar. 2014), 695–705. DOI:DOI:https://doi.org/10.1109/TPDS.2013.51 Google Scholar
Digital Library
- K. Sano and S. Yamamoto. 2017. FPGA-Based scalable and power-efficient fluid simulation using floating-point DSP Blocks. IEEE Trans. Parallel Distrib. Syst. 28, 10 (2017), 2823–2837.Google Scholar
Digital Library
- L. Shapiro and G. Stockman. 2001. Computer Vision. Prentice Hall, Inc., NJ. Google Scholar
Digital Library
- H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Proceedings of the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1–12. Google Scholar
Digital Library
- Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, and Chunyuan Zhang. 2018. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 97–106. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Gerard Sleijpen and Henk Van der Vorst. 2000. A Jacobi-Davidson iteration method for linear eigenvalue problems. SIAM Review 42, 2 (2000), 267–293. Google Scholar
Digital Library
- A. Solazzo, E. Del Sozzo, I. De Rose, M. De Silvestri, G. C. Durelli, and M. D. Santambrogio. 2016. Hardware design automation of convolutional neural networks. In Proceedings of the IEEE Computer Society Symposium on VLSI (ISVLSI’16). 224–229. DOI:DOI:https://doi.org/10.1109/ISVLSI.2016.101Google Scholar
- John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Liu, and Wen mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).Google Scholar
- Daniel Strigl, Klaus Kofler, and Stefan Podlipnig. 2010. Performance and scalability of GPU-based convolutional neural networks. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing. IEEE, 317–324. Google Scholar
Digital Library
- Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 16–25. Google Scholar
Digital Library
- Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures. 117–128. Google Scholar
Digital Library
- Maxeler Technologies. 2015. MPC-X Series. Retrieved from https://www.maxeler.com/products/mpc-xseries/.Google Scholar
- T. Tian, X. Jin, L. Zhao, X. Wang, J. Wang, and W. Wu. 2020. Exploration of memory access optimization for FPGA-based 3D CNN accelerator. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’20). 1650–1655. Google Scholar
Digital Library
- S. I. Venieris and C. Bouganis. 2017. Latency-driven design for FPGA-based convolutional neural networks. In Proceedings of the 27th International Conference on Field-programmable Logic and Applications (FPL’17). 1–8.Google Scholar
- Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Toolflows for mapping convolutional neural networks on FPGAs: A survey and future directions. ACM Comput. Surv. 51, 3 (2018), 1–39. Google Scholar
Digital Library
- Hasitha Muthumala Waidyasooriya and Masanori Hariyama. 2019. Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (2019), 53188–53201.Google Scholar
Cross Ref
- Hasitha Muthumala Waidyasooriya, Yasuhiro Takei, Shunsuke Tatsumi, and Masanori Hariyama. 2016. OpenCL-based FPGA-platform for stencil computation and its optimization methodology. IEEE Trans. Parallel Distrib. Syst. 28, 5 (2016), 1390–1402. Google Scholar
Digital Library
- S. Wang and Y. Liang. 2017. A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1–6. DOI:DOI:https://doi.org/10.1145/3061639.3062185 Google Scholar
Digital Library
- Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53rd ACM/EDAC/IEEE Design Automation Conference (DAC’16). 1–6. Google Scholar
Digital Library
- Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and J. Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1–6. DOI:DOI:https://doi.org/10.1145/3061639.3062207 Google Scholar
Digital Library
- Stephen Wolfram. 1984. Computation theory of cellular automata. Comm. Math. Phys. 96, 1 (1984), 15–57. Retrieved from https://projecteuclid.org:443/euclid.cmp/1103941718.Google Scholar
Cross Ref
- Omry Yadan, Keith Adams, Yaniv Taigman, and Facebook Ai Group. 2013. Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853 (2013).Google Scholar
- Kai Yu. 2013. Large-scale deep learning at Baidu. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (CIKM’13). ACM, New York, NY, 2211–2212. DOI:DOI:https://doi.org/10.1145/2505515.2514699 Google Scholar
Digital Library
- Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna. 2018. A framework for generating high throughput CNN implementations on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 117–126. Google Scholar
Digital Library
- C. Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD’16). 1–8. DOI:DOI:https://doi.org/10.1145/2966986.2967011 Google Scholar
Digital Library
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161–170. DOI:DOI:https://doi.org/10.1145/2684746.2689060 Google Scholar
Digital Library
- Chi Zhang and Viktor Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 35–44. Google Scholar
Digital Library
- Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the International Symposium on Low Power Electronics and Design. 326–331. Google Scholar
Digital Library
- Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’17). ACM, New York, NY, 25–34. DOI:DOI:https://doi.org/10.1145/3020078.3021698 Google Scholar
Digital Library
- Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD’18). IEEE, 1–8. Google Scholar
Digital Library
Index Terms
Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL Components
Recommendations
Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth
Stencil computation is one of the important kernels in scientific computations. However, sustained performance is limited owing to restriction on memory bandwidth, especially on multicore microprocessors and graphics processing units (GPUs) because of ...
Extreme-scale realistic stencil computations on sunway taihulight with ten million cores
CCGrid '18: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingStencil computation arises from a large variety of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to optimize stencil ...
Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation
This paper presents the domain-specific programmable design of custom computing machines for high-performance stencil computation. Stencil computation is one of the typical kernels in scientific computations, however its low operational-intensity makes ...






Comments