Abstract
Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winograd-based convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3 x, and sometimes 8 x faster than other state-of-the-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.
Supplemental Material
Available for Download
Source Code and Artifact Evaluation Code
- 2016. FALCON Library: Fast Image Convolution in Neural Networks on Intel Architecture. "https://colfaxresearch.com/falcon-library/". (2016).Google Scholar
- 2016. Intel(R) Math Kernel Library for Deep Neural Networks. "https://github.com/01org/mkl-dnn". (2016).Google Scholar
- 2018. N-Dimensional Winograd-based convolution framework. https://bitbucket.org/poozh/ond-winograd. (2018).Google Scholar
- Accessed: 01-14-2018. C3D: Generic Features for Video Analysis. http://vlg.cs.dartmouth.edu/c3d/. (Accessed: 01-14-2018).Google Scholar
- Accessed: 01-14-2018. ILSVRC-2014 model (VGG team) with 16 weight layers. https://gist.github.com/ksimonyan/211839e770f7b538e2d8. (Accessed: 01-14-2018).Google Scholar
- Accessed: 01-14-2018. Imagenet Winners Benchmarking. https://github.com/soumith/convnet-benchmarks. (Accessed: 01-14-2018).Google Scholar
- Accessed: 01-14-2018. Intel® Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. (Accessed: 01-14-2018).Google Scholar
- Accessed: 01-14-2018. Intel® Math Kernel Library. https://software.intel.com/en-us/mkl. (Accessed: 01-14-2018).Google Scholar
- Accessed: 01-14-2018. Intel® Nervana reference deep learning framework. https://github.com/NervanaSystems/neon. (Accessed: 01-14-2018).Google Scholar
- Accessed: 01-14-2018. LIBXSMM. https://github.com/hfp/libxsmm. (Accessed: 01-14-2018).Google Scholar
- Accessed: 01-14-2018. MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors: Developer's Guide. "https: //colfaxresearch.com/knl-mcdram/". (Accessed: 01-14-2018).Google Scholar
- Accessed: 01-14-2018. SPIRAL Project: Fast x86 Barrier. http://www.spiral.net/software/barrier.html. (Accessed: 01-14-2018).Google Scholar
- Accessed: 01-14-2018. Wincnn. "https://github.com/andravin/wincnn". (Accessed: 01-14-2018).Google Scholar
- Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).Google Scholar
- David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, and Nir Shavit. 2016. Deep Tensor Convolution on Multi-cores. arXiv preprint arXiv:1611.06565 (2016).Google Scholar
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google Scholar
- G. Chrysos. 2012. Intel Xeon Phi X100 Family Coprocessor-The Architecture. white paper, Intel. (Nov 2012).Google Scholar
- Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 2016. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 424--432.Google Scholar
- Matthieu Courbariaux, Yoshua Bengio, and J David. 2014. Low precision arithmetic for deep learning. CoRR, abs/1412.7024 4 (2014).Google Scholar
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.Google Scholar
Cross Ref
- Matteo Frigo and Steven G Johnson. 1998. FFTW: An adaptive software architecture for the FFT. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, Vol. 3. IEEE, 1381--1384.Google Scholar
- Dennis Gannon, William Jalby, and Kyle Gallivan. 1988. Strategies for Cache and Local Memory Management by Global Program Transformation. In Proceedings of the 1st International Conference on Supercomputing. Springer-Verlag, London, UK, UK, 229--254. http://dl.acm.org/citation.cfm?id=647970.761024 Google Scholar
Digital Library
- Liuhao Ge, Hui Liang, Junsong Yuan, and Daniel Thalmann. 2017. 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249--256.Google Scholar
- Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 1737--1746. Google Scholar
Digital Library
- Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, Michael Chen, and Kunle Olukotun. 2000. The Stanford Hydra CMP. IEEE Micro 20, 2 (March 2000), 71--84. Google Scholar
Digital Library
- Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: accelerating small matrix multiplications by runtime code generation. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 981--991. Google Scholar
Digital Library
- Alexander Heinecke, Hans Pabst, and Greg Henry. 2015. Libxsmm: A high performance library for small matrix multiplications. Poster and Extended Abstract Presented at SC (2015).Google Scholar
- James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann. Google Scholar
Digital Library
- J Jiménez, S Doerr, G Martínez-Rosell, AS Rose, and G De Fabritiis. 2017. DeepSite: Protein binding site predictor using 3D-convolutional neural networks. Bioinformatics (2017).Google Scholar
- Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500. Google Scholar
Digital Library
- Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro 25, 2 (March 2005), 21--29. Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105. Google Scholar
Digital Library
- Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013--4021.Google Scholar
Cross Ref
- Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 76. Google Scholar
Digital Library
- Vijay K. Madisetti. 2009. The Digital Signal Processing Handbook, Second Edition. CRC Press. Google Scholar
Digital Library
- Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851 (2013).Google Scholar
- Daniel Maturana and Sebastian Scherer. 2015. 3d convolutional neural networks for landing zone detection from lidar. In Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 3471--3478.Google Scholar
Cross Ref
- Daniel Maturana and Sebastian Scherer. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 922--928.Google Scholar
Cross Ref
- Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In Advances in neural information processing systems. 2924--2932. Google Scholar
Digital Library
- Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning Deconvolution Network for Semantic Segmentation. In Computer Vision (ICCV), 2015 IEEE International Conference on. Google Scholar
Digital Library
- Tran Minh Quan, David GC Hilderbrand, and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. arXiv preprint arXiv:1612.05360 (2016).Google Scholar
- Lawrence R Rabiner and Bernard Gold. 1975. Theory and application of digital signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p. (1975).Google Scholar
Digital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 234--241.Google Scholar
Cross Ref
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252. Google Scholar
Digital Library
- Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A Many-core x86 Architecture for Visual Computing. In ACM SIGGRAPH 2008 Papers (SIGGRAPH '08). ACM, New York, NY, USA, Article 18, 15 pages. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. Ieee micro 36, 2 (2016), 34--46. Google Scholar
Digital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google Scholar
Cross Ref
- Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. 2011. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, Vol. 1. 4.Google Scholar
- Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014).Google Scholar
- Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. 2017. Accelerating Deep Convolutional Networks using low-precision and sparsity. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2861--2865.Google Scholar
Cross Ref
- Kevin Vincent, Kevin Stephano, Michael Frumkin, Boris Ginsburg, and Julien Demouth. 2017. On Improving the Numerical Stability of Winograd Convolutions. (2017).Google Scholar
- Yida Wang, Michael J Anderson, Jonathan D Cohen, Alexander Heinecke, Kai Li, Nadathur Satish, Narayanan Sundaram, Nicholas B Turk-Browne, and Theodore L Willke. 2015. Full correlation matrix analysis of fMRI data on Intel® Xeon Phi coprocessors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 23. Google Scholar
Digital Library
- Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam.Google Scholar
- Aleksandar Zlateski, Kisuk Lee, and H Sebastian Seung. 2016. ZNN-A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines. In Parallel and Distributed Processing Symposium, 2016 IEEE International. IEEE, 801--811.Google Scholar
Cross Ref
- Aleksandar Zlateski, Kisuk Lee, and H Sebastian Seung. 2016. ZNNi: maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 854--865. Google Scholar
Digital Library
- Aleksandar Zlateski and H Sebastian Seung. 2017. Compile-time optimized and statically scheduled ND convnet primitives for multi-core and many-core (Xeon Phi) CPUs. In Proceedings of the International Conference on Supercomputing. ACM, 8. Google Scholar
Digital Library
Index Terms
Optimizing N-dimensional, winograd-based convolution for manycore CPUs
Recommendations
Optimizing batched winograd convolution on GPUs
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingIn this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on ...
Optimizing N-dimensional, winograd-based convolution for manycore CPUs
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingRecent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse ...
Optimizing Massively Parallel Winograd Convolution on ARM Processor
ICPP '21: Proceedings of the 50th International Conference on Parallel ProcessingConvolution Neural Network (CNN) has gained a great success in deep learning applications and been accelerated by dedicated convolutional algorithms. Winograd-based algorithm can greatly reduce the number of arithmetic operations required in ...







Comments