skip to main content

Optimizing N-dimensional, winograd-based convolution for manycore CPUs

Published:10 February 2018Publication History
Skip Abstract Section

Abstract

Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winograd-based convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3 x, and sometimes 8 x faster than other state-of-the-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.

Skip Supplemental Material Section

Supplemental Material

References

  1. 2016. FALCON Library: Fast Image Convolution in Neural Networks on Intel Architecture. "https://colfaxresearch.com/falcon-library/". (2016).Google ScholarGoogle Scholar
  2. 2016. Intel(R) Math Kernel Library for Deep Neural Networks. "https://github.com/01org/mkl-dnn". (2016).Google ScholarGoogle Scholar
  3. 2018. N-Dimensional Winograd-based convolution framework. https://bitbucket.org/poozh/ond-winograd. (2018).Google ScholarGoogle Scholar
  4. Accessed: 01-14-2018. C3D: Generic Features for Video Analysis. http://vlg.cs.dartmouth.edu/c3d/. (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  5. Accessed: 01-14-2018. ILSVRC-2014 model (VGG team) with 16 weight layers. https://gist.github.com/ksimonyan/211839e770f7b538e2d8. (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  6. Accessed: 01-14-2018. Imagenet Winners Benchmarking. https://github.com/soumith/convnet-benchmarks. (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  7. Accessed: 01-14-2018. Intel® Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  8. Accessed: 01-14-2018. Intel® Math Kernel Library. https://software.intel.com/en-us/mkl. (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  9. Accessed: 01-14-2018. Intel® Nervana reference deep learning framework. https://github.com/NervanaSystems/neon. (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  10. Accessed: 01-14-2018. LIBXSMM. https://github.com/hfp/libxsmm. (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  11. Accessed: 01-14-2018. MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors: Developer's Guide. "https: //colfaxresearch.com/knl-mcdram/". (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  12. Accessed: 01-14-2018. SPIRAL Project: Fast x86 Barrier. http://www.spiral.net/software/barrier.html. (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  13. Accessed: 01-14-2018. Wincnn. "https://github.com/andravin/wincnn". (Accessed: 01-14-2018).Google ScholarGoogle Scholar
  14. Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).Google ScholarGoogle Scholar
  15. David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, and Nir Shavit. 2016. Deep Tensor Convolution on Multi-cores. arXiv preprint arXiv:1611.06565 (2016).Google ScholarGoogle Scholar
  16. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google ScholarGoogle Scholar
  17. G. Chrysos. 2012. Intel Xeon Phi X100 Family Coprocessor-The Architecture. white paper, Intel. (Nov 2012).Google ScholarGoogle Scholar
  18. Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 2016. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 424--432.Google ScholarGoogle Scholar
  19. Matthieu Courbariaux, Yoshua Bengio, and J David. 2014. Low precision arithmetic for deep learning. CoRR, abs/1412.7024 4 (2014).Google ScholarGoogle Scholar
  20. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  21. Matteo Frigo and Steven G Johnson. 1998. FFTW: An adaptive software architecture for the FFT. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, Vol. 3. IEEE, 1381--1384.Google ScholarGoogle Scholar
  22. Dennis Gannon, William Jalby, and Kyle Gallivan. 1988. Strategies for Cache and Local Memory Management by Global Program Transformation. In Proceedings of the 1st International Conference on Supercomputing. Springer-Verlag, London, UK, UK, 229--254. http://dl.acm.org/citation.cfm?id=647970.761024 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Liuhao Ge, Hui Liang, Junsong Yuan, and Daniel Thalmann. 2017. 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  24. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249--256.Google ScholarGoogle Scholar
  25. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 1737--1746. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, Michael Chen, and Kunle Olukotun. 2000. The Stanford Hydra CMP. IEEE Micro 20, 2 (March 2000), 71--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: accelerating small matrix multiplications by runtime code generation. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 981--991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alexander Heinecke, Hans Pabst, and Greg Henry. 2015. Libxsmm: A high performance library for small matrix multiplications. Poster and Extended Abstract Presented at SC (2015).Google ScholarGoogle Scholar
  29. James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J Jiménez, S Doerr, G Martínez-Rosell, AS Rose, and G De Fabritiis. 2017. DeepSite: Protein binding site predictor using 3D-convolutional neural networks. Bioinformatics (2017).Google ScholarGoogle Scholar
  31. Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro 25, 2 (March 2005), 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013--4021.Google ScholarGoogle ScholarCross RefCross Ref
  35. Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Vijay K. Madisetti. 2009. The Digital Signal Processing Handbook, Second Edition. CRC Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851 (2013).Google ScholarGoogle Scholar
  38. Daniel Maturana and Sebastian Scherer. 2015. 3d convolutional neural networks for landing zone detection from lidar. In Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 3471--3478.Google ScholarGoogle ScholarCross RefCross Ref
  39. Daniel Maturana and Sebastian Scherer. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 922--928.Google ScholarGoogle ScholarCross RefCross Ref
  40. Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In Advances in neural information processing systems. 2924--2932. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning Deconvolution Network for Semantic Segmentation. In Computer Vision (ICCV), 2015 IEEE International Conference on. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Tran Minh Quan, David GC Hilderbrand, and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. arXiv preprint arXiv:1612.05360 (2016).Google ScholarGoogle Scholar
  43. Lawrence R Rabiner and Bernard Gold. 1975. Theory and application of digital signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p. (1975).Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 234--241.Google ScholarGoogle ScholarCross RefCross Ref
  45. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A Many-core x86 Architecture for Visual Computing. In ACM SIGGRAPH 2008 Papers (SIGGRAPH '08). ACM, New York, NY, USA, Article 18, 15 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  48. Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. Ieee micro 36, 2 (2016), 34--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  50. Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. 2011. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, Vol. 1. 4.Google ScholarGoogle Scholar
  51. Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014).Google ScholarGoogle Scholar
  52. Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. 2017. Accelerating Deep Convolutional Networks using low-precision and sparsity. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2861--2865.Google ScholarGoogle ScholarCross RefCross Ref
  53. Kevin Vincent, Kevin Stephano, Michael Frumkin, Boris Ginsburg, and Julien Demouth. 2017. On Improving the Numerical Stability of Winograd Convolutions. (2017).Google ScholarGoogle Scholar
  54. Yida Wang, Michael J Anderson, Jonathan D Cohen, Alexander Heinecke, Kai Li, Nadathur Satish, Narayanan Sundaram, Nicholas B Turk-Browne, and Theodore L Willke. 2015. Full correlation matrix analysis of fMRI data on Intel® Xeon Phi coprocessors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam.Google ScholarGoogle Scholar
  56. Aleksandar Zlateski, Kisuk Lee, and H Sebastian Seung. 2016. ZNN-A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines. In Parallel and Distributed Processing Symposium, 2016 IEEE International. IEEE, 801--811.Google ScholarGoogle ScholarCross RefCross Ref
  57. Aleksandar Zlateski, Kisuk Lee, and H Sebastian Seung. 2016. ZNNi: maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 854--865. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Aleksandar Zlateski and H Sebastian Seung. 2017. Compile-time optimized and statically scheduled ND convnet primitives for multi-core and many-core (Xeon Phi) CPUs. In Proceedings of the International Conference on Supercomputing. ACM, 8. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing N-dimensional, winograd-based convolution for manycore CPUs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!