skip to main content
research-article
Open Access

Demystifying differentiable programming: shift/reset the penultimate backpropagator

Published:26 July 2019Publication History
Skip Abstract Section

Abstract

Deep learning has seen tremendous success over the past decade in computer vision, machine translation, and gameplay. This success rests crucially on gradient-descent optimization and the ability to “learn” parameters of a neural network by backpropagating observed errors. However, neural network architectures are growing increasingly sophisticated and diverse, which motivates an emerging quest for even more general forms of differentiable programming, where arbitrary parameterized computations can be trained by gradient descent. In this paper, we take a fresh look at automatic differentiation (AD) techniques, and especially aim to demystify the reverse-mode form of AD that generalizes backpropagation in neural networks. We uncover a tight connection between reverse-mode AD and delimited continuations, which permits implementing reverse-mode AD purely via operator overloading and without managing any auxiliary data structures. We further show how this formulation of AD can be fruitfully combined with multi-stage programming (staging), leading to an efficient implementation that combines the performance benefits of deep learning frameworks based on explicit reified computation graphs (e.g., TensorFlow) with the expressiveness of pure library approaches (e.g., PyTorch).

Skip Supplemental Material Section

Supplemental Material

a96-zheng.webm

References

  1. Martin Abadi, Michael Isard, and Derek G. Murray. 2017. A Computational Model for TensorFlow (An Introduction). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey. CoRR abs/1502.05767 (2018).Google ScholarGoogle Scholar
  3. Atilim Günes Baydin, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. DiffSharp: An AD Library for .NET Languages. CoRR abs/1611.03423 (2016).Google ScholarGoogle Scholar
  4. L. M. Beda, L. N. Korolev, N. V. Sukkikh, and T. S. Frolova. 1959. Programs for automatic differentiation for the machine BESM. Technical Report. Institute for Precise Mechanics and Computation Techniques, Academy of Science, Moscow, USSR. (In Russian).Google ScholarGoogle Scholar
  5. Anders Bondorf. 1992. Improving Binding Times Without Explicit CPS-Conversion. In Proceedings of the Conference on Lisp and Functional Programming. ACM Press, 1–10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher R. Aberger, and Kunle Olukotun. 2016. Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns. In CGO. ACM, 194–205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A Heterogeneous Parallel Framework for Domain-Specific Languages. In PACT. IEEE Computer Society, 89–100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A Bryson and Yu-Chi Ho. 1975. Applied optimal control: Optimization, estimation, and control (revised edition). Levittown, Pennsylvania: Taylor & Francis (1975).Google ScholarGoogle Scholar
  9. Arthur E Bryson and Walter F Denham. 1962. A steepest-ascent method for solving optimum programming problems. Journal of Applied Mechanics 29, 2 (1962), 247–257.Google ScholarGoogle ScholarCross RefCross Ref
  10. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).Google ScholarGoogle Scholar
  11. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In OSDI. USENIX Association, 578–594. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bruce Christianson. 1992. Automatic Hessians by reverse accumulation. IMA J. Numer. Anal. 12, 2 (1992), 135–150.Google ScholarGoogle ScholarCross RefCross Ref
  13. Jason Chuang. 2013. Stanford Sentiment Treebank. https://nlp.stanford.edu/sentiment/treebank.htmlGoogle ScholarGoogle Scholar
  14. Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, NIPS Workshop.Google ScholarGoogle Scholar
  15. Olivier Danvy and Andrzej Filinski. 1990. Abstracting Control. In LISP and Functional Programming. 151–160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Olivier Danvy and Andrzej Filinski. 1992. Representing Control: A Study of the CPS Transformation. Mathematical Structures in Computer Science 2, 4 (1992), 361–391.Google ScholarGoogle ScholarCross RefCross Ref
  17. Olivier Danvy and Mayer Goldberg. 2005. There and back again. Fundamenta Informaticae 66, 4 (2005), 397–413. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Olivier Danvy and Kevin Millikin. 2009. Refunctionalization at work. Sci. Comput. Program. 74, 8 (2009), 534–549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Olivier Danvy and Lasse R. Nielsen. 2001. Defunctionalization at Work. In PPDP. ACM, 162–174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Olivier Danvy and Lasse R. Nielsen. 2003. A first-order one-pass CPS transformation. Theor. Comput. Sci. 308, 1-3 (2003), 239–257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12 (2011), 2121–2159. http://dl.acm.org/citation.cfm?id=2021068 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Conal Elliott. 2009. Beautiful differentiation. In International Conference on Functional Programming (ICFP). http://conal. net/papers/beautiful-differentiation Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Conal Elliott. 2017. Compiling to categories. PACMPL 1, ICFP (2017), 27:1–27:27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Conal Elliott. 2018. The simple essence of automatic differentiation. PACMPL 2, ICFP (2018), 70:1–70:29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14, 2 (1990), 179–211.Google ScholarGoogle Scholar
  26. Matthias Felleisen. 1988. The Theory and Practice of First-Class Prompts. In POPL. ACM Press, 180–190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. 1993. The Essence of Compiling with Continuations. In PLDI. ACM, 237–247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Brendan Fong, David I Spivak, and Rémy Tuyéras. 2017. Backprop as Functor: A compositional perspective on supervised learning. arXiv preprint arXiv:1711.10455 (2017).Google ScholarGoogle Scholar
  29. Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines. CoRR abs/1410.5401 (2014). arXiv: 1410.5401 http://arxiv.org/abs/1410.5401Google ScholarGoogle Scholar
  30. Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. 2015. Learning to Transduce with Unbounded Memory. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 1828–1836. http://papers.nips.cc/paper/5648-learning-totransduce-with-unbounded-memory.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE Computer Society, 770–778.Google ScholarGoogle Scholar
  32. Robert Hecht-Nielsen. 1988. Theory of the backpropagation neural network. Neural Networks 1, Supplement-1 (1988), 445–448.Google ScholarGoogle ScholarCross RefCross Ref
  33. Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. CoRR abs/1602.07360 (2016).Google ScholarGoogle Scholar
  34. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. CoRR abs/1408.5093 (2014). arXiv: 1408.5093 http://arxiv.org/abs/1408.5093Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ulrik Jørring and William L. Scherlis. 1986. Compilers and Staging Transformations. In POPL. ACM Press, 86–96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv: 1412.6980 http://arxiv.org/abs/1412.6980Google ScholarGoogle Scholar
  37. Alex Krizhevsky. 2012. Learning Multiple Layers of Features from Tiny Images. University of Toronto (05 2012).Google ScholarGoogle Scholar
  38. John Launchbury and Simon L. Peyton Jones. 1994. Lazy Functional State Threads. In PLDI. ACM, 24–35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yann LeCun. 2018. Deep Learning est mort. Vive Differentiable Programming! https://www.facebook.com/yann.lecun/ posts/10155003011462143 .Google ScholarGoogle Scholar
  40. Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. 1990. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems. 396–404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Seppo Linnainmaa. 1976. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics 16, 2 (1976), 146–160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. 2017. Deep Learning with Dynamic Computation Graphs. ICLR (2017).Google ScholarGoogle Scholar
  43. Dougal Maclaurin. 2016. Modeling, Inference and Optimization with Composable Differentiable Procedures. Ph.D. Dissertation.Google ScholarGoogle Scholar
  44. Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017a. DyNet: The Dynamic Neural Network Toolkit. CoRR abs/1701.03980 (2017).Google ScholarGoogle Scholar
  45. Graham Neubig, Yoav Goldberg, and Chris Dyer. 2017b. On-the-fly Operation Batching in Dynamic Computation Graphs. In NIPS. 3974–3984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. John F Nolan. 1953. Analytical differentiation on a digital computer.Google ScholarGoogle Scholar
  47. Christopher Olah. 2015. Neural Networks, Types, and Functional Programming. http://colah.github.io/posts/2015-09-NN-Types-FP/ .Google ScholarGoogle Scholar
  48. ONNX working groups. 2017. ONNX: Open Neural Network Exchange format. https://onnx.ai/Google ScholarGoogle Scholar
  49. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), 5206–5210.Google ScholarGoogle ScholarCross RefCross Ref
  50. D.B. Parker, Massachusetts Institute of Technology, and Sloan School of Management. 1985. Learning Logic: Casting the Cortex of the Human Brain in Silicon. Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science. https://books.google.com/books?id=2kS9GwAACAAJGoogle ScholarGoogle Scholar
  51. Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. 2017a. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration. www.pytorch.orgGoogle ScholarGoogle Scholar
  52. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017b. Automatic differentiation in PyTorch. (2017).Google ScholarGoogle Scholar
  53. Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM Trans. Program. Lang. Syst. 30, 2 (2008), 7:1–7:36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. PyTorch. 2019. Torch Script. https://pytorch.org/docs/master/jit.html {Online; accessed 1-June-2019}.Google ScholarGoogle Scholar
  55. Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks 12, 1 (1999), 145–151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. John C. Reynolds. 1998. Definitional Interpreters for Higher-Order Programming Languages. Higher-Order and Symbolic Computation 11, 4 (1998), 363–397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Jared Roesch, Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, and Zachary Tatlock. 2018. Relay: A New IR for Machine Learning Frameworks. CoRR abs/1810.00952 (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Tiark Rompf. 2016. The Essence of Multi-stage Evaluation in LMS. In A List of Successes That Can Change the World (Lecture Notes in Computer Science), Vol. 9600. Springer, 318–335.Google ScholarGoogle Scholar
  59. Tiark Rompf and Nada Amin. 2015. Functional pearl: a SQL to C compiler in 500 lines of code. In ICFP. ACM, 2–9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Tiark Rompf, Kevin J. Brown, HyoukJoong Lee, Arvind K. Sujeeth, Manohar Jonnalagedda, Nada Amin, Georg Ofenbeck, Alen Stojanov, Yannis Klonatos, Mohammad Dashti, Christoph Koch, Markus Püschel, and Kunle Olukotun. 2015. Go Meta! A Case for Generative Programming and DSLs in Performance Critical Systems. In SNAPL (LIPIcs), Vol. 32. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 238–261.Google ScholarGoogle Scholar
  61. Tiark Rompf, Ingo Maier, and Martin Odersky. 2009. Implementing first-class polymorphic delimited continuations by a type-directed selective CPS-transform. In ICFP. ACM, 317–328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. In GPCE. ACM, 127–136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Tiark Rompf, Arvind K. Sujeeth, Nada Amin, Kevin J. Brown, Vojin Jovanovic, HyoukJoong Lee, Manohar Jonnalagedda, Kunle Olukotun, and Martin Odersky. 2013. Optimizing data structures in high-level programs: new directions for extensible compilers based on staging. In POPL. ACM, 497–510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Tiark Rompf, Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. Building-Blocks for Performance Oriented DSLs. In DSL (EPTCS), Vol. 66. 93–117.Google ScholarGoogle ScholarCross RefCross Ref
  65. Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. 2018. Glow: Graph Lowering Compiler Techniques for Neural Networks. CoRR abs/1805.00907 (2018).Google ScholarGoogle Scholar
  66. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature 323, 6088 (1986), 533.Google ScholarGoogle Scholar
  67. Frank Seide and Amit Agarwal. 2016. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2135–2135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Amir Shaikhha, Andrew Fitzgibbon, Dimitrios Vytiniotis, Simon Peyton Jones, and Christoph Koch. 2018. Efficient Differentiable Programming in a Functional Array-Processing Language. CoRR abs/1806.02136 (2018).Google ScholarGoogle Scholar
  69. Jeffrey Mark Siskind and Barak A. Pearlmutter. 2008. Nesting forward-mode AD in a functional framework. Higher-Order and Symbolic Computation 21, 4 (2008), 361–376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Jeffrey Mark Siskind and Barak A. Pearlmutter. 2016. Efficient Implementation of a Higher-Order Language with Built-In AD. CoRR abs/1611.03416 (2016).Google ScholarGoogle Scholar
  71. Bert Speelpenning. 1980. Compiling fast partial derivatives of functions given by algorithms. Ph.D. Dissertation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Tiark Rompf, Hassan Chafi, Michael Wu, Anand R. Atreya, Martin Odersky, and Kunle Olukotun. 2011. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In ICML. Omnipress, 609–616. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Walid Taha and Tim Sheard. 2000. MetaML and multi-stage programming with explicit annotations. Theor. Comput. Sci. 248, 1-2 (2000), 211–242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. CoRR abs/1503.00075 (2015). arXiv: 1503.00075 http://arxiv.org/abs/1503.00075Google ScholarGoogle Scholar
  75. TensorFlow. 2019. Swift For TensorFlow. https://www.tensorflow.org/swift {Online; accessed 1-June-2019}.Google ScholarGoogle Scholar
  76. Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), Vol. 5.Google ScholarGoogle Scholar
  77. B. van Merriënboer, A. B. Wiltschko, and D. Moldovan. 2017. Tangent: Automatic Differentiation Using Source Code Transformation in Python. ArXiv e-prints (Nov. 2017). arXiv: cs.MS/1711.02712Google ScholarGoogle Scholar
  78. Fei Wang, James Decker, Xilun Wu, Gregory Essertel, and Tiark Rompf. 2018a. Backpropagation with Callbacks: Foundations for Efficient and Expressive Differentiable Programming. In NeurIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Fei Wang and Tiark Rompf. 2018. A Language and Compiler View on Differentiable Programming. ICLR Workshop Track (2018). https://openreview.net/forum?id=SJxJtYkPGGoogle ScholarGoogle Scholar
  80. Fei Wang, Xilun Wu, Grégory M. Essertel, James M. Decker, and Tiark Rompf. 2018b. Demystifying Differentiable Programming: Shift/Reset the Penultimate Backpropagator. CoRR abs/1803.10228 (2018).Google ScholarGoogle Scholar
  81. Richard Wei, Vikram S. Adve, and Lane Schwartz. 2017a. DLVM: A modern compiler infrastructure for deep learning systems. CoRR abs/1711.03016 (2017).Google ScholarGoogle Scholar
  82. Richard Wei, Lane Schwartz, and Vikram Adve. 2017b. A modern compiler infrastructure for deep learning systems with adjoint code generation in a domain-specific IR. In NIPS AutoDiff Workshop.Google ScholarGoogle Scholar
  83. R. E. Wengert. 1964. A simple automatic derivative evaluation program. Commun. ACM 7, 8 (1964), 463–464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Paul Werbos. 1974. Beyond regression: New tools for prediction and analysis in the behavior science. Unpublished Doctoral Dissertation, Harvard University (1974).Google ScholarGoogle Scholar
  85. Alex Wiltschko. 2017. Tangent: Source-to-Source Debuggable Derivatives. https://research.googleblog.com/2017/11/tangentsource-to-source-debuggable.htmlGoogle ScholarGoogle Scholar

Index Terms

  1. Demystifying differentiable programming: shift/reset the penultimate backpropagator

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!