Abstract
Deep learning has seen tremendous success over the past decade in computer vision, machine translation, and gameplay. This success rests crucially on gradient-descent optimization and the ability to “learn” parameters of a neural network by backpropagating observed errors. However, neural network architectures are growing increasingly sophisticated and diverse, which motivates an emerging quest for even more general forms of differentiable programming, where arbitrary parameterized computations can be trained by gradient descent. In this paper, we take a fresh look at automatic differentiation (AD) techniques, and especially aim to demystify the reverse-mode form of AD that generalizes backpropagation in neural networks. We uncover a tight connection between reverse-mode AD and delimited continuations, which permits implementing reverse-mode AD purely via operator overloading and without managing any auxiliary data structures. We further show how this formulation of AD can be fruitfully combined with multi-stage programming (staging), leading to an efficient implementation that combines the performance benefits of deep learning frameworks based on explicit reified computation graphs (e.g., TensorFlow) with the expressiveness of pure library approaches (e.g., PyTorch).
Supplemental Material
- Martin Abadi, Michael Isard, and Derek G. Murray. 2017. A Computational Model for TensorFlow (An Introduction). Google Scholar
Digital Library
- Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey. CoRR abs/1502.05767 (2018).Google Scholar
- Atilim Günes Baydin, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. DiffSharp: An AD Library for .NET Languages. CoRR abs/1611.03423 (2016).Google Scholar
- L. M. Beda, L. N. Korolev, N. V. Sukkikh, and T. S. Frolova. 1959. Programs for automatic differentiation for the machine BESM. Technical Report. Institute for Precise Mechanics and Computation Techniques, Academy of Science, Moscow, USSR. (In Russian).Google Scholar
- Anders Bondorf. 1992. Improving Binding Times Without Explicit CPS-Conversion. In Proceedings of the Conference on Lisp and Functional Programming. ACM Press, 1–10. Google Scholar
Digital Library
- Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher R. Aberger, and Kunle Olukotun. 2016. Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns. In CGO. ACM, 194–205. Google Scholar
Digital Library
- Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A Heterogeneous Parallel Framework for Domain-Specific Languages. In PACT. IEEE Computer Society, 89–100. Google Scholar
Digital Library
- A Bryson and Yu-Chi Ho. 1975. Applied optimal control: Optimization, estimation, and control (revised edition). Levittown, Pennsylvania: Taylor & Francis (1975).Google Scholar
- Arthur E Bryson and Walter F Denham. 1962. A steepest-ascent method for solving optimum programming problems. Journal of Applied Mechanics 29, 2 (1962), 247–257.Google Scholar
Cross Ref
- Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).Google Scholar
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In OSDI. USENIX Association, 578–594. Google Scholar
Digital Library
- Bruce Christianson. 1992. Automatic Hessians by reverse accumulation. IMA J. Numer. Anal. 12, 2 (1992), 135–150.Google Scholar
Cross Ref
- Jason Chuang. 2013. Stanford Sentiment Treebank. https://nlp.stanford.edu/sentiment/treebank.htmlGoogle Scholar
- Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, NIPS Workshop.Google Scholar
- Olivier Danvy and Andrzej Filinski. 1990. Abstracting Control. In LISP and Functional Programming. 151–160. Google Scholar
Digital Library
- Olivier Danvy and Andrzej Filinski. 1992. Representing Control: A Study of the CPS Transformation. Mathematical Structures in Computer Science 2, 4 (1992), 361–391.Google Scholar
Cross Ref
- Olivier Danvy and Mayer Goldberg. 2005. There and back again. Fundamenta Informaticae 66, 4 (2005), 397–413. Google Scholar
Digital Library
- Olivier Danvy and Kevin Millikin. 2009. Refunctionalization at work. Sci. Comput. Program. 74, 8 (2009), 534–549. Google Scholar
Digital Library
- Olivier Danvy and Lasse R. Nielsen. 2001. Defunctionalization at Work. In PPDP. ACM, 162–174. Google Scholar
Digital Library
- Olivier Danvy and Lasse R. Nielsen. 2003. A first-order one-pass CPS transformation. Theor. Comput. Sci. 308, 1-3 (2003), 239–257. Google Scholar
Digital Library
- John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12 (2011), 2121–2159. http://dl.acm.org/citation.cfm?id=2021068 Google Scholar
Digital Library
- Conal Elliott. 2009. Beautiful differentiation. In International Conference on Functional Programming (ICFP). http://conal. net/papers/beautiful-differentiation Google Scholar
Digital Library
- Conal Elliott. 2017. Compiling to categories. PACMPL 1, ICFP (2017), 27:1–27:27. Google Scholar
Digital Library
- Conal Elliott. 2018. The simple essence of automatic differentiation. PACMPL 2, ICFP (2018), 70:1–70:29. Google Scholar
Digital Library
- Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14, 2 (1990), 179–211.Google Scholar
- Matthias Felleisen. 1988. The Theory and Practice of First-Class Prompts. In POPL. ACM Press, 180–190. Google Scholar
Digital Library
- Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. 1993. The Essence of Compiling with Continuations. In PLDI. ACM, 237–247. Google Scholar
Digital Library
- Brendan Fong, David I Spivak, and Rémy Tuyéras. 2017. Backprop as Functor: A compositional perspective on supervised learning. arXiv preprint arXiv:1711.10455 (2017).Google Scholar
- Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines. CoRR abs/1410.5401 (2014). arXiv: 1410.5401 http://arxiv.org/abs/1410.5401Google Scholar
- Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. 2015. Learning to Transduce with Unbounded Memory. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 1828–1836. http://papers.nips.cc/paper/5648-learning-totransduce-with-unbounded-memory.pdf Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE Computer Society, 770–778.Google Scholar
- Robert Hecht-Nielsen. 1988. Theory of the backpropagation neural network. Neural Networks 1, Supplement-1 (1988), 445–448.Google Scholar
Cross Ref
- Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. CoRR abs/1602.07360 (2016).Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. CoRR abs/1408.5093 (2014). arXiv: 1408.5093 http://arxiv.org/abs/1408.5093Google Scholar
Digital Library
- Ulrik Jørring and William L. Scherlis. 1986. Compilers and Staging Transformations. In POPL. ACM Press, 86–96. Google Scholar
Digital Library
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv: 1412.6980 http://arxiv.org/abs/1412.6980Google Scholar
- Alex Krizhevsky. 2012. Learning Multiple Layers of Features from Tiny Images. University of Toronto (05 2012).Google Scholar
- John Launchbury and Simon L. Peyton Jones. 1994. Lazy Functional State Threads. In PLDI. ACM, 24–35. Google Scholar
Digital Library
- Yann LeCun. 2018. Deep Learning est mort. Vive Differentiable Programming! https://www.facebook.com/yann.lecun/ posts/10155003011462143 .Google Scholar
- Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. 1990. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems. 396–404. Google Scholar
Digital Library
- Seppo Linnainmaa. 1976. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics 16, 2 (1976), 146–160.Google Scholar
Digital Library
- Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. 2017. Deep Learning with Dynamic Computation Graphs. ICLR (2017).Google Scholar
- Dougal Maclaurin. 2016. Modeling, Inference and Optimization with Composable Differentiable Procedures. Ph.D. Dissertation.Google Scholar
- Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017a. DyNet: The Dynamic Neural Network Toolkit. CoRR abs/1701.03980 (2017).Google Scholar
- Graham Neubig, Yoav Goldberg, and Chris Dyer. 2017b. On-the-fly Operation Batching in Dynamic Computation Graphs. In NIPS. 3974–3984. Google Scholar
Digital Library
- John F Nolan. 1953. Analytical differentiation on a digital computer.Google Scholar
- Christopher Olah. 2015. Neural Networks, Types, and Functional Programming. http://colah.github.io/posts/2015-09-NN-Types-FP/ .Google Scholar
- ONNX working groups. 2017. ONNX: Open Neural Network Exchange format. https://onnx.ai/Google Scholar
- Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), 5206–5210.Google Scholar
Cross Ref
- D.B. Parker, Massachusetts Institute of Technology, and Sloan School of Management. 1985. Learning Logic: Casting the Cortex of the Human Brain in Silicon. Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science. https://books.google.com/books?id=2kS9GwAACAAJGoogle Scholar
- Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. 2017a. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration. www.pytorch.orgGoogle Scholar
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017b. Automatic differentiation in PyTorch. (2017).Google Scholar
- Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM Trans. Program. Lang. Syst. 30, 2 (2008), 7:1–7:36. Google Scholar
Digital Library
- PyTorch. 2019. Torch Script. https://pytorch.org/docs/master/jit.html {Online; accessed 1-June-2019}.Google Scholar
- Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks 12, 1 (1999), 145–151. Google Scholar
Digital Library
- John C. Reynolds. 1998. Definitional Interpreters for Higher-Order Programming Languages. Higher-Order and Symbolic Computation 11, 4 (1998), 363–397. Google Scholar
Digital Library
- Jared Roesch, Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, and Zachary Tatlock. 2018. Relay: A New IR for Machine Learning Frameworks. CoRR abs/1810.00952 (2018). Google Scholar
Digital Library
- Tiark Rompf. 2016. The Essence of Multi-stage Evaluation in LMS. In A List of Successes That Can Change the World (Lecture Notes in Computer Science), Vol. 9600. Springer, 318–335.Google Scholar
- Tiark Rompf and Nada Amin. 2015. Functional pearl: a SQL to C compiler in 500 lines of code. In ICFP. ACM, 2–9. Google Scholar
Digital Library
- Tiark Rompf, Kevin J. Brown, HyoukJoong Lee, Arvind K. Sujeeth, Manohar Jonnalagedda, Nada Amin, Georg Ofenbeck, Alen Stojanov, Yannis Klonatos, Mohammad Dashti, Christoph Koch, Markus Püschel, and Kunle Olukotun. 2015. Go Meta! A Case for Generative Programming and DSLs in Performance Critical Systems. In SNAPL (LIPIcs), Vol. 32. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 238–261.Google Scholar
- Tiark Rompf, Ingo Maier, and Martin Odersky. 2009. Implementing first-class polymorphic delimited continuations by a type-directed selective CPS-transform. In ICFP. ACM, 317–328. Google Scholar
Digital Library
- Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. In GPCE. ACM, 127–136. Google Scholar
Digital Library
- Tiark Rompf, Arvind K. Sujeeth, Nada Amin, Kevin J. Brown, Vojin Jovanovic, HyoukJoong Lee, Manohar Jonnalagedda, Kunle Olukotun, and Martin Odersky. 2013. Optimizing data structures in high-level programs: new directions for extensible compilers based on staging. In POPL. ACM, 497–510. Google Scholar
Digital Library
- Tiark Rompf, Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. Building-Blocks for Performance Oriented DSLs. In DSL (EPTCS), Vol. 66. 93–117.Google Scholar
Cross Ref
- Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. 2018. Glow: Graph Lowering Compiler Techniques for Neural Networks. CoRR abs/1805.00907 (2018).Google Scholar
- David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature 323, 6088 (1986), 533.Google Scholar
- Frank Seide and Amit Agarwal. 2016. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2135–2135. Google Scholar
Digital Library
- Amir Shaikhha, Andrew Fitzgibbon, Dimitrios Vytiniotis, Simon Peyton Jones, and Christoph Koch. 2018. Efficient Differentiable Programming in a Functional Array-Processing Language. CoRR abs/1806.02136 (2018).Google Scholar
- Jeffrey Mark Siskind and Barak A. Pearlmutter. 2008. Nesting forward-mode AD in a functional framework. Higher-Order and Symbolic Computation 21, 4 (2008), 361–376. Google Scholar
Digital Library
- Jeffrey Mark Siskind and Barak A. Pearlmutter. 2016. Efficient Implementation of a Higher-Order Language with Built-In AD. CoRR abs/1611.03416 (2016).Google Scholar
- Bert Speelpenning. 1980. Compiling fast partial derivatives of functions given by algorithms. Ph.D. Dissertation. Google Scholar
Digital Library
- Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Tiark Rompf, Hassan Chafi, Michael Wu, Anand R. Atreya, Martin Odersky, and Kunle Olukotun. 2011. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In ICML. Omnipress, 609–616. Google Scholar
Digital Library
- Walid Taha and Tim Sheard. 2000. MetaML and multi-stage programming with explicit annotations. Theor. Comput. Sci. 248, 1-2 (2000), 211–242. Google Scholar
Digital Library
- Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. CoRR abs/1503.00075 (2015). arXiv: 1503.00075 http://arxiv.org/abs/1503.00075Google Scholar
- TensorFlow. 2019. Swift For TensorFlow. https://www.tensorflow.org/swift {Online; accessed 1-June-2019}.Google Scholar
- Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), Vol. 5.Google Scholar
- B. van Merriënboer, A. B. Wiltschko, and D. Moldovan. 2017. Tangent: Automatic Differentiation Using Source Code Transformation in Python. ArXiv e-prints (Nov. 2017). arXiv: cs.MS/1711.02712Google Scholar
- Fei Wang, James Decker, Xilun Wu, Gregory Essertel, and Tiark Rompf. 2018a. Backpropagation with Callbacks: Foundations for Efficient and Expressive Differentiable Programming. In NeurIPS. Google Scholar
Digital Library
- Fei Wang and Tiark Rompf. 2018. A Language and Compiler View on Differentiable Programming. ICLR Workshop Track (2018). https://openreview.net/forum?id=SJxJtYkPGGoogle Scholar
- Fei Wang, Xilun Wu, Grégory M. Essertel, James M. Decker, and Tiark Rompf. 2018b. Demystifying Differentiable Programming: Shift/Reset the Penultimate Backpropagator. CoRR abs/1803.10228 (2018).Google Scholar
- Richard Wei, Vikram S. Adve, and Lane Schwartz. 2017a. DLVM: A modern compiler infrastructure for deep learning systems. CoRR abs/1711.03016 (2017).Google Scholar
- Richard Wei, Lane Schwartz, and Vikram Adve. 2017b. A modern compiler infrastructure for deep learning systems with adjoint code generation in a domain-specific IR. In NIPS AutoDiff Workshop.Google Scholar
- R. E. Wengert. 1964. A simple automatic derivative evaluation program. Commun. ACM 7, 8 (1964), 463–464. Google Scholar
Digital Library
- Paul Werbos. 1974. Beyond regression: New tools for prediction and analysis in the behavior science. Unpublished Doctoral Dissertation, Harvard University (1974).Google Scholar
- Alex Wiltschko. 2017. Tangent: Source-to-Source Debuggable Derivatives. https://research.googleblog.com/2017/11/tangentsource-to-source-debuggable.htmlGoogle Scholar
Index Terms
Demystifying differentiable programming: shift/reset the penultimate backpropagator
Recommendations
A simple differentiable programming language
Automatic differentiation plays a prominent role in scientific computing and in modern machine learning, often in the context of powerful programming systems. The relation of the various embodiments of automatic differentiation to the mathematical ...
Towards Demystifying Adversarial Robustness of Binarized Neural Networks
Applied Cryptography and Network Security WorkshopsAbstractQuantized neural networks are proposed for reduced computation and memory costs. When quantized neural networks are designed for edge or terminal devices, they may be more vulnerable to adversarial perturbations. We focus on the extreme cases, ...
Beyond graph neural networks with lifted relational neural networks
AbstractWe introduce a declarative differentiable programming framework, based on the language of Lifted Relational Neural Networks, where small parameterized logic programs are used to encode deep relational learning scenarios through the underlying ...






Comments