Abstract
In this paper we introduce a new approach for learning precise and general probabilistic models of code based on decision tree learning. Our approach directly benefits an emerging class of statistical programming tools which leverage probabilistic models of code learned over large codebases (e.g., GitHub) to make predictions about new programs (e.g., code completion, repair, etc).
The key idea is to phrase the problem of learning a probabilistic model of code as learning a decision tree in a domain specific language over abstract syntax trees (called TGen). This allows us to condition the prediction of a program element on a dynamically computed context. Further, our problem formulation enables us to easily instantiate known decision tree learning algorithms such as ID3, but also to obtain new variants we refer to as ID3+ and E13, not previously explored and ones that outperform ID3 in prediction accuracy.
Our approach is general and can be used to learn a probabilistic model of any programming language. We implemented our approach in a system called Deep3 and evaluated it for the challenging task of learning probabilistic models of JavaScript and Python. Our experimental results indicate that Deep3 predicts elements of JavaScript and Python code with precision above 82% and 69%, respectively. Further, Deep3 often significantly outperforms state-of-the-art approaches in overall prediction accuracy.
- M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In MSR, 2013. Google Scholar
Digital Library
- M. Allamanis and C. Sutton. Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 472–483, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3056-5. Google Scholar
Digital Library
- M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 38–49, New York, NY, USA, 2015. Google Scholar
Digital Library
- ACM. ISBN 978-1-4503-3675-8.Google Scholar
- M. Allamanis, D. Tarlow, A. D. Gordon, and Y. Wei. Bimodal modelling of source code and natural language. In F. R. Bach and D. M. Blei, editors, ICML, volume 37 of JMLR Proceedings, pages 2123–2132. JMLR.org, 2015.Google Scholar
- R. Alur, R. Bod´ık, G. Juniwal, M. M. K. Martin, M. Raghothaman, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. Syntax-guided synthesis. In Formal Methods in Computer-Aided Design, FMCAD 2013, Portland, OR, USA, October 20-23, 2013, pages 1–8, 2013.Google Scholar
- S. Barman, R. Bodik, S. Chandra, E. Torlak, A. Bhattacharya, and D. Culler. Toward tool support for interactive synthesis. In 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), Onward! 2015, pages 121–136, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3688-8. Google Scholar
Digital Library
- P. Bielik, V. Raychev, and M. T. Vechev. PHOG: probabilistic model for code. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2933–2942, 2016.Google Scholar
- P. Garg, D. Neider, P. Madhusudan, and D. Roth. Learning invariants using decision trees and implication counterexamples. In Proceedings of the 43rd Annual ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 499–512, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2. Google Scholar
Digital Library
- T. Gvero and V. Kuncak. Synthesizing java expressions from free-form queries. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, pages 416–432, 2015. Google Scholar
Digital Library
- T. Gvero, V. Kuncak, I. Kuraj, and R. Piskac. Complete completion using types and weights. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 27–38. ACM, 2013. ISBN 978-1-4503-2014-6. Google Scholar
Digital Library
- A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 837–847, Piscataway, NJ, USA, 2012. IEEE Press. ISBN 978- 1-4673-1067-3. Google Scholar
Digital Library
- T. Hottelier and R. Bodik. Synthesis of layout engines from relational constraints. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 74–88, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3689-5. Google Scholar
Digital Library
- C.-H. Hsiao, M. Cafarella, and S. Narayanasamy. Using web corpus statistics for program analysis. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’14, pages 49–65, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2585-1. Google Scholar
Digital Library
- S. Jha, S. Gulwani, S. A. Seshia, and A. Tiwari. Oracleguided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pages 215–224, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-719-6. Google Scholar
Digital Library
- S. Karaivanov, V. Raychev, and M. T. Vechev. Phrase-based statistical translation of programming languages. In Onward! 2014, Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, part of SLASH ’14, Portland, OR, USA, October 20-24, 2014, pages 173–184, 2014. doi: 10.1145/ 2661136.2661148. Google Scholar
Digital Library
- E. Kneuss, I. Kuraj, V. Kuncak, and P. Suter. Synthesis modulo recursive functions. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 407–426, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2374-1. doi: 10.1145/2509136.2509555. Google Scholar
Digital Library
- P. Liang, M. I. Jordan, and D. Klein. Learning programs: A hierarchical bayesian approach. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 639–646, 2010.Google Scholar
- F. Long and M. Rinard. Automatic patch generation by learning correct code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 298–312, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2. Google Scholar
Digital Library
- C. J. Maddison and D. Tarlow. Structured generative models of natural source code. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 649–657, 2014.Google Scholar
- T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997. Google Scholar
Digital Library
- A. T. Nguyen and T. N. Nguyen. Graph-based statistical language model for code. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 858–868, Piscataway, NJ, USA, 2015. IEEE Press. ISBN 978-1-4799-1934-5. Google Scholar
Digital Library
- T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 532–542, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2237-9. Google Scholar
Digital Library
- J. R. Quinlan. Induction of decision trees. Mach. Learn., 1 (1):81–106, Mar. 1986. ISSN 0885-6125. doi: 10.1023/A: 1022643204877.Google Scholar
Digital Library
- J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. Google Scholar
Digital Library
- ISBN 1-55860-238-0.Google Scholar
- V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 419–428, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2784-8. Google Scholar
Digital Library
- V. Raychev, M. Vechev, and A. Krause. Predicting program properties from ”big code”. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pages 111–124. ACM, 2015. ISBN 978-1-4503-3300-9. Google Scholar
Digital Library
- V. Raychev, P. Bielik, M. Vechev, and A. Krause. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 761–774, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2. doi: 10.1145/2837614.2837671. Google Scholar
Digital Library
- M. Raza, S. Gulwani, and N. Milic-Frayling. Compositional program synthesis from natural language and examples. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 792–800, 2015. Google Scholar
Digital Library
- R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here. In Proceedings of the IEEE, 2000.Google Scholar
Cross Ref
- A. Solar-Lezama. Program sketching. STTT, 15(5-6):475– 495, 2013.Google Scholar
Digital Library
- A. Solar-Lezama, L. Tancau, R. Bod´ık, S. A. Seshia, and V. A. Saraswat. Combinatorial sketching for finite programs. In ASPLOS, pages 404–415, 2006. Google Scholar
Digital Library
- A. Stolcke. SRILM-an Extensible Language Modeling Toolkit. International Conference on Spoken Language Processing, 2002.Google Scholar
- I. H. Witten and T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37 (4):1085–1094, 1991. Google Scholar
Digital Library
- T. Zhaopeng, S. Zhendong, and D. Premkumar. On the localness of software. In Foundations of Software Engineering, FSE ’14, New York, NY, USA, 2014. ACM. Google Scholar
Digital Library
Index Terms
Probabilistic model for code with decision trees
Recommendations
Probabilistic model for code with decision trees
OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and ApplicationsIn this paper we introduce a new approach for learning precise and general probabilistic models of code based on decision tree learning. Our approach directly benefits an emerging class of statistical programming tools which leverage probabilistic ...
Learning probabilistic decision trees for AUC
Special issue: ROC analysis in pattern recognitionAccurate ranking, measured by AUC (the area under the ROC curve), is crucial in many real-world applications. Most traditional learning algorithms, however, aim only at high classification accuracy. It has been observed that traditional decision trees ...
Decision trees as possibilistic classifiers
This paper addresses the classification problem with imperfect data. More precisely, it extends standard decision trees to handle uncertainty in both building and classification procedures. Uncertainty here is represented by means of possibility ...







Comments