skip to main content
research-article

Probabilistic model for code with decision trees

Published:19 October 2016Publication History
Skip Abstract Section

Abstract

In this paper we introduce a new approach for learning precise and general probabilistic models of code based on decision tree learning. Our approach directly benefits an emerging class of statistical programming tools which leverage probabilistic models of code learned over large codebases (e.g., GitHub) to make predictions about new programs (e.g., code completion, repair, etc).

The key idea is to phrase the problem of learning a probabilistic model of code as learning a decision tree in a domain specific language over abstract syntax trees (called TGen). This allows us to condition the prediction of a program element on a dynamically computed context. Further, our problem formulation enables us to easily instantiate known decision tree learning algorithms such as ID3, but also to obtain new variants we refer to as ID3+ and E13, not previously explored and ones that outperform ID3 in prediction accuracy.

Our approach is general and can be used to learn a probabilistic model of any programming language. We implemented our approach in a system called Deep3 and evaluated it for the challenging task of learning probabilistic models of JavaScript and Python. Our experimental results indicate that Deep3 predicts elements of JavaScript and Python code with precision above 82% and 69%, respectively. Further, Deep3 often significantly outperforms state-of-the-art approaches in overall prediction accuracy.

References

  1. M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In MSR, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Allamanis and C. Sutton. Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 472–483, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3056-5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 38–49, New York, NY, USA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. ACM. ISBN 978-1-4503-3675-8.Google ScholarGoogle Scholar
  5. M. Allamanis, D. Tarlow, A. D. Gordon, and Y. Wei. Bimodal modelling of source code and natural language. In F. R. Bach and D. M. Blei, editors, ICML, volume 37 of JMLR Proceedings, pages 2123–2132. JMLR.org, 2015.Google ScholarGoogle Scholar
  6. R. Alur, R. Bod´ık, G. Juniwal, M. M. K. Martin, M. Raghothaman, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. Syntax-guided synthesis. In Formal Methods in Computer-Aided Design, FMCAD 2013, Portland, OR, USA, October 20-23, 2013, pages 1–8, 2013.Google ScholarGoogle Scholar
  7. S. Barman, R. Bodik, S. Chandra, E. Torlak, A. Bhattacharya, and D. Culler. Toward tool support for interactive synthesis. In 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), Onward! 2015, pages 121–136, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3688-8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Bielik, V. Raychev, and M. T. Vechev. PHOG: probabilistic model for code. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2933–2942, 2016.Google ScholarGoogle Scholar
  9. P. Garg, D. Neider, P. Madhusudan, and D. Roth. Learning invariants using decision trees and implication counterexamples. In Proceedings of the 43rd Annual ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 499–512, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Gvero and V. Kuncak. Synthesizing java expressions from free-form queries. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, pages 416–432, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Gvero, V. Kuncak, I. Kuraj, and R. Piskac. Complete completion using types and weights. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 27–38. ACM, 2013. ISBN 978-1-4503-2014-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 837–847, Piscataway, NJ, USA, 2012. IEEE Press. ISBN 978- 1-4673-1067-3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Hottelier and R. Bodik. Synthesis of layout engines from relational constraints. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 74–88, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3689-5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C.-H. Hsiao, M. Cafarella, and S. Narayanasamy. Using web corpus statistics for program analysis. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’14, pages 49–65, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2585-1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Jha, S. Gulwani, S. A. Seshia, and A. Tiwari. Oracleguided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pages 215–224, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-719-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Karaivanov, V. Raychev, and M. T. Vechev. Phrase-based statistical translation of programming languages. In Onward! 2014, Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, part of SLASH ’14, Portland, OR, USA, October 20-24, 2014, pages 173–184, 2014. doi: 10.1145/ 2661136.2661148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Kneuss, I. Kuraj, V. Kuncak, and P. Suter. Synthesis modulo recursive functions. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 407–426, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2374-1. doi: 10.1145/2509136.2509555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Liang, M. I. Jordan, and D. Klein. Learning programs: A hierarchical bayesian approach. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 639–646, 2010.Google ScholarGoogle Scholar
  19. F. Long and M. Rinard. Automatic patch generation by learning correct code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 298–312, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. J. Maddison and D. Tarlow. Structured generative models of natural source code. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 649–657, 2014.Google ScholarGoogle Scholar
  21. T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. T. Nguyen and T. N. Nguyen. Graph-based statistical language model for code. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 858–868, Piscataway, NJ, USA, 2015. IEEE Press. ISBN 978-1-4799-1934-5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 532–542, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2237-9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. R. Quinlan. Induction of decision trees. Mach. Learn., 1 (1):81–106, Mar. 1986. ISSN 0885-6125. doi: 10.1023/A: 1022643204877.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. ISBN 1-55860-238-0.Google ScholarGoogle Scholar
  27. V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 419–428, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2784-8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Raychev, M. Vechev, and A. Krause. Predicting program properties from ”big code”. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pages 111–124. ACM, 2015. ISBN 978-1-4503-3300-9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. Raychev, P. Bielik, M. Vechev, and A. Krause. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, pages 761–774, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3549-2. doi: 10.1145/2837614.2837671. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Raza, S. Gulwani, and N. Milic-Frayling. Compositional program synthesis from natural language and examples. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 792–800, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here. In Proceedings of the IEEE, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  32. A. Solar-Lezama. Program sketching. STTT, 15(5-6):475– 495, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Solar-Lezama, L. Tancau, R. Bod´ık, S. A. Seshia, and V. A. Saraswat. Combinatorial sketching for finite programs. In ASPLOS, pages 404–415, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Stolcke. SRILM-an Extensible Language Modeling Toolkit. International Conference on Spoken Language Processing, 2002.Google ScholarGoogle Scholar
  35. I. H. Witten and T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37 (4):1085–1094, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Zhaopeng, S. Zhendong, and D. Premkumar. On the localness of software. In Foundations of Software Engineering, FSE ’14, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Probabilistic model for code with decision trees

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 51, Issue 10
      OOPSLA '16
      October 2016
      915 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3022671
      Issue’s Table of Contents
      • cover image ACM Conferences
        OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
        October 2016
        915 pages
        ISBN:9781450344449
        DOI:10.1145/2983990

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 October 2016

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!