skip to main content
article

Learning programs from noisy data

Published:11 January 2016Publication History
Skip Abstract Section

Abstract

We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting -- a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as ``Big Code''. The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems.

References

  1. A LLAMANIS, M., AND S UTTON, C. Mining source code repositories at massive scale using language modeling. In MSR (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A LLAMANIS, M., AND S UTTON, C. Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (New York, NY, USA, 2014), FSE 2014, ACM, pp. 472–483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B AKER, J. E. Reducing bias and inefficiency in the selection algorithm. In Proceedings of the Second International Conference on Genetic Algorithms on Genetic Algorithms and Their Application (Hillsdale, NJ, USA, 1987), L. Erlbaum Associates Inc., pp. 14–21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B ANZHAF, W., F RANCONE, F. D., K ELLER, R. E., AND N ORDIN, P. Genetic Programming: An Introduction: on the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B AROWY, D. W., G OCHEV, D., AND B ERGER, E. D. Checkcell: Data debugging for spreadsheets. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (New York, NY, USA, 2014), OOPSLA ’14, ACM, pp. 507–523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C HANDOLA, V., B ANERJEE, A., AND K UMAR, V. Anomaly detection: A survey. ACM Comput. Surv. 41, 3 (July 2009), 15:1–15:58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C HAUDHURI, S., C LOCHARD, M., AND S OLAR -L EZAMA, A. Bridging boolean and quantitative synthesis using smoothed proof search. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (New York, NY, USA, 2014), POPL ’14, ACM, pp. 207–220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C RAMER, N. L. A representation for the adaptive generation of simple sequential programs. In Proceedings of the 1st International Conference on Genetic Algorithms (Hillsdale, NJ, USA, 1985), L. Erlbaum Associates Inc., pp. 183–187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D E M OURA, L., AND B JØRNER, N. Z3: An efficient smt solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (Berlin, Heidelberg, 2008), TACAS’08/ETAPS’08, Springer-Verlag, pp. 337–340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Github code search. https://github.com/search.Google ScholarGoogle Scholar
  11. G ULWANI, S. Dimensions in program synthesis. In Proceedings of the 12th International ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming (New York, NY, USA, 2010), PPDP ’10, ACM, pp. 13–24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G ULWANI, S. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (New York, NY, USA, 2011), POPL ’11, ACM, pp. 317–330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G ULWANI, S., J HA, S., T IWARI, A., AND V ENKATESAN, R. Synthesis of loop-free programs. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2011), PLDI ’11, ACM, pp. 62–73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G VERO, T., K UNCAK, V., K URAJ, I., AND P ISKAC, R. Complete completion using types and weights. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (2013), PLDI ’13, ACM, pp. 27–38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H AR -P ELED, S., AND M AZUMDAR, S. On coresets for k-means and k-median clustering. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing (New York, NY, USA, 2004), STOC ’04, ACM, pp. 291–300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H INDLE, A., B ARR, E. T., S U, Z., G ABEL, M., AND D EVANBU, P. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering (Piscataway, NJ, USA, 2012), ICSE ’12, IEEE Press, pp. 837–847. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H UTTER, F., H OOS, H. H., L EYTON -B ROWN, K., AND S T ÜTZLE, T. Paramils: An automatic algorithm configuration framework. J. Artif. Int. Res. 36, 1 (Sept. 2009), 267–306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J HA, S., G ULWANI, S., S ESHIA, S. A., AND T IWARI, A. Oracleguided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1 (New York, NY, USA, 2010), ICSE ’10, ACM, pp. 215–224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L AU, T. A. Programming by Demonstration: A Machine Learning Approach. PhD thesis, 2001. AAI3013992.Google ScholarGoogle Scholar
  20. L E, V., AND G ULWANI, S. Flashextract: A framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2014), PLDI ’14, ACM, pp. 542–553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L UXBURG, U. V ., AND S CHOELKOPF, B. Statistical learning theory: Models, concepts, and results. In Inductive Logic. 2011, pp. 651–706.Google ScholarGoogle ScholarCross RefCross Ref
  22. M ENON, A. K., T AMUZ, O., G ULWANI, S., L AMPSON, B. W., AND K ALAI, A. A machine learning framework for programming by example. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 (2013), pp. 187–195.Google ScholarGoogle Scholar
  23. M ITCHELL, T. M. Machine Learning, 1 ed. McGraw-Hill, Inc., New York, NY, USA, 1997.Google ScholarGoogle Scholar
  24. N GUYEN, A. T., AND N GUYEN, T. N. Graph-based statistical language model for code. In Proceedings of the 37th International Conference on Software Engineering - Volume 1 (Piscataway, NJ, USA, 2015), ICSE ’15, IEEE Press, pp. 858–868. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N GUYEN, T. T., N GUYEN, A. T., N GUYEN, H. A., AND N GUYEN, T. N. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (New York, NY, USA, 2013), ESEC/FSE 2013, ACM, pp. 532–542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N ORI, A. V., O ZAIR, S., R AJAMANI, S. K., AND V IJAYKEERTHY, D. Efficient synthesis of probabilistic programs. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2015), PLDI 2015, ACM, pp. 208–217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. O CH, F. J. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1 (Stroudsburg, PA, USA, 2003), ACL ’03, Association for Computational Linguistics, pp. 160–167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P ANCHEKHA, P., S ANCHEZ -S TERN, A., W ILCOX, J. R., AND T AT - LOCK, Z. Automatically improving accuracy for floating point expressions. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015 (2015), pp. 1–11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R AYCHEV, V., V ECHEV, M., AND K RAUSE, A. Predicting program properties from ”big code”. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (2015), POPL ’15, ACM, pp. 111–124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R AYCHEV, V., V ECHEV, M., AND Y AHAV, E. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2014), PLDI ’14, ACM, pp. 419–428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S MITH, S. F. A Learning System Based on Genetic Adaptive Algorithms. PhD thesis, Pittsburgh, PA, USA, 1980. AAI8112638. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S OLAR -L EZAMA, A., T ANCAU, L., B OD ´ IK, R., S ESHIA, S. A., AND S ARASWAT, V. A. Combinatorial sketching for finite programs. In ASPLOS (2006), pp. 404–415.Google ScholarGoogle Scholar
  33. ˇ C ERN ´ Y, P., AND H ENZINGER, T. A. From boolean to quantitative synthesis. In Proceedings of the Ninth ACM International Conference on Embedded Software (New York, NY, USA, 2011), EMSOFT ’11, ACM, pp. 149–154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. V ECHEV, M., Y AHAV, E., AND Y ORSH, G. Abstraction-guided synthesis of synchronization. In Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (New York, NY, USA, 2010), POPL ’10, ACM, pp. 327– 338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. W ARREN, H. S. Hacker’s Delight. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002.Google ScholarGoogle Scholar
  36. W ITTEN, I. H., AND B ELL, T. C. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37, 4 (1991), 1085–1094. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning programs from noisy data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!