Abstract
We present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, and a dataset sampler which carefully samples the dataset by leveraging the candidate program's score on that dataset. The two components are connected in a continuous feedback-directed loop. We show how to apply this approach to two settings: one where the dataset has a bound on the noise, and another without a noise bound. The second setting leads to a new way of performing approximate empirical risk minimization on hypotheses classes formed by a discrete search space. We then present two new kinds of program synthesizers which target the two noise settings. First, we introduce a novel regularized bitstream synthesizer that successfully generates programs even in the presence of incorrect examples. We show that the synthesizer can detect errors in the examples while combating overfitting -- a major problem in existing synthesis techniques. We also show how the approach can be used in a setting where the dataset grows dynamically via new examples (e.g., provided by a human). Second, we present a novel technique for constructing statistical code completion systems. These are systems trained on massive datasets of open source programs, also known as ``Big Code''. The key idea is to introduce a domain specific language (DSL) over trees and to learn functions in that DSL directly from the dataset. These learned functions then condition the predictions made by the system. This is a flexible and powerful technique which generalizes several existing works as we no longer need to decide a priori on what the prediction should be conditioned (another benefit is that the learned functions are a natural mechanism for explaining the prediction). As a result, our code completion system surpasses the prediction capabilities of existing, hard-wired systems.
- A LLAMANIS, M., AND S UTTON, C. Mining source code repositories at massive scale using language modeling. In MSR (2013). Google Scholar
Digital Library
- A LLAMANIS, M., AND S UTTON, C. Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (New York, NY, USA, 2014), FSE 2014, ACM, pp. 472–483. Google Scholar
Digital Library
- B AKER, J. E. Reducing bias and inefficiency in the selection algorithm. In Proceedings of the Second International Conference on Genetic Algorithms on Genetic Algorithms and Their Application (Hillsdale, NJ, USA, 1987), L. Erlbaum Associates Inc., pp. 14–21. Google Scholar
Digital Library
- B ANZHAF, W., F RANCONE, F. D., K ELLER, R. E., AND N ORDIN, P. Genetic Programming: An Introduction: on the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998. Google Scholar
Digital Library
- B AROWY, D. W., G OCHEV, D., AND B ERGER, E. D. Checkcell: Data debugging for spreadsheets. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (New York, NY, USA, 2014), OOPSLA ’14, ACM, pp. 507–523. Google Scholar
Digital Library
- C HANDOLA, V., B ANERJEE, A., AND K UMAR, V. Anomaly detection: A survey. ACM Comput. Surv. 41, 3 (July 2009), 15:1–15:58. Google Scholar
Digital Library
- C HAUDHURI, S., C LOCHARD, M., AND S OLAR -L EZAMA, A. Bridging boolean and quantitative synthesis using smoothed proof search. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (New York, NY, USA, 2014), POPL ’14, ACM, pp. 207–220. Google Scholar
Digital Library
- C RAMER, N. L. A representation for the adaptive generation of simple sequential programs. In Proceedings of the 1st International Conference on Genetic Algorithms (Hillsdale, NJ, USA, 1985), L. Erlbaum Associates Inc., pp. 183–187. Google Scholar
Digital Library
- D E M OURA, L., AND B JØRNER, N. Z3: An efficient smt solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (Berlin, Heidelberg, 2008), TACAS’08/ETAPS’08, Springer-Verlag, pp. 337–340. Google Scholar
Digital Library
- Github code search. https://github.com/search.Google Scholar
- G ULWANI, S. Dimensions in program synthesis. In Proceedings of the 12th International ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming (New York, NY, USA, 2010), PPDP ’10, ACM, pp. 13–24. Google Scholar
Digital Library
- G ULWANI, S. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (New York, NY, USA, 2011), POPL ’11, ACM, pp. 317–330. Google Scholar
Digital Library
- G ULWANI, S., J HA, S., T IWARI, A., AND V ENKATESAN, R. Synthesis of loop-free programs. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2011), PLDI ’11, ACM, pp. 62–73. Google Scholar
Digital Library
- G VERO, T., K UNCAK, V., K URAJ, I., AND P ISKAC, R. Complete completion using types and weights. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (2013), PLDI ’13, ACM, pp. 27–38. Google Scholar
Digital Library
- H AR -P ELED, S., AND M AZUMDAR, S. On coresets for k-means and k-median clustering. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing (New York, NY, USA, 2004), STOC ’04, ACM, pp. 291–300. Google Scholar
Digital Library
- H INDLE, A., B ARR, E. T., S U, Z., G ABEL, M., AND D EVANBU, P. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering (Piscataway, NJ, USA, 2012), ICSE ’12, IEEE Press, pp. 837–847. Google Scholar
Digital Library
- H UTTER, F., H OOS, H. H., L EYTON -B ROWN, K., AND S T ÜTZLE, T. Paramils: An automatic algorithm configuration framework. J. Artif. Int. Res. 36, 1 (Sept. 2009), 267–306. Google Scholar
Digital Library
- J HA, S., G ULWANI, S., S ESHIA, S. A., AND T IWARI, A. Oracleguided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1 (New York, NY, USA, 2010), ICSE ’10, ACM, pp. 215–224. Google Scholar
Digital Library
- L AU, T. A. Programming by Demonstration: A Machine Learning Approach. PhD thesis, 2001. AAI3013992.Google Scholar
- L E, V., AND G ULWANI, S. Flashextract: A framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2014), PLDI ’14, ACM, pp. 542–553. Google Scholar
Digital Library
- L UXBURG, U. V ., AND S CHOELKOPF, B. Statistical learning theory: Models, concepts, and results. In Inductive Logic. 2011, pp. 651–706.Google Scholar
Cross Ref
- M ENON, A. K., T AMUZ, O., G ULWANI, S., L AMPSON, B. W., AND K ALAI, A. A machine learning framework for programming by example. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 (2013), pp. 187–195.Google Scholar
- M ITCHELL, T. M. Machine Learning, 1 ed. McGraw-Hill, Inc., New York, NY, USA, 1997.Google Scholar
- N GUYEN, A. T., AND N GUYEN, T. N. Graph-based statistical language model for code. In Proceedings of the 37th International Conference on Software Engineering - Volume 1 (Piscataway, NJ, USA, 2015), ICSE ’15, IEEE Press, pp. 858–868. Google Scholar
Digital Library
- N GUYEN, T. T., N GUYEN, A. T., N GUYEN, H. A., AND N GUYEN, T. N. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (New York, NY, USA, 2013), ESEC/FSE 2013, ACM, pp. 532–542. Google Scholar
Digital Library
- N ORI, A. V., O ZAIR, S., R AJAMANI, S. K., AND V IJAYKEERTHY, D. Efficient synthesis of probabilistic programs. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2015), PLDI 2015, ACM, pp. 208–217. Google Scholar
Digital Library
- O CH, F. J. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1 (Stroudsburg, PA, USA, 2003), ACL ’03, Association for Computational Linguistics, pp. 160–167. Google Scholar
Digital Library
- P ANCHEKHA, P., S ANCHEZ -S TERN, A., W ILCOX, J. R., AND T AT - LOCK, Z. Automatically improving accuracy for floating point expressions. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015 (2015), pp. 1–11. Google Scholar
Digital Library
- R AYCHEV, V., V ECHEV, M., AND K RAUSE, A. Predicting program properties from ”big code”. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (2015), POPL ’15, ACM, pp. 111–124. Google Scholar
Digital Library
- R AYCHEV, V., V ECHEV, M., AND Y AHAV, E. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2014), PLDI ’14, ACM, pp. 419–428. Google Scholar
Digital Library
- S MITH, S. F. A Learning System Based on Genetic Adaptive Algorithms. PhD thesis, Pittsburgh, PA, USA, 1980. AAI8112638. Google Scholar
Digital Library
- S OLAR -L EZAMA, A., T ANCAU, L., B OD ´ IK, R., S ESHIA, S. A., AND S ARASWAT, V. A. Combinatorial sketching for finite programs. In ASPLOS (2006), pp. 404–415.Google Scholar
- ˇ C ERN ´ Y, P., AND H ENZINGER, T. A. From boolean to quantitative synthesis. In Proceedings of the Ninth ACM International Conference on Embedded Software (New York, NY, USA, 2011), EMSOFT ’11, ACM, pp. 149–154. Google Scholar
Digital Library
- V ECHEV, M., Y AHAV, E., AND Y ORSH, G. Abstraction-guided synthesis of synchronization. In Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (New York, NY, USA, 2010), POPL ’10, ACM, pp. 327– 338. Google Scholar
Digital Library
- W ARREN, H. S. Hacker’s Delight. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002.Google Scholar
- W ITTEN, I. H., AND B ELL, T. C. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37, 4 (1991), 1085–1094. Google Scholar
Digital Library
Index Terms
Learning programs from noisy data
Recommendations
code2vec: learning distributed representations of code
We present a neural model for representing snippets of code as continuous distributed vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of ...
Learning programs from noisy data
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesWe present a new approach for learning programs from noisy datasets. Our approach is based on two new concepts: a regularized program generator which produces a candidate program based on a small sample of the entire dataset while avoiding overfitting, ...
Inductive program synthesis over noisy data
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringWe present a new framework and associated synthesis algorithms for program synthesis over noisy data, i.e., data that may contain incorrect/corrupted input-output examples. This framework is based on an extension of finite tree automata called state-...






Comments