skip to main content
research-article
Open Access

AL: autogenerating supervised learning programs

Published:10 October 2019Publication History
Skip Abstract Section

Abstract

We present AL, a novel automated machine learning system that learns to generate new supervised learning pipelines from an existing corpus of supervised learning programs. In contrast to existing automated machine learning tools, which typically implement a search over manually selected machine learning functions and classes, AL learns to identify the relevant classes in an API by analyzing dynamic program traces that use the target machine learning library. AL constructs a conditional probability model from these traces to estimate the likelihood of the generated supervised learning pipelines and uses this model to guide the search to generate pipelines for new datasets. Our evaluation shows that AL can produce successful pipelines for datasets that previous systems fail to process and produces pipelines with comparable predictive performance for datasets that previous systems process successfully.

References

  1. 2017. UCI: Solar Flare Data Set. https://archive.ics.uci.edu/ml/datasets/Solar+FlareGoogle ScholarGoogle Scholar
  2. Shawkat Ali and Kate A Smith-Miles. 2006. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing 70, 1 (2006), 173–186.Google ScholarGoogle ScholarCross RefCross Ref
  3. Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 38–49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Autosklearn. 2017. Github Repository Issue 292. https://github.com/automl/auto- sklearn/issues/292Google ScholarGoogle Scholar
  5. Leonard E Baum and Ted Petrie. 1966. Statistical inference for probabilistic functions of finite state Markov chains. The annals of mathematical statistics 37, 6 (1966), 1554–1563.Google ScholarGoogle Scholar
  6. Allan Borodin, Ran El-Yaniv, and Vincent Gogan. 2004. Can we learn to beat the best stock. In Advances in Neural Information Processing Systems. 345–352.Google ScholarGoogle Scholar
  7. Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108–122.Google ScholarGoogle Scholar
  8. Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. CoRR abs/1603.02754 (2016). arXiv: 1603.02754 http://arxiv.org/abs/1603.02754Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alex GC de Sá, Walter José GS Pinto, Luiz Otavio VB Oliveira, and Gisele L Pappa. 2017. RECIPE: A Grammar-Based Framework for Automatically Evolving Classification Pipelines. In European Conference on Genetic Programming. Springer, 246–261.Google ScholarGoogle Scholar
  10. Mikael Elinder and Oscar Erixson. 2012. Gender, social norms, and survival in maritime disasters. Proceedings of the National Academy of Sciences 109, 33 (2012), 13220–13224.Google ScholarGoogle ScholarCross RefCross Ref
  11. Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017a. Component-based synthesis of table consolidation and transformation tasks from examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 422–436.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W Reps. 2017b. Component-based synthesis for complex APIs. ACM SIGPLAN Notices 52, 1 (2017), 599–612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems. 2962–2970.Google ScholarGoogle Scholar
  14. Christophe Giraud-Carrier, Ricardo Vilalta, and Pavel Brazdil. 2004. Introduction to the special issue on meta-learning. Machine learning 54, 3 (2004), 187–193.Google ScholarGoogle Scholar
  15. Google. 2017a. Kaggle Website. https://www.kaggle.com/Google ScholarGoogle Scholar
  16. Google. 2017b. Kaggle Website. https://www.kaggle.com/product- feedback/39790Google ScholarGoogle Scholar
  17. Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 631–642.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837–847.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. LION 5 (2011), 507–523.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kaggle. 2015. Titanic: Machine Learning from Disaster (Start here! Predict survival on the Titanic and get familiar with ML basics). https://www.kaggle.com/c/titanicGoogle ScholarGoogle Scholar
  21. Kaggle. 2017. Meta-Kaggle. https://www.kaggle.com/kaggle/meta- kaggle/dataGoogle ScholarGoogle Scholar
  22. Mary Beth Kery, Amber Horvath, and Brad A Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists.. In CHI. 1265–1276.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lars Kotthoff, Chris Thornton, Holger H Hoos, Frank Hutter, and Kevin Leyton-Brown. 2016. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research 17 (2016), 1–5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ceshine Lee. 2017. Multiple Kaggle Kernels for Corporacion Favorita Grocery Sales Forecasting. https://www.kaggle.com/ ceshine/kernels?search=favorita&userId=101136Google ScholarGoogle Scholar
  25. Steve Lohr. 2014. For Big-Data Scientists, ’Janitor Work’ Is Key Hurdle to Insights. New York Times (2014). https: //www.nytimes.com/2014/08/18/technology/for- big- data- scientists- hurdle- to- insights- is- janitor- work.htmlGoogle ScholarGoogle Scholar
  26. David Mandelin, Lin Xu, Rastislav Bodík, and Doug Kimelman. 2005. Jungloid mining: helping to navigate the API jungle. In ACM SIGPLAN Notices, Vol. 40. ACM, 48–61.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wes McKinney et al. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Vol. 445. Austin, TX, 51–56.Google ScholarGoogle Scholar
  28. Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. 2017. Bayesian Sketch Learning for Program Synthesis. arXiv preprint arXiv:1703.05698 (2017).Google ScholarGoogle Scholar
  29. Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, and Chris Jermaine. 2018. Neural Sketch Learning for Conditional Program Generation. In International Conference on Learning Representations. https://openreview.net/forum?id=HkfXMzAbGoogle ScholarGoogle Scholar
  30. Randal S Olson, Nathan Bartley, Ryan J Urbanowicz, and Jason H Moore. 2016. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference. ACM, 485–492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 1 (11 Dec 2017), 36. Google ScholarGoogle ScholarCross RefCross Ref
  32. Patrick Pantel, Dekang Lin, et al. 1998. Spamcop: A spam classification & organization program. In Proceedings of AAAI-98 Workshop on Learning for Text Categorization. 95–98.Google ScholarGoogle Scholar
  33. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).Google ScholarGoogle Scholar
  34. John W Ratcliff and David E Metzener. 1988. Pattern-matching-the gestalt approach. Dr Dobbs Journal 13, 7 (1988), 46.Google ScholarGoogle Scholar
  35. Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In Acm Sigplan Notices, Vol. 49. ACM, 419–428.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Matthias Reif, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel. 2014. Automatic classifier selection for non-experts. Pattern Analysis and Applications 17, 1 (2014), 83–96.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Roni Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. (1996).Google ScholarGoogle Scholar
  38. Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513–523.Google ScholarGoogle Scholar
  39. Skipper Seabold and Josef Perktold. 2010. Statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference.Google ScholarGoogle ScholarCross RefCross Ref
  40. Chengxun Shu and Hongyu Zhang. 2017. Neural programming by example. In Thirty-First AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rishabh Singh and Pushmeet Kohli. 2017. AP: artificial programming. In 2nd Summit on Advances in Programming Languages (SNAPL 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google ScholarGoogle Scholar
  42. Evan R Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J Franklin, and Benjamin Recht. 2017. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 535–546.Google ScholarGoogle ScholarCross RefCross Ref
  43. Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong, Ge Li, and Lu Zhang. 2018. A Grammar-Based Structural CNN Decoder for Code Generation. arXiv preprint arXiv:1811.06837 (2018).Google ScholarGoogle Scholar
  44. Adi L Tarca, Vincent J Carey, Xue-wen Chen, Roberto Romero, and Sorin Drăghici. 2007. Machine learning and its applications to biology. PLoS computational biology 3, 6 (2007), e116.Google ScholarGoogle Scholar
  45. TPOT. 2018. Github Repository. https://github.com/EpistasisLab/tpot/blob/master/tpot/base.py#L1030- 1034Google ScholarGoogle Scholar
  46. Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, and Ioannis Vlahavas. 2011. Mulan: A java library for multi-label learning. Journal of Machine Learning Research 12, Jul (2011), 2411–2414.Google ScholarGoogle Scholar
  47. Bojan Tunguz. 2018. Multiple Kaggle Kernels for Mercari Price Suggestion Challenge. https://www.kaggle.com/tunguz/ kernels?search=Mercari&userId=417337Google ScholarGoogle Scholar
  48. Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49–60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. AL: autogenerating supervised learning programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!