Abstract
We present AL, a novel automated machine learning system that learns to generate new supervised learning pipelines from an existing corpus of supervised learning programs. In contrast to existing automated machine learning tools, which typically implement a search over manually selected machine learning functions and classes, AL learns to identify the relevant classes in an API by analyzing dynamic program traces that use the target machine learning library. AL constructs a conditional probability model from these traces to estimate the likelihood of the generated supervised learning pipelines and uses this model to guide the search to generate pipelines for new datasets. Our evaluation shows that AL can produce successful pipelines for datasets that previous systems fail to process and produces pipelines with comparable predictive performance for datasets that previous systems process successfully.
- 2017. UCI: Solar Flare Data Set. https://archive.ics.uci.edu/ml/datasets/Solar+FlareGoogle Scholar
- Shawkat Ali and Kate A Smith-Miles. 2006. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing 70, 1 (2006), 173–186.Google Scholar
Cross Ref
- Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 38–49.Google Scholar
Digital Library
- Autosklearn. 2017. Github Repository Issue 292. https://github.com/automl/auto- sklearn/issues/292Google Scholar
- Leonard E Baum and Ted Petrie. 1966. Statistical inference for probabilistic functions of finite state Markov chains. The annals of mathematical statistics 37, 6 (1966), 1554–1563.Google Scholar
- Allan Borodin, Ran El-Yaniv, and Vincent Gogan. 2004. Can we learn to beat the best stock. In Advances in Neural Information Processing Systems. 345–352.Google Scholar
- Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108–122.Google Scholar
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. CoRR abs/1603.02754 (2016). arXiv: 1603.02754 http://arxiv.org/abs/1603.02754Google Scholar
Digital Library
- Alex GC de Sá, Walter José GS Pinto, Luiz Otavio VB Oliveira, and Gisele L Pappa. 2017. RECIPE: A Grammar-Based Framework for Automatically Evolving Classification Pipelines. In European Conference on Genetic Programming. Springer, 246–261.Google Scholar
- Mikael Elinder and Oscar Erixson. 2012. Gender, social norms, and survival in maritime disasters. Proceedings of the National Academy of Sciences 109, 33 (2012), 13220–13224.Google Scholar
Cross Ref
- Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017a. Component-based synthesis of table consolidation and transformation tasks from examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 422–436.Google Scholar
Digital Library
- Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W Reps. 2017b. Component-based synthesis for complex APIs. ACM SIGPLAN Notices 52, 1 (2017), 599–612.Google Scholar
Digital Library
- Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems. 2962–2970.Google Scholar
- Christophe Giraud-Carrier, Ricardo Vilalta, and Pavel Brazdil. 2004. Introduction to the special issue on meta-learning. Machine learning 54, 3 (2004), 187–193.Google Scholar
- Google. 2017a. Kaggle Website. https://www.kaggle.com/Google Scholar
- Google. 2017b. Kaggle Website. https://www.kaggle.com/product- feedback/39790Google Scholar
- Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 631–642.Google Scholar
Digital Library
- Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837–847.Google Scholar
Digital Library
- Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. LION 5 (2011), 507–523.Google Scholar
Digital Library
- Kaggle. 2015. Titanic: Machine Learning from Disaster (Start here! Predict survival on the Titanic and get familiar with ML basics). https://www.kaggle.com/c/titanicGoogle Scholar
- Kaggle. 2017. Meta-Kaggle. https://www.kaggle.com/kaggle/meta- kaggle/dataGoogle Scholar
- Mary Beth Kery, Amber Horvath, and Brad A Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists.. In CHI. 1265–1276.Google Scholar
Digital Library
- Lars Kotthoff, Chris Thornton, Holger H Hoos, Frank Hutter, and Kevin Leyton-Brown. 2016. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research 17 (2016), 1–5.Google Scholar
Digital Library
- Ceshine Lee. 2017. Multiple Kaggle Kernels for Corporacion Favorita Grocery Sales Forecasting. https://www.kaggle.com/ ceshine/kernels?search=favorita&userId=101136Google Scholar
- Steve Lohr. 2014. For Big-Data Scientists, ’Janitor Work’ Is Key Hurdle to Insights. New York Times (2014). https: //www.nytimes.com/2014/08/18/technology/for- big- data- scientists- hurdle- to- insights- is- janitor- work.htmlGoogle Scholar
- David Mandelin, Lin Xu, Rastislav Bodík, and Doug Kimelman. 2005. Jungloid mining: helping to navigate the API jungle. In ACM SIGPLAN Notices, Vol. 40. ACM, 48–61.Google Scholar
Digital Library
- Wes McKinney et al. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Vol. 445. Austin, TX, 51–56.Google Scholar
- Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. 2017. Bayesian Sketch Learning for Program Synthesis. arXiv preprint arXiv:1703.05698 (2017).Google Scholar
- Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, and Chris Jermaine. 2018. Neural Sketch Learning for Conditional Program Generation. In International Conference on Learning Representations. https://openreview.net/forum?id=HkfXMzAbGoogle Scholar
- Randal S Olson, Nathan Bartley, Ryan J Urbanowicz, and Jason H Moore. 2016. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference. ACM, 485–492.Google Scholar
Digital Library
- Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, 1 (11 Dec 2017), 36. Google Scholar
Cross Ref
- Patrick Pantel, Dekang Lin, et al. 1998. Spamcop: A spam classification & organization program. In Proceedings of AAAI-98 Workshop on Learning for Text Categorization. 95–98.Google Scholar
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).Google Scholar
- John W Ratcliff and David E Metzener. 1988. Pattern-matching-the gestalt approach. Dr Dobbs Journal 13, 7 (1988), 46.Google Scholar
- Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In Acm Sigplan Notices, Vol. 49. ACM, 419–428.Google Scholar
Digital Library
- Matthias Reif, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel. 2014. Automatic classifier selection for non-experts. Pattern Analysis and Applications 17, 1 (2014), 83–96.Google Scholar
Digital Library
- Roni Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. (1996).Google Scholar
- Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513–523.Google Scholar
- Skipper Seabold and Josef Perktold. 2010. Statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference.Google Scholar
Cross Ref
- Chengxun Shu and Hongyu Zhang. 2017. Neural programming by example. In Thirty-First AAAI Conference on Artificial Intelligence.Google Scholar
Digital Library
- Rishabh Singh and Pushmeet Kohli. 2017. AP: artificial programming. In 2nd Summit on Advances in Programming Languages (SNAPL 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google Scholar
- Evan R Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J Franklin, and Benjamin Recht. 2017. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 535–546.Google Scholar
Cross Ref
- Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong, Ge Li, and Lu Zhang. 2018. A Grammar-Based Structural CNN Decoder for Code Generation. arXiv preprint arXiv:1811.06837 (2018).Google Scholar
- Adi L Tarca, Vincent J Carey, Xue-wen Chen, Roberto Romero, and Sorin Drăghici. 2007. Machine learning and its applications to biology. PLoS computational biology 3, 6 (2007), e116.Google Scholar
- TPOT. 2018. Github Repository. https://github.com/EpistasisLab/tpot/blob/master/tpot/base.py#L1030- 1034Google Scholar
- Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, and Ioannis Vlahavas. 2011. Mulan: A java library for multi-label learning. Journal of Machine Learning Research 12, Jul (2011), 2411–2414.Google Scholar
- Bojan Tunguz. 2018. Multiple Kaggle Kernels for Mercari Price Suggestion Challenge. https://www.kaggle.com/tunguz/ kernels?search=Mercari&userId=417337Google Scholar
- Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49–60. Google Scholar
Digital Library
- Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98.Google Scholar
Digital Library
- Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.Google Scholar
Digital Library
Index Terms
AL: autogenerating supervised learning programs
Recommendations
Using supervised and one-class automated machine learning for predictive maintenance
AbstractPredictive Maintenance (PdM) is a critical area that is benefiting from the Industry 4.0 advent. Recently, several attempts have been made to apply Machine Learning (ML) to PdM, with the majority of the research studies assuming an ...
Highlights- Automated Machine Learning (AutoML) is rarely used for Predictive Maintenance (PdM).
An Automated Machine Learning Platform for Non-experts
RACS '20: Proceedings of the International Conference on Research in Adaptive and Convergent SystemsWith successful applications of machine learning to various domains, there have been large demands on developing machine learning-based applications. Automated machine learning is crucial to meet the demand because there are not sufficiently many expert ...
A Scalable and Automated Machine Learning Framework to Support Risk Management
Agents and Artificial IntelligenceAbstractDue to the growth of data and widespread usage of Machine Learning (ML) by non-experts, automation and scalability are becoming key issues for ML. This paper presents an automated and scalable framework for ML that requires minimum human input. We ...






Comments