column

OpenML: networked science in machine learning

Abstract

Many sciences have made significant breakthroughs by adopting online tools that help organize, structure and mine information that is too detailed to be printed in journals. In this paper, we introduce OpenML, a place for machine learning researchers to share and organize data in fine detail, so that they can work more effectively, be more visible, and collaborate with others to tackle harder problems. We discuss how OpenML relates to other examples of networked science and what benefits it brings for machine learning research, individual scientists, as well as students and practitioners.

References

  1. D. W. Aha. Generalizing from case studies: a case study. In Proceedings of the ninth international workshop on Machine learning, pages 1--10, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Asuncion and D. Newman. UCI machine learning repository. University of California, School of Information and Computer Science, 2007.Google ScholarGoogle Scholar
  3. M. Berthold, N. Cebron, F. Dill, T. Gabriel, T. Kotter, T. Meini, P. Ohl, C. Sieb, K. Thiel, and B. Wiswedel. KNIME: The Konstanz information miner. Studies in Classification, Data Analysis, and Knowledge Organization, 5:319--326, 2008.Google ScholarGoogle Scholar
  4. A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis. Journal of Machine Learning Research, 11:1601--1604, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Bischl. mlr: Machine Learning in R., 2013. R package version 1.1-18.Google ScholarGoogle Scholar
  6. H. Blockeel and J. Vanschoren. Experiment databases: Towards an improved experimental methodology in machine learning. Lecture Notes in Computer Science, 4702:6--17, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. A. Boroson and T. R. Lauer. A candidate subparsec supermassive binary black hole system. Nature, 458(7234):53--55, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta. Metalearning: Applications to data mining. Springer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Cardamone, K. Schawinski, M. Sarzi, S. P. Bamford, N. Bennert, C. Urry, C. Lintott, W. C. Keel, J. Parejko, R. C. Nichol, et al. Galaxy zoo green peas: discovery of a class of compact extremely star-forming galaxies. Monthly Notices of the Royal Astronomical Society, 399(3):1191--1205, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  10. J. Carpenter. May the best analyst win. Science, 331(6018):698--699, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Cooper, F. Khatib, A. Treuille, J. Barbero, J. Lee, M. Beenen, A. Leaver-Fay, D. Baker, Z. Popović, et al. Predicting protein structures with a multiplayer online game. Nature, 466(7307):756--760, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  12. K. A. Frazer, D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont, A. Boudreau, P. Hardenbol, S. M. Leal, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449(7164):851--861, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1):10--18, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Hand. Classifier technology and the illusion of progress. Statistical Science, 21(1):1--14, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  15. M. J. Hawrylycz, E. S. Lein, A. L. Guillozet-Bongaarts, E. H. Shen, L. Ng, J. A. Miller, L. N. van de Lagemaat, K. A. Smith, A. Ebbert, Z. L. Riley, et al. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature, 489(7416):391--399, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  16. H. Hirsh. Data mining research: Current status and future opportunities. Statistical Analysis and Data Mining, 1(2):104--107, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Hoste and W. Daelemans. Comparing learning approaches to coreference resolution. There is more to it than bias. In Proceedings of the ICML'05 Workshop on Meta-learning, pages 20--27, 2005.Google ScholarGoogle Scholar
  18. A. R. Isern. The ocean observatories initiative: Wiring the ocean for interactive scientific discovery. IEEE, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  19. T. Kealey. Sex, science and profits. Random House, 2010.Google ScholarGoogle Scholar
  20. E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining and Knowledge Discovery, 7(4):349--371, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. S. Lein, M. J. Hawrylycz, N. Ao, M. Ayres, A. Bensinger, A. Bernard, A. F. Boe, M. S. Boguski, K. S. Brockway, E. J. Byrnes, et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature, 445(7124):168--176, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. R. Leite, P. Brazdil, and J. Vanschoren. Selecting classification algorithms with active testing. In Machine Learning and Data Mining in Pattern Recognition, pages 117--131. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. J. Lintott, K. Schawinski, W. Keel, H. Van Arkel, N. Bennert, E. Edmondson, D. Thomas, D. J. Smith, P. D. Herbert, M. J. Jarvis, et al. Galaxy Zoo:Hanny's Voorwerp, a quasar light echo? Monthly Notices of the Royal Astronomical Society, 399(1):129--140, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  24. C. J. Lintott, K. Schawinski, A. Slosar, K. Land, S. Bamford, D. Thomas, M. J. Raddick, R. C. Nichol, A. Szalay, D. Andreescu, et al. Galaxy Zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey. Monthly Notices of the Royal Astronomical Society, 389(3):1179--1189, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  25. I. Manolescu, L. Afanasiev, A. Arion, J. Dittrich, S. Manegold, N. Polyzotis, K. Schnaitter, P. Senellart, and S. Zoupanos. The repeatability experiment of SIGMOD 2008. ACM SIGMOD Record, 37(1):39--45, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. L. Masters, M. Mosleh, A. K. Romer, R. C. Nichol, S. P. Bamford, K. Schawinski, C. J. Lintott, D. Andreescu, H. C. Campbell, B. Crowcroft, et al. Galaxy Zoo: passive red spirals. Monthly Notices of the Royal Astronomical Society, 405(2):783--799, 2010.Google ScholarGoogle Scholar
  27. D. Michie, D. Spiegelhalter, and C. Taylor. Machine learning, neural and statistical classification. Ellis Horwood, Upper Saddle River, NJ, USA, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Nielsen. The future of science: Building a better collective memory. APS Physics, 17(10), 2008.Google ScholarGoogle Scholar
  29. M. Nielsen. Reinventing discovery: the new era of networked science. Princeton University Press, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Ostrom. Collective action and the evolution of social norms. The Journal of Economic Perspectives, pages 137--158, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  31. H. Parkinson, M. Kapushesky, M. Shojatalab, N. Abeygunawardena, R. Coulson, A. Farne, E. Holloway, N. Kolesnykov, P. Lilja, M. Lukk, et al. ArrayExpress: A public database of microarray experiments and gene expression profiles. Nucleic acids research, 35(suppl 1):D747--D750, 2007.Google ScholarGoogle Scholar
  32. T. Pedersen. Empiricism is not a matter of faith. Computational Linguistics, 34:465--470, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Perlich, F. Provost, and J. Simonoff. Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research, 4:211--255, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Pfahringer, H. Bensusan, and C. Giraud-Carrier. Meta-learning by landmarking various learning algorithms. Proceedings of the International Conference on Machine Learning (ICML), 951(2000):743--750, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Priem, P. Groth, and D. Taraborelli. The Altmetrics Collection. PLoS ONE, 11(7):e48753, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  36. M. J. Raddick, G. Bracey, P. L. Gay, C. J. Lintott, P. Murray, K. Schawinski, A. S. Szalay, and J. Vandenberg. Galaxy zoo: Exploring the motivations of citizen science volunteers. Astronomy Education Review, 9(1):010103, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  37. M. E. Schwamb, C. J. Lintott, D. A. Fischer, M. J. Giguere, S. Lynn, A. M. Smith, J. M. Brewer, M. Parrish, K. Schawinski, and R. J. Simpson. Planet hunters: Assessing the kepler inventory of short-period planets. The Astrophysical Journal, 754(2):129, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  38. S. Sonnenburg, M. Braun, C. Ong, S. Bengio, L. Bottou, G. Holmes, Y. LeCun, K. Muller, F. Pereira, C. Rasmussen, G. Ratsch, B. Scholkopf, A. Smola, P. Vincent, J. Weston, and R. Williamson. The need for open source software in machine learning. Journal of Machine Learning Research, 8:2443--2466, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. S. Szalay, J. Gray, A. R. Thakar, P. Z. Kunszt, T. Malik, J. Raddick, C. Stoughton, and J. vandenBerg. The sdss skyserver: public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIG- MOD international conference on Management of data, pages 570--581. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. L. Torgo. Data Mining with R: Learning with Case Studies. Chapman & Hall/CRC, 1st edition, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. N. van Rijn, B. Bischl, L. Torgo, B. Gao, V. Umaashankar, S. Fischer, P. Winter, B. Wiswedel, M. R. Berthold, and J. Vanschoren. OpenML: A col- laborative science platform. In Proceedings of ECML- PKDD 2013, pages 645--649, 2013.Google ScholarGoogle Scholar
  42. J. N. van Rijn, V. Umaashankar, S. Fischer, B. Bischl, L. Torgo, B. Gao, P. Winter, B. Wiswedel, M. R. Berthold, and J. Vanschoren. A RapidMiner extension for open machine learning. In RapidMiner Community Meeting and Conference 2013, pages 59--70, 2013.Google ScholarGoogle Scholar
  43. J. Vanschoren, H. Blockeel, B. Pfahringer, and G. Holmes. Experiment databases. A new way to share, organize and learn from experiments. Machine Learning, 87(2):127--158, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Wojnarski, S. Stawicki, and P. Wojnarowski. Tunedit.org: System for automated evaluation of algorithms in repeatable experiments. Lecture Notes in Computer Science, 6086:20--29, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. OpenML

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!