research-article

Snorkel: rapid training data creation with weak supervision

Online:01 November 2017Publication History

Abstract

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8X faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8X speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.

References

  1. Worldwide semiannual cognitive/artificial intelligence systems spending guide. Technical report, International Data Corporation, 2017.Google ScholarGoogle Scholar
  2. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. K. Agrawala. Learning with a probabilistic teacher. IEEE Transactions on Infomation Theory, 16:373--379, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pattern learning for relation extraction with a hierarchical topic model. In Meeting of the Association for Computational Linguistics (ACL), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. H. Bach, B. He, A. Ratner, and C. Ré. Learning the structure of generative models without labeled data. In International Conference on Machine Learning (ICML), 2017.Google ScholarGoogle Scholar
  6. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Workshop on Computational Learning Theory (COLT), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. C. Bunescu and R. J. Mooney. Learning to extract relations from the Web using minimal supervision. In Meeting of the Association for Computational Linguistics (ACL), 2007.Google ScholarGoogle Scholar
  8. R. Caspi, R. Billington, L. Ferrer, H. Foerster, C. A. Fulcher, I. M. Keseler, A. Kothari, M. Krummenacker, M. Latendresse, L. A. Mueller, Q. Ong, S. Paley, P. Subhraveti, D. S. Weaver, and P. D. Karp. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Research, 44(D1):D471--D480, 2016.Google ScholarGoogle Scholar
  9. O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. Adaptive Computation and Machine Learning. MIT Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Corney, D. Albakour, M. Martinez, and S. Moussa. What do a million news articles look like? In Workshop on Recent Trends in News Information Retrieval, 2016.Google ScholarGoogle Scholar
  11. N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In International World Wide Web Conference (WWW), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. P. Davis et al. A CTD-Pfizer collaboration: Manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions. Database, 2013.Google ScholarGoogle Scholar
  13. A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society C, 28(1):20--28, 1979.Google ScholarGoogle Scholar
  14. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, IEEE Conference on (CVPR), 2009.Google ScholarGoogle Scholar
  15. X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.Google ScholarGoogle Scholar
  16. L. Eadicicco. Baidu's Andrew Ng on the future of artificial intelligence, 2017. Time {Online; posted 11-January-2017}.Google ScholarGoogle Scholar
  17. A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602--610, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CoNLL, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  19. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385,2015.Google ScholarGoogle Scholar
  20. M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Meeting of the Association for Computational Linguistics (ACL), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771--1800, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In Meeting of the Association for Computational Linguistics (ACL), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Comprehensive and reliable crowd assessment algorithms. In International Conference on Data Engineering (ICDE), 2015.Google ScholarGoogle ScholarCross RefCross Ref
  24. D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.Google ScholarGoogle Scholar
  25. J. P. Ku, J. L. Hicks, T. Hastie, J. Leskovec, C. Ré, and S. L. Delp. The Mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility. Journal of the American Medical Informatics Association, 22(6):1120--1125, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  26. J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal, 2014.Google ScholarGoogle Scholar
  27. H. Li, B. Yu, and D. Zhou. Error rate analysis of labeling by crowdsourcing. In ICML Workshop: Machine Learning Meets Crowdsourcing. Atalanta, Georgia, USA, 2013.Google ScholarGoogle Scholar
  28. Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. A survey on truth discovery. SIGKDD Explor. Newsl., 17(2),2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Liang, M. I. Jordan, and D. Klein. Learning from measurements in exponential families. In International Conference on Machine Learning (ICML), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research, 11:955--984, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Metz. Google's hand-fed AI now gives answers, not just search results, 2016. Wired {Online; posted 29-November-2016}.Google ScholarGoogle Scholar
  32. M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Meeting of the Association for Computational Linguistics (ACL), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. A. P., C. J. Grondin, R. J. Johnson, D. Sciaky, B. L. King, R. McMorran, J. Wiegers, T. Wiegers, and C. J. Mattingly. The comparative toxicogenomics database: update 2017. Nucleic Acids Research, 2016.Google ScholarGoogle Scholar
  34. S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345--1359, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. F. Parisi, F. Strino, B. Nadler, and Y. Kluger. Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences of the USA, 111(4):1253--1258, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  36. R. Pochampally, A. Das Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In ACM SIGMOD International Conference on Management of Data (SIGMOD), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. J. Quinn and B. B. Bederson. Human computation: A survey and taxonomy of a growing field. In ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Ratner, C. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Neural Information Processing Systems (NIPS), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. HoloClean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. Parameswaran, and C. Ré. SLiMFast: Guaranteed results for data fusion and source reliability. In ACM SIGMOD International Conference on Management of Data (SIGMOD), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. B. Roth and D. Klakow. Combining generative and discriminative model scores for distant supervision. In Conference on Empirical Methods on Natural Language Processing (EMNLP), 2013.Google ScholarGoogle Scholar
  43. V. Satopaa, J. Albrecht, D. Irwin, and B. Raghavan. Finding a "kneedle" in a haystack: Detecting knee points in system behavior. In International Conference on Distributed Computing Systems Workshops, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. H. J. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Infomation Theory, 11:363--371, 1965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R. Stewart and S. Ermon. Label-free supervision of neural networks with physics and other domain knowledge. In AAAI Conference on Artificial Intelligence (AAAI), 2017.Google ScholarGoogle Scholar
  47. C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.Google ScholarGoogle Scholar
  48. S. Takamatsu, I. Sato, and H. Nakagawa. Reducing wrong labels in distant supervision for relation extraction. In Meeting of the Association for Computational Linguistics (ACL), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. C.-H. Wei, Y. Peng, R. Leaman, D. A. P., C. J. Mattingly, J. Li, T. Wiegers, and Z. Lu. Overview of the BioCreative Vchemical disease relation (CDR)task. In BioCreative Challenge Evaluation Workshop, 2015.Google ScholarGoogle Scholar
  50. M.-C. Yuen, I. King, and K.-S. Leung. A survey of crowdsourcing systems. In Privacy, Security, Risk and Trust (PASSAT) and Inernational Conference on Social Computing (SocialCom), 2011.Google ScholarGoogle Scholar
  51. O. F. Zaidan and J. Eisner. Modeling annotators: A generative approach to learning from annotator rationales. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. C. Zhang, C. Ré, M. Cafarella, C. De Sa, A. Ratner, J. Shin, F. Wang, and S. Wu. DeepDive: Declarative knowledge base construction. Commun. ACM, 60(5):93--102, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. Journal of Machine Learning Research, 17:1--44, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. B. Zhao, B. I. Rubinstein, J. Gemmell, and J. Han. A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. Snorkel: rapid training data creation with weak supervision

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        Proceedings of the VLDB Endowment cover image
        Proceedings of the VLDB Endowment  Volume 11, Issue 3
        November 2017
        150 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Online: 1 November 2017
        • Published: 1 November 2017

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!