Abstract
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8X faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8X speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.
References
- Worldwide semiannual cognitive/artificial intelligence systems spending guide. Technical report, International Data Corporation, 2017.Google Scholar
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016. Google Scholar
Digital Library
- A. K. Agrawala. Learning with a probabilistic teacher. IEEE Transactions on Infomation Theory, 16:373--379, 1970. Google Scholar
Digital Library
- E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pattern learning for relation extraction with a hierarchical topic model. In Meeting of the Association for Computational Linguistics (ACL), 2012. Google Scholar
Digital Library
- S. H. Bach, B. He, A. Ratner, and C. Ré. Learning the structure of generative models without labeled data. In International Conference on Machine Learning (ICML), 2017.Google Scholar
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Workshop on Computational Learning Theory (COLT), 1998. Google Scholar
Digital Library
- R. C. Bunescu and R. J. Mooney. Learning to extract relations from the Web using minimal supervision. In Meeting of the Association for Computational Linguistics (ACL), 2007.Google Scholar
- R. Caspi, R. Billington, L. Ferrer, H. Foerster, C. A. Fulcher, I. M. Keseler, A. Kothari, M. Krummenacker, M. Latendresse, L. A. Mueller, Q. Ong, S. Paley, P. Subhraveti, D. S. Weaver, and P. D. Karp. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Research, 44(D1):D471--D480, 2016.Google Scholar
- O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. Adaptive Computation and Machine Learning. MIT Press, 2009. Google Scholar
Digital Library
- D. Corney, D. Albakour, M. Martinez, and S. Moussa. What do a million news articles look like? In Workshop on Recent Trends in News Information Retrieval, 2016.Google Scholar
- N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In International World Wide Web Conference (WWW), 2013. Google Scholar
Digital Library
- A. P. Davis et al. A CTD-Pfizer collaboration: Manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions. Database, 2013.Google Scholar
- A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society C, 28(1):20--28, 1979.Google Scholar
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, IEEE Conference on (CVPR), 2009.Google Scholar
- X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.Google Scholar
- L. Eadicicco. Baidu's Andrew Ng on the future of artificial intelligence, 2017. Time {Online; posted 11-January-2017}.Google Scholar
- A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602--610, 2005. Google Scholar
Digital Library
- S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CoNLL, 2014.Google Scholar
Cross Ref
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385,2015.Google Scholar
- M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Meeting of the Association for Computational Linguistics (ACL), 1992. Google Scholar
Digital Library
- G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771--1800, 2002. Google Scholar
Digital Library
- R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In Meeting of the Association for Computational Linguistics (ACL), 2011. Google Scholar
Digital Library
- M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Comprehensive and reliable crowd assessment algorithms. In International Conference on Data Engineering (ICDE), 2015.Google Scholar
Cross Ref
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.Google Scholar
- J. P. Ku, J. L. Hicks, T. Hastie, J. Leskovec, C. Ré, and S. L. Delp. The Mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility. Journal of the American Medical Informatics Association, 22(6):1120--1125, 2015.Google Scholar
Cross Ref
- J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal, 2014.Google Scholar
- H. Li, B. Yu, and D. Zhou. Error rate analysis of labeling by crowdsourcing. In ICML Workshop: Machine Learning Meets Crowdsourcing. Atalanta, Georgia, USA, 2013.Google Scholar
- Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. A survey on truth discovery. SIGKDD Explor. Newsl., 17(2),2015. Google Scholar
Digital Library
- P. Liang, M. I. Jordan, and D. Klein. Learning from measurements in exponential families. In International Conference on Machine Learning (ICML), 2009. Google Scholar
Digital Library
- G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research, 11:955--984, 2010. Google Scholar
Digital Library
- C. Metz. Google's hand-fed AI now gives answers, not just search results, 2016. Wired {Online; posted 29-November-2016}.Google Scholar
- M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Meeting of the Association for Computational Linguistics (ACL), 2009. Google Scholar
Digital Library
- D. A. P., C. J. Grondin, R. J. Johnson, D. Sciaky, B. L. King, R. McMorran, J. Wiegers, T. Wiegers, and C. J. Mattingly. The comparative toxicogenomics database: update 2017. Nucleic Acids Research, 2016.Google Scholar
- S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345--1359, 2010. Google Scholar
Digital Library
- F. Parisi, F. Strino, B. Nadler, and Y. Kluger. Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences of the USA, 111(4):1253--1258, 2014.Google Scholar
Cross Ref
- R. Pochampally, A. Das Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In ACM SIGMOD International Conference on Management of Data (SIGMOD), 2014. Google Scholar
Digital Library
- A. J. Quinn and B. B. Bederson. Human computation: A survey and taxonomy of a growing field. In ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), 2011. Google Scholar
Digital Library
- A. Ratner, C. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Neural Information Processing Systems (NIPS), 2016. Google Scholar
Digital Library
- T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. HoloClean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017. Google Scholar
Digital Library
- T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. Parameswaran, and C. Ré. SLiMFast: Guaranteed results for data fusion and source reliability. In ACM SIGMOD International Conference on Management of Data (SIGMOD), 2017. Google Scholar
Digital Library
- S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2010. Google Scholar
Digital Library
- B. Roth and D. Klakow. Combining generative and discriminative model scores for distant supervision. In Conference on Empirical Methods on Natural Language Processing (EMNLP), 2013.Google Scholar
- V. Satopaa, J. Albrecht, D. Irwin, and B. Raghavan. Finding a "kneedle" in a haystack: Detecting knee points in system behavior. In International Conference on Distributed Computing Systems Workshops, 2011. Google Scholar
Digital Library
- H. J. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Infomation Theory, 11:363--371, 1965. Google Scholar
Digital Library
- B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012. Google Scholar
Digital Library
- R. Stewart and S. Ermon. Label-free supervision of neural networks with physics and other domain knowledge. In AAAI Conference on Artificial Intelligence (AAAI), 2017.Google Scholar
- C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.Google Scholar
- S. Takamatsu, I. Sato, and H. Nakagawa. Reducing wrong labels in distant supervision for relation extraction. In Meeting of the Association for Computational Linguistics (ACL), 2012. Google Scholar
Digital Library
- C.-H. Wei, Y. Peng, R. Leaman, D. A. P., C. J. Mattingly, J. Li, T. Wiegers, and Z. Lu. Overview of the BioCreative Vchemical disease relation (CDR)task. In BioCreative Challenge Evaluation Workshop, 2015.Google Scholar
- M.-C. Yuen, I. King, and K.-S. Leung. A survey of crowdsourcing systems. In Privacy, Security, Risk and Trust (PASSAT) and Inernational Conference on Social Computing (SocialCom), 2011.Google Scholar
- O. F. Zaidan and J. Eisner. Modeling annotators: A generative approach to learning from annotator rationales. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008. Google Scholar
Digital Library
- C. Zhang, C. Ré, M. Cafarella, C. De Sa, A. Ratner, J. Shin, F. Wang, and S. Wu. DeepDive: Declarative knowledge base construction. Commun. ACM, 60(5):93--102, 2017. Google Scholar
Digital Library
- Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. Journal of Machine Learning Research, 17:1--44, 2016. Google Scholar
Digital Library
- B. Zhao, B. I. Rubinstein, J. Gemmell, and J. Han. A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012. Google Scholar
Digital Library
Index Terms
(auto-classified)Snorkel: rapid training data creation with weak supervision





Comments