ABSTRACT
Labeling training data is one of the most costly bottlenecks in developing machine learning-based applications. We present a first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introduce Snorkel DryBell, a new weak supervision management system for this setting. Snorkel DryBell builds on the Snorkel framework, extending it in three critical aspects: flexible, template-based ingestion of diverse organizational knowledge, cross-feature production serving, and scalable, sampling-free execution. On three classification tasks at Google, we find that Snorkel DryBell creates classifiers of comparable quality to ones trained with tens of thousands of hand-labeled examples, converts non-servable organizational resources to servable models for an average 52% performance improvement, and executes over millions of data points in tens of minutes.
- Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In USENIX Conference on Operating Systems Design and Implementation (OSDI) . Google Scholar
Digital Library
- A. K. Agrawala. 1970. Learning with a Probabilistic Teacher. IEEE Transactions on Infomation Theory, Vol. 16 (1970), 373--379. Google Scholar
Digital Library
- Stephen H. Bach, Bryan He, Alexander Ratner, and Christopher Ré. 2017. Learning the Structure of Generative Models without Labeled Data. In International Conference on Machine Learning (ICML) . Google Scholar
Digital Library
- Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, et almbox. 2017. TFX: A TensorFlow-based production-scale machine learning platform. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . Google Scholar
Digital Library
- A. Blum and T. Mitchell. 1998. Combining Labeled and Unlabeled Data with Co-Training. In Workshop on Computational Learning Theory (COLT) . Google Scholar
Digital Library
- Jakramate Bootkrajang and Ata Kabán. 2012. Label-noise robust logistic regression and its applications. In Joint European conference on machine learning and knowledge discovery in databases. Springer, 143--158.Google Scholar
Cross Ref
- O. Chapelle, B. Schölkopf, and A. Zien (Eds.). 2006. Semi-Supervised Learning .MIT Press.Google Scholar
- Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2018. AutoAugment: Learning Augmentation Policies from Data. arXiv preprint arXiv:1805.09501 (2018).Google Scholar
- Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. 2013. Aggregating crowdsourced binary ratings. In Proceedings of the 22nd international conference on World Wide Web. ACM, 285--294. Google Scholar
Digital Library
- Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics (1979), 20--28.Google Scholar
- X. L. Dong and D. Srivastava. 2015. Big Data Integration .Morgan & Claypool Publishers.Google Scholar
- Google. 2019. Cloud AI. https://cloud.google.com/products/ai/.Google Scholar
- Edouard Grave, Moustapha M Cisse, and Armand Joulin. 2017. Unbounded cache model for online language modeling with open vocabulary. In Advances in Neural Information Processing Systems (NeurIPS . Google Scholar
Digital Library
- Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Kumar, and Mike Lewis. 2018. Semantic parsing for task oriented dialog using hierarchical representations. arXiv preprint arXiv:1810.07942 (2018).Google Scholar
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. (2017).Google Scholar
- Dong-Hyun Lee. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning .Google Scholar
- Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. 2015. A Survey on Truth Discovery. SIGKDD Explor. Newsl., Vol. 17, 2 (2015). Google Scholar
Digital Library
- Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. 2018. Exploring the Limits of Weakly Supervised Pretraining. In European Conference on Computer Vision (ECCV) .Google Scholar
- Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Annual meeting of the Association for Computational Linguistics: System Demonstrations .Google Scholar
Cross Ref
- H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et almbox. 2013. Ad click prediction: A view from the trenches. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . Google Scholar
Digital Library
- Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 1003--1011. Google Scholar
Digital Library
- Volodymyr Mnih and Geoffrey E Hinton. 2012. Learning to label aerial images from noisy data. In Proceedings of the 29th International conference on machine learning (ICML-12). 567--574. Google Scholar
Digital Library
- ONNX. 2017. Open Neural Network Exchange. https://github.com/onnx/onnx .Google Scholar
- S. J. Pan and Q. Yang. 2010. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 22, 10 (2010), 1345--1359. Google Scholar
Digital Library
- Sinno Jialin Pan, Qiang Yang, et almbox. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, Vol. 22, 10 (2010), 1345--1359. Google Scholar
Digital Library
- Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., Vol. 23, 4 (2000), 3--13.Google Scholar
- Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, Vol. 11, 3 (2017), 269--282.Google Scholar
Digital Library
- Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems. 3567--3575. Google Scholar
Digital Library
- Alexander J Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Ré. 2019. Training Complex Models with Multi-Task Weak Supervision. In AAAI .Google Scholar
- Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017a. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, Vol. 10, 11 (2017), 1190--1201.Google Scholar
Digital Library
- Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Ré. 2017b. SLiMFast: Guaranteed Results for Data Fusion and Source Reliability. In ACM SIGMOD International Conference on Management of Data (SIGMOD) .Google Scholar
Digital Library
- H. J. Scudder. 1965. Probability of Error of some Adaptive Pattern-Recognition Machines. IEEE Transactions on Infomation Theory, Vol. 11 (1965), 363--371. Google Scholar
Digital Library
- Amazon Web Services. 2019. Amazon Comprehend. https://aws.amazon.com/comprehend/.Google Scholar
- B. Settles. 2012. Active Learning .Morgan & Claypool Publishers. Google Scholar
- Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. CoRR, Vol. abs/1707.02968 (2017). arxiv: 1707.02968 http://arxiv.org/abs/1707.02968Google Scholar
- Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 2018. Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).Google Scholar
- Andrew Zhai, Dmitry Kislyuk, Yushi Jing, Michael Feng, Eric Tzeng, Jeff Donahue, Yue Li Du, and Trevor Darrell. 2017. Visual discovery at Pinterest. In International Conference on the World Wide Web (WWW) . Google Scholar
Digital Library
- Ce Zhang, Jaeho Shin, Christopher Ré, Michael Cafarella, and Feng Niu. 2016. Extracting databases from dark data with deepdive. In Proceedings of the 2016 International Conference on Management of Data. ACM, 847--859. Google Scholar
Digital Library
- Bo Zhao, Benjamin IP Rubinstein, Jim Gemmell, and Jiawei Han. 2012. A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, Vol. 5, 6 (2012), 550--561. Google Scholar
Digital Library
Index Terms
Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale
Recommendations
Snorkel: Fast Training Set Generation for Information Extraction
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataState-of-the art machine learning methods such as deep learning rely on large sets of hand-labeled training data. Collecting training data is prohibitively slow and expensive, especially when technical domain expertise is required; even the largest ...
Software 2.0 and Snorkel: Beyond Hand-Labeled Data
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningIn the last few years, deep learning models have simultaneously achieved high quality on conventionally challenging tasks and become easy-to-use commodity tools. These factors, combined with the ease of deployment compared to traditional software, have ...
Snorkel: rapid training data creation with weak supervision
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, ...





Comments