skip to main content
10.1145/3299869.3314036acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Published:25 June 2019Publication History

ABSTRACT

Labeling training data is one of the most costly bottlenecks in developing machine learning-based applications. We present a first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introduce Snorkel DryBell, a new weak supervision management system for this setting. Snorkel DryBell builds on the Snorkel framework, extending it in three critical aspects: flexible, template-based ingestion of diverse organizational knowledge, cross-feature production serving, and scalable, sampling-free execution. On three classification tasks at Google, we find that Snorkel DryBell creates classifiers of comparable quality to ones trained with tens of thousands of hand-labeled examples, converts non-servable organizational resources to servable models for an average 52% performance improvement, and executes over millions of data points in tens of minutes.

References

  1. Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In USENIX Conference on Operating Systems Design and Implementation (OSDI) . Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. K. Agrawala. 1970. Learning with a Probabilistic Teacher. IEEE Transactions on Infomation Theory, Vol. 16 (1970), 373--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Stephen H. Bach, Bryan He, Alexander Ratner, and Christopher Ré. 2017. Learning the Structure of Generative Models without Labeled Data. In International Conference on Machine Learning (ICML) . Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, et almbox. 2017. TFX: A TensorFlow-based production-scale machine learning platform. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Blum and T. Mitchell. 1998. Combining Labeled and Unlabeled Data with Co-Training. In Workshop on Computational Learning Theory (COLT) . Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jakramate Bootkrajang and Ata Kabán. 2012. Label-noise robust logistic regression and its applications. In Joint European conference on machine learning and knowledge discovery in databases. Springer, 143--158.Google ScholarGoogle ScholarCross RefCross Ref
  7. O. Chapelle, B. Schölkopf, and A. Zien (Eds.). 2006. Semi-Supervised Learning .MIT Press.Google ScholarGoogle Scholar
  8. Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2018. AutoAugment: Learning Augmentation Policies from Data. arXiv preprint arXiv:1805.09501 (2018).Google ScholarGoogle Scholar
  9. Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. 2013. Aggregating crowdsourced binary ratings. In Proceedings of the 22nd international conference on World Wide Web. ACM, 285--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics (1979), 20--28.Google ScholarGoogle Scholar
  11. X. L. Dong and D. Srivastava. 2015. Big Data Integration .Morgan & Claypool Publishers.Google ScholarGoogle Scholar
  12. Google. 2019. Cloud AI. https://cloud.google.com/products/ai/.Google ScholarGoogle Scholar
  13. Edouard Grave, Moustapha M Cisse, and Armand Joulin. 2017. Unbounded cache model for online language modeling with open vocabulary. In Advances in Neural Information Processing Systems (NeurIPS . Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Kumar, and Mike Lewis. 2018. Semantic parsing for task oriented dialog using hierarchical representations. arXiv preprint arXiv:1810.07942 (2018).Google ScholarGoogle Scholar
  15. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google ScholarGoogle Scholar
  16. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. (2017).Google ScholarGoogle Scholar
  18. Dong-Hyun Lee. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning .Google ScholarGoogle Scholar
  19. Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. 2015. A Survey on Truth Discovery. SIGKDD Explor. Newsl., Vol. 17, 2 (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. 2018. Exploring the Limits of Weakly Supervised Pretraining. In European Conference on Computer Vision (ECCV) .Google ScholarGoogle Scholar
  21. Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Annual meeting of the Association for Computational Linguistics: System Demonstrations .Google ScholarGoogle ScholarCross RefCross Ref
  22. H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et almbox. 2013. Ad click prediction: A view from the trenches. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 1003--1011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Volodymyr Mnih and Geoffrey E Hinton. 2012. Learning to label aerial images from noisy data. In Proceedings of the 29th International conference on machine learning (ICML-12). 567--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ONNX. 2017. Open Neural Network Exchange. https://github.com/onnx/onnx .Google ScholarGoogle Scholar
  26. S. J. Pan and Q. Yang. 2010. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 22, 10 (2010), 1345--1359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sinno Jialin Pan, Qiang Yang, et almbox. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, Vol. 22, 10 (2010), 1345--1359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., Vol. 23, 4 (2000), 3--13.Google ScholarGoogle Scholar
  29. Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, Vol. 11, 3 (2017), 269--282.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems. 3567--3575. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alexander J Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Ré. 2019. Training Complex Models with Multi-Task Weak Supervision. In AAAI .Google ScholarGoogle Scholar
  32. Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017a. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, Vol. 10, 11 (2017), 1190--1201.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Ré. 2017b. SLiMFast: Guaranteed Results for Data Fusion and Source Reliability. In ACM SIGMOD International Conference on Management of Data (SIGMOD) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. H. J. Scudder. 1965. Probability of Error of some Adaptive Pattern-Recognition Machines. IEEE Transactions on Infomation Theory, Vol. 11 (1965), 363--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Amazon Web Services. 2019. Amazon Comprehend. https://aws.amazon.com/comprehend/.Google ScholarGoogle Scholar
  36. B. Settles. 2012. Active Learning .Morgan & Claypool Publishers. Google ScholarGoogle Scholar
  37. Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. CoRR, Vol. abs/1707.02968 (2017). arxiv: 1707.02968 http://arxiv.org/abs/1707.02968Google ScholarGoogle Scholar
  38. Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 2018. Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).Google ScholarGoogle Scholar
  39. Andrew Zhai, Dmitry Kislyuk, Yushi Jing, Michael Feng, Eric Tzeng, Jeff Donahue, Yue Li Du, and Trevor Darrell. 2017. Visual discovery at Pinterest. In International Conference on the World Wide Web (WWW) . Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ce Zhang, Jaeho Shin, Christopher Ré, Michael Cafarella, and Feng Niu. 2016. Extracting databases from dark data with deepdive. In Proceedings of the 2016 International Conference on Management of Data. ACM, 847--859. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Bo Zhao, Benjamin IP Rubinstein, Jim Gemmell, and Jiawei Han. 2012. A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, Vol. 5, 6 (2012), 550--561. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
      June 2019
      2106 pages
      ISBN:9781450356435
      DOI:10.1145/3299869

      Copyright © 2019 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 June 2019

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '19 Paper Acceptance Rate88of430submissions,20%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader