skip to main content
research-article
Open Access

Asterisk: Generating Large Training Datasets with Automatic Active Supervision

Authors Info & Claims
Published:30 May 2020Publication History
Skip Abstract Section

Abstract

Labeling datasets is one of the most expensive bottlenecks in data preprocessing tasks in machine learning. Therefore, organizations, in many domains, are applying weak supervision to produce noisy labels. However, since weak supervision relies on cheaper sources, the quality of the generated labels is problematic. Therefore, in this article, we present Asterisk, an end-to-end framework to generate high-quality, large-scale labeled datasets. The system, first, automatically generates heuristics to assign initial labels. Then, the framework applies a novel data-driven active learning process to enhance the labeling quality. We present an algorithm that learns the selection policy by accommodating the modeled accuracies of the heuristics, along with the outcome of the generative model. Finally, the system employs the output of the active learning process to enhance the quality of the labels. To evaluate the proposed system, we report its performance against four state-of-the-art techniques. In collaboration with our industrial partner, IBM, we test the framework within a wide range of real-world applications. The experiments include 10 datasets of varying sizes with a maximum size of 11 million records. The results illustrate the effectiveness of the framework in producing high-quality labels and achieving high classification accuracy with minimal annotation efforts.

References

  1. W. Zhao, G. Guan, L. Chen, X. He, D. Cai, B. Wang, and Q. Wang. 2018. Weakly-supervised deep embedding for product review sentiment analysis. IEEE Trans. Knowl. Data Eng. 30, 1 (2018), 185--197Google ScholarGoogle ScholarCross RefCross Ref
  2. A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. 2016. Data programming: Creating large training sets, quickly. Advances in Neural Information Processing Systems, pp. 3567--3575.Google ScholarGoogle Scholar
  3. V. S. Sheng, J. Zhang, B. Gu, and X. Wu. 2019. Majority voting and pairing with multiple noisy labeling. IEEE Trans. Knowl. Data Eng. 31, 7 (2019), 1355--1368.Google ScholarGoogle ScholarCross RefCross Ref
  4. P. Cheng, X. Lian, X. Jian, and L. Chen. 2019. FROG: A fast and reliable crowdsourcing framework. IEEE Trans. Knowl. Data Eng. 31, 5 (2019), 894--908.Google ScholarGoogle ScholarCross RefCross Ref
  5. C. De Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. 2016. DeepDive: Declarative knowledge base construction. SIGMOD Rec. 45, 1 (2016), 60--67.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. 2017. Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow. 11, 3 (2017), 269--282.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowl. Inf. Syst. 35, 2 (2013), 249--283.Google ScholarGoogle ScholarCross RefCross Ref
  8. N. Gurjar, S. Sudholt, and G. A. Fink. 2018. Learning deep representations for word spotting under weak supervision. International Workshop on Document Analysis Systems, pp. 7--12.Google ScholarGoogle Scholar
  9. S. Chaidaroon, T. Ebesu, and Y. Fang. 2018. Deep semantic text hashing with weak supervision. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1109--1112.Google ScholarGoogle Scholar
  10. A. H. Akbarnejad and M. S. Baghshah. 2019. An efficient semi-supervised multi-label classifier capable of handling missing labels. IEEE Trans. Knowl. Data Eng. 31 (2019), 229--242.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. 2016. Improved techniques for training GANs. Advances in Neural Information Processing Systems, pp. 2234--2242.Google ScholarGoogle Scholar
  12. S. H. Bach, B. He, A. Ratner, and C. Ré. 2017. Learning the structure of generative models without labeled data. Proc. 34th International Conference on Machine Learning, pp. 273--282.Google ScholarGoogle Scholar
  13. P. Varma and C. Ré. 2018. Snuba: Automating weak supervision to label training data. Proc. VLDB Endowment pp. 223--236.Google ScholarGoogle Scholar
  14. E.-C. Huang, H.-K. Pao, and Y.-J. Lee. 2017. Big active learning. IEEE International Conference on Big Data, pp. 94--101.Google ScholarGoogle Scholar
  15. Z.-H. Zhou. 2017. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5, 1 (2017), 44--53.Google ScholarGoogle ScholarCross RefCross Ref
  16. M.-F. Balcan, S. Hanneke, and J. W. Vaughan. 2010. The true sample complexity of active learning. Machine Learn. 80, 2--3 (2010), 111--139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Settles. 2009. Active Learning Literature Survey.Google ScholarGoogle Scholar
  18. N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. 2013. Aggregating crowdsourced binary ratings. Proc. International Conference on World Wide Web, pp. 285--294.Google ScholarGoogle Scholar
  19. M. Joglekar, H. Garcia-Molina, and A. Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. IEEE International Conference on Data Engineering, pp. 195--206.Google ScholarGoogle Scholar
  20. P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, and C. Ré. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat, 2016.Google ScholarGoogle Scholar
  21. P. Varma et al. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat, 2016.Google ScholarGoogle Scholar
  22. N. Das, S. Chaba, S. Gandhi, D. H. Chau, and X. Chu. 2019. GOGGLES: Automatic training data generation with affinity coding. ArXiv190304552 Cs, 2019.Google ScholarGoogle Scholar
  23. J. Zhu, H. Wang, B. K. Tsou, and M. Ma. 2010. Active learning with sampling by uncertainty and density for data annotations. IEEE Trans. Audio Speech Lang. Process. 18, 6 (2010), 1323--1331.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. B. C. Prudencio and T. B. Ludermir. 2008. Active meta-learning with uncertainty sampling and outlier detection. IEEE International Joint Conference on Neural Networks, pp. 346--351.Google ScholarGoogle Scholar
  25. K. Konyushkova, R. Sznitman, and P. Fua. 2015. Introducing geometry in active learning for image segmentation. ArXiv150804955 Cs, 2015.Google ScholarGoogle Scholar
  26. Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann. 2015. Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vis. 113, 2 (2015), 113--127.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Liu, W. Buntine, and G. Haffari. 2018. Learning how to actively learn: A deep imitation learning approach. Proc. Annual Meeting of the Association for Computational Linguistics. 1 (2018), 1874--1883.Google ScholarGoogle Scholar
  28. M. E. Ramirez-Loaiza, M. Sharma, G. Kumar, and M. Bilgic. 2017. Active learning: An empirical study of common baselines. Data Min. Knowl. Discov. 31. 2 (2017), 287--313.Google ScholarGoogle Scholar
  29. M. Fang, Y. Li, and T. Cohn. 2017. Learning how to active Learn: A deep reinforcement learning approach. ArXiv170802383 Cs, Aug. 2017.Google ScholarGoogle Scholar
  30. K. Konyushkova, R. Sznitman, and P. Fua. 2017. Learning active learning from data. Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  31. H. Chu and H. Lin. 2016. Can active learning experience be transferred? IEEE International Conference on Data Mining, pp. 841--846.Google ScholarGoogle Scholar
  32. K. Pang, M. Dong, Y. Wu, and T. Hospedales. 2018. Meta-learning transferable active learning policies by deep reinforcement learning. ArXiv Prepr. ArXiv180604798, 2018.Google ScholarGoogle Scholar
  33. A. Niculescu-Mizil and R. Caruana. 2005. Predicting good probabilities with supervised learning. Proc. International Conference on Machine Learning, pp. 625--632.Google ScholarGoogle Scholar
  34. O. Sagi and L. Rokach. 2018. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8, 4 (2018), 1--18.Google ScholarGoogle ScholarCross RefCross Ref
  35. B. Desharnais, F. Camirand-Lemyre, P. Mireault, and C. D. Skinner. 2015. Determination of confidence intervals in non-normal data: Application of the bootstrap to cocaine concentration in femoral blood. J. Anal. Toxicol. 39, 2 (2015), 113--117.Google ScholarGoogle Scholar
  36. R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. 2014. Supervised hashing for image retrieval via image representation learning. AAAI Conference on Artificial Intelligence, pp. 2156--2162.Google ScholarGoogle Scholar
  37. J. Bernard, M. Zeppelzauer, M. Lehmann, M. Müller, and M. Sedlmair. 2018. Towards user-centered active learning algorithms. Comput. Graph. Forum. 37, 3 (2018), 121--132.Google ScholarGoogle ScholarCross RefCross Ref
  38. M. Nashaat, A. Ghosh, J. Miller, S. Quader, C. Marston, and J. F. Puget. 2018. Hybridization of active learning and data programming for labeling large industrial datasets. IEEE International Conference on Big Data, pp. 46--55.Google ScholarGoogle Scholar
  39. M. Nashaat, A. Ghosh, J. Miller, S. Quader, and C. Marston. 2019. M-Lean: An end-to-end development framework for predictive models in B2B scenarios. Inf. Softw. Technol. 113 (2019), 131--145.Google ScholarGoogle Scholar
  40. S. Moro, P. Cortez, and P. Rita. 2014. A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62 (2014), 22--23.Google ScholarGoogle ScholarCross RefCross Ref
  41. I.-C. Yeh and C. Lien. 2009. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36, 2 (2009), 2473--2480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. P. Baldi, P. Sadowski, D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nat. Comm. 5, 4308 (2014).Google ScholarGoogle Scholar
  43. L. M. Candanedo and V. Feldheim. 2016. Accurate occupancy detection of an office room from light, temperature, humidity and CO 2 measurements using statistical learning models. 2016. Energy Build. 112 (2018), 28--39.Google ScholarGoogle Scholar
  44. R. K. Bock et al. 2004. Methods for multidimensional event classification: A case study using images from a Cherenkov gamma-ray telescope. Nucl. Instrum. Methods Phys. Res. Sect. Accel. Spectrometers Detect. Assoc. Equip. 516, 2 (2004), 511--528.Google ScholarGoogle ScholarCross RefCross Ref
  45. K. Fernandes, P. Vinagre, and P. Cortez. 2015. A proactive intelligent decision support system for predicting the popularity of online news. Conference on Artificial Intelligence, pp. 535--546.Google ScholarGoogle Scholar
  46. H. Xiao, K. Rasul, and R. Vollgraf. 2017. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. ArXiv170807747 Cs Stat, 2017.Google ScholarGoogle Scholar
  47. P. Varma et al. 2017. Inferring generative model structure with static analysis. ArXiv170902477 Cs Stat, 2017.Google ScholarGoogle Scholar
  48. G. Beatty, E. Kochis, and M. Bloodgood. 2019. The use of unlabeled data versus labeled data for stopping active learning for text classification. IEEE International Conference on Semantic Computing, pp. 287--294.Google ScholarGoogle Scholar
  49. M. Bloodgood and K. Vijay-Shanker. 2009. A method for stopping active learning based on stabilizing predictions and the need for user-adjustable stopping. Proc. 13th Conference on Computational Natural Language Learning, pp. 39--47.Google ScholarGoogle Scholar
  50. M. Nashaat, A. Ghosh, J. Miller, and S. Quader. WeSAL: Applying active supervision to find high-quality labels at industrial scale. Hawaii International Conference on System Sciences, submitted for publication.Google ScholarGoogle Scholar
  51. A. C. Tan and D. Gilbert. 2003. An empirical comparison of supervised machine learning techniques in bioinformatics. Proc. Conference on Bioinformatics. 19 (2003), 219--222.Google ScholarGoogle Scholar
  52. D. M. Powers. 2011. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2 (2011), 37--63.Google ScholarGoogle ScholarCross RefCross Ref
  53. T. Chen and C. Guestrin. 2016. XGBoost: A scalable tree boosting system. Proc. ACM SIGKDD, pp. 785--794.Google ScholarGoogle Scholar
  54. I. Teinemaa, M. Dumas, F. M. Maggi, and C. Di Francescomarino. 2016. Predictive business process monitoring with structured and unstructured data. International Conference on Business Process Management, pp. 401--417.Google ScholarGoogle Scholar
  55. J. Kremer, K. Steenstrup Pedersen, and C. Igel. 2014. Active learning with support vector machines. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4, 4 (2014), 313--326.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Ré. 2018. Snorkel MeTaL: Weak supervision for multi-task learning. Proc. 2nd Workshop on Data Management for End-to-End Machine Learning.Google ScholarGoogle Scholar
  57. T. Durand, N. Thome, and M. Cord. 2018. SyMIL: MinMax latent SVM for weakly labeled data. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 6099--6112.Google ScholarGoogle Scholar
  58. R. Stewart and S. Ermon. 2017. Label-free supervision of neural networks with physics and domain knowledge. AAAI Conference on Artificial Intelligence, pp. 2576--2582.Google ScholarGoogle Scholar
  59. L. Cao et al. 2019. Smile: A system to support machine learning on EEG data at scale. Proc. VLDB Endow. 12, 12 (2019), 2230--2241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. S. Wu et al. 2018. Fonduer: Knowledge base construction from richly formatted data. Proc. International Conference on Management of Data, pp. 1301--1316.Google ScholarGoogle Scholar
  61. Y. Li, Y. Wang, D. Yu, Y. Ning, P. Hu, and R. Zhao. 2020. ASCENT: Active supervision for semi-supervised learning. IEEE Trans. Knowl. Data Eng. 32, 5 (2020), 868--882.Google ScholarGoogle Scholar
  62. P. Bachman, A. Sordoni, and A. Trischler. 2017. Learning algorithms for active learning. Proc. International Conference on Machine Learning. 70 (2017), 301--310.Google ScholarGoogle Scholar
  63. D. Kang, D. Raghavan, P. Bailis, and M. Zaharia. 2018. Model assertions for debugging machine learning. NeurIPS MLSys Workshop.Google ScholarGoogle Scholar
  64. M. Carbonneau, E. Granger, and G. Gagnon. 2019. Bag-level aggregation for multiple-instance active learning in instance classification problems. IEEE Trans. Neural Netw. Learn. Syst. 30, 5 (2019), 1441--1451.Google ScholarGoogle ScholarCross RefCross Ref
  65. Z. Zhou, J. Y. Shin, S. R. Gurudu, M. B. Gotway, and J. Liang. 2018. AFT* Integrating active learning and transfer learning to reduce annotation efforts. ArXiv180200912 Cs Stat, 2018.Google ScholarGoogle Scholar

Index Terms

  1. Asterisk: Generating Large Training Datasets with Automatic Active Supervision

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM/IMS Transactions on Data Science
        ACM/IMS Transactions on Data Science  Volume 1, Issue 2
        May 2020
        169 pages
        ISSN:2691-1922
        DOI:10.1145/3403596
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 May 2020
        • Online AM: 7 May 2020
        • Accepted: 1 February 2020
        • Revised: 1 December 2019
        • Received: 1 July 2019
        Published in tds Volume 1, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!