Abstract
Labeling datasets is one of the most expensive bottlenecks in data preprocessing tasks in machine learning. Therefore, organizations, in many domains, are applying weak supervision to produce noisy labels. However, since weak supervision relies on cheaper sources, the quality of the generated labels is problematic. Therefore, in this article, we present Asterisk, an end-to-end framework to generate high-quality, large-scale labeled datasets. The system, first, automatically generates heuristics to assign initial labels. Then, the framework applies a novel data-driven active learning process to enhance the labeling quality. We present an algorithm that learns the selection policy by accommodating the modeled accuracies of the heuristics, along with the outcome of the generative model. Finally, the system employs the output of the active learning process to enhance the quality of the labels. To evaluate the proposed system, we report its performance against four state-of-the-art techniques. In collaboration with our industrial partner, IBM, we test the framework within a wide range of real-world applications. The experiments include 10 datasets of varying sizes with a maximum size of 11 million records. The results illustrate the effectiveness of the framework in producing high-quality labels and achieving high classification accuracy with minimal annotation efforts.
- W. Zhao, G. Guan, L. Chen, X. He, D. Cai, B. Wang, and Q. Wang. 2018. Weakly-supervised deep embedding for product review sentiment analysis. IEEE Trans. Knowl. Data Eng. 30, 1 (2018), 185--197Google Scholar
Cross Ref
- A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. 2016. Data programming: Creating large training sets, quickly. Advances in Neural Information Processing Systems, pp. 3567--3575.Google Scholar
- V. S. Sheng, J. Zhang, B. Gu, and X. Wu. 2019. Majority voting and pairing with multiple noisy labeling. IEEE Trans. Knowl. Data Eng. 31, 7 (2019), 1355--1368.Google Scholar
Cross Ref
- P. Cheng, X. Lian, X. Jian, and L. Chen. 2019. FROG: A fast and reliable crowdsourcing framework. IEEE Trans. Knowl. Data Eng. 31, 5 (2019), 894--908.Google Scholar
Cross Ref
- C. De Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. 2016. DeepDive: Declarative knowledge base construction. SIGMOD Rec. 45, 1 (2016), 60--67.Google Scholar
Digital Library
- A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. 2017. Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow. 11, 3 (2017), 269--282.Google Scholar
Digital Library
- Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowl. Inf. Syst. 35, 2 (2013), 249--283.Google Scholar
Cross Ref
- N. Gurjar, S. Sudholt, and G. A. Fink. 2018. Learning deep representations for word spotting under weak supervision. International Workshop on Document Analysis Systems, pp. 7--12.Google Scholar
- S. Chaidaroon, T. Ebesu, and Y. Fang. 2018. Deep semantic text hashing with weak supervision. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1109--1112.Google Scholar
- A. H. Akbarnejad and M. S. Baghshah. 2019. An efficient semi-supervised multi-label classifier capable of handling missing labels. IEEE Trans. Knowl. Data Eng. 31 (2019), 229--242.Google Scholar
Digital Library
- T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. 2016. Improved techniques for training GANs. Advances in Neural Information Processing Systems, pp. 2234--2242.Google Scholar
- S. H. Bach, B. He, A. Ratner, and C. Ré. 2017. Learning the structure of generative models without labeled data. Proc. 34th International Conference on Machine Learning, pp. 273--282.Google Scholar
- P. Varma and C. Ré. 2018. Snuba: Automating weak supervision to label training data. Proc. VLDB Endowment pp. 223--236.Google Scholar
- E.-C. Huang, H.-K. Pao, and Y.-J. Lee. 2017. Big active learning. IEEE International Conference on Big Data, pp. 94--101.Google Scholar
- Z.-H. Zhou. 2017. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5, 1 (2017), 44--53.Google Scholar
Cross Ref
- M.-F. Balcan, S. Hanneke, and J. W. Vaughan. 2010. The true sample complexity of active learning. Machine Learn. 80, 2--3 (2010), 111--139.Google Scholar
Digital Library
- B. Settles. 2009. Active Learning Literature Survey.Google Scholar
- N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. 2013. Aggregating crowdsourced binary ratings. Proc. International Conference on World Wide Web, pp. 285--294.Google Scholar
- M. Joglekar, H. Garcia-Molina, and A. Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. IEEE International Conference on Data Engineering, pp. 195--206.Google Scholar
- P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, and C. Ré. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat, 2016.Google Scholar
- P. Varma et al. 2016. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. ArXiv161008123 Cs Stat, 2016.Google Scholar
- N. Das, S. Chaba, S. Gandhi, D. H. Chau, and X. Chu. 2019. GOGGLES: Automatic training data generation with affinity coding. ArXiv190304552 Cs, 2019.Google Scholar
- J. Zhu, H. Wang, B. K. Tsou, and M. Ma. 2010. Active learning with sampling by uncertainty and density for data annotations. IEEE Trans. Audio Speech Lang. Process. 18, 6 (2010), 1323--1331.Google Scholar
Digital Library
- R. B. C. Prudencio and T. B. Ludermir. 2008. Active meta-learning with uncertainty sampling and outlier detection. IEEE International Joint Conference on Neural Networks, pp. 346--351.Google Scholar
- K. Konyushkova, R. Sznitman, and P. Fua. 2015. Introducing geometry in active learning for image segmentation. ArXiv150804955 Cs, 2015.Google Scholar
- Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann. 2015. Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vis. 113, 2 (2015), 113--127.Google Scholar
Digital Library
- M. Liu, W. Buntine, and G. Haffari. 2018. Learning how to actively learn: A deep imitation learning approach. Proc. Annual Meeting of the Association for Computational Linguistics. 1 (2018), 1874--1883.Google Scholar
- M. E. Ramirez-Loaiza, M. Sharma, G. Kumar, and M. Bilgic. 2017. Active learning: An empirical study of common baselines. Data Min. Knowl. Discov. 31. 2 (2017), 287--313.Google Scholar
- M. Fang, Y. Li, and T. Cohn. 2017. Learning how to active Learn: A deep reinforcement learning approach. ArXiv170802383 Cs, Aug. 2017.Google Scholar
- K. Konyushkova, R. Sznitman, and P. Fua. 2017. Learning active learning from data. Advances in Neural Information Processing Systems.Google Scholar
- H. Chu and H. Lin. 2016. Can active learning experience be transferred? IEEE International Conference on Data Mining, pp. 841--846.Google Scholar
- K. Pang, M. Dong, Y. Wu, and T. Hospedales. 2018. Meta-learning transferable active learning policies by deep reinforcement learning. ArXiv Prepr. ArXiv180604798, 2018.Google Scholar
- A. Niculescu-Mizil and R. Caruana. 2005. Predicting good probabilities with supervised learning. Proc. International Conference on Machine Learning, pp. 625--632.Google Scholar
- O. Sagi and L. Rokach. 2018. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8, 4 (2018), 1--18.Google Scholar
Cross Ref
- B. Desharnais, F. Camirand-Lemyre, P. Mireault, and C. D. Skinner. 2015. Determination of confidence intervals in non-normal data: Application of the bootstrap to cocaine concentration in femoral blood. J. Anal. Toxicol. 39, 2 (2015), 113--117.Google Scholar
- R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. 2014. Supervised hashing for image retrieval via image representation learning. AAAI Conference on Artificial Intelligence, pp. 2156--2162.Google Scholar
- J. Bernard, M. Zeppelzauer, M. Lehmann, M. Müller, and M. Sedlmair. 2018. Towards user-centered active learning algorithms. Comput. Graph. Forum. 37, 3 (2018), 121--132.Google Scholar
Cross Ref
- M. Nashaat, A. Ghosh, J. Miller, S. Quader, C. Marston, and J. F. Puget. 2018. Hybridization of active learning and data programming for labeling large industrial datasets. IEEE International Conference on Big Data, pp. 46--55.Google Scholar
- M. Nashaat, A. Ghosh, J. Miller, S. Quader, and C. Marston. 2019. M-Lean: An end-to-end development framework for predictive models in B2B scenarios. Inf. Softw. Technol. 113 (2019), 131--145.Google Scholar
- S. Moro, P. Cortez, and P. Rita. 2014. A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62 (2014), 22--23.Google Scholar
Cross Ref
- I.-C. Yeh and C. Lien. 2009. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36, 2 (2009), 2473--2480.Google Scholar
Digital Library
- P. Baldi, P. Sadowski, D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nat. Comm. 5, 4308 (2014).Google Scholar
- L. M. Candanedo and V. Feldheim. 2016. Accurate occupancy detection of an office room from light, temperature, humidity and CO 2 measurements using statistical learning models. 2016. Energy Build. 112 (2018), 28--39.Google Scholar
- R. K. Bock et al. 2004. Methods for multidimensional event classification: A case study using images from a Cherenkov gamma-ray telescope. Nucl. Instrum. Methods Phys. Res. Sect. Accel. Spectrometers Detect. Assoc. Equip. 516, 2 (2004), 511--528.Google Scholar
Cross Ref
- K. Fernandes, P. Vinagre, and P. Cortez. 2015. A proactive intelligent decision support system for predicting the popularity of online news. Conference on Artificial Intelligence, pp. 535--546.Google Scholar
- H. Xiao, K. Rasul, and R. Vollgraf. 2017. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. ArXiv170807747 Cs Stat, 2017.Google Scholar
- P. Varma et al. 2017. Inferring generative model structure with static analysis. ArXiv170902477 Cs Stat, 2017.Google Scholar
- G. Beatty, E. Kochis, and M. Bloodgood. 2019. The use of unlabeled data versus labeled data for stopping active learning for text classification. IEEE International Conference on Semantic Computing, pp. 287--294.Google Scholar
- M. Bloodgood and K. Vijay-Shanker. 2009. A method for stopping active learning based on stabilizing predictions and the need for user-adjustable stopping. Proc. 13th Conference on Computational Natural Language Learning, pp. 39--47.Google Scholar
- M. Nashaat, A. Ghosh, J. Miller, and S. Quader. WeSAL: Applying active supervision to find high-quality labels at industrial scale. Hawaii International Conference on System Sciences, submitted for publication.Google Scholar
- A. C. Tan and D. Gilbert. 2003. An empirical comparison of supervised machine learning techniques in bioinformatics. Proc. Conference on Bioinformatics. 19 (2003), 219--222.Google Scholar
- D. M. Powers. 2011. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2 (2011), 37--63.Google Scholar
Cross Ref
- T. Chen and C. Guestrin. 2016. XGBoost: A scalable tree boosting system. Proc. ACM SIGKDD, pp. 785--794.Google Scholar
- I. Teinemaa, M. Dumas, F. M. Maggi, and C. Di Francescomarino. 2016. Predictive business process monitoring with structured and unstructured data. International Conference on Business Process Management, pp. 401--417.Google Scholar
- J. Kremer, K. Steenstrup Pedersen, and C. Igel. 2014. Active learning with support vector machines. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4, 4 (2014), 313--326.Google Scholar
Digital Library
- A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Ré. 2018. Snorkel MeTaL: Weak supervision for multi-task learning. Proc. 2nd Workshop on Data Management for End-to-End Machine Learning.Google Scholar
- T. Durand, N. Thome, and M. Cord. 2018. SyMIL: MinMax latent SVM for weakly labeled data. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 6099--6112.Google Scholar
- R. Stewart and S. Ermon. 2017. Label-free supervision of neural networks with physics and domain knowledge. AAAI Conference on Artificial Intelligence, pp. 2576--2582.Google Scholar
- L. Cao et al. 2019. Smile: A system to support machine learning on EEG data at scale. Proc. VLDB Endow. 12, 12 (2019), 2230--2241.Google Scholar
Digital Library
- S. Wu et al. 2018. Fonduer: Knowledge base construction from richly formatted data. Proc. International Conference on Management of Data, pp. 1301--1316.Google Scholar
- Y. Li, Y. Wang, D. Yu, Y. Ning, P. Hu, and R. Zhao. 2020. ASCENT: Active supervision for semi-supervised learning. IEEE Trans. Knowl. Data Eng. 32, 5 (2020), 868--882.Google Scholar
- P. Bachman, A. Sordoni, and A. Trischler. 2017. Learning algorithms for active learning. Proc. International Conference on Machine Learning. 70 (2017), 301--310.Google Scholar
- D. Kang, D. Raghavan, P. Bailis, and M. Zaharia. 2018. Model assertions for debugging machine learning. NeurIPS MLSys Workshop.Google Scholar
- M. Carbonneau, E. Granger, and G. Gagnon. 2019. Bag-level aggregation for multiple-instance active learning in instance classification problems. IEEE Trans. Neural Netw. Learn. Syst. 30, 5 (2019), 1441--1451.Google Scholar
Cross Ref
- Z. Zhou, J. Y. Shin, S. R. Gurudu, M. B. Gotway, and J. Liang. 2018. AFT* Integrating active learning and transfer learning to reduce annotation efforts. ArXiv180200912 Cs Stat, 2018.Google Scholar
Index Terms
Asterisk: Generating Large Training Datasets with Automatic Active Supervision
Recommendations
Active learning using a self-correcting neural network (ALSCN)
AbstractData labeling represents a major obstacle in the development of new models because the performance of machine learning models directly depends on the quality of the datasets used to train these models and labeling requires substantial manual ...
A Cluster-then-label Approach for Few-shot Learning with Application to Automatic Image Data Labeling
Few-shot learning (FSL) aims at learning to generalize from only a small number of labeled examples for a given target task. Most current state-of-the-art FSL methods typically have two limitations. First, they usually require access to a source dataset (...
Graph-Based Active Learning Based on Label Propagation
MDAI '08 Sabadell: Proceedings of the 5th International Conference on Modeling Decisions for Artificial IntelligenceBy only selecting the most informative instances for labeling, active learning could reduce the labeling cost when labeled instances are hard to obtain. Facing the same situation, semi-supervised learning utilize unlabeled instances to strengthen ...






Comments