Abstract
Supervised learning—that is, learning from labeled examples—is an area of Machine Learning that has reached substantial maturity. It has generated general-purpose and practically successful algorithms and the foundations are quite well understood and captured by theoretical frameworks such as the PAC-learning model and the Statistical Learning theory framework. However, for many contemporary practical problems such as classifying web pages or detecting spam, there is often additional information available in the form of unlabeled data, which is often much cheaper and more plentiful than labeled data. As a consequence, there has recently been substantial interest in semi-supervised learning—using unlabeled data together with labeled data—since any useful information that reduces the amount of labeled data needed can be a significant benefit. Several techniques have been developed for doing this, along with experimental results on a variety of different learning problems. Unfortunately, the standard learning frameworks for reasoning about supervised learning do not capture the key aspects and the assumptions underlying these semi-supervised learning methods.
In this article, we describe an augmented version of the PAC model designed for semi-supervised learning, that can be used to reason about many of the different approaches taken over the past decade in the Machine Learning community. This model provides a unified framework for analyzing when and why unlabeled data can help, in which one can analyze both sample-complexity and algorithmic issues. The model can be viewed as an extension of the standard PAC model where, in addition to a concept class C, one also proposes a compatibility notion: a type of compatibility that one believes the target concept should have with the underlying distribution of data. Unlabeled data is then potentially helpful in this setting because it allows one to estimate compatibility over the space of hypotheses, and to reduce the size of the search space from the whole set of hypotheses C down to those that, according to one's assumptions, are a-priori reasonable with respect to the distribution. As we show, many of the assumptions underlying existing semi-supervised learning algorithms can be formulated in this framework.
After proposing the model, we then analyze sample-complexity issues in this setting: that is, how much of each type of data one should expect to need in order to learn well, and what the key quantities are that these numbers depend on. We also consider the algorithmic question of how to efficiently optimize for natural classes and compatibility notions, and provide several algorithmic results including an improved bound for Co-Training with linear separators when the distribution satisfies independence given the label.
- Amini, M.-R., Chapelle, O., and Ghani, R., Eds. 2005. Learning with Partially Classified Training Data. Workshop. In 22nd International Conference on Machine Learning Workshop (ICML'05). ACM, New York. http://www-connex.lip6.fr/~amini/lpctd_icml05.html.Google Scholar
- Balcan, M.-F., Beygelzimer, A., and Langford, J. 2009. Agnostic active learning. J. Comput. Syst. Sci. 75, 1, 78--89. Special Issue on Learning Theory. An earlier version appeared in International Conference on Machine Learning 2006. Google Scholar
Digital Library
- Balcan, M.-F., Blum, A., and Yang, K. 2004. Co-training and expansion: Towards bridging theory and practice. In Proceedings of the 18th Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA.Google Scholar
- Balcan, M.-F., Broder, A., and Zhang, T. 2007. Margin based active learning. In Proceedings of the 20th Annual Conference on Computational Learning Theory (COLT). Lecture Notes in Computer Science, vol. 4539. Springer-Verlag, Berlin, Germany, 35--50. Google Scholar
Digital Library
- Balcan, M.-F., Hanneke, S., and Wortman, J. 2008. The true sample complexity of active learning. In Proceedings of the 21st Annual Conference on Computational Learning Theory (COLT). http://colt2008.cs.helsinki.fi/papers/COLT2008.pdf.Google Scholar
- Bartlett, P., Boucheron, S., and Lugosi, G. 2000. Model selection and error estimation. In Proceedings of the 13th Annual Conference on Computational Learning Theory. Morgan-Kaufmann, San Francisco, CA, 286--297. Google Scholar
Digital Library
- Bartlett, P., and Mendelson, S. 2002. Rademacher and gaussian complexities risk bounds and structural results. J. Mach. Learn. Res. 463--482. Google Scholar
Digital Library
- Baum, E. B. 1990. Polynomial time algorithms for learning neural nets. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory. Morgan-Kaufmann, San Francisco, CA, 258--272. Google Scholar
Digital Library
- Benedek, G., and Itai, A. 1991. Learnability with respect to a fixed distribution. Theoret. Comput. Sci. 86, 377--389. Google Scholar
Digital Library
- Bie, T. D., and Cristianini, N. 2003. Convex methods for transduction. In Proceedings of the 17th Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, 73--80.Google Scholar
- Bie, T. D., and Cristianini, N. 2004. Convex transduction with the normalized cut. Internal Report 04-128, ESAT-SISTA, K.U. Leuven.Google Scholar
- Blum, A., and Chawla, S. 2001. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th International Conference on Machine Learning. Morgan-Kaufmann, San Francisco, CA, 19--26. Google Scholar
Digital Library
- Blum, A., Frieze, A., Kannan, R., and Vempala, S. 1998. A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica 22, 35--52.Google Scholar
Cross Ref
- Blum, A., and Kannan, R. 1997. Learning an intersection of k halfspaces over a uniform distribution. J. Comput. Syst. Sci. 54, 2, 371--380. Google Scholar
Digital Library
- Blum, A., Lafferty, J., Rwebangira, M. R., and Reddy, R. 2004. Semi-supervised learning using randomized mincuts. In Proceedings of the 21st International Conference on Machine Learning. ACM, New York. Google Scholar
Digital Library
- Blum, A., and Mitchell, T. M. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory. ACM, New York, 92--100. Google Scholar
Digital Library
- Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik Chervonenkis dimension. J. ACM 36, 4, 929--965. Google Scholar
Digital Library
- Boucheron, S., Bousquet, O., and Lugosi, G. 2005. Theory of classification: A survey of recent advances. ESAIM: Probab. Stat. 9, 9, 323--375.Google Scholar
Cross Ref
- Boucheron, S., Lugosi, G., and Massart, P. 2000. A sharp concentration inequality with applications. Rand. Struct. Algor. 16, 277--292. Google Scholar
Digital Library
- Castelli, V., and Cover, T. 1995. On the exponential value of labeled samples. Patt. Recognit. Lett. 16, 105--111. Google Scholar
Digital Library
- Castelli, V., and Cover, T. 1996. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inf. Theory 42, 6, 2102--2117. Google Scholar
Cross Ref
- Chapelle, O., Schölkopf, B., and Zien, A., Eds. 2006. Semi-Supervised Learning. MIT Press, Cambridge, MA.Google Scholar
Digital Library
- Chapelle, O., and Zien, A. 2005. Semi-supervised classification by low density separation. In Proceedings of the 10 International Workshop on Artificial Intelligence and Statistics, 57--64.Google Scholar
- Cohen, D., Atlas, L., and Ladner, R. 1994. Improving generalzation with active learning. Mach. Learn. 15, 2, 201--221. Google Scholar
Digital Library
- Collins, M., and Singer, Y. 1999. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. ACM, New York, 189--196.Google Scholar
- Dasgupta, S. 2005. Coarse sample complexity bounds for active learning. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, 235--242.Google Scholar
- Dasgupta, S., Hsu, D., and Monteleoni, C. 2007. A general agnostic active learning algorithm. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA.Google Scholar
- Dasgupta, S., Littman, M. L., and McAllester, D. 2001. Pac generalization bounds for co-training. Adv. Neural Inf. Proc. Syst. 14.Google Scholar
- Devroye, L., Gyorfi, L., and Lugosi, G. 1996. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, Berlin, Germany.Google Scholar
- Dunagan, J., and Vempala, S. 2001. Optimal outlier removal in high-dimensional spaces. In Proceedings of the 33rd ACM Symposium on Theory of Computing. ACM, New York. Google Scholar
Digital Library
- Ehrenfeucht, A., Haussler, D., Kearns, M., and Valiant, L. 1989. A general lower bound on the number of examples needed for learning. Inf. Comput. 82, 246--261. Google Scholar
Digital Library
- Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. 1993. Information, prediction, and query by comittee. In Advances in Neural Information Processing Systems. Morgan-Kaufmann, San Francisco, CA, 483--490. Google Scholar
Digital Library
- Friedman, E. 2009. Active learning for smooth problems. In Proceedings of the 22nd Annual Conference on Computational Learning Theory (COLT).Google Scholar
- Gabrilovich, E., and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference on Artificial Intelligence. ACM, New York, 1048--1053. Google Scholar
Digital Library
- Ganchev, K., Graca, J., Blitzer, J., and Taskar, B. 2008. Multi-view learning over structured and non-identical outputs. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 204--211.Google Scholar
- Ghani, R. 2001. Combining labeled and unlabeled data for text classification with a large number of categories. In Proceedings of the IEEE International Conference on Data Mining. IEEE Computer Society Press, Los Alamitos, CA. Google Scholar
Digital Library
- Ghani, R., Jones, R., and Rosenberg, C., Eds. 2003. The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining. (Workshop, ICML'03).Google Scholar
- Hanneke, S. 2007. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th Annual International Conference on Machine Learning (ICML). ACM New York, 345--352. Google Scholar
Digital Library
- Joachims, T. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning. ACM, New York, 200--209. Google Scholar
Digital Library
- Joachims, T. 2003. Transductive learning via spectral graph partitioning. In Proceedings of the 20th International Conference on Machine Learning. AAAI Press, 290--297.Google Scholar
- Kääriäinen, M. 2005. Generalization error bounds using unlabeled data. In Proceedings of the 18th Annual Conference on Learning Theory. Lecture Notes in Computer Science, vol. 3559, Spinger-Verlag, Berlin, Germany, 127--142. Google Scholar
Digital Library
- Kääriäinen, M. 2006. On active learning in the non-realizable case. In Proceedings of 17th International Conference on Algorithmic Learning Theory (ALT). Lecture Notes in Computer Science, vol. 4264, Spinger-Verlag, Berlin, Germany, 63--77. Google Scholar
Digital Library
- Kearns, M. 1998. Efficient noise-tolerant learning from statistical queries. JACM. 983--1006. Google Scholar
Digital Library
- Kearns, M., and Vazirani, U. 1994. An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Kleinberg, J. 2000. Detecting a network failure. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 231--239. Google Scholar
Digital Library
- Kleinberg, J., Sandler, M., and Slivkins, A. 2004. Network failure detection and graph connectivity. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 231--239.Google Scholar
- Klivans, A. R., O'Donnell, R., and Servedio, R. 2002. Learning intersections and thresholds of halfspaces. In Proceedings of the 43rd Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 177--186. Google Scholar
Digital Library
- Koltchinskii, V. 2001. Rademacher penalties and structural risk minimization. IEEE Trans. Inf. Theory 47, 5, 1902--1914.Google Scholar
Digital Library
- Koltchinskii, V. 2009. Rademacher complexities and bounding the excess risk in active learning. http://fodava.gatech.edu/files/FODAVA-09-15.pdf.Google Scholar
- Leskes, B. 2005. The value of agreement, A new boosting algorithm. In Proceedings of the 18th Annual Conference on Learning Theory. Lecture Notes in Computer Science, vol. 3559, Spinger-Verlag, Berlin, Germany, 51--56. Google Scholar
Digital Library
- Levin, A., Viola, P., and Freund, Y. 2003. Unsupervised improvement of visual detectors using co-training. In Proceedings of the 19th IEEE International Conference on Computer Vision. IEEE Computer Society Press, Los Alamitos, CA, 626--633. Google Scholar
Digital Library
- Linial, N., Mansour, Y., and Nisan, N. 1989. Constant depth circuits, fourier transform, and learnability. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 574--579. Google Scholar
Digital Library
- Mendelson, S., and Philips, P. 2003. Random subclass bounds. In Proceedings of the 16th Annual Conference on Computational Learning Theory (COLT). Lecture Notes in Artificial Intelligence, vol. 2777, Spinger-Verlag, Berlin, Germany, 329--343.Google Scholar
- Nigam, K., and Ghani, R. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the ACM International Conference on Information and Knowledge Management. ACM, New York, 86--93. Google Scholar
Digital Library
- Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. M. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2/3, 103--134. Google Scholar
Digital Library
- Park, S., and Zhang, B. 2003. Large scale unstructured document classification using unlabeled data and syntactic information. In Proceedings of the 7th Pacific-Asia Conference (PAKDD'03). Lecture Notes in Computer Science, vol. 2637, Spinger-Verlag, Berlin, Germany, 88--99. Google Scholar
Digital Library
- Pierce, D., and Cardie, C. 2001. Limitations of co-training for natural language learning from large datasets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1--9. (http://www.cs.cornell.edu/home/pierce/papers/emnlp2001.ps.)Google Scholar
- Ratsaby, J., and Venkatesh, S. 1995. Learning from a mixture of labeled and unlabeled examples with parametric side information. In Proceedings of the 8th Annual Conference on Computational Learning Theory. ACM, New York, 412--417. Google Scholar
Digital Library
- Rosenberg, D., and Bartlett, P. 2007. The Rademacher complexity of co-regularized kernel classes. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics. (http://www.stat.umn.edu/~aistat/proceedings/start.htm)Google Scholar
- Schuurmans, D., and Southey, F. 2002. Metric-based methods for adaptive model selection and regularization. Mach. Learn. 48, 1-3, 51--84. Google Scholar
Digital Library
- Shawe-Taylor, J. 2006. Rademacher Analysis and Multi-View Classification. http://www.gla.ac.uk/external/RSS/RSScomp/shawe-taylor.pdf.Google Scholar
- Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. 1998. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44, 5, 1926--1940.Google Scholar
Digital Library
- Simon, H.-U. 2009. Smart pac-learners. In Proceedings of 20th International Conference on Algorithmic Learning Theory (ALT). Lecture Notes in Computer Science, vol. 5809, Spinger-Verlag, Berlin, Germany. Google Scholar
Digital Library
- Sokolovska, N., Capp, O., and Yvon, F. 2008. The asymptotics of semi-supervised learning in discriminative probabilistic models. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008). Omni Press, 984--991. Google Scholar
Digital Library
- Sridharan, K., and Kakade, S. M. 2008. An information theoretic framework for multi-view learning. In Proceedings of the 21st Annual Conference on Learning Theory. Omni Press, 403--414.Google Scholar
- Tong, S., and Koller., D. 2001. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 4, 45--66. Google Scholar
Digital Library
- Valiant, L. 1984. A theory of the learnable. Commun. ACM 27, 11, 1134--1142. Google Scholar
Digital Library
- Vapnik, V. N. 1982. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York. Google Scholar
- Vapnik, V. N. 1998. Statistical Learning Theory. Wiley, New York.Google Scholar
Cross Ref
- Vempala, S. 1997. A random sampling based algorithm for learning the intersection of half-spaces. In Proceedings of the 38th Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 508--513. Google Scholar
Digital Library
- Verbeurgt, K. A. 1990. Learning dnf under the uniform distribution in quasi-polynomial time. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory. Morgan-Kaufmann, San Francisco, CA, 314--326. Google Scholar
Digital Library
- Yarowsky, D. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics. 189--196. Google Scholar
Digital Library
- Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schlkopf, B. 2004. Learning with local and global consistency. In Proceedings of the 18th Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA.Google Scholar
- Zhu, X. 2006. Semi-Supervised Learning Literature Survey. Computer Sciences TR 1530 University of Wisconsin -- Madison.Google Scholar
- Zhu, X., Ghahramani, Z., and Lafferty, J. 2003a. Combinig active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning. Omni Press, 912--912.Google Scholar
- Zhu, X., Ghahramani, Z., and Lafferty, J. 2003b. Semi-supervised learning: From gaussian fields to gaussian processes. Tech. rep., Carnegie Mellon University, Pittsburg, PA.Google Scholar
- Zhu, X., Ghahramani, Z., and Lafferty, J. 2003c. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning. AAAI Press, 912--919.Google Scholar
Index Terms
- A discriminative model for semi-supervised learning
Recommendations
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningIn multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Extending Semi-supervised Learning Methods for Inductive Transfer Learning
ICDM '09: Proceedings of the 2009 Ninth IEEE International Conference on Data MiningInductive transfer learning and semi-supervised learning are two different branches of machine learning. The former tries to reuse knowledge in labeled out-of-domain instances while the later attempts to exploit the usefulness of unlabeled in-domain ...
Semi-supervised Learning with Multimodal Perturbation
ISNN '09: Proceedings of the 6th International Symposium on Neural Networks on Advances in Neural NetworksIn this paper, a new co-training style semi-supervised algorithm is proposed, which employs Bagging based multimodal perturbation to label the unlabeled data. In detail, through perturbing the training data, input attributes and learning parameters ...





Comments