skip to main content
research-article

A discriminative model for semi-supervised learning

Published:29 March 2010Publication History
Skip Abstract Section

Abstract

Supervised learning—that is, learning from labeled examples—is an area of Machine Learning that has reached substantial maturity. It has generated general-purpose and practically successful algorithms and the foundations are quite well understood and captured by theoretical frameworks such as the PAC-learning model and the Statistical Learning theory framework. However, for many contemporary practical problems such as classifying web pages or detecting spam, there is often additional information available in the form of unlabeled data, which is often much cheaper and more plentiful than labeled data. As a consequence, there has recently been substantial interest in semi-supervised learning—using unlabeled data together with labeled data—since any useful information that reduces the amount of labeled data needed can be a significant benefit. Several techniques have been developed for doing this, along with experimental results on a variety of different learning problems. Unfortunately, the standard learning frameworks for reasoning about supervised learning do not capture the key aspects and the assumptions underlying these semi-supervised learning methods.

In this article, we describe an augmented version of the PAC model designed for semi-supervised learning, that can be used to reason about many of the different approaches taken over the past decade in the Machine Learning community. This model provides a unified framework for analyzing when and why unlabeled data can help, in which one can analyze both sample-complexity and algorithmic issues. The model can be viewed as an extension of the standard PAC model where, in addition to a concept class C, one also proposes a compatibility notion: a type of compatibility that one believes the target concept should have with the underlying distribution of data. Unlabeled data is then potentially helpful in this setting because it allows one to estimate compatibility over the space of hypotheses, and to reduce the size of the search space from the whole set of hypotheses C down to those that, according to one's assumptions, are a-priori reasonable with respect to the distribution. As we show, many of the assumptions underlying existing semi-supervised learning algorithms can be formulated in this framework.

After proposing the model, we then analyze sample-complexity issues in this setting: that is, how much of each type of data one should expect to need in order to learn well, and what the key quantities are that these numbers depend on. We also consider the algorithmic question of how to efficiently optimize for natural classes and compatibility notions, and provide several algorithmic results including an improved bound for Co-Training with linear separators when the distribution satisfies independence given the label.

References

  1. Amini, M.-R., Chapelle, O., and Ghani, R., Eds. 2005. Learning with Partially Classified Training Data. Workshop. In 22nd International Conference on Machine Learning Workshop (ICML'05). ACM, New York. http://www-connex.lip6.fr/~amini/lpctd_icml05.html.Google ScholarGoogle Scholar
  2. Balcan, M.-F., Beygelzimer, A., and Langford, J. 2009. Agnostic active learning. J. Comput. Syst. Sci. 75, 1, 78--89. Special Issue on Learning Theory. An earlier version appeared in International Conference on Machine Learning 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Balcan, M.-F., Blum, A., and Yang, K. 2004. Co-training and expansion: Towards bridging theory and practice. In Proceedings of the 18th Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  4. Balcan, M.-F., Broder, A., and Zhang, T. 2007. Margin based active learning. In Proceedings of the 20th Annual Conference on Computational Learning Theory (COLT). Lecture Notes in Computer Science, vol. 4539. Springer-Verlag, Berlin, Germany, 35--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Balcan, M.-F., Hanneke, S., and Wortman, J. 2008. The true sample complexity of active learning. In Proceedings of the 21st Annual Conference on Computational Learning Theory (COLT). http://colt2008.cs.helsinki.fi/papers/COLT2008.pdf.Google ScholarGoogle Scholar
  6. Bartlett, P., Boucheron, S., and Lugosi, G. 2000. Model selection and error estimation. In Proceedings of the 13th Annual Conference on Computational Learning Theory. Morgan-Kaufmann, San Francisco, CA, 286--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bartlett, P., and Mendelson, S. 2002. Rademacher and gaussian complexities risk bounds and structural results. J. Mach. Learn. Res. 463--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Baum, E. B. 1990. Polynomial time algorithms for learning neural nets. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory. Morgan-Kaufmann, San Francisco, CA, 258--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Benedek, G., and Itai, A. 1991. Learnability with respect to a fixed distribution. Theoret. Comput. Sci. 86, 377--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bie, T. D., and Cristianini, N. 2003. Convex methods for transduction. In Proceedings of the 17th Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, 73--80.Google ScholarGoogle Scholar
  11. Bie, T. D., and Cristianini, N. 2004. Convex transduction with the normalized cut. Internal Report 04-128, ESAT-SISTA, K.U. Leuven.Google ScholarGoogle Scholar
  12. Blum, A., and Chawla, S. 2001. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th International Conference on Machine Learning. Morgan-Kaufmann, San Francisco, CA, 19--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Blum, A., Frieze, A., Kannan, R., and Vempala, S. 1998. A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica 22, 35--52.Google ScholarGoogle ScholarCross RefCross Ref
  14. Blum, A., and Kannan, R. 1997. Learning an intersection of k halfspaces over a uniform distribution. J. Comput. Syst. Sci. 54, 2, 371--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Blum, A., Lafferty, J., Rwebangira, M. R., and Reddy, R. 2004. Semi-supervised learning using randomized mincuts. In Proceedings of the 21st International Conference on Machine Learning. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Blum, A., and Mitchell, T. M. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory. ACM, New York, 92--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik Chervonenkis dimension. J. ACM 36, 4, 929--965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Boucheron, S., Bousquet, O., and Lugosi, G. 2005. Theory of classification: A survey of recent advances. ESAIM: Probab. Stat. 9, 9, 323--375.Google ScholarGoogle ScholarCross RefCross Ref
  19. Boucheron, S., Lugosi, G., and Massart, P. 2000. A sharp concentration inequality with applications. Rand. Struct. Algor. 16, 277--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Castelli, V., and Cover, T. 1995. On the exponential value of labeled samples. Patt. Recognit. Lett. 16, 105--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Castelli, V., and Cover, T. 1996. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inf. Theory 42, 6, 2102--2117. Google ScholarGoogle ScholarCross RefCross Ref
  22. Chapelle, O., Schölkopf, B., and Zien, A., Eds. 2006. Semi-Supervised Learning. MIT Press, Cambridge, MA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chapelle, O., and Zien, A. 2005. Semi-supervised classification by low density separation. In Proceedings of the 10 International Workshop on Artificial Intelligence and Statistics, 57--64.Google ScholarGoogle Scholar
  24. Cohen, D., Atlas, L., and Ladner, R. 1994. Improving generalzation with active learning. Mach. Learn. 15, 2, 201--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Collins, M., and Singer, Y. 1999. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. ACM, New York, 189--196.Google ScholarGoogle Scholar
  26. Dasgupta, S. 2005. Coarse sample complexity bounds for active learning. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, 235--242.Google ScholarGoogle Scholar
  27. Dasgupta, S., Hsu, D., and Monteleoni, C. 2007. A general agnostic active learning algorithm. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  28. Dasgupta, S., Littman, M. L., and McAllester, D. 2001. Pac generalization bounds for co-training. Adv. Neural Inf. Proc. Syst. 14.Google ScholarGoogle Scholar
  29. Devroye, L., Gyorfi, L., and Lugosi, G. 1996. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, Berlin, Germany.Google ScholarGoogle Scholar
  30. Dunagan, J., and Vempala, S. 2001. Optimal outlier removal in high-dimensional spaces. In Proceedings of the 33rd ACM Symposium on Theory of Computing. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ehrenfeucht, A., Haussler, D., Kearns, M., and Valiant, L. 1989. A general lower bound on the number of examples needed for learning. Inf. Comput. 82, 246--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. 1993. Information, prediction, and query by comittee. In Advances in Neural Information Processing Systems. Morgan-Kaufmann, San Francisco, CA, 483--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Friedman, E. 2009. Active learning for smooth problems. In Proceedings of the 22nd Annual Conference on Computational Learning Theory (COLT).Google ScholarGoogle Scholar
  34. Gabrilovich, E., and Markovitch, S. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 19th International Joint Conference on Artificial Intelligence. ACM, New York, 1048--1053. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ganchev, K., Graca, J., Blitzer, J., and Taskar, B. 2008. Multi-view learning over structured and non-identical outputs. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 204--211.Google ScholarGoogle Scholar
  36. Ghani, R. 2001. Combining labeled and unlabeled data for text classification with a large number of categories. In Proceedings of the IEEE International Conference on Data Mining. IEEE Computer Society Press, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ghani, R., Jones, R., and Rosenberg, C., Eds. 2003. The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining. (Workshop, ICML'03).Google ScholarGoogle Scholar
  38. Hanneke, S. 2007. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th Annual International Conference on Machine Learning (ICML). ACM New York, 345--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Joachims, T. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning. ACM, New York, 200--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Joachims, T. 2003. Transductive learning via spectral graph partitioning. In Proceedings of the 20th International Conference on Machine Learning. AAAI Press, 290--297.Google ScholarGoogle Scholar
  41. Kääriäinen, M. 2005. Generalization error bounds using unlabeled data. In Proceedings of the 18th Annual Conference on Learning Theory. Lecture Notes in Computer Science, vol. 3559, Spinger-Verlag, Berlin, Germany, 127--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Kääriäinen, M. 2006. On active learning in the non-realizable case. In Proceedings of 17th International Conference on Algorithmic Learning Theory (ALT). Lecture Notes in Computer Science, vol. 4264, Spinger-Verlag, Berlin, Germany, 63--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kearns, M. 1998. Efficient noise-tolerant learning from statistical queries. JACM. 983--1006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Kearns, M., and Vazirani, U. 1994. An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Kleinberg, J. 2000. Detecting a network failure. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 231--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Kleinberg, J., Sandler, M., and Slivkins, A. 2004. Network failure detection and graph connectivity. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 231--239.Google ScholarGoogle Scholar
  47. Klivans, A. R., O'Donnell, R., and Servedio, R. 2002. Learning intersections and thresholds of halfspaces. In Proceedings of the 43rd Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 177--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Koltchinskii, V. 2001. Rademacher penalties and structural risk minimization. IEEE Trans. Inf. Theory 47, 5, 1902--1914.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Koltchinskii, V. 2009. Rademacher complexities and bounding the excess risk in active learning. http://fodava.gatech.edu/files/FODAVA-09-15.pdf.Google ScholarGoogle Scholar
  50. Leskes, B. 2005. The value of agreement, A new boosting algorithm. In Proceedings of the 18th Annual Conference on Learning Theory. Lecture Notes in Computer Science, vol. 3559, Spinger-Verlag, Berlin, Germany, 51--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Levin, A., Viola, P., and Freund, Y. 2003. Unsupervised improvement of visual detectors using co-training. In Proceedings of the 19th IEEE International Conference on Computer Vision. IEEE Computer Society Press, Los Alamitos, CA, 626--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Linial, N., Mansour, Y., and Nisan, N. 1989. Constant depth circuits, fourier transform, and learnability. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 574--579. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Mendelson, S., and Philips, P. 2003. Random subclass bounds. In Proceedings of the 16th Annual Conference on Computational Learning Theory (COLT). Lecture Notes in Artificial Intelligence, vol. 2777, Spinger-Verlag, Berlin, Germany, 329--343.Google ScholarGoogle Scholar
  54. Nigam, K., and Ghani, R. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the ACM International Conference on Information and Knowledge Management. ACM, New York, 86--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. M. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2/3, 103--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Park, S., and Zhang, B. 2003. Large scale unstructured document classification using unlabeled data and syntactic information. In Proceedings of the 7th Pacific-Asia Conference (PAKDD'03). Lecture Notes in Computer Science, vol. 2637, Spinger-Verlag, Berlin, Germany, 88--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Pierce, D., and Cardie, C. 2001. Limitations of co-training for natural language learning from large datasets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1--9. (http://www.cs.cornell.edu/home/pierce/papers/emnlp2001.ps.)Google ScholarGoogle Scholar
  58. Ratsaby, J., and Venkatesh, S. 1995. Learning from a mixture of labeled and unlabeled examples with parametric side information. In Proceedings of the 8th Annual Conference on Computational Learning Theory. ACM, New York, 412--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Rosenberg, D., and Bartlett, P. 2007. The Rademacher complexity of co-regularized kernel classes. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics. (http://www.stat.umn.edu/~aistat/proceedings/start.htm)Google ScholarGoogle Scholar
  60. Schuurmans, D., and Southey, F. 2002. Metric-based methods for adaptive model selection and regularization. Mach. Learn. 48, 1-3, 51--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Shawe-Taylor, J. 2006. Rademacher Analysis and Multi-View Classification. http://www.gla.ac.uk/external/RSS/RSScomp/shawe-taylor.pdf.Google ScholarGoogle Scholar
  62. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. 1998. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44, 5, 1926--1940.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Simon, H.-U. 2009. Smart pac-learners. In Proceedings of 20th International Conference on Algorithmic Learning Theory (ALT). Lecture Notes in Computer Science, vol. 5809, Spinger-Verlag, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Sokolovska, N., Capp, O., and Yvon, F. 2008. The asymptotics of semi-supervised learning in discriminative probabilistic models. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008). Omni Press, 984--991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Sridharan, K., and Kakade, S. M. 2008. An information theoretic framework for multi-view learning. In Proceedings of the 21st Annual Conference on Learning Theory. Omni Press, 403--414.Google ScholarGoogle Scholar
  66. Tong, S., and Koller., D. 2001. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 4, 45--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Valiant, L. 1984. A theory of the learnable. Commun. ACM 27, 11, 1134--1142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Vapnik, V. N. 1982. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York. Google ScholarGoogle Scholar
  69. Vapnik, V. N. 1998. Statistical Learning Theory. Wiley, New York.Google ScholarGoogle ScholarCross RefCross Ref
  70. Vempala, S. 1997. A random sampling based algorithm for learning the intersection of half-spaces. In Proceedings of the 38th Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, 508--513. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Verbeurgt, K. A. 1990. Learning dnf under the uniform distribution in quasi-polynomial time. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory. Morgan-Kaufmann, San Francisco, CA, 314--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Yarowsky, D. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics. 189--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schlkopf, B. 2004. Learning with local and global consistency. In Proceedings of the 18th Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  74. Zhu, X. 2006. Semi-Supervised Learning Literature Survey. Computer Sciences TR 1530 University of Wisconsin -- Madison.Google ScholarGoogle Scholar
  75. Zhu, X., Ghahramani, Z., and Lafferty, J. 2003a. Combinig active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning. Omni Press, 912--912.Google ScholarGoogle Scholar
  76. Zhu, X., Ghahramani, Z., and Lafferty, J. 2003b. Semi-supervised learning: From gaussian fields to gaussian processes. Tech. rep., Carnegie Mellon University, Pittsburg, PA.Google ScholarGoogle Scholar
  77. Zhu, X., Ghahramani, Z., and Lafferty, J. 2003c. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning. AAAI Press, 912--919.Google ScholarGoogle Scholar

Index Terms

  1. A discriminative model for semi-supervised learning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader