article
Free Access

Robustness and Regularization of Support Vector Machines

Publication: The Journal of Machine Learning Research

Abstract

We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well.

References

  1. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P.L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463-482, November 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P.L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33(4):1497-1537, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations Research Letters, 25(1):1-13, August 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1(1):23-34, 1992.Google ScholarGoogle Scholar
  6. D. Bertsimas and A. Fertis. Personal Correspondence, March 2008.Google ScholarGoogle Scholar
  7. D. Bertsimas and M. Sim. The price of robustness. Operations Research, 52(1):35-53, January 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Bhattacharyya. Robust classification of noisy data using second order cone programming approach. In Proceedings International Conference on Intelligent Sensing and Information Processing, pages 433-438, Chennai, India, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  9. C. Bhattacharyya, L. R. Grate, M. I. Jordan, L. El Ghaoui, and I. S. Mian. Robust sparse hyperplane classifiers: Application to uncertain molecular profiling data. Journal of Computational Biology, 11(6):1073-1089, 2004a.Google ScholarGoogle ScholarCross RefCross Ref
  10. C. Bhattacharyya, K. S. Pannagadatta, and A. J. Smola. A second order cone programming formulation for classifying missing data. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems (NIPS17), Cambridge, MA, 2004b. MIT Press.Google ScholarGoogle Scholar
  11. J. Bi and T. Zhang. Support vector classification with input data uncertainty. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems (NIPS17), Cambridge, MA, 2004. MIT Press.Google ScholarGoogle Scholar
  12. C. M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108-116, 1995. doi: 10.1162/neco.1995.7.1.108. URL http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.1.108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 144-152, New York, NY, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499-526, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Christmann and I. Steinwart. On robustness properties of convex risk minimization methods for pattern recognition. The Journal of Machine Learning Research, 5:1007-1034, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Christmann and I. Steinwart. Consistency and robustness of kernel based regression in convex risk minimization. Bernoulli, 13(3):799-819, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. Christmann and A. Van Messem. Bouligand derivatives and robustness of support vector machines for regression. The Journal of Machine Learning Research, 9:915-936, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Cortes and V. N. Vapnik. Support vector networks. Machine Learning, 20:1-25, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Durrett. Probability: Theory and Examples. Duxbury Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18:1035-1064, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 171-203, Cambridge, MA, 2000. MIT Press.Google ScholarGoogle Scholar
  23. A. Globerson and S. Roweis. Nightmare at test time: Robust learning by feature deletion. In ICML '06: Proceedings of the 23rd International Conference on Machine Learning, pages 353-360, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Hampel. The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383-393, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  25. F.R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust Statistics: The Approach Based on Influence Functions. John Wiley & Sons, New York, 1986.Google ScholarGoogle Scholar
  26. P.J. Huber. Robust Statistics. John Wiley & Sons, New York, 1981.Google ScholarGoogle Scholar
  27. M. Kearns, Y. Mansour, A. Ng, and D. Ron. An experimental and theoretical comparison of model selection methods. Machine Learning, 27:7-50, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1):1-50, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  29. S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error. In UAI- 2002: Uncertainty in Artificial Intelligence, pages 275-282, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax approach to classification. The Journal of Machine Learning Research, 3:555-582, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R.A. Maronna, R. D. Martin, and V. J. Yohai. Robust Statistics: Theory and Methods. John Wiley & Sons, New York, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  32. S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161-193, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  33. T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428(6981):419-422, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  34. P.J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. John Wiley & Sons, New York, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.Google ScholarGoogle Scholar
  36. P.K. Shivaswamy, C. Bhattacharyya, and A. J. Smola. Second order cone programming approaches for handling missing and uncertain data. The Journal of Machine Learning Research, 7:1283- 1314, July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A.J. Smola, B. Schölkopf, and K. Müller. The connection between regularization operators and support vector kernels. Neural Networks, 11:637-649, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. I. Steinwart. Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory, 51(1):128-142, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. C.H. Teo, A. Globerson, S. Roweis, and A. J. Smola. Convex learning with invariances. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1489-1496, Cambridge, MA, 2008. MIT Press.Google ScholarGoogle Scholar
  41. T. Trafalis and R. Gilbert. Robust support vector machines for classification and computational issues. Optimization Methods and Software, 22(1):187-198, February 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A.W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New York, 2000.Google ScholarGoogle Scholar
  43. V.N. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974.Google ScholarGoogle Scholar
  44. V.N. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3):260-284, 1991.Google ScholarGoogle Scholar
  45. V.N. Vapnik and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 24:744-780, 1963.Google ScholarGoogle Scholar
  46. H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1801-1808, 2009.Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Robustness and Regularization of Support Vector Machines

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          The Journal of Machine Learning Research cover image
          The Journal of Machine Learning Research  Volume 10, Issue
          12/1/2009
          ISSN:1532-4435
          EISSN:1533-7928
          Issue’s Table of Contents

          Publisher

          JMLR.org

          Publication History

          • Published: 1 December 2009

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!