skip to main content
10.1145/2213556.2213579acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial

Approximate computation and implicit regularization for very large-scale data analysis

Published:21 May 2012Publication History

ABSTRACT

Database theory and database practice are typically the domain of computer scientists who adopt what may be termed an algorithmic perspective on their data. This perspective is very different than the more statistical perspective adopted by statisticians, scientific computers, machine learners, and other who work on what may be broadly termed statistical data analysis. In this article, I will address fundamental aspects of this algorithmic-statistical disconnect, with an eye to bridging the gap between these two very different approaches. A concept that lies at the heart of this disconnect is that of statistical regularization, a notion that has to do with how robust is the output of an algorithm to the noise properties of the input data. Although it is nearly completely absent from computer science, which historically has taken the input data as given and modeled algorithms discretely, regularization in one form or another is central to nearly every application domain that applies algorithms to noisy data. By using several case studies, I will illustrate, both theoretically and empirically, the nonobvious fact that approximate computation, in and of itself, can implicitly lead to statistical regularization. This and other recent work suggests that, by exploiting in a more principled way the statistical properties implicit in worst-case algorithms, one can in many cases satisfy the bicriteria of having algorithms that are scalable to very large-scale databases and that also have good inferential or predictive properties.

References

  1. R. Andersen, F.R.K. Chung, and K. Lang. Local graph partitioning using PageRank vectors. In FOCS '06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 475--486, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Andersen and K. Lang. Communities from seed sets. In WWW '06: Proceedings of the 15th International Conference on World Wide Web, pages 223--232, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Andersen and K. Lang. An algorithm for improving graph partitions. In SODA '08: Proceedings of the 19th ACM-SIAM Symposium on Discrete algorithms, pages 651--660, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Arora, S. Rao, and U. Vazirani. Geometry, flows, and graph-partitioning algorithms. Communications of the ACM, 51(10):96--105, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized PageRank on MapReduce. In Proceedings of the 37th SIGMOD international conference on Management of data, pages 973--984, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Bahmani, A. Chowdhury, and A. Goel. Fast incremental and personalized pagerank. Proceedings of the VLDB Endowment, 4(3):173--184, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Berkhin. A survey on PageRank computing. Internet Mathematics, 2(1):73--120, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  8. P. Bickel and B. Li. Regularization in statistics. TEST, 15(2):271--344, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  9. L. Blum, F. Cucker, M. Shub, and S. Smale. Complexity and real computation: A manifesto. International Journal of Bifurcation and Chaos, 6(1):3--26, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  10. P. Boldi and S. Vigna. The push algorithm for spectral ranking. Technical report. Preprint: arXiv:1109.4680 (2011).Google ScholarGoogle Scholar
  11. L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pages 177--187, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Cheeger. A lower bound for the smallest eigenvalue of the Laplacian. In Problems in Analysis, Papers dedicated to Salomon Bochner, pages 195--199. Princeton University Press, 1969.Google ScholarGoogle Scholar
  13. Z. Chen and S. Haykin. On different facets of regularization theory. Neural Computation, 14(12):2791--2846, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F.R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. American Mathematical Society, 1997.Google ScholarGoogle Scholar
  15. F.R.K. Chung. The heat kernel as the pagerank of a graph. Proceedings of the National Academy of Sciences of the United States of America, 104(50):19735--19740, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  16. J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. Proceedings of the VLDB Endowment, 2(2):1481--1492, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation, pages 10--10, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Agrawal phet al. The Claremont report on database research. ACM SIGMOD Record, 37(3):9--19, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337--407, 2000.Google ScholarGoogle Scholar
  20. D.F. Gleich and M.W. Mahoney. Unpublished results, 2012.Google ScholarGoogle Scholar
  21. S. Guattery and G.L. Miller. On the quality of spectral separators. SIAM Journal on Matrix Analysis and Applications, 19:701--719, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, New York, 2003.Google ScholarGoogle Scholar
  23. S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43:439--561, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  24. G. Jeh and J. Widom. Scaling personalized web search. In WWW '03: Proceedings of the 12th International Conference on World Wide Web, pages 271--279, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777--801, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Leighton and S. Rao. Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. Journal of the ACM, 46(6):787--832, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Leskovec, K.J. Lang, A. Dasgupta, and M.W. Mahoney. Statistical properties of community structure in large social and information networks. In WWW '08: Proceedings of the 17th International Conference on World Wide Web, pages 695--704, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Leskovec, K.J. Lang, and M.W. Mahoney. Empirical comparison of algorithms for network community detection. In WWW '10: Proceedings of the 19th International Conference on World Wide Web, pages 631--640, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215--245, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning. NOW Publishers, Boston, 2011. Also available at: arXiv:1104.5557. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. W. Mahoney. Algorithmic and statistical perspectives on large-scale data analysis. In U. Naumann and O. Schenk, editors, Combinatorial Scientific Computing, Chapman & Hall/CRC Computational Science. CRC Press, 2012.Google ScholarGoogle Scholar
  32. M. W. Mahoney and L. Orecchia. Implementing regularization implicitly via approximate eigenvector computation. In Proceedings of the 28th International Conference on Machine Learning, pages 121--128, 2011.Google ScholarGoogle Scholar
  33. M. W. Mahoney, L. Orecchia, and N. K. Vishnoi. A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally. Technical report. Preprint: arXiv:0912.0681 (2009).Google ScholarGoogle Scholar
  34. A. Neumaier. Solving ill-conditioned and singular linear systems: A tutorial on regularization. SIAM Review, 40:636--666, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  36. P. O. Perry and M. W. Mahoney. Regularized Laplacian estimation and fast eigenvector approximation. In Annual Advances in Neural Information Processing Systems 25: Proceedings of the 2011 Conference, 2011.Google ScholarGoogle Scholar
  37. A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating PageRank on graph streams. In Proceedings of the 27th ACM Symposium on Principles of Database Systems, pages 69--78, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Smale. Some remarks on the foundations of numerical analysis. SIAM Review, 32(2):211--220, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D.A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC '04: Proceedings of the 36th annual ACM Symposium on Theory of Computing, pages 81--90, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A.N. Tikhonov and V.Y. Arsenin. Solutions of Ill-Posed Problems. W.H. Winston, Washington, D.C., 1977.Google ScholarGoogle Scholar
  41. V.V. Vazirani. Approximation Algorithms. Springer-Verlag, New York, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Vigna. Spectral ranking. Technical report. Preprint: arXiv:0912.0238 (2009).Google ScholarGoogle Scholar

Index Terms

  1. Approximate computation and implicit regularization for very large-scale data analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems
      May 2012
      332 pages
      ISBN:9781450312486
      DOI:10.1145/2213556

      Copyright © 2012 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 May 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • tutorial

      Acceptance Rates

      Overall Acceptance Rate476of1,835submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!