ABSTRACT
Database theory and database practice are typically the domain of computer scientists who adopt what may be termed an algorithmic perspective on their data. This perspective is very different than the more statistical perspective adopted by statisticians, scientific computers, machine learners, and other who work on what may be broadly termed statistical data analysis. In this article, I will address fundamental aspects of this algorithmic-statistical disconnect, with an eye to bridging the gap between these two very different approaches. A concept that lies at the heart of this disconnect is that of statistical regularization, a notion that has to do with how robust is the output of an algorithm to the noise properties of the input data. Although it is nearly completely absent from computer science, which historically has taken the input data as given and modeled algorithms discretely, regularization in one form or another is central to nearly every application domain that applies algorithms to noisy data. By using several case studies, I will illustrate, both theoretically and empirically, the nonobvious fact that approximate computation, in and of itself, can implicitly lead to statistical regularization. This and other recent work suggests that, by exploiting in a more principled way the statistical properties implicit in worst-case algorithms, one can in many cases satisfy the bicriteria of having algorithms that are scalable to very large-scale databases and that also have good inferential or predictive properties.
- R. Andersen, F.R.K. Chung, and K. Lang. Local graph partitioning using PageRank vectors. In FOCS '06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 475--486, 2006. Google Scholar
Digital Library
- R. Andersen and K. Lang. Communities from seed sets. In WWW '06: Proceedings of the 15th International Conference on World Wide Web, pages 223--232, 2006. Google Scholar
Digital Library
- R. Andersen and K. Lang. An algorithm for improving graph partitions. In SODA '08: Proceedings of the 19th ACM-SIAM Symposium on Discrete algorithms, pages 651--660, 2008. Google Scholar
Digital Library
- S. Arora, S. Rao, and U. Vazirani. Geometry, flows, and graph-partitioning algorithms. Communications of the ACM, 51(10):96--105, 2008. Google Scholar
Digital Library
- B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized PageRank on MapReduce. In Proceedings of the 37th SIGMOD international conference on Management of data, pages 973--984, 2011. Google Scholar
Digital Library
- B. Bahmani, A. Chowdhury, and A. Goel. Fast incremental and personalized pagerank. Proceedings of the VLDB Endowment, 4(3):173--184, 2010. Google Scholar
Digital Library
- P. Berkhin. A survey on PageRank computing. Internet Mathematics, 2(1):73--120, 2005.Google Scholar
Cross Ref
- P. Bickel and B. Li. Regularization in statistics. TEST, 15(2):271--344, 2006.Google Scholar
Cross Ref
- L. Blum, F. Cucker, M. Shub, and S. Smale. Complexity and real computation: A manifesto. International Journal of Bifurcation and Chaos, 6(1):3--26, 1996.Google Scholar
Cross Ref
- P. Boldi and S. Vigna. The push algorithm for spectral ranking. Technical report. Preprint: arXiv:1109.4680 (2011).Google Scholar
- L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010), pages 177--187, 2010.Google Scholar
Cross Ref
- J. Cheeger. A lower bound for the smallest eigenvalue of the Laplacian. In Problems in Analysis, Papers dedicated to Salomon Bochner, pages 195--199. Princeton University Press, 1969.Google Scholar
- Z. Chen and S. Haykin. On different facets of regularization theory. Neural Computation, 14(12):2791--2846, 2002. Google Scholar
Digital Library
- F.R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. American Mathematical Society, 1997.Google Scholar
- F.R.K. Chung. The heat kernel as the pagerank of a graph. Proceedings of the National Academy of Sciences of the United States of America, 104(50):19735--19740, 2007.Google Scholar
Cross Ref
- J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. Proceedings of the VLDB Endowment, 2(2):1481--1492, 2009. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation, pages 10--10, 2004. Google Scholar
Digital Library
- R. Agrawal phet al. The Claremont report on database research. ACM SIGMOD Record, 37(3):9--19, 2008. Google Scholar
Digital Library
- J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337--407, 2000.Google Scholar
- D.F. Gleich and M.W. Mahoney. Unpublished results, 2012.Google Scholar
- S. Guattery and G.L. Miller. On the quality of spectral separators. SIAM Journal on Matrix Analysis and Applications, 19:701--719, 1998. Google Scholar
Digital Library
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, New York, 2003.Google Scholar
- S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43:439--561, 2006.Google Scholar
Cross Ref
- G. Jeh and J. Widom. Scaling personalized web search. In WWW '03: Proceedings of the 12th International Conference on World Wide Web, pages 271--279, 2003. Google Scholar
Digital Library
- J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777--801, 2009. Google Scholar
Digital Library
- T. Leighton and S. Rao. Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. Journal of the ACM, 46(6):787--832, 1999. Google Scholar
Digital Library
- J. Leskovec, K.J. Lang, A. Dasgupta, and M.W. Mahoney. Statistical properties of community structure in large social and information networks. In WWW '08: Proceedings of the 17th International Conference on World Wide Web, pages 695--704, 2008. Google Scholar
Digital Library
- J. Leskovec, K.J. Lang, and M.W. Mahoney. Empirical comparison of algorithms for network community detection. In WWW '10: Proceedings of the 19th International Conference on World Wide Web, pages 631--640, 2010. Google Scholar
Digital Library
- N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215--245, 1995.Google Scholar
Digital Library
- M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning. NOW Publishers, Boston, 2011. Also available at: arXiv:1104.5557. Google Scholar
Digital Library
- M. W. Mahoney. Algorithmic and statistical perspectives on large-scale data analysis. In U. Naumann and O. Schenk, editors, Combinatorial Scientific Computing, Chapman & Hall/CRC Computational Science. CRC Press, 2012.Google Scholar
- M. W. Mahoney and L. Orecchia. Implementing regularization implicitly via approximate eigenvector computation. In Proceedings of the 28th International Conference on Machine Learning, pages 121--128, 2011.Google Scholar
- M. W. Mahoney, L. Orecchia, and N. K. Vishnoi. A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally. Technical report. Preprint: arXiv:0912.0681 (2009).Google Scholar
- A. Neumaier. Solving ill-conditioned and singular linear systems: A tutorial on regularization. SIAM Review, 40:636--666, 1998. Google Scholar
Digital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.Google Scholar
- P. O. Perry and M. W. Mahoney. Regularized Laplacian estimation and fast eigenvector approximation. In Annual Advances in Neural Information Processing Systems 25: Proceedings of the 2011 Conference, 2011.Google Scholar
- A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating PageRank on graph streams. In Proceedings of the 27th ACM Symposium on Principles of Database Systems, pages 69--78, 2008. Google Scholar
Digital Library
- S. Smale. Some remarks on the foundations of numerical analysis. SIAM Review, 32(2):211--220, 1990. Google Scholar
Digital Library
- D.A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC '04: Proceedings of the 36th annual ACM Symposium on Theory of Computing, pages 81--90, 2004. Google Scholar
Digital Library
- A.N. Tikhonov and V.Y. Arsenin. Solutions of Ill-Posed Problems. W.H. Winston, Washington, D.C., 1977.Google Scholar
- V.V. Vazirani. Approximation Algorithms. Springer-Verlag, New York, 2001. Google Scholar
Digital Library
- S. Vigna. Spectral ranking. Technical report. Preprint: arXiv:0912.0238 (2009).Google Scholar
Index Terms
Approximate computation and implicit regularization for very large-scale data analysis
Recommendations
High-Order Tensor Decomposition for Large-Scale Data Analysis
BIGDATACONGRESS '15: Proceedings of the 2015 IEEE International Congress on Big DataHigher-order tensor decomposition is a basis for many important data mining tasks and the efficient large-scale tensor decomposition algorithms will have positive impact on clustering, trend detection, and anomaly detection. In the paper, we develop a ...
Improving Implicit Alternating Least Squares with Ring-based Regularization
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information RetrievalDue to the widespread presence of implicit feedback, recommendation based on them has been a long-standing research problem in academia and industry. However, it suffers from the extremely-sparse problem, since each user only interacts with a few items. ...
AgFlow: fast model selection of penalized PCA via implicit regularization effects of gradient flow
AbstractPrincipal component analysis (PCA) has been widely used as an effective technique for feature extraction and dimension reduction. In the High Dimension Low Sample Size setting, one may prefer modified principal components, with penalized loadings, ...






Comments