ABSTRACT
A sketch of a matrix A is another matrix B which is significantly smaller than A but still approximates it well. Finding such sketches efficiently is an important building block in modern algorithms for approximating, for example, the PCA of massive matrices. This task is made more challenging in the streaming model, where each row of the input matrix can only be processed once and storage is severely limited.
In this paper we adapt a well known streaming algorithm for approximating item frequencies to the matrix sketching setting. The algorithm receives n rows of a large matrix A ε ℜ n x m one after the other in a streaming fashion. It maintains a sketch B ℜ l x m containing only l << n rows but still guarantees that ATABTB. More accurately, ∀x || x,||=1 0≤||Ax||2 - ||Bx||2 ≤ 2||A||_f2l Or BTB prec ATA and ||ATA - BTB|| ≤ 2 ||A||f2l.
This gives a streaming algorithm whose error decays proportional to 1/l using O(ml) space. For comparison, random-projection, hashing or sampling based algorithms produce convergence bounds proportional to 1/√l. Sketch updates per row in A require amortized O(ml) operations and the algorithm is perfectly parallelizable. Our experiments corroborate the algorithm's scalability and improved convergence rate. The presented algorithm also stands out in that it is deterministic, simple to implement and elementary to prove.
References
- Dimitris Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '01, pages 274--281, New York, NY, USA, 2001. ACM. Google Scholar
Digital Library
- Dimitris Achlioptas and Frank Mcsherry. Fast computation of low-rank matrix approximations. J. ACM, 54(2), 2007. Google Scholar
Digital Library
- Rudolf Ahlswede and Andreas Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory, 48(3):569--579, 2002. Google Scholar
Digital Library
- Sanjeev Arora, Elad Hazan, and Satyen Kale. A fast random sampling algorithm for sparsifying matrices. In Proceedings of the 9th international conference on Approximation Algorithms for Combinatorial Optimization Problems, and 10th international conference on Randomization and Computation, APPROX'06/RANDOM'06, pages 272--279, Berlin, Heidelberg, 2006. Springer-Verlag. Google Scholar
Digital Library
- Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near optimal column-based matrix reconstruction. In Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS '11, pages 305--314, Washington, DC, USA, 2011. IEEE Computer Society. Google Scholar
Digital Library
- Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '09, pages 968--977, Philadelphia, PA, USA, 2009. Society for Industrial and Applied Mathematics. Google Scholar
Digital Library
- Kenneth L. Clarkson and David P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the 41st annual ACM symposium on Theory of computing, STOC '09, pages 205--214, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, STOC '13, pages 81--90, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson: Lindenstrauss transform. In STOC, pages 341--350, 2010. Google Scholar
Digital Library
- Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. Frequency estimation of internet packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algorithms, ESA '02, pages 348--360, London, UK, UK, 2002. Springer-Verlag. Google Scholar
Digital Library
- Amit Deshpande and Santosh Vempala. Adaptive sampling and fast low-rank matrix approximation. In APPROX-RANDOM, pages 292--303, 2006. Google Scholar
Digital Library
- Petros Drineas and Ravi Kannan. Pass efficient algorithms for approximating large matrices. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, SODA '03, pages 223--232, Philadelphia, PA, USA, 2003. Society for Industrial and Applied Mathematics. Google Scholar
Digital Library
- Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error cur matrix decompositions. SIAM J. Matrix Analysis Applications, 30(2):844--881, 2008. Google Scholar
Digital Library
- Petros Drineas, Michael W. Mahoney, S. Muthukrishnan, and Tamas Sarlos. Faster least squares approximation. Numer. Math., 117(2):219--249, February 2011. Google Scholar
Digital Library
- Petros Drineas and Anastasios Zouzias. A note on element-wise matrix sparsification via a matrix-valued bernstein inequality. Inf. Process. Lett., 111(8):385--389, March 2011. Google Scholar
Digital Library
- Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for finding low-rank approximations. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science, FOCS '98, pages 370--, Washington, DC, USA, 1998. IEEE Computer Society. Google Scholar
Digital Library
- Phillip B. Gibbons and Yossi Matias. External memory algorithms, 1999.Google Scholar
- Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms. In SODA, pages 1195--1206, 2012. Google Scholar
Digital Library
- Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., 28(1):51--55, March 2003. Google Scholar
Digital Library
- Edo Liberty. www.cs.yale.edu/homes/el327/public/experimentalresults/.Google Scholar
- Edo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences,, 104(51):20167--20172, December 2007.Google Scholar
Cross Ref
- Michael W. Mahoney, Petros Drineas, Malik Magdon-Ismail, and David P. Woodruff. Fast approximation of matrix coherence and statistical leverage. In ICML, 2012.Google Scholar
- Jayadev Misra and David Gries. Finding repeated elements. Technical report, Ithaca, NY, USA, 1982. Google Scholar
Digital Library
- Roberto Imbuzeiro Oliveira. Sums of random hermitian matrices and an inequality by rudelson. arXiv:1004.3821v1, April 2010.Google Scholar
- Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. Latent semantic indexing: a probabilistic analysis. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, PODS '98, pages 159--168, New York, NY, USA, 1998. ACM. Google Scholar
Digital Library
- Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis. J. ACM, 54(4), July 2007. Google Scholar
Digital Library
- Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In FOCS, pages 143--152, 2006. Google Scholar
Digital Library
- S. S. Vempala. The Random Projection Method. American Mathematical Society, 2004.Google Scholar
- Roman Vershynin. A note on sums of independent random matrices after ahlswede-winter. Lecture Notes.Google Scholar
- Roman Vershynin. Spectral norm of products of random and deterministic matrices. Probability Theory and Related Fields, 150(3--4):471--509, 2011.Google Scholar
- Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pages 1113--1120, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
Index Terms
Simple and deterministic matrix sketching

Edo Liberty


Comments