ABSTRACT
The L1-distance, also known as the Manhattan or taxicab distance, between two vectors x, y in Rn is ∑_{i=1}overn |xi-y_i|. Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and support vector machines. We give the first 1-pass streaming algorithm for this problem in the turnstile model with O*(1/ε2) space and O*(1) update time. The O* notation hides polylogarithmic factors in ε, n, and the precision required to store vector entries. All previous algorithms either required Ω(1/ε3) space or Ω(1/ε2) update time and/or could not work in the turnstile model (i.e., support an arbitrary number of updates to each coordinate). Our bounds are optimal up to O*(1) factors.
- Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In ICDT, pages 420--434, 2001. Google Scholar
Digital Library
- Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, and Kyuseok Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In VLDB, pages 490--501, 1995. Google Scholar
Digital Library
- Noga Alon, Yossi Matias, and Mario Szegedy. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58(1):137--147, 1999. Google Scholar
Digital Library
- Radu Berinde, Graham Cormode, Piotr Indyk, and Martin J. Strauss. Space-optimal heavy hitters with strong error bounds. In PODS, pages 157--166, 2009. Google Scholar
Digital Library
- Kevin S. Beyer and Raghu Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD Conference, pages 359--370, 1999. Google Scholar
Digital Library
- Emmanuel J. Candès, Justin Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8), 2006.Google Scholar
- Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In Proceedings of the 29th International Colloquium on Automata, Languages and Programming (ICALP), pages 693--703, 2002. Google Scholar
Digital Library
- Surajit Chaudhuri, Rajeev Motwani, and Vivek R. Narasayya. On random sampling over joins. In SIGMOD Conference, pages 263--274, 1999. Google Scholar
Digital Library
- Cisco NetFlow. http://www.cisco.com/go/netflow.Google Scholar
- Kenneth L. Clarkson. Subgradient and sampling algorithms for l1 regression. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2005. Google Scholar
Digital Library
- Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Algorithms and estimators for accurate summarization of internet traffic. In Internet Measurement Comference, pages 265--278, 2007. Google Scholar
Digital Library
- Graham Cormode, Mayur Datar, Piotr Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529--540, 2003. Google Scholar
Digital Library
- Graham Cormode and Minos N. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB, pages 13--24, 2005. Google Scholar
Digital Library
- Graham Cormode, Piotr Indyk, Nick Koudas, and S. Muthukrishnan. Fast mining of massive tabular data via approximate distance computations. In ICDE, pages 605---, 2002.Google Scholar
- Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In PODS, pages 263--272, 2006. Google Scholar
Digital Library
- Graham Cormode, Flip Korn, and Srikanta Tirthapura. Time-decaying aggregates in out-of-order streams. In PODS, pages 89--98, 2008. Google Scholar
Digital Library
- Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google Scholar
Digital Library
- Graham Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 271--282, 2005. Google Scholar
Digital Library
- Graham Cormode and S. Muthukrishnan. What's hot and what's not: tracking most frequent items dynamically. ACM Trans. Database Syst., 30(1):249--278, 2005. Google Scholar
Digital Library
- Graham Cormode, S. Muthukrishnan, and Irina Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB, pages 25--36, 2005. Google Scholar
Digital Library
- Yadolah Dodge. L1-Statistical Procedures and Related Topics. Institute for Mathematical Statistics, 1997.Google Scholar
- Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries efficiently. In VLDB, pages 299--310, 1998. Google Scholar
Digital Library
- Joan Feigenbaum, Sampath Kannan, Martin Strauss, and Mahesh Viswanathan. An approximate L1-difference algorithm for massive data streams. SIAM J. Comput., 32(1):131--151, 2002. Google Scholar
Digital Library
- Dan Feldman, Morteza Monemizadeh, Christian Sohler, and David P. Woodruff. Coresets and sketches for high dimensional subspace problems. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), to appear, 2010. Google Scholar
Digital Library
- Sumit Ganguly. personal communication.Google Scholar
- Sumit Ganguly and Graham Cormode. On estimating frequency moments of data streams. In Proceedings of the 11th International Workshop on Randomization and Computation (RANDOM), pages 479--493, 2007. Google Scholar
Digital Library
- Sumit Ganguly, Abhayendra N. Singh, and Satyam Shankar. Finding frequent items over general update streams. In Proceedings of the 20th International Conference on Scientific and Statistical Database Management (SSDBM), pages 204--221, 2008. Google Scholar
Digital Library
- Anna C. Gilbert, Martin J. Strauss, Joel A. Tropp, and Roman Vershynin. One sketch for all: fast algorithms for compressed sensing. In STOC, pages 237--246, 2007. Google Scholar
Digital Library
- Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. Efficient computation of iceberg cubes with complex measures. In SIGMOD Conference, pages 1--12, 2001. Google Scholar
Digital Library
- John Hershberger, Nisheeth Shrivastava, Subhash Suri, and Csaba D. Tóth. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 338--347, 2005. Google Scholar
Digital Library
- Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 53(3):307--323, 2006. Google Scholar
Digital Library
- Piotr Indyk and Andrew McGregor. Declaring independence via the sketching of sketches. In SODA, pages 737--745, 2008. Google Scholar
Digital Library
- Piotr Indyk and David P. Woodruff. Polylogarithmic private approximations and efficient matching. In TCC, pages 245--264, 2006. Google Scholar
Digital Library
- T. S. Jayram and David P. Woodruff. The data stream space complexity of cascaded norms. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 765--774, 2009. Google Scholar
Digital Library
- Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1161--1178, 2010. Google Scholar
Digital Library
- Khaled Labib and V. Rao Vemuri. A hardware-based clustering approach for anomaly detection, 2006.Google Scholar
- Wing Cheong Lau, Murali S. Kodialam, T. V. Lakshman, and H. Jonathan Chao. Datalite: a distributed architecture for traffic analysis via light-weight traffic digest. In BROADNETS, pages 622--630, 2007.Google Scholar
- Kenneth D. Lawrence and Jeffrey L. Arthur. Robust Regression. Dekker, 1990.Google Scholar
- Ping Li. Estimators and tail bounds for dimension reduction in l_p (0 < p łe 2) using stable random projections. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 10--19, 2008. Google Scholar
Digital Library
- Ping Li, Trevor Hastie, and Kenneth Ward Church. Very sparse random projections. In KDD, pages 287--296, 2006. Google Scholar
Digital Library
- Hendrik P. Lopuha\"a and Peter J. Rousseeuw. Breakdown points of affine equivalent estimators of multivarite location and covariance matrices. Annals of Statistics, 19(1):229--248, 1991.Google Scholar
Cross Ref
- Andre Madeira and S. Muthukrishnan. Functionally private approximation for negligibly--biased estimators. In Proceedings of the 29th International Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), 2009.Google Scholar
- Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error l_p sampling with applications. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), to appear, 2010. Google Scholar
Digital Library
- Jelani Nelson and David P. Woodruff. A near-optimal algorithm for L1-difference. CoRR, abs/0904.2027, 2009.Google Scholar
- Jiawang Nie, Pablo A. Parillo, and Bernd Sturmfels. Semidefinite representation of the k-ellipse. Algorithms in Algebraic Geometry, IMA Volumes in Mathematics and its Applications, 146:117--132, 2008.Google Scholar
- Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449--461, 1992.Google Scholar
- Open Problems in Data Streams and Related Topics. IITK Workshop on Algorithms for Data Streams, 2006. http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf.Google Scholar
- Anna Pagh and Rasmus Pagh. Uniform hashing in constant time and linear space. SIAM J. Comput., 38(1):85--96, 2008. Google Scholar
Digital Library
- Peter J. Rousseeuw and Annick M. Lerow. Robust Regression and Outlier Detection. John Wiley, 1987. Google Scholar
Digital Library
- Robert T. Schweller, Zhichun Li, Yan Chen, Yan Gao, Ashish Gupta, Yin Zhang, Peter A. Dinda, Ming-Yang Kao, and Gokhan Memik. Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Trans. Netw., 15(5):1059--1072, 2007. Google Scholar
Digital Library
- Nicholas D. Sidiropoulos and Rasmus Bro. Mathematical programming algorithms for regression-based non-linear filtering in Rn. IEEE Transactions on Signal Processing, pages 771--782, 1999. Google Scholar
Digital Library
- Mikkel Thorup and Yin Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 615--624, 2004. Google Scholar
Digital Library
- Salil P. Vadhan. Pseudorandomness II. Manuscript. http://people.seas.harvard.edu/ salil/cs225/spring09/lecnotes/FnTTCS-vol2.pdf.Google Scholar
- Endre V. Weiszfeld. Sur le point pour lequel la somme des distances de n points donnes est minimum. Tohoku Math, 43:355--386, 1937.Google Scholar
- David P. Woodruff. Optimal space lower bounds for all frequency moments. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 167--175, 2004. Google Scholar
Digital Library
- David P. Woodruff. Private approximations from sampling in a data stream. Manuscript, 2010.Google Scholar
- Byoung-Kee Yi and Christos Faloutsos. Fast time sequence indexing for arbitrary Lp norms. In VLDB, pages 385--394, 2000. Google Scholar
Digital Library
Index Terms
Fast Manhattan sketches in data streams
Recommendations
Data Streams with Bounded Deletions
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsTwo prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. ...
Research on data stream clustering algorithms
Data stream is a potentially massive, continuous, rapid sequence of data information. It has aroused great concern and research upsurge in the field of data mining. Clustering is an effective tool of data mining, so data stream clustering will ...
Clustering categorical data streams
In this paper, we propose an efficient clustering algorithm for analyzing categorical data streams. It has been proved that the proposed algorithm uses small memory footprints. We provide empirical analysis on the performance of the algorithm in ...






Comments