skip to main content
10.1145/1807085.1807101acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Fast Manhattan sketches in data streams

Published:06 June 2010Publication History

ABSTRACT

The L1-distance, also known as the Manhattan or taxicab distance, between two vectors x, y in Rn is ∑_{i=1}overn |xi-y_i|. Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and support vector machines. We give the first 1-pass streaming algorithm for this problem in the turnstile model with O*(1/ε2) space and O*(1) update time. The O* notation hides polylogarithmic factors in ε, n, and the precision required to store vector entries. All previous algorithms either required Ω(1/ε3) space or Ω(1/ε2) update time and/or could not work in the turnstile model (i.e., support an arbitrary number of updates to each coordinate). Our bounds are optimal up to O*(1) factors.

References

  1. Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In ICDT, pages 420--434, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, and Kyuseok Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In VLDB, pages 490--501, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Noga Alon, Yossi Matias, and Mario Szegedy. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58(1):137--147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Radu Berinde, Graham Cormode, Piotr Indyk, and Martin J. Strauss. Space-optimal heavy hitters with strong error bounds. In PODS, pages 157--166, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kevin S. Beyer and Raghu Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD Conference, pages 359--370, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Emmanuel J. Candès, Justin Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8), 2006.Google ScholarGoogle Scholar
  7. Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In Proceedings of the 29th International Colloquium on Automata, Languages and Programming (ICALP), pages 693--703, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Surajit Chaudhuri, Rajeev Motwani, and Vivek R. Narasayya. On random sampling over joins. In SIGMOD Conference, pages 263--274, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cisco NetFlow. http://www.cisco.com/go/netflow.Google ScholarGoogle Scholar
  10. Kenneth L. Clarkson. Subgradient and sampling algorithms for l1 regression. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Algorithms and estimators for accurate summarization of internet traffic. In Internet Measurement Comference, pages 265--278, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Graham Cormode, Mayur Datar, Piotr Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529--540, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Graham Cormode and Minos N. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB, pages 13--24, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Graham Cormode, Piotr Indyk, Nick Koudas, and S. Muthukrishnan. Fast mining of massive tabular data via approximate distance computations. In ICDE, pages 605---, 2002.Google ScholarGoogle Scholar
  15. Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In PODS, pages 263--272, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Graham Cormode, Flip Korn, and Srikanta Tirthapura. Time-decaying aggregates in out-of-order streams. In PODS, pages 89--98, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Graham Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 271--282, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Graham Cormode and S. Muthukrishnan. What's hot and what's not: tracking most frequent items dynamically. ACM Trans. Database Syst., 30(1):249--278, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Graham Cormode, S. Muthukrishnan, and Irina Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB, pages 25--36, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yadolah Dodge. L1-Statistical Procedures and Related Topics. Institute for Mathematical Statistics, 1997.Google ScholarGoogle Scholar
  22. Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries efficiently. In VLDB, pages 299--310, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Joan Feigenbaum, Sampath Kannan, Martin Strauss, and Mahesh Viswanathan. An approximate L1-difference algorithm for massive data streams. SIAM J. Comput., 32(1):131--151, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Dan Feldman, Morteza Monemizadeh, Christian Sohler, and David P. Woodruff. Coresets and sketches for high dimensional subspace problems. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), to appear, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sumit Ganguly. personal communication.Google ScholarGoogle Scholar
  26. Sumit Ganguly and Graham Cormode. On estimating frequency moments of data streams. In Proceedings of the 11th International Workshop on Randomization and Computation (RANDOM), pages 479--493, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sumit Ganguly, Abhayendra N. Singh, and Satyam Shankar. Finding frequent items over general update streams. In Proceedings of the 20th International Conference on Scientific and Statistical Database Management (SSDBM), pages 204--221, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Anna C. Gilbert, Martin J. Strauss, Joel A. Tropp, and Roman Vershynin. One sketch for all: fast algorithms for compressed sensing. In STOC, pages 237--246, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. Efficient computation of iceberg cubes with complex measures. In SIGMOD Conference, pages 1--12, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. John Hershberger, Nisheeth Shrivastava, Subhash Suri, and Csaba D. Tóth. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 338--347, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 53(3):307--323, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Piotr Indyk and Andrew McGregor. Declaring independence via the sketching of sketches. In SODA, pages 737--745, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Piotr Indyk and David P. Woodruff. Polylogarithmic private approximations and efficient matching. In TCC, pages 245--264, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. T. S. Jayram and David P. Woodruff. The data stream space complexity of cascaded norms. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 765--774, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1161--1178, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Khaled Labib and V. Rao Vemuri. A hardware-based clustering approach for anomaly detection, 2006.Google ScholarGoogle Scholar
  37. Wing Cheong Lau, Murali S. Kodialam, T. V. Lakshman, and H. Jonathan Chao. Datalite: a distributed architecture for traffic analysis via light-weight traffic digest. In BROADNETS, pages 622--630, 2007.Google ScholarGoogle Scholar
  38. Kenneth D. Lawrence and Jeffrey L. Arthur. Robust Regression. Dekker, 1990.Google ScholarGoogle Scholar
  39. Ping Li. Estimators and tail bounds for dimension reduction in l_p (0 < p łe 2) using stable random projections. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 10--19, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ping Li, Trevor Hastie, and Kenneth Ward Church. Very sparse random projections. In KDD, pages 287--296, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Hendrik P. Lopuha\"a and Peter J. Rousseeuw. Breakdown points of affine equivalent estimators of multivarite location and covariance matrices. Annals of Statistics, 19(1):229--248, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  42. Andre Madeira and S. Muthukrishnan. Functionally private approximation for negligibly--biased estimators. In Proceedings of the 29th International Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), 2009.Google ScholarGoogle Scholar
  43. Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error l_p sampling with applications. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), to appear, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jelani Nelson and David P. Woodruff. A near-optimal algorithm for L1-difference. CoRR, abs/0904.2027, 2009.Google ScholarGoogle Scholar
  45. Jiawang Nie, Pablo A. Parillo, and Bernd Sturmfels. Semidefinite representation of the k-ellipse. Algorithms in Algebraic Geometry, IMA Volumes in Mathematics and its Applications, 146:117--132, 2008.Google ScholarGoogle Scholar
  46. Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449--461, 1992.Google ScholarGoogle Scholar
  47. Open Problems in Data Streams and Related Topics. IITK Workshop on Algorithms for Data Streams, 2006. http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf.Google ScholarGoogle Scholar
  48. Anna Pagh and Rasmus Pagh. Uniform hashing in constant time and linear space. SIAM J. Comput., 38(1):85--96, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Peter J. Rousseeuw and Annick M. Lerow. Robust Regression and Outlier Detection. John Wiley, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Robert T. Schweller, Zhichun Li, Yan Chen, Yan Gao, Ashish Gupta, Yin Zhang, Peter A. Dinda, Ming-Yang Kao, and Gokhan Memik. Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Trans. Netw., 15(5):1059--1072, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Nicholas D. Sidiropoulos and Rasmus Bro. Mathematical programming algorithms for regression-based non-linear filtering in Rn. IEEE Transactions on Signal Processing, pages 771--782, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Mikkel Thorup and Yin Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 615--624, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Salil P. Vadhan. Pseudorandomness II. Manuscript. http://people.seas.harvard.edu/ salil/cs225/spring09/lecnotes/FnTTCS-vol2.pdf.Google ScholarGoogle Scholar
  54. Endre V. Weiszfeld. Sur le point pour lequel la somme des distances de n points donnes est minimum. Tohoku Math, 43:355--386, 1937.Google ScholarGoogle Scholar
  55. David P. Woodruff. Optimal space lower bounds for all frequency moments. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 167--175, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. David P. Woodruff. Private approximations from sampling in a data stream. Manuscript, 2010.Google ScholarGoogle Scholar
  57. Byoung-Kee Yi and Christos Faloutsos. Fast time sequence indexing for arbitrary Lp norms. In VLDB, pages 385--394, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast Manhattan sketches in data streams

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
      June 2010
      350 pages
      ISBN:9781450300339
      DOI:10.1145/1807085

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate476of1,835submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!