skip to main content
10.1145/1559795.1559819acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Space-optimal heavy hitters with strong error bounds

Published:29 June 2009Publication History

ABSTRACT

The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of "counter-based algorithms" (including the popular and very space-efficient FREQUENT and SPACESAVING algorithms) provide much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining "tail." This shows that counter-based methods are the most space-efficient (in fact, space-optimal) algorithms having this strong error bound.

This tail guarantee allows these algorithms to solve the "sparse recovery" problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f* to the frequency vector f so that the L1 error ||f -- f*||1 is close to the best possible error minf2 ||f2 -- f||1, where f2 ranges over all vectors with at most k non-zero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.

References

  1. ]]A. Arasu, S. Babu, and J. Widom. Cql: A language for continuous queries over streams and relations. Proceedings of the 9th DBPL International Confenrence on Data Base and Programming Languages, pages 1--11, 2003.Google ScholarGoogle Scholar
  2. ]]R. Berinde, A. Gilbert, P. Indyk, H. Karloff, and M. Strauss. Combining geometry and combinatorics: a unified approach to sparse signal recovery. Allerton, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  3. ]]R. Berinde, P. Indyk, and M. Ruzic. Practical near-optimal sparse recovery in the l1 norm. Allerton, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  4. ]]K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. Proceedings of 1999 ACM SIGMOD, pages 359--370, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. ]]P. Bonnet, J. Gehrke, and P. Seshadri. Towards sensor database systems. Proceedings of the 2nd IEEE MDM International Conference on Mobile Data Management, pages 3--14, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. ]]P. Bose, E. Kranakis, P. Morin, and Y. Tang. Bounds for frequency estimation of packet streams. Proceedings of the 10th International Colloquium on Structural Information and Communication Complexity, pages 33--42, 2003.Google ScholarGoogle Scholar
  7. ]]E.J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1208--1223, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  8. ]]A. Chakrabarti, G. Cormode, and A. McGregor. A near-optimal algorithm for computing the entropy of a stream. In SODA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. ]]M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Proceedings of the 29th ICALP International Colloqium on Automata, Languages and Programming, pages 693--703, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. ]]G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. PVLDB, 1(2):1530--1541, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. ]]G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Finding hierarchical heavy hitters in data streams. Proceedings of the 29th ACM VLDB International Conference on Very Large Data Bases, pages 464--475, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. ]]G. Cormode and S. Muthukrishnan. Improved data stream summaries: The count-min sketch and its applications. FSTTCS, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  13. ]]E. Demaine, A.L. Ortiz, and J. Munro. Frequency estimation of internet packet streams with limited space. Proceedings of the 10th ESA Annual European Symposium on Algorithms, pages 348--360, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ]]D.L. Donoho. Compressed sensing. Unpublished manuscript, Oct. 2004.Google ScholarGoogle Scholar
  15. ]]C. Estan and G. Verghese. New directions in traffic measurement and accounting. ACM SIGCOMM Internet Measurement Workshop, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. ]]M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. Ullman. Computing iceberg queries efficiently. Proceedings of the 24th ACM VLDB International Conference on Very Large Data Bases, pages 299--310, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ]]A.C. Gilbert, M.J. Strauss, J.A. Tropp, and R. Vershynin. One sketch for all: fast algorithms for compressed sensing. In ACM STOC 2007, pages 237--246, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. ]]J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. Proceedings of 2001 ACM SIGMOD, pages 1--12, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. ]]J. Hershberger, N. Shrivastava, S. Suri, and C.D. Tóth. Space complexity of hierarchical heavy hitters in multi-dimensional streams. Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 338--347, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. ]]P. Indyk. Algorithms for dynamic geometric problems over data streams. In STOC, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. ]]P. Indyk. Sketching, streaming and sublinear-space algorithms. Graduate course notes, available at http://stellar.mit.edu/S/course/6/fa07/6.895/, 2007.Google ScholarGoogle Scholar
  22. ]]P. Indyk and M. Ruzic. Near-optimal sparse recovery in the l1 norm. FOCS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. ]]R.M. Karp, S. Shenker, and C.H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51--55, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. ]]G. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, pages 346--357, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ]]A. Metwally, D. Agrawal, and A. Abbabi. Efficient computation of frequent and top-k elements in data streams. International Conference on Database Theory, pages 398--412, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. ]]J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:142--152, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  27. ]]S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. ]]Compressed sensing resources. Available at http://www.dsp.ece.rice.edu/cs/, 2006. Rice DSP Group.Google ScholarGoogle Scholar
  29. ]]N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: new aggregation techniques for sensor networks. Proceedings of the 2nd International Conference on Embedded Network Sensor Systems, pages 239--249, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. ]]G. Zipf. Human Behavior and The Principle of Least Effort. Addison-Wesley, 1949.Google ScholarGoogle Scholar

Index Terms

  1. Space-optimal heavy hitters with strong error bounds

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
          June 2009
          298 pages
          ISBN:9781605585536
          DOI:10.1145/1559795
          • General Chair:
          • Jan Paredaens,
          • Program Chair:
          • Jianwen Su

          Copyright © 2009 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 June 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate476of1,835submissions,26%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!