ABSTRACT
The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of "counter-based algorithms" (including the popular and very space-efficient FREQUENT and SPACESAVING algorithms) provide much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining "tail." This shows that counter-based methods are the most space-efficient (in fact, space-optimal) algorithms having this strong error bound.
This tail guarantee allows these algorithms to solve the "sparse recovery" problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f* to the frequency vector f so that the L1 error ||f -- f*||1 is close to the best possible error minf2 ||f2 -- f||1, where f2 ranges over all vectors with at most k non-zero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.
- ]]A. Arasu, S. Babu, and J. Widom. Cql: A language for continuous queries over streams and relations. Proceedings of the 9th DBPL International Confenrence on Data Base and Programming Languages, pages 1--11, 2003.Google Scholar
- ]]R. Berinde, A. Gilbert, P. Indyk, H. Karloff, and M. Strauss. Combining geometry and combinatorics: a unified approach to sparse signal recovery. Allerton, 2008.Google Scholar
Cross Ref
- ]]R. Berinde, P. Indyk, and M. Ruzic. Practical near-optimal sparse recovery in the l1 norm. Allerton, 2008.Google Scholar
Cross Ref
- ]]K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. Proceedings of 1999 ACM SIGMOD, pages 359--370, 1999. Google Scholar
Digital Library
- ]]P. Bonnet, J. Gehrke, and P. Seshadri. Towards sensor database systems. Proceedings of the 2nd IEEE MDM International Conference on Mobile Data Management, pages 3--14, 2001. Google Scholar
Digital Library
- ]]P. Bose, E. Kranakis, P. Morin, and Y. Tang. Bounds for frequency estimation of packet streams. Proceedings of the 10th International Colloquium on Structural Information and Communication Complexity, pages 33--42, 2003.Google Scholar
- ]]E.J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1208--1223, 2006.Google Scholar
Cross Ref
- ]]A. Chakrabarti, G. Cormode, and A. McGregor. A near-optimal algorithm for computing the entropy of a stream. In SODA, 2007. Google Scholar
Digital Library
- ]]M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Proceedings of the 29th ICALP International Colloqium on Automata, Languages and Programming, pages 693--703, 2002. Google Scholar
Digital Library
- ]]G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. PVLDB, 1(2):1530--1541, 2008. Google Scholar
Digital Library
- ]]G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Finding hierarchical heavy hitters in data streams. Proceedings of the 29th ACM VLDB International Conference on Very Large Data Bases, pages 464--475, 2003. Google Scholar
Digital Library
- ]]G. Cormode and S. Muthukrishnan. Improved data stream summaries: The count-min sketch and its applications. FSTTCS, 2004.Google Scholar
Cross Ref
- ]]E. Demaine, A.L. Ortiz, and J. Munro. Frequency estimation of internet packet streams with limited space. Proceedings of the 10th ESA Annual European Symposium on Algorithms, pages 348--360, 2002. Google Scholar
Digital Library
- ]]D.L. Donoho. Compressed sensing. Unpublished manuscript, Oct. 2004.Google Scholar
- ]]C. Estan and G. Verghese. New directions in traffic measurement and accounting. ACM SIGCOMM Internet Measurement Workshop, 2001. Google Scholar
Digital Library
- ]]M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. Ullman. Computing iceberg queries efficiently. Proceedings of the 24th ACM VLDB International Conference on Very Large Data Bases, pages 299--310, 1998. Google Scholar
Digital Library
- ]]A.C. Gilbert, M.J. Strauss, J.A. Tropp, and R. Vershynin. One sketch for all: fast algorithms for compressed sensing. In ACM STOC 2007, pages 237--246, 2007. Google Scholar
Digital Library
- ]]J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. Proceedings of 2001 ACM SIGMOD, pages 1--12, 2001. Google Scholar
Digital Library
- ]]J. Hershberger, N. Shrivastava, S. Suri, and C.D. Tóth. Space complexity of hierarchical heavy hitters in multi-dimensional streams. Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 338--347, 2005. Google Scholar
Digital Library
- ]]P. Indyk. Algorithms for dynamic geometric problems over data streams. In STOC, 2004. Google Scholar
Digital Library
- ]]P. Indyk. Sketching, streaming and sublinear-space algorithms. Graduate course notes, available at http://stellar.mit.edu/S/course/6/fa07/6.895/, 2007.Google Scholar
- ]]P. Indyk and M. Ruzic. Near-optimal sparse recovery in the l1 norm. FOCS, 2008. Google Scholar
Digital Library
- ]]R.M. Karp, S. Shenker, and C.H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51--55, 2003. Google Scholar
Digital Library
- ]]G. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, pages 346--357, 2002. Google Scholar
Digital Library
- ]]A. Metwally, D. Agrawal, and A. Abbabi. Efficient computation of frequent and top-k elements in data streams. International Conference on Database Theory, pages 398--412, 2005. Google Scholar
Digital Library
- ]]J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:142--152, 1982.Google Scholar
Cross Ref
- ]]S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, 2005. Google Scholar
Digital Library
- ]]Compressed sensing resources. Available at http://www.dsp.ece.rice.edu/cs/, 2006. Rice DSP Group.Google Scholar
- ]]N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: new aggregation techniques for sensor networks. Proceedings of the 2nd International Conference on Embedded Network Sensor Systems, pages 239--249, 2004. Google Scholar
Digital Library
- ]]G. Zipf. Human Behavior and The Principle of Least Effort. Addison-Wesley, 1949.Google Scholar
Index Terms
Space-optimal heavy hitters with strong error bounds
Recommendations
Space-optimal heavy hitters with strong error bounds
The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on ...
An Optimal Algorithm for ℓ1-Heavy Hitters in Insertion Streams and Related Problems
We give the first optimal bounds for returning the ℓ1-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of m items in { 1, 2, … , n} and parameters 0 < ε <...
Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of ComputingGiven a stream p1, …, pm of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ2-heavy hitters, i.e., those items j for which fj ≥ є √F2, where fj is ...






Comments