ABSTRACT
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,...,n}, our algorithm computes a (1 ± ε)-approximation using an optimal O(1/ε-2 + log(n)) bits of space with 2/3 success probability, where 0<ε<1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worst-case time, and can report an estimate at any point midstream in O(1) worst-case time, thus settling both the space and time complexities simultaneously.
We also give an algorithm to estimate the Hamming norm of a stream, a generalization of the number of distinct elements, which is useful in data cleaning, packet tracing, and database auditing. Our algorithm uses nearly optimal space, and has optimal O(1) update and reporting times.
- S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In SIGMOD, pages 574--576, 1999. Google Scholar
Digital Library
- A. Akella, A. Bharambe, M. Reiter, and S. Seshan. Detecting DDoS attacks on ISP networks. In Proc. MPDS, 2003.Google Scholar
- N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58(1):137--147, 1999. Google Scholar
Digital Library
- Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. RANDOM, pages 1--10, 2002. Google Scholar
Digital Library
- Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. SODA, pages 623--632, 2002. Google Scholar
Digital Library
- K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007. Google Scholar
Digital Library
- D. K. Blandford and G. E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Alg., 4(2), 2008. Google Scholar
Digital Library
- A. Brodnik. Computation of the least significant set bit. In Proc. ERK, 1993.Google Scholar
- J. Brody and A. Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. In Proc. CCC, pages 358--368, 2009. Google Scholar
Digital Library
- P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180, 2005. Google Scholar
Digital Library
- L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci., 18(2):143--154, 1979.Google Scholar
- E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441--453, 1997. Google Scholar
Digital Library
- G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529--540, 2003. Google Scholar
Digital Library
- T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, pages 240--251, 2002. Google Scholar
Digital Library
- D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Struct. Algorithms, 13(2):99--124, 1998. Google Scholar
Digital Library
- M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In Proc. ESA, pages 605--617, 2003.Google Scholar
- C. Estan, G. Varghese, and M. E. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw., 14(5):925--937, 2006. Google Scholar
Digital Library
- S. J. Finkelstein, M. Schkolnick, and P. Tiberio. Physical database design for relational databases. ACM Trans. Database Syst., 13(1):91--128, 1988. Google Scholar
Digital Library
- P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Disc. Math. and Theor. Comp. Sci., AH:127--146, 2007.Google Scholar
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google Scholar
Digital Library
- M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci., 47(3):424--436, 1993. Google Scholar
Digital Library
- S. Ganguly. Counting distinct items over update streams. Theor. Comput. Sci., 378(3):211--222, 2007. Google Scholar
Digital Library
- P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, pages 541--550, 2001. Google Scholar
Digital Library
- P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. SPAA, pages 281--291, 2001. Google Scholar
Digital Library
- P. Indyk. Algorithms for dynamic geometric problems over data streams. In Proc. STOC, pages 373--380, 2004. Google Scholar
Digital Library
- P. Indyk and D. P. Woodruff. Tight lower bounds for the distinct elements problem. In Proc. FOCS, pages 283---, 2003. Google Scholar
Digital Library
- D. M. Kane, J. Nelson, and D. P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proc. SODA, pages 1161--1178, 2010. Google Scholar
Digital Library
- R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google Scholar
Digital Library
- M. H. Overmars. The Design of Dynamic Data Structures. Springer, 1983. Google Scholar
Digital Library
- S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. Multi-dimensional clustering: A new data layout scheme in db2. In SIGMOD, pages 637--641, 2003. Google Scholar
Digital Library
- A. Pagh and R. Pagh. Uniform hashing in constant time and optimal space. SIAM J. Comput., 38(1):85--96, 2008. Google Scholar
Digital Library
- C. R. Palmer, G. Siganos, M. Faloutsos, and C. Faloutsos. The connectivity and fault-tolerance of the internet topology. In NRDM Workshop, 2001.Google Scholar
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979. Google Scholar
Digital Library
- A. Shukla, P. Deshpande, J. F. Naughton, and K. Ramasamy. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB, pages 522--531, 1996. Google Scholar
Digital Library
- A. Siegel. On universal classes of extremely random constant-time hash functions. SIAM J. Computing, 33(3):505--543, 2004. Google Scholar
Digital Library
- D. P. Woodruff. Optimal space lower bounds for all frequency moments. In Proc. SODA, pages 167--175, 2004. Google Scholar
Digital Library
Index Terms
An optimal algorithm for the distinct elements problem
Recommendations
Optimal Streaming and Tracking Distinct Elements with High Probability
Special Issue on Soda'18 and Regular PapersThe distinct elements problem is one of the fundamental problems in streaming algorithms—given a stream of integers in the range { 1,… ,n}, we wish to provide a (1+ε) approximation to the number of distinct elements in the input. After a long line of ...
Range-Efficient Counting of Distinct Elements in a Massive Data Stream
Efficient one-pass estimation of $F_0$, the number of distinct elements in a data stream, is a fundamental problem arising in various contexts in databases and networking. We consider range-efficient estimation of $F_0$: estimation of the number of ...
Optimal streaming and tracking distinct elements with high probability
SODA '18: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete AlgorithmsThe distinct elements problem is one of the fundamental problems in streaming algorithms --- given a stream of integers in the range {1, ... n}, we wish to provide a (1 + ε) approximation of the number of distinct elements in the input. After a long ...






Comments