skip to main content
10.1145/2783258.2783279acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Stream Sampling for Frequency Cap Statistics

Published:10 August 2015Publication History

ABSTRACT

Unaggregated data, in a streamed or distributed form, is prevalent and comes from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries) and elements with different keys interleave.

Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function f applied to the frequency (the total number of occurrences) of the key. In particular, Distinct is the number of active keys in the segment, Sum is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T. One important application of cap statistics is staging advertisement campaigns, where the cap parameter is the limit of the maximum number of impressions per user and we estimate the total number of qualifying impressions.

The number of distinct active keys in the data can be very large, making exact computation of queries costly. Instead, we can estimate these statistics from a sample. An optimal sample for a given function f would include a key with frequency w with probability roughly proportional to f(w). But while such a "gold-standard" sample can be easily computed over the aggregated data (the set of key-frequency pairs), exact aggregation itself is costly and slow. Ideally, we would like to compute and maintain a sample without aggregation.

We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our l-capped samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics, and close to gold-standard estimates for frequency cap statistics with T=Θ(l). Furthermore, our design facilitates multi-objective samples, which provide tight estimates for a specified set of statistics using a single smaller sample.

References

  1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. T. Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653--656, 1982.Google ScholarGoogle Scholar
  3. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Cohen. All-distances sketches, revisited: HIP estimators for massive graphs analysis. In PODS. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Cohen. Stream sampling for frequency cap statistics. Technical Report cs.IR/1502.05955, arXiv, 2015. http://arxiv.org/abs/1502.05955. Google ScholarGoogle Scholar
  6. E. Cohen, G. Cormode, and N. Duffield. Don't let the negatives bring you down: Sampling from streams of signed updates. In Proc. ACM SIGMETRICS/Performance, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Composable, scalable, and accurate weight summarization of unaggregated data sets. Proc. VLDB, 2(1):431--442, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators for accurate summarization of unaggregated data streams. J. Comput. System Sci., 80, 2014.Google ScholarGoogle Scholar
  9. E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput., 40(5), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. VLDB, 2(1--2), 2009. full: http://arxiv.org/abs/0906.4560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Estan and G. Varghese. New directions in traffic measurement and accounting. In SIGCOMM. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Gemulla, W. Lehner, and P. J. Haas. A dip in the reservoir: Maintaining sample synopses of evolving datasets. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Google. Frequency capping: AdWords help, December 2014. https://support.google.com/adwords/answer/117579.Google ScholarGoogle Scholar
  18. S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. Hohn and D. Veitch. Inverting sampled traffic. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 222--233, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.Google ScholarGoogle ScholarCross RefCross Ref
  21. P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Annual Symposium on Foundations of Computer Science, pages 189--197. IEEE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math., 26, 1984.Google ScholarGoogle Scholar
  23. H. Jowhari, M. Saglam, and G. Tardos. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In PODS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 1st edition, 1968.Google ScholarGoogle Scholar
  25. J. Misra and D. Gries. Finding repeated elements. Technical report, Cornell University, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Monemizadeh and D. P. Woodruff. 1-pass relative-error l p-sampling with applications. In Proc. 21st ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.Google ScholarGoogle Scholar
  28. M. Osborne. Facebook Reach and Frequency Buying, October 2014. http://citizennet.com/blog/2014/10/01/facebook-reach-and-frequency-buying/.Google ScholarGoogle Scholar
  29. B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  30. Y. Tillé. Sampling Algorithms. Springer-Verlag, New York, 2006.Google ScholarGoogle Scholar

Index Terms

  1. Stream Sampling for Frequency Cap Statistics

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
        August 2015
        2378 pages
        ISBN:9781450336642
        DOI:10.1145/2783258

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 August 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader