ABSTRACT
Unaggregated data, in a streamed or distributed form, is prevalent and comes from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries) and elements with different keys interleave.
Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function f applied to the frequency (the total number of occurrences) of the key. In particular, Distinct is the number of active keys in the segment, Sum is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T. One important application of cap statistics is staging advertisement campaigns, where the cap parameter is the limit of the maximum number of impressions per user and we estimate the total number of qualifying impressions.
The number of distinct active keys in the data can be very large, making exact computation of queries costly. Instead, we can estimate these statistics from a sample. An optimal sample for a given function f would include a key with frequency w with probability roughly proportional to f(w). But while such a "gold-standard" sample can be easily computed over the aggregated data (the set of key-frequency pairs), exact aggregation itself is costly and slow. Ideally, we would like to compute and maintain a sample without aggregation.
We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our l-capped samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics, and close to gold-standard estimates for frequency cap statistics with T=Θ(l). Furthermore, our design facilitates multi-objective samples, which provide tight estimates for a specified set of statistics using a single smaller sample.
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999. Google Scholar
Digital Library
- M. T. Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653--656, 1982.Google Scholar
- E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google Scholar
Digital Library
- E. Cohen. All-distances sketches, revisited: HIP estimators for massive graphs analysis. In PODS. ACM, 2014. Google Scholar
Digital Library
- E. Cohen. Stream sampling for frequency cap statistics. Technical Report cs.IR/1502.05955, arXiv, 2015. http://arxiv.org/abs/1502.05955. Google Scholar
- E. Cohen, G. Cormode, and N. Duffield. Don't let the negatives bring you down: Sampling from streams of signed updates. In Proc. ACM SIGMETRICS/Performance, 2012. Google Scholar
Digital Library
- E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Composable, scalable, and accurate weight summarization of unaggregated data sets. Proc. VLDB, 2(1):431--442, 2009. Google Scholar
Digital Library
- E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators for accurate summarization of unaggregated data streams. J. Comput. System Sci., 80, 2014.Google Scholar
- E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput., 40(5), 2011. Google Scholar
Digital Library
- E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008. Google Scholar
Digital Library
- E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. VLDB, 2(1--2), 2009. full: http://arxiv.org/abs/0906.4560. Google Scholar
Digital Library
- N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007. Google Scholar
Digital Library
- C. Estan and G. Varghese. New directions in traffic measurement and accounting. In SIGCOMM. ACM, 2002. Google Scholar
Digital Library
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google Scholar
Digital Library
- R. Gemulla, W. Lehner, and P. J. Haas. A dip in the reservoir: Maintaining sample synopses of evolving datasets. In VLDB, 2006. Google Scholar
Digital Library
- P. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD. ACM, 1998. Google Scholar
Digital Library
- Google. Frequency capping: AdWords help, December 2014. https://support.google.com/adwords/answer/117579.Google Scholar
- S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013. Google Scholar
Digital Library
- N. Hohn and D. Veitch. Inverting sampled traffic. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 222--233, 2003. Google Scholar
Digital Library
- D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.Google Scholar
Cross Ref
- P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Annual Symposium on Foundations of Computer Science, pages 189--197. IEEE, 2001. Google Scholar
Digital Library
- W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math., 26, 1984.Google Scholar
- H. Jowhari, M. Saglam, and G. Tardos. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In PODS, 2011. Google Scholar
Digital Library
- D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 1st edition, 1968.Google Scholar
- J. Misra and D. Gries. Finding repeated elements. Technical report, Cornell University, 1982. Google Scholar
Digital Library
- M. Monemizadeh and D. P. Woodruff. 1-pass relative-error l p-sampling with applications. In Proc. 21st ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM, 2010. Google Scholar
Digital Library
- E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.Google Scholar
- M. Osborne. Facebook Reach and Frequency Buying, October 2014. http://citizennet.com/blog/2014/10/01/facebook-reach-and-frequency-buying/.Google Scholar
- B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972.Google Scholar
Cross Ref
- Y. Tillé. Sampling Algorithms. Springer-Verlag, New York, 2006.Google Scholar
Index Terms
Stream Sampling for Frequency Cap Statistics
Recommendations
Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningClusters in document streams, such as online news articles, can be induced by their textual contents, as well as by the temporal dynamics of their arriving patterns. Can we leverage both sources of information to obtain a better clustering of the ...
Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningDespite having various attractive qualities such as high prediction accuracy and the ability to quantify uncertainty and avoid over-fitting, Bayesian Matrix Factorization has not been widely adopted because of the prohibitive cost of inference. In this ...
TimeMachine: Timeline Generation for Knowledge-Base Entities
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningWe present a method called TIMEMACHINE to generate a timeline of events and relations for entities in a knowledge base. For example for an actor, such a timeline should show the most important professional and personal milestones and relationships such ...





Comments