ABSTRACT
Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the sample facilitate fast approximate processing of queries posed over the original data and the value of the sample hinges on the quality of these estimators.
Our work targets data sets such as request and traffic logs and sensor measurements, where data is repeatedly collected over multiple instances: time periods, locations, or snapshots. We are interested in operations, like quantiles and range, that span multiple instances. Subset-sums of these operations are used for applications ranging from planning to anomaly and change detection.
Unbiased low-variance estimators are particularly effective as the relative error decreases with aggregation. The Horvitz-Thompson estimator, known to minimize variance for subset-sums over a sample of a single instance, is not optimal for multi-instance operations because it fails to exploit samples which provide partial information on the estimated quantity.
We present a general principled methodology for the derivation of optimal unbiased estimators over sampled instances and aim to understand its potential. We demonstrate significant improvement in estimate accuracy of fundamental queries for common sampling schemes.
- K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In ACM SIGMOD, 2007. Google Scholar
Digital Library
- K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian J. of Statistics, 14(3):231--239, 1972.Google Scholar
Cross Ref
- A. Z. Broder. On the resemblance and containment of documents. In IEEE Compression and Complexity of Sequences, 1997. Google Scholar
Digital Library
- A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc. of the 11th Ann. Symp. on Combinatorial Pattern Matching, LLNCS vol 1848, pages 1--10. Springer, 2000. Google Scholar
Digital Library
- M. T. Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653--656, 1982.Google Scholar
- M. S. Charikar. Similarity estimation techniques from rounding algorithms. In ACM STOC. 2002. Google Scholar
Digital Library
- E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google Scholar
Digital Library
- E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Stream sampling for variance-optimal estimation of subset sums. In ACM-SIAM SODA. 2009. Google Scholar
Digital Library
- E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC', 2007. Google Scholar
Digital Library
- E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In VLDB, 2008. Google Scholar
Digital Library
- E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. In ACM SIGMETRICS, 2009. Google Scholar
Digital Library
- E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. VLDB, 2009. Google Scholar
Digital Library
- E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. arXiv:0906.4560, 2009. Google Scholar
Digital Library
- G. Cormode and S. Muthukrishnan. Estimating dominance norms of multiple data streams. In ESA. Springer-Verlag, 2003.Google Scholar
Cross Ref
- G. Cormode and S. Muthukrishnan. What's new: finding significant differences in network data streams. IEEE/ACM Tran. on Networking, 13(6):1219--1232, 2005. Google Scholar
Digital Library
- T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In ACM SIGMOD, 2002. Google Scholar
Digital Library
- N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. ACM, 54(6), 2007. Google Scholar
Digital Library
- P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In ACM SPAA. 2001. Google Scholar
Digital Library
- P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001. Google Scholar
Digital Library
- M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: Selectivity estimators for set similarity selection queries. In VLDB, 2008. Google Scholar
Digital Library
- D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. J. of the American Stat. Assoc., 47(260):663--685, 1952.Google Scholar
Cross Ref
- D.E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. Addison-Wesley, 1969.Google Scholar
Digital Library
- B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: Methods, evaluation, and applications. In ACM IMC, 2003. Google Scholar
Digital Library
- E. Ohlsson. Coordination of pps samples over time. In Int. Conf. on Establishment Surveys, pages 255--264. American Stat. Assoc., 2000.Google Scholar
- B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997.Google Scholar
Cross Ref
- P. J. Saavedra. Fixed sample size pps approximations with a permanent random number. In Proc. of the Section on Survey Research Methods, pages 697--700. American Stat. Assoc., 1995.Google Scholar
- M. Szegedy. The DLT priority sampling is essentially optimal. In ACM STOC. 2006. Google Scholar
Digital Library
- J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985. Google Scholar
Digital Library
Index Terms
Get the most out of your sample: optimal unbiased estimators using partial information
Recommendations
Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataData volumes are growing exponentially for our decision-support systems making it challenging to ensure interactive response time for ad-hoc queries without increasing cost of hardware. Aggregation queries with Group By that produce an aggregate value ...
A sample-and-clean framework for fast and accurate query processing on dirty data
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataIn emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of ...
Selecting queries from sample to crawl deep web data sources
This paper studies the problem of selecting queries to efficiently crawl a deep web data source using a set of sample documents. Crawling deep web is the process of collecting data from search interfaces by issuing queries. One of the major challenges ...






Comments