Abstract
From a high-volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to Internet traffic analysis, the items could be records summarizing the flows of packets streaming by a router. Subsets could be flow records from different time intervals of a worm attack whose signature is later determined. The samples taken in the past thus allow us to trace the history of the attack even though the worm was unknown at the time of sampling.
Estimation from the samples must be accurate even with heavy-tailed distributions where most of the weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight-sensitive sampling scheme without replacement that works in a streaming context and is suitable for estimating subset sums. Testing priority sampling on Internet traffic analysis, we found it to perform an order of magnitude better than previous schemes.
Priority sampling is simple to define and implement: we consider a steam of items i = 0,…,n − 1 with weights wi. For each item i, we generate a random number αi ∈ (0,1] and create a priority qi = wi/αi. The sample S consists of the k highest priority items. Let τ be the (k + 1)th highest priority. Each sampled item i in S gets a weight estimate ŵi = max{wi, τ}, while nonsampled items get weight estimate ŵi = 0.
Magically, it turns out that the weight estimates are unbiased, that is, E[ŵi] = wi, and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and find, surprisingly, that the covariance between estimates ŵi and ŵj of different weights is zero.
Finally, we conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling k items with unbiased weight estimators that gets smaller variance sum than priority sampling with k + 1 items. Szegedy settled this conjecture at STOC'06.
- Adler, R., Feldman, R., and Taqqu, M. 1998. A Practical Guide to Heavy Tails. Birkhauser.Google Scholar
- Alon, N., Duffield, N., Lund, C., and Thorup, M. 2005. Estimating arbitrary subset sums with few probes. In Proceedings of the 24th ACM Symposium on Principles of Database Systems (PODS). ACM, New York, 317--325. Google Scholar
Digital Library
- Arnold, B., and Balakrishnan, N. 1988. Relations, Bounds and Approximations for Order Statistics. Lecture Notes in Statistics, vol. 53. Springer, New York.Google Scholar
- Brewer, K., and Hanif, M. 1983. Sampling with unequal probabilities. Lecture Notes in Statistics, vol. 15. Springer-Verlag, New York.Google Scholar
- Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. On random sampling over joins. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 263--274. Google Scholar
Digital Library
- Cohen, E. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55, 3, 441--453. Google Scholar
Digital Library
- Cohen, E., Duffield, N., Kaplan, H., Lund, C., and Thorup, M. 2007. Sketching unaggregated data streams for subpopulation-size queries. In Proceedings of the 26th ACM Symposium on Principles of Database Systems (PODS). ACM, New York, 253--262. Google Scholar
Digital Library
- Cohen, E., and Kaplan, H. 2007. Bottom-k sketches: Better and more efficient estimation of aggregates (poster). In Proceedings of the ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance). ACM, New York, 353--354. Google Scholar
Digital Library
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms, 2nd ed. MIT Press, McGraw-Hill. ISBN 0-262-03293-7, 0-07-013151-1. Google Scholar
Digital Library
- David, H. 1981. Order Statistics, 2nd ed. Wiley, New York.Google Scholar
- Duffield, N., Lund, C., and Thorup, M. 2004. Flow sampling under hard resource constraints. In Proceedings of the ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance). ACM, New York, 85--96. Google Scholar
Digital Library
- Duffield, N., Lund, C., and Thorup, M. 2005a. Learn more, sample less: control of volume and variance in network measurements. IEEE Trans. Inf. Theory 51, 5, 1756--1775. Google Scholar
Digital Library
- Duffield, N., Lund, C., and Thorup, M. 2005b. Optimal combination of sampled network measurements. In Proceedings of the ACM SIGCOMM Internet Measurement Conference (IMC). ACM, New York, 91--104. Google Scholar
Digital Library
- Duffield, N., Lund, C., and Thorup, M. 2005c. Sampling to estimate arbitrary subset sums. Tech. Rep. cs.DS/0509026, Computing Research Repository (CoRR). http://arxiv.org/abs/cs.DS/0509026.Google Scholar
- Fan, C., Muller, M., and Rezucha, I. 1962. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Amer. Stat. Assoc. 57, 387--402.Google Scholar
Cross Ref
- Hellerstein, J., Haas, P., and Wang, H. 1997. Online aggregation. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 171--182. Google Scholar
Digital Library
- Johnson, T., Muthukrishnan, S., and Rozenbaum, I. 2005. Sampling algorithms in a stream operator. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 1--12. Google Scholar
Digital Library
- Knuth, D. 1969. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. Addison-Wesley, Reading, MA. Google Scholar
Digital Library
- Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., and Weaver, N. 2003. Inside the slammer worm. IEEE Sec. Priv. Mag. 1, 4, 33--39. Google Scholar
Digital Library
- Muthukrishnan, S. 2005. Data streams: Algorithms and applications. Foundat. Trends Theoret. Comput. Sci. 1, 2. Google Scholar
Digital Library
- Park, K., Kim, G., and Crovella, M. 1996. On the relationship between file sizes, transport protocols, and self-similar network traffic. In Proceedings of the 4th IEEE International Conference on Network Protocols (ICNP). IEEE Computer Society Press, Los Alamitos, CA. Google Scholar
Digital Library
- Särndal, C.-E., Swensson, B., and Wretman, J. 1992. Model Assisted Survey Sampling. Springer-Verlag, New York.Google Scholar
- Sunter, A. B. 1977. List sequential sampling with equal or unequal probabilites without replacement. Applied Statistics 26, 261--268.Google Scholar
Cross Ref
- Szegedy, M. 2006. The DLT priority sampling is essentially optimal. In Proceedings of the 38th ACM Symposium on the Theory of Computing (STOC). ACM, New York, 150--158. Google Scholar
Digital Library
- Szegedy, M., and Thorup, M. 2007. On the variance of subset sum estimation. In Proceedings of the 15th European Symposium on Algorithms (ESA). Lecture Notes in Computer Science, vol. 4698. Springer-Verlag, New York, 75--86. Google Scholar
Digital Library
- Thorup, M. 2006. Confidence intervals for priority sampling. In Proceedings of the ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance). ACM, New York, 252--253. Google Scholar
Digital Library
- Thorup, M. 2007. Equivalence between priority queues and sorting. J. ACM 54, 6, Article 28. Google Scholar
Digital Library
- Vitter, J. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1, 37--57. Google Scholar
Digital Library
Index Terms
Priority sampling for estimation of arbitrary subset sums
Recommendations
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataWe introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the ...
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking ...
Estimating arbitrary subset sums with few probes
PODS '05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsSuppose we have a large table T of items i, each with a weight wi, e.g., people and their salary. In a general preprocessing step for estimating arbitrary subset sums, we assign each item a random priority depending on its weight. Suppose we want to ...





Comments