ABSTRACT
IP packet streams consist of multiple interleaving IP flows. Statistical summaries of these streams, collected for different measurement periods, are used for characterization of traffic, billing, anomaly detection, inferring traffic demands, configuring packet filters and routing protocols, and more. While queries are posed over the set of flows, the summarization algorithmis applied to the stream of packets. Aggregation of traffic into flows before summarization requires storage of per-flow counters, which is often infeasible. Therefore, the summary has to be produced over the unaggregated stream.
An important aggregate performed over a summary is to approximate the size of a subpopulation of flows that is specified a posteriori. For example, flows belonging to an application such as Web or DNS or flows that originate from a certain Autonomous System. We design efficient streaming algorithms that summarize unaggregated streams and provide corresponding unbiased estimators for subpopulation sizes. Our summaries outperform, in terms of estimates accuracy, those produced by packet sampling deployed by Cisco's sampled NetFlow, the most widely deployed such system. Performance of our best method, step sample-and-hold is close to that of summaries that can be obtainedfrom pre-aggregated traffic.
- E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators for accurate summarization of Internet traffic. Manuscript, 2007.Google Scholar
- E. Cohen and H. Kaplan. Bottom-k sketches: Better and more efficient estimation of aggregates. In Proceedings of the ACM SIGMETRICS'07 Conference, 2007. poster. Google Scholar
Digital Library
- E. Cohen and H. Kaplan. Sketches and estimators for subpopulation weight queries. Manuscript, 2007.Google Scholar
- E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci., 73:265?--288, 2007. Google Scholar
Digital Library
- E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. Manuscript, 2007.Google Scholar
- C. Cranor, T. Johnson, V. Shkapenyuk, and O. Spatcheck. Gigascope: A stream database for network applications. In Proceedings of the ACM SIGMOD, 2003. Google Scholar
Digital Library
- N. Duffield, C. Lund, and M. Thorup. Estimating flow distributions from sampled flow statistics. In Proceedings of the ACM SIGCOMM'03 Conference, pages 325--?336, 2003. Google Scholar
Digital Library
- N. Duffield, M. Thorup, and C. Lund. Flow sampling under hard resource constraints. In Proceedings the ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pages 85--?96, 2004. Google Scholar
Digital Library
- C. Estan, K. Keys, D. Moore, and G. Varghese. Building a better netflow. In Proceedings of the ACM SIGCOMM'04 Conference. ACM, 2004. Google Scholar
Digital Library
- C. Estan and G. Varghese. New directions in traffic measurement and accounting. In Proceedings of the ACM SIGCOMM'02 Conference. ACM, 2002. Google Scholar
Digital Library
- M. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD. ACM, 1998. Google Scholar
Digital Library
- J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the ACM SIGMOD, 1997. Google Scholar
Digital Library
- N. Hohn and D. Veitch. Inverting sampled traffic. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 222--?233, 2003. Google Scholar
Digital Library
- K. Keys, D. Moore, and C. Estan. A robust system for accurate real-time summaries of Internet traffic. In Proceedings of the ACM SIGMETRICS'05. ACM, 2005. Google Scholar
Digital Library
- A. Kumar, M. Sung, J. Xu, and E. W. Zegura. A data streaming algorithm for estimating subpopulation flow size distribution. ACM SIGMETRICS Performance Evaluation Review, 33, 2005. Google Scholar
Digital Library
- S. Ramabhadran and G. Varghese. Efficient implementation of a statistics counter architecture. In Proc. of ACM Sigmetrics 2003, 2003. Google Scholar
Digital Library
- D. Shah, S. Iyer, B. Prabhakar, and N. McKeown. Maintaining statistics counters in router line cards. IEEE Micro, 22(1):76--?81, 2002. Google Scholar
Digital Library
Index Terms
Sketching unaggregated data streams for subpopulation-size queries
Recommendations
Algorithms and estimators for accurate summarization of internet traffic
IMC '07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurementStatistical summaries of traffic in IP networks are at the heart of network operation and are used to recover information on the traffic of arbitrary subpopulations of flows. It is therefore of great importance to collect the most accurate and ...
Sketching distributed sliding-window data streams
While traditional data management systems focus on evaluating single, ad hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely ...
Data Streams with Bounded Deletions
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsTwo prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. ...






Comments