skip to main content
10.1145/1265530.1265566acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Sketching unaggregated data streams for subpopulation-size queries

Published:11 June 2007Publication History

ABSTRACT

IP packet streams consist of multiple interleaving IP flows. Statistical summaries of these streams, collected for different measurement periods, are used for characterization of traffic, billing, anomaly detection, inferring traffic demands, configuring packet filters and routing protocols, and more. While queries are posed over the set of flows, the summarization algorithmis applied to the stream of packets. Aggregation of traffic into flows before summarization requires storage of per-flow counters, which is often infeasible. Therefore, the summary has to be produced over the unaggregated stream.

An important aggregate performed over a summary is to approximate the size of a subpopulation of flows that is specified a posteriori. For example, flows belonging to an application such as Web or DNS or flows that originate from a certain Autonomous System. We design efficient streaming algorithms that summarize unaggregated streams and provide corresponding unbiased estimators for subpopulation sizes. Our summaries outperform, in terms of estimates accuracy, those produced by packet sampling deployed by Cisco's sampled NetFlow, the most widely deployed such system. Performance of our best method, step sample-and-hold is close to that of summaries that can be obtainedfrom pre-aggregated traffic.

References

  1. E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators for accurate summarization of Internet traffic. Manuscript, 2007.Google ScholarGoogle Scholar
  2. E. Cohen and H. Kaplan. Bottom-k sketches: Better and more efficient estimation of aggregates. In Proceedings of the ACM SIGMETRICS'07 Conference, 2007. poster. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Cohen and H. Kaplan. Sketches and estimators for subpopulation weight queries. Manuscript, 2007.Google ScholarGoogle Scholar
  4. E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci., 73:265?--288, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. Manuscript, 2007.Google ScholarGoogle Scholar
  6. C. Cranor, T. Johnson, V. Shkapenyuk, and O. Spatcheck. Gigascope: A stream database for network applications. In Proceedings of the ACM SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Duffield, C. Lund, and M. Thorup. Estimating flow distributions from sampled flow statistics. In Proceedings of the ACM SIGCOMM'03 Conference, pages 325--?336, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Duffield, M. Thorup, and C. Lund. Flow sampling under hard resource constraints. In Proceedings the ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pages 85--?96, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Estan, K. Keys, D. Moore, and G. Varghese. Building a better netflow. In Proceedings of the ACM SIGCOMM'04 Conference. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Estan and G. Varghese. New directions in traffic measurement and accounting. In Proceedings of the ACM SIGCOMM'02 Conference. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the ACM SIGMOD, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Hohn and D. Veitch. Inverting sampled traffic. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 222--?233, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Keys, D. Moore, and C. Estan. A robust system for accurate real-time summaries of Internet traffic. In Proceedings of the ACM SIGMETRICS'05. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Kumar, M. Sung, J. Xu, and E. W. Zegura. A data streaming algorithm for estimating subpopulation flow size distribution. ACM SIGMETRICS Performance Evaluation Review, 33, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Ramabhadran and G. Varghese. Efficient implementation of a statistics counter architecture. In Proc. of ACM Sigmetrics 2003, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Shah, S. Iyer, B. Prabhakar, and N. McKeown. Maintaining statistics counters in router line cards. IEEE Micro, 22(1):76--?81, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sketching unaggregated data streams for subpopulation-size queries

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
        June 2007
        328 pages
        ISBN:9781595936851
        DOI:10.1145/1265530

        Copyright © 2007 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 June 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate476of1,835submissions,26%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!