skip to main content
research-article
Public Access

Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic

Published:15 December 2021Publication History
Skip Abstract Section

Abstract

Network management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and the representation of those features, ultimately determine model accuracy, as well as where and whether the model can be deployed in practice. Thus, the design and evaluation of these models ultimately requires understanding not only model accuracy but also the systems costs associated with deploying the model in an operational network. Towards this goal, this paper develops a new framework and system that enables a joint evaluation of both the conventional notions of machine learning performance (e.g., model accuracy) and the systems-level costs of different representations of network traffic. We highlight these two dimensions for two practical network management tasks, video streaming quality inference and malware detection, to demonstrate the importance of exploring different representations to find the appropriate operating point. We demonstrate the benefit of exploring a range of representations of network traffic and present Traffic Refinery, a proof-of-concept implementation that both monitors network traffic at 10~Gbps and transforms traffic in real time to produce a variety of feature representations for machine learning. Traffic Refinery both highlights this design space and makes it possible to explore different representations for learning, balancing systems costs related to feature extraction and model training against model accuracy.

References

  1. 2018. Deep Learning models for network traffic classification. https://github.com/echowei/DeepTraffic/.Google ScholarGoogle Scholar
  2. 2018. DPDK, Data Plane Development Kit. https://www.dpdk.org/.Google ScholarGoogle Scholar
  3. 2019. Corelight. https://corelight.com/.Google ScholarGoogle Scholar
  4. 2019. Deepfield. https://www.nokia.com/networks/solutions/deepfield/. Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 40. Publication date: December 2021. Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic 40:21Google ScholarGoogle Scholar
  5. 2019. Kentik. https://kentik.com/.Google ScholarGoogle Scholar
  6. 2020. Go language. https://golang.org/.Google ScholarGoogle Scholar
  7. 2020. Go Packet Library. https://godoc.org/github.com/google/gopacket.Google ScholarGoogle Scholar
  8. 2020. NIKSUN NetVCR. https://www.niksun.com/product.php?id=110.Google ScholarGoogle Scholar
  9. 2020. Nokia Traffica. https://www.nokia.com/networks/products/traffica/.Google ScholarGoogle Scholar
  10. 2020. tcpdump and libpcap. https://www.tcpdump.org/.Google ScholarGoogle Scholar
  11. 2020. Tshark: terminal-based Wireshark. https://www.wireshark.org/docs/wsug_html_chunked/AppToolstshark.html.Google ScholarGoogle Scholar
  12. 2021. Traffic Refinery. https://github.com/traffic-refinery/traffic-refinery.Google ScholarGoogle Scholar
  13. Kevin Borders, Jonathan Springer, and Matthew Burnside. 2012. Chimera: A Declarative Language for Streaming Network Traffic Analysis. In Presented as part of the 21st USENIX Security Symposium (USENIX Security 12). USENIX, Bellevue, WA, 365--379. https://www.usenix.org/conference/usenixsecurity12/technical-sessions/presentation/bordersGoogle ScholarGoogle Scholar
  14. Kevin Borgolte, Tithi Chattopadhyay, Nick Feamster, Mihir Kshirsagar, Jordan Holland, Austin Hounsel, and Paul Schmitt. 2019. How DNS over HTTPS is Reshaping Privacy, Performance, and Policy in the Internet Ecosystem. Performance, and Policy in the Internet Ecosystem (July 27, 2019) (2019).Google ScholarGoogle Scholar
  15. Raouf Boutaba, Mohammad A Salahuddin, Noura Limam, Sara Ayoubi, Nashid Shahriar, Felipe Estrada-Solano, and Oscar M Caicedo. 2018. A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications 9, 1 (2018), 16.Google ScholarGoogle ScholarCross RefCross Ref
  16. Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D Sculley. 2017. The ml test score: A rubric for ml production readiness and technical debt reduction. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 1123--1132.Google ScholarGoogle ScholarCross RefCross Ref
  17. Francesco Bronzino, Paul Schmitt, Sara Ayoubi, Guilherme Martins, Renata Teixeira, and Nick Feamster. 2019. Inferring Streaming Video Quality from Encrypted Traffic: Practical Models and Deployment Experience. Proceedings of the ACM on Measurement and Analysis of Computing Systems 3, 3 (2019), 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Benoit Claise. 2004. Cisco systems netflow services export version 9. Technical Report.Google ScholarGoogle Scholar
  19. Benoit Claise, Brian Trammell, and Paul Aitken. 2013. Specification of the IP flow information export (IPFIX) protocol for the exchange of flow information. RFC 7011..Google ScholarGoogle Scholar
  20. Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. 2003. Gigascope: a stream database for network applications. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 647--651.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Luca Deri et al. 2004. Improving passive packet capture: Beyond device polling. In Proceedings of SANE, Vol. 2004. Amsterdam, Netherlands, 85--93.Google ScholarGoogle Scholar
  22. David Duce. 2003. Portable Network Graphics (PNG) Specification (Second Edition). W3C Recommendation.Google ScholarGoogle Scholar
  23. Cristian Estan and George Varghese. 2002. New Directions in Traffic Measurement and Accounting. In Proceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (Pittsburgh, Pennsylvania, USA) (SIGCOMM '02). ACM, New York, NY, USA, 323--336. https://doi.org/10.1145/633025.633056Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Alessandro Finamore, Marco Mellia, Michela Meo, Maurizio M Munafò, and Dario Rossi. 2010. Live traffic monitoring with tstat: Capabilities and experiences. In International Conference on Wired/Wireless Internet Communications. Springer, 290--301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and Walter Willinger. 2018. Sonata: Query-driven Streaming Network Telemetry. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (Budapest, Hungary) (SIGCOMM '18). ACM, New York, NY, USA, 357--371. https://doi.org/10. 1145/3230543.3230555Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Craig Gutterman, Katherine Guo, Sarthak Arora, Xiaoyang Wang, Les Wu, Ethan Katz-Bassett, and Gil Zussman. 2019. Requet: Real-time qoe detection for encrypted youtube traffic. In Proceedings of the 10th ACM Multimedia Systems Conference. 48--59.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jordan Holland, Paul Schmitt, Nick Feamster, and Prateek Mittal. 2020. nPrint: A Standard Data Representation for Network Traffic Analysis. (2020). arXiv:2008.02695 https://arxiv.org/abs/2008.02695Google ScholarGoogle Scholar
  28. Vengatanathan Krishnamoorthi, Niklas Carlsson, Emir Halepovic, and Eric Petajan. 2017. BUFFEST: Predicting Buffer Conditions and Real-time Requirements of HTTP (S) Adaptive Streaming Clients. In Proceedings of the 8th ACM on Multimedia Systems Conference. ACM, 76--87.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Abhishek Kumar, Minho Sung, Jun Jim Xu, and Jia Wang. 2004. Data streaming algorithms for efficient and accurate estimation of flow size distribution. In ACM SIGMETRICS Performance Evaluation Review, Vol. 32. ACM, 177--188.Google ScholarGoogle Scholar
  30. Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. 2016. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 101--114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tarun Mangla, Emir Halepovic, Mostafa Ammar, and Ellen Zegura. 2019. Using session modeling to estimate HTTPbased video QoE metrics from encrypted network traffic. IEEE Transactions on Network and Service Management 16, 3 (2019), 1086--1099. Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 40. Publication date: December 2021. 40:22 Francesco Bronzino et al.Google ScholarGoogle ScholarCross RefCross Ref
  32. Gonzalo Marín, Pedro Casas, and Germán Capdehourat. 2018. Rawpower: Deep learning based anomaly detection from raw network traffic measurements. In Proceedings of the ACM SIGCOMM 2018 Conference on Posters and Demos. 75--77.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Gonzalo Marín, Pedro Casas, and Germán Capdehourat. 2020. DeepMAL--Deep Learning Models for Malware Traffic Detection and Classification. arXiv preprint arXiv:2003.04079 (2020).Google ScholarGoogle Scholar
  34. M Hammad Mazhar and Zubair Shafiq. 2018. Real-time video quality of experience monitoring for https and quic. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 1331--1339.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Hammad Mazhar and Zubair Shafiq. 2018. Real-time Video Quality of Experience Monitoring for HTTPS and QUIC. In INFOCOM, 2018 Proceedings IEEE. IEEE.Google ScholarGoogle Scholar
  36. Marco Mellia, Andrea Carpani, and Renato Lo Cigno. 2003. Tstat: TCP statistic and analysis tool. In International Workshop on Quality of Service in Multiservice IP Networks. Springer, 145--157.Google ScholarGoogle ScholarCross RefCross Ref
  37. Srinivas Narayana, Anirudh Sivaraman, Vikram Nathan, Prateesh Goyal, Venkat Arun, Mohammad Alizadeh, Vimalkumar Jeyakumar, and Changhoon Kim. 2017. Language-directed hardware design for network performance monitoring. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 85--98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Thuy TT Nguyen and Grenville Armitage. 2008. A survey of techniques for internet traffic classification using machine learning. IEEE communications surveys & tutorials 10, 4 (2008), 56--76.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Angela Orebaugh, Gilbert Ramirez, and Jay Beale. 2006. Wireshark & Ethereal network protocol analyzer toolkit. Elsevier.Google ScholarGoogle Scholar
  40. Vern Paxson. 1999. Bro: a system for detecting network intruders in real-time. Computer networks 31, 23--24 (1999), 2435--2463.Google ScholarGoogle Scholar
  41. David Plonka and Paul Barford. 2011. Flexible traffic and host profiling via DNS rendezvous. In Workshop Satin.Google ScholarGoogle Scholar
  42. Tirumaleswar Reddy, Dan Wing, and Prashanth Patil. 2017. Dns over datagram transport layer security (dtls). RFC 8094 (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Martin Roesch et al. 1999. Snort: Lightweight intrusion detection for networks.. In Lisa, Vol. 99. 229--238.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Advances in neural information processing systems. 2503--2511.Google ScholarGoogle Scholar
  45. Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization.. In ICISSP. 108--116.Google ScholarGoogle Scholar
  46. Jayveer Singh and Manisha J Nene. 2013. A survey on machine learning techniques for intrusion detection systems. International Journal of Advanced Research in Computer and Communication Engineering 2, 11 (2013), 4349--4355.Google ScholarGoogle Scholar
  47. Wei Wang, Yiqiang Sheng, Jinlin Wang, Xuewen Zeng, Xiaozhou Ye, Yongzhong Huang, and Ming Zhu. 2017. HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection. IEEE Access 6 (2017), 1792--1806.Google ScholarGoogle ScholarCross RefCross Ref
  48. Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. 2017. Malware traffic classification using convolutional neural network for representation learning. In 2017 International Conference on Information Networking (ICOIN). IEEE, 712--717.Google ScholarGoogle ScholarCross RefCross Ref
  49. Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. ACM, 561--575.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Da Yu, Yibo Zhu, Behnaz Arzani, Rodrigo Fonseca, Tianrong Zhang, Karl Deng, and Lihua Yuan. 2019. dShark: A general, easy to program and scalable framework for analyzing in-network packet traces. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 207--220.Google ScholarGoogle Scholar
  51. Minlan Yu, Lavanya Jose, and Rui Miao. 2013. Software Defined Traffic Measurement with OpenSketch. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). 29--42.Google ScholarGoogle Scholar
  52. Yifei Yuan, Dong Lin, Ankit Mishra, Sajal Marwaha, Rajeev Alur, and Boon Thau Loo. 2017. Quantitative Network Monitoring with NetQRE. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (Los Angeles, CA, USA) (SIGCOMM '17). ACM, New York, NY, USA, 99--112. https://doi.org/10.1145/3098822.3098830Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y Zhao, et al. 2015. Packet-level telemetry in large datacenter networks. In ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 479--491.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!