Abstract
Network management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and the representation of those features, ultimately determine model accuracy, as well as where and whether the model can be deployed in practice. Thus, the design and evaluation of these models ultimately requires understanding not only model accuracy but also the systems costs associated with deploying the model in an operational network. Towards this goal, this paper develops a new framework and system that enables a joint evaluation of both the conventional notions of machine learning performance (e.g., model accuracy) and the systems-level costs of different representations of network traffic. We highlight these two dimensions for two practical network management tasks, video streaming quality inference and malware detection, to demonstrate the importance of exploring different representations to find the appropriate operating point. We demonstrate the benefit of exploring a range of representations of network traffic and present Traffic Refinery, a proof-of-concept implementation that both monitors network traffic at 10~Gbps and transforms traffic in real time to produce a variety of feature representations for machine learning. Traffic Refinery both highlights this design space and makes it possible to explore different representations for learning, balancing systems costs related to feature extraction and model training against model accuracy.
- 2018. Deep Learning models for network traffic classification. https://github.com/echowei/DeepTraffic/.Google Scholar
- 2018. DPDK, Data Plane Development Kit. https://www.dpdk.org/.Google Scholar
- 2019. Corelight. https://corelight.com/.Google Scholar
- 2019. Deepfield. https://www.nokia.com/networks/solutions/deepfield/. Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 40. Publication date: December 2021. Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic 40:21Google Scholar
- 2019. Kentik. https://kentik.com/.Google Scholar
- 2020. Go language. https://golang.org/.Google Scholar
- 2020. Go Packet Library. https://godoc.org/github.com/google/gopacket.Google Scholar
- 2020. NIKSUN NetVCR. https://www.niksun.com/product.php?id=110.Google Scholar
- 2020. Nokia Traffica. https://www.nokia.com/networks/products/traffica/.Google Scholar
- 2020. tcpdump and libpcap. https://www.tcpdump.org/.Google Scholar
- 2020. Tshark: terminal-based Wireshark. https://www.wireshark.org/docs/wsug_html_chunked/AppToolstshark.html.Google Scholar
- 2021. Traffic Refinery. https://github.com/traffic-refinery/traffic-refinery.Google Scholar
- Kevin Borders, Jonathan Springer, and Matthew Burnside. 2012. Chimera: A Declarative Language for Streaming Network Traffic Analysis. In Presented as part of the 21st USENIX Security Symposium (USENIX Security 12). USENIX, Bellevue, WA, 365--379. https://www.usenix.org/conference/usenixsecurity12/technical-sessions/presentation/bordersGoogle Scholar
- Kevin Borgolte, Tithi Chattopadhyay, Nick Feamster, Mihir Kshirsagar, Jordan Holland, Austin Hounsel, and Paul Schmitt. 2019. How DNS over HTTPS is Reshaping Privacy, Performance, and Policy in the Internet Ecosystem. Performance, and Policy in the Internet Ecosystem (July 27, 2019) (2019).Google Scholar
- Raouf Boutaba, Mohammad A Salahuddin, Noura Limam, Sara Ayoubi, Nashid Shahriar, Felipe Estrada-Solano, and Oscar M Caicedo. 2018. A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications 9, 1 (2018), 16.Google Scholar
Cross Ref
- Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D Sculley. 2017. The ml test score: A rubric for ml production readiness and technical debt reduction. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 1123--1132.Google Scholar
Cross Ref
- Francesco Bronzino, Paul Schmitt, Sara Ayoubi, Guilherme Martins, Renata Teixeira, and Nick Feamster. 2019. Inferring Streaming Video Quality from Encrypted Traffic: Practical Models and Deployment Experience. Proceedings of the ACM on Measurement and Analysis of Computing Systems 3, 3 (2019), 1--25.Google Scholar
Digital Library
- Benoit Claise. 2004. Cisco systems netflow services export version 9. Technical Report.Google Scholar
- Benoit Claise, Brian Trammell, and Paul Aitken. 2013. Specification of the IP flow information export (IPFIX) protocol for the exchange of flow information. RFC 7011..Google Scholar
- Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. 2003. Gigascope: a stream database for network applications. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 647--651.Google Scholar
Digital Library
- Luca Deri et al. 2004. Improving passive packet capture: Beyond device polling. In Proceedings of SANE, Vol. 2004. Amsterdam, Netherlands, 85--93.Google Scholar
- David Duce. 2003. Portable Network Graphics (PNG) Specification (Second Edition). W3C Recommendation.Google Scholar
- Cristian Estan and George Varghese. 2002. New Directions in Traffic Measurement and Accounting. In Proceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (Pittsburgh, Pennsylvania, USA) (SIGCOMM '02). ACM, New York, NY, USA, 323--336. https://doi.org/10.1145/633025.633056Google Scholar
Digital Library
- Alessandro Finamore, Marco Mellia, Michela Meo, Maurizio M Munafò, and Dario Rossi. 2010. Live traffic monitoring with tstat: Capabilities and experiences. In International Conference on Wired/Wireless Internet Communications. Springer, 290--301.Google Scholar
Digital Library
- Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and Walter Willinger. 2018. Sonata: Query-driven Streaming Network Telemetry. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (Budapest, Hungary) (SIGCOMM '18). ACM, New York, NY, USA, 357--371. https://doi.org/10. 1145/3230543.3230555Google Scholar
Digital Library
- Craig Gutterman, Katherine Guo, Sarthak Arora, Xiaoyang Wang, Les Wu, Ethan Katz-Bassett, and Gil Zussman. 2019. Requet: Real-time qoe detection for encrypted youtube traffic. In Proceedings of the 10th ACM Multimedia Systems Conference. 48--59.Google Scholar
Digital Library
- Jordan Holland, Paul Schmitt, Nick Feamster, and Prateek Mittal. 2020. nPrint: A Standard Data Representation for Network Traffic Analysis. (2020). arXiv:2008.02695 https://arxiv.org/abs/2008.02695Google Scholar
- Vengatanathan Krishnamoorthi, Niklas Carlsson, Emir Halepovic, and Eric Petajan. 2017. BUFFEST: Predicting Buffer Conditions and Real-time Requirements of HTTP (S) Adaptive Streaming Clients. In Proceedings of the 8th ACM on Multimedia Systems Conference. ACM, 76--87.Google Scholar
Digital Library
- Abhishek Kumar, Minho Sung, Jun Jim Xu, and Jia Wang. 2004. Data streaming algorithms for efficient and accurate estimation of flow size distribution. In ACM SIGMETRICS Performance Evaluation Review, Vol. 32. ACM, 177--188.Google Scholar
- Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. 2016. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 101--114.Google Scholar
Digital Library
- Tarun Mangla, Emir Halepovic, Mostafa Ammar, and Ellen Zegura. 2019. Using session modeling to estimate HTTPbased video QoE metrics from encrypted network traffic. IEEE Transactions on Network and Service Management 16, 3 (2019), 1086--1099. Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 40. Publication date: December 2021. 40:22 Francesco Bronzino et al.Google Scholar
Cross Ref
- Gonzalo Marín, Pedro Casas, and Germán Capdehourat. 2018. Rawpower: Deep learning based anomaly detection from raw network traffic measurements. In Proceedings of the ACM SIGCOMM 2018 Conference on Posters and Demos. 75--77.Google Scholar
Digital Library
- Gonzalo Marín, Pedro Casas, and Germán Capdehourat. 2020. DeepMAL--Deep Learning Models for Malware Traffic Detection and Classification. arXiv preprint arXiv:2003.04079 (2020).Google Scholar
- M Hammad Mazhar and Zubair Shafiq. 2018. Real-time video quality of experience monitoring for https and quic. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 1331--1339.Google Scholar
Digital Library
- M. Hammad Mazhar and Zubair Shafiq. 2018. Real-time Video Quality of Experience Monitoring for HTTPS and QUIC. In INFOCOM, 2018 Proceedings IEEE. IEEE.Google Scholar
- Marco Mellia, Andrea Carpani, and Renato Lo Cigno. 2003. Tstat: TCP statistic and analysis tool. In International Workshop on Quality of Service in Multiservice IP Networks. Springer, 145--157.Google Scholar
Cross Ref
- Srinivas Narayana, Anirudh Sivaraman, Vikram Nathan, Prateesh Goyal, Venkat Arun, Mohammad Alizadeh, Vimalkumar Jeyakumar, and Changhoon Kim. 2017. Language-directed hardware design for network performance monitoring. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 85--98.Google Scholar
Digital Library
- Thuy TT Nguyen and Grenville Armitage. 2008. A survey of techniques for internet traffic classification using machine learning. IEEE communications surveys & tutorials 10, 4 (2008), 56--76.Google Scholar
Digital Library
- Angela Orebaugh, Gilbert Ramirez, and Jay Beale. 2006. Wireshark & Ethereal network protocol analyzer toolkit. Elsevier.Google Scholar
- Vern Paxson. 1999. Bro: a system for detecting network intruders in real-time. Computer networks 31, 23--24 (1999), 2435--2463.Google Scholar
- David Plonka and Paul Barford. 2011. Flexible traffic and host profiling via DNS rendezvous. In Workshop Satin.Google Scholar
- Tirumaleswar Reddy, Dan Wing, and Prashanth Patil. 2017. Dns over datagram transport layer security (dtls). RFC 8094 (2017).Google Scholar
Digital Library
- Martin Roesch et al. 1999. Snort: Lightweight intrusion detection for networks.. In Lisa, Vol. 99. 229--238.Google Scholar
Digital Library
- David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Advances in neural information processing systems. 2503--2511.Google Scholar
- Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization.. In ICISSP. 108--116.Google Scholar
- Jayveer Singh and Manisha J Nene. 2013. A survey on machine learning techniques for intrusion detection systems. International Journal of Advanced Research in Computer and Communication Engineering 2, 11 (2013), 4349--4355.Google Scholar
- Wei Wang, Yiqiang Sheng, Jinlin Wang, Xuewen Zeng, Xiaozhou Ye, Yongzhong Huang, and Ming Zhu. 2017. HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection. IEEE Access 6 (2017), 1792--1806.Google Scholar
Cross Ref
- Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. 2017. Malware traffic classification using convolutional neural network for representation learning. In 2017 International Conference on Information Networking (ICOIN). IEEE, 712--717.Google Scholar
Cross Ref
- Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. ACM, 561--575.Google Scholar
Digital Library
- Da Yu, Yibo Zhu, Behnaz Arzani, Rodrigo Fonseca, Tianrong Zhang, Karl Deng, and Lihua Yuan. 2019. dShark: A general, easy to program and scalable framework for analyzing in-network packet traces. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 207--220.Google Scholar
- Minlan Yu, Lavanya Jose, and Rui Miao. 2013. Software Defined Traffic Measurement with OpenSketch. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). 29--42.Google Scholar
- Yifei Yuan, Dong Lin, Ankit Mishra, Sajal Marwaha, Rajeev Alur, and Boon Thau Loo. 2017. Quantitative Network Monitoring with NetQRE. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (Los Angeles, CA, USA) (SIGCOMM '17). ACM, New York, NY, USA, 99--112. https://doi.org/10.1145/3098822.3098830Google Scholar
Digital Library
- Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y Zhao, et al. 2015. Packet-level telemetry in large datacenter networks. In ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 479--491.Google Scholar
Digital Library
Index Terms
Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic
Recommendations
Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsNetwork management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and ...
Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic
SIGMETRICS '22Network management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and ...
Internet Traffic Modeling with Lévy Flights
ICN '08: Proceedings of the Seventh International Conference on NetworkingMeasurements of local and wide-area network traffic in the 90’s established the relation between burstiness and self-similarity of network traffic. Several papers demonstrated that the widely used Poisson based models could not be applied for the past ...






Comments