ABSTRACT
Microservices are becoming more complicated, posing new challenges for traditional performance monitoring solutions. On the one hand, the rapid evolution of microservices places a significant burden on the utilization and maintenance of existing distributed tracing frameworks. On the other hand, complex infrastructure increases the probability of network performance problems and creates more blind spots on the network side. In this paper, we present DeepFlow, a network-centric distributed tracing framework for troubleshooting microservices. DeepFlow provides out-of-the-box tracing via a network-centric tracing plane and implicit context propagation. In addition, it eliminates blind spots in network infrastructure, captures network metrics in a low-cost way, and enhances correlation between different components and layers. We demonstrate analytically and empirically that DeepFlow is capable of locating microservice performance anomalies with negligible overhead. DeepFlow has already identified over 71 critical performance anomalies for more than 26 companies and has been utilized by hundreds of individual developers. Our production evaluations demonstrate that DeepFlow is able to save users hours of instrumentation efforts and reduce troubleshooting time from several hours to just a few minutes.
- Zaafar Ahmed, Muhammad Hamad Alizai, and Affan A. Syed. 2018. InKeV: In-Kernel Distributed Network Virtualization for DCN. SIGCOMM Comput. Commun. Rev. 46, 3, Article 4 (jul 2018), 6 pages. Google Scholar
Digital Library
- Meta AI. 2022. Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration. (Nov. 2022). https://pytorch.org/Google Scholar
- Nadav Amit and Michael Wei. 2018. The Design and Implementation of Hyper-upcalls. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '18). USENIX Association, USA, 97--111.Google Scholar
- Apache. 2022. Apache Hadoop project. (Nov. 2022). https://hadoop.apache.org/Google Scholar
- Apache. 2022. Apache Spark - Unified engine for large-scale data analytics. (Nov. 2022). https://spark.apache.org/Google Scholar
- Apache. 2023. Apache SkyWalking. (July 2023). Retrieved Jul, 2023 from https://skywalking.apache.org/Google Scholar
- Emre Ates, Lily Sturmann, Mert Toslali, Orran Krieger, Richard Megginson, Ayse K. Coskun, and Raja R. Sambasivan. 2019. An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '19). Association for Computing Machinery, New York, NY, USA, 165--170. Google Scholar
Digital Library
- AWS. 2023. AWS Lambda, Run code without thinking about servers or clusters. (Jan. 2023). https://aws.amazon.com/lambda/Google Scholar
- AWS. 2023. Tagging AWS resources. (Jan. 2023). https://docs.aws.amazon.com/general/latest/gr/aws_tagging.htmlGoogle Scholar
- Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online Modelling and Performance-Aware Systems. In Proceedings of the 9th Conference on Hot Topics in Operating Systems - Volume 9 (HOTOS'03). USENIX Association, USA, 15.Google Scholar
Digital Library
- Ashish Bijlani and Umakishore Ramachandran. 2019. Extension Framework for File Systems in User Space. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '19). USENIX Association, USA, 121--134.Google Scholar
- Spring Boot. 2023. Jaeger Demo. (Jan. 2023). https://github.com/chanjarster/spring-boot-istio-jaeger-demoGoogle Scholar
- Peter Bourgon. 2017. Metrics, tracing, and logging. (Feb. 2017). https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.htmlGoogle Scholar
- Anupam Chanda, Alan L. Cox, and Willy Zwaenepoel. 2007. Whodunit: Transactional Profiling for Multi-Tier Applications. SIGOPS Oper. Syst. Rev. 41, 3 (mar 2007), 17--30. Google Scholar
Digital Library
- Zaheer Chothia, John Liagouris, Desislava Dimitrova, and Timothy Roscoe. 2017. Online Reconstruction of Structural Information from Datacenter Logs. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys '17). Association for Computing Machinery, New York, NY, USA, 344--358. Google Scholar
Digital Library
- Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The Mystery Machine: End-to-End Performance Analysis of Large-Scale Internet Services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, USA, 217--231.Google Scholar
Digital Library
- Cilium. 2022. Hubble - Network, Service & Security Observability for Kubernetes using eBPF. (July 2022). Retrieved Feb, 2023 from https://github.com/cilium/hubbleGoogle Scholar
- ClickHouse. 2023. Database. (Jan. 2023). https://clickhouse.com/Google Scholar
- ClickHouse. 2023. LowCardinality. (Jan. 2023). https://clickhouse.com/docs/en/sql-reference/data-types/lowcardinality/Google Scholar
- Open Zipkin Community. 2023. Zipkin tracing library for Python and C++. (Jan. 2023). https://github.com/dulikvor/cppKinGoogle Scholar
- CoreDNS. 2022. CoreDNS: DNS and Service Discovery. (Nov. 2022). https://coredns.io/Google Scholar
- Oracle Corporation. 2022. MySQL. (Nov. 2022). https://www.mysql.comGoogle Scholar
- DataDog. 2023. eBPF manager. This manager helps handle the life cycle of your eBPF programs. https://github.com/DataDog/ebpf-manager. (July 2023).Google Scholar
- Henri Maxime Demoulin, Isaac Pedisich, Nikos Vasilakis, Vincent Liu, Boon Thau Loo, and Linh Thi Xuan Phan. 2019. Detecting Asymmetric Application-Layer Denial-of-Service Attacks in-Flight with Finelame. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '19). USENIX Association, USA, 693--707.Google Scholar
- Luca Deri, Samuele Sabella, and Simone Mainardi. 2019. Combining System Visibility and Security Using eBPF. In Proceedings of the Third Italian Conference on Cyber Security. CEUR-WS.org, Italy, 1--12.Google Scholar
- Rui Ding, Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Qingwei Lin, Qiang Fu, Dongmei Zhang, and Tao Xie. 2015. Log2: A Cost-Aware Logging Mechanism for Performance Diagnosis. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15). USENIX Association, USA, 139--150.Google Scholar
- Ryan Eberhardt. 2023. My First Kernel Module: A Debugging Nightmare. (Jan. 2023). https://reberhardt.com/blog/2020/11/18/my-first-kernel-module.htmlGoogle Scholar
- eBPF. 2023. eBPF - extended Berkeley Packet Filter. (Jan. 2023). https://ebpf.io/Google Scholar
- eBPF. 2023. eBPF Applications Landscape. (Jan. 2023). https://ebpf.io/applicationsGoogle Scholar
- Google Kubernetes Engine(GKE). 2023. Create and manage Tags in GKE. (Jan. 2023). https://cloud.google.com/kubernetes-engine/docs/how-to/tagsGoogle Scholar
- Tânia Esteves, Francisco Neves, Rui Oliveira, and João Paulo. 2021. CAT: Content-Aware Tracing and Analysis for Distributed Systems. In Proceedings of the 22nd International Middleware Conference (Middleware '21). Association for Computing Machinery, New York, NY, USA, 223--235. Google Scholar
Digital Library
- Inc. F5. 2022. NGINX: Advanced Load Balancer, Web Server, & Reverse Proxy. (Nov. 2022). https://www.nginx.comGoogle Scholar
- Flask. 2022. Flask - The Python micro framework for building web applications. (Nov. 2022). https://github.com/pallets/flaskGoogle Scholar
- Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-Trace: A Pervasive Network Tracing Framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation (NSDI'07). USENIX Association, USA, 20.Google Scholar
Digital Library
- Apache Software Foundation. 2023. KAFKA PROTOCOL GUIDE. (Jan. 2023). https://kafka.apache.org/protocol.htmlGoogle Scholar
- Apache Software Foundation. 2023. Remote communication details of Dubbo. (Jan. 2023). https://dubbo.apache.org/en/docs/v2.7/dev/implementation/#remote-communication-detailsGoogle Scholar
- Cloud Native Computing Foundation. 2022. Production-Grade Container Orchestration. (Nov. 2022). https://kubernetes.io/Google Scholar
- Cloud Native Computing Foundation. 2023. Cloud Native Landscape. (Jan. 2023). https://landscape.cncf.io/Google Scholar
- Django Software Foundation. 2022. Django: The web framework for perfectionists with deadlines. (Nov. 2022). https://www.djangoproject.comGoogle Scholar
- Open Infrastructure Foundation. 2022. OpenStack: Open Source Cloud Computing Infrastructure. (Nov. 2022). https://www.openstack.org/Google Scholar
- Neves Francisco, Machado Nuno, and Pereira José. 2023. fntneves/falcon: Falcon: A practical log-based analysis tool for distributed systems. (Jan. 2023). https://github.com/fntneves/falconGoogle Scholar
- Amazon GameLift. 2023. Dedicated server management for session-based multiplayer games. (Jan. 2023). https://aws.amazon.com/gamelift/Google Scholar
- Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: Practical and Scalable ML-Driven Performance Debugging in Microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 135--151. Google Scholar
Digital Library
- Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 3--18. Google Scholar
Digital Library
- Kaihui Gao, Chen Sun, Shuai Wang, Dan Li, Yu Zhou, Hongqiang Harry Liu, Lingjun Zhu, and Ming Zhang. 2022. Buffer-based End-to-end Request Event Monitoring in the Cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 829--843. https://www.usenix.org/conference/nsdi22/presentation/gao-kaihuiGoogle Scholar
- Xiongzi Ge, Yi Liu, David H.C. Du, Liang Zhang, Hongguang Guan, Jian Chen, Yuping Zhao, and Xinyu Hu. 2014. OpenANFV: Accelerating Network Function Virtualization with a Consolidated Framework in Openstack. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM '14). Association for Computing Machinery, New York, NY, USA, 353--354. Google Scholar
Digital Library
- Francis Giraldeau and Michel Dagenais. 2016. Wait Analysis of Distributed Systems Using Kernel Tracing. IEEE Transactions on Parallel and Distributed Systems 27, 8 (2016), 2450--2461. Google Scholar
Digital Library
- go.dev. 2023. The Go Programming Language. https://go.dev/. (July 2023).Google Scholar
- Google. 2022. TensorFlow - An end-to-end machine learning platform. (Nov. 2022). https://www.tensorflow.org/Google Scholar
- HAProxy. 2023. HAProxy Documentation. (Jan. 2023). http://docs.haproxy.org/2.7/configuration.html#7.3.6-unique-idGoogle Scholar
- HAProxy. 2023. The Reliable, High Performance TCP/HTTP Load Balancer. (Jan. 2023). http://www.haproxy.org/Google Scholar
- Red Hat. 2022. Red Hat OpenShift makes container orchestration easier. (Nov. 2022). https://www.redhat.com/en/technologies/cloud-computing/openshiftGoogle Scholar
- HelmVMware. 2023. The package manager for Kubernetes. (July 2023). Retrieved Jul, 2023 from https://helm.sh/Google Scholar
- Jiamin Huang, Barzan Mozafari, and Thomas F. Wenisch. 2017. Statistical Analysis of Latency Through Semantic Profiling. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys '17). Association for Computing Machinery, New York, NY, USA, 64--79. Google Scholar
Digital Library
- Lexiang Huang and Timothy Zhu. 2021. Tprof: Performance Profiling via Structural Aggregation and Automated Analysis of Distributed Systems Traces. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '21). Association for Computing Machinery, New York, NY, USA, 76--91. Google Scholar
Digital Library
- Peng Huang, Chuanxiong Guo, Jacob R. Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and Enhancing in Situ System Observability for Failure Detection. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18). USENIX Association, USA, 1--16.Google Scholar
Digital Library
- International Business Machines Corporation (IBM) and Eurotech. 2023. MQTT V3.1 Protocol Specification. (Jan. 2023). https://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.htmlGoogle Scholar
- Internet Engineering Task Force (IETF). 2023. RFC 1035. (Jan. 2023). https://www.rfc-editor.org/rfc/rfc1035Google Scholar
- Internet Engineering Task Force (IETF). 2023. RFC 7231. (Jan. 2023). https://www.rfc-editor.org/rfc/rfc7231Google Scholar
- Internet Engineering Task Force (IETF). 2023. RFC 7540. (Jan. 2023). https://www.rfc-editor.org/rfc/rfc7540Google Scholar
- Istio. 2023. Bookinfo Application. (Jan. 2023). https://istio.io/latest/docs/examples/bookinfo/Google Scholar
- Jaeger. 2023. Jaeger: open source, end-to-end distributed tracing. (Jan. 2023). https://www.jaegertracing.io/Google Scholar
- Yurong Jiang, Lenin Ravindranath Sivalingam, Suman Nath, and Ramesh Govindan. 2016. WebPerf: Evaluating What-If Scenarios for Cloud-Hosted Web Applications. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM '16). Association for Computing Machinery, New York, NY, USA, 258--271. Google Scholar
Digital Library
- Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An End-to-End Performance Tracing And Analysis System. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17). Association for Computing Machinery, New York, NY, USA, 34--50. Google Scholar
Digital Library
- Suman Karumuri. 2023. PinTrace: Distributed Tracing at Pinterest. (August 2016). (Jan. 2023). https://www.slideshare.net/mansu/pintrace-advanced-aws-meetupGoogle Scholar
- The Linux Kernel. 2023. bpf(2) --- Linux manual page. (Jan. 2023). https://man7.org/linux/man-pages/man2/bpf.2.htmlGoogle Scholar
- The Linux Kernel. 2023. eBPF verifier. (Jan. 2023). https://www.kernel.org/doc/html/latest/bpf/verifier.htmlGoogle Scholar
- The Linux Kernel. 2023. Kernel Probes (Kprobes). (Jan. 2023). https://www.kernel.org/doc/html/latest/trace/kprobes.htmlGoogle Scholar
- The Linux Kernel. 2023. Linux Socket Filtering aka Berkeley Packet Filter (BPF). (Jan. 2023). https://www.kernel.org/doc/html/latest/networking/filter.htmlGoogle Scholar
- The Linux Kernel. 2023. Uprobe-tracer: Uprobe-based Event Tracing. (Jan. 2023). https://www.kernel.org/doc/html/latest/trace/uprobetracer.htmlGoogle Scholar
- The Linux Kernel. 2023. Using the Linux Kernel Tracepoints. (Jan. 2023). https://www.kernel.org/doc/html/latest/trace/tracepoints.htmlGoogle Scholar
- Chung Hwan Kim, Junghwan Rhee, Hui Zhang, Nipun Arora, Guofei Jiang, Xiangyu Zhang, and Dongyan Xu. 2014. IntroPerf: Transparent Context-Sensitive Multi-Layer Performance Inference Using System Stack Traces. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '14). Association for Computing Machinery, New York, NY, USA, 235--247. Google Scholar
Digital Library
- Kubernetes. 2023. Labels and Selectors. (Jan. 2023). https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/Google Scholar
- PIXIE labs. 2022. Instantly troubleshoot your applications on Kubernetes. (July 2022). Retrieved Feb, 2023 from https://pixielabs.ai/Google Scholar
- Chien-An Lai, Josh Kimball, Tao Zhu, Qingyang Wang, and Calton Pu. 2017. milliScope: A Fine-Grained Monitoring Framework for Performance Debugging of n-Tier Web Services. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, USA, 92--102. Google Scholar
Cross Ref
- AWS Lambda. 2023. Using AWS Lambda with other services. (Jan. 2023). https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.htmlGoogle Scholar
- Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, and Rodrigo Fonseca. 2018. Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '18). Association for Computing Machinery, New York, NY, USA, 326--332. Google Scholar
Digital Library
- Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, and Jonathan Mace. 2019. Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '19). Association for Computing Machinery, New York, NY, USA, 312--324. Google Scholar
Digital Library
- Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish. 2013. Improving Availability in Distributed Systems with Failure Informers. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (nsdi'13). USENIX Association, USA, 427--442.Google Scholar
Digital Library
- Joshua Levin and Theophilus A. Benson. 2020. ViperProbe: Rethinking Microservice Observability with eBPF. In 2020 IEEE 9th International Conference on Cloud Networking (CloudNet). IEEE, USA, 1--8. Google Scholar
Cross Ref
- Ding Li, James Mickens, Suman Nath, and Lenin Ravindranath. 2015. Domino: Understanding Wide-Area, Asynchronous Event Causality in Web Applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC '15). Association for Computing Machinery, New York, NY, USA, 182--188. Google Scholar
Digital Library
- libbpf. 2023. Automated upstream mirror for libbpf stand-alone build. https://github.com/libbpf/libbpf. (July 2023).Google Scholar
- Linux. 2023. BPF Type Format (BTF). https://www.kernel.org/doc/html/next/bpf/btf.html. (July 2023).Google Scholar
- Linux. 2023. packet(7) --- Linux manual page. (Jan. 2023). https://man7.org/linux/man-pages/man7/packet.7.htmlGoogle Scholar
- Linux. 2023. pidstat(1) --- Linux manual page. (Jan. 2023). https://man7.org/linux/man-pages/man1/pidstat.1.htmlGoogle Scholar
- Chang Liu, Zhengong Cai, Bingshen Wang, Zhimin Tang, and Jiaxu Liu. 2020. A protocol-independent container network observability analysis system based on eBPF. In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, Hong Kong, 697--702. Google Scholar
Cross Ref
- LTTng. 2023. LTTng: an open source tracing framework for Linux. (Jan. 2023). https://lttng.org/Google Scholar
- Liang Luo, Suman Nath, Lenin Ravindranath Sivalingam, Madan Musuvathi, and Luis Ceze. 2018. Troubleshooting Transiently-Recurring Problems in Production Systems with Blame-Proportional Logging. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '18). USENIX Association, USA, 321--334.Google Scholar
- Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, and Chengzhong Xu. 2022. An In-Depth Study of Microservice Call Graph and Runtime Performance. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 3901--3914. Google Scholar
- LWN. 2023. A thorough introduction to eBPF. https://lwn.net/Articles/740157/. (July 2023).Google Scholar
- Jonathan Mace and Rodrigo Fonseca. 2018. Universal Context Propagation for Distributed System Instrumentation. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). Association for Computing Machinery, New York, NY, USA, Article 8, 18 pages. Google Scholar
Digital Library
- Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). Association for Computing Machinery, New York, NY, USA, 378--393. Google Scholar
Digital Library
- Steven McCanne and Van Jacobson. 1993. The BSD Packet Filter: A New Architecture for User-Level Packet Capture. In Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings (USENIX'93). USENIX Association, USA, 2.Google Scholar
Digital Library
- memcached. 2022. memcached - a distributed memory object caching system. (Nov. 2022). https://memcached.org/Google Scholar
- Haibo Mi, Huaimin Wang, Hua Cai, Yangfan Zhou, Michael R. Lyu, and Zhenbang Chen. 2012. P-Tracer: Path-Based Performance Profiling in Cloud Computing Systems. In 2012 IEEE 36th Annual Computer Software and Applications Conference. IEEE, Izmir, Turkey, 509--514. Google Scholar
Digital Library
- Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael Rung-Tsong Lyu, and Hua Cai. 2013. Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems. IEEE Transactions on Parallel and Distributed Systems 24, 6 (2013), 1245--1255. Google Scholar
Digital Library
- Microsoft. 2023. Azure for gaming. (Jan. 2023). https://azure.microsoft.com/en-us/solutions/gaming/Google Scholar
- Microsoft. 2023. Event Tracing for Windows | Microsoft Learn. (Jan. 2023). https://learn.microsoft.com/en-us/windows-hardware/test/wpt/event-tracing-for-windowsGoogle Scholar
- J. Mogul, R. Rashid, and M. Accetta. 1987. The Packet Filter: An Efficient Mechanism for User-Level Network Code. In Proceedings of the Eleventh ACM Symposium on Operating Systems Principles (SOSP '87). Association for Computing Machinery, New York, NY, USA, 39--51. Google Scholar
Digital Library
- Naver. 2023. Bytecode Instrumentation, Not Requiring Code Modifications. (Jan. 2023). https://pinpoint-apm.github.io/pinpoint/techdetail.html#how-bytecode-instrumentation-worksGoogle Scholar
- Naver. 2023. Pinpoint | Leading Open-Source APM. (Jan. 2023). https://pinpoint-apm.gitbook.io/pinpoint/Google Scholar
- Nginx. 2023. Nginx Documentation - HTTP core module. (Jan. 2023). http://nginx.org/en/docs/http/ngx_http_core_module.htmlGoogle Scholar
- OpenTelemetry. 2023. High-quality, ubiquitous, and portable telemetry to enable effective observability. (Jan. 2023). https://opentelemetry.io/Google Scholar
- OpenTelemetry. 2023. Propagators Distribution. (Jan. 2023). https://opentelemetry.io/docs/reference/specification/context/api-propagators/#propagators-distributionGoogle Scholar
- OpenZipkin. 2023. B3-propagation. (Jan. 2023). https://github.com/openzipkin/b3-propagationGoogle Scholar
- Oracle. 2023. MySQL Client/Server Protocol. (Jan. 2023). https://dev.mysql.com/doc/dev/mysql-server/latest/PAGE_PROTOCOL.htmlGoogle Scholar
- Cuong Pham, Long Wang, Byung Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2017. Failure Diagnosis for Distributed Systems Using Targeted Fault Injection. IEEE Transactions on Parallel and Distributed Systems 28, 2 (2017), 503--516. Google Scholar
Digital Library
- Linux posts. 2023. Linux kernel crash dump analysis. (Jan. 2023). http://sklinuxblog.blogspot.com/2018/06/linux-kernel-crash-dump-analysis.htmlGoogle Scholar
- Kubernetes Documentation-Configuration Best Practices. 2022. (July 2022). Retrieved July, 2022 from https://kubernetes.io/docs/concepts/configuration/overview/Google Scholar
- Envoy Project. 2022. Envoy Proxy - Home. (Nov. 2022). https://www.envoyproxy.io/Google Scholar
- Envoy Project. 2023. HTTP header manipulation. (Jan. 2023). https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/headers#x-request-idGoogle Scholar
- Prometheus. 2023. From metrics to insight. (Jan. 2023). https://prometheus.io/Google Scholar
- Redis. 2022. Redis - Remote Dictionary Server. (Nov. 2022). https://redis.io/Google Scholar
- Redis. 2023. Redis serialization protocol (RESP) specification. (Jan. 2023). https://redis.io/docs/reference/protocol-spec/Google Scholar
- Daniele Rogora, Antonio Carzaniga, Amer Diwan, Matthias Hauswirth, and Robert Soulé. 2020. Analyzing System Performance with Probabilistic Performance Annotations. In Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys '20). Association for Computing Machinery, New York, NY, USA, Article 43, 14 pages. Google Scholar
Digital Library
- Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, and Gregory R. Ganger. 2016. Principled Workflow-Centric Tracing of Distributed Systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing (SoCC '16). Association for Computing Machinery, New York, NY, USA, 401--414. Google Scholar
Digital Library
- Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing Performance Changes by Comparing Request Flows. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11). USENIX Association, Boston, MA, 43--56. https://www.usenix.org/conference/nsdi11/diagnosing-performance-changes-comparing-request-flowsGoogle Scholar
- Bo Sang, Jianfeng Zhan, Gang Lu, Haining Wang, Dongyan Xu, Lei Wang, Zhihong Zhang, and Zhen Jia. 2012. Precise, Scalable, and Online Request Tracing for Multitier Services of Black Boxes. IEEE Transactions on Parallel and Distributed Systems 23, 6 (2012), 1159--1167. Google Scholar
Digital Library
- Arjun Satish, Thomas Shiou, Chuck Zhang, Khaled Elmeleegy, and Willy Zwaenepoel. 2018. Scrub: Online Troubleshooting for Large Mission-Critical Applications. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). Association for Computing Machinery, New York, NY, USA, Article 5, 15 pages. Google Scholar
Digital Library
- Ben Sigelman. 2023. Towards Turnkey Distributed Tracing (June 2016). (Jan. 2023). https://medium.com/opentracing/towards-turnkey-distributed-tracing-5f4297d1736Google Scholar
- Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. https://research.google.com/archive/papers/dapper-2010-1.pdfGoogle Scholar
- SQLite. 2022. SQLite Home Page. (Nov. 2022). https://www.sqlite.orgGoogle Scholar
- Kun Suo, Yong Zhao, Wei Chen, and Jia Rao. 2018. vNetTracer: Efficient and Programmable Packet Tracing in Virtualized Networks. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, Vienna, Austria, 165--175. Google Scholar
Cross Ref
- Sysdig. 2023. Security Tools for Containers, Kubernetes, and Cloud - Sysdig. (Jan. 2023). https://sysdig.com/Google Scholar
- Open Telemetry. 2023. Open Telemetry's Golang net/http wrapper package. (Jan. 2023). https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttpGoogle Scholar
- Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable Insights from Monitored Metrics in Distributed Systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware '17). Association for Computing Machinery, New York, NY, USA, 14--27. Google Scholar
Digital Library
- Marcos A. M. Vieira, Matheus S. Castanho, Racyus D. G. Pacífico, Elerson R. S. Santos, Eduardo P. M. Câmara Júnior, and Luiz F. M. Vieira. 2020. Fast Packet Processing with EBPF and XDP: Concepts, Code, Challenges, and Applications. ACM Comput. Surv. 53, 1, Article 16 (feb 2020), 36 pages. Google Scholar
Digital Library
- VMware. 2023. RabbitMQ: easy to use, flexible messaging and streaming. (July 2023). Retrieved Jul, 2023 from https://www.rabbitmq.com/Google Scholar
- W3C. 2023. Trace Context W3C Recommendation 23 November 2021. (Jan. 2023). https://www.w3.org/TR/trace-context/Google Scholar
- World Wide Web Consortium (W3C). 2023. Trace Context HTTP Headers Format. (Jan. 2023). https://www.w3.org/TR/trace-context/#trace-context-http-headers-formatGoogle Scholar
- Adam Welc. 2021. Automated Code Transformation for Context Propagation in Go. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 1242--1252. Google Scholar
Digital Library
- Jianping Weng, Jessie Hui Wang, Jiahai Yang, and Yang Yang. 2018. Root Cause Analysis of Anomalies of Multitier Services in Public Clouds. IEEE/ACM Transactions on Networking 26, 4 (2018), 1646--1659. Google Scholar
Digital Library
- wrk2. 2022. wrk2 - A constant throughput, correct latency recording variant of wrk. (July 2022). Retrieved Feb, 2023 from https://github.com/giltene/wrk2Google Scholar
- Stephen Yang, Seo Jin Park, and John Ousterhout. 2018. NanoLog: A Nanosecond Scale Logging System. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '18). USENIX Association, USA, 335--349.Google Scholar
- Yong Yang, Long Wang, Jing Gu, and Ying Li. 2022. Capturing Request Execution Path for Understanding Service Behavior and Detecting Anomalies without Code Instrumentation. IEEE Transactions on Services Computing 1, 1 (2022), 1--1. Google Scholar
Cross Ref
- Xiao Yu, Pallavi Joshi, Jianwu Xu, Guoliang Jin, Hui Zhang, and Guofei Jiang. 2016. CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASP-LOS '16). Association for Computing Machinery, New York, NY, USA, 489--502. Google Scholar
Digital Library
- Jun Zhang, Robert Ferydouni, Aldrin Montana, Daniel Bittman, and Peter Alvaro. 2021. 3MileBeach: A Tracer with Teeth. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '21). Association for Computing Machinery, New York, NY, USA, 458--472. Google Scholar
Digital Library
- Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan. 2019. The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19). Association for Computing Machinery, New York, NY, USA, 131--146. Google Scholar
Digital Library
- Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, USA, 603--618.Google Scholar
Digital Library
- Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. Lprof: A Non-Intrusive Request Flow Profiler for Distributed Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, USA, 629--644.Google Scholar
- Zipkin. 2023. Zipkin. (Jan. 2023). https://zipkin.io/Google Scholar
Index Terms
Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code
Recommendations
Enhancing Packet Tracing of Microservices in Container Overlay Networks using eBPF
AINTEC '22: Proceedings of the 17th Asian Internet Engineering ConferenceThe microservices architecture has been rapidly adopted to latency-sensitive applications. The architecture of these applications and the container overlay networks on servers are also complex. Distributed tracing is widely adopted in microservice ...
Enhancing Trace Visualizations for Microservices Performance Analysis
ICPE '23 Companion: Companion of the 2023 ACM/SPEC International Conference on Performance EngineeringPerformance analysis of microservices can be a challenging task, as a typical request to these systems involves multiple Remote Procedure Calls (RPC) spanning across independent services and machines. Practitioners primarily rely on distributed tracing ...
Automatic anti-pattern detection in microservice architectures based on distributed tracing
ICSE-SEIP '22: Proceedings of the 44th International Conference on Software Engineering: Software Engineering in PracticeThe successful use of microservice-based applications by large companies has popularized this architectural style. One problem with the microservice architecture is that current techniques for visualising- and detecting anti-pattern are inadequate. This ...





Comments