skip to main content
10.1145/3603269.3604823acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open Access

Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code

Authors Info & Claims
Published:01 September 2023Publication History

ABSTRACT

Microservices are becoming more complicated, posing new challenges for traditional performance monitoring solutions. On the one hand, the rapid evolution of microservices places a significant burden on the utilization and maintenance of existing distributed tracing frameworks. On the other hand, complex infrastructure increases the probability of network performance problems and creates more blind spots on the network side. In this paper, we present DeepFlow, a network-centric distributed tracing framework for troubleshooting microservices. DeepFlow provides out-of-the-box tracing via a network-centric tracing plane and implicit context propagation. In addition, it eliminates blind spots in network infrastructure, captures network metrics in a low-cost way, and enhances correlation between different components and layers. We demonstrate analytically and empirically that DeepFlow is capable of locating microservice performance anomalies with negligible overhead. DeepFlow has already identified over 71 critical performance anomalies for more than 26 companies and has been utilized by hundreds of individual developers. Our production evaluations demonstrate that DeepFlow is able to save users hours of instrumentation efforts and reduce troubleshooting time from several hours to just a few minutes.

References

  1. Zaafar Ahmed, Muhammad Hamad Alizai, and Affan A. Syed. 2018. InKeV: In-Kernel Distributed Network Virtualization for DCN. SIGCOMM Comput. Commun. Rev. 46, 3, Article 4 (jul 2018), 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Meta AI. 2022. Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration. (Nov. 2022). https://pytorch.org/Google ScholarGoogle Scholar
  3. Nadav Amit and Michael Wei. 2018. The Design and Implementation of Hyper-upcalls. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '18). USENIX Association, USA, 97--111.Google ScholarGoogle Scholar
  4. Apache. 2022. Apache Hadoop project. (Nov. 2022). https://hadoop.apache.org/Google ScholarGoogle Scholar
  5. Apache. 2022. Apache Spark - Unified engine for large-scale data analytics. (Nov. 2022). https://spark.apache.org/Google ScholarGoogle Scholar
  6. Apache. 2023. Apache SkyWalking. (July 2023). Retrieved Jul, 2023 from https://skywalking.apache.org/Google ScholarGoogle Scholar
  7. Emre Ates, Lily Sturmann, Mert Toslali, Orran Krieger, Richard Megginson, Ayse K. Coskun, and Raja R. Sambasivan. 2019. An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '19). Association for Computing Machinery, New York, NY, USA, 165--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. AWS. 2023. AWS Lambda, Run code without thinking about servers or clusters. (Jan. 2023). https://aws.amazon.com/lambda/Google ScholarGoogle Scholar
  9. AWS. 2023. Tagging AWS resources. (Jan. 2023). https://docs.aws.amazon.com/general/latest/gr/aws_tagging.htmlGoogle ScholarGoogle Scholar
  10. Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online Modelling and Performance-Aware Systems. In Proceedings of the 9th Conference on Hot Topics in Operating Systems - Volume 9 (HOTOS'03). USENIX Association, USA, 15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ashish Bijlani and Umakishore Ramachandran. 2019. Extension Framework for File Systems in User Space. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '19). USENIX Association, USA, 121--134.Google ScholarGoogle Scholar
  12. Spring Boot. 2023. Jaeger Demo. (Jan. 2023). https://github.com/chanjarster/spring-boot-istio-jaeger-demoGoogle ScholarGoogle Scholar
  13. Peter Bourgon. 2017. Metrics, tracing, and logging. (Feb. 2017). https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.htmlGoogle ScholarGoogle Scholar
  14. Anupam Chanda, Alan L. Cox, and Willy Zwaenepoel. 2007. Whodunit: Transactional Profiling for Multi-Tier Applications. SIGOPS Oper. Syst. Rev. 41, 3 (mar 2007), 17--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Zaheer Chothia, John Liagouris, Desislava Dimitrova, and Timothy Roscoe. 2017. Online Reconstruction of Structural Information from Datacenter Logs. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys '17). Association for Computing Machinery, New York, NY, USA, 344--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The Mystery Machine: End-to-End Performance Analysis of Large-Scale Internet Services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, USA, 217--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Cilium. 2022. Hubble - Network, Service & Security Observability for Kubernetes using eBPF. (July 2022). Retrieved Feb, 2023 from https://github.com/cilium/hubbleGoogle ScholarGoogle Scholar
  18. ClickHouse. 2023. Database. (Jan. 2023). https://clickhouse.com/Google ScholarGoogle Scholar
  19. ClickHouse. 2023. LowCardinality. (Jan. 2023). https://clickhouse.com/docs/en/sql-reference/data-types/lowcardinality/Google ScholarGoogle Scholar
  20. Open Zipkin Community. 2023. Zipkin tracing library for Python and C++. (Jan. 2023). https://github.com/dulikvor/cppKinGoogle ScholarGoogle Scholar
  21. CoreDNS. 2022. CoreDNS: DNS and Service Discovery. (Nov. 2022). https://coredns.io/Google ScholarGoogle Scholar
  22. Oracle Corporation. 2022. MySQL. (Nov. 2022). https://www.mysql.comGoogle ScholarGoogle Scholar
  23. DataDog. 2023. eBPF manager. This manager helps handle the life cycle of your eBPF programs. https://github.com/DataDog/ebpf-manager. (July 2023).Google ScholarGoogle Scholar
  24. Henri Maxime Demoulin, Isaac Pedisich, Nikos Vasilakis, Vincent Liu, Boon Thau Loo, and Linh Thi Xuan Phan. 2019. Detecting Asymmetric Application-Layer Denial-of-Service Attacks in-Flight with Finelame. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '19). USENIX Association, USA, 693--707.Google ScholarGoogle Scholar
  25. Luca Deri, Samuele Sabella, and Simone Mainardi. 2019. Combining System Visibility and Security Using eBPF. In Proceedings of the Third Italian Conference on Cyber Security. CEUR-WS.org, Italy, 1--12.Google ScholarGoogle Scholar
  26. Rui Ding, Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Qingwei Lin, Qiang Fu, Dongmei Zhang, and Tao Xie. 2015. Log2: A Cost-Aware Logging Mechanism for Performance Diagnosis. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15). USENIX Association, USA, 139--150.Google ScholarGoogle Scholar
  27. Ryan Eberhardt. 2023. My First Kernel Module: A Debugging Nightmare. (Jan. 2023). https://reberhardt.com/blog/2020/11/18/my-first-kernel-module.htmlGoogle ScholarGoogle Scholar
  28. eBPF. 2023. eBPF - extended Berkeley Packet Filter. (Jan. 2023). https://ebpf.io/Google ScholarGoogle Scholar
  29. eBPF. 2023. eBPF Applications Landscape. (Jan. 2023). https://ebpf.io/applicationsGoogle ScholarGoogle Scholar
  30. Google Kubernetes Engine(GKE). 2023. Create and manage Tags in GKE. (Jan. 2023). https://cloud.google.com/kubernetes-engine/docs/how-to/tagsGoogle ScholarGoogle Scholar
  31. Tânia Esteves, Francisco Neves, Rui Oliveira, and João Paulo. 2021. CAT: Content-Aware Tracing and Analysis for Distributed Systems. In Proceedings of the 22nd International Middleware Conference (Middleware '21). Association for Computing Machinery, New York, NY, USA, 223--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Inc. F5. 2022. NGINX: Advanced Load Balancer, Web Server, & Reverse Proxy. (Nov. 2022). https://www.nginx.comGoogle ScholarGoogle Scholar
  33. Flask. 2022. Flask - The Python micro framework for building web applications. (Nov. 2022). https://github.com/pallets/flaskGoogle ScholarGoogle Scholar
  34. Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-Trace: A Pervasive Network Tracing Framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation (NSDI'07). USENIX Association, USA, 20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Apache Software Foundation. 2023. KAFKA PROTOCOL GUIDE. (Jan. 2023). https://kafka.apache.org/protocol.htmlGoogle ScholarGoogle Scholar
  36. Apache Software Foundation. 2023. Remote communication details of Dubbo. (Jan. 2023). https://dubbo.apache.org/en/docs/v2.7/dev/implementation/#remote-communication-detailsGoogle ScholarGoogle Scholar
  37. Cloud Native Computing Foundation. 2022. Production-Grade Container Orchestration. (Nov. 2022). https://kubernetes.io/Google ScholarGoogle Scholar
  38. Cloud Native Computing Foundation. 2023. Cloud Native Landscape. (Jan. 2023). https://landscape.cncf.io/Google ScholarGoogle Scholar
  39. Django Software Foundation. 2022. Django: The web framework for perfectionists with deadlines. (Nov. 2022). https://www.djangoproject.comGoogle ScholarGoogle Scholar
  40. Open Infrastructure Foundation. 2022. OpenStack: Open Source Cloud Computing Infrastructure. (Nov. 2022). https://www.openstack.org/Google ScholarGoogle Scholar
  41. Neves Francisco, Machado Nuno, and Pereira José. 2023. fntneves/falcon: Falcon: A practical log-based analysis tool for distributed systems. (Jan. 2023). https://github.com/fntneves/falconGoogle ScholarGoogle Scholar
  42. Amazon GameLift. 2023. Dedicated server management for session-based multiplayer games. (Jan. 2023). https://aws.amazon.com/gamelift/Google ScholarGoogle Scholar
  43. Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: Practical and Scalable ML-Driven Performance Debugging in Microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 135--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 3--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Kaihui Gao, Chen Sun, Shuai Wang, Dan Li, Yu Zhou, Hongqiang Harry Liu, Lingjun Zhu, and Ming Zhang. 2022. Buffer-based End-to-end Request Event Monitoring in the Cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 829--843. https://www.usenix.org/conference/nsdi22/presentation/gao-kaihuiGoogle ScholarGoogle Scholar
  46. Xiongzi Ge, Yi Liu, David H.C. Du, Liang Zhang, Hongguang Guan, Jian Chen, Yuping Zhao, and Xinyu Hu. 2014. OpenANFV: Accelerating Network Function Virtualization with a Consolidated Framework in Openstack. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM '14). Association for Computing Machinery, New York, NY, USA, 353--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Francis Giraldeau and Michel Dagenais. 2016. Wait Analysis of Distributed Systems Using Kernel Tracing. IEEE Transactions on Parallel and Distributed Systems 27, 8 (2016), 2450--2461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. go.dev. 2023. The Go Programming Language. https://go.dev/. (July 2023).Google ScholarGoogle Scholar
  49. Google. 2022. TensorFlow - An end-to-end machine learning platform. (Nov. 2022). https://www.tensorflow.org/Google ScholarGoogle Scholar
  50. HAProxy. 2023. HAProxy Documentation. (Jan. 2023). http://docs.haproxy.org/2.7/configuration.html#7.3.6-unique-idGoogle ScholarGoogle Scholar
  51. HAProxy. 2023. The Reliable, High Performance TCP/HTTP Load Balancer. (Jan. 2023). http://www.haproxy.org/Google ScholarGoogle Scholar
  52. Red Hat. 2022. Red Hat OpenShift makes container orchestration easier. (Nov. 2022). https://www.redhat.com/en/technologies/cloud-computing/openshiftGoogle ScholarGoogle Scholar
  53. HelmVMware. 2023. The package manager for Kubernetes. (July 2023). Retrieved Jul, 2023 from https://helm.sh/Google ScholarGoogle Scholar
  54. Jiamin Huang, Barzan Mozafari, and Thomas F. Wenisch. 2017. Statistical Analysis of Latency Through Semantic Profiling. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys '17). Association for Computing Machinery, New York, NY, USA, 64--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Lexiang Huang and Timothy Zhu. 2021. Tprof: Performance Profiling via Structural Aggregation and Automated Analysis of Distributed Systems Traces. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '21). Association for Computing Machinery, New York, NY, USA, 76--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Peng Huang, Chuanxiong Guo, Jacob R. Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and Enhancing in Situ System Observability for Failure Detection. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18). USENIX Association, USA, 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. International Business Machines Corporation (IBM) and Eurotech. 2023. MQTT V3.1 Protocol Specification. (Jan. 2023). https://public.dhe.ibm.com/software/dw/webservices/ws-mqtt/mqtt-v3r1.htmlGoogle ScholarGoogle Scholar
  58. Internet Engineering Task Force (IETF). 2023. RFC 1035. (Jan. 2023). https://www.rfc-editor.org/rfc/rfc1035Google ScholarGoogle Scholar
  59. Internet Engineering Task Force (IETF). 2023. RFC 7231. (Jan. 2023). https://www.rfc-editor.org/rfc/rfc7231Google ScholarGoogle Scholar
  60. Internet Engineering Task Force (IETF). 2023. RFC 7540. (Jan. 2023). https://www.rfc-editor.org/rfc/rfc7540Google ScholarGoogle Scholar
  61. Istio. 2023. Bookinfo Application. (Jan. 2023). https://istio.io/latest/docs/examples/bookinfo/Google ScholarGoogle Scholar
  62. Jaeger. 2023. Jaeger: open source, end-to-end distributed tracing. (Jan. 2023). https://www.jaegertracing.io/Google ScholarGoogle Scholar
  63. Yurong Jiang, Lenin Ravindranath Sivalingam, Suman Nath, and Ramesh Govindan. 2016. WebPerf: Evaluating What-If Scenarios for Cloud-Hosted Web Applications. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM '16). Association for Computing Machinery, New York, NY, USA, 258--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An End-to-End Performance Tracing And Analysis System. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17). Association for Computing Machinery, New York, NY, USA, 34--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Suman Karumuri. 2023. PinTrace: Distributed Tracing at Pinterest. (August 2016). (Jan. 2023). https://www.slideshare.net/mansu/pintrace-advanced-aws-meetupGoogle ScholarGoogle Scholar
  66. The Linux Kernel. 2023. bpf(2) --- Linux manual page. (Jan. 2023). https://man7.org/linux/man-pages/man2/bpf.2.htmlGoogle ScholarGoogle Scholar
  67. The Linux Kernel. 2023. eBPF verifier. (Jan. 2023). https://www.kernel.org/doc/html/latest/bpf/verifier.htmlGoogle ScholarGoogle Scholar
  68. The Linux Kernel. 2023. Kernel Probes (Kprobes). (Jan. 2023). https://www.kernel.org/doc/html/latest/trace/kprobes.htmlGoogle ScholarGoogle Scholar
  69. The Linux Kernel. 2023. Linux Socket Filtering aka Berkeley Packet Filter (BPF). (Jan. 2023). https://www.kernel.org/doc/html/latest/networking/filter.htmlGoogle ScholarGoogle Scholar
  70. The Linux Kernel. 2023. Uprobe-tracer: Uprobe-based Event Tracing. (Jan. 2023). https://www.kernel.org/doc/html/latest/trace/uprobetracer.htmlGoogle ScholarGoogle Scholar
  71. The Linux Kernel. 2023. Using the Linux Kernel Tracepoints. (Jan. 2023). https://www.kernel.org/doc/html/latest/trace/tracepoints.htmlGoogle ScholarGoogle Scholar
  72. Chung Hwan Kim, Junghwan Rhee, Hui Zhang, Nipun Arora, Guofei Jiang, Xiangyu Zhang, and Dongyan Xu. 2014. IntroPerf: Transparent Context-Sensitive Multi-Layer Performance Inference Using System Stack Traces. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '14). Association for Computing Machinery, New York, NY, USA, 235--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Kubernetes. 2023. Labels and Selectors. (Jan. 2023). https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/Google ScholarGoogle Scholar
  74. PIXIE labs. 2022. Instantly troubleshoot your applications on Kubernetes. (July 2022). Retrieved Feb, 2023 from https://pixielabs.ai/Google ScholarGoogle Scholar
  75. Chien-An Lai, Josh Kimball, Tao Zhu, Qingyang Wang, and Calton Pu. 2017. milliScope: A Fine-Grained Monitoring Framework for Performance Debugging of n-Tier Web Services. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, USA, 92--102. Google ScholarGoogle ScholarCross RefCross Ref
  76. AWS Lambda. 2023. Using AWS Lambda with other services. (Jan. 2023). https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.htmlGoogle ScholarGoogle Scholar
  77. Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, and Rodrigo Fonseca. 2018. Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '18). Association for Computing Machinery, New York, NY, USA, 326--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, and Jonathan Mace. 2019. Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '19). Association for Computing Machinery, New York, NY, USA, 312--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish. 2013. Improving Availability in Distributed Systems with Failure Informers. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (nsdi'13). USENIX Association, USA, 427--442.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Joshua Levin and Theophilus A. Benson. 2020. ViperProbe: Rethinking Microservice Observability with eBPF. In 2020 IEEE 9th International Conference on Cloud Networking (CloudNet). IEEE, USA, 1--8. Google ScholarGoogle ScholarCross RefCross Ref
  81. Ding Li, James Mickens, Suman Nath, and Lenin Ravindranath. 2015. Domino: Understanding Wide-Area, Asynchronous Event Causality in Web Applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC '15). Association for Computing Machinery, New York, NY, USA, 182--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. libbpf. 2023. Automated upstream mirror for libbpf stand-alone build. https://github.com/libbpf/libbpf. (July 2023).Google ScholarGoogle Scholar
  83. Linux. 2023. BPF Type Format (BTF). https://www.kernel.org/doc/html/next/bpf/btf.html. (July 2023).Google ScholarGoogle Scholar
  84. Linux. 2023. packet(7) --- Linux manual page. (Jan. 2023). https://man7.org/linux/man-pages/man7/packet.7.htmlGoogle ScholarGoogle Scholar
  85. Linux. 2023. pidstat(1) --- Linux manual page. (Jan. 2023). https://man7.org/linux/man-pages/man1/pidstat.1.htmlGoogle ScholarGoogle Scholar
  86. Chang Liu, Zhengong Cai, Bingshen Wang, Zhimin Tang, and Jiaxu Liu. 2020. A protocol-independent container network observability analysis system based on eBPF. In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, Hong Kong, 697--702. Google ScholarGoogle ScholarCross RefCross Ref
  87. LTTng. 2023. LTTng: an open source tracing framework for Linux. (Jan. 2023). https://lttng.org/Google ScholarGoogle Scholar
  88. Liang Luo, Suman Nath, Lenin Ravindranath Sivalingam, Madan Musuvathi, and Luis Ceze. 2018. Troubleshooting Transiently-Recurring Problems in Production Systems with Blame-Proportional Logging. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '18). USENIX Association, USA, 321--334.Google ScholarGoogle Scholar
  89. Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, and Chengzhong Xu. 2022. An In-Depth Study of Microservice Call Graph and Runtime Performance. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 3901--3914. Google ScholarGoogle Scholar
  90. LWN. 2023. A thorough introduction to eBPF. https://lwn.net/Articles/740157/. (July 2023).Google ScholarGoogle Scholar
  91. Jonathan Mace and Rodrigo Fonseca. 2018. Universal Context Propagation for Distributed System Instrumentation. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). Association for Computing Machinery, New York, NY, USA, Article 8, 18 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). Association for Computing Machinery, New York, NY, USA, 378--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Steven McCanne and Van Jacobson. 1993. The BSD Packet Filter: A New Architecture for User-Level Packet Capture. In Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings (USENIX'93). USENIX Association, USA, 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. memcached. 2022. memcached - a distributed memory object caching system. (Nov. 2022). https://memcached.org/Google ScholarGoogle Scholar
  95. Haibo Mi, Huaimin Wang, Hua Cai, Yangfan Zhou, Michael R. Lyu, and Zhenbang Chen. 2012. P-Tracer: Path-Based Performance Profiling in Cloud Computing Systems. In 2012 IEEE 36th Annual Computer Software and Applications Conference. IEEE, Izmir, Turkey, 509--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael Rung-Tsong Lyu, and Hua Cai. 2013. Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems. IEEE Transactions on Parallel and Distributed Systems 24, 6 (2013), 1245--1255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Microsoft. 2023. Azure for gaming. (Jan. 2023). https://azure.microsoft.com/en-us/solutions/gaming/Google ScholarGoogle Scholar
  98. Microsoft. 2023. Event Tracing for Windows | Microsoft Learn. (Jan. 2023). https://learn.microsoft.com/en-us/windows-hardware/test/wpt/event-tracing-for-windowsGoogle ScholarGoogle Scholar
  99. J. Mogul, R. Rashid, and M. Accetta. 1987. The Packet Filter: An Efficient Mechanism for User-Level Network Code. In Proceedings of the Eleventh ACM Symposium on Operating Systems Principles (SOSP '87). Association for Computing Machinery, New York, NY, USA, 39--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Naver. 2023. Bytecode Instrumentation, Not Requiring Code Modifications. (Jan. 2023). https://pinpoint-apm.github.io/pinpoint/techdetail.html#how-bytecode-instrumentation-worksGoogle ScholarGoogle Scholar
  101. Naver. 2023. Pinpoint | Leading Open-Source APM. (Jan. 2023). https://pinpoint-apm.gitbook.io/pinpoint/Google ScholarGoogle Scholar
  102. Nginx. 2023. Nginx Documentation - HTTP core module. (Jan. 2023). http://nginx.org/en/docs/http/ngx_http_core_module.htmlGoogle ScholarGoogle Scholar
  103. OpenTelemetry. 2023. High-quality, ubiquitous, and portable telemetry to enable effective observability. (Jan. 2023). https://opentelemetry.io/Google ScholarGoogle Scholar
  104. OpenTelemetry. 2023. Propagators Distribution. (Jan. 2023). https://opentelemetry.io/docs/reference/specification/context/api-propagators/#propagators-distributionGoogle ScholarGoogle Scholar
  105. OpenZipkin. 2023. B3-propagation. (Jan. 2023). https://github.com/openzipkin/b3-propagationGoogle ScholarGoogle Scholar
  106. Oracle. 2023. MySQL Client/Server Protocol. (Jan. 2023). https://dev.mysql.com/doc/dev/mysql-server/latest/PAGE_PROTOCOL.htmlGoogle ScholarGoogle Scholar
  107. Cuong Pham, Long Wang, Byung Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2017. Failure Diagnosis for Distributed Systems Using Targeted Fault Injection. IEEE Transactions on Parallel and Distributed Systems 28, 2 (2017), 503--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Linux posts. 2023. Linux kernel crash dump analysis. (Jan. 2023). http://sklinuxblog.blogspot.com/2018/06/linux-kernel-crash-dump-analysis.htmlGoogle ScholarGoogle Scholar
  109. Kubernetes Documentation-Configuration Best Practices. 2022. (July 2022). Retrieved July, 2022 from https://kubernetes.io/docs/concepts/configuration/overview/Google ScholarGoogle Scholar
  110. Envoy Project. 2022. Envoy Proxy - Home. (Nov. 2022). https://www.envoyproxy.io/Google ScholarGoogle Scholar
  111. Envoy Project. 2023. HTTP header manipulation. (Jan. 2023). https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/headers#x-request-idGoogle ScholarGoogle Scholar
  112. Prometheus. 2023. From metrics to insight. (Jan. 2023). https://prometheus.io/Google ScholarGoogle Scholar
  113. Redis. 2022. Redis - Remote Dictionary Server. (Nov. 2022). https://redis.io/Google ScholarGoogle Scholar
  114. Redis. 2023. Redis serialization protocol (RESP) specification. (Jan. 2023). https://redis.io/docs/reference/protocol-spec/Google ScholarGoogle Scholar
  115. Daniele Rogora, Antonio Carzaniga, Amer Diwan, Matthias Hauswirth, and Robert Soulé. 2020. Analyzing System Performance with Probabilistic Performance Annotations. In Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys '20). Association for Computing Machinery, New York, NY, USA, Article 43, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, and Gregory R. Ganger. 2016. Principled Workflow-Centric Tracing of Distributed Systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing (SoCC '16). Association for Computing Machinery, New York, NY, USA, 401--414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing Performance Changes by Comparing Request Flows. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11). USENIX Association, Boston, MA, 43--56. https://www.usenix.org/conference/nsdi11/diagnosing-performance-changes-comparing-request-flowsGoogle ScholarGoogle Scholar
  118. Bo Sang, Jianfeng Zhan, Gang Lu, Haining Wang, Dongyan Xu, Lei Wang, Zhihong Zhang, and Zhen Jia. 2012. Precise, Scalable, and Online Request Tracing for Multitier Services of Black Boxes. IEEE Transactions on Parallel and Distributed Systems 23, 6 (2012), 1159--1167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Arjun Satish, Thomas Shiou, Chuck Zhang, Khaled Elmeleegy, and Willy Zwaenepoel. 2018. Scrub: Online Troubleshooting for Large Mission-Critical Applications. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). Association for Computing Machinery, New York, NY, USA, Article 5, 15 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Ben Sigelman. 2023. Towards Turnkey Distributed Tracing (June 2016). (Jan. 2023). https://medium.com/opentracing/towards-turnkey-distributed-tracing-5f4297d1736Google ScholarGoogle Scholar
  121. Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. https://research.google.com/archive/papers/dapper-2010-1.pdfGoogle ScholarGoogle Scholar
  122. SQLite. 2022. SQLite Home Page. (Nov. 2022). https://www.sqlite.orgGoogle ScholarGoogle Scholar
  123. Kun Suo, Yong Zhao, Wei Chen, and Jia Rao. 2018. vNetTracer: Efficient and Programmable Packet Tracing in Virtualized Networks. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, Vienna, Austria, 165--175. Google ScholarGoogle ScholarCross RefCross Ref
  124. Sysdig. 2023. Security Tools for Containers, Kubernetes, and Cloud - Sysdig. (Jan. 2023). https://sysdig.com/Google ScholarGoogle Scholar
  125. Open Telemetry. 2023. Open Telemetry's Golang net/http wrapper package. (Jan. 2023). https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttpGoogle ScholarGoogle Scholar
  126. Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable Insights from Monitored Metrics in Distributed Systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware '17). Association for Computing Machinery, New York, NY, USA, 14--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Marcos A. M. Vieira, Matheus S. Castanho, Racyus D. G. Pacífico, Elerson R. S. Santos, Eduardo P. M. Câmara Júnior, and Luiz F. M. Vieira. 2020. Fast Packet Processing with EBPF and XDP: Concepts, Code, Challenges, and Applications. ACM Comput. Surv. 53, 1, Article 16 (feb 2020), 36 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. VMware. 2023. RabbitMQ: easy to use, flexible messaging and streaming. (July 2023). Retrieved Jul, 2023 from https://www.rabbitmq.com/Google ScholarGoogle Scholar
  129. W3C. 2023. Trace Context W3C Recommendation 23 November 2021. (Jan. 2023). https://www.w3.org/TR/trace-context/Google ScholarGoogle Scholar
  130. World Wide Web Consortium (W3C). 2023. Trace Context HTTP Headers Format. (Jan. 2023). https://www.w3.org/TR/trace-context/#trace-context-http-headers-formatGoogle ScholarGoogle Scholar
  131. Adam Welc. 2021. Automated Code Transformation for Context Propagation in Go. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 1242--1252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. Jianping Weng, Jessie Hui Wang, Jiahai Yang, and Yang Yang. 2018. Root Cause Analysis of Anomalies of Multitier Services in Public Clouds. IEEE/ACM Transactions on Networking 26, 4 (2018), 1646--1659. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. wrk2. 2022. wrk2 - A constant throughput, correct latency recording variant of wrk. (July 2022). Retrieved Feb, 2023 from https://github.com/giltene/wrk2Google ScholarGoogle Scholar
  134. Stephen Yang, Seo Jin Park, and John Ousterhout. 2018. NanoLog: A Nanosecond Scale Logging System. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '18). USENIX Association, USA, 335--349.Google ScholarGoogle Scholar
  135. Yong Yang, Long Wang, Jing Gu, and Ying Li. 2022. Capturing Request Execution Path for Understanding Service Behavior and Detecting Anomalies without Code Instrumentation. IEEE Transactions on Services Computing 1, 1 (2022), 1--1. Google ScholarGoogle ScholarCross RefCross Ref
  136. Xiao Yu, Pallavi Joshi, Jianwu Xu, Guoliang Jin, Hui Zhang, and Guofei Jiang. 2016. CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASP-LOS '16). Association for Computing Machinery, New York, NY, USA, 489--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. Jun Zhang, Robert Ferydouni, Aldrin Montana, Daniel Bittman, and Peter Alvaro. 2021. 3MileBeach: A Tracer with Teeth. In Proceedings of the ACM Symposium on Cloud Computing (SoCC '21). Association for Computing Machinery, New York, NY, USA, 458--472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan. 2019. The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19). Association for Computing Machinery, New York, NY, USA, 131--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, USA, 603--618.Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. Lprof: A Non-Intrusive Request Flow Profiler for Distributed Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, USA, 629--644.Google ScholarGoogle Scholar
  141. Zipkin. 2023. Zipkin. (Jan. 2023). https://zipkin.io/Google ScholarGoogle Scholar

Index Terms

  1. Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference
          September 2023
          1217 pages
          ISBN:9798400702365
          DOI:10.1145/3603269

          Copyright © 2023 Owner/Author(s)

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 September 2023

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate554of3,547submissions,16%
        • Article Metrics

          • Downloads (Last 12 months)1,202
          • Downloads (Last 6 weeks)388

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader