skip to main content
research-article
Public Access

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Published:05 December 2018Publication History
Skip Abstract Section

Abstract

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today—logs, counters, and metrics—have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This article presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.

References

  1. Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. In 19th ACM Symposium on Operating Systems Principles (SOSP’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte. 2008. Interval tree clocks: A logical clock for dynamic systems. In 12th International Conference on Principles of Distributed Systems (OPODIS’08).Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache. 2017. Accumulo. Retrieved January 2017 from https://accumulo.apache.org/.Google ScholarGoogle Scholar
  4. Apache. 2017. HADOOP-6599 Split RPC metrics into summary and detailed metrics. Retrieved July 2017 from https://issues.apache.org/jira/browse/HADOOP-6599.Google ScholarGoogle Scholar
  5. Apache. 2017. HADOOP-6859 Introduce additional statistics to FileSystem. Retrieved July 2017 from https://issues.apache.org/jira/browse/HADOOP-6859.Google ScholarGoogle Scholar
  6. Apache. 2016. HBase. Retrieved June 2016 from https://hbase.apache.org.Google ScholarGoogle Scholar
  7. Apache. 2017. HBASE-11559 Add dumping of DATA block usage to the BlockCache JSON report. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-11559.Google ScholarGoogle Scholar
  8. Apache. 2017. HBASE-12364 API for query metrics. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12364.Google ScholarGoogle Scholar
  9. Apache. 2017. HBASE-12424 Finer grained logging and metrics for split transaction. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12424.Google ScholarGoogle Scholar
  10. Apache. 2017. HBASE-12477 Add a flush failed metric. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12477.Google ScholarGoogle Scholar
  11. Apache. 2017. HBASE-12494 Add metrics for blocked updates and delayed flushes. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12494.Google ScholarGoogle Scholar
  12. Apache. 2017. HBASE-12496 A blockedRequestsCount metric. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12496.Google ScholarGoogle Scholar
  13. Apache. 2017. HBASE-12574 Update replication metrics to not do so many map look ups. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12574.Google ScholarGoogle Scholar
  14. Apache. 2017. HBASE-2257 {stargate} multiuser mode. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-2257.Google ScholarGoogle Scholar
  15. Apache. 2017. HBASE-4038 Hot Region : Write Diagnosis. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-4038.Google ScholarGoogle Scholar
  16. Apache. 2017. HBASE-4145 Provide metrics for hbase client. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-4145.Google ScholarGoogle Scholar
  17. Apache. 2017. HBASE-4219 Add Per-Column Family Metrics. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-4219.Google ScholarGoogle Scholar
  18. Apache. 2017. HBASE-7958 Statistics per-column family per-region. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-7958.Google ScholarGoogle Scholar
  19. Apache. 2017. HBASE-8370 Report data block cache hit rates apart from aggregate cache hit rates. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-8370.Google ScholarGoogle Scholar
  20. Apache. 2017. HBASE-8868 add metric to report client shortcircuit reads. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-8868.Google ScholarGoogle Scholar
  21. Apache. 2017. HBASE-9722 need documentation to configure HBase to reduce metrics. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-9722.Google ScholarGoogle Scholar
  22. Apache. 2017. HBase Reference Guide. Retrieved July 2017 from https://hbase.apache.org/book.html.Google ScholarGoogle Scholar
  23. Apache. 2017. HDFS-4169 Add per-disk latency metrics to DataNode. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-4169.Google ScholarGoogle Scholar
  24. Apache. 2017. HDFS-5253 Add requesting user’s name to PathBasedCacheEntry. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-5253.Google ScholarGoogle Scholar
  25. Apache. 2017. HDFS-6093 Expose more caching information for debugging by users. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-6093.Google ScholarGoogle Scholar
  26. Apache. 2017. HDFS-6268 Better sorting in NetworkTopology.pseudoSortByDistance when no local node is found. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-6268.Google ScholarGoogle Scholar
  27. Apache. 2017. HDFS-6292 Display HDFS per user and per group usage on webUI. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-6292.Google ScholarGoogle Scholar
  28. Apache. 2017. HDFS-7390 Provide JMX metrics per storage type. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-7390.Google ScholarGoogle Scholar
  29. Apache. 2017. HTrace. Retrieved January 2017 from http://htrace.incubator.apache.org/.Google ScholarGoogle Scholar
  30. Apache. 2017. MESOS-1949 All log messages from master, slave, executor, etc. should be collected on a per-task basis. Retrieved July 2017 from https://issues.apache.org/jira/browse/MESOS-1949.Google ScholarGoogle Scholar
  31. Apache. 2017. MESOS-2157 Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints. Retrieved July 2017 from https://issues.apache.org/jira/browse/MESOS-2157.Google ScholarGoogle Scholar
  32. Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online modelling and performance-aware systems. In 9th USENIX Workshop on Hot Topics in Operating Systems (HotOS’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ivan Beschastnikh, Yuriy Brun, Michael D. Ernst, and Arvind Krishnamurthy. 2014. Inferring models of concurrent systems from logs of their behavior with CSight. In 36th ACM International Conference on Software Engineering (ICSE’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Peter Bodik. 2011. Overview of the workshop of managing large-scale systems via the analysis of system logs and the application of machine learning techniques (SLAML’11). SIGOPS Operating Systems Review 45, 3 (2011), 20--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Bryan Cantrill. 2006. Hidden in plain sight. ACM Queue 4, 1 (2006), 26--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Bryan Cantrill, Michael W. Shapiro, and Adam H. Leventhal. 2004. Dynamic instrumentation of production systems. In 2004 USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Anupam Chanda, Alan L. Cox, and Willy Zwaenepoel. 2007. Whodunit: Transactional profiling for multi-tier applications. In 2nd ACM European Conference on Computer Systems (EuroSys’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Anupam Chanda, Khaled Elmeleegy, Alan L. Cox, and Willy Zwaenepoel. 2005. Causeway: Support for controlling and analyzing the execution of multi-tier applications. In 6th ACM/IFIP/USENIX International Middleware Conference (Middleware’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mike Y. Chen, Anthony Accardi, Emre Kiciman, David A. Patterson, Armando Fox, and Eric A. Brewer. 2004. Path-based failure and evolution management. In 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In 32nd IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Shigeru Chiba. 2004. Javassist: Java bytecode engineering made simple. Java Developer’s Journal 9, 1 (2004).Google ScholarGoogle Scholar
  45. Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Michael Chow, Kaushik Veeraraghavan, Michael Cafarella, and Jason Flinn. 2016. DQBarge: Improving data-quality tradeoffs in large-scale internet services. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In 1st ACM Symposium on Cloud Computing (SoCC’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Couckuyt, P. Davies, and J. M. Cahill. 2005. Multiple chart user interface. US Patent US6906717 B2.Google ScholarGoogle Scholar
  49. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Michael Dietz, Shashi Shekhar, Yuliy Pisetsky, Anhei Shu, and Dan S. Wallach. 2011. QUIRE: Lightweight provenance for smart phone operating systems. In 20th USENIX Security Symposium (Security’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In 4th ACM Symposium on Cloud Computing (SoCC’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Dynatrace. 2017. Dynatrace Application Monitoring. Retrieved July 2017 from http://www.dynatrace.com.Google ScholarGoogle Scholar
  53. William Enck, Peter Gilbert, Byung-Gon Chun, Landon Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. 2010. TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Úlfar Erlingsson, Marcus Peinado, Simon Peter, Mihai Budiu, and Gloria Mainar-Ruiz. 2011. Fay: Extensible distributed tracing from kernels to clusters. In 23rd ACM Symposium on Operating Systems Principles (SOSP’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-Trace: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Zhenyu Guo, Dong Zhou, Haoxiang Lin, Mao Yang, Fan Long, Chaoqiang Deng, Changshu Liu, and Lidong Zhou. 2011. G: A graph processing system for diagnosing distributed systems. In 2011 USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Jiawei Han, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W. Wah, Jianyong Wang, and Y. Dora Cai. 2005. Stream cube: An architecture for multi-dimensional analysis of data streams. Distributed and Parallel Databases 18, 2 (2005), 173--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 26th IEEE International Conference on Data Engineering Workshops (ICDEW’10).Google ScholarGoogle ScholarCross RefCross Ref
  60. Soila P. Kavulya, Scott Daniels, Kaustubh Joshi, Matti Hiltunen, Rajeev Gandhi, and Priya Narasimhan. 2012. Draco: Statistical diagnosis of chronic problems in large distributed systems. In 42nd IEEE/IFIP Conference on Dependable Systems and Networks (DSN’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, and William G. Griswold. 2001. An overview of AspectJ. In 15th European Conference on Object-Oriented Programming (ECOOP’01). Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Videira Lopes, Jean-Marc Loingtier, and John Irwin. 1997. Aspect-oriented programming. In 11th European Conference on Object-Oriented Programming (ECOOP’97).Google ScholarGoogle ScholarCross RefCross Ref
  63. Myunghwan Kim, Roshan Sumbaly, and Sam Shah. 2013. Root cause detection in a service-oriented architecture. In 2013 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Steven Y. Ko, Praveen Yalagandula, Indranil Gupta, Vanish Talwar, Dejan Milojicic, and Subu Iyer. 2008. Moara: Flexible and scalable group-based querying system. In 9th ACM/IFIP/USENIX International Conference on Middleware (Middleware’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Brian Laub, Chengwei Wang, Karsten Schwan, and Chad Huneycutt. 2014. Towards combining online 8 offline management for big data applications. In 11th USENIX International Conference on Autonomic Computing (ICAC’14).Google ScholarGoogle Scholar
  67. Jonathan Mace, Peter Bodik, Rodrigo Fonseca, and Madanlal Musuvathi. 2015. Retro: Targeted resource management in multi-tenant distributed systems. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Gideon Mann, Mark Sandler, Darja Krushevskaja, Sudipto Guha, and Eyal Even-Dar. 2011. Modeling the parallel execution of black-box services. In 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Matthew L. Massie, Brent N. Chun, and David E. Culler. 2004. The ganglia distributed monitoring system: Design, implementation, and experience. Parallel Comput. 30, 7 (2004), 817--840.Google ScholarGoogle ScholarCross RefCross Ref
  70. Erik Meijer, Brian Beckman, and Gavin Bierman. 2006. LINQ: Reconciling object, relations and XML in the .NET framework. In 2006 ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Haibo Mi, Huaimin Wang, Zhenbang Chen, and Yangfan Zhou. 2014. Automatic detecting performance bugs in cloud computing systems via learning latency specification model. In 8th IEEE International Symposium on Service Oriented System Engineering (SOSE’14). IEEE, 302--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael R. Lyu, and Hua Cai. 2013. Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Transactions on Parallel and Distributed Systems 24, 6 (2013), 1245--1255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael Rung-Tsong Lyu, Hua Cai, and Gang Yin. 2013. An online service-oriented performance profiling tool for cloud computing systems. Frontiers of Computer Science 7, 3 (2013), 431--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo I. Seltzer. 2010. Provenance for the cloud. In 8th USENIX Conference on File and Storage Technologies (FAST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Andrew C. Myers and Barbara Liskov. 1997. A decentralized model for information flow control. In 16th ACM Symposium on Operating Systems Principles (SOSP’97). Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Karthik Nagaraj, Charles Edwin Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Adam Oliner, Archana Ganapathi, and Wei Xu. 2012. Advances and challenges in log analysis. Commun. ACM 55, 2 (2012), 55--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Adam J. Oliner, Ashutosh V. Kulkarni, and Alex Aiken. 2010. Using correlated surprise to infer shared influence. In 40th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’10).Google ScholarGoogle ScholarCross RefCross Ref
  79. Oracle. 2015. The Java HotSpot Performance Engine Architecture. Retrieved March 2015 from http://www.oracle.com/technetwork/java/whitepaper-135217.html.Google ScholarGoogle Scholar
  80. Krzysztof Ostrowski, Gideon Mann, and Mark Sandler. 2011. Diagnosing latency in multi-tier black-box services. In 5th Workshop on Large Scale Distributed Systems and Middleware (LADIS’11).Google ScholarGoogle Scholar
  81. Insung Park and Ricky Buch. 2007. Event Tracing: Improve Debugging and Performance Tuning with ETW. Retrieved July 2017 from http://download.microsoft.com/download/3/A/7/3A7FA450-1F33-41F7-9E6D-3AA95B5A6AEA/MSDNMagazineApril2007en-us.chm.Google ScholarGoogle Scholar
  82. D. Stott Parker, Gerald J. Popek, Gerard Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen Kiser, and Charles Kline. 1983. Detection of mutual inconsistency in distributed systems. IEEE Transactions on Software Engineering 3 (1983), 240--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Vara Prasad, William Cohen, Frank Ch. Eigler, Martin Hunt, Jim Keniston, and Brad Chen. 2005. Locating system problems using dynamic instrumentation. In 2005 Ottawa Linux Symposium.Google ScholarGoogle Scholar
  84. Ariel Rabkin and Randy Howard Katz. 2013. How Hadoop clusters break. IEEE Software 30, 4 (2013), 88--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Raghu Ramakrishnan and Johannes Gehrke. 2000. Database Management Systems (2nd ed.). Osborne/McGraw-Hill, Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Lenin Ravindranath, Jitendra Padhye, Ratul Mahajan, and Hari Balakrishnan. 2013. Timecard: Controlling user-perceived delays in server-based mobile applications. In 24th ACM Symposium on Operating Systems Principles (SOSP’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. John Reumann and Kang G. Shin. 2004. Stateful distributed interposition. ACM Transactions on Computer Systems 22, 1 (2004), 1--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Patrick Reynolds, Charles Edwin Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. 2006. Pip: Detecting the unexpected in distributed systems. In 3rd USENIX Symposium on Networked Systems Design and Implementation (NSDI’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, and Gregory R. Ganger. 2014. So, You Want to Trace Your Distributed System? Key Design Insights from Years of Practical Experience. Technical Report CMU-PDL-14-102. Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA 15213-3890.Google ScholarGoogle Scholar
  90. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. David Shue, Michael J. Freedman, and Anees Shaikh. 2012. Performance isolation and fairness for multi-tenant cloud storage. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google.Google ScholarGoogle Scholar
  94. SolarWinds. 2017. Traceview. Retrieved July 2017 from https://traceview.solarwinds.com/.Google ScholarGoogle Scholar
  95. Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, and Gregory R. Ganger. 2006. Stardust: Tracking activity in a distributed storage system. In 2006 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Twitter. 2017. Zipkin. Retrieved July 2017 from http://zipkin.io/.Google ScholarGoogle Scholar
  97. Robbert Van Renesse, Kenneth P. Birman, and Werner Vogels. 2003. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Transactions on Computer Systems 21, 2 (2003), 164--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Kenton Varda. 2008. Protocol Buffers: Google’s Data Interchange Format. Retrieved January 2017 from https://opensource.googleblog.com/2008/07/protocol-buffers-googles-data.html.Google ScholarGoogle Scholar
  99. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet another resource negotiator. In 4th ACM Symposium on Cloud Computing (SoCC’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Chengwel Wang, Soila P. Kavulya, Jiaqi Tan, Liting Hu, Mahendra Kutare, Mike Kasick, Karsten Schwan, Priya Narasimhan, and Rajeev Gandhi. 2013. Performance troubleshooting in data centers: An annotated bibliography. ACM SIGOPS Operating Systems Review 47, 3 (2013), 50--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Chengwei Wang, Infantdani Abel Rayan, Greg Eisenhauer, Karsten Schwan, Vanish Talwar, Matthew Wolf, and Chad Huneycutt. 2012. VScope: Middleware for troubleshooting time-sensitive data center applications. In 13th ACM/IFIP/USENIX International Middleware Conference (Middleware’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Peter T. Wood. 2012. Query languages for graph databases. SIGMOD Record 41, 1 (2012), 50--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting large-scale system problems by mining console logs. In 22nd ACM Symposium on Operating Systems Principles (SOSP’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Lok-Kwong Yan and Heng Yin. 2012. DroidScope: Seamlessly reconstructing the OS and Dalvik semantic views for dynamic android malware analysis. In 21st USENIX Security Symposium (Security’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In 23rd ACM Symposium on Operating Systems Principles (SOSP’11). ACM, 159--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2011. Improving software diagnosability via log enhancement. In 16th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A non-intrusive request flow profiler for distributed systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang. 2014. MTracer: A trace-oriented monitoring framework for medium-scale distributed systems. In 8th IEEE International Symposium on Service Oriented System Engineering (SOSE’14). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Computer Systems
                ACM Transactions on Computer Systems  Volume 35, Issue 4
                November 2017
                97 pages
                ISSN:0734-2071
                EISSN:1557-7333
                DOI:10.1145/3297862
                Issue’s Table of Contents

                Copyright © 2018 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 5 December 2018
                • Accepted: 1 April 2018
                • Revised: 1 January 2018
                • Received: 1 August 2017
                Published in tocs Volume 35, Issue 4

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              HTML Format

              View this article in HTML Format .

              View HTML Format
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!