Abstract
Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today—logs, counters, and metrics—have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This article presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.
- Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. In 19th ACM Symposium on Operating Systems Principles (SOSP’03). Google Scholar
Digital Library
- Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte. 2008. Interval tree clocks: A logical clock for dynamic systems. In 12th International Conference on Principles of Distributed Systems (OPODIS’08).Google Scholar
Digital Library
- Apache. 2017. Accumulo. Retrieved January 2017 from https://accumulo.apache.org/.Google Scholar
- Apache. 2017. HADOOP-6599 Split RPC metrics into summary and detailed metrics. Retrieved July 2017 from https://issues.apache.org/jira/browse/HADOOP-6599.Google Scholar
- Apache. 2017. HADOOP-6859 Introduce additional statistics to FileSystem. Retrieved July 2017 from https://issues.apache.org/jira/browse/HADOOP-6859.Google Scholar
- Apache. 2016. HBase. Retrieved June 2016 from https://hbase.apache.org.Google Scholar
- Apache. 2017. HBASE-11559 Add dumping of DATA block usage to the BlockCache JSON report. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-11559.Google Scholar
- Apache. 2017. HBASE-12364 API for query metrics. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12364.Google Scholar
- Apache. 2017. HBASE-12424 Finer grained logging and metrics for split transaction. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12424.Google Scholar
- Apache. 2017. HBASE-12477 Add a flush failed metric. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12477.Google Scholar
- Apache. 2017. HBASE-12494 Add metrics for blocked updates and delayed flushes. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12494.Google Scholar
- Apache. 2017. HBASE-12496 A blockedRequestsCount metric. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12496.Google Scholar
- Apache. 2017. HBASE-12574 Update replication metrics to not do so many map look ups. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-12574.Google Scholar
- Apache. 2017. HBASE-2257 {stargate} multiuser mode. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-2257.Google Scholar
- Apache. 2017. HBASE-4038 Hot Region : Write Diagnosis. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-4038.Google Scholar
- Apache. 2017. HBASE-4145 Provide metrics for hbase client. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-4145.Google Scholar
- Apache. 2017. HBASE-4219 Add Per-Column Family Metrics. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-4219.Google Scholar
- Apache. 2017. HBASE-7958 Statistics per-column family per-region. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-7958.Google Scholar
- Apache. 2017. HBASE-8370 Report data block cache hit rates apart from aggregate cache hit rates. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-8370.Google Scholar
- Apache. 2017. HBASE-8868 add metric to report client shortcircuit reads. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-8868.Google Scholar
- Apache. 2017. HBASE-9722 need documentation to configure HBase to reduce metrics. Retrieved July 2017 from https://issues.apache.org/jira/browse/HBASE-9722.Google Scholar
- Apache. 2017. HBase Reference Guide. Retrieved July 2017 from https://hbase.apache.org/book.html.Google Scholar
- Apache. 2017. HDFS-4169 Add per-disk latency metrics to DataNode. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-4169.Google Scholar
- Apache. 2017. HDFS-5253 Add requesting user’s name to PathBasedCacheEntry. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-5253.Google Scholar
- Apache. 2017. HDFS-6093 Expose more caching information for debugging by users. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-6093.Google Scholar
- Apache. 2017. HDFS-6268 Better sorting in NetworkTopology.pseudoSortByDistance when no local node is found. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-6268.Google Scholar
- Apache. 2017. HDFS-6292 Display HDFS per user and per group usage on webUI. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-6292.Google Scholar
- Apache. 2017. HDFS-7390 Provide JMX metrics per storage type. Retrieved July 2017 from https://issues.apache.org/jira/browse/HDFS-7390.Google Scholar
- Apache. 2017. HTrace. Retrieved January 2017 from http://htrace.incubator.apache.org/.Google Scholar
- Apache. 2017. MESOS-1949 All log messages from master, slave, executor, etc. should be collected on a per-task basis. Retrieved July 2017 from https://issues.apache.org/jira/browse/MESOS-1949.Google Scholar
- Apache. 2017. MESOS-2157 Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints. Retrieved July 2017 from https://issues.apache.org/jira/browse/MESOS-2157.Google Scholar
- Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). Google Scholar
Digital Library
- Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI’04). Google Scholar
Digital Library
- Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online modelling and performance-aware systems. In 9th USENIX Workshop on Hot Topics in Operating Systems (HotOS’03). Google Scholar
Digital Library
- Ivan Beschastnikh, Yuriy Brun, Michael D. Ernst, and Arvind Krishnamurthy. 2014. Inferring models of concurrent systems from logs of their behavior with CSight. In 36th ACM International Conference on Software Engineering (ICSE’14). Google Scholar
Digital Library
- Peter Bodik. 2011. Overview of the workshop of managing large-scale systems via the analysis of system logs and the application of machine learning techniques (SLAML’11). SIGOPS Operating Systems Review 45, 3 (2011), 20--22. Google Scholar
Digital Library
- Bryan Cantrill. 2006. Hidden in plain sight. ACM Queue 4, 1 (2006), 26--36. Google Scholar
Digital Library
- Bryan Cantrill, Michael W. Shapiro, and Adam H. Leventhal. 2004. Dynamic instrumentation of production systems. In 2004 USENIX Annual Technical Conference (ATC). Google Scholar
Digital Library
- Anupam Chanda, Alan L. Cox, and Willy Zwaenepoel. 2007. Whodunit: Transactional profiling for multi-tier applications. In 2nd ACM European Conference on Computer Systems (EuroSys’07). Google Scholar
Digital Library
- Anupam Chanda, Khaled Elmeleegy, Alan L. Cox, and Willy Zwaenepoel. 2005. Causeway: Support for controlling and analyzing the execution of multi-tier applications. In 6th ACM/IFIP/USENIX International Middleware Conference (Middleware’05). Google Scholar
Digital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). Google Scholar
Digital Library
- Mike Y. Chen, Anthony Accardi, Emre Kiciman, David A. Patterson, Armando Fox, and Eric A. Brewer. 2004. Path-based failure and evolution management. In 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI’04). Google Scholar
Digital Library
- Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In 32nd IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’02). Google Scholar
Digital Library
- Shigeru Chiba. 2004. Javassist: Java bytecode engineering made simple. Java Developer’s Journal 9, 1 (2004).Google Scholar
- Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). Google Scholar
Digital Library
- Michael Chow, Kaushik Veeraraghavan, Michael Cafarella, and Jason Flinn. 2016. DQBarge: Improving data-quality tradeoffs in large-scale internet services. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). Google Scholar
Digital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In 1st ACM Symposium on Cloud Computing (SoCC’10). Google Scholar
Digital Library
- J. Couckuyt, P. Davies, and J. M. Cahill. 2005. Multiple chart user interface. US Patent US6906717 B2.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI’04). Google Scholar
Digital Library
- Michael Dietz, Shashi Shekhar, Yuliy Pisetsky, Anhei Shu, and Dan S. Wallach. 2011. QUIRE: Lightweight provenance for smart phone operating systems. In 20th USENIX Security Symposium (Security’11). Google Scholar
Digital Library
- Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In 4th ACM Symposium on Cloud Computing (SoCC’13). Google Scholar
Digital Library
- Dynatrace. 2017. Dynatrace Application Monitoring. Retrieved July 2017 from http://www.dynatrace.com.Google Scholar
- William Enck, Peter Gilbert, Byung-Gon Chun, Landon Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. 2010. TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10). Google Scholar
Digital Library
- Úlfar Erlingsson, Marcus Peinado, Simon Peter, Mihai Budiu, and Gloria Mainar-Ruiz. 2011. Fay: Extensible distributed tracing from kernels to clusters. In 23rd ACM Symposium on Operating Systems Principles (SOSP’11). Google Scholar
Digital Library
- Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-Trace: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI’07). Google Scholar
Digital Library
- Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53. Google Scholar
Digital Library
- Zhenyu Guo, Dong Zhou, Haoxiang Lin, Mao Yang, Fan Long, Chaoqiang Deng, Changshu Liu, and Lidong Zhou. 2011. G: A graph processing system for diagnosing distributed systems. In 2011 USENIX Annual Technical Conference (ATC). Google Scholar
Digital Library
- Jiawei Han, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W. Wah, Jianyong Wang, and Y. Dora Cai. 2005. Stream cube: An architecture for multi-dimensional analysis of data streams. Distributed and Parallel Databases 18, 2 (2005), 173--197. Google Scholar
Digital Library
- Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 26th IEEE International Conference on Data Engineering Workshops (ICDEW’10).Google Scholar
Cross Ref
- Soila P. Kavulya, Scott Daniels, Kaustubh Joshi, Matti Hiltunen, Rajeev Gandhi, and Priya Narasimhan. 2012. Draco: Statistical diagnosis of chronic problems in large distributed systems. In 42nd IEEE/IFIP Conference on Dependable Systems and Networks (DSN’12). Google Scholar
Digital Library
- Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, and William G. Griswold. 2001. An overview of AspectJ. In 15th European Conference on Object-Oriented Programming (ECOOP’01). Google Scholar
Digital Library
- Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Videira Lopes, Jean-Marc Loingtier, and John Irwin. 1997. Aspect-oriented programming. In 11th European Conference on Object-Oriented Programming (ECOOP’97).Google Scholar
Cross Ref
- Myunghwan Kim, Roshan Sumbaly, and Sam Shah. 2013. Root cause detection in a service-oriented architecture. In 2013 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google Scholar
Digital Library
- Steven Y. Ko, Praveen Yalagandula, Indranil Gupta, Vanish Talwar, Dejan Milojicic, and Subu Iyer. 2008. Moara: Flexible and scalable group-based querying system. In 9th ACM/IFIP/USENIX International Conference on Middleware (Middleware’08). Google Scholar
Digital Library
- Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558--565. Google Scholar
Digital Library
- Brian Laub, Chengwei Wang, Karsten Schwan, and Chad Huneycutt. 2014. Towards combining online 8 offline management for big data applications. In 11th USENIX International Conference on Autonomic Computing (ICAC’14).Google Scholar
- Jonathan Mace, Peter Bodik, Rodrigo Fonseca, and Madanlal Musuvathi. 2015. Retro: Targeted resource management in multi-tenant distributed systems. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). Google Scholar
Digital Library
- Gideon Mann, Mark Sandler, Darja Krushevskaja, Sudipto Guha, and Eyal Even-Dar. 2011. Modeling the parallel execution of black-box services. In 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’11). Google Scholar
Digital Library
- Matthew L. Massie, Brent N. Chun, and David E. Culler. 2004. The ganglia distributed monitoring system: Design, implementation, and experience. Parallel Comput. 30, 7 (2004), 817--840.Google Scholar
Cross Ref
- Erik Meijer, Brian Beckman, and Gavin Bierman. 2006. LINQ: Reconciling object, relations and XML in the .NET framework. In 2006 ACM SIGMOD International Conference on Management of Data. Google Scholar
Digital Library
- Haibo Mi, Huaimin Wang, Zhenbang Chen, and Yangfan Zhou. 2014. Automatic detecting performance bugs in cloud computing systems via learning latency specification model. In 8th IEEE International Symposium on Service Oriented System Engineering (SOSE’14). IEEE, 302--307. Google Scholar
Digital Library
- Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael R. Lyu, and Hua Cai. 2013. Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Transactions on Parallel and Distributed Systems 24, 6 (2013), 1245--1255. Google Scholar
Digital Library
- Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael Rung-Tsong Lyu, Hua Cai, and Gang Yin. 2013. An online service-oriented performance profiling tool for cloud computing systems. Frontiers of Computer Science 7, 3 (2013), 431--445. Google Scholar
Digital Library
- Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo I. Seltzer. 2010. Provenance for the cloud. In 8th USENIX Conference on File and Storage Technologies (FAST’10). Google Scholar
Digital Library
- Andrew C. Myers and Barbara Liskov. 1997. A decentralized model for information flow control. In 16th ACM Symposium on Operating Systems Principles (SOSP’97). Google Scholar
Digital Library
- Karthik Nagaraj, Charles Edwin Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI’12). Google Scholar
Digital Library
- Adam Oliner, Archana Ganapathi, and Wei Xu. 2012. Advances and challenges in log analysis. Commun. ACM 55, 2 (2012), 55--61. Google Scholar
Digital Library
- Adam J. Oliner, Ashutosh V. Kulkarni, and Alex Aiken. 2010. Using correlated surprise to infer shared influence. In 40th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’10).Google Scholar
Cross Ref
- Oracle. 2015. The Java HotSpot Performance Engine Architecture. Retrieved March 2015 from http://www.oracle.com/technetwork/java/whitepaper-135217.html.Google Scholar
- Krzysztof Ostrowski, Gideon Mann, and Mark Sandler. 2011. Diagnosing latency in multi-tier black-box services. In 5th Workshop on Large Scale Distributed Systems and Middleware (LADIS’11).Google Scholar
- Insung Park and Ricky Buch. 2007. Event Tracing: Improve Debugging and Performance Tuning with ETW. Retrieved July 2017 from http://download.microsoft.com/download/3/A/7/3A7FA450-1F33-41F7-9E6D-3AA95B5A6AEA/MSDNMagazineApril2007en-us.chm.Google Scholar
- D. Stott Parker, Gerald J. Popek, Gerard Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen Kiser, and Charles Kline. 1983. Detection of mutual inconsistency in distributed systems. IEEE Transactions on Software Engineering 3 (1983), 240--247. Google Scholar
Digital Library
- Vara Prasad, William Cohen, Frank Ch. Eigler, Martin Hunt, Jim Keniston, and Brad Chen. 2005. Locating system problems using dynamic instrumentation. In 2005 Ottawa Linux Symposium.Google Scholar
- Ariel Rabkin and Randy Howard Katz. 2013. How Hadoop clusters break. IEEE Software 30, 4 (2013), 88--94. Google Scholar
Digital Library
- Raghu Ramakrishnan and Johannes Gehrke. 2000. Database Management Systems (2nd ed.). Osborne/McGraw-Hill, Berkeley, CA. Google Scholar
Digital Library
- Lenin Ravindranath, Jitendra Padhye, Ratul Mahajan, and Hari Balakrishnan. 2013. Timecard: Controlling user-perceived delays in server-based mobile applications. In 24th ACM Symposium on Operating Systems Principles (SOSP’13). Google Scholar
Digital Library
- John Reumann and Kang G. Shin. 2004. Stateful distributed interposition. ACM Transactions on Computer Systems 22, 1 (2004), 1--48. Google Scholar
Digital Library
- Patrick Reynolds, Charles Edwin Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. 2006. Pip: Detecting the unexpected in distributed systems. In 3rd USENIX Symposium on Networked Systems Design and Implementation (NSDI’06). Google Scholar
Digital Library
- Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, and Gregory R. Ganger. 2014. So, You Want to Trace Your Distributed System? Key Design Insights from Years of Practical Experience. Technical Report CMU-PDL-14-102. Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA 15213-3890.Google Scholar
- Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI’11). Google Scholar
Digital Library
- David Shue, Michael J. Freedman, and Anees Shaikh. 2012. Performance isolation and fairness for multi-tenant cloud storage. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). Google Scholar
Digital Library
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST’10). Google Scholar
Digital Library
- Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google.Google Scholar
- SolarWinds. 2017. Traceview. Retrieved July 2017 from https://traceview.solarwinds.com/.Google Scholar
- Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, and Gregory R. Ganger. 2006. Stardust: Tracking activity in a distributed storage system. In 2006 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google Scholar
Digital Library
- Twitter. 2017. Zipkin. Retrieved July 2017 from http://zipkin.io/.Google Scholar
- Robbert Van Renesse, Kenneth P. Birman, and Werner Vogels. 2003. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Transactions on Computer Systems 21, 2 (2003), 164--206. Google Scholar
Digital Library
- Kenton Varda. 2008. Protocol Buffers: Google’s Data Interchange Format. Retrieved January 2017 from https://opensource.googleblog.com/2008/07/protocol-buffers-googles-data.html.Google Scholar
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet another resource negotiator. In 4th ACM Symposium on Cloud Computing (SoCC’13). Google Scholar
Digital Library
- Chengwel Wang, Soila P. Kavulya, Jiaqi Tan, Liting Hu, Mahendra Kutare, Mike Kasick, Karsten Schwan, Priya Narasimhan, and Rajeev Gandhi. 2013. Performance troubleshooting in data centers: An annotated bibliography. ACM SIGOPS Operating Systems Review 47, 3 (2013), 50--62. Google Scholar
Digital Library
- Chengwei Wang, Infantdani Abel Rayan, Greg Eisenhauer, Karsten Schwan, Vanish Talwar, Matthew Wolf, and Chad Huneycutt. 2012. VScope: Middleware for troubleshooting time-sensitive data center applications. In 13th ACM/IFIP/USENIX International Middleware Conference (Middleware’12). Google Scholar
Digital Library
- Peter T. Wood. 2012. Query languages for graph databases. SIGMOD Record 41, 1 (2012), 50--60. Google Scholar
Digital Library
- Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting large-scale system problems by mining console logs. In 22nd ACM Symposium on Operating Systems Principles (SOSP’09). Google Scholar
Digital Library
- Lok-Kwong Yan and Heng Yin. 2012. DroidScope: Seamlessly reconstructing the OS and Dalvik semantic views for dynamic android malware analysis. In 21st USENIX Security Symposium (Security’12). Google Scholar
Digital Library
- Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In 23rd ACM Symposium on Operating Systems Principles (SOSP’11). ACM, 159--172. Google Scholar
Digital Library
- Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2011. Improving software diagnosability via log enhancement. In 16th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). Google Scholar
Digital Library
- Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A non-intrusive request flow profiler for distributed systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). Google Scholar
Digital Library
- Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang. 2014. MTracer: A trace-oriented monitoring framework for medium-scale distributed systems. In 8th IEEE International Symposium on Service Oriented System Engineering (SOSE’14). Google Scholar
Digital Library
Index Terms
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Recommendations
Pivot tracing: dynamic causal monitoring for distributed systems
Monitoring and troubleshooting distributed systems are notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today---logs, counters, and metrics---have two important ...
A Trace-Scaling Agent for Parallel Application Tracing
ICTAI '02: Proceedings of the 14th IEEE International Conference on Tools with Artificial IntelligenceTracing and performance analysis tools are an important component in the development of high performance applications. Tracing parallel programs with current tracing tools, however,easily leads to large trace files with hundreds of Megabytes. The ...
Traffic trace engineering
APNOMS'09: Proceedings of the 12th Asia-Pacific network operations and management conference on Management enabling the future internet for changing business and new computing servicesTraffic traces captured from backbone links have been widely used in traffic analysis for many years. By far the most popular use of such traces is replay where conditions and states of the original traffic trace are recreated almost identically in ...






Comments