Abstract
Fay is a flexible platform for the efficient collection, processing, and analysis of software execution traces. Fay provides dynamic tracing through use of runtime instrumentation and distributed aggregation within machines and across clusters. At the lowest level, Fay can be safely extended with new tracing primitives, including even untrusted, fully optimized machine code, and Fay can be applied to running user-mode or kernel-mode software without compromising system stability. At the highest level, Fay provides a unified, declarative means of specifying what events to trace, as well as the aggregation, processing, and analysis of those events.
We have implemented the Fay tracing platform for Windows and integrated it with two powerful, expressive systems for distributed programming. Our implementation is easy to use, can be applied to unmodified production systems, and provides primitives that allow the overhead of tracing to be greatly reduced, compared to previous dynamic tracing platforms. To show the generality of Fay tracing, we reimplement, in experiments, a range of tracing strategies and several custom mechanisms from existing tracing frameworks.
Fay shows that modern techniques for high-level querying and data-parallel processing of disagreggated data streams are well suited to comprehensive monitoring of software execution in distributed systems. Revisiting a lesson from the late 1960s [Deutsch and Grant 1971], Fay also demonstrates the efficiency and extensibility benefits of using safe, statically verified machine code as the basis for low-level execution tracing. Finally, Fay establishes that, by automatically deriving optimized query plans and code for safe extensions, the expressiveness and performance of high-level tracing queries can equal or even surpass that of specialized monitoring tools.
- Ansel, J., Marchenko, P., Erlingsson, Ú., Taylor, E., Chen, B., Schuff, D. L., Sehr, D.,Biffle, C. L., and Yee, B. 2011. Language-independent sandboxing of just-in-time compilation and self-modifying code. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Apache. Hadoop project. http://hadoop.apache.org/.Google Scholar
- Avgustinov, P., Tibble, J., Bodden, E., Hendren, L., Lhotak, O., de Moor, O., Ongkingco, N., and Sittampalam, G. 2006. Efficient trace monitoring. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA). Google Scholar
Digital Library
- Balazinska, M., Balakrishnan, H., Madden, S., and Stonebraker, M. 2005. Fault-tolerance in the Borealis distributed stream processing system. In Proceedings of the ACM SIGMOD International Conference Management of Data (SIGMOD). Google Scholar
Digital Library
- Barham, P., Donnelly, A., Isaacs, R., and Mortier, R. 2004. Using Magpie for request extraction and workload modelling. In Proceedings of the Conference on Operating System Design and Implementation (OSDI). Google Scholar
Digital Library
- Bershad, B. N., Savage, S., Pardyak, P., Becker, D., Fiuczynski, M., and Sirer, E. G. 1995. Protection is a software issue. In Proceedings of the 5th Workshop on Hot Topics in Operating Systems (HotOS-V). Google Scholar
Digital Library
- Bhatia, S., Kumar, A., Fiuczynski, M. E., and Peterson, L. 2008. Lightweight, high-resolution monitoring for troubleshooting production systems. In Proceedings of the Conference on Operating System Design and Implementation (OSDI). Google Scholar
Digital Library
- Bungale, P. P. and Luk, C.-K. 2007. PinOS: A programmable framework for whole-system dynamic instrumentation. In Proceedings of the 3rd International ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environment (VEE). Google Scholar
Digital Library
- Burrows, M., Erlingsson, Ú., Leung, S.-T. A., Vandevoorde, M. T., Waldspurger, C. A., Walker, K., and Weihl, W. E. 2000. Efficient and flexible value sampling. In Proceedings of the Internaational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Cantrill, B. 2006. Hidden in plain sight. ACM Queue 4. Google Scholar
Digital Library
- Cantrill, B. M., Shapiro, M. W., and Leventhal, A. H. 2004. Dynamic instrumentation of production systems. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Cao, Q., Abdelzaher, T., Stankovic, J., Whitehouse, K., and Luo, L. 2008. Declarative tracepoints: A programmable and application independent debugging system for wireless sensor networks. In Proceedings of the International Conference on Embedded Networked Sensor Systems (SenSys). Google Scholar
Digital Library
- Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R. R., Bradshaw, R., and Weizenbaum, N. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Dean, J. and Ghemawat, S. 2010. MapReduce: A flexible data processing tool. Comm. ACM 53, 1. Google Scholar
Digital Library
- Deutsch, P. and Grant, C. A. 1971. A flexible measurement tool for software systems. In Proceedings of the IFIP Congress 71.Google Scholar
- Eclipse. Callgraph plug-in. http://wiki.eclipse.org/Linux_Tools_Project/Callgraph/User_Guide.Google Scholar
- Eigler, F. C. 2010. Systemtap tutorial. http://sourceware.org/systemtap/tutorial/.Google Scholar
- Erlingsson, Ú., Abadi, M., Vrable, M., Budiu, M., and Necula, G. C. 2006a. XFI: Software guards for system address spaces. In Proceedings of the Conference on Operating System Design and Implementation (OSDI). Google Scholar
Digital Library
- Erlingsson, Ú., Manasse, M., and McSherry, F. 2006b. A cool and practical alternative to traditional hash tables. In Proceedings of the Workshop on Distributed Data and Structures.Google Scholar
- Etsion, Y., Tsafrir, D., Kirkpatrick, S., and Feitelson, D. G. 2007. Fine grained kernel logging with KLogger: Experience and insights. In Proceedings of the 2007 EuroSys Conference. Google Scholar
Digital Library
- flume. Flume: Open source log collection system. http://github.com/cloudera/flume.Google Scholar
- Gao, D., Jensen, S., Snodgrass, R. T., and Soo, M. D. 2005. Join operations in temporal databases. Int. J. Very Large Datab. (VLDB Journal) 14, 2. Google Scholar
Digital Library
- Glerum, K., Kinshumann, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, D., Loihle, G., and Hunt, G. 2009. Debugging in the (very) large: Ten years of implementation and experience. In Proceedings of the 22nd ACM Symposium on Operating System Principles (SOSP’09). Google Scholar
Digital Library
- Goldsmith, S. F., O’Callahan, R., and Aiken, A. 2005. Relational queries over program traces. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA’05). Google Scholar
Digital Library
- Gupta, A., Mumick, I. S., and Subrahmanian, V. S. 1993. Maintaining views incrementally. In Proceedings of the ACM International Conference on Management of Data. Google Scholar
Digital Library
- Hunt, G. and Brubacher, D. 1998. Detours: Binary interception of Win32 functions. In Proceedings of the USENIX Windows NT Symposium. Google Scholar
Digital Library
- Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys. Google Scholar
Digital Library
- Lee, G. L., Schulz, M., Ahn, D. H., Bernat, A., de Supinskil, B. R., Ko, S. Y., and Rountree, B. 2007. Dynamic binary instrumentation and data aggregation on large scale systems. Int. J. Parall. Prog. 35, 3. Google Scholar
Digital Library
- Liblit, B., Aiken, A., Zheng, A. X., and Jordan, M. I. 2003. Bug isolation via remote program sampling. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI) 38, 5. Google Scholar
Digital Library
- Marguerie, F., Eichert, S., and Wooley, J. 2008. LINQ in Action. Manning Publications Co. Google Scholar
Digital Library
- Marian, T., Sagar, A., Chen, T., and Weatherspoon, H. 2011. Fmeter: Extracting indexable low-level system signatures by counting kernel function calls. Tech. rep., Cornell University, Computing and Information Science. http://hdl.handle.net/1813/23568.Google Scholar
- Martin, M., Livshits, B., and Lam, M. S. 2005. Finding application errors and security flaws using PQL: A program query language. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA’05). Google Scholar
Digital Library
- Massie, M. L., Chun, B. N., and Culler, D. E. 2003. The Ganglia distributed monitoring system: Design, implementation and experience. Int. J. Parall. Comput. 30.Google Scholar
- McSherry, F., Yu, Y., Budiu, M., Isard, M., and Fetterly, D. 2011. Scaling Up Machine Learning. Cambridge Univ. Press.Google Scholar
- Microsoft Corp. Determine which queries are holding locks. MSDN. http://msdn.microsoft.com/en-us/library/bb677357.aspx.Google Scholar
- Microsoft Corp. 2003. Introduction to hotpatching. Microsoft TechNet.Google Scholar
- Microsoft Corp. 2006. Kernel patch protection: Frequently asked questions. Windows Hardware Developer Central. http://www.microsoft.com/whdc/driver/kernel/64bitpatch_FAQ.mspx.Google Scholar
- Microsoft Corp. 2010. WDK and developer tools. Windows Hardware Developer Central. http://www.microsoft.com/whdc/DevTools/default.mspx.Google Scholar
- Microsoft Corp. 2011a. Diagnosing and resolving latch contention on SQL Server. Microsoft Download Center. http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=%%26665.Google Scholar
- Microsoft Corp. 2011b. Introducing SQL Server extended events. MSDN. http://msdn.microsoft.com/en-us/library/bb630354.aspx.Google Scholar
- Microsoft Corp. 2011c. Use the Microsoft Symbol Server to obtain debug symbol files. http://support.microsoft.com/kb/311503.Google Scholar
- Microsoft Corp. 2012. Microsoft StreamInsight. MSDN. http://msdn.microsoft.com/en-us/library/ee362541.aspx.Google Scholar
- Morrisett, G., Walker, D., Crary, K., and Glew, N. 1998. From System F to typed assembly language. In Proceedings of the Symposium on Principles of Programming Languages (POPL). Google Scholar
Digital Library
- Necula, G. C. 1997. Proof-carrying code. In Proceedings of the Symposium on Principles of Programming Languages (POPL). Google Scholar
Digital Library
- Nethercote, N. and Seward, J. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Oney, W. 2002. Programming the Microsoft Windows Driver Model. Microsoft Press. Google Scholar
Digital Library
- Park, I. and Buch, R. 2007. Improve debugging and performance tuning with ETW. MSDN Magazine.Google Scholar
- Passing, J., Schmidt, A., von Lowis, M., and Polze, A. 2009. NTrace: Function boundary tracing for Windows on IA-32. In Proceedings of the Working Conference on Reverse Engineering. Google Scholar
Digital Library
- Peter, S., Baumann, A., Roscoe, T., Barham, P., and Isaacs, R. 2008. 30 seconds is not enough!: A study of operating system timer usage. In Proceedings of the 2008 EuroSys Conference. Google Scholar
Digital Library
- Pietrek, M. 1997. A crash course on the depths of Win32 structured exception handling. Microsoft Syst. J.Google Scholar
- Prasad, V., Cohen, W., Eigler, F. C., Hunt, M., Keniston, J., and Chen, B. 2005. Locating system problems using dynamic instrumentation. In Proceedings of the Ottawa Linux Symposium.Google Scholar
- Ren, G., Tune, E., Moseley, T., Shi, Y., Rus, S., and Hundt, R. 2010. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE Micro 30, 4. Google Scholar
Digital Library
- Romer, T. H., Lee, D., Voelker, G. M., Wolman, A., Wong, W. A., Baer, J.-L., Bershad, B. N., and Levy, H. M. 1996. The structure and performance of interpreters. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Rostedt, S. 2009. Debugging the kernel using Ftrace. lwn.net.Google Scholar
- Russinovich, M. E., Solomon, D. A., and Ionescu, A. 2009. Microsoft Windows Internals. Microsoft Press.Google Scholar
- Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., and Shanbhag, C. 2010. Dapper, a large-scale distributed systems tracing infrastructure. Tech. rep. 2010-1, Google Inc.Google Scholar
- Skadron, K., Ahuja, P. S., Martonosi, M., and Clark, D. W. 1998. Improving prediction for procedure returns with return-address-stack repair mechanisms. In Proceedings of the Annual ACM/IEEE International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Small, C. and Seltzer, M. I. 1998. MiSFIT: Constructing safe extensible systems. IEEE Concurr.: Parall. Distrib. Mobile Comput. 6, 3. Google Scholar
Digital Library
- Sookoor, T., Hnat, T., Hooimeijer, P., Weimer, W., and Whitehouse, K. 2009. Macrodebugging: Global views of distributed program execution. In Proceedings of the International Conference on Embedded Networked Sensor Systems (SenSys). Google Scholar
Digital Library
- Srivastava, A., Edwards, A., and Vo, H. 2001. Vulcan: Binary transformation in a distributed environment. Tech. rep. MSR-TR-2001-50, Microsoft Research.Google Scholar
- Stanek, W. 2009. Windows PowerShell(TM) 2.0 Administrator’s Pocket Consultant. Microsoft Press. Google Scholar
Digital Library
- Strosaker, M. Sample real-world use of systemtap. http://zombieprocess.wordpress.com/2008/01/03/sample-real-world-use-of-systemtap/.Google Scholar
- SystemTap. Examples. http://sourceware.org/systemtap/examples/.Google Scholar
- SystemTap. 2006. Bug 2725: function(“*”) probes sometimes crash & burn. http://sources.redhat.com/bugzilla/show_bug.cgi?id=2725.Google Scholar
- Varghese, G. and Lauck, A. 1997. Hashed and hierarchical timing wheels. IEEE/ACM Trans. Netw. 5, 6. Google Scholar
Digital Library
- Verbowski, C., Kiciman, E., Kumar, A., Daniels, B., Lu, S., Lee, J., Wang, Y.-M., and Roussev, R. 2006. Flight data recorder: Monitoring persistent-state interactions to improve systems management. In Proceedings of the Conference on Operating System Design and Implementation (OSDI). Google Scholar
Digital Library
- Wahbe, R., Lucco, S., Anderson, T. E., and Graham, S. L. 1993. Efficient software-based fault isolation. In Proceedings of the 14th ACM Symposium on Operating System Principles (SOSP’93). Google Scholar
Digital Library
- Wisniewski, R. W. and Rosenburg, B. 2003. Efficient, unified, and scalable performance monitoring for multiprocessor operating systems. In Supercomputing. Google Scholar
Digital Library
- Woodard, D. B. and Goldszmidt, M. 2009. Model-based clustering for online crisis identification in distributed computing. Tech. rep. TR-2009-131, MSR.Google Scholar
- Yee, B., Sehr, D., Dardyk, G., Chen, J. B., Muth, R., Ormandy, T., Okasaka, S., Narula, N., and Fullagar, N. 2010. Native client: A sandbox for portable, untrusted x86 native code. Comm. ACM 53, 1, 91--99. Google Scholar
Digital Library
- Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Kumar, P. G., and Currey, J. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the Conference on Operating System Design and Implementation (OSDI). Google Scholar
Digital Library
- Yu, Y., Gunda, P. K., and Isard, M. 2009. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Proceedings of the 22nd ACM Symposium on Operating System Principles (SOSP’09). Google Scholar
Digital Library
Index Terms
Fay: Extensible Distributed Tracing from Kernels to Clusters
Recommendations
Fay: extensible distributed tracing from kernels to clusters
SOSP '11: Proceedings of the Twenty-Third ACM Symposium on Operating Systems PrinciplesFay is a flexible platform for the efficient collection, processing, and analysis of software execution traces. Fay provides dynamic tracing through use of runtime instrumentation and distributed aggregation within machines and across clusters. At the ...
Revision of total hip arthroplasty: Clinical outcome of extended trochanteric osteotomy and intraoperative femoral fracture
In femoral revision arthroplasty the orthopaedic surgeon frequently has to decide between performing an extended trochanteric osteotomy or trying to remove the femoral stem without an osteotomy and taking the risk of an intraoperative fracture. As this ...
Biomechanical analyses of static and dynamic fixation techniques of retrograde interlocking femoral nailing using nonlinear finite element methods
Femoral shaft fractures can be treated using retrograde interlocking nailing systems; however, fracture nonunion still occurs. Dynamic fixation techniques, which remove either the proximal or distal locking screws, have been used to solve the problem of ...






Comments