Abstract
Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of troubleshooting any complex software system, but further exacerbated by the paucity of information that is typically available in the production setting. Indeed, for reasons of both overhead and privacy, it is common that only the run-time log generated by a system (e.g., syslog) can be shared with the developers. Unfortunately, the ad-hoc nature of such reports are frequently insufficient for detailed failure diagnosis. This paper seeks to improve this situation within the rubric of existing practice. We describe a tool, LogEnhancer that automatically “enhances” existing logging code to aid in future post-failure debugging. We evaluate LogEnhancer on eight large, real-world applications and demonstrate that it can dramatically reduce the set of potential root failure causes that must be considered while imposing negligible overheads.
- Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. 2003. Performance debugging for distributed systems of black boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). ACM, New York, 74--89. Google Scholar
Digital Library
- Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers: Principles, Techniques, and Tools 2nd Ed. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google Scholar
Digital Library
- Aiken, A., Bugrara, S., Dillig, I., Dillig, T., Hackett, B., and Hawkins, P. 2007. An overview of the saturn project. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE’07). ACM, New York, NY, 43--48. Google Scholar
Digital Library
- Apple. 2004. Apple Inc., CrashReport. Tech. rep. TN2123.Google Scholar
- Ayers, A., Schooler, R., Metcalf, C., Agarwal, A., Rhee, J., and Witchel, E. 2005. Traceback: First fault diagnosis by reconstruction of distributed control flow. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). ACM, New York, NY, 201--212. Google Scholar
Digital Library
- Barham, P., Donnelly, A., Isaacs, R., and Mortier, R. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation. USENIX Association, Berkeley, CA, 18--18. Google Scholar
Digital Library
- Bhatia, S., Kumar, A., Fiuczynski, M. E., and Peterson, L. 2008. Lightweight, high-resolution monitoring for troubleshooting production systems. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, Berkeley, CA, 103--116. Google Scholar
Digital Library
- Cadar, C., Dunbar, D., and Engler, D. 2008. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, Berkeley, CA, 209--224. Google Scholar
Digital Library
- Castro, M., Costa, M., and Martin, J.-P. 2008. Better bug reporting with better privacy. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 319--328. Google Scholar
Digital Library
- Chen, S., Kozuch, M., Strigkos, T., Falsafi, B., Gibbons, P. B., Mowry, T. C., Ramachandran, V., Ruwase, O., Ryan, M., and Vlachos, E. 2008. Flexible hardware acceleration for instruction-grain program monitoring. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Los Alamitos, CA, 377--388. Google Scholar
Digital Library
- Chilimbi, T. M., Liblit, B., Mehra, K., Nori, A. V., and Vaswani, K. 2009. HOLMES: Effective statistical debugging via efficient path profiling. In Proceedings of the 31st International Conference on Software Engineering (ICSE’09). IEEE Computer Society, Los Alamitos, CA, 34--44. Google Scholar
Digital Library
- Cisco. Cisco system log management.Google Scholar
- Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., and Fox, A. 2005. Capturing, indexing, clustering, and retrieving system history. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). ACM, New York, NY, 105--118. Google Scholar
Digital Library
- Costa, M., Castro, M., Zhou, L., Zhang, L., and Peinado, M. 2007. Bouncer: securing software by blocking bad input. In Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP’07). ACM, New York, NY, 117--130. Google Scholar
Digital Library
- Crameri, O., Bianchini, R., and Zwaenepoel, W. 2011. Striking a new balance between program instrumentation and debugging time. In Proceedings of the 6th Conference on Computer Systems (EuroSys’11). ACM, New York, NY, 199--214. Google Scholar
Digital Library
- Dell. 2008. Streamlined troubleshooting with the Dell system E-Support tool. Dell Power Solutions.Google Scholar
- Detlefs, D. L., Leino, K. R. M., Rustan, K., Leino, M., Nelson, G., and Saxe, J. B. 1998. Extended static checking. Compac SRC Research rep. 159.Google Scholar
- Devietti, J., Lucia, B., Ceze, L., and Oskin, M. 2009. Dmp: deterministic shared memory multiprocessing. In Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09). ACM, New York, NY, 85--96. Google Scholar
Digital Library
- Dunlap, G. W., Lucchetti, D. G., Fetterman, M. A., and Chen, P. M. 2008. Execution replay of multiprocessor virtual machines. In Proceedings of the 4th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’08). ACM, New York, NY, 121--130. Google Scholar
Digital Library
- DWARF. The DWARF Debugging Format. http://dwarfstd.org.Google Scholar
- EMC. 2005. EMC seen collecting and managing log as key driver for 94 percent of customers.Google Scholar
- Engler, D. and Ashcraft, K. 2003. Racerx: Effective, static detection of race conditions and deadlocks. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). ACM, New York, NY, 237--252. Google Scholar
Digital Library
- Flanagan, C., Leino, K. R. M., Lillibridge, M., Nelson, G., Saxe, J. B., and Stata, R. 2002. Extended static checking for java. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’02). ACM, New York, NY, 234--245. Google Scholar
Digital Library
- GCORE. Man page for gcore (Linux section 1).Google Scholar
- Glerum, K., Kinshumann, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, D., Loihle, G., and Hunt, G. 2009. Debugging in the (very) large: Ten years of implementation and experience. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 103--116. Google Scholar
Digital Library
- GoogleBreakpad. Google Inc., Breakpad. http://code.google.com/p/google-breakpad/.Google Scholar
- Guo, Z., Wang, X., Tang, J., Liu, X., Xu, Z., Wu, M., Kaashoek, M. F., and Zhang, Z. 2008. R2: an application-level kernel for record and replay. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, Berkeley, CA, 193--208. Google Scholar
Digital Library
- Ha, J., Rossbach, C. J., Davis, J. V., Roy, I., Ramadan, H. E., Porter, D. E., Chen, D. L., and Witchel, E. 2007. Improved error reporting for software that uses black-box components. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, New York, NY, 101--111. Google Scholar
Digital Library
- Hackett, B. and Aiken, A. 2006. How is aliasing used in systems software? In Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering (SIGSOFT’06/FSE-14). ACM, New York, NY, 69--80. Google Scholar
Digital Library
- Kadav, A., Renzelmann, M. J., and Swift, M. M. 2009. Tolerating hardware device failures in software. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 59--72. Google Scholar
Digital Library
- Kernighan, B. W. and Pike, R. 1999. The Practice of Programming. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google Scholar
Digital Library
- King, S. T., Dunlap, G. W., and Chen, P. M. 2005. Debugging operating systems with time-traveling virtual machines. In Proceedings of the USENIX Annual Technical Conference (ATEC’05). USENIX Association, Berkeley, CA, 1--1. Google Scholar
Digital Library
- LeBlanc, T. J. and Mellor-Crummey, J. M. 1987. Debugging parallel programs with instant replay. IEEE Trans. Comput. 36, 471--482. Google Scholar
Digital Library
- Lee, D., Wester, B., Veeraraghavan, K., Narayanasamy, S., Chen, P. M., and Flinn, J. 2010. Respec: Efficient online multiprocessor replayvia speculation and external determinism. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). ACM, New York, NY, 77--90. Google Scholar
Digital Library
- Li, Z., Tan, L., Wang, X., Lu, S., Zhou, Y., and Zhai, C. 2006. Have things changed now?: An empirical study of bug characteristics in modern open source software. In Proceedings of the 1st Workshop on Architectural and System Support for Improving Software Dependability (ASID’06). ACM, New York, NY, 25--33. Google Scholar
Digital Library
- Liblit, B., Aiken, A., Zheng, A. X., and Jordan, M. I. 2003. Bug isolation via remote program sampling. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’03). ACM, New York, NY, 141--154. Google Scholar
Digital Library
- Manevich, R., Sridharan, M., Adams, S., Das, M., and Yang, Z. 2004. PSE: explaining program failures via postmortem static analysis. In Proceedings of the 12th International Symposium on the Foundations of Software Engineering. 63--72. Google Scholar
Digital Library
- Montesinos, P., Ceze, L., and Torrellas, J. 2008. Delorean: Recording and deterministically replaying shared-memory multiprocessor execution efficiently. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Los Alamitos, CA, 289--300. Google Scholar
Digital Library
- Mozilla QFA. Mozilla Quality Feedback Agent. http://kb.mozillazine.org/Quality_Feedback_Agent.Google Scholar
- Naik, M. and Aiken, A. 2007. Conditional must not aliasing for static race detection. In Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’07). ACM, New York, NY, 327--338. Google Scholar
Digital Library
- Naik, M., Aiken, A., and Whaley, J. 2006. Effective static race detection for java. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’06). ACM, New York, NY, 308--319. Google Scholar
Digital Library
- Narayanasamy, S., Pokam, G., and Calder, B. 2005. Bugnet: Continuously recording program execution for deterministic replay debugging. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). IEEE Computer Society, Los Alamitos, CA, 284--295. Google Scholar
Digital Library
- Necula, G. C., McPeak, S., Rahul, S. P., and Weimer, W. 2002. CIL: Intermediate language and tools for analysis and transformation of c programs. In Proceedings of the 11th International Conference on Compiler Construction (CC’02). Springer-Verlag, Berlin, 213--228. Google Scholar
Digital Library
- NetApp. 2007. Proactive health management with auto-support. NetApp white paper.Google Scholar
- NetAppSavecore. NetApp Inc., Savecore. ONTAP 7.3 Manual Page Reference, Volume 1, 471--472.Google Scholar
- Olszewski, M., Ansel, J., and Amarasinghe, S. 2009. Kendo: Efficient deterministic multithreading in software. In Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09). ACM, New York, NY, 97--108. Google Scholar
Digital Library
- Park, S., Zhou, Y., Xiong, W., Yin, Z., Kaushik, R., Lee, K. H., and Lu, S. 2009. Pres: Probabilistic replay with execution sketching on multiprocessors. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 177--192. Google Scholar
Digital Library
- Schmidt, S. 2009. 7 more good tips on logging. http://codemonkeyism.com/7-more-good-tips-on-logging/.Google Scholar
- SLOCCount. Sloccount. http://www.dwheeler.com/sloccount/.Google Scholar
- Subhraveti, D. and Nieh, J. 2011. Record and transplay: Partial checkpointing for replay debugging across heterogeneous systems. In Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’11). ACM, New York, NY, 109--120. Google Scholar
Digital Library
- Tucek, J., Lu, S., Huang, C., Xanthos, S., and Zhou, Y. 2007. Triage: Diagnosing production run failures at the user’s site. In Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles (SOSP’07). ACM, New York, NY, 131--144. Google Scholar
Digital Library
- Veeraraghavan, K., Lee, D., Wester, B., Ouyang, J., Chen, P. M., Flinn, J., and Narayanasamy, S. 2011. Doubleplay: Parallelizing sequential logging and replay. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). ACM, New York, NY, 15--26. Google Scholar
Digital Library
- Vlachos, E., Goodstein, M. L., Kozuch, M. A., Chen, S., Falsafi, B., Gibbons, P. B., and Mowry, T. C. 2010. Paralog: Enabling and accelerating online parallel monitoring of multithreaded applications. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). ACM, New York, NY, 271--284. Google Scholar
Digital Library
- VMWare. Using the intergrated virtual debugger for visual studio. http://www.vmware.com/pdf/ws65_manual.pdf.Google Scholar
- Weeratunge, D., Zhang, X., and Jagannathan, S. 2010. Analyzing multicore dumps to facilitate concurrency bug reproduction. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). ACM, New York, NY, 155--166. Google Scholar
Digital Library
- Xu, M., Bodik, R., and Hill, M. D. 2003. A “flight data recorder” for enabling full-system multiprocessor deterministic replay. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03). ACM, New York, NY, 122--135. Google Scholar
Digital Library
- Xu, W., Huang, L., Fox, A., Patterson, D., and Jordan, M. I. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 117--132. Google Scholar
Digital Library
- Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., and Pasupathy, S. 2010. Sherlog: Error diagnosis by connecting clues from run-time logs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). ACM, New York, NY, 143--154. Google Scholar
Digital Library
- Zamfir, C. and Candea, G. 2010. Execution synthesis: A technique for automated software debugging. In Proceedings of the 5th European Conference on Computer Systems (EuroSys’10). ACM, New York, NY, 321--334. Google Scholar
Digital Library
- Zhang, X., Tallam, S., and Gupta, R. 2006. Dynamic slicing long running programs through execution fast forwarding. In Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering (SIGSOFT’06/FSE-14). ACM, New York, NY, 81--91. Google Scholar
Digital Library
- Zhao, Q., Rabbah, R., Amarasinghe, S., Rudolph, L., and Wong, W.-F. 2008. How to do a million watchpoints: Efficient debugging using dynamic instrumentation. In Proceedings of the International Conference on Compiler Construction. Google Scholar
Digital Library
Index Terms
Improving Software Diagnosability via Log Enhancement
Recommendations
SherLog: error diagnosis by connecting clues from run-time logs
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsComputer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) ...
Improving software diagnosability via log enhancement
ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systemsDiagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of trouble-shooting any complex software system, but further exacerbated by the paucity of information that is typically available in the ...
Improving software diagnosability via log enhancement
ASPLOS '11Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of trouble-shooting any complex software system, but further exacerbated by the paucity of information that is typically available in the ...








Comments