Abstract
Information produced by Internet applications is inherently a result of processes that are executed locally. Think of a web server that makes use of a CGI script, or a content management system where a post was first edited using a word processor. Given the impact of these processes to the content published online, a consumer of that information may want to understand what those impacts were. For example, understanding from where text was copied and pasted to make a post, or if the CGI script was updated with the latest security patches, may all influence the confidence on the published content. Capturing and exposing this information provenance is thus important to ascertaining trust to online content. Furthermore, providers of internet applications may wish to have access to the same information for debugging or audit purposes. For processes following a rigid structure (such as databases or workflows), disclosed provenance systems have been developed that efficiently and accurately capture the provenance of the produced data. However, accurately capturing provenance from unstructured processes, for example, user-interactive computing used to produce web content, remains a problem to be tackled.
In this article, we address the problem of capturing and exposing provenance from unstructured processes. Our approach, called PROV2R (PROVenance Record and Replay) is composed of two parts: (a) the decoupling of provenance analysis from its capture; and (b) the capture of high-fidelity provenance from unmodified programs. We use techniques originating in the security and reverse engineering communities, namely, record and replay and taint tracking. Taint tracking fundamentally addresses the data provenance problem but is impractical to apply at runtime due to extremely high overhead. With a number of case studies, we demonstrate that PROV2R enables the use of taint analysis for high-fidelity provenance capture, while keeping the runtime overhead at manageable levels. In addition, we show how captured information can be represented using the W3C PROV provenance model for exposure on the Web.
- Andrei Bacs, Remco Vermeulen, Asia Slowinska, and Herbert Bos. 2012. System-level support for intrusion recovery. In Proceedings of DIMVA’12. Google Scholar
Digital Library
- Adam Bates, Devin J. Pohly, and Kevin R. B. Butler. 2016. Secure and Trustworthy Provenance Collection for Digital Forensics. Springer, New York, NY, 141--176.Google Scholar
- Adam Bates, Dave Tian, Kevin R. B. Butler, and Thomas Moyer. 2015. Trustworthy whole-system provenance for the linux kernel. In Proceedings of USENIX SEC’15. Google Scholar
Digital Library
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of USENIX ATC’05. Google Scholar
Digital Library
- Erik Bosman, Asia Slowinska, and Herbert Bos. 2011. Minemu: The world’s fastest taint tracker. In Proceedings of RAID’11. Google Scholar
Digital Library
- Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Selter, and Andy Hopper. 2014. A primer on provenance. Commun. ACM 57, 5 (2014), 52--60. Google Scholar
Digital Library
- Lorenzo Cavallaro and R. Sekar. 2011. Taint-enhanced anomaly detection. In Proceedings of ICISS’11. Google Scholar
Digital Library
- Yufei Chen and Haibo Chen. 2013. Scalable deterministic replay in a parallel full-system emulator. In Proceedings of ACM SIGPLAN PPoPP’13. Google Scholar
Digital Library
- James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in databases: Why, how, and where. Found. Trends Data. 1, 4 (April 2009). Google Scholar
Digital Library
- Fernando Chirigati, Rémi Rampin, Dennis Shasha, and Juliana Freire. 2016. ReproZip: Computational reproducibility with ease. In Proceedings of SIGMOD’16. Google Scholar
Digital Library
- Jim Chow, Tal Garfinkel, and Peter M. Chen. 2008. Decoupling dynamic program analysis from execution in virtual environments. In Proceedings of USENIX ATC’08. Google Scholar
Digital Library
- Jim Chow, Dominic Lucchetti, Tal Garfinkel, Geoffrey Lefebvre, Ryan Gardner, Joshua Mason, Sam Small, and Peter M. Chen. 2010. Multi-stage replay with crosscut. In ACM SIGPLAN Notices, Vol. 45. ACM, 13--24. Google Scholar
Digital Library
- James Clause, Wanchun Li, and Alessandro Orso. 2007. Dytan: A generic dynamic taint analysis framework. In Proceedings of ISSTA’07. Google Scholar
Digital Library
- Manuel Costa, Jon Crowcroft, Miguel Castro, Antony Rowstron, Lidong Zhou, Lintao Zhang, and Paul Barham. 2008. Vigilante: End-to-end containment of internet worm epidemics. ACM TOCS 26, 4 (December 2008). Google Scholar
Digital Library
- Jedidiah R. Crandall and Frederic T. Chong. 2004. Minos: Control data attack prevention orthogonal to memory model. In Proceedings of MICRO-37’04. Google Scholar
Digital Library
- CVE Details. 2016. Linux Kernel Vulnerability Statistics. (November 2016). Retrieved November 17, 2016 from http://www.cvedetails.com/product/47/Linux-Linux-Kernel.html.Google Scholar
- Michael Dalton, Hari Kannan, and Christos Kozyrakis. 2010. Tainting is not pointless. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 88--92. Google Scholar
Digital Library
- Dorothy E. Denning and Peter J. Denning. 1977. Certification of programs for secure information flow. Commun. ACM 20, 7 (July 1977). Google Scholar
Digital Library
- David Devecsery, Michael Chow, Xianzheng Dou, Jason Flinn, and Peter M. Chen. 2014. Eidetic systems. In Proceedings of USENIX OSDI’14. Google Scholar
Digital Library
- Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan. 2014. Repeatable Reverse Engineering for the Greater Good with PANDA. Technical Report CUCS-023-14. Columbia University.Google Scholar
- Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan. 2015. Repeatable reverse engineering with PANDA. In Proceedings of PPREW’15. Google Scholar
Digital Library
- Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. LAVA: Large-scale automated vulnerability addition (May 2016).Google Scholar
- Brendan Dolan-Gavitt, Tim Leek, Josh Hodosh, and Wenke Lee. 2013. Tappan zee (north) bridge: Mining Memory Accesses for Introspection. In Proceedings of ACM SIGSAC CCS’13. Google Scholar
Digital Library
- George W. Dunlap, Samuel T. King, Sukru Cinar, Murtaza A. Basrai, and Peter M. Chen. 2002. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of USENIX OSDI’02. Google Scholar
Digital Library
- Christof Fetzer and Martin Süßkraut. 2008. Switchblade: Enforcing dynamic personalized system call models. In Proceedings of ACM SIGOPS EuroSys’08. Google Scholar
Digital Library
- James Frew, Dominic Metzger, and Peter Slaughter. 2008. Automatic capture and reconstruction of computational provenance. Concurr. Comput.: Pract. 8 Exper. 20, 5 (April 2008), 485--596. Google Scholar
Digital Library
- Ashish Gehani and Dawood Tariq. 2012. SPADE: Support for provenance auditing in distributed environments. In Proceedings of Middleware’12. Google Scholar
Digital Library
- Eleni Gessiou, Vasilis Pappas, Elias Athanasopoulos, Angelos D. Keromytis, and Sotiris Ioannidis. 2012. Towards a universal data provenance framework using dynamic instrumentation. IFIP Advances in Information and Communication Technology, Vol. 376. 103--114.Google Scholar
Cross Ref
- Boris Glavic. 2014. A Primer on Database Provenance. Technical Report RIIT/CS-DB-2014-01. Illinois Institute of Technology.Google Scholar
- Paul Groth, Simon Miles, and Luc Moreau. 2009. A model of process documentation to determine provenance in mash-ups. ACM Trans. Internet Technol. 9, 1 (February 2009), 3:1--3:31. Google Scholar
Digital Library
- Paul Groth and Luc Moreau. 2009. Recording process documentation for provenance. IEEE Transactions on Parallel and Distributed Systems (September 2009). Google Scholar
Digital Library
- Paul Groth and Luc Moreau (eds.). 2013. PROV-Overview: An Overview of the PROV Family of Documents. W3C Working Group Note NOTE-prov-overview-20130430. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/2013/NOTE-prov-overview-20130430/Google Scholar
- David A. Holland, Margo I. Seltzer, Uri Braun, and Kiran-Kumar Muniswamy-Reddy. 2008. PASSing the provenance challenge. Concurr. Comput.: Pract. 8 Exper. 20, 5 (April 2008), 531--540. Google Scholar
Digital Library
- Trung Dong Huynh, Paul Groth, and Stephan Zednik (eds.). 2013. PROV Implementation Report. W3C Working Group Note NOTE-prov-implementations-20130430. World Wide Web Consortium.Google Scholar
- Kangkook Jee, Vasileios P. Kemerlis, Angelos D. Keromytis, and Georgios Portokalidis. 2013. ShadowReplica: Efficient parallelization of dynamic data flow tracking. In Proceedings of ACM SIGSAC CCS’13. Google Scholar
Digital Library
- Yang Ji, Sangho Lee, and Wenke Lee. 2016. RecProv: Towards provenance-aware user space record and replay. In Proceedings of IPAW’16, Marta Mattoso and Boris Glavic (Eds.).Google Scholar
Cross Ref
- Xuxian Jiang, Xinyuan Wang, and Dongyan Xu. 2007. Stealthy malware detection through VMM-based out-of-the-box semantic view reconstruction. In Proceedings of ACM SIGSAC CCS’07. Google Scholar
Digital Library
- Min Gyung Kang, Stephen McCamant, Pongsin Poosankam, and Dawn Song. 2011. DTA++: Dynamic taint analysis with targeted control-flow propagation. In Proceedings of NDSS’11.Google Scholar
- D. B. Keator, K. Helmer, J. Steffener, J. A. Turner, T. G. M. Van Erp, S. Gadde, N. Ashish, G. A. Burns, and B. N. Nichols. 2013. Towards structured sharing of raw and derived neuroimaging data across existing resources. NeuroImage 82 (2013), 647--661.Google Scholar
Cross Ref
- Vasileios P. Kemerlis, Georgios Portokalidis, Kangkook Jee, and Angelos D. Keromytis. 2012. libdft: Practical dynamic data flow tracking for commodity systems. In Proceedings of VEE’12. Google Scholar
Digital Library
- Graham Klyne, Paul Groth (eds.), Luc Moreau, Olaf Hartig, Yogesh Simmhan, James Myers, Timothy Lebo, Khalid Belhajjame, and Simon Miles. 2013. PROV-AQ: Provenance Access and Query. W3C Working Group Note NOTE-prov-aq-20130430. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/2013/NOTE-prov-aq-20130430/.Google Scholar
- Troy Kohwalter, Thiago Oliveira, Juliana Freire, Esteban Clua, and Leonardo Murta. 2016. Prov viewer: A graph-based visualization tool for interactive exploration of provenance data. In Proceedings of IPAW’16.Google Scholar
Cross Ref
- Kostya Kortchinsky. 2009. Cloudburst: A vmware guest to host escape. In Black Hat Conference.Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of CGO’04. Google Scholar
Digital Library
- Timothy Lebo, Satya Sahoo, Deborah McGuinness (eds.), Khalid Behajjame, James Cheney, David Corsar, Daniel Garijo, Stian Soiland-Reyes, Stephan Zednik, and Jun Zhao. 2013. PROV-O: The PROV Ontology. W3C Recommendation REC-prov-o-20130430. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/2013/REC-prov-o-20130430Google Scholar
- Rongxing Lu, Xiaodong Lin, Xiaohui Liang, and Xuemin (Sherman) Shen. 2010. Secure provenance: The essential of bread and butter of data forensics in cloud computing. In Proceedings of ACM SIGSAC ASIACCS’10. Google Scholar
Digital Library
- Shiqing Ma, Xiangyu Zhang, and Dongyan Xu. 2016. ProTracer: Towards practical provenance tracing by alternating between logging and tainting. In Proceedings of NDSS’16.Google Scholar
Cross Ref
- Xiaogang Ma, Peter Fox, Curt Tilmes, Katharine Jacobs, and Anne Waple. 2014. Capturing provenance of global change information. Nature Clim. Change 4, 6 (06 2014), 409--413.Google Scholar
- Wes Masri, Andy Podgurski, and David Leon. 2004. Detecting and debugging insecure information flows. In Proceedings of ISSRE’04. Google Scholar
Digital Library
- Stephen McCamant and Michael D. Ernst. 2006. Quantitative Information-Flow Tracking for C and Related Languages. Technical Report MIT-CSAIL-TR-2006-076. MIT, Cambridge, MA.Google Scholar
- Luc Moreau. 2010. The foundations for provenance on the Web. Foundations and Trends in Web Science 2, 2--3 (November 2010). Google Scholar
Digital Library
- Luc Moreau and Paul Groth. 2013. Provenance: An introduction to PROV. Synthesis Lectures on the Semantic Web: Theory and Technology 3, 4 (2013). Google Scholar
Digital Library
- Luc Moreau and Paolo Missier. 2013. PROV-DM: The PROV Data Model. Recommendation REC-prov-dm-20130430. W3C. Retrieved from http://www.w3.org/TR/2013/REC-prov-dm-20130430/Google Scholar
- Tom Oinn, Mark Greenwood, and et al. 2006. Taverna: Lessons in creating a workflow environment for the life sciences. Concurr. Comput.: Pract. 8 Exper. 18, 10 (2006). Google Scholar
Digital Library
- Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie. 2010. PinPlay: A framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings of CGO’10. Google Scholar
Digital Library
- Devin J. Pohly, Stephen McLaughlin, Patrick McDaniel, and Kevin Butler. 2012. Hi-Fi: Collecting high-fidelity whole-system provenance. In Proceedings of ACSAC’12. Google Scholar
Digital Library
- Georgios Portokalidis, Philip Homburg, Kostas Anagnostakis, and Herbert Bos. 2010. Paranoid android: Versatile protection for smartphones. In Proceedings of ACSAC’10. Google Scholar
Digital Library
- Georgios Portokalidis, Asia Slowinska, and Herbert Bos. 2006. Argos: An emulator for fingerprinting zero-day attacks for advertised honeypots with automatic signature generation. In Proceedings of EuroSys’06. Google Scholar
Digital Library
- Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song. 2016. Samsara: Efficient deterministic replay in multiprocessor environments with hardware virtualization extensions. In Proceedings of USENIX ATC’16. Google Scholar
Digital Library
- Darren P. Richardson and Luc Moreau. 2016. Towards the domain agnostic generation of natural language explanations from provenance graphs for casual users. In Proceedings of IPAW’16.Google Scholar
- Prateek Saxena, R. Sekar, and Varun Puranik. 2008. Efficient fine-grained binary instrumentation with applications to taint-tracking. In Proceedings of CGO’08. Google Scholar
Digital Library
- Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2005. A survey of data provenance in e-science. SIGMOD Rec. 34, 3 (2005). Google Scholar
Digital Library
- Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2008. Karma2: Provenance management for data driven workflows. Int. J. Web Serv. Res. 5, 2 (2008).Google Scholar
Cross Ref
- Asia Slowinska and Herbert Bos. 2009. Pointless tainting?: Evaluating the practicality of pointer tainting. In Proceedings of EuroSys’09. Google Scholar
Digital Library
- Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2014. Looking inside the black-box: Capturing data provenance using dynamic instrumentation. In Proceedings of IPAW’14. Google Scholar
Digital Library
- Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2015. Decoupling provenance capture and analysis from execution. In Proceedings of USENIX TaPP’15. http://dare.ubvu.vu.nl/handle/1871/53077 Google Scholar
Digital Library
- Manolis Stamatogiannakis, Hasanat Kazmi, Hashim Sharif, Remco Vermeulen, Ashish Gehani, Herbert Bos, and Paul Groth. 2016. Trade-offs in automatic provenance capture. In Proceedings of IPAW’16.Google Scholar
Cross Ref
- Tao Wang, Jiwei Xu, Wenbo Zhang, Jianhua Zhang, Jun Wei, and Hua Zhong. 2016. ReSeer: Efficient search-based replay for multiprocessor virtual machines. J. Syst. Software (2016).Google Scholar
- Ryan Whelan, Tim Leek, and David Kaeli. 2013. Architecture-independent dynamic information flow tracking. In Proceedings of CC’13. Google Scholar
Digital Library
- Wikipedia. 2016. Virtual machine escape. (November 2016). Retrieved November 17, 2016 from https://en.wikipedia.org/wiki/Virtual_machine_escapeGoogle Scholar
- Min Xu, Rastislav Bodik, and Mark D. Hill. 2003. A “flight data recorder” for enabling full-system multiprocessor deterministic replay. In Proceedings of ACM ISCA’03. Google Scholar
Digital Library
- Min Xu, Vyacheslav Malyugin, Jeffrey Sheldon, Ganesh Venkitachalam, and Boris Weissman. 2007. ReTrace: Collecting execution trace with virtual machine deterministic replay. In Proceedings of MoBS’07.Google Scholar
- Lok Kwong Yan and Heng Yin. 2012. DroidScope: Seamlessly reconstructing the OS and dalvik semantic views for dynamic android malware analysis. In Proceedings USENIX SEC’12.Google Scholar
- Heng Yin and Dawn Song. 2010. TEMU: Binary Code Analysis via Whole-System Layered Annotative Execution. Technical Report UCB/EECS-2010-3. EECS Department, University of California, Berkeley. Retrieved November 17, 2016 from http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-3.html.Google Scholar
Index Terms
PROV2R: Practical Provenance Analysis of Unstructured Processes
Recommendations
Looking Inside the Black-Box: Capturing Data Provenance Using Dynamic Instrumentation
IPAW 2014: Revised Selected Papers of the 5th International Provenance and Annotation Workshop on Provenance and Annotation of Data and Processes - Volume 8628Knowing the provenance of a data item helps in ascertaining its trustworthiness. Various approaches have been proposed to track or infer data provenance. However, these approaches either treat an executing program as a black-box, limiting the fidelity ...
A survey on provenance: What for? What form? What from?
Provenance refers to any information describing the production process of an end product, which can be anything from a piece of digital data to a physical object. While this survey focuses on the former type of end product, this definition still leaves ...
Provenance for Lattice QCD workflows
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023We present a provenance model for the generic workflow of numerical Lattice Quantum Chromodynamics (QCD) calculations, which constitute an important component of particle physics research. These calculations are carried out on the largest ...






Comments