skip to main content
research-article

PROV2R: Practical Provenance Analysis of Unstructured Processes

Published:18 August 2017Publication History
Skip Abstract Section

Abstract

Information produced by Internet applications is inherently a result of processes that are executed locally. Think of a web server that makes use of a CGI script, or a content management system where a post was first edited using a word processor. Given the impact of these processes to the content published online, a consumer of that information may want to understand what those impacts were. For example, understanding from where text was copied and pasted to make a post, or if the CGI script was updated with the latest security patches, may all influence the confidence on the published content. Capturing and exposing this information provenance is thus important to ascertaining trust to online content. Furthermore, providers of internet applications may wish to have access to the same information for debugging or audit purposes. For processes following a rigid structure (such as databases or workflows), disclosed provenance systems have been developed that efficiently and accurately capture the provenance of the produced data. However, accurately capturing provenance from unstructured processes, for example, user-interactive computing used to produce web content, remains a problem to be tackled.

In this article, we address the problem of capturing and exposing provenance from unstructured processes. Our approach, called PROV2R (PROVenance Record and Replay) is composed of two parts: (a) the decoupling of provenance analysis from its capture; and (b) the capture of high-fidelity provenance from unmodified programs. We use techniques originating in the security and reverse engineering communities, namely, record and replay and taint tracking. Taint tracking fundamentally addresses the data provenance problem but is impractical to apply at runtime due to extremely high overhead. With a number of case studies, we demonstrate that PROV2R enables the use of taint analysis for high-fidelity provenance capture, while keeping the runtime overhead at manageable levels. In addition, we show how captured information can be represented using the W3C PROV provenance model for exposure on the Web.

References

  1. Andrei Bacs, Remco Vermeulen, Asia Slowinska, and Herbert Bos. 2012. System-level support for intrusion recovery. In Proceedings of DIMVA’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Adam Bates, Devin J. Pohly, and Kevin R. B. Butler. 2016. Secure and Trustworthy Provenance Collection for Digital Forensics. Springer, New York, NY, 141--176.Google ScholarGoogle Scholar
  3. Adam Bates, Dave Tian, Kevin R. B. Butler, and Thomas Moyer. 2015. Trustworthy whole-system provenance for the linux kernel. In Proceedings of USENIX SEC’15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of USENIX ATC’05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Erik Bosman, Asia Slowinska, and Herbert Bos. 2011. Minemu: The world’s fastest taint tracker. In Proceedings of RAID’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Selter, and Andy Hopper. 2014. A primer on provenance. Commun. ACM 57, 5 (2014), 52--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Lorenzo Cavallaro and R. Sekar. 2011. Taint-enhanced anomaly detection. In Proceedings of ICISS’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yufei Chen and Haibo Chen. 2013. Scalable deterministic replay in a parallel full-system emulator. In Proceedings of ACM SIGPLAN PPoPP’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in databases: Why, how, and where. Found. Trends Data. 1, 4 (April 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fernando Chirigati, Rémi Rampin, Dennis Shasha, and Juliana Freire. 2016. ReproZip: Computational reproducibility with ease. In Proceedings of SIGMOD’16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jim Chow, Tal Garfinkel, and Peter M. Chen. 2008. Decoupling dynamic program analysis from execution in virtual environments. In Proceedings of USENIX ATC’08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jim Chow, Dominic Lucchetti, Tal Garfinkel, Geoffrey Lefebvre, Ryan Gardner, Joshua Mason, Sam Small, and Peter M. Chen. 2010. Multi-stage replay with crosscut. In ACM SIGPLAN Notices, Vol. 45. ACM, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. James Clause, Wanchun Li, and Alessandro Orso. 2007. Dytan: A generic dynamic taint analysis framework. In Proceedings of ISSTA’07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Manuel Costa, Jon Crowcroft, Miguel Castro, Antony Rowstron, Lidong Zhou, Lintao Zhang, and Paul Barham. 2008. Vigilante: End-to-end containment of internet worm epidemics. ACM TOCS 26, 4 (December 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jedidiah R. Crandall and Frederic T. Chong. 2004. Minos: Control data attack prevention orthogonal to memory model. In Proceedings of MICRO-37’04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. CVE Details. 2016. Linux Kernel Vulnerability Statistics. (November 2016). Retrieved November 17, 2016 from http://www.cvedetails.com/product/47/Linux-Linux-Kernel.html.Google ScholarGoogle Scholar
  17. Michael Dalton, Hari Kannan, and Christos Kozyrakis. 2010. Tainting is not pointless. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 88--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dorothy E. Denning and Peter J. Denning. 1977. Certification of programs for secure information flow. Commun. ACM 20, 7 (July 1977). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. David Devecsery, Michael Chow, Xianzheng Dou, Jason Flinn, and Peter M. Chen. 2014. Eidetic systems. In Proceedings of USENIX OSDI’14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan. 2014. Repeatable Reverse Engineering for the Greater Good with PANDA. Technical Report CUCS-023-14. Columbia University.Google ScholarGoogle Scholar
  21. Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan. 2015. Repeatable reverse engineering with PANDA. In Proceedings of PPREW’15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. LAVA: Large-scale automated vulnerability addition (May 2016).Google ScholarGoogle Scholar
  23. Brendan Dolan-Gavitt, Tim Leek, Josh Hodosh, and Wenke Lee. 2013. Tappan zee (north) bridge: Mining Memory Accesses for Introspection. In Proceedings of ACM SIGSAC CCS’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. George W. Dunlap, Samuel T. King, Sukru Cinar, Murtaza A. Basrai, and Peter M. Chen. 2002. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of USENIX OSDI’02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Christof Fetzer and Martin Süßkraut. 2008. Switchblade: Enforcing dynamic personalized system call models. In Proceedings of ACM SIGOPS EuroSys’08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. James Frew, Dominic Metzger, and Peter Slaughter. 2008. Automatic capture and reconstruction of computational provenance. Concurr. Comput.: Pract. 8 Exper. 20, 5 (April 2008), 485--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ashish Gehani and Dawood Tariq. 2012. SPADE: Support for provenance auditing in distributed environments. In Proceedings of Middleware’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Eleni Gessiou, Vasilis Pappas, Elias Athanasopoulos, Angelos D. Keromytis, and Sotiris Ioannidis. 2012. Towards a universal data provenance framework using dynamic instrumentation. IFIP Advances in Information and Communication Technology, Vol. 376. 103--114.Google ScholarGoogle ScholarCross RefCross Ref
  29. Boris Glavic. 2014. A Primer on Database Provenance. Technical Report RIIT/CS-DB-2014-01. Illinois Institute of Technology.Google ScholarGoogle Scholar
  30. Paul Groth, Simon Miles, and Luc Moreau. 2009. A model of process documentation to determine provenance in mash-ups. ACM Trans. Internet Technol. 9, 1 (February 2009), 3:1--3:31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Paul Groth and Luc Moreau. 2009. Recording process documentation for provenance. IEEE Transactions on Parallel and Distributed Systems (September 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Paul Groth and Luc Moreau (eds.). 2013. PROV-Overview: An Overview of the PROV Family of Documents. W3C Working Group Note NOTE-prov-overview-20130430. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/2013/NOTE-prov-overview-20130430/Google ScholarGoogle Scholar
  33. David A. Holland, Margo I. Seltzer, Uri Braun, and Kiran-Kumar Muniswamy-Reddy. 2008. PASSing the provenance challenge. Concurr. Comput.: Pract. 8 Exper. 20, 5 (April 2008), 531--540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Trung Dong Huynh, Paul Groth, and Stephan Zednik (eds.). 2013. PROV Implementation Report. W3C Working Group Note NOTE-prov-implementations-20130430. World Wide Web Consortium.Google ScholarGoogle Scholar
  35. Kangkook Jee, Vasileios P. Kemerlis, Angelos D. Keromytis, and Georgios Portokalidis. 2013. ShadowReplica: Efficient parallelization of dynamic data flow tracking. In Proceedings of ACM SIGSAC CCS’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yang Ji, Sangho Lee, and Wenke Lee. 2016. RecProv: Towards provenance-aware user space record and replay. In Proceedings of IPAW’16, Marta Mattoso and Boris Glavic (Eds.).Google ScholarGoogle ScholarCross RefCross Ref
  37. Xuxian Jiang, Xinyuan Wang, and Dongyan Xu. 2007. Stealthy malware detection through VMM-based out-of-the-box semantic view reconstruction. In Proceedings of ACM SIGSAC CCS’07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Min Gyung Kang, Stephen McCamant, Pongsin Poosankam, and Dawn Song. 2011. DTA++: Dynamic taint analysis with targeted control-flow propagation. In Proceedings of NDSS’11.Google ScholarGoogle Scholar
  39. D. B. Keator, K. Helmer, J. Steffener, J. A. Turner, T. G. M. Van Erp, S. Gadde, N. Ashish, G. A. Burns, and B. N. Nichols. 2013. Towards structured sharing of raw and derived neuroimaging data across existing resources. NeuroImage 82 (2013), 647--661.Google ScholarGoogle ScholarCross RefCross Ref
  40. Vasileios P. Kemerlis, Georgios Portokalidis, Kangkook Jee, and Angelos D. Keromytis. 2012. libdft: Practical dynamic data flow tracking for commodity systems. In Proceedings of VEE’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Graham Klyne, Paul Groth (eds.), Luc Moreau, Olaf Hartig, Yogesh Simmhan, James Myers, Timothy Lebo, Khalid Belhajjame, and Simon Miles. 2013. PROV-AQ: Provenance Access and Query. W3C Working Group Note NOTE-prov-aq-20130430. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/2013/NOTE-prov-aq-20130430/.Google ScholarGoogle Scholar
  42. Troy Kohwalter, Thiago Oliveira, Juliana Freire, Esteban Clua, and Leonardo Murta. 2016. Prov viewer: A graph-based visualization tool for interactive exploration of provenance data. In Proceedings of IPAW’16.Google ScholarGoogle ScholarCross RefCross Ref
  43. Kostya Kortchinsky. 2009. Cloudburst: A vmware guest to host escape. In Black Hat Conference.Google ScholarGoogle Scholar
  44. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of CGO’04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Timothy Lebo, Satya Sahoo, Deborah McGuinness (eds.), Khalid Behajjame, James Cheney, David Corsar, Daniel Garijo, Stian Soiland-Reyes, Stephan Zednik, and Jun Zhao. 2013. PROV-O: The PROV Ontology. W3C Recommendation REC-prov-o-20130430. World Wide Web Consortium. Retrieved from http://www.w3.org/TR/2013/REC-prov-o-20130430Google ScholarGoogle Scholar
  46. Rongxing Lu, Xiaodong Lin, Xiaohui Liang, and Xuemin (Sherman) Shen. 2010. Secure provenance: The essential of bread and butter of data forensics in cloud computing. In Proceedings of ACM SIGSAC ASIACCS’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Shiqing Ma, Xiangyu Zhang, and Dongyan Xu. 2016. ProTracer: Towards practical provenance tracing by alternating between logging and tainting. In Proceedings of NDSS’16.Google ScholarGoogle ScholarCross RefCross Ref
  48. Xiaogang Ma, Peter Fox, Curt Tilmes, Katharine Jacobs, and Anne Waple. 2014. Capturing provenance of global change information. Nature Clim. Change 4, 6 (06 2014), 409--413.Google ScholarGoogle Scholar
  49. Wes Masri, Andy Podgurski, and David Leon. 2004. Detecting and debugging insecure information flows. In Proceedings of ISSRE’04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Stephen McCamant and Michael D. Ernst. 2006. Quantitative Information-Flow Tracking for C and Related Languages. Technical Report MIT-CSAIL-TR-2006-076. MIT, Cambridge, MA.Google ScholarGoogle Scholar
  51. Luc Moreau. 2010. The foundations for provenance on the Web. Foundations and Trends in Web Science 2, 2--3 (November 2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Luc Moreau and Paul Groth. 2013. Provenance: An introduction to PROV. Synthesis Lectures on the Semantic Web: Theory and Technology 3, 4 (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Luc Moreau and Paolo Missier. 2013. PROV-DM: The PROV Data Model. Recommendation REC-prov-dm-20130430. W3C. Retrieved from http://www.w3.org/TR/2013/REC-prov-dm-20130430/Google ScholarGoogle Scholar
  54. Tom Oinn, Mark Greenwood, and et al. 2006. Taverna: Lessons in creating a workflow environment for the life sciences. Concurr. Comput.: Pract. 8 Exper. 18, 10 (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie. 2010. PinPlay: A framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings of CGO’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Devin J. Pohly, Stephen McLaughlin, Patrick McDaniel, and Kevin Butler. 2012. Hi-Fi: Collecting high-fidelity whole-system provenance. In Proceedings of ACSAC’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Georgios Portokalidis, Philip Homburg, Kostas Anagnostakis, and Herbert Bos. 2010. Paranoid android: Versatile protection for smartphones. In Proceedings of ACSAC’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Georgios Portokalidis, Asia Slowinska, and Herbert Bos. 2006. Argos: An emulator for fingerprinting zero-day attacks for advertised honeypots with automatic signature generation. In Proceedings of EuroSys’06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song. 2016. Samsara: Efficient deterministic replay in multiprocessor environments with hardware virtualization extensions. In Proceedings of USENIX ATC’16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Darren P. Richardson and Luc Moreau. 2016. Towards the domain agnostic generation of natural language explanations from provenance graphs for casual users. In Proceedings of IPAW’16.Google ScholarGoogle Scholar
  61. Prateek Saxena, R. Sekar, and Varun Puranik. 2008. Efficient fine-grained binary instrumentation with applications to taint-tracking. In Proceedings of CGO’08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2005. A survey of data provenance in e-science. SIGMOD Rec. 34, 3 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2008. Karma2: Provenance management for data driven workflows. Int. J. Web Serv. Res. 5, 2 (2008).Google ScholarGoogle ScholarCross RefCross Ref
  64. Asia Slowinska and Herbert Bos. 2009. Pointless tainting?: Evaluating the practicality of pointer tainting. In Proceedings of EuroSys’09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2014. Looking inside the black-box: Capturing data provenance using dynamic instrumentation. In Proceedings of IPAW’14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2015. Decoupling provenance capture and analysis from execution. In Proceedings of USENIX TaPP’15. http://dare.ubvu.vu.nl/handle/1871/53077 Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Manolis Stamatogiannakis, Hasanat Kazmi, Hashim Sharif, Remco Vermeulen, Ashish Gehani, Herbert Bos, and Paul Groth. 2016. Trade-offs in automatic provenance capture. In Proceedings of IPAW’16.Google ScholarGoogle ScholarCross RefCross Ref
  68. Tao Wang, Jiwei Xu, Wenbo Zhang, Jianhua Zhang, Jun Wei, and Hua Zhong. 2016. ReSeer: Efficient search-based replay for multiprocessor virtual machines. J. Syst. Software (2016).Google ScholarGoogle Scholar
  69. Ryan Whelan, Tim Leek, and David Kaeli. 2013. Architecture-independent dynamic information flow tracking. In Proceedings of CC’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Wikipedia. 2016. Virtual machine escape. (November 2016). Retrieved November 17, 2016 from https://en.wikipedia.org/wiki/Virtual_machine_escapeGoogle ScholarGoogle Scholar
  71. Min Xu, Rastislav Bodik, and Mark D. Hill. 2003. A “flight data recorder” for enabling full-system multiprocessor deterministic replay. In Proceedings of ACM ISCA’03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Min Xu, Vyacheslav Malyugin, Jeffrey Sheldon, Ganesh Venkitachalam, and Boris Weissman. 2007. ReTrace: Collecting execution trace with virtual machine deterministic replay. In Proceedings of MoBS’07.Google ScholarGoogle Scholar
  73. Lok Kwong Yan and Heng Yin. 2012. DroidScope: Seamlessly reconstructing the OS and dalvik semantic views for dynamic android malware analysis. In Proceedings USENIX SEC’12.Google ScholarGoogle Scholar
  74. Heng Yin and Dawn Song. 2010. TEMU: Binary Code Analysis via Whole-System Layered Annotative Execution. Technical Report UCB/EECS-2010-3. EECS Department, University of California, Berkeley. Retrieved November 17, 2016 from http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-3.html.Google ScholarGoogle Scholar

Index Terms

  1. PROV2R: Practical Provenance Analysis of Unstructured Processes

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Internet Technology
        ACM Transactions on Internet Technology  Volume 17, Issue 4
        Special Issue on Provenance of Online Data and Regular Papers
        November 2017
        165 pages
        ISSN:1533-5399
        EISSN:1557-6051
        DOI:10.1145/3133307
        • Editor:
        • Munindar P. Singh
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 August 2017
        • Accepted: 1 March 2017
        • Revised: 1 February 2017
        • Received: 1 August 2016
        Published in toit Volume 17, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!