skip to main content
research-article

Managing Provenance of Implicit Data Flows in Scientific Experiments

Published:18 August 2017Publication History
Skip Abstract Section

Abstract

Scientific experiments modeled as scientific workflows may create, change, or access data products not explicitly referenced in the workflow specification, leading to implicit data flows. The lack of knowledge about implicit data flows makes the experiments hard to understand and reproduce. In this article, we present ProvMonitor, an approach that identifies the creation, change, or access to data products even within implicit data flows. ProvMonitor links this information with the workflow activity that generated it, allowing for scientists to compare data products within and throughout trials of the same workflow, identifying side effects on data evolution caused by implicit data flows. We evaluated ProvMonitor and observed that it could answer queries for scenarios that demand specific knowledge related to implicit provenance.

References

  1. Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory. Springer, 316--330 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Steven P. Callahan, Juliana Freire, Emanuele Santos, Carlos E. Scheidegger, Cláudio T. Silva, and Huy T. Vo. 2006. VisTrails: Visualization meets data management. In Proceedings of the International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Fernando Chirigati, Dennis Shasha, and Juliana Freire. 2013. Packing experiments for sharing and publication. In Proceedings of the International Conference on Management of Data. 977--980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Grégory Cobéna, Serge Abiteboul, and Amélie Marian. 2002. Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sérgio Cruz, Fernando Chirigati, Rafael Dahis, Maria Campos, and Marta Mattoso. 2008. Using explicit control processes in distributed workflows to gather provenance. In Provenance and Annotation of Data and Processes, Juliana Freire, David Koop, and Luc Moreau (Eds.). Lecture Notes in Computer Science, Vol. 5272. 186--199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. José Ricardo da Silva Junior, Esteban Clua, and Leonardo Murta. 2016. Efficient image-aware version control systems using GPU. Software: Prac. Exp. 46, 8 (2016), 1011--1033. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Andrew P. Davison. 2012. Automated capture of experiment context for easier reproducibility in computational research. Comput. Sci. Eng. 14, 4 (2012), 48--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Tom De Nies, Sara Magliacane, Ruben Verborgh, Sam Coppens, Paul Groth, Erik Mannens, and Rik Van de Walle. 2013. Git2PROV: Exposing version control system content as W3C PROV. In Proceedings of the International Semantic Web Conference. 125--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Daniel de Oliveira, Eduardo Ogasawara, Fernanda Baião, and Marta Mattoso. 2010. Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In Proceedings of the IEEE International Conference on Cloud Computing. Miami, USA, 378--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. 2009. Workflows and e-Science: An overview of workflow system features and capabilities. Fut. Gen. Comput. Syst. 25, 5 (2009), 528--540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Juliana Freire, Philippe Bonnet, and Dennis Shasha. 2012. Computational reproducibility: State-of-the-art, challenges, and database research opportunities. In Proceedings of the International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Juliana Freire, David Koop, Emanuele Santos, and Cláudio Silva. 2008. Provenance for computational tasks: A survey. Comput. Sci. Eng. 10, 3 (2008), 11--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. James Frew, Dominic Metzger, and Peter Slaughter. 2008. Automatic capture and reconstruction of computational provenance. Concurr. Comput.: Pract. Exp. 20, 5 (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. João Carlos de A. R. Gonçalves, Daniel de Oliveira, Kary Ocaña, Eduardo Ogasawara, and Marta Mattoso. 2012. Using domain-specific data to enhance scientific workflow steering queries. In Proceedings of the International Provenance and Annotation Workshop. Santa Barbara, CA, 152--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Philip J. Guo and Dawson Engler. 2011. CDE: Using system call interposition to automatically create portable software packages. In Proceedings of the USENIX Annual Technical Conference. 247--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tansley Hey, Tony Stewart, and Kristin Tolle. 2009. The Fourth Paradigm Data-intensive Scientific Discovery. Microsoft Research, Redmond, WA.Google ScholarGoogle Scholar
  18. David Koop, Emanuele Santos, Bela Bauer, Matthias Troyer, Juliana Freire, and Cláudio T. Silva. 2010. Bridging workflow and data provenance using strong links. In Proceedings of the International Conference on Scientific and Statistical Database Management. Heidelberg, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Vlad Korolev and Anupam Joshi. 2014. PROB: A tool for tracking provenance and reproducibility of big data experiments. In Proceedings of Reproduce’14.Google ScholarGoogle Scholar
  20. A. Marinho, Leonardo Murta, C. Werner, V. Braganholo, S. M. S. Cruz, E. Ogasawara, and M. Mattoso. 2012. ProvManager: A provenance management system for scientific workflows. Concurr. Comput.: Pract. Exp. 24, 13 (2012), 1513--1530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Anderson Marinho, Cláudia Werner, Marta Mattoso, Vanessa Braganholo, and Leonardo Murta. 2011. Challenges in managing implicit and abstract provenance data: Experiences with provmanager. In Proceedings of the Workshop on the Theory and Practice of Provenance.Google ScholarGoogle Scholar
  22. Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. 2006. Provenance-aware storage systems. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. 2014. noWorkflow: Capturing and analyzing provenance of scripts. In Proceedings of the International Provenance and Annotation Workshop. 71--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Vitor C. Neves, Vanessa Braganholo, and Leonardo Murta. 2013. Implicit provenance gathering through configuration management. In Proceedings of the International Workshop on Software Engineering for Computational Science and Engineering. 92--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kary Ocaña, Silvia Benza, Daniel de Oliveira, Jonas Dias, and Marta Mattoso. 2014. Exploring large scale receptor-ligand pairs in molecular docking workflows in HPC clouds. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium Workshops. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kary Ocaña, Daniel de Oliveira, Eduardo Ogasawara, Alberto M. R. Dávila, Alexandre Lima, and Marta Mattoso. 2011. SciPhy: A cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In Proceedings of the Brazilian Simposium of Bioinformatics. 66--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dirk Ohst, Michael Welle, and Udo Kelter. 2003. Differences between versions of UML diagrams. In Proceedings of the European Software Engineering Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Wellington Oliveira, Vitor Neves, Kary Ocaña, Leonardo Murta, Daniel de Oliveira, and Vanessa Braganholo. 2014. Captura e consulta a dados de proveniência retrospectiva implícita intra-atividade. In Proceedings of the Simpósio Brasileiro de Banco de Dados. 35--44.Google ScholarGoogle Scholar
  29. João Felipe Pimentel, Juliana Freire, Vanessa Braganholo, and Leonardo Murta. 2016. Tracking and Analyzing the Evolution of Provenance from Scripts. Springer International Publishing, Cham, Switzerland, 16--28.Google ScholarGoogle Scholar
  30. João Felipe Pimentel, Juliana Freire, Leonardo Murta, and Vanessa Braganholo. 2016. Fine-Grained Provenance Collection over Scripts Through Program Slicing. Springer International Publishing, Cham, Switzerland, 199--203.Google ScholarGoogle Scholar
  31. Marc J. Rochkind. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4 (1975), 364--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2006. A framework for collecting provenance in data-centric scientific workflows. In Proceedings of the International Conference on Web Services. 427--436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Edward Walker and Chona Guiang. 2007. Challenges in executing large parameter sweep studies across widely distributed computing environments. In Proceedings of the Workshop on Challenges of Large Applications in Distributed Environments. 11--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Zhao, M. Wilde, and I. Foster. 2006. Applying the virtual data provenance model. In Proceedings of the International Provenance and Annotation Workshop. 148--161. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Managing Provenance of Implicit Data Flows in Scientific Experiments

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Internet Technology
        ACM Transactions on Internet Technology  Volume 17, Issue 4
        Special Issue on Provenance of Online Data and Regular Papers
        November 2017
        165 pages
        ISSN:1533-5399
        EISSN:1557-6051
        DOI:10.1145/3133307
        • Editor:
        • Munindar P. Singh
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 August 2017
        • Accepted: 1 February 2017
        • Revised: 1 December 2016
        • Received: 1 July 2016
        Published in toit Volume 17, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!