Abstract
Scientific experiments modeled as scientific workflows may create, change, or access data products not explicitly referenced in the workflow specification, leading to implicit data flows. The lack of knowledge about implicit data flows makes the experiments hard to understand and reproduce. In this article, we present ProvMonitor, an approach that identifies the creation, change, or access to data products even within implicit data flows. ProvMonitor links this information with the workflow activity that generated it, allowing for scientists to compare data products within and throughout trials of the same workflow, identifying side effects on data evolution caused by implicit data flows. We evaluated ProvMonitor and observed that it could answer queries for scenarios that demand specific knowledge related to implicit provenance.
- Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory. Springer, 316--330 Google Scholar
Digital Library
- Steven P. Callahan, Juliana Freire, Emanuele Santos, Carlos E. Scheidegger, Cláudio T. Silva, and Huy T. Vo. 2006. VisTrails: Visualization meets data management. In Proceedings of the International Conference on Management of Data. Google Scholar
Digital Library
- Fernando Chirigati, Dennis Shasha, and Juliana Freire. 2013. Packing experiments for sharing and publication. In Proceedings of the International Conference on Management of Data. 977--980. Google Scholar
Digital Library
- Grégory Cobéna, Serge Abiteboul, and Amélie Marian. 2002. Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering. Google Scholar
Digital Library
- Sérgio Cruz, Fernando Chirigati, Rafael Dahis, Maria Campos, and Marta Mattoso. 2008. Using explicit control processes in distributed workflows to gather provenance. In Provenance and Annotation of Data and Processes, Juliana Freire, David Koop, and Luc Moreau (Eds.). Lecture Notes in Computer Science, Vol. 5272. 186--199. Google Scholar
Digital Library
- José Ricardo da Silva Junior, Esteban Clua, and Leonardo Murta. 2016. Efficient image-aware version control systems using GPU. Software: Prac. Exp. 46, 8 (2016), 1011--1033. Google Scholar
Digital Library
- Andrew P. Davison. 2012. Automated capture of experiment context for easier reproducibility in computational research. Comput. Sci. Eng. 14, 4 (2012), 48--56. Google Scholar
Digital Library
- Tom De Nies, Sara Magliacane, Ruben Verborgh, Sam Coppens, Paul Groth, Erik Mannens, and Rik Van de Walle. 2013. Git2PROV: Exposing version control system content as W3C PROV. In Proceedings of the International Semantic Web Conference. 125--128. Google Scholar
Digital Library
- Daniel de Oliveira, Eduardo Ogasawara, Fernanda Baião, and Marta Mattoso. 2010. Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In Proceedings of the IEEE International Conference on Cloud Computing. Miami, USA, 378--385. Google Scholar
Digital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google Scholar
Digital Library
- Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. 2009. Workflows and e-Science: An overview of workflow system features and capabilities. Fut. Gen. Comput. Syst. 25, 5 (2009), 528--540. Google Scholar
Digital Library
- Juliana Freire, Philippe Bonnet, and Dennis Shasha. 2012. Computational reproducibility: State-of-the-art, challenges, and database research opportunities. In Proceedings of the International Conference on Management of Data. Google Scholar
Digital Library
- Juliana Freire, David Koop, Emanuele Santos, and Cláudio Silva. 2008. Provenance for computational tasks: A survey. Comput. Sci. Eng. 10, 3 (2008), 11--21. Google Scholar
Digital Library
- James Frew, Dominic Metzger, and Peter Slaughter. 2008. Automatic capture and reconstruction of computational provenance. Concurr. Comput.: Pract. Exp. 20, 5 (2008). Google Scholar
Digital Library
- João Carlos de A. R. Gonçalves, Daniel de Oliveira, Kary Ocaña, Eduardo Ogasawara, and Marta Mattoso. 2012. Using domain-specific data to enhance scientific workflow steering queries. In Proceedings of the International Provenance and Annotation Workshop. Santa Barbara, CA, 152--167. Google Scholar
Digital Library
- Philip J. Guo and Dawson Engler. 2011. CDE: Using system call interposition to automatically create portable software packages. In Proceedings of the USENIX Annual Technical Conference. 247--252. Google Scholar
Digital Library
- Tansley Hey, Tony Stewart, and Kristin Tolle. 2009. The Fourth Paradigm Data-intensive Scientific Discovery. Microsoft Research, Redmond, WA.Google Scholar
- David Koop, Emanuele Santos, Bela Bauer, Matthias Troyer, Juliana Freire, and Cláudio T. Silva. 2010. Bridging workflow and data provenance using strong links. In Proceedings of the International Conference on Scientific and Statistical Database Management. Heidelberg, Germany. Google Scholar
Digital Library
- Vlad Korolev and Anupam Joshi. 2014. PROB: A tool for tracking provenance and reproducibility of big data experiments. In Proceedings of Reproduce’14.Google Scholar
- A. Marinho, Leonardo Murta, C. Werner, V. Braganholo, S. M. S. Cruz, E. Ogasawara, and M. Mattoso. 2012. ProvManager: A provenance management system for scientific workflows. Concurr. Comput.: Pract. Exp. 24, 13 (2012), 1513--1530. Google Scholar
Digital Library
- Anderson Marinho, Cláudia Werner, Marta Mattoso, Vanessa Braganholo, and Leonardo Murta. 2011. Challenges in managing implicit and abstract provenance data: Experiences with provmanager. In Proceedings of the Workshop on the Theory and Practice of Provenance.Google Scholar
- Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. 2006. Provenance-aware storage systems. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. 2014. noWorkflow: Capturing and analyzing provenance of scripts. In Proceedings of the International Provenance and Annotation Workshop. 71--83. Google Scholar
Digital Library
- Vitor C. Neves, Vanessa Braganholo, and Leonardo Murta. 2013. Implicit provenance gathering through configuration management. In Proceedings of the International Workshop on Software Engineering for Computational Science and Engineering. 92--95. Google Scholar
Digital Library
- Kary Ocaña, Silvia Benza, Daniel de Oliveira, Jonas Dias, and Marta Mattoso. 2014. Exploring large scale receptor-ligand pairs in molecular docking workflows in HPC clouds. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium Workshops. Google Scholar
Digital Library
- Kary Ocaña, Daniel de Oliveira, Eduardo Ogasawara, Alberto M. R. Dávila, Alexandre Lima, and Marta Mattoso. 2011. SciPhy: A cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In Proceedings of the Brazilian Simposium of Bioinformatics. 66--70. Google Scholar
Digital Library
- Dirk Ohst, Michael Welle, and Udo Kelter. 2003. Differences between versions of UML diagrams. In Proceedings of the European Software Engineering Conference. Google Scholar
Digital Library
- Wellington Oliveira, Vitor Neves, Kary Ocaña, Leonardo Murta, Daniel de Oliveira, and Vanessa Braganholo. 2014. Captura e consulta a dados de proveniência retrospectiva implícita intra-atividade. In Proceedings of the Simpósio Brasileiro de Banco de Dados. 35--44.Google Scholar
- João Felipe Pimentel, Juliana Freire, Vanessa Braganholo, and Leonardo Murta. 2016. Tracking and Analyzing the Evolution of Provenance from Scripts. Springer International Publishing, Cham, Switzerland, 16--28.Google Scholar
- João Felipe Pimentel, Juliana Freire, Leonardo Murta, and Vanessa Braganholo. 2016. Fine-Grained Provenance Collection over Scripts Through Program Slicing. Springer International Publishing, Cham, Switzerland, 199--203.Google Scholar
- Marc J. Rochkind. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4 (1975), 364--370. Google Scholar
Digital Library
- Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2006. A framework for collecting provenance in data-centric scientific workflows. In Proceedings of the International Conference on Web Services. 427--436. Google Scholar
Digital Library
- Edward Walker and Chona Guiang. 2007. Challenges in executing large parameter sweep studies across widely distributed computing environments. In Proceedings of the Workshop on Challenges of Large Applications in Distributed Environments. 11--18. Google Scholar
Digital Library
- Y. Zhao, M. Wilde, and I. Foster. 2006. Applying the virtual data provenance model. In Proceedings of the International Provenance and Annotation Workshop. 148--161. Google Scholar
Digital Library
Index Terms
Managing Provenance of Implicit Data Flows in Scientific Experiments
Recommendations
Implicit provenance gathering through configuration management
SE-CSE '13: Proceedings of the 5th International Workshop on Software Engineering for Computational Science and EngineeringScientific experiments based on computer simulations usually consume and produce huge amounts of data. Data provenance is used to help scientists answer queries related to how experiment data were generated or changed. However, during the experiment ...
Provenance Support for Grid-Enabled Scientific Workflows
SKG '08: Proceedings of the 2008 Fourth International Conference on Semantics, Knowledge and GridThe Grid is evolving and new concepts like Semantic Grid, Knowledge Grid are rapidly emerging, where humans and distributed machines share, exchange, and manage data and resources intelligently. Computational scientists typically use workflows to ...
Connecting Scientific Data to Scientific Experiments with Provenance
E-SCIENCE '07: Proceedings of the Third IEEE International Conference on e-Science and Grid ComputingAs scientific workflows and the data they operate on, grow in size and complexity, the task of defining how those workflows should execute (which resources to use, where the resources must be in readiness for processing etc.) becomes proportionally more ...






Comments