Abstract
To process a large amount of data sequentially and systematically, proper management of workflow components (i.e., modules, data, configurations, associations among ports and links) in a Scientific Workflow Management System (SWfMS) is inevitable. Managing data with provenance in a SWfMS to support reusability of workflows, modules, and data is not a simple task. Handling such components is even more burdensome for frequently assembled and executed complex workflows for investigating large datasets with different technologies (i.e., various learning algorithms or models). However, a great many studies propose various techniques and technologies for managing and recommending services in a SWfMS, but only a very few studies consider the management of data in a SWfMS for efficient storing and facilitating workflow executions. Furthermore, there is no study to inquire about the effectiveness and efficiency of such data management in a SWfMS from a user perspective. In this paper, we present and evaluate a GUI version of such a novel approach of intermediate data management with two use cases (Plant Phenotyping and Bioinformatics). The technique we call GUI-RISPTS (Recommending Intermediate States from Pipelines Considering Tool-States) can facilitate executions of workflows with processed data (i.e., intermediate outcomes of modules in a workflow) and can thus reduce the computational time of some modules in a SWfMS. We integrated GUI-RISPTS with an existing workflow management system called SciWorCS. In SciWorCS, we present an interface that users use for selecting the recommendation of intermediate states (i.e., modules' outcomes). We investigated GUI-RISPTS's effectiveness from users' perspectives along with measuring its overhead in terms of storage and efficiency in workflow execution.
- Enis Afgan, Dannon Baker, Bérénice Batut, Marius van den Beek, Dave Bouvier, Martin ?ech, John Chilton, Dave Clements, Nate Coraor, Björn A Grüning, Aysam Guerler, Jennifer Hillman-Jackson, Saskia Hiltemann, Vahid Jalili, Helena Rasche, Nicola Soranzo, Jeremy Goecks, James Taylor, Anton Nekrutenko, and Daniel Blankenberg. 2018. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Research, Vol. 46 (2018), W537--W544. https://doi.org/10.1093/nar/gky379Google Scholar
Cross Ref
- Rakesh Agrawal, Tomasz Imieli'nski, and Arun Swami. 1993. Mining Association Rules Between Sets of Items in Large Databases. SIGMOD Rec., Vol. 22 (1993), 207--216. Google Scholar
Digital Library
- Emir M. Bahsi, Emrah Ceyhan, and Tevfik Kosar. 2007. Conditional Workflow Management: A Survey and Analysis. Sci. Program., Vol. 15, 4 (Dec. 2007), 283--297. https://doi.org/10.1155/2007/680291 Google Scholar
Digital Library
- Duncan A. Brown, Patrick R. Brady, Alexander Dietz, Junwei Cao, Ben Johnson, and John McNabb. 2007. A Case Study on the Use of Workflow Technologies for Scientific Analysis: Gravitational Wave Data Analysis .Springer London, London, 39--59. https://doi.org/10.1007/978--1--84628--757--2_4Google Scholar
- D. Chakroborti, M. Mondal, B. Roy, C. K. Roy, and K. A. Schneider. 2018. Optimized Storing of Workflow Outputs through Mining Association Rules. In 2018 IEEE International Conference on Big Data (Big Data). 508--515. https://doi.org/10.1109/BigData.2018.8622351Google Scholar
Cross Ref
- Debasish Chakroborti, Banani Roy, Amit Mondal, Golam Mostaeen, Chanchal K. Roy, Kevin A. Schneider, and Ralph Deters. 2020. A Data Management Scheme for Micro-Level Modular Computation-Intensive Programs in Big Data Platforms .Springer International Publishing, Cham, 135--153. https://doi.org/10.1007/978--3-030--32587--9_9Google Scholar
- Eran Chinthaka, Jaliya Ekanayake, David Leake, and Beth Plale. 2009. CBR based workflow composition assistant. In Proc. of World Congress on Services. 352 -- 355. Google Scholar
Digital Library
- Brian Clifton. 2012. Advanced web metrics with Google Analytics .John Wiley & Sons. Google Scholar
Digital Library
- Susan B. Davidson and Juliana Freire. 2008. Provenance and Scientific Workflows: Challenges and Opportunities. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). ACM, New York, NY, USA, 1345--1350. https://doi.org/10.1145/1376616.1376772 Google Scholar
Digital Library
- E. Deelman and A. Chervenak. 2008. Data Management Challenges of Data-Intensive Scientific Workflows. In 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID). 687--692. https://doi.org/10.1109/CCGRID.2008.24 Google Scholar
Digital Library
- Juliana Freire, Cláudio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo. 2006. Managing Rapidly-Evolving Scientific Workflows. In Provenance and Annotation of Data, Luc Moreau and Ian Foster (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 10--18. Google Scholar
Digital Library
- Ritu Garg and Awadhesh Kumar Singh. 2015. Adaptive workflow scheduling in grid computing based on dynamic resource availability. Engineering Science and Technology, an International Journal, Vol. 18, 2 (2015), 256 -- 269. https://doi.org/10.1016/j.jestch.2015.01.001Google Scholar
- Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox, D. Gannon, C. Goble, M. Livny, L. Moreau, and J. Myers. 2007. Examining the Challenges of Scientific Workflows. Computer, Vol. 40, 12 (Dec 2007), 24--32. https://doi.org/10.1109/MC.2007.421 Google Scholar
Digital Library
- Yolanda Gil, Pedro Szekely, Sandra Villamizar, Thomas C. Harmon, Varun Ratnakar, Shubham Gupta, Maria Muslea, Fabio Silva, and Craig A. Knoblock. 2011. Mind Your Metadata: Exploiting Semantics for Configuration, Adaptation, and Provenance in Scientific Workflows. In Proceedings of the 10th International Conference on The Semantic Web - Volume Part II (ISWC'11). Springer-Verlag, Berlin, Heidelberg, 65--80. http://dl.acm.org/citation.cfm?id=2063076.2063082 Google Scholar
Digital Library
- Jim Gray, David T. Liu, Maria Nieto-Santisteban, Alex Szalay, David J. DeWitt, and Gerd Heber. 2005. Scientific Data Management in the Coming Decade. SIGMOD Rec., Vol. 34, 4 (Dec. 2005), 34--41. https://doi.org/10.1145/1107499.1107503 Google Scholar
Digital Library
- Emily H Halili. 2008. Apache JMeter: A practical beginner's guide to automated testing and performance measurement for your websites .Packt Publishing Ltd. Google Scholar
Digital Library
- D. Koop, C. E. Scheidegger, S. P. Callahan, J. Freire, and C. T. Silva. 2008. VisComplete: Automating Suggestions for Visualization Pipelines. IEEE Transactions on Visualization and Computer Graphics, Vol. 14 (2008), 1691--1698. Google Scholar
Digital Library
- David Leake and Joseph Kendall-Morwick. 2008. Towards Case-Based Support for e-Science Workflow Generation by Mining Provenance. In Advances in Case-Based Reasoning, Klaus-Dieter Althoff, Ralph Bergmann, Mirjam Minor, and Alexandre Hanft (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 269--283. Google Scholar
Digital Library
- Golam Mostaeen, Banani Roy, Chanchal Roy, and Kevin Schneider. 2019. Designing for Real-Time Groupware Systems to Support Complex Scientific Data Analysis. Proc. ACM Hum.-Comput. Interact., Vol. 3, EICS, Article Article 9 (June 2019), 28 pages. https://doi.org/10.1145/3331151 Google Scholar
Digital Library
- Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. 2006. Provenance-aware Storage Systems. In Proceedings of the Annual Conference on USENIX '06 Annual Technical Conference (ATEC '06). USENIX Association, Berkeley, CA, USA, 4--4. http://dl.acm.org/citation.cfm?id=1267359.1267363 Google Scholar
Digital Library
- Radu Prodan and Thomas Fahringer. 2005. Dynamic Scheduling of Scientific Workflow Applications on the Grid: A Case Study. In Proceedings of the 2005 ACM Symposium on Applied Computing (SAC '05). ACM, New York, NY, USA, 687--694. https://doi.org/10.1145/1066677.1066835 Google Scholar
Digital Library
- Arcot Rajasekar, Mike Wan, Reagan Moore, and Wayne Schroeder. 2006. A prototype rule-based distributed data management system. (01 2006).Google Scholar
- H.A. Reijers, I. Vanderfeesten, and W.M.P. van der Aalst. 2016. The effectiveness of workflow management systems: A longitudinal study. International Journal of Information Management, Vol. 36, 1 (2016), 126 -- 141. https://doi.org/10.1016/j.ijinfomgt.2015.08.003 Google Scholar
Digital Library
- Peter Sevcik. 2005. Defining the application performance index. Business Communications Review, Vol. 20 (2005).Google Scholar
- Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. 2005. A Survey of Data Provenance in e-Science. SIGMOD Rec., Vol. 34, 3 (Sept. 2005), 31--36. https://doi.org/10.1145/1084805.1084812 Google Scholar
Digital Library
- Ola Spjuth, Erik Bongcam-Rudloff, Guillermo Carrasco Hernández, Lukas Forer, Mario Giovacchini, Roman Valls Guimera, Aleksi Kallio, Eija Korpelainen, Maciej M. Ka'n duła, Milko Krachunov, David P. Kreil, Ognyan Kulev, Paweł P. Łabaj, Samuel Lampa, Luca Pireddu, Sebastian Schönherr, Alexey Siretskiy, and Dimitar Vassilev. 2015. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct, Vol. 10 (2015), 43.Google Scholar
Cross Ref
- Jianwu Wang, Daniel Crawl, and Ilkay Altintas. 2009. Kepler+Hadoop: A General Architecture Facilitating Data-intensive Applications in Scientific Workflow Systems. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science (WORKS '09). ACM, New York, NY, USA, Article 12, 8 pages. https://doi.org/10.1145/1645164.1645176 Google Scholar
Digital Library
- Simon Woodman, Hugo Hiden, and Paul Watson. 2015. Workflow Provenance: An Analysis of Long Term Storage Costs. In Proc. of WORKS. 1--9. Google Scholar
Digital Library
- Qishi Wu, Mengxia Zhu, Yi Gu, Patrick Brown, Xukang Lu, Wuyin Lin, and Yangang Liu. 2012. A Distributed Workflow Management System with Case Study of Real-life Scientific Applications on Grids. Journal of Grid Computing, Vol. 10, 3 (01 Sep 2012), 367--393. https://doi.org/10.1007/s10723-012--9222--7 Google Scholar
Digital Library
- Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen. 2011. On-demand Minimum Cost Benchmarking for Intermediate Dataset Storage in Scientific Cloud Workflow Systems. J. Parallel Distrib. Comput., Vol. 71 (2011), 316--332. Google Scholar
Digital Library
- Jia Zhang, Wei Tan, Alexander John, Ian Foster, and Ravi Madduri. 2011. Recommend-as-you-go: A novel approach supporting services-oriented scientific workflow reuse. In Proc. of SCC. 48 -- 55. Google Scholar
Digital Library
- Charles Zheng and Douglas Thain. 2015. Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker. In Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing (VTDC '15). ACM, New York, NY, USA, 31--38. https://doi.org/10.1145/2755979.2755984 Google Scholar
Digital Library
- Thomas Zimmermann, Peter Weisgerber, Stephan Diehl, and Andreas Zeller. 2004. Mining Version Histories to Guide Software Changes. In Proc. of ICSE. 563--572. Google Scholar
Digital Library
- M. zur Muhlen. 1999. Evaluation of workflow management systems using meta models. In Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers, Vol. Track 5. 11 pp.--. https://doi.org/10.1109/HICSS.1999.772961 Google Scholar
Digital Library
Index Terms
Designing for Recommending Intermediate States in A Scientific Workflow Management System
Recommendations
Grid-Enabled Workflow Management System Based On BPEL
A grid-enabled workflow management system provides a set of tools to facilitate building high-level grid application services by orchestrating low-level grid services. BPEL (Business Process Execution Language) is the de ...
The Grid Resource Broker workflow engine
2nd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2007)Increasingly, complex scientific applications are structured in terms of workflows. These applications are usually computationally and-or data intensive and thus are well suited for execution in grid environments. Distributed, geographically spread ...
Design and implementation of a workflow-based resource broker with information system on computational grids
The grid is a promising infrastructure that can allow scientists and engineers to access resources among geographically distributed environments. Grid computing is a new technology which focuses on aggregating resources (e.g., processor cycles, disk ...






Comments