skip to main content
10.1145/3183767.3183768acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesparma-ditamConference Proceedingsconference-collections
research-article

HyperLoom: A Platform for Defining and Executing Scientific Pipelines in Distributed Environments

Published: 23 January 2018 Publication History

Abstract

Real-world scientific applications often encompass end-to-end data processing pipelines composed of a large number of interconnected computational tasks of various granularity. We introduce HyperLoom, an open source platform for defining and executing such pipelines in distributed environments and providing a Python interface for defining tasks. HyperLoom is a self-contained system that does not use an external scheduler for the actual execution of the task. We have successfully employed HyperLoom for executing chemogenomics pipelines used in pharmaceutic industry for novel drug discovery.

References

[1]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 3 (2011), 27.
[2]
Weiwei Chen and Ewa Deelman. 2011. Workflow Overhead Analysis and Optimizations. In Proceedings of the 6th Workshop on Workflows in Support of Large-scale Science (WORKS '11). ACM, New York, NY, USA, 11--20.
[3]
Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, G. Bruce Berriman, John Good, Anastasia Laity, Joseph C. Jacob, and Daniel S. Katz. 2005. Pegasus: A Framework for Mapping Complex Scientific Workflows Onto Distributed Systems. Sci. Program. 13, 3 (July 2005), 219--237.
[4]
Red Hat. 2017. Red Hat Enterprise Linux. (2017). https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux {Online; accessed 31-March-2017}.
[5]
HTCondor. 2017. HTCondor. (2017). https://research.cs.wisc.edu/htcondor/index.html {Online; accessed 31-March-2017}.
[6]
Samuel Lampa, Jonathan Alvarsson, and Ola Spjuth. 2016. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. Journal of Cheminformatics 8, 1 (2016), 67.
[7]
Matthew Rocklin. 2015. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference. Citeseer, 130--136.
[8]
Jiangming Sun, Nina Jeliazkova, Vladimir Chupakin, Jose-Felipe Golib-Dzib, Ola Engkvist, Lars Carlsson, Jörg Wegner, Hugo Ceulemans, Ivan Georgiev, Vedrin Jeliazkov, Nikolay Kochev, Thomas J. Ashby, and Hongming Chen. 2017. ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics. Journal of Cheminformatics 9, 1 (dec 2017), 17.
[9]
Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O'Reilly Media, Inc.
[10]
Wikipedia. 2017. InfiniBand -- Wikipedia, The Free Encyclopedia. (2017). https://en.wikipedia.org/w/index.php?title=InfiniBand&oldid=772443735 {Online; accessed 31-March-2017}.
[11]
Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, and others. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.

Cited By

View all
  • (2022)Distributed workflows with JupyterFuture Generation Computer Systems10.1016/j.future.2021.10.007128:C(282-298)Online publication date: 1-Mar-2022
  • (2022)Analysis of workflow schedulers in simulated distributed environmentsThe Journal of Supercomputing10.1007/s11227-022-04438-y78:13(15154-15180)Online publication date: 14-Apr-2022
  • (2021)EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9473940(1320-1325)Online publication date: 1-Feb-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
PARMA-DITAM '18: Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms
January 2018
76 pages
ISBN:9781450364447
DOI:10.1145/3183767
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • HiPEAC: HiPEAC Network of Excellence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 January 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Big Data
  2. Chemogenomics
  3. Distributed Computing
  4. HPC
  5. Machine Learning
  6. Scientific Pipeline
  7. Task Scheduling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PARMA-DITAM '18

Acceptance Rates

Overall Acceptance Rate 11 of 24 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Distributed workflows with JupyterFuture Generation Computer Systems10.1016/j.future.2021.10.007128:C(282-298)Online publication date: 1-Mar-2022
  • (2022)Analysis of workflow schedulers in simulated distributed environmentsThe Journal of Supercomputing10.1007/s11227-022-04438-y78:13(15154-15180)Online publication date: 14-Apr-2022
  • (2021)EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9473940(1320-1325)Online publication date: 1-Feb-2021
  • (2021)Sustainable data analysis with SnakemakeF1000Research10.12688/f1000research.29032.210(33)Online publication date: 19-Apr-2021
  • (2021)Sustainable data analysis with SnakemakeF1000Research10.12688/f1000research.29032.110(33)Online publication date: 18-Jan-2021
  • (2021)StreamFlow: Cross-Breeding Cloud With HPCIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2020.30192029:4(1723-1737)Online publication date: 1-Oct-2021
  • (2020)Industry-scale application and evaluation of deep learning for drug target predictionJournal of Cheminformatics10.1186/s13321-020-00428-512:1Online publication date: 19-Apr-2020
  • (2020)Runtime vs Scheduler: Analyzing Dask’s Overheads2020 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS51914.2020.00006(1-8)Online publication date: Nov-2020
  • (2020)Real-Time Model of Computation over HPC/Cloud Orchestration - The LEXIS ApproachComplex, Intelligent and Software Intensive Systems10.1007/978-3-030-50454-0_24(255-266)Online publication date: 11-Jun-2020
  • (2019)Setting the Configuration Parameters of the Algorithm for the Periodic Vehicle Routing Problem by HPC PowerMATEC Web of Conferences10.1051/matecconf/201929601009296(01009)Online publication date: 22-Oct-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media