skip to main content
10.1145/1362622.1362680acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Falkon: a Fast and Light-weight tasK executiON framework

Published: 10 November 2007 Publication History

Abstract

To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) from task dispatch, and (2) a streamlined dispatcher. Falkon's integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system. We describe Falkon architecture and implementation, and present performance results for both microbenchmarks and applications. Microbenchmarks show that Falkon throughput (487 tasks/sec) and scalability (to 54,000 executors and 2,000,000 tasks processed in just 112 minutes) are one to two orders of magnitude better than other systems used in production Grids. Large-scale astronomy and medical applications executed under Falkon by the Swift parallel programming system achieve up to 90% reduction in end-to-end run time, relative to versions that execute tasks via separate scheduler submissions.

References

[1]
D. Thain, T. Tannenbaum, and M. Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2--4, pages 323--356, February-April, 2005.
[2]
Swift Workflow System: www.ci.uchicago.edu/swift, 2007.
[3]
Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. "Swift: Fast, Reliable, Loosely Coupled Parallel Computation", IEEE Workshop on Scientific Workflows 2007.
[4]
I. Foster, J. Voeckler, M. Wilde, Y. Zhao. "Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation", SSDBM 2002.
[5]
J.-P Goux, S. Kulkarni, J. T. Linderoth, and M. E. Yoder, "An Enabling Framework for Master-Worker Applications on the Computational Grid," IEEE International Symposium on High Performance Distributed Computing, 2000.
[6]
I. Foster, C. Kesselman, S. Tuecke, "The Anatomy of the Grid: Enabling Scalable Virtual Organizations", International Journal of Supercomputer Applications, 15 (3). 200--222. 2001.
[7]
G. Banga, P. Druschel, J. C. Mogul. "Resource Containers: A New Facility for Resource Management in Server Systems." Symposium on Operating Systems Design and Implementation, 1999.
[8]
J. A. Stankovic, K. Ramamritham, D. Niehaus, M. Humphrey, G. Wallace, "The Spring System: Integrated Support for Complex Real-Time Systems", Real-Time Systems, May 1999, Vol 16, No. 2/3, pp. 97--125.
[9]
J. Frey, T. Tannenbaum, I. Foster, M. Frey, S. Tuecke, "Condor-G: A Computation Management Agent for Multi-Institutional Grids," Cluster Computing, 2002.
[10]
G. Singh, C. Kesselman, E. Deelman, "Optimizing Grid-Based Workflow Execution." Journal of Grid Computing, Volume 3(3--4), December 2005, pp. 201--219.
[11]
E. Walker, J. P. Gardner, V. Litvin, E. L. Turner, "Creating Personal Adaptive Clusters for Managing Scientific Tasks in a Distributed Computing Environment", Workshop on Challenges of Large Applications in Distributed Environments, 2006.
[12]
G. Singh, C. Kesselman E. Deelman. "Performance Impact of Resource Provisioning on Workflows", USC ISI Technical Report 2006.
[13]
G. Mehta, C. Kesselman, E. Deelman. "Dynamic Deployment of VO-specific Schedulers on Managed Resources," USC ISI Technical Report, 2006.
[14]
D. Thain, T. Tannenbaum, and M. Livny, "Condor and the Grid", Grid Computing: Making The Global Infrastructure a Reality, John Wiley, 2003. ISBN: 0-470-85319-0.
[15]
E. Robinson, D. J. DeWitt. "Turning Cluster Management into Data Management: A System Overview", Conference on Innovative Data Systems Research, 2007.
[16]
B. Bode, D. M. Halstead, R. Kendall, Z. Lei, W. Hall, D. Jackson. "The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters", Usenix, 4th Annual Linux Showcase & Conference, 2000.
[17]
S. Zhou. "LSF: Load sharing in large-scale heterogeneous distributed systems," Workshop on Cluster Computing, 1992.
[18]
W. Gentzsch, "Sun Grid Engine: Towards Creating a Compute Power Grid," 1st International Symposium on Cluster Computing and the Grid, 2001.
[19]
D. P. Anderson. "BOINC: A System for Public-Resource Computing and Storage." 5th IEEE/ACM International Workshop on Grid Computing, 2004.
[20]
D. P. Anderson, E. Korpela, R. Walton. "High-Performance Task Distribution for Volunteer Computing." IEEE Conference on e-Science and Grid Technologies, 2005.
[21]
The Functional Magnetic Resonance Imaging Data Center, http://www.fmridc.org/, 2007.
[22]
G. B. Berriman, et al., "Montage: a Grid Enabled Engine for Delivering Custom Science-Grade Image Mosaics on Demand." SPIE Conference on Astronomical Telescopes and Instrumentation. 2004.
[23]
K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar, S. Krishnakumar, D. Pazel, J. Pershing, and B. Rochwerger, "Oceano - SLA Based Management of a Computing Utility," 7th IFIP/IEEE International Symposium on Integrated Network Management, 2001.
[24]
L. Ramakrishnan, L. Grit, A. Iamnitchi, D. Irwin, A. Yumerefendi, J. Chase. "Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control," IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC06), 2006.
[25]
J. Bresnahan. "An Architecture for Dynamic Allocation of Compute Cluster Bandwidth", MS Thesis, Department of Computer Science, University of Chicago, December 2006.
[26]
Catlett, C. et al., "TeraGrid: Analysis of Organization, System Architecture, and Middleware Enabling New Types of Applications," HPC 2006.
[27]
M. Feller, I. Foster, and S. Martin. "GT4 GRAM: A Functionality and Performance Study", TeraGrid Conference 2007.
[28]
I. Foster, "Globus Toolkit Version 4: Software for Service-Oriented Systems," Conference on Network and Parallel Computing, 2005.
[29]
The Globus Security Team. "Globus Toolkit Version 4 Grid Security Infrastructure: A Standards Perspective," Technical Report, Argonne National Laboratory, MCS, 2005.
[30]
I. Raicu, I. Foster, A. Szalay. "Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets", IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC06), 2006.
[31]
I. Raicu, I. Foster, A. Szalay, G. Turcu. "AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis", TeraGrid Conference 2006.
[32]
J. C. Jacob, et al. "The Montage Architecture for Grid-Enabled Science Processing of Large, Distributed Datasets." Earth Science Technology Conference 2004.
[33]
E. Deelman, et al. "Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems", Scientific Programming Journal, Vol 13(3), 2005, pp. 219--237.
[34]
T. Tannenbaum. "Condor RoadMap", Condor Week 2007.
[35]
K. Ranganathan, I. Foster, "Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids", Journal of Grid Computing, V1(1) 2003.

Cited By

View all
  • (2024)Towards Fine-Grained Parallelism in Parallel and Distributed Python Libraries2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00133(706-715)Online publication date: 27-May-2024
  • (2023)FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00036(299-310)Online publication date: May-2023
  • (2023)HTDcr: a job execution framework for high-throughput computing on supercomputersScience China Information Sciences10.1007/s11432-022-3657-367:1Online publication date: 22-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing
November 2007
723 pages
ISBN:9781595937643
DOI:10.1145/1362622
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dynamic resource provisioning
  2. grid computing
  3. parallel programming
  4. scheduling

Qualifiers

  • Research-article

Funding Sources

Conference

SC '07
Sponsor:

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Fine-Grained Parallelism in Parallel and Distributed Python Libraries2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00133(706-715)Online publication date: 27-May-2024
  • (2023)FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00036(299-310)Online publication date: May-2023
  • (2023)HTDcr: a job execution framework for high-throughput computing on supercomputersScience China Information Sciences10.1007/s11432-022-3657-367:1Online publication date: 22-Dec-2023
  • (2022)Using unusedProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571938(1-15)Online publication date: 13-Nov-2022
  • (2022)Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and a Future System ArchitectureElectronics10.3390/electronics1201005312:1(53)Online publication date: 23-Dec-2022
  • (2022)Resource Profiling and Performance Modeling for Distributed Scientific Computing EnvironmentsApplied Sciences10.3390/app1209479712:9(4797)Online publication date: 9-May-2022
  • (2022)Design and Performance Characterization of RADICAL-Pilot on Leadership-Class PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310599433:4(818-829)Online publication date: 1-Apr-2022
  • (2022)Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-WhiskSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00045(1-15)Online publication date: Nov-2022
  • (2021)Achieving Fairness-Aware Two-Level Scheduling for Heterogeneous Distributed SystemsIEEE Transactions on Services Computing10.1109/TSC.2018.283644414:3(639-653)Online publication date: 1-May-2021
  • (2020)ДОСЯГНЕННЯ ЕФЕКТИВНОГО РОЗПОДІЛЕНОГО ПЛАНУВАННЯ ЗА ДОПОМОГОЮ ЧЕРГ ПОВІДОМЛЕНЬ У ХМАРІ ДЛЯ БАГАТОЗАДАЧНИХ ОБЧИСЛЕНЬ ТА ВИСОКОПРОДУКТИВНИХ ОБЧИСЛЕНЬInternational Academy Journal Web of Scholar10.31435/rsglobal_wos/30122020/7323Online publication date: 19-Dec-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media