Abstract
In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data formats and API. (2) an advanced execution engine that supports efficient compression of genomic data and eliminates redundancies in the execution engine of our GPF. We further present both system and algorithm-specific implementations for users to build genomic analysis pipeline without any acquaintance of Spark parallel programming. To test the performance of GPF, we built a WGS pipeline on top of our GPF as a test case. Our experimental data indicate that GPF completes Whole-Genome-Sequencing (WGS) analysis of 146.9G bases Human Platinum Genome in running time of 24 minutes, with over 50% parallel efficiency when used on 2048 CPU cores. Together, our GPF framework provides a fast and general engine for large-scale genomic data processing which supports in-memory computing.
- 2016. HG19 Human Genome Download. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. (2016).Google Scholar
- 2016. Picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. http://broadinstitute.github.io/picard/. (2016).Google Scholar
- Stuart Anthony Byma, Sam David Whitlock, Laura Flueratoru, Ethan Tseng, Christos Kozyrakis, Edouard Bugnion, and James Larus. 2017. Persona: A High-Performance Bioinformatics Framework. In USENIX Annual Technical Conference 2017. Google Scholar
Digital Library
- Sebastian Deorowicz and Szymon Grabowski. 2011. Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 6 (2011), 860--862. Google Scholar
Digital Library
- Apache Software Foundation. Online. Apache Hadoop. http://hadoop.apache.org/. (Online).Google Scholar
- Claudia Gonzaga-Jauregui, James R Lupski, and Richard A Gibbs. 2012. Human genome sequencing in health and disease. Annual review of medicine 63 (2012), 35--61.Google Scholar
- Illumina. 2012. HiSeq Sequencing System. http://www.illumina.com/. (2012).Google Scholar
- illumina. 2017. NovaSeq. https://www.illumina.com/systems/sequencing-platforms/novaseq.html. (2017).Google Scholar
- Broad Institute. Online. GATK-4. https://github.com/broadinstitute/gatk. (Online).Google Scholar
- Broad Institute. Online. GATK Queue. http://gatkforums.broadinstitute.org/discussion/1306/overview-of-queue. (Online).Google Scholar
- Scott D Kahn. 2011. On the future of genomic data. science 331, 6018 (2011), 728--729.Google Scholar
- Benjamin J. Kelly, James R. Fitch, Yangqiu Hu, Donald J. Corsmeier, Huachun Zhong, Amy N. Wetzel, Russell D. Nordquist, David L. Newsom, and Peter White. 2015. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biology 16, 1 (2015), 1--14.Google Scholar
Cross Ref
- Patricia Kovatch, Anthony Costa, Zachary Giles, Eugene Fluder, Hyung Min Cho, and Svetlana Mazurkova. 2015. Big omics data experience. In the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM Press, New York, New York, USA, 1--12. Google Scholar
Digital Library
- Hugo Y K Lam, Cuiping Pan, Michael J Clark, Phil Lacroute, Rui Chen, Rajini Haraksingh, Maeve O'Huallachain, Mark B Gerstein, Jeffrey M Kidd, Carlos D Bustamante, and Michael Snyder. 2012. Detecting and annotating genetic variations using the HugeSeq pipeline. Nat Biotech 30, 3 (03 2012), 226--229.Google Scholar
- Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biol 10, 3 (2009), R25.Google Scholar
- Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 14 (2009), 1754--1760. Google Scholar
Digital Library
- Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079. Google Scholar
Digital Library
- Xueqi Li, Guangming Tan, Chunming Zhang, Xu Li, Zhonghai Zhang, and Ninghui Sun. 2016. Accelerating large-scale genomic analysis with Spark. In Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on. IEEE, 747--751.Google Scholar
- Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D Joseph, and David A Patterson. 2013. Adam: Genomics formats and processing patterns for cloud scale computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013 207 (2013).Google Scholar
- A Mckenna, M Hanna, E Banks, A Sivachenko, K Cibulskis, A Kernytsky, K Garimella, D Altshuler, S Gabriel, and M Daly. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20, 9 (2010), 1297--303.Google Scholar
Cross Ref
- Michael L Metzker. 2010. Sequencing technologies-the next generation. Nature reviews genetics 11, 1 (2010), 31--46.Google Scholar
- Nabeel M Mohamed, Heshan Lin, and Wuchun Feng. 2013. Accelerating data-intensive genome analysis in the cloud. In Proceedings of the 5th International Conference on Bioinformatics and Computational Biology (BICoB), Honolulu, Hawaii, USA.Google Scholar
- Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, et al. 2015. Rethinking data-intensive science using scalable analytics systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 631--646. Google Scholar
Digital Library
- Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, and V ICSI. 2015. Making Sense of Performance in Data Analytics Frameworks.. In NSDI, Vol. 15. 293--307. Google Scholar
Digital Library
- Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. 2012. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 7, 3 (2012), 562.Google Scholar
- Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David A. Patterson, Scott Shenker, Ion Stoica, Richard M. Karp, and Taylor Sittler. 2011. Faster and More Accurate Sequence Alignment with SNAP. CoRR abs/1111.5572 (2011). arXiv:1111.5572 http://arxiv.org/abs/1111.5572Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301 Google Scholar
Digital Library
- Jing Zhang, Heshan Lin, Pavan Balaji, and Wu-chun Feng. 2013. Optimizing Burrows-Wheeler Transform-Based Sequence Alignment on Multicore Architectures. CCGRID (2013), 377--384.Google Scholar
Index Terms
High-performance genomic analysis framework with in-memory computing
Recommendations
High-performance genomic analysis framework with in-memory computing
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingIn this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data ...
Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system
AbstractOver the last five years, Apache Spark has become a major software platform for in-memory data analysis. Acknowledging its widespread use, we present a comprehensive study of system characteristics of Spark targeting scientific data ...
Highlights- We develop a benchmark, ArrayBench, for benchmarking scientific data analytics that process gene expression matrices using Spark and SciDB.
Identifying Candidate Disease Genes with High-Performance Computing
The publicly-funded effort to read the complete nucleotide sequence of the human genome, the human genome project (HGP), is nearing completion of the approximately three billion nucleotides of the human genome. In addition, several valuable sources of ...







Comments