skip to main content
research-article

High-performance genomic analysis framework with in-memory computing

Authors Info & Claims
Published:10 February 2018Publication History
Skip Abstract Section

Abstract

In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data formats and API. (2) an advanced execution engine that supports efficient compression of genomic data and eliminates redundancies in the execution engine of our GPF. We further present both system and algorithm-specific implementations for users to build genomic analysis pipeline without any acquaintance of Spark parallel programming. To test the performance of GPF, we built a WGS pipeline on top of our GPF as a test case. Our experimental data indicate that GPF completes Whole-Genome-Sequencing (WGS) analysis of 146.9G bases Human Platinum Genome in running time of 24 minutes, with over 50% parallel efficiency when used on 2048 CPU cores. Together, our GPF framework provides a fast and general engine for large-scale genomic data processing which supports in-memory computing.

References

  1. 2016. HG19 Human Genome Download. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. (2016).Google ScholarGoogle Scholar
  2. 2016. Picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. http://broadinstitute.github.io/picard/. (2016).Google ScholarGoogle Scholar
  3. Stuart Anthony Byma, Sam David Whitlock, Laura Flueratoru, Ethan Tseng, Christos Kozyrakis, Edouard Bugnion, and James Larus. 2017. Persona: A High-Performance Bioinformatics Framework. In USENIX Annual Technical Conference 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sebastian Deorowicz and Szymon Grabowski. 2011. Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 6 (2011), 860--862. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Apache Software Foundation. Online. Apache Hadoop. http://hadoop.apache.org/. (Online).Google ScholarGoogle Scholar
  6. Claudia Gonzaga-Jauregui, James R Lupski, and Richard A Gibbs. 2012. Human genome sequencing in health and disease. Annual review of medicine 63 (2012), 35--61.Google ScholarGoogle Scholar
  7. Illumina. 2012. HiSeq Sequencing System. http://www.illumina.com/. (2012).Google ScholarGoogle Scholar
  8. illumina. 2017. NovaSeq. https://www.illumina.com/systems/sequencing-platforms/novaseq.html. (2017).Google ScholarGoogle Scholar
  9. Broad Institute. Online. GATK-4. https://github.com/broadinstitute/gatk. (Online).Google ScholarGoogle Scholar
  10. Broad Institute. Online. GATK Queue. http://gatkforums.broadinstitute.org/discussion/1306/overview-of-queue. (Online).Google ScholarGoogle Scholar
  11. Scott D Kahn. 2011. On the future of genomic data. science 331, 6018 (2011), 728--729.Google ScholarGoogle Scholar
  12. Benjamin J. Kelly, James R. Fitch, Yangqiu Hu, Donald J. Corsmeier, Huachun Zhong, Amy N. Wetzel, Russell D. Nordquist, David L. Newsom, and Peter White. 2015. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biology 16, 1 (2015), 1--14.Google ScholarGoogle ScholarCross RefCross Ref
  13. Patricia Kovatch, Anthony Costa, Zachary Giles, Eugene Fluder, Hyung Min Cho, and Svetlana Mazurkova. 2015. Big omics data experience. In the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM Press, New York, New York, USA, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hugo Y K Lam, Cuiping Pan, Michael J Clark, Phil Lacroute, Rui Chen, Rajini Haraksingh, Maeve O'Huallachain, Mark B Gerstein, Jeffrey M Kidd, Carlos D Bustamante, and Michael Snyder. 2012. Detecting and annotating genetic variations using the HugeSeq pipeline. Nat Biotech 30, 3 (03 2012), 226--229.Google ScholarGoogle Scholar
  15. Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biol 10, 3 (2009), R25.Google ScholarGoogle Scholar
  16. Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 14 (2009), 1754--1760. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xueqi Li, Guangming Tan, Chunming Zhang, Xu Li, Zhonghai Zhang, and Ninghui Sun. 2016. Accelerating large-scale genomic analysis with Spark. In Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on. IEEE, 747--751.Google ScholarGoogle Scholar
  19. Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D Joseph, and David A Patterson. 2013. Adam: Genomics formats and processing patterns for cloud scale computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013 207 (2013).Google ScholarGoogle Scholar
  20. A Mckenna, M Hanna, E Banks, A Sivachenko, K Cibulskis, A Kernytsky, K Garimella, D Altshuler, S Gabriel, and M Daly. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20, 9 (2010), 1297--303.Google ScholarGoogle ScholarCross RefCross Ref
  21. Michael L Metzker. 2010. Sequencing technologies-the next generation. Nature reviews genetics 11, 1 (2010), 31--46.Google ScholarGoogle Scholar
  22. Nabeel M Mohamed, Heshan Lin, and Wuchun Feng. 2013. Accelerating data-intensive genome analysis in the cloud. In Proceedings of the 5th International Conference on Bioinformatics and Computational Biology (BICoB), Honolulu, Hawaii, USA.Google ScholarGoogle Scholar
  23. Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, et al. 2015. Rethinking data-intensive science using scalable analytics systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 631--646. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, and V ICSI. 2015. Making Sense of Performance in Data Analytics Frameworks.. In NSDI, Vol. 15. 293--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. 2012. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 7, 3 (2012), 562.Google ScholarGoogle Scholar
  26. Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David A. Patterson, Scott Shenker, Ion Stoica, Richard M. Karp, and Taylor Sittler. 2011. Faster and More Accurate Sequence Alignment with SNAP. CoRR abs/1111.5572 (2011). arXiv:1111.5572 http://arxiv.org/abs/1111.5572Google ScholarGoogle Scholar
  27. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jing Zhang, Heshan Lin, Pavan Balaji, and Wu-chun Feng. 2013. Optimizing Burrows-Wheeler Transform-Based Sequence Alignment on Multicore Architectures. CCGRID (2013), 377--384.Google ScholarGoogle Scholar

Index Terms

  1. High-performance genomic analysis framework with in-memory computing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 53, Issue 1
        PPoPP '18
        January 2018
        426 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3200691
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          February 2018
          442 pages
          ISBN:9781450349826
          DOI:10.1145/3178487

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 February 2018

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!