skip to main content
research-article
Open Access
Artifacts Available
Artifacts Evaluated & Functional

Seq: a high-performance language for bioinformatics

Published:10 October 2019Publication History
Skip Abstract Section

Abstract

The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100—a factor of over 106—and the amount of data to be analyzed has increased proportionally. Yet, as Moore’s Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. In a field where many researchers are primarily focused on biological analysis over computational optimization, the unfortunate solution to this problem is often to simply buy larger and faster machines.

Here, we introduce Seq, the first language tailored specifically to bioinformatics, which marries the ease and productivity of Python with C-like performance. Seq starts with a subset of Python—and is in many cases a drop-in replacement—yet also incorporates novel bioinformatics- and computational genomics-oriented data types, language constructs and optimizations. Seq enables users to write high-level, Pythonic code without having to worry about low-level or domain-specific optimizations, and allows for the seamless expression of the algorithms, idioms and patterns found in many genomics or bioinformatics applications. We evaluated Seq on several standard computational genomics tasks like reverse complementation, k-mer manipulation, sequence pattern matching and large genomic index queries. On equivalent CPython code, Seq attains a performance improvement of up to two orders of magnitude, and a 160× improvement once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650× improvement. Compared to optimized C++ code, which is already difficult for most biologists to produce, Seq frequently attains up to a 2× improvement, and with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software.

References

  1. 1000 Genomes Project Consortium, Gonçalo R. Abecasis, David Altshuler, Adam Auton, Lisa D. Brooks, Richard M. Durbin, Richard A. Gibbs, Matt E. Hurles, and Gil A. McVean. 2010. A Map of Human Genome Variation from Population-Scale Sequencing. Nature 467, 7319 (Oct. 2010), 1061–1073. Google ScholarGoogle ScholarCross RefCross Ref
  2. R. Appuswamy, J. Fellay, and N. Chaturvedi. 2018. Sequence Alignment Through the Looking Glass. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) . 257–266. Google ScholarGoogle ScholarCross RefCross Ref
  3. Stefan Milton Bache and Hadley Wickham. 2014. magrittr: A forward-pipe operator for R. R package version 1, 1 (2014). https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.htmlGoogle ScholarGoogle Scholar
  4. Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019). IEEE Press, Piscataway, NJ, USA, 193–205. http://dl.acm.org/citation.cfm?id=3314872.3314896Google ScholarGoogle ScholarCross RefCross Ref
  5. J. A. Bailey, A. M. Yavor, H. F. Massa, B. J. Trask, and E. E. Eichler. 2001. Segmental duplications: organization and impact within the current human genome project assembly. Genome Research 11, 6 (2001), 1005–1017. Google ScholarGoogle ScholarCross RefCross Ref
  6. Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature News 533, 7604 (2016), 452.Google ScholarGoogle ScholarCross RefCross Ref
  7. S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D.S. Seljebotn, and K. Smith. 2011. Cython: The Best of Both Worlds. Computing in Science Engineering 13, 2 (2011), 31 –39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jeff Bezanson, Stefan Karpinski, Viral B Shah, and Alan Edelman. 2012. Julia: A fast dynamic language for technical computing. arXiv (2012), 1209.5145.Google ScholarGoogle Scholar
  9. Hans-Juergen Boehm and Mark Weiser. 1988. Garbage Collection in an Uncooperative Environment. Softw. Pract. Exper. 18, 9 (Sept. 1988), 807–820. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. 2009. Tracing the Meta-level: PyPy’s Tracing JIT Compiler. In Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems (ICOOOLPS ’09) . ACM, New York, NY, USA, 18–25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. 2011. A Domainspecific Approach to Heterogeneous Parallelism. SIGPLAN Not. 46, 8 (Feb. 2011), 35–46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving Hash Join Performance Through Prefetching. ACM Trans. Database Syst. 32, 3, Article 17 (Aug. 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Charisee Chiw, Gordon Kindlmann, John Reppy, Lamont Samuels, and Nick Seltzer. 2012. Diderot: a parallel DSL for image analysis and visualization. In Acm sigplan notices, Vol. 47. ACM, 111–120.Google ScholarGoogle Scholar
  14. Peter JA Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 11 (2009), 1422–1423.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Andreas Döring, David Weese, Tobias Rausch, and Knut Reinert. 2008. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9, 1 (2008), 11.Google ScholarGoogle ScholarCross RefCross Ref
  16. Mark Dufour. 2006. Shed skin: An optimizing python-to-c++ compiler. Master’s thesis. Delft University of Technology.Google ScholarGoogle Scholar
  17. Michael Farrar. 2006. Striped Smith–Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23, 2 (11 2006), 156–161.Google ScholarGoogle Scholar
  18. Paolo Ferragina and Giovanni Manzini. 2004. Compression Boosting in Optimal Linear Time Using the Burrows-Wheeler Transform. In SODA 2004. 655–663.Google ScholarGoogle Scholar
  19. Erika Check Hayden. 2014. Technology: The $1,000 genome. Nature 507, 7492 (mar 2014), 294–295. Google ScholarGoogle ScholarCross RefCross Ref
  20. K Hayen. 2012. Nuitka. (2012). http://nuitka.netGoogle ScholarGoogle Scholar
  21. Rick Kamps, Rita D Brandão, Bianca J Bosch, Aimee DC Paulussen, Sofia Xanthoulea, Marinus J Blok, and Andrea Romano. 2017. Next-generation sequencing in oncology: genetic diagnosis, risk prediction and cancer classification. International Journal of Molecular Sciences 18, 2 (2017), 308.Google ScholarGoogle ScholarCross RefCross Ref
  22. Abdul Rafay Khan, Muhammad Tariq Pervez, Masroor Ellahi Babar, Nasir Naveed, and Muhammad Shoaib. 2018. A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective. Evol Bioinform Online 14 (20 Feb 2018), 1176934318758650–1176934318758650. Google ScholarGoogle ScholarCross RefCross Ref
  23. Vladimir Kiriansky, Haoran Xu, Martin Rinard, and Saman Amarasinghe. 2018. Cimple: Instruction and Memory Level Parallelism: A DSL for Uncovering ILP and MLP. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18) . ACM, New York, NY, USA, Article 30, 16 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In Proc. IEEE/ACM Automated Software Engineering. IEEE, 943–948.Google ScholarGoogle ScholarCross RefCross Ref
  25. Fredrik Kjolstad, Shoaib Kamil, Jonathan Ragan-Kelley, David IW Levin, Shinjiro Sueda, Desai Chen, Etienne Vouga, Danny M Kaufman, Gurtej Kanwar, Wojciech Matusik, et al. 2016. Simit: A language for physical simulation. ACM Transactions on Graphics (TOG) 35, 2 (2016), 20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gregory Kucherov, Karel Břinda, and Maciej Sykulski. 2015. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31, 22 (07 2015), 3584–3592.Google ScholarGoogle Scholar
  27. Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC (LLVM ’15) . ACM, New York, NY, USA, Article 7, 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. Int. Sym. on Code Generation and Optimization . IEEE Computer Society, 75.Google ScholarGoogle ScholarCross RefCross Ref
  29. Heng Li and Richard Durbin. 2009. Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform. Bioinformatics 25, 14 (2009), 1754–1760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009a. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078–2079.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. 2009b. The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, 16 (2009), 2078–2079.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Heng Li and Nils Homer. 2010. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11, 5 (Sep 2010), 473–483. Google ScholarGoogle ScholarCross RefCross Ref
  33. Hengyun Lu, Francesca Giordano, and Zemin Ning. 2016. Oxford Nanopore MinION sequencing and genome assembly. Genomics, Proteomics & Bioinformatics 14, 5 (2016), 265–279.Google ScholarGoogle Scholar
  34. Kanak Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, Saurabh Bagchi, and Somali Chaterji. 2016. SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications. In Proceedings of the 2016 International Conference on Supercomputing (ICS ’16) . ACM, New York, NY, USA, Article 34, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Teri A Manolio, Lisa D Brooks, and Francis S Collins. 2008. A HapMap harvest of insights into the genetics of common disease. The Journal of Clinical Investigation 118, 5 (2008), 1590–1605.Google ScholarGoogle ScholarCross RefCross Ref
  36. ER Mardis. 2017. DNA sequencing technologies: 2006-2016. Nature Protocols 12, 2 (2017), 213–218.Google ScholarGoogle ScholarCross RefCross Ref
  37. Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, and Mark A. DePristo. 2010. The Genome Analysis Toolkit: A MapReduce Framework for Analyzing next-Generation DNA Sequencing Data. Genome Research 20, 9 (Sept. 2010), 1297–1303. Google ScholarGoogle ScholarCross RefCross Ref
  38. Paul Muir, Shantao Li, Shaoke Lou, Daifeng Wang, Daniel J Spakowicz, Leonidas Salichos, Jing Zhang, George M Weinstock, Farren Isaacs, Joel Rozowsky, et al. 2016. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biology 17, 1 (2016), 53.Google ScholarGoogle ScholarCross RefCross Ref
  39. Gor Nishanov. 2017. ISO/IEC TS 22277:2017. (Dec 2017). https://www.iso.org/standard/73008.htmlGoogle ScholarGoogle Scholar
  40. Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology 17, 1 (20 Jun 2016), 132. Google ScholarGoogle ScholarCross RefCross Ref
  41. Roger D Peng. 2011. Reproducible research in computational science. Science 334, 6060 (2011), 1226–1227.Google ScholarGoogle ScholarCross RefCross Ref
  42. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Andrea Sboner, Xinmeng Jasmine Mu, Dov Greenbaum, Raymond K Auerbach, and Mark B Gerstein. 2011. The real cost of sequencing: higher than you think! Genome biology 12, 8 (2011), 125.Google ScholarGoogle Scholar
  44. Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation. SIGPLAN Not. 52, 8 (Jan. 2017), 249–265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ariya Shajii, Ibrahim Numanagić, Christopher Whelan, and Bonnie Berger. 2018. Statistical Binning for Barcoded Reads Improves Downstream Analyses. Cell Systems 7, 2 (2018), 219–226.Google ScholarGoogle ScholarCross RefCross Ref
  46. Jared T. Simpson and Richard Durbin. 2012. Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22, 3 (Mar 2012), 549–556. Google ScholarGoogle ScholarCross RefCross Ref
  47. Petr Šmarda, Petr Bureš, Lucie Horová, Ilia J. Leitch, Ladislav Mucina, Ettore Pacini, Lubomír Tichý, Vít Grulich, and Olga Rotreklová. 2014. Ecological and evolutionary significance of genomic GC content diversity in monocots. Proceedings of the National Academy of Sciences 111, 39 (2014), E4096–E4102. Google ScholarGoogle ScholarCross RefCross Ref
  48. Hajime Suzuki and Masahiro Kasahara. 2018. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics 19, 1 (19 Feb 2018), 45. Google ScholarGoogle ScholarCross RefCross Ref
  49. Guido van Rossum. 2015. The Python Library Reference, Release 3.5. Fred L. Drake Jr.Google ScholarGoogle Scholar
  50. K Voss, J Gentry, and G Van der Auwera. 2017. Full-stack genomics pipelining with GATK4 +WDL +Cromwell.. In 18th Annual Bioinformatics Open Source Conference . poster.Google ScholarGoogle Scholar
  51. Martin Šošić and Mile Šikić. 2017. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, 9 (01 2017), 1394–1395.Google ScholarGoogle Scholar
  52. Wendi Wang, Wen Tang, Linchuan Li, Guangming Tan, Peiheng Zhang, and Ninghui Sun. 2012. Investigating Memory Optimization of Hash-index for Next Generation Sequencing on Multi-core Architecture. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (2012), 665–674.Google ScholarGoogle Scholar
  53. Deirdre Weymann, Janessa Laskin, Robyn Roscoe, Kasmintan A. Schrader, Stephen Chia, Stephen Yip, Winson Y. Cheung, Karen A. Gelmon, Aly Karsan, Daniel J. Renouf, Marco Marra, and Dean A. Regier. 2017. The cost and cost trajectory of whole-genome analysis guiding treatment of patients with advanced cancers. Molecular Genetics & Genomic Medicine 5, 3 (2017), 251–260. Google ScholarGoogle ScholarCross RefCross Ref
  54. Deniz Yorukoglu, Yun William Yu, Jian Peng, and Bonnie Berger. 2016. Compressive mapping for next-generation sequencing. Nat Biotech 34, 4 (2016), 374–376. Google ScholarGoogle ScholarCross RefCross Ref
  55. Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David A. Patterson, Scott Shenker, Ion Stoica, Richard M. Karp, and Taylor Sittler. 2011. Faster and More Accurate Sequence Alignment with SNAP. CoRR abs/1111.5572 (2011). arXiv: 1111.5572 http://arxiv.org/abs/1111.5572Google ScholarGoogle Scholar
  56. Di Zhang, Yunquan Zhang, and Jing Chen. 2007. Efficient Construction of FM-index Using Overlapping Block Processing for Large Scale Texts. In Proceedings of the 29th European Conference on IR Research (ECIR’07). Springer-Verlag, Berlin, Heidelberg, 113–123. http://dl.acm.org/citation.cfm?id=1763653.1763669Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High-performance Graph DSL. Proc. ACM Program. Lang. 2, OOPSLA, Article 121 (Oct. 2018), 30 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Seq: a high-performance language for bioinformatics

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!