Abstract
The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100—a factor of over 106—and the amount of data to be analyzed has increased proportionally. Yet, as Moore’s Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. In a field where many researchers are primarily focused on biological analysis over computational optimization, the unfortunate solution to this problem is often to simply buy larger and faster machines.
Here, we introduce Seq, the first language tailored specifically to bioinformatics, which marries the ease and productivity of Python with C-like performance. Seq starts with a subset of Python—and is in many cases a drop-in replacement—yet also incorporates novel bioinformatics- and computational genomics-oriented data types, language constructs and optimizations. Seq enables users to write high-level, Pythonic code without having to worry about low-level or domain-specific optimizations, and allows for the seamless expression of the algorithms, idioms and patterns found in many genomics or bioinformatics applications. We evaluated Seq on several standard computational genomics tasks like reverse complementation, k-mer manipulation, sequence pattern matching and large genomic index queries. On equivalent CPython code, Seq attains a performance improvement of up to two orders of magnitude, and a 160× improvement once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650× improvement. Compared to optimized C++ code, which is already difficult for most biologists to produce, Seq frequently attains up to a 2× improvement, and with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software.
- 1000 Genomes Project Consortium, Gonçalo R. Abecasis, David Altshuler, Adam Auton, Lisa D. Brooks, Richard M. Durbin, Richard A. Gibbs, Matt E. Hurles, and Gil A. McVean. 2010. A Map of Human Genome Variation from Population-Scale Sequencing. Nature 467, 7319 (Oct. 2010), 1061–1073. Google Scholar
Cross Ref
- R. Appuswamy, J. Fellay, and N. Chaturvedi. 2018. Sequence Alignment Through the Looking Glass. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) . 257–266. Google Scholar
Cross Ref
- Stefan Milton Bache and Hadley Wickham. 2014. magrittr: A forward-pipe operator for R. R package version 1, 1 (2014). https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.htmlGoogle Scholar
- Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019). IEEE Press, Piscataway, NJ, USA, 193–205. http://dl.acm.org/citation.cfm?id=3314872.3314896Google Scholar
Cross Ref
- J. A. Bailey, A. M. Yavor, H. F. Massa, B. J. Trask, and E. E. Eichler. 2001. Segmental duplications: organization and impact within the current human genome project assembly. Genome Research 11, 6 (2001), 1005–1017. Google Scholar
Cross Ref
- Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature News 533, 7604 (2016), 452.Google Scholar
Cross Ref
- S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D.S. Seljebotn, and K. Smith. 2011. Cython: The Best of Both Worlds. Computing in Science Engineering 13, 2 (2011), 31 –39. Google Scholar
Digital Library
- Jeff Bezanson, Stefan Karpinski, Viral B Shah, and Alan Edelman. 2012. Julia: A fast dynamic language for technical computing. arXiv (2012), 1209.5145.Google Scholar
- Hans-Juergen Boehm and Mark Weiser. 1988. Garbage Collection in an Uncooperative Environment. Softw. Pract. Exper. 18, 9 (Sept. 1988), 807–820. Google Scholar
Digital Library
- Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. 2009. Tracing the Meta-level: PyPy’s Tracing JIT Compiler. In Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems (ICOOOLPS ’09) . ACM, New York, NY, USA, 18–25. Google Scholar
Digital Library
- Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. 2011. A Domainspecific Approach to Heterogeneous Parallelism. SIGPLAN Not. 46, 8 (Feb. 2011), 35–46. Google Scholar
Digital Library
- Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving Hash Join Performance Through Prefetching. ACM Trans. Database Syst. 32, 3, Article 17 (Aug. 2007). Google Scholar
Digital Library
- Charisee Chiw, Gordon Kindlmann, John Reppy, Lamont Samuels, and Nick Seltzer. 2012. Diderot: a parallel DSL for image analysis and visualization. In Acm sigplan notices, Vol. 47. ACM, 111–120.Google Scholar
- Peter JA Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 11 (2009), 1422–1423.Google Scholar
Digital Library
- Andreas Döring, David Weese, Tobias Rausch, and Knut Reinert. 2008. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9, 1 (2008), 11.Google Scholar
Cross Ref
- Mark Dufour. 2006. Shed skin: An optimizing python-to-c++ compiler. Master’s thesis. Delft University of Technology.Google Scholar
- Michael Farrar. 2006. Striped Smith–Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23, 2 (11 2006), 156–161.Google Scholar
- Paolo Ferragina and Giovanni Manzini. 2004. Compression Boosting in Optimal Linear Time Using the Burrows-Wheeler Transform. In SODA 2004. 655–663.Google Scholar
- Erika Check Hayden. 2014. Technology: The $1,000 genome. Nature 507, 7492 (mar 2014), 294–295. Google Scholar
Cross Ref
- K Hayen. 2012. Nuitka. (2012). http://nuitka.netGoogle Scholar
- Rick Kamps, Rita D Brandão, Bianca J Bosch, Aimee DC Paulussen, Sofia Xanthoulea, Marinus J Blok, and Andrea Romano. 2017. Next-generation sequencing in oncology: genetic diagnosis, risk prediction and cancer classification. International Journal of Molecular Sciences 18, 2 (2017), 308.Google Scholar
Cross Ref
- Abdul Rafay Khan, Muhammad Tariq Pervez, Masroor Ellahi Babar, Nasir Naveed, and Muhammad Shoaib. 2018. A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective. Evol Bioinform Online 14 (20 Feb 2018), 1176934318758650–1176934318758650. Google Scholar
Cross Ref
- Vladimir Kiriansky, Haoran Xu, Martin Rinard, and Saman Amarasinghe. 2018. Cimple: Instruction and Memory Level Parallelism: A DSL for Uncovering ILP and MLP. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18) . ACM, New York, NY, USA, Article 30, 16 pages. Google Scholar
Digital Library
- Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In Proc. IEEE/ACM Automated Software Engineering. IEEE, 943–948.Google Scholar
Cross Ref
- Fredrik Kjolstad, Shoaib Kamil, Jonathan Ragan-Kelley, David IW Levin, Shinjiro Sueda, Desai Chen, Etienne Vouga, Danny M Kaufman, Gurtej Kanwar, Wojciech Matusik, et al. 2016. Simit: A language for physical simulation. ACM Transactions on Graphics (TOG) 35, 2 (2016), 20.Google Scholar
Digital Library
- Gregory Kucherov, Karel Břinda, and Maciej Sykulski. 2015. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31, 22 (07 2015), 3584–3592.Google Scholar
- Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC (LLVM ’15) . ACM, New York, NY, USA, Article 7, 6 pages. Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. Int. Sym. on Code Generation and Optimization . IEEE Computer Society, 75.Google Scholar
Cross Ref
- Heng Li and Richard Durbin. 2009. Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform. Bioinformatics 25, 14 (2009), 1754–1760.Google Scholar
Digital Library
- Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009a. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078–2079.Google Scholar
Digital Library
- Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. 2009b. The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, 16 (2009), 2078–2079.Google Scholar
Digital Library
- Heng Li and Nils Homer. 2010. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11, 5 (Sep 2010), 473–483. Google Scholar
Cross Ref
- Hengyun Lu, Francesca Giordano, and Zemin Ning. 2016. Oxford Nanopore MinION sequencing and genome assembly. Genomics, Proteomics & Bioinformatics 14, 5 (2016), 265–279.Google Scholar
- Kanak Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, Saurabh Bagchi, and Somali Chaterji. 2016. SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications. In Proceedings of the 2016 International Conference on Supercomputing (ICS ’16) . ACM, New York, NY, USA, Article 34, 12 pages. Google Scholar
Digital Library
- Teri A Manolio, Lisa D Brooks, and Francis S Collins. 2008. A HapMap harvest of insights into the genetics of common disease. The Journal of Clinical Investigation 118, 5 (2008), 1590–1605.Google Scholar
Cross Ref
- ER Mardis. 2017. DNA sequencing technologies: 2006-2016. Nature Protocols 12, 2 (2017), 213–218.Google Scholar
Cross Ref
- Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, and Mark A. DePristo. 2010. The Genome Analysis Toolkit: A MapReduce Framework for Analyzing next-Generation DNA Sequencing Data. Genome Research 20, 9 (Sept. 2010), 1297–1303. Google Scholar
Cross Ref
- Paul Muir, Shantao Li, Shaoke Lou, Daifeng Wang, Daniel J Spakowicz, Leonidas Salichos, Jing Zhang, George M Weinstock, Farren Isaacs, Joel Rozowsky, et al. 2016. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biology 17, 1 (2016), 53.Google Scholar
Cross Ref
- Gor Nishanov. 2017. ISO/IEC TS 22277:2017. (Dec 2017). https://www.iso.org/standard/73008.htmlGoogle Scholar
- Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology 17, 1 (20 Jun 2016), 132. Google Scholar
Cross Ref
- Roger D Peng. 2011. Reproducible research in computational science. Science 334, 6060 (2011), 1226–1227.Google Scholar
Cross Ref
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530.Google Scholar
Digital Library
- Andrea Sboner, Xinmeng Jasmine Mu, Dov Greenbaum, Raymond K Auerbach, and Mark B Gerstein. 2011. The real cost of sequencing: higher than you think! Genome biology 12, 8 (2011), 125.Google Scholar
- Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation. SIGPLAN Not. 52, 8 (Jan. 2017), 249–265. Google Scholar
Digital Library
- Ariya Shajii, Ibrahim Numanagić, Christopher Whelan, and Bonnie Berger. 2018. Statistical Binning for Barcoded Reads Improves Downstream Analyses. Cell Systems 7, 2 (2018), 219–226.Google Scholar
Cross Ref
- Jared T. Simpson and Richard Durbin. 2012. Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22, 3 (Mar 2012), 549–556. Google Scholar
Cross Ref
- Petr Šmarda, Petr Bureš, Lucie Horová, Ilia J. Leitch, Ladislav Mucina, Ettore Pacini, Lubomír Tichý, Vít Grulich, and Olga Rotreklová. 2014. Ecological and evolutionary significance of genomic GC content diversity in monocots. Proceedings of the National Academy of Sciences 111, 39 (2014), E4096–E4102. Google Scholar
Cross Ref
- Hajime Suzuki and Masahiro Kasahara. 2018. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics 19, 1 (19 Feb 2018), 45. Google Scholar
Cross Ref
- Guido van Rossum. 2015. The Python Library Reference, Release 3.5. Fred L. Drake Jr.Google Scholar
- K Voss, J Gentry, and G Van der Auwera. 2017. Full-stack genomics pipelining with GATK4 +WDL +Cromwell.. In 18th Annual Bioinformatics Open Source Conference . poster.Google Scholar
- Martin Šošić and Mile Šikić. 2017. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, 9 (01 2017), 1394–1395.Google Scholar
- Wendi Wang, Wen Tang, Linchuan Li, Guangming Tan, Peiheng Zhang, and Ninghui Sun. 2012. Investigating Memory Optimization of Hash-index for Next Generation Sequencing on Multi-core Architecture. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (2012), 665–674.Google Scholar
- Deirdre Weymann, Janessa Laskin, Robyn Roscoe, Kasmintan A. Schrader, Stephen Chia, Stephen Yip, Winson Y. Cheung, Karen A. Gelmon, Aly Karsan, Daniel J. Renouf, Marco Marra, and Dean A. Regier. 2017. The cost and cost trajectory of whole-genome analysis guiding treatment of patients with advanced cancers. Molecular Genetics & Genomic Medicine 5, 3 (2017), 251–260. Google Scholar
Cross Ref
- Deniz Yorukoglu, Yun William Yu, Jian Peng, and Bonnie Berger. 2016. Compressive mapping for next-generation sequencing. Nat Biotech 34, 4 (2016), 374–376. Google Scholar
Cross Ref
- Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David A. Patterson, Scott Shenker, Ion Stoica, Richard M. Karp, and Taylor Sittler. 2011. Faster and More Accurate Sequence Alignment with SNAP. CoRR abs/1111.5572 (2011). arXiv: 1111.5572 http://arxiv.org/abs/1111.5572Google Scholar
- Di Zhang, Yunquan Zhang, and Jing Chen. 2007. Efficient Construction of FM-index Using Overlapping Block Processing for Large Scale Texts. In Proceedings of the 29th European Conference on IR Research (ECIR’07). Springer-Verlag, Berlin, Heidelberg, 113–123. http://dl.acm.org/citation.cfm?id=1763653.1763669Google Scholar
Digital Library
- Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High-performance Graph DSL. Proc. ACM Program. Lang. 2, OOPSLA, Article 121 (Oct. 2018), 30 pages. Google Scholar
Digital Library
Index Terms
Seq: a high-performance language for bioinformatics
Recommendations
Codon: A Compiler for High-Performance Pythonic Applications and DSLs
CC 2023: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler ConstructionDomain-specific languages (DSLs) are able to provide intuitive high-level abstractions that are easy to work with while attaining better performance than general-purpose languages. Yet, implementing new DSLs is a burdensome task. As a result, new DSLs ...
SuperPhy
BIOSTEC 2014: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3Advances in DNA sequencing technology have created new opportunities in fields such as clinical medicine and epidemiology, where performing real-time, genome-based surveillance and identification of phenotypic characteristics of bacterial pathogens is ...
Homology prediction refinement and reconstruction of gene content and order of ancestral bacterial genomes
BCB '10: Proceedings of the First ACM International Conference on Bioinformatics and Computational BiologyWe present a systematical methodology to refine orthologs identification generated by 3rd party de novo prediction programs and reconstruction of ancestral bacteria genome with this information by a neighboring gene pairs (NGPs) based method. The ...






Comments