Abstract
Treatment of patients using high-quality precision medicine requires a thorough understanding of the genetic composition of a patient. Ideally, the identification of unique variations in an individual’s genome is needed for specifying the necessary treatment. Variant calling workflow is a pipeline of tools, integrating state of the art software systems aimed at alignment, sorting and variant calling for the whole genome sequencing (WGS) data. This pipeline is utilized for identifying unique variations in an individual’s genome (compared to a reference genome). Currently, such a workflow is implemented on high-performance computers (with additional GPUs or FPGAs) or in cloud computers. Such systems are large, have a high cost, and rely on the internet for genome data transfer which makes the system unusable in remote locations unequipped with internet connectivity. It further raises privacy concerns due to processing being carried out in a different facility.
To overcome such limitations, in this paper, for the first time, we present a cost-efficient, offline, scalable, portable, and energy-efficient computing system named SWARAM for variant calling workflow processing. The system uses novel architecture and algorithms to match against partial reference genomes to exploit smaller memory sizes which are typically available in tiny processing systems. Extensive tests on a standard benchmark data-set (NA12878 Illumina platinum genome) confirm that the time consumed for the data transfer and completing variant calling workflow on SWARAM was competitive to that of a 32-core Intel Xeon server with similar accuracy, but costs less than a fifth, and consumes less than 40% of the energy of the server system. The original scripts and code we developed for executing the variant calling workflow on SWARAM are available in the associated Github repository https://github.com/Rammohanty/swaram.
- 2013. Maxeler Technologies. https://www.maxeler.com/products/mpc-xseries/.Google Scholar
- 2019. SWARAM repository. https://github.com/Rammohanty/swaram.Google Scholar
- J. Arram, T. Kaplan, W. Luk, and P. Jiang. 2016. Leveraging FPGAs for accelerating short read alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics / IEEE, ACM 5963, c (2016), 1--10.Google Scholar
- K. Benkrid, Y. Liu, and A. Benkrid. 2009. A highly parameterized and efficient FPGA-based skeleton for pairwise biological sequence alignment. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 4 (2009), 561--570.Google Scholar
Digital Library
- P. Brodin, K. Eiglmeier, M. Marmiesse, A. Billault, T. Garnier, S. Niemann, S. Cole, and R. Brosch. 2002. Bacterial artificial chromosome-based comparative genomic analysis identifies Mycobacterium microti as a natural ESAT-6 deletion mutant. Infection and Immunity 70, 10 (2002), 5568--5578.Google Scholar
Cross Ref
- N. Chen, T. Chiu, Y. Li, Y. Chien, and Y. Lu. 2015. Power efficient special processor design for burrows-wheeler-transform-based short read sequence alignment. In Biomedical Circuits and Systems Conference (BioCAS), 2015 IEEE. IEEE, 1--4.Google Scholar
- S. Chen and M. A Senar. 2016. Accelerating BWA aligner using multistage data parallelization on multicore and manycore architectures. Procedia Computer Science 80 (2016), 2438--2442.Google Scholar
Digital Library
- J. Cleary, R. Braithwaite, K. Gaastra, B. Hilbush, S. Inglis, S. Irvine, A. Jackson, R. Littin, M. Rathod, D. Ware, et al. 2015. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. BioRxiv (2015), 023754.Google Scholar
- D. D’Agostino, L. Morganti, E. Corni, D. Cesini, and I. Merelli. 2019. Combining edge and cloud computing for low-power, cost-effective metagenomics analysis. Future Generation Computer Systems 90 (2019), 79--85.Google Scholar
Cross Ref
- P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. DePristo, R. Handsaker, G. Lunter, G. Marth, S. Sherry, et al. 2011. The variant call format and VCFtools. Bioinformatics 27, 15 (2011), 2156--2158.Google Scholar
Digital Library
- M. DePristo, E. Banks, R. Poplin, K. Garimella, J. Maguire, C. Hartl, A. Philippakis, G. Del Angel, M. Rivas, M. Hanna, et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 5 (2011), 491--498.Google Scholar
Cross Ref
- Y. Erlich and A. Narayanan. 2014. Routes for breaching and protecting genetic privacy.Google Scholar
- F. S. Collins, E. D. Green, A. E. Guttmacher, and M. S. Guyer. 2003. A vision for the future of genomics research. Nature 431, April (2003), 835--847.Google Scholar
Cross Ref
- GAIB. 2018. NA12878. Retrieved Apr 19, 2018 from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37.Google Scholar
- S. Gire, A. Goba, K. Andersen, R. Sealfon, D. Park, L. Kanneh, S. Jalloh, M. Momoh, M. Fullah, G. Dudas, et al. 2014. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 345, 6202 (2014), 1369--1372.Google Scholar
Cross Ref
- V. Gnanasambandapillai, A. Bayat, and S. Parameswaran. 2018. MESGA: An MPSoC based embedded system solution for short read genome alignment. In Proceedings of the 23rd Asia and South Pacific Design Automation Conference. IEEE Press, 52--57.Google Scholar
- Y. Guo, X. Ding, Y. Shen, G. Lyon, and K. Wang. 2015. SeqMule: Automated pipeline for analysis of human exome/genome sequencing data. Scientific Reports 5 (2015), 1--10. http://dx.doi.org/10.1038/srep14283Google Scholar
- C. Herzeel, P. Costanza, T. Ashby, and R. Wuyts. 2013. Performance Analysis of BWA Alignment. Technical Report. Technical Report Exascience Life Lab.Google Scholar
- E. Houtgast, V. Sima, K. Bertels, and Z. Al-Ars. 2016. GPU-accelerated BWA-MEM genomic mapping algorithm using adaptive load balancing. In Architecture of Computing Systems -- ARCS 2016, F. Hannig, J. Cardoso, T. Pionteck, D. Fey, W. Schroder-Preikschat, and J. Teich (Eds.). Springer International Publishing, Cham, 130--142.Google Scholar
- S. Huang, G. Manikandan, A. Ramachandran, K. Rupnow, W. Hwu, and D. Chen. 2017. Hardware acceleration of the pair-HMM algorithm for DNA variant calling. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 275--284.Google Scholar
- Illumina. 2016. MiniSeq System. https://science-docs.illumina.com/documents/Instruments/miniseq-system-spec-sheet-770-2015-039/miniseq-system-spec-sheet-770-2015-039.pdf.Google Scholar
- J. Ivković, A. Veljović, and B. Ranđelović. 2016. ODROID-XU4 as a desktop PC and microcontroller development boards alternative. Technics and Informatics in Education May (2016), 439--444.Google Scholar
- B. Kelly, J. Fitch, Y. Hu, D. Corsmeier, H. Zhong, A. Wetzel, R. Nordquist, D. Newsom, and P. White. 2015. Churchill: An ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biology 16, 1 (2015), 6.Google Scholar
Cross Ref
- P. Klus, S. Lam, D. Lyberg, M. Cheung, G. Pullan, I. McFarlane, G. Yeo, and B. Lam. 2012. BarraCUDA-a fast short read sequence aligner using graphics processing units. BMC Research Notes 5, 1 (2012), 27.Google Scholar
Cross Ref
- H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079.Google Scholar
Digital Library
- H. Li and N. Homer. 2010. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11, 5 (2010), 473--483.Google Scholar
Cross Ref
- Y. Liao, G. Smyth, and W. Shi. 2013. The subread aligner: Fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research 41, 10 (2013), e108--e108.Google Scholar
Cross Ref
- C. Liu, T. Wong, E. Wu, R. Luo, S. Yiu, Y. Li, B. Wang, C. Yu, X. Chu, K. Zhao, et al. 2012. SOAP3: Ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics 28, 6 (2012), 878--879.Google Scholar
Digital Library
- Illumina Cambridge Ltd. 2018 (accessed Apr 19, 2018). NA12878. https://www.ebi.ac.uk/ena/data/view/ERR194147.Google Scholar
- R. Luo, Y. Wong, W. Law, L. Lee, J. Cheung, C. Liu, and T. Lam. 2014. BALSA: Integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU. PeerJ 2 (2014), e421.Google Scholar
Cross Ref
- A. OD́riscoll, J. Daugelaite, and R. Sleator. 2013. Big data, Hadoop and cloud computing in genomics. Journal of Biomedical Informatics 46, 5 (2013), 774--781.Google Scholar
Digital Library
- C. Olson, M. Kim, C. Clauson, B. Kogon, C. Ebeling, S. Hauck, and W. Ruzzo. 2012. Hardware acceleration of short read mapping. In 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 161--168.Google Scholar
- World Health Organization et al. 2015. WHO: Ebola Situation Report 11 March 2015.Google Scholar
- S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. Speicher, J. Zschocke, and Z. Trajanoski. 2014. A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics 15, 2 (2014), 256--278.Google Scholar
Cross Ref
- V. Popic and S. Batzoglou. 2017. A hybrid cloud read aligner based on MinHash and kmer voting that preserves privacy. Nature Communications 8 (2017), 15311.Google Scholar
Cross Ref
- R. Poplin, V. Ruano-Rubio, M. DePristo, T. Fennell, M. Carneiro, G. der Auwera, D. Kling, L. Gauthier, A. Levy-Moonshine, D. Roazen, and Others. 2017. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv (2017), 201178.Google Scholar
- A. Rimmer, H. Phan, I. Mathieson, Z. Iqbal, S. Twigg, A. Wilkie, G. McVean, G. Lunter, WGS500 Consortium, et al. 2014. Integrating mapping-, assembly-and haplotype-based approaches for calling variants in clinical sequencing applications. Nature Genetics 46, 8 (2014), 912--918.Google Scholar
Cross Ref
- S. Sandmann, A. De Graaf, M. Karimi, B. Van Der Reijden, E. Hellström-Lindberg, J. Jansen, and M. Dugas. 2017. Evaluating variant calling tools for non-matched next-generation sequencing data. Scientific Reports 7 (2017), 43169.Google Scholar
Cross Ref
- M. Schatz. 2009. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics 25, 11 (2009), 1363--1369.Google Scholar
Digital Library
- M. Schatz, B. Langmead, and S. Salzberg. 2010. Cloud computing and the DNA data race. Nature Biotechnology 28, 7 (2010), 691--693.Google Scholar
Cross Ref
- M. Schatz, C. Trapnell, A. Delcher, and A. Varshney. 2007. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics 8, 1 (2007), 474.Google Scholar
Cross Ref
- N. Siva. 2008. 1000 Genomes project.Google Scholar
- Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, and Gene E. Robinson. 2015. Big data: Astronomical or genomical? PLoS Biology 13, 7 (2015), e1002195.Google Scholar
Cross Ref
- Ellen Tsai, Rimma Shakbatyan, Jason Evans, Peter Rossetti, Chet Graham, Himanshu Sharma, Chiao-Feng Lin, and Matthew Lebo. 2016. Bioinformatics workflow for clinical whole genome sequencing at partners healthcare personalized medicine. Journal of Personalized Medicine 6, 1 (2016), 12.Google Scholar
Cross Ref
- M. Yang, B. Athey, H. Arabnia, A. Sung, Q. Liu, J. Yang, J. Mao, and Y. Deng. 2009. High-throughput next-generation sequencing technologies foster new cutting-edge computing techniques in bioinformatics. BMC Genomics 10, SUPPL. 1 (2009), 1--3.Google Scholar
Index Terms
SWARAM: Portable Energy and Cost Efficient Embedded System for Genomic Processing
Recommendations
Haplotype-centered mapping for improved alignments and genetic association studies
BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health InformaticsNext-Generation Sequencing experiments have been used to identify genotypes that are associated with many medical conditions. An important part of Next Generation read processing is the mapping of short reads to a reference genome. Although many ...
Identical sequence patterns in the ends of exons and introns of human protein-coding genes
Intron splicing is one of the most important steps involved in the maturation process of a pre-mRNA. Although the sequence profiles around the splice sites have been studied extensively, the levels of sequence identity between the exonic sequences ...
Enabling large-scale next-generation sequence assembly with Blacklight
A variety of extremely challenging biological sequence analyses were conducted on the XSEDE large shared memory resource Blacklight, using current bioinformatics tools and encompassing a wide range of scientific applications. These include genomic ...






Comments