Abstract
The data deluge in medical imaging processing requires faster and more efficient systems. Due to the advance in recent heterogeneous architecture, there has been a resurgence in research aimed at domain-specific accelerators. In this article, we develop an experimental system SuperDragon for evaluating acceleration of a single-particle Cryo-electron microscopy (Cryo-EM) 3D reconstruction package EMAN through a hybrid of CPU, GPU, and FPGA parallel architecture. Based on a comprehensive workload characterization, we exploit multigrained parallelism in the Cryo-EM 3D reconstruction algorithm and investigate a proper computational mapping to the underlying heterogeneous architecture. The package is restructured with task-level (MPI), thread-level (OpenMP), and data-level (GPU and FPGA) parallelism. Especially, the proposed FPGA accelerator is a stream architecture that emphasizes the importance of optimizing computing dominated data access patterns. Besides, the configurable computing streams are constructed by arranging the hardware modules and bypassing channels to form a linear deep pipeline. Compared to the multicore (six-core) program, the GPU and FPGA implementations achieve speedups of 8.4 and 2.25 times in execution time while improving power efficiency by factors of 7.2 and 14.2, respectively.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, SuperDragon: A Heterogeneous Parallel System for Accelerating 3D Reconstruction of Cryo-Electron Microscopy Images
- B. Betkaoui, D. B. Thomas, and W. Luk. 2010. Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing. In Proceedings of the International Conference on Field-Programmable Technology. 94--101.Google Scholar
- T. M. Brewer. 2010. Instruction set innovations for the convey HC-1 computer. IEEE Micro 30, 2, 70--79. Google Scholar
Digital Library
- S. Coric, M. Leeser, E. Miller, and M. Trepanier. 2002. Parallel-beam backprojection: An FPGA implementation optimized for medical imaging. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays. 217--226. Google Scholar
Digital Library
- F. de Dinechin, C. Klein, and B. Pasca. 2009. Generating high-performance custom floating-point pipelines. In Proceedings of the International Conference on Field Programmable Logic and Applications. 59--64.Google Scholar
- D. DeRosier and A. Klug. 1968. A reconstruction of 3-dimensional structure from electron micrographs. Nature 217, 130--134.Google Scholar
Cross Ref
- T. El-Ghazawi. 2008. The promise of high-performance reconfigurable computing. IEEE Computer 41, 2, 69--76. Google Scholar
Digital Library
- Fluke. 2011. Fluke Home Page. Retrieved August 25, 2015, from http://www.fluke.com/fluke/usen/products/categoryben.htm.Google Scholar
- T. Hartley, U. Catalyurek, A. Ruiz, F. Igual, R. Mayo, and M. Ujaldon. 2008. Biomedical image analysis on a cooperative cluster of GPUs and biomedical image analysis on a cooperative cluster of GPUs and multicores. In Proceedings of the 22nd Annual International Conference on Supercomputing. 15--25. Google Scholar
Digital Library
- A. Hormati, M. Kudlur, S. Mahlke, D. Bacon, and R. Rabbah. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. 41--50. Google Scholar
Digital Library
- L. Li, X. Li, G. Tan, M. Chen, and P. Zhang. 2011. Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system. In Proceedings of the 20th International Symposium on High Performance Distributed Computing. 195--204. Google Scholar
Digital Library
- LLVM. 2012. The LLVM Compiler Infrastructure. Retrieved August 25, 2015, from http://www.llvm.org.Google Scholar
- S. J. Ludtke and P. R. Baldwin. 1999. Eman: Semiautomated software for high-resolution single-particle reconstructions. Journal of Structural Biology 128, 1, 82--97.Google Scholar
Cross Ref
- J. Marathe and F. Mueller. 2008. PFetch: Software prefetching exploiting temporal predictability of memory access streams. In Proceedings of the 9th Workshop on Memory performance: Dealing with Applications, Systems, and Architecture. 1--8. Google Scholar
Digital Library
- Matlab. 2013. Matlab Home Page. Retrieved August 25, 2015, from http://www.mathworks.cn.Google Scholar
- O. Mencer, K. H. Tsoi, S. Craimer, T. Todman, W. Luk, M. Y. Wong, and P. H. W. Leong. 2009. CUBE: A 512-FPGA CLUSTER. In Proceedings of the IEEE Southern Programmable Logic Conference. 51--57.Google Scholar
Cross Ref
- B. Nikolaos, S. M. Chai, D. Malcolm, L. Dan, and L. Abelardo. 2009. Proteus: An architectural synthesis toll based on the stream programming paradigm. In Proceedings of the International Conference on Field Programmable Logic and Applications. 596--599.Google Scholar
- L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. 2013. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 29--38. Google Scholar
Digital Library
- S. Ryoo, C. Rodrigues, and S. Baghsorkhi. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 73--82. Google Scholar
Digital Library
- H. Scherl, S. Hoppe, M. Kowarschik, and J. Hornegger. 2008. Design and implementation of the software architecture for a 3-D reconstruction system in medical imaging. In Proceedings of the ACM 30th International Conference on Software Engineering. 661--668. Google Scholar
Digital Library
- R. Scrofano, M. Gokhale, F. Trouw, and V. K. Prasanna. 2006. Hardware/software approach to molecular dynamics on reconfigurable computers. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 23--24. Google Scholar
Digital Library
- G. Tan and Z. Guo. 2009. Single-particle 3D reconstruction from cryo-electron microscopy images on GPU. In Proceedings of the 23rd International Conference on Supercomputing. 380--389. Google Scholar
Digital Library
- K. Taylor and R. M. Glaeser. 1974. Electron diffraction of frozen, hydrated protein crystals. Science 186, 1036--1037.Google Scholar
Cross Ref
- K. H. Tsoi and W. Luk. 2010. Axel: A heterogeneous cluster with FPGAs and GPUs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 115--124. Google Scholar
Digital Library
- W. Wang, B. Duan, W. Tang, C. Zhang, G. Tan, P. Zhang, and N. Sun. 2012. A coarse-grained stream architecture for cryo-electron microscopy images 3D reconstruction. In Proceedings of the ACM/SIGDA 20th International Symposium on Field-Programmable Gate Arrays. 143--152. Google Scholar
Digital Library
- Xilinx. 2011. Xilinx Home Page. Retrieved August 25, 2015, from http://www.xilinx.com.Google Scholar
- F. Xu and K. Mueller. 2007. Real-time 3D computed tomographic reconstruction using commodity graphics hardware. Physics in Medicine and Biology 512, 12, 3405--3419.Google Scholar
Cross Ref
Index Terms
SuperDragon: A Heterogeneous Parallel System for Accelerating 3D Reconstruction of Cryo-Electron Microscopy Images
Recommendations
Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system
HPDC '11: Proceedings of the 20th international symposium on High performance distributed computingHeterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize massive parallelism of both applications ...
The RACECAR heuristic for automatic function specialization on multi-core heterogeneous systems
CASES '12: Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systemsEmbedded systems increasingly combine multi-core processors and heterogeneous resources such as graphics-processing units and field-programmable gate arrays. However, significant application design complexity for such systems caused by parallel ...
Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...






Comments