Abstract
The modern processor landscape is a varied and diverse community. As such, developers need a way to quickly and fairly compare various devices for use with particular applications. This article expands the authors’ previously published computational-density metrics and presents an analysis of a new generation of various device architectures, including CPU, DSP, FPGA, GPU, and hybrid architectures. Also, new memory metrics are added to expand the existing suite of metrics to characterize the memory resources on various processing devices. Finally, a new relational metric, realizable utilization (RU), is introduced, which quantifies the fraction of the computational density metric that an application achieves within an individual implementation. The RU metric can be used to provide valuable feedback to application developers and architecture designers by highlighting the upper bound on specific application optimization and providing a quantifiable measure of theoretical and realizable performance. Overall, the analysis in this article quantifies the performance tradeoffs among the architectures studied, the memory characteristics of different device types, and the efficiency of device architectures.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Analysis of Fixed, Reconfigurable, and Hybrid Devices with Computational, Memory, I/O, & Realizable-Utilization Metrics
- S. Aarseth and S. J. Aarseth. 2003. Gravitational N-Body Simulations. Cambridge University Press. http://books.google.com/books?id=Xo8eaQzs0YoC. Google Scholar
Cross Ref
- A. Athavale and C. Christensen. 2005. High-Speed Serial I/O Made Simple, A Designers’ Guide, with FPGA Applications. Xilinx Connectivity Solutions.Google Scholar
- D. H. Bailey. 1995. Proceedings of the Seventh Siam Conference on Parallel Processing for Scientific Computing. Siam. http://books.google.com/books?id=FgDYbavV-R4C. Google Scholar
Digital Library
- Sergio Barrachina, Maribel Castillo, Francisco Igual, Rafael Mayo, and Enrique Quintana-Ort. 2008. Solving dense linear systems on graphics processors. In Euro-Par 2008 Parallel Processing, Emilio Luque, T. Margalef, and Domingo Bentez (Eds.). Lecture Notes in Computer Science, Vol. 5168. Springer, Berlin, 739--748. http://dx.doi.org/10.1007/978-3-540-85451-7_79 Google Scholar
Digital Library
- Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Orti, and Gregorio Quintana-Orti. June 7, 2009. Exploiting the capabilities of modern GPUs for dense matrix computations. Concurrency and Computation: Practice and Experience (June 7, 2009). Google Scholar
Digital Library
- Bathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. High Performance Computing, Networking, Storage and Analysis, 2009. SC 2009. International Conference for (November 2009). Google Scholar
Digital Library
- Robert G. Belleman, Jeroen Bedorf, and Simon F. Portegies Zwart. 2008. High performance direct gravitational n-body simulations on graphics processing units II: An implementation in CUDA. New Astron. 13 (Feb. 2008). Issue 2.Google Scholar
- Bhaskar. 2006. Applied Mathematical Methods. Pearson Education. http://books.google.com/books?id=D4DA7rWWWPYC.Google Scholar
- Doug Burger, James R. Goodman, and Alain Kägi. 1996. Memory bandwidth limitations of future microprocessors. In ISCA’96: Proceedings of the 23rd Annual International Symposium on Computer Architecture. ACM, New York, NY, 78--89. DOI:http://dx.doi.org/10.1145/232973.232983 Google Scholar
Digital Library
- Jose M. Cecilia. The GPU on the Matrix-Matrix Multiply: Performance Study and Contributions.Google Scholar
- Andre DeHon. 1996. Reconfigurable Architectures for General-Purpose Computing. Technical Report. Massachusetts Institute of Technology, Cambridge, MA, USA. Google Scholar
Digital Library
- J. J. Dongarra, Society for Industrial, and Applied Mathematics. 1979. Linpack Users’ Guide. Society for Industrial and Applied Mathematics. http://books.google.com/books?id=AmSm1n3Vw0cC.Google Scholar
- Tsuyoshi Hamada and Toshiaki Iitaka. March 5, 2007. The chamomile scheme: An optimized algorithm for n-body simulations on programmable graphics processing units. NewAstron. (March 5, 2007).Google Scholar
- Intel. 2014. Intel math kernel library reference manual. 072, MKL 11.2 (2014). https://software.intel.com/en-us/mkl_11.2_ref_pdf.Google Scholar
- P. Lancaster and M. Tismenetsky. 1985. The Theory of Matrices: With Applications. Academic Press. http://books.google.com/books?id=m8z6Xh1A3t8C.Google Scholar
- Andrew Milluzzi, Justin Richardson, Alan George, and Herman Lam. 2014. A multi-tiered optimization framework for heterogeneous computing. IEEE Proc. of High-Performance Extreme Computing Conference (September 2014). Google Scholar
Cross Ref
- NVidia. 2010a. NVidia SDK Core. http://developer.nvidia.com/cuda-toolkit. (2010).Google Scholar
- NVidia. 2010b. NVidia SDK DirectCompute Core. http://developer.nvidia.com/cuda-toolkit. (2010).Google Scholar
- NVidia. 2015. CUBLAS LIBRARY. 7.0 (2015). http://docs.nvidia.com/cuda/pdf/CUBLAS_Library.pdf.Google Scholar
- Lars Nyland, Mark Harris, and Jan Prins. 2007. Fast n-body simulation with CUDA. GPU Gems 3 (2007).Google Scholar
- J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. 2008. GPU computing. Proc. IEEE 96, 5 (may 2008), 879--899. DOI:http://dx.doi.org/10.1109/JPROC.2008.917757 Google Scholar
Cross Ref
- Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei W. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). ACM, New York, NY, 73--82. DOI:http://dx.doi.org/10.1145/1345206.1345220 Google Scholar
Digital Library
- Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk. 1996. Missing the memory wall: The case for processor/memory integration. In ISCA’96: Proceedings of the 23rd Annual International Symposium on Computer Architecture. ACM, New York, NY, 90--101. DOI:http://dx.doi.org/10.1145/232973.232984 Google Scholar
Digital Library
- Gurindar S. Sohi and Manoj Franklin. 1991. High-bandwidth data memory systems for superscalar processors. SIGOPS Operat. Syst. Rev. 25, Special Issue (1991), 53--62. DOI:http://dx.doi.org/10.1145/106974.106980 Google Scholar
Digital Library
- Vasily Volkov and James Demmel. 2008a. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-49.html. (May 2008).Google Scholar
- Vasily Volkov and James W. Demmel. Nov. 15--21, 2008b. Benchmarking GPUs to tune dense linear algebra. High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for (Nov. 15--21, 2008). Google Scholar
Digital Library
- R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimization of software and the ATLAS project. Parallel Comput. 27, 1--2 (2001), 3--35. Also available as University of Tennessee LAPACK Working Note #147, UT-CS-00-448, 2000 (www.netlib.org/lapack/lawns/lawn147.ps).Google Scholar
Cross Ref
- J. Williams, A. George, J. Richardson, K. Gosrani, C. Massie, and H. Lam. 2011. Characterization of fixed and reconfigurable multi-core devices for application acceleration. ACM Trans. Reconfig. Technol. Syst. 3, 4 (2011), 19:1--19:29. Google Scholar
Digital Library
- J. Williams, A. George, J. Richardson, K. Gosrani, and S. Suresh. July 7--10, 2008a. Computational density of fixed and reconfigurable multi-core devices for application acceleration. Proc. of Reconfigurable Systems Summer Institute 2008 (RSSI) (July 7--10, 2008).Google Scholar
- J. Williams, A. George, J. Richardson, K. Gosrani, and S. Suresh. Sep. 23--25, 2008b. Fixed and reconfigurable multi-core device characterization for HPEC. Proc. of High-Performance Embedded Computing Workshop (HPEC) (Sep. 23--25, 2008).Google Scholar
Index Terms
Analysis of Fixed, Reconfigurable, and Hybrid Devices with Computational, Memory, I/O, & Realizable-Utilization Metrics
Recommendations
BenchMetrics: a systematic benchmarking method for binary classification performance metrics
AbstractThis paper proposes a systematic benchmarking method called BenchMetrics to analyze and compare the robustness of binary classification performance metrics based on the confusion matrix for a crisp classifier. BenchMetrics, introducing new ...
Quality Analysis of Object Oriented Cohesion Metrics
QUATIC '10: Proceedings of the 2010 Seventh International Conference on the Quality of Information and Communications TechnologyNumerous class cohesion metrics can be found in the literature. However, they end up capturing different aspects of cohesion. Which metric is best suited for a given situation is always a critical question. This work focuses on exploring the strengths ...
The DEVStone Metric: Performance Analysis of DEVS Simulation Engines
The DEVStone benchmark allows us to evaluate the performance of discrete-event simulators based on the Discrete Event System (DEVS) formalism. It provides model sets with different characteristics, enabling the analysis of specific issues of simulation ...






Comments