ABSTRACT
Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multicore platform, since these specialized accelerators feature ISAs and functionality that are significantly different from the general purpose CPU cores. In this paper, we present EXOCHI: (1) Exoskeleton Sequencer(EXO), an architecture to represent heterogeneous acceleratorsas ISA-based MIMD architecture resources, and a shared virtual memory heterogeneous multithreaded program execution model that tightly couples specialized accelerator cores with generalpurpose CPU cores, and (2) C for Heterogeneous Integration(CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages. The CHI compiler extends the OpenMP pragma for heterogeneous multithreading programming, and produces a single fat binary with code sections corresponding to different instruction sets. The runtime can judiciously spread parallel computation across the heterogeneous cores to optimize performance and power.
We have prototyped the EXO architecture on a physical heterogeneous platform consisting of an Intel® Core™ 2 Duo processor and an 8-core 32-thread Intel® Graphics Media Accelerator X3000. In addition, we have implemented the CHI integrated programming environment with the Intel® C++ Compiler, runtime toolset, and debugger. On the EXO prototype system, we have enhanced a suite of production-quality media kernels for video and image processing to utilize the accelerator through the CHI programming interface, achieving significant speedup (1.41X to10.97X) over execution on the IA32 CPU alone.
References
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Transactions on Graphics, 23(3):777--786, 2004. Google Scholar
Digital Library
- CPU+GPU integration. http://www.google.com/search?hl=en&lr=&rls=GGLG%2CGGLG%2005--47%2CGGLG3Aen&q=intel+amd+nvidia+ati+cpu+gpu+integrated+&btnG=Search.Google Scholar
- CUDA. http://developer.nvidia.com/object/cuda.html.Google Scholar
- P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. [email protected] Magazine, February 2005.Google Scholar
- A. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing Compiler for the CELL Processor. In Proceedings of the 14th international Conference on Parallel Architectures and Compilation Techniques, 2005. Google Scholar
Digital Library
- GLSL OpenGL Shading Language. www.wikipedia.org/wiki/GLSL.Google Scholar
- R. Gonzalez. A Software-configurable Processor Architecture. IEEE Micro, pages 42--51, Sept-Oct 2006. Google Scholar
Digital Library
- N. Govindaraju, S. Larsen, J. Gray, and D.Manocha. AMemory Model for Scientific Algorithms on Graphics Processor. In IEEE Supercomputing, 2006. Google Scholar
Digital Library
- GPGPU: General Purpose Computation using Graphics Hardware. www.gpgpu.org.Google Scholar
- E. Grochowski and M. Annavaram. Energy per Instruction Trends in Intel Microprocessors. [email protected] Magazine, March 2006.Google Scholar
- R. Hankins, G. Chinya, J. Collins, P. Wang, R. Rakvic, H. Wang, and J. Shen. Multiple Instruction Stream Processor. In Proceedings of the 33rd International Symposium on Computer Architecture, June 2006. Google Scholar
Digital Library
- Intel G965 Express Chipset. http://www.intel.com/products/chipsets/g965/prod brief.pdf.Google Scholar
- Intel Santa Rosa Platform. http://www.intel.com/pressroom/archive/releases/20060307corp b.htm.Google Scholar
- Tera-scale Research Prototype: Connecting 80 Simple Sores on a Single Test Chip. ftp://download.intel.com/research/platform/terascale/tera-scaleresearchprototypebackgrounder.pdf.Google Scholar
- Intels Next Generation Integrated Graphics Architecture Intel Graphics Media Accelerator X3000 and 3000. Intel Corporation, 2006.Google Scholar
- U. Kapasi, S. Rixner, W. Dally, B. Khailany, J. Ahn, P. Mattson, and J. Owens. Programmable Stream Processors. IEEE Computer, 2003. Google Scholar
Digital Library
- R. Kumar, D. Tullsen, P. Ranganathan, N. Jouppi, and K. Farkas. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proceedings of the 31st International Symposium on Computer Architecture, June 2004. Google Scholar
Digital Library
- F. Labonte, P. Mattson, W. Thies, I. Buck, C. Kozyrakis, and M. Horowitz. The Stream Virtual Machine. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, 2004. Google Scholar
Digital Library
- W. Mark, R. Glanville, K. Akeley, and M. Kilgard. Cg: A System for Programming Graphics Hardware in a C-like Language. ACM Transactions on Graphics, (3):896--907, 2003. Google Scholar
Digital Library
- M. McCool and S. Toit. Metaprogramming GPUs with Sh. A K Peters, 2004. Google Scholar
Digital Library
- M. McCool, K. Wadleigh, B. Henderson, and H. Y. Lin. Performance Evaluation of GPUs using the RapidMind Development Platform. In Proceedings of the 20th International Conference on Supercomputing, 2006. Google Scholar
Digital Library
- J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. Lefohn, and T. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. In Eurographics, August 2005.Google Scholar
- The PeakStream Platform: High Productivity Software Development for Multi-core Processors. PeakStream Inc, 2006.Google Scholar
- M. Segal and M. Peercy. A Performance-Oriented Data Parallel Virtual Machine for GPUs. In SIGGRAPH, 2006. Google Scholar
Digital Library
- S. Shah, G. Haab, P. Petersen, and J. Throop. Flexible control structures for parallelism in OpenMP. In First European Workshop on OpenMP, September 1999.Google Scholar
- E. Su, X. Tian ,M. Girkar, G. Haab, S. Shah, and P. Petersen. Compiler Support of the Workqueuing Execution Model for Intel SMP Architectures. In Proceedings of the 4th European Workshop on OpenMP, 2002.Google Scholar
- D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2006. Google Scholar
Digital Library
- W. Thies,M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. In Computational Complexity, 2002. Google Scholar
Digital Library
- X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su. Intel OpenMP C++/Fortran Compiler for Hyper--Threading Technology: Implementation and Performance. Intel Technology Journal, Q1 2002.Google Scholar
- X. Tian, M. Girkar, S. Shah, D. Armstrong, E. Su, and P. Petersen. Compiler and Runtime Support for Running OpenMP Programs on Pentium and Itanium Architectures. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, April 2003. Google Scholar
Digital Library
- O. Wechsler. Inside Intel Core Microarchitecture: Setting New Standards for Energy-efficient Performance. [email protected] Magazine, 2006.Google Scholar
- D. Zhang, Z. Li, H. Song, and L. Liu. A Programming Model for an Embedded Media Processing Architecture. In Embedded Computer Systems: Architecture, Modeling, and Simulation, 2005. Google Scholar
Digital Library
Index Terms
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system






Comments