ABSTRACT
Communication and multimedia applications with increased data rates and enhanced functionality continuously raise the bar for the computational requirements of future microprocessors. In order to meet these computational demands it is necessary to exploit sub-word parallelism efficiently. We propose to make sub-word data movement a first-class operation in microprocessor architectures by introducing a Sub-word Permutation Unit (SPU)in the execution pipeline. The SPU is evaluated in the context of the MMX media co-processor for the Intel Pentium architectures, but our results can be extended to any processor that supports sub-word parallelism. We find that the SPU all ws us to orchestrate sub-word data placement prior to computation, thus all wing the MMX functional units to concentrate on performing calculations. Furthermore, we introduce a decoupled SPU control mechanism at the basic block level which allows static optimization to eliminate data-movement verhead in tight loops, where most media and signal processing occurs. We demonstrated that anywhere from 4% to 20% improvement can be obtained on key media and signal processing kernels with as little as 1% increase in hardware resources.
- Virtual press kit: Intel Pentium 4 processor. http://www.intel.com/pressroom/archive/photos/p4_photos.htm.Google Scholar
- K. Diefendorff and P. Dubey. How multimedia workloads will change rocessor design. IEEE Computer,30(9):43--45, sept 1997. Google Scholar
Digital Library
- S. Dutta, K. Connor, W. Wolf, and A. Wolfe. A Design Study of a 0.25um Video Signal Processor. IEEE Transactions on Circuits and Systems for Vide Technology, 8:501--519, august 1998. Google Scholar
Digital Library
- J. Fridman. Subword parallelism in digital signal processing. IEEE Signal Processing Magazine, 17(2):270--35, march 2000.Google Scholar
Cross Ref
- J. Fridman and Z. Greenfield. The TigerSHARC DSP Architecture. IEEE Micro pages 66--76, 2000. Google Scholar
Digital Library
- S. R. Gerrit Slavenburg and H. Dijkstra. The TriMedia TM-1 PCI VLIW Media Processor. In Proceedings of the HotChips 8: A Symposium on High Performance Chips, august 1996.Google Scholar
- J. L. Hennessy and D. A.Patterson. Computer Architecture: A Quantitative Approach, 2002. Google Scholar
Digital Library
- J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach, 2002. Figure 2.37, page 142, Third Edition. Google Scholar
Digital Library
- Intel. Vtune performance analyzers. http://www.intel.com/software/prodcuts/vtune/.Google Scholar
- IPP Intel. Intel Integrated Performance Primitives for Intel Pentium Processors and Intel Itanium Architectures. http://www.intel.com/software/rodcuts/ip/ip30/.Google Scholar
- S. L. Johnsson and C.-T. Ho. Optimum broadcasting and personalized communication in hypercubes. IEEE Transactions on Computers, 38(9):1249--1268, September 1989. Google Scholar
Digital Library
- P. D. Keith Diefendorff, R. Hochsprung, and H. Scales. Altivec extension to powerpc accelerates media processing. IEEE Micro, pages 85--96, march 2000. Google Scholar
Digital Library
- D.J. Kuck and R. A. Stokes. The Burroughs Scientific Processor (BSP). IEEE Transaction on Computers, 31:363--376, may 1982.Google Scholar
Digital Library
- R. B. Lee. Subword parallelism with MAX-2 --accelerating media rocessing with a minimal set of instruction extensions supporting efficient subword parallelism. IEEE Micro, 16(4):51--59, 1996. Google Scholar
Digital Library
- R. B. Lee. Multimedia extensions for general-purpose processors. In IEEE Workshop on Signal Processing Systems, pages 9--23, november 1997.Google Scholar
- P. Mattson, W. Dally, S. Rixner, and J. Owens. Communication Scheduling. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, november 2000. Google Scholar
Digital Library
- S. A. McKee, A. Aluwihare, B. H. Clark, R. H. Klenke, T. C. Landon, C. W. Oliver, M. H. Salinas, A. E. Szymkowiak, K. L. Wright, W. A. Wulf, and J. H. Aylor. Design andevaluation of dynamic access ordering hardware. In International Conference on Supercomputing, pages 125--132, 1996. Google Scholar
Digital Library
- Klenke, T.C. Landon, C.W. Oliver, M.H. Salinas, A.E. Szymkowiak, K.L. Wright, W.A. Wulf, and J.H. Aylor. Design and evaluation of dynamic access ordering hardware. In International Conference on Supercomputing, pages 125--132, 1996. Google Scholar
Digital Library
- D. O. Michael Kagan, Simcha Gochman and D. Lin. MMX microarchitecture of Pentium rocessors with MMX technology and Pentium II microprocessors. (Q3):8, 1997.Google Scholar
- A. Peleg and U. Weiser. MMX technology extension to Intel architecture. IEEE Micro, 16(4):42--50, 1996. Google Scholar
Digital Library
- N. Seshan. High VelociTI Processing. IEEE Signal Processing Magazine, pages 86--101, march 1998.Google Scholar
- D. Talla. Architectural techniques to accelerate multimedia applications on general-purpose processors, 2001.Google Scholar
- M. Taylor, W. Lee, S. Amarsinghe, and A. Agarwal. Scalar operand network: On-chip interconnect for ilp in partitioned architectures. In HPCA, february 2003. Google Scholar
Digital Library
- A. Wolfe, J. Fritts, S. Dutta, and E. Fernandes. Datapath Design for a VLIW Signal Processor. In Proceedings of HPCA-3, 1997, february 1997. Google Scholar
Digital Library
- W. Wulf. Compilers and Computer Architecture. IEEE Computers, pages 41--48, July 1981.Google Scholar
Digital Library
Index Terms
Efficient orchestration of sub-word parallelism in media processors
Recommendations
Media Processors
An overview of various media processors' architecture is presented in this short tutorial. The media processors discussed here provide compute powers in terms of billions of operations per second along with the memory bandwidth required to sustain those ...
Exploiting Instruction- and Data-Level Parallelism
Historically, there have been two different approaches to high performance computing: instruction-level parallelism (ILP) and data-level parallelism (DLP). The ILP paradigm seeks to execute several instructions each cycle by exploring a sequential ...
Memory-level parallelism aware fetch policies for simultaneous multithreading processors
A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated ...





Comments