10.1145/1007912.1007946acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
Article

Efficient orchestration of sub-word parallelism in media processors

Published:27 June 2004Publication History

ABSTRACT

Communication and multimedia applications with increased data rates and enhanced functionality continuously raise the bar for the computational requirements of future microprocessors. In order to meet these computational demands it is necessary to exploit sub-word parallelism efficiently. We propose to make sub-word data movement a first-class operation in microprocessor architectures by introducing a Sub-word Permutation Unit (SPU)in the execution pipeline. The SPU is evaluated in the context of the MMX media co-processor for the Intel Pentium architectures, but our results can be extended to any processor that supports sub-word parallelism. We find that the SPU all ws us to orchestrate sub-word data placement prior to computation, thus all wing the MMX functional units to concentrate on performing calculations. Furthermore, we introduce a decoupled SPU control mechanism at the basic block level which allows static optimization to eliminate data-movement verhead in tight loops, where most media and signal processing occurs. We demonstrated that anywhere from 4% to 20% improvement can be obtained on key media and signal processing kernels with as little as 1% increase in hardware resources.

References

  1. Virtual press kit: Intel Pentium 4 processor. http://www.intel.com/pressroom/archive/photos/p4_photos.htm.Google ScholarGoogle Scholar
  2. K. Diefendorff and P. Dubey. How multimedia workloads will change rocessor design. IEEE Computer,30(9):43--45, sept 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Dutta, K. Connor, W. Wolf, and A. Wolfe. A Design Study of a 0.25um Video Signal Processor. IEEE Transactions on Circuits and Systems for Vide Technology, 8:501--519, august 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Fridman. Subword parallelism in digital signal processing. IEEE Signal Processing Magazine, 17(2):270--35, march 2000.Google ScholarGoogle ScholarCross RefCross Ref
  5. J. Fridman and Z. Greenfield. The TigerSHARC DSP Architecture. IEEE Micro pages 66--76, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. R. Gerrit Slavenburg and H. Dijkstra. The TriMedia TM-1 PCI VLIW Media Processor. In Proceedings of the HotChips 8: A Symposium on High Performance Chips, august 1996.Google ScholarGoogle Scholar
  7. J. L. Hennessy and D. A.Patterson. Computer Architecture: A Quantitative Approach, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach, 2002. Figure 2.37, page 142, Third Edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Intel. Vtune performance analyzers. http://www.intel.com/software/prodcuts/vtune/.Google ScholarGoogle Scholar
  10. IPP Intel. Intel Integrated Performance Primitives for Intel Pentium Processors and Intel Itanium Architectures. http://www.intel.com/software/rodcuts/ip/ip30/.Google ScholarGoogle Scholar
  11. S. L. Johnsson and C.-T. Ho. Optimum broadcasting and personalized communication in hypercubes. IEEE Transactions on Computers, 38(9):1249--1268, September 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. D. Keith Diefendorff, R. Hochsprung, and H. Scales. Altivec extension to powerpc accelerates media processing. IEEE Micro, pages 85--96, march 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D.J. Kuck and R. A. Stokes. The Burroughs Scientific Processor (BSP). IEEE Transaction on Computers, 31:363--376, may 1982.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. B. Lee. Subword parallelism with MAX-2 --accelerating media rocessing with a minimal set of instruction extensions supporting efficient subword parallelism. IEEE Micro, 16(4):51--59, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. B. Lee. Multimedia extensions for general-purpose processors. In IEEE Workshop on Signal Processing Systems, pages 9--23, november 1997.Google ScholarGoogle Scholar
  16. P. Mattson, W. Dally, S. Rixner, and J. Owens. Communication Scheduling. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, november 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. A. McKee, A. Aluwihare, B. H. Clark, R. H. Klenke, T. C. Landon, C. W. Oliver, M. H. Salinas, A. E. Szymkowiak, K. L. Wright, W. A. Wulf, and J. H. Aylor. Design andevaluation of dynamic access ordering hardware. In International Conference on Supercomputing, pages 125--132, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Klenke, T.C. Landon, C.W. Oliver, M.H. Salinas, A.E. Szymkowiak, K.L. Wright, W.A. Wulf, and J.H. Aylor. Design and evaluation of dynamic access ordering hardware. In International Conference on Supercomputing, pages 125--132, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. O. Michael Kagan, Simcha Gochman and D. Lin. MMX microarchitecture of Pentium rocessors with MMX technology and Pentium II microprocessors. (Q3):8, 1997.Google ScholarGoogle Scholar
  20. A. Peleg and U. Weiser. MMX technology extension to Intel architecture. IEEE Micro, 16(4):42--50, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Seshan. High VelociTI Processing. IEEE Signal Processing Magazine, pages 86--101, march 1998.Google ScholarGoogle Scholar
  22. D. Talla. Architectural techniques to accelerate multimedia applications on general-purpose processors, 2001.Google ScholarGoogle Scholar
  23. M. Taylor, W. Lee, S. Amarsinghe, and A. Agarwal. Scalar operand network: On-chip interconnect for ilp in partitioned architectures. In HPCA, february 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Wolfe, J. Fritts, S. Dutta, and E. Fernandes. Datapath Design for a VLIW Signal Processor. In Proceedings of HPCA-3, 1997, february 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. W. Wulf. Compilers and Computer Architecture. IEEE Computers, pages 41--48, July 1981.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient orchestration of sub-word parallelism in media processors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!