
- 1 E. H. Gornish, E. D. Granston, and A. V. Veidenbaum, "Compiler-directed data prefetching in multiprocessor with memory hierarchies," in Proc. Int'! Conf. on Supercomputing, (Amsterdam, The Netherlands), pp. 354-368, June 1990. Google Scholar
Digital Library
- 2 D. Callahan, K. Kennedy, and A. Porterfield, "Software prefetching," in Proc. Fourth lnt'! Conf. on Architectural Support for Prog. Lang. and Operating Systems., pp. 40-52, Apr. 1991. Google Scholar
Digital Library
- 3 A. C. Klaiber and H. M. Levy, "An architecture for softwarecontrolled data prefetching," in Proc. 18th Ann. Int'l Syrup. Computer Architecture, (Toronto, Canada), pp. 43-53, May 1991. Google Scholar
Digital Library
- 4 T. Mowry and A. Gupta, "Tolerating latency through software-controlled prefetching in shared-memory multiprocessors," J. Parallel and Distributed Computing, vol. 12, pp. 87-106, 1991. Google Scholar
Digital Library
- 5 W. Y. Chen, S. A. Malalke, P. P. Chang, and W. W. Hwu, "Data access microarchitectures for superscalar processors with compiler-assisted data prefetching," in Proc. ~4st Ann. Workshop on Microprogramming and Microarchitectares, (Albuquerque, NM.), Nov. 1991. Google Scholar
Digital Library
- 6 M. Wolfe, "Iteration space tiling for memory hierarchies," in Proc. of the 4th SIAM Conference, 1989.Google Scholar
- 7 D. Gannon, W. Jalby, and K. Gallivan, "Strategies for cache and local memory management by global program transformation,'' J. Parallel and Distributed Comporting, vol. 5, pp. 344-358, 1988. Google Scholar
Digital Library
- 8 J. W. C. Fu and J. H. Patel, "Data prefetching in multiprocessor vector cache memories," in Proc. 18th Ann. lnt'l Symp. Computer Architecture, (Toronto, Canada), pp. 54- 63, June 1991. Google Scholar
Digital Library
- 9 J.-L. Baer and T.-F. Chen, "An effective on-chip preloading scheme to reduce data access penalty," in Proceeding of $upercomputing '91, pp. 176-186, Nov. 1991. Google Scholar
Digital Library
- 10 W. Y Chen, S. A. Mahlke, and W. W. Hwu, "Tolerating first level memory access latency in high-performance systems," in Proc. 21th int'l Con}. on Parallel Processing, Aug. 1992.Google Scholar
- 11 P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu, "IMPACT: An architectural framework for multiple-instruction-issue processors," in Proc. 18th Ann. lnt'l Syrup. Computer Architecture, (Toronto, Canada), pp. 266-275, June 1991. Google Scholar
Digital Library
Index Terms
An efficient architecture for loop based data preloading
Recommendations
Loop striping: maximize parallelism for nested loops
EUC'06: Proceedings of the 2006 international conference on Embedded and Ubiquitous ComputingThe majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested loops. Most of the existing loop transformation techniques ...
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesVectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Efficient tiled loop generation: D-tiling
LCPC'09: Proceedings of the 22nd international conference on Languages and Compilers for Parallel ComputingTiling is an important loop optimization for exposing coarse-grained parallelism and enhancing data locality. Tiled loop generation from an arbitrarily shaped polyhedron is a well studied problem. Except for the special case of a rectangular iteration ...






Comments