ABSTRACT
This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned into large sections, the bottom-level algorithm must more carefully analyze program structure when producing the final schedule.Our analysis reveals that at this bottom level, good scheduling depends upon carefully balancing instruction contention for processing elements and operand latency between producer and consumer instructions. We develop a parameterizable instruction scheduler that more effectively optimizes this trade-off. We use this scheduler to determine the contention-latency sweet spot that generates the best instruction schedule for each application. To avoid this application-specific tuning, we also determine the parameters that produce the best performance across all applications. The result is a contention-latency setting that generates instruction schedules for all applications in our workload that come within 17% of the best schedule for each.
- S.P. Amarasinghe and M.S. Lam. Communication optimization and code generation for distributed memory machines. In Proceedings of the Conference on Programming Language Design and Implementation, 1993.]] Google Scholar
Digital Library
- J.M. Anderson and M.S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the Conference on Programming Language Design and Implementation, 1993.]] Google Scholar
Digital Library
- Arvind and R. Nikhil. Executing a program on the mit tagged-token dataflow architecture. IEEE Transactions on Computers, 39(3), 1990.]] Google Scholar
Digital Library
- D. Buell et al. Splash 2: FPGAs in a Custom Computing Machine. IEEE Computer Society, 1996.]]Google Scholar
- T.M. Chilimbi, M.D. Hill, and J.R. Larus. Cache-conscious structure layout. In Proceedings of the Conference on Programming Language Design and Implementation, 1999.]] Google Scholar
Digital Library
- K. Coons, X. Chen, S. Kushwaha, K. McKinley, and D. Burger. A spatial path scheduling algorithm for EDGE architectures. In Symposium on Architectural Support for Programming Languages and Operating Systems, 2006.]] Google Scholar
Digital Library
- D.E. Culler, A. Sah, K.E. Schauser, T. von Eicken, and J. Wawrzynek. Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1991.]] Google Scholar
Digital Library
- A.L. Davis. The architecure and system method of DDM1: A recursively structured data driven machine. In Proceedings of the Annual Symposium on Computer Architecture, Palo Alto, California, April 3-5, 1978. IEEE Computer Society and ACM SIGARCH.]] Google Scholar
Digital Library
- J.B. Dennis. A preliminary architecture for a basic dataflow processor. In Proceedings of the Symposium on Computer Architecture, 1975.]] Google Scholar
Digital Library
- G. Desoli. Instruction assignment for clustered VLIW DSP compilers: A new approach. Technical Report HPL-98-13, Hewlett-Packard Laboratories, January 1998.]]Google Scholar
- J. Ellis. Bulldog: A Compiler for VLIW Architectures. PhD thesis, MIT, 1986.]] Google Scholar
Digital Library
- J.R. Ellis. Bulldog: A Compiler for VLIW Architectures. ACM doctoral dissertation award; 1985. The MIT Press, 1986.]] Google Scholar
Digital Library
- P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, and F. Homewood. Lx: A technology platform for customizable VLIW embedded processing. In International Symposium on Computer Architecture, 2000.]] Google Scholar
Digital Library
- V.G. Grafe, G.S. Davidson, J.E. Hoch, and V.P. Holmes. The Epsilon dataflow processor. In Proceedings of the International Symposium on Computer Architecture, 1989.]] Google Scholar
Digital Library
- J.R. Gurd, C.C. Kirkham, and I. Watson. The Manchester prototype dataflow computer. Communications of the ACM, 28(1), 1985.]] Google Scholar
Digital Library
- D.R. Kerns and S.J. Eggers. Balanced scheduling: Instruction scheduling when memory latency is uncertain. In Proceedings of the Conference on Programming Language Design and Implementation, 1993.]] Google Scholar
Digital Library
- M. Kishi, H. Yasuhara, and Y. Kawamura. DDDP-A distributed data driven processor. In Proceedings of the International Symposium on Computer Architecture, 1983.]] Google Scholar
Digital Library
- W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-time scheduling of instruction-level parallelism on a Raw machine. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1998.]] Google Scholar
Digital Library
- W. Lee et al. Space-time scheduling of instruction-level parallelism on a Raw machine. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating, 1998.]] Google Scholar
Digital Library
- W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent scheduling. In Proceedings of the International Symposium on Microarchitecture, 2002.]] Google Scholar
Digital Library
- J.L. Lo and S.J. Eggers. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In Proceedings of the Conference on Programming Language Design and Implementation, 1995.]] Google Scholar
Digital Library
- P.G. Lowney, S.M. Freudenberger, T.J. Karzes, W.D. Lichtenstein, R.P. Nix, J.S. O'Donnell, and J. Ruttenberg. The multiflow trace scheduling compiler. J. Supercomputing, 1993.]] Google Scholar
Digital Library
- K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz. Smart memories: A modular reconfigurable architecture. In International Symposium on Computer Architecture, 2002.]] Google Scholar
Digital Library
- R. Nagarajan, S.K. Kushwaha, D. Burger, K.S. McKinley, C. Lin, and S.W. Keckler. Static placement, dynamic issue (SPDI) scheduling for EDGE architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2004.]] Google Scholar
Digital Library
- R. Nagarajan, K. Sankaralingam, D. Burger, and S. Keckler. A design space evaluation of grid processor architectures. In Proceedings of the International Symposium on Microarchitecture, 2001.]] Google Scholar
Digital Library
- E. Özer, S. Banerjia, and T.M. Conte. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings of the International Symposium on Microarchitecture, 1998.]] Google Scholar
Digital Library
- G. Papadopoulos and D. Culler. Monsoon: An explicit token-store architecture. In Proceedings of the International Symposium on Computer Architecture, 1990.]] Google Scholar
Digital Library
- G.M. Papadopoulos and K.R. Traub. Multithreading: A revisionist view of dataflow architectures. In Proceedings of the International Symposium on Computer Architecture, 1991.]] Google Scholar
Digital Library
- T.A. Proebsting and C.N. Fischer. Linear-time, optimal code scheduling for delayed-load architectures. In Proceedings of the Conference on Programming Language Design and Implementation, 1991.]] Google Scholar
Digital Library
- G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the Conference on Programming Language Design and Implementation, 1998.]] Google Scholar
Digital Library
- S. Sakai, y. Yamaguchi, K. Hiraki, Y. Kodama, and T. Yuba. An architecture of a dataflow single chip processor. In Proceedings of the International Symposium on Computer Architecture, 1989.]] Google Scholar
Digital Library
- J. Sanchez and A. Gonzalez. Instruction scheduling for clustered VLIW architectures. In Proceedings of the International Symposium on System Synthesis, 2000.]] Google Scholar
Digital Library
- T. Shimada, K. Hiraki, K. Nishida, and S. Sekiguchi. Evaluation of a prototype data flow processor of the sigma-1 for scientific computations. In Proceedings of the International Symposium on Computer Architecture, 1986.]] Google Scholar
Digital Library
- Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the Conference on Programming Language Design and Implementation, 1999.]] Google Scholar
Digital Library
- SPEC. Spec CPU 2000 benchmark specifications. SPEC2000 Benchmark Release, 2000.]]Google Scholar
- S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. WaveScalar. In Proceedings of the International Symposium on Microarchitecture, 2003.]] Google Scholar
Digital Library
- S. Swanson, A. Putnam, M. Mercaldi, K. Michelson, A. Petersen, A. Schwerin, M. Oskin, and S. Eggers. Area-performance trade-offs in tiled dataflow architectures. In Proceedings of the International Symposium on Computer Architecture, 2006.]] Google Scholar
Digital Library
- R. von Hanxleden and K. Kennedy. Give-n-take - a balanced code placement framework. In Proceedings of the Conference on Programming Language Design and Implementation, 1994.]] Google Scholar
Digital Library
- E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring it all to software: Raw machines. Computer, 30(9), 1997.]] Google Scholar
Digital Library
- K. Wilken, J. Liu, and M. Heffernan. Optimal instruction scheduling using integer programming. In Proceedings of the Conference on Programming Language Design and Implementation, 2000.]] Google Scholar
Digital Library
- M.E. Wolf and M.S. Lam. A data locality optimizing algorithm. In Proceedings of the Conference on Programming Language Design and Implementation, 1991.]] Google Scholar
Digital Library
- T. Yang and A. Gerasoulis. PYRROS: static task scheduling and code generation for message passing multiprocessors. In Proceedings of the International Conference on Supercomputing, 1992.]] Google Scholar
Digital Library
- J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Modulo scheduling with integrated register spilling for clustered vliw architectures. In Proceedings of International Symposium on Microarchitecture, 2001.]] Google Scholar
Digital Library
Index Terms
Instruction scheduling for a tiled dataflow architecture
Recommendations
Instruction scheduling for a tiled dataflow architecture
Proceedings of the 2006 ASPLOS ConferenceThis paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned ...
Instruction scheduling for a tiled dataflow architecture
Proceedings of the 2006 ASPLOS ConferenceThis paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned ...
Instruction scheduling for a tiled dataflow architecture
Proceedings of the 2006 ASPLOS ConferenceThis paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned ...








Comments