ABSTRACT
Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, we present a novel technique for clustering operations based on graph partitioning methods. Our approach incorporates new methods of assigning weights to nodes and edges within the dataflow graph to guide the partitioner. Nodes are assigned weights to reflect their resource usage within a cluster, while a slack distribution method intelligently assigns weights to edges to reflect the cost of inserting moves across clusters. A multilevel graph partitioning algorithm, which globally divides a dataflow graph into multiple parts in a hierarchical manner, uses these weights to efficiently generate estimates for the quality of partitions. We found that our algorithm was able to achieve an average of 20% improvement in DSP kernels and 5% improvement in SPECint2000 for a four-cluster architecture.
References
- A. Aletà, J. Codina, J. Sánchez, and A. González. Graph-partitioning based instruction scheduling for clustered processors. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Dec. 2001. Google Scholar
Digital Library
- A. Aletà, J. Codina, J. Sánchez, A. González, and D. Kaeli. Exploiting pseudo-schedules to guide data dependence graph partitioning. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, Sept. 2002. Google Scholar
Digital Library
- A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for VLIWs: A preliminary analysis of tradeoffs. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 103--114, Dec. 1992. Google Scholar
Digital Library
- J. Codina, J. Sánchez, and A. González. URACAM: A unified register allocation, cluster assignment and modulo scheduling approach. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Dec. 2001.Google Scholar
- G. Desoli. Instruction assignment for clustered VLIW DSP compilers: A new approach. Technical Report HPL-98-13, Hewlett-Packard Laboratories, Feb. 1998.Google Scholar
- J. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, MA, 1985.Google Scholar
Digital Library
- P. Faraboschi, G. Desoli, and J. Fisher. Clustered instruction-level parallel processors. Technical Report HPL-98-204, Hewlett-Packard Laboratories, Dec. 1998.Google Scholar
- K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The multicluster architecture: Reducing cycle time through partitioning. In Proceedings of the 30th Annual International Symposium on Microarchitecture, Dec. 1997. Google Scholar
Digital Library
- B. Fields, R. Bodík, and M. D. Hill. Slack: Maximizing performance under technological constraints. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002. Google Scholar
Digital Library
- J. Fisher. Very long instruction word architectures and the ELI-52. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 140--150, June 13--17, 1983. Google Scholar
Digital Library
- R. Hank, W. Hwu, and B. Rau. Region-based compilation: An introduction and motivation. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 158--168, Nov. 1995. Google Scholar
Digital Library
- B. Hendrickson and R. Leland. The Chaco User's Guide. Sandia National Laboratories, July 1995.Google Scholar
- J. Hiser, S. Carr, and P. Sweany. Global register partitioning. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, pages 13--23, Oct. 2000. Google Scholar
Digital Library
- K. Kailas, K. Ebcioglu, and A. Agrawala. CARS: A new code generation framework for clustered ILP processors. In Proceeding of the 2001 International Conference on High Performance Computer Architecture, pages 133--142, Feb. 2001. Google Scholar
Digital Library
- G. Karypis and V. Kumar. Metis: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse Matrices. University of Minnesota, Sept. 1998.Google Scholar
- G. Krishnamurthy, E. Granston, and E. Stotzer. Affinity-based cluster assignment for unrolled loops. In Proceedings of the 2002 International Conference on Supercomputing, pages 107--116, June 2002. Google Scholar
Digital Library
- V. Lapinskii, M. Jacome, and G. de~Veciana. High-quality operation binding for clustered VLIW datapaths. In Proceedings of the 2001 Design Automation Conference, June 2001. Google Scholar
Digital Library
- W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent scheduling. In Proceedings of the 35th Annual International Symposium on Microarchitecture, Nov. 2002. Google Scholar
Digital Library
- R. Leupers. Instruction scheduling for clustered VLIW DSPs. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, Oct. 2000. Google Scholar
Digital Library
- J. Liou and M. Palis. A new heuristic for scheduling parallel programs on multiprocessor. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pages 358--365, Oct. 1998. Google Scholar
Digital Library
- P. Lowney et~al. The Multiflow Trace Scheduling compiler. The Journal of Supercomputing, 7(1-2):51--142, 1993. Google Scholar
Digital Library
- E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling. In Proceedings of the 31th Annual International Symposium on Microarchitecture, pages 103--114, Nov. 1998. Google Scholar
Digital Library
- E. Özer, S. Banerjia, and T. Conte. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings of the 31th Annual International Symposium on Microarchitecture, pages 308--315, Nov. 1998. Google Scholar
Digital Library
- B. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 63--74, Nov. 1994. Google Scholar
Digital Library
- Trimaran. An infrastructure for research in ILP. http://www.trimaran.org.Google Scholar
- T. Yang and A. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 1994. Google Scholar
Digital Library
Index Terms
Region-based hierarchical operation partitioning for multicluster processors






Comments