10.1145/781131.781165acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedings
Article

Region-based hierarchical operation partitioning for multicluster processors

ABSTRACT

Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, we present a novel technique for clustering operations based on graph partitioning methods. Our approach incorporates new methods of assigning weights to nodes and edges within the dataflow graph to guide the partitioner. Nodes are assigned weights to reflect their resource usage within a cluster, while a slack distribution method intelligently assigns weights to edges to reflect the cost of inserting moves across clusters. A multilevel graph partitioning algorithm, which globally divides a dataflow graph into multiple parts in a hierarchical manner, uses these weights to efficiently generate estimates for the quality of partitions. We found that our algorithm was able to achieve an average of 20% improvement in DSP kernels and 5% improvement in SPECint2000 for a four-cluster architecture.

References

  1. A. Aletà, J. Codina, J. Sánchez, and A. González. Graph-partitioning based instruction scheduling for clustered processors. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Dec. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Aletà, J. Codina, J. Sánchez, A. González, and D. Kaeli. Exploiting pseudo-schedules to guide data dependence graph partitioning. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, Sept. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for VLIWs: A preliminary analysis of tradeoffs. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 103--114, Dec. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Codina, J. Sánchez, and A. González. URACAM: A unified register allocation, cluster assignment and modulo scheduling approach. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Dec. 2001.Google ScholarGoogle Scholar
  5. G. Desoli. Instruction assignment for clustered VLIW DSP compilers: A new approach. Technical Report HPL-98-13, Hewlett-Packard Laboratories, Feb. 1998.Google ScholarGoogle Scholar
  6. J. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, MA, 1985.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Faraboschi, G. Desoli, and J. Fisher. Clustered instruction-level parallel processors. Technical Report HPL-98-204, Hewlett-Packard Laboratories, Dec. 1998.Google ScholarGoogle Scholar
  8. K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The multicluster architecture: Reducing cycle time through partitioning. In Proceedings of the 30th Annual International Symposium on Microarchitecture, Dec. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Fields, R. Bodík, and M. D. Hill. Slack: Maximizing performance under technological constraints. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Fisher. Very long instruction word architectures and the ELI-52. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 140--150, June 13--17, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Hank, W. Hwu, and B. Rau. Region-based compilation: An introduction and motivation. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 158--168, Nov. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Hendrickson and R. Leland. The Chaco User's Guide. Sandia National Laboratories, July 1995.Google ScholarGoogle Scholar
  13. J. Hiser, S. Carr, and P. Sweany. Global register partitioning. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, pages 13--23, Oct. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Kailas, K. Ebcioglu, and A. Agrawala. CARS: A new code generation framework for clustered ILP processors. In Proceeding of the 2001 International Conference on High Performance Computer Architecture, pages 133--142, Feb. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Karypis and V. Kumar. Metis: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse Matrices. University of Minnesota, Sept. 1998.Google ScholarGoogle Scholar
  16. G. Krishnamurthy, E. Granston, and E. Stotzer. Affinity-based cluster assignment for unrolled loops. In Proceedings of the 2002 International Conference on Supercomputing, pages 107--116, June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Lapinskii, M. Jacome, and G. de~Veciana. High-quality operation binding for clustered VLIW datapaths. In Proceedings of the 2001 Design Automation Conference, June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent scheduling. In Proceedings of the 35th Annual International Symposium on Microarchitecture, Nov. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Leupers. Instruction scheduling for clustered VLIW DSPs. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, Oct. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Liou and M. Palis. A new heuristic for scheduling parallel programs on multiprocessor. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pages 358--365, Oct. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Lowney et~al. The Multiflow Trace Scheduling compiler. The Journal of Supercomputing, 7(1-2):51--142, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling. In Proceedings of the 31th Annual International Symposium on Microarchitecture, pages 103--114, Nov. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. Özer, S. Banerjia, and T. Conte. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings of the 31th Annual International Symposium on Microarchitecture, pages 308--315, Nov. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 63--74, Nov. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Trimaran. An infrastructure for research in ILP. http://www.trimaran.org.Google ScholarGoogle Scholar
  26. T. Yang and A. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Region-based hierarchical operation partitioning for multicluster processors

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!