skip to main content
research-article

A constraint programming approach for integrated spatial and temporal scheduling for clustered architectures

Published:05 September 2013Publication History
Skip Abstract Section

Abstract

Many embedded processors use clustering to scale up instruction-level parallelism in a cost-effective manner. In a clustered architecture, the registers and functional units are partitioned into smaller units and clusters communicate through register-to-register copy operations. Texas Instruments, for example, has a series of architectures for embedded processors which are clustered. Such an architecture places a heavier burden on the compiler, which must now assign instructions to clusters (spatial scheduling), assign instructions to cycles (temporal scheduling), and schedule copy operations to move data between clusters. We consider instruction scheduling of local blocks of code on clustered architectures to improve performance. Scheduling for space and time is known to be a hard problem. Previous work has proposed greedy approaches based on list scheduling to simultaneously perform spatial and temporal scheduling and phased approaches based on first partitioning a block of code to do spatial assignment and then performing temporal scheduling. Greedy approaches risk making mistakes that are then costly to recover from, and partitioning approaches suffer from the well-known phase ordering problem. In this article, we present a constraint programming approach for scheduling instructions on clustered architectures. We employ a problem decomposition technique that solves spatial and temporal scheduling in an integrated manner. We analyze the effect of different hardware parameters—such as the number of clusters, issue-width, and intercluster communication cost—on application performance. We found that our approach was able to achieve an improvement of up to 26%, on average, over a state-of-the-art technique on superblocks from SPEC 2000 benchmarks.

References

  1. Aggarwal, A. and Franklin, M. 2005. Scalablility aspects of instruction distribution algorithms for clustered processors. IEEE Trans. Parallel Distrib. Syst. 16, 10, 944--955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aleta, A., Codina, J. M., Sanchez, J., González, A., and Kaeli, D. 2009. AGAMOS: A graph-based approach to modulo scheduling for clustered microarchitectures. IEEE Trans. Comput. 58, 6, 770--783. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amarasinghe, S., Karger, D. R., Lee, W., and Mirrokni, V. S. 2002. A theoretical and practical approach to instruction scheduling on spatial architectures. Tech. rep. MIT, LCS Technical Reports, Cambridge, MA.Google ScholarGoogle Scholar
  4. Andreev, K. and Räcke, H. 2004. Balanced graph partitioning. In Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architecture. 120--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. ARM. 2011. The architecture for the digital world. http://www.arm.com. (Last accessed June 2011).Google ScholarGoogle Scholar
  6. Bednarski, A. and Kessler, C. W. 2006. Optimal integrated VLIW code generation with integer linear programming. In Proceedings of Euro-Par Conference on Parallel Processing. 461--472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Beg, M. and van Beek, P. 2011. A constraint programming approach to instruction assignment. In Proceedings of the 15th Annual Workshop on the Interaction between Compilers and Computer Architecture (INTERACT'15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Benders, J. F. 1962. Partitioning procedures for solving mixed-variables programming problems. Numerische Mathematik 4, 238--252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bjerregaard, T. and Mahadevan, S. 2006. A survey of research and practices of network-on-chip. ACM Comput. Surv. 38, 1, 1--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Blainey, R. J. 1994. Instruction scheduling in the TOBEY compiler. IBM J. Res. Develop., 38, 5, 577--593. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chu, M., Fan, K., and Mahlke, S. 2003. Region-based hierarchical operation partitioning for multicluster processors. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'03). 300--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chu, M. and Mahlke, S. 2006. Compiler-directed data partitioning for multicluster processors. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'06). 208--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chu, M., Ravindran, R., and Mahlke, S. 2007. Data access partitioning for fine-grain parallelism on multicore architectures. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (Micro'07). 369--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chung, Y. C., Liu, C. C., and Liu, J. S. 1995. Applications and performance analysis of an optimization approach for list scheduling algorithms on distributed memory multiprocessors. J. Inf. Sci. Eng. 11, 2, 155--181.Google ScholarGoogle Scholar
  15. Codina, J. M., Sánchez, J. F., and González, A. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques (PACT'01). 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dantzig, G. B. and Wolfe, P. 1960. Decomposition principle for linear programs. Oper. Res. 8, 101--111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ellis, J. R. 1986. Bulldog: A Compiler for VLSI Architectures. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Eriksson, M. V. and Kessler, C. W. 2009. Integrated modulo scheduling for clustered VLIW architectures. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'09). 65--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Faraboschi, P., Desoli, G., and Fisher, J. A. 1998. Clustered instruction-level parallel processors. Tech. rep. HP Labs Technical Report HPL-98-204. 1--29.Google ScholarGoogle Scholar
  20. Fisher, J. A., Faraboschi, P., and Young, C. 2005. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Elsevier, Amsterdam.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Heffernan, M. and Wilken, K. 2005. Data-dependency graph transformations for instruction scheduling. J. Schedul. 8, 427--451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Heffernan, M., Wilken, K., and Shobaki, G. 2006. Data-dependency graph transformations for superblock scheduling. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (Micro'06). 77--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hendrickson, B. and Leland, R. 1995. A multilevel algorithm for partitioning graphs. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing'95). 28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hoxey, S., Karim, F., Hay, B., and Warren, H. 1996. The PowerPC Compiler Writers Guide. Warthman Associates, Palo Alto, CA.Google ScholarGoogle Scholar
  25. Karypis, G. and Kumar, V. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1, 359--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kessler, C. W. and Bednarski, A. 2006. Optimal integrated code generation for VLIW architectures. Concu. Comput. Practice Exp. 18, 11, 1353--1390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lapinskii, V. S., Jacome, M. F., and De Veciana, G. A. 2002. Cluster assignment for high-performance embedded VLIW processors. ACM Trans. Des. Autom. Electro. Syst. 7, 430--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lee, W., Barua, R., Frank, M., Srikrishna, D., Babb, J., Sarkar, V., and Amarasinghe, S. 1998. Space-time scheduling of instruction-level parallelism on a RAW machine. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII). 46--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lee, W., Puppin, D., Swenson, S., and Amarasinghe, S. 2002. Convergent scheduling. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture (Micro'35). 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Leupers, R. 2000. Instruction scheduling for clustered VLIW DSPs. In Proceedings of the IEEE International Conference on Parallel Architectures and Compilation Techniques (PACT'00). 291--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Luo, C., Bai, Y., Xu, C., and Zhang, L. 2009. FCCM: A novel inter-core communication mechanism in multi-core platform. In Proceedings of International Conference on Science and Engineering. 215--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Malik, A. M., McInnes, J., and van Beek, P. 2008. Optimal basic block instruction scheduling for multiple-issue processors using constraint programming. Int. J. Arti. Intell. Tools 17, 1, 37--54.Google ScholarGoogle ScholarCross RefCross Ref
  33. Malik, A. M., Chase, M., Russell, T., and van Beek, P. 2008. An application of constraint programming to superblock instruction scheduling. In Proceedings of the 14th International Conference on Principles and Practice of Constraint Programming (CP'08). 97--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Nagpal, R. and Srikant, Y. N. 2004. Integrated temporal and spatial scheduling for extended operand clustered VLIW processors. In Proceedings of the Conference on Computing Frontiers. 457--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Nagpal, R. and Srikant, Y. N. 2008. Pragmatic integrated scheduling for clustered VLIW architectures. Softw. Prac. Exp. 38, 227--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Nystrom, E. and Eichenberger, A. E. 1998. Effective cluster assignment for modulo scheduling. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (Micro'98). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Owens, J. D., Dally, W. J., Ho, R., Jayasimha, D. N., Keckler, S. W., and Peh, L. 2007. Research challenges for on-chip interconnection networks. IEEE Micro 27, 5, 96--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Parcerisa, J.-M., Sahuqillo, J., González, A., and Duato, J. 2002. Efficient interconnects for clustered microarchitectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'02). 291--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Rich, K. and Farrens, M. 2000. Code partitioning in decoupled compilers. In Proceedings from the 6th International Euro-Par Conference on Parallel Processing (Euro-Par'00). 1008--1017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Russell, T., Malik, A., Chase, M., and van Beek, P. 2009. Learning heuristics for the superblock instruction scheduling problem. IEEE Trans. Knowl. Data Eng. 21, 10, 1489--1502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rossi, F., van Beek, P., and Walsh, T. (Ed). 2006. Handbook of Constraint Programming. Elsevier, Amsterdam. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sánchez, J. and González, A. 2000. Instruction scheduling for clustered VLIW architectures. In Proceedings of the 13th International Symposium on System Synthesis (ISSS'00). 41--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Shobaki, G. and Wilken, K. 2004. Optimal superblock scheduling using enumeration. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (Micro'04). 283--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Terechko, A. S. and Corporaal, H. 2007. Inter-cluster communication in VLIW architectures. Trans. Archit. Code Optim. (TACO), 4, 2, 1--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Terechko, A. S. 2007. Clustered VLIW architectures: A quantitative approach. Ph.D. Dissertation, Technischie Universiteit Eindhoven, Eindhoven, Netherlands.Google ScholarGoogle Scholar
  46. Texas Instruments. 2011. http://www.ti.com. (Last accessed June 2011).Google ScholarGoogle Scholar

Index Terms

  1. A constraint programming approach for integrated spatial and temporal scheduling for clustered architectures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!