Abstract
Many embedded processors use clustering to scale up instruction-level parallelism in a cost-effective manner. In a clustered architecture, the registers and functional units are partitioned into smaller units and clusters communicate through register-to-register copy operations. Texas Instruments, for example, has a series of architectures for embedded processors which are clustered. Such an architecture places a heavier burden on the compiler, which must now assign instructions to clusters (spatial scheduling), assign instructions to cycles (temporal scheduling), and schedule copy operations to move data between clusters. We consider instruction scheduling of local blocks of code on clustered architectures to improve performance. Scheduling for space and time is known to be a hard problem. Previous work has proposed greedy approaches based on list scheduling to simultaneously perform spatial and temporal scheduling and phased approaches based on first partitioning a block of code to do spatial assignment and then performing temporal scheduling. Greedy approaches risk making mistakes that are then costly to recover from, and partitioning approaches suffer from the well-known phase ordering problem. In this article, we present a constraint programming approach for scheduling instructions on clustered architectures. We employ a problem decomposition technique that solves spatial and temporal scheduling in an integrated manner. We analyze the effect of different hardware parameters—such as the number of clusters, issue-width, and intercluster communication cost—on application performance. We found that our approach was able to achieve an improvement of up to 26%, on average, over a state-of-the-art technique on superblocks from SPEC 2000 benchmarks.
- Aggarwal, A. and Franklin, M. 2005. Scalablility aspects of instruction distribution algorithms for clustered processors. IEEE Trans. Parallel Distrib. Syst. 16, 10, 944--955. Google Scholar
Digital Library
- Aleta, A., Codina, J. M., Sanchez, J., González, A., and Kaeli, D. 2009. AGAMOS: A graph-based approach to modulo scheduling for clustered microarchitectures. IEEE Trans. Comput. 58, 6, 770--783. Google Scholar
Digital Library
- Amarasinghe, S., Karger, D. R., Lee, W., and Mirrokni, V. S. 2002. A theoretical and practical approach to instruction scheduling on spatial architectures. Tech. rep. MIT, LCS Technical Reports, Cambridge, MA.Google Scholar
- Andreev, K. and Räcke, H. 2004. Balanced graph partitioning. In Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architecture. 120--124. Google Scholar
Digital Library
- ARM. 2011. The architecture for the digital world. http://www.arm.com. (Last accessed June 2011).Google Scholar
- Bednarski, A. and Kessler, C. W. 2006. Optimal integrated VLIW code generation with integer linear programming. In Proceedings of Euro-Par Conference on Parallel Processing. 461--472. Google Scholar
Digital Library
- Beg, M. and van Beek, P. 2011. A constraint programming approach to instruction assignment. In Proceedings of the 15th Annual Workshop on the Interaction between Compilers and Computer Architecture (INTERACT'15). Google Scholar
Digital Library
- Benders, J. F. 1962. Partitioning procedures for solving mixed-variables programming problems. Numerische Mathematik 4, 238--252.Google Scholar
Digital Library
- Bjerregaard, T. and Mahadevan, S. 2006. A survey of research and practices of network-on-chip. ACM Comput. Surv. 38, 1, 1--51. Google Scholar
Digital Library
- Blainey, R. J. 1994. Instruction scheduling in the TOBEY compiler. IBM J. Res. Develop., 38, 5, 577--593. Google Scholar
Digital Library
- Chu, M., Fan, K., and Mahlke, S. 2003. Region-based hierarchical operation partitioning for multicluster processors. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'03). 300--311. Google Scholar
Digital Library
- Chu, M. and Mahlke, S. 2006. Compiler-directed data partitioning for multicluster processors. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'06). 208--220. Google Scholar
Digital Library
- Chu, M., Ravindran, R., and Mahlke, S. 2007. Data access partitioning for fine-grain parallelism on multicore architectures. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (Micro'07). 369--380. Google Scholar
Digital Library
- Chung, Y. C., Liu, C. C., and Liu, J. S. 1995. Applications and performance analysis of an optimization approach for list scheduling algorithms on distributed memory multiprocessors. J. Inf. Sci. Eng. 11, 2, 155--181.Google Scholar
- Codina, J. M., Sánchez, J. F., and González, A. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques (PACT'01). 175--184. Google Scholar
Digital Library
- Dantzig, G. B. and Wolfe, P. 1960. Decomposition principle for linear programs. Oper. Res. 8, 101--111.Google Scholar
Digital Library
- Ellis, J. R. 1986. Bulldog: A Compiler for VLSI Architectures. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Eriksson, M. V. and Kessler, C. W. 2009. Integrated modulo scheduling for clustered VLIW architectures. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'09). 65--79. Google Scholar
Digital Library
- Faraboschi, P., Desoli, G., and Fisher, J. A. 1998. Clustered instruction-level parallel processors. Tech. rep. HP Labs Technical Report HPL-98-204. 1--29.Google Scholar
- Fisher, J. A., Faraboschi, P., and Young, C. 2005. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Elsevier, Amsterdam.Google Scholar
Digital Library
- Heffernan, M. and Wilken, K. 2005. Data-dependency graph transformations for instruction scheduling. J. Schedul. 8, 427--451. Google Scholar
Digital Library
- Heffernan, M., Wilken, K., and Shobaki, G. 2006. Data-dependency graph transformations for superblock scheduling. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (Micro'06). 77--88. Google Scholar
Digital Library
- Hendrickson, B. and Leland, R. 1995. A multilevel algorithm for partitioning graphs. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing'95). 28. Google Scholar
Digital Library
- Hoxey, S., Karim, F., Hay, B., and Warren, H. 1996. The PowerPC Compiler Writers Guide. Warthman Associates, Palo Alto, CA.Google Scholar
- Karypis, G. and Kumar, V. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1, 359--392. Google Scholar
Digital Library
- Kessler, C. W. and Bednarski, A. 2006. Optimal integrated code generation for VLIW architectures. Concu. Comput. Practice Exp. 18, 11, 1353--1390. Google Scholar
Digital Library
- Lapinskii, V. S., Jacome, M. F., and De Veciana, G. A. 2002. Cluster assignment for high-performance embedded VLIW processors. ACM Trans. Des. Autom. Electro. Syst. 7, 430--454. Google Scholar
Digital Library
- Lee, W., Barua, R., Frank, M., Srikrishna, D., Babb, J., Sarkar, V., and Amarasinghe, S. 1998. Space-time scheduling of instruction-level parallelism on a RAW machine. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII). 46--57. Google Scholar
Digital Library
- Lee, W., Puppin, D., Swenson, S., and Amarasinghe, S. 2002. Convergent scheduling. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture (Micro'35). 111--122. Google Scholar
Digital Library
- Leupers, R. 2000. Instruction scheduling for clustered VLIW DSPs. In Proceedings of the IEEE International Conference on Parallel Architectures and Compilation Techniques (PACT'00). 291--300. Google Scholar
Digital Library
- Luo, C., Bai, Y., Xu, C., and Zhang, L. 2009. FCCM: A novel inter-core communication mechanism in multi-core platform. In Proceedings of International Conference on Science and Engineering. 215--218. Google Scholar
Digital Library
- Malik, A. M., McInnes, J., and van Beek, P. 2008. Optimal basic block instruction scheduling for multiple-issue processors using constraint programming. Int. J. Arti. Intell. Tools 17, 1, 37--54.Google Scholar
Cross Ref
- Malik, A. M., Chase, M., Russell, T., and van Beek, P. 2008. An application of constraint programming to superblock instruction scheduling. In Proceedings of the 14th International Conference on Principles and Practice of Constraint Programming (CP'08). 97--111. Google Scholar
Digital Library
- Nagpal, R. and Srikant, Y. N. 2004. Integrated temporal and spatial scheduling for extended operand clustered VLIW processors. In Proceedings of the Conference on Computing Frontiers. 457--470. Google Scholar
Digital Library
- Nagpal, R. and Srikant, Y. N. 2008. Pragmatic integrated scheduling for clustered VLIW architectures. Softw. Prac. Exp. 38, 227--257. Google Scholar
Digital Library
- Nystrom, E. and Eichenberger, A. E. 1998. Effective cluster assignment for modulo scheduling. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (Micro'98). Google Scholar
Digital Library
- Owens, J. D., Dally, W. J., Ho, R., Jayasimha, D. N., Keckler, S. W., and Peh, L. 2007. Research challenges for on-chip interconnection networks. IEEE Micro 27, 5, 96--108. Google Scholar
Digital Library
- Parcerisa, J.-M., Sahuqillo, J., González, A., and Duato, J. 2002. Efficient interconnects for clustered microarchitectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'02). 291--300. Google Scholar
Digital Library
- Rich, K. and Farrens, M. 2000. Code partitioning in decoupled compilers. In Proceedings from the 6th International Euro-Par Conference on Parallel Processing (Euro-Par'00). 1008--1017. Google Scholar
Digital Library
- Russell, T., Malik, A., Chase, M., and van Beek, P. 2009. Learning heuristics for the superblock instruction scheduling problem. IEEE Trans. Knowl. Data Eng. 21, 10, 1489--1502. Google Scholar
Digital Library
- Rossi, F., van Beek, P., and Walsh, T. (Ed). 2006. Handbook of Constraint Programming. Elsevier, Amsterdam. Google Scholar
Digital Library
- Sánchez, J. and González, A. 2000. Instruction scheduling for clustered VLIW architectures. In Proceedings of the 13th International Symposium on System Synthesis (ISSS'00). 41--46. Google Scholar
Digital Library
- Shobaki, G. and Wilken, K. 2004. Optimal superblock scheduling using enumeration. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (Micro'04). 283--293. Google Scholar
Digital Library
- Terechko, A. S. and Corporaal, H. 2007. Inter-cluster communication in VLIW architectures. Trans. Archit. Code Optim. (TACO), 4, 2, 1--38. Google Scholar
Digital Library
- Terechko, A. S. 2007. Clustered VLIW architectures: A quantitative approach. Ph.D. Dissertation, Technischie Universiteit Eindhoven, Eindhoven, Netherlands.Google Scholar
- Texas Instruments. 2011. http://www.ti.com. (Last accessed June 2011).Google Scholar
Index Terms
A constraint programming approach for integrated spatial and temporal scheduling for clustered architectures
Recommendations
Integrated temporal and spatial scheduling for extended operand clustered VLIW processors
CF '04: Proceedings of the 1st conference on Computing frontiersCentralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption and are thus not suitable for consumer electronic devices. The consequence is the emergence of architectures having many interconnected ...
Loop fusion for clustered VLIW architectures
Embedded systems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, high-performance digital signal processors can often exploit considerable instruction-level ...
Pragmatic integrated scheduling for clustered VLIW architectures
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Scheduling for clustered architectures involves spatial concerns (...






Comments