Abstract
Many modern embedded processors such as DSPs support partitioned memory banks (also called X--Y memory or dual-bank memory) along with parallel load/store instructions to achieve higher code density and performance. In order to effectively utilize the parallel load/store instructions, the compiler must partition the memory-resident values and assign them to X or Y bank. This paper gives a postregister allocation solution to merge the generated load/store instructions into their parallel counterparts. Simultaneously, our framework performs allocation of values to X or Y memory banks. We first remove as many load/stores and register--register moves as possible through an excellent iterated coalescing based register allocator by Appel and George [1996]. We then attempt to parallelize the generated load/stores using a multipass approach. The basic phase of our approach attempts the merger of load/stores without duplication and web splitting. We model this problem as a graph-coloring problem in which each value is colored as either X or Y. We then construct a motion scheduling graph (MSG), based on the range of motion for each load/store instruction. MSG reflects potential instructions that could be merged. We propose a notion of pseudofixed boundaries so that the load/store movement is less affected by register dependencies. We prove that the coloring problem for MSG is NP-complete and solve it with two different heuristic algorithms with different complexity. We then propose a two-level iterative process to attempt instruction duplication, variable duplication, web splitting, and local conflict elimination to effectively merge the remaining load/stores. Finally, we clean up some multiple-aliased load/stores. To improve the performance, we combine profiling information with each stage coupled with some modifications to the algorithm. We show that our framework results in parallelization of a large number of load/stores without much growth in data and code segments. The average speedup for our optimization pass reaches roughly 13% if no profile information is available and 17% with profile information. The average code and data segment growth is controlled within 13%.
- Aarts, E. and Korst, K. 1989. Simulated annealing and Boltzmann Machines, Courier Int'l. Google Scholar
- Aho, A. V., Sethi, R. and Ullman, J. D. 1986. Compilers Principles, Techniques and Tools, Addison-Wesley, Reading, MA. Google Scholar
- Briggs, P., Cooper, K., and Torczon, L. 1994. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems. Google Scholar
- Chaitin, G.J., Auslander, M. A., Chandra, A. K., Cocke, J., Hopkins, M. E., and Markstein, P. 1981. Register allocation via coloring. Computer Language, 6, 47--57.Google Scholar
- Cho, J., Paek, Y., and Whalley, D. 2002. Register and memory assignment for non-orthogonal architectures via graph coloring and MST algorithms. Proc. of LCTES'02 (June), 130--138. Google Scholar
- Cooper, K. D. and Harvey, T. J. 1998. Compiler-controlled memory. In 8th ASPLOS (Oct.) Google Scholar
- Cooper, K. D. and Mcintosh, N. 1999. Enhanced code compression for embedded RISC processors. Proc. SIGPLAN '1999 Conf. Programming Language Design and Implementation (May), 139--149. Google Scholar
- Davidson, J. W. and Jinturkar, S. 1994. Memory access coalescing: A technique for eliminating redundant memory accesses. Proc. SIGPLAN '94 Conf. Programming Language Design and Implementation (June) 186--195. Google Scholar
- George, L. and Appel, A. W. 1996. Iterated register coalescing. In Proc. SIGPLAN '96 Conf. Programming Language Design and Implementation. Google Scholar
- Gross, J. and Yellen, J. 1999. Graph theory and its applications. CRC Press. Boca Raton, FL. Google Scholar
- Knoop, J., Ruthing, O., and Steffen, B. 1992. Lazy code motion, Proc. SIGPLAN '1992 Conf. Programming Language Design and Implementation (July). Google Scholar
- Leupers, R. and Kotte, D. 2001. Variable partitioning for dual memory bank DSPs. ICASSP (May). Google Scholar
- Mach-SUIF Backend Compiler, 2000. The Machine-SUIF 2.1 compiler documentation set. Harvard University, Sept. http://ececs.harvard.edu/hube/research/machsuif.html.Google Scholar
- Papadimitriou, C.H. and Steiglitz, K. 1998. Combinatorial optimization Algorithms and Complexity, Dover Publications, 1998. Google Scholar
- Powell, B., Lee, E.A., and Newman, W.C. 1992. Direct synthesis of optimized DSP assembly code from signal flow block diagrams. Proceedings International Conference on Acoustics, Speech, and Signal Processing. 553--556.Google Scholar
- Saghir, M. A. R., Chow, P., and Lee, C. G. 1996. Exploiting dual data-memory banks in digital signal processors. Proc. of the 8th International Conference on Architectural Support for Programming Languages and Operation Systems, 234--243. Google Scholar
- Stanford SUIF Compiler Infrastructure, 2000. The SUIF 2 Compiler Documentation Set, Stanford University, Sep. http://suif.stanford.edu/suif/index.html.Google Scholar
- Sudarsanam, A. and Malik, S. 2000. Simultaneous reference allocation in code generation for dual data memory bank ASIPs. ACM Trans. on Design Automation of Electronic Systems, Vol. 5, 242--264 (Apr.). Google Scholar
- Zhuang, X., Pande, S., and Greenland J. S. Jr. 2002. A framework for parallelizing load/stores on embedded processors. In Proc. of International Conference on Parallel Architectures and Compilation Techniques, 68--70 (Sep.). Google Scholar
Index Terms
Parallelizing load/stores on dual-bank memory embedded processors
Recommendations
A Framework for Parallelizing Load/Stores on Embedded Processors
PACT '02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation TechniquesMany modern embedded processors (esp. DSPs) support partitioned memory banks (also called X-Y memory or dual bank memory) along with parallel load/store instructions to achieve code density and/or performance. In order to effectively utilize the ...
Memory-level parallelism aware fetch policies for simultaneous multithreading processors
A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated ...
A Memory-Level Parallelism Aware Fetch Policy for SMT Processors
HPCA '07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer ArchitectureA thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated ...






Comments