Abstract
We have built a runtime compilation system that takes unmodified sequential binaries and improves their performance on off-the-shelf multiprocessors using dynamic vectorization and loop-level parallelization techniques. Our system, Azure, is purely software based and requires no specific hardware support for speculative thread execution, yet it is able to break even in most cases; that is, the achieved speedup exceeds the cost of runtime monitoring and compilation, often by significant amounts.
Key to this remarkable performance is an offline preprocessing step that extracts a mostly correct control flow graph (CFG) from the binary program ahead of time. This statically obtained CFG is incomplete in that it may be missing some edges corresponding to computed branches. We describe how such additional control flow edges are discovered and handled at runtime, so that an incomplete static analysis never leads to an incorrect optimization result.
The availability of a mostly correct CFG enables us to statically partition a binary executable into single-entry multiple-exit regions and to identify potential parallelization candidates ahead of execution. Program regions that are not candidates for parallelization can thereby be excluded completely from runtime monitoring and dynamic recompilation. Azure's extremely low overhead is a direct consequence of this design.
- ]]Akkary, H. and Driscoll, M. A. 1998. A dynamic multithreading processor. In Proceedings of the 31st Annual International Symposium on Microarchitecture. ACM Press, 226--236. Google Scholar
Digital Library
- ]]Bala, V., Duesterwald, E., and Banerjia, S. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--12. Google Scholar
Digital Library
- ]]Balakrishnan, G. and Reps, T. 2004. Analyzing memory accesses in x86 executables. In Proceedings of the Conference on Compiler Construction. Lecture Notes in Computer Science, vol. 2985, Springer Verlag, 5--23.Google Scholar
- ]]Baraz, L., Devor, T., Etzion, O., Goldenberg, S., Skaletsky, A., Wang, Y., and Zemach, Y. 2003. IA-32 Execution Layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the 36th International Symposium on Micro-architecture. IEEE, 191--201. Google Scholar
Digital Library
- ]]Buck, B. and Hollingsworth, J. K. 2000. An API for runtime code patching. Int. J. High Perform. Comput. Applic. 14, 4, 317--329. Google Scholar
Digital Library
- ]]Byrd, G. T. and Holliday, M. A. 1995. Multithreaded processor architecture. IEEE Spectrum 32, 8, 38--46. Google Scholar
Digital Library
- ]]Carlisle, M. C., Rogers, A., Reppy, J. H., and Hendren, L. J. 1994. Early experiences with Olden. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, 1--20. Google Scholar
Digital Library
- ]]Chambers, C. 2002. Staged compilation. In Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Semantics-Based Program Manipulation. ACM Press, New York, NY, 1--8. Google Scholar
Digital Library
- ]]Chernoff, A., Herdeg, M., Hookway, R., Reeve, C., Rubin, N., Tye, T., Yadavalli, S. B., and Yates, J. 1998. FX!32: A profile-directed binary translator. IEEE Micro 18, 2, 56--64. Google Scholar
Digital Library
- ]]Cifuentes, C. and Gough, K. J. 1995. Decompilation of binary programs. Softw. Pract. Exper. 25, 7, 811--829. Google Scholar
Digital Library
- ]]Cintra, M. and Llanos, D. R. 2003. Toward efficient and robust software speculative parallelization on multiprocessors. In Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, New York. 13--24. Google Scholar
Digital Library
- ]]de Alba, M. and Kaeli, D. 2001. Runtime predictability of loops. In Proceedings of the 4th Annual IEEE International Workshop on Workload Characterization. Google Scholar
Digital Library
- ]]Ebcioğlu, K. and Altman, E. R. 1997. DAISY: Dynamic compilation for 100% architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 26--37. Google Scholar
Digital Library
- ]]Ebcioğlu, K., Fritts, J., Kosonocky, S., Gschwind, M., Altman, E., Kailas, K., and Bright, T. 1998. An eight-issue tree VLIW processor for dynamic binary translation. In Proceedings of the International Conference on Computer Design. Google Scholar
Digital Library
- ]]Fisher, J. A. 1981. Trace scheduling: A technique for global micro-code compaction. IEEE Trans. Comput. 30, 7, 478--490. Google Scholar
Digital Library
- ]]Grant, B., Philipose, M., Mock, M., Chambers, C., and Eggers, S. J. 1999. An evaluation of staged run time optimizations in DyC. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 293--304. Google Scholar
Digital Library
- ]]Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., and Olukotun, K. 2000. The Stanford Hydra CMP. IEEE Micro 20, 2, 71--84. Google Scholar
Digital Library
- ]]Hwu, W., Mahlke, S., Chen, W., Chang, P., Warter, N., Bringmann, R., Ouellette, R., Hank, R., Kiyohara, T., Haab, G., Holm, J., and Lavery, D. 1993. The superblock: An effective technique for vliw and superscalar compilation. J. Supercomput. 7, 1--2, 229--248. Google Scholar
Digital Library
- ]]Kagan, M., Gochman, S., Orenstien, D., and Lin, D. 1997. MMX micro-architecture of Pentium processors with MMX technology and Pentium II micro-processors. Intel Techn. J. 8.Google Scholar
- ]]Kistler, T. and Franz, M. 2001. Continuous program optimization: Design and evaluation. IEEE Trans. Comput. 50, 6, 549--566. Google Scholar
Digital Library
- ]]Kistler, T. and Franz, M. 2003. Continuous program optimization: A case study. ACM Trans. Program. Lang. Syst. 25, 4, 500--548. Google Scholar
Digital Library
- ]]Klaiber, A. 2000. The technology behind Crusoe processors. White Paper, Transmeta Corp.Google Scholar
- ]]Krishnan, V. and Torrellas, J. 1998. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor. In Proceedings of the International Conference on Supercomputing. 85--92. Google Scholar
Digital Library
- ]]Krishnan, V. and Torrellas, J. 1999. A chip-multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. 48, 9, 866--880. Google Scholar
Digital Library
- ]]Krishnan, V. S. 1998. Speculative multithreading architectures. Tech. rep. UIUCDCS-R-98-2048, UIUC. Google Scholar
Digital Library
- ]]Larus, J. R. and Ball, T. 1994. Rewriting executable files to measure program behavior. Softw. Pract. Exper. 24, 2, 197--218. Google Scholar
Digital Library
- ]]Leung, A. and George, L. 1999. Static single assignment form for machine code. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 204--214. Google Scholar
Digital Library
- ]]Lo, J. 1998. Exploiting thread-level parallelism on simultaneous multithreaded processors. Ph.D. thesis, University of Washington. Google Scholar
Digital Library
- ]]Nguyen, H. and John, L. K. 1999. Exploiting SIMD parallelism in DSP and multimedia algorithms using the altivec technology. In Proceedings of the International Conference on Supercomputing. 11--20. Google Scholar
Digital Library
- ]]Oberman, S., Favor, G., and Weber, F. 1999. AMD 3DNow! technology: Architecture and implementations. IEEE Micro 19, 2 (Mar./Apr.), 37--48. Google Scholar
Digital Library
- ]]Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K., and Chang, K. 1996. The case for a single-chip multiprocessor. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII). 2--11. Google Scholar
Digital Library
- ]]Quinones, C. G., Madriles, C., Sanchez, J., Marcuello, P., Gonzalez, A., and Tullsen, D. M. 2005. Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York. 269--279. Google Scholar
Digital Library
- ]]Rauchwerger, L. and Padua, D. A. 1999. The LRPD test: Speculative run time parallelization of loops with privatization and reduction parallelization. IEEE Trans. Parall. Distrib. Syst. 10, 2, 160--180. Google Scholar
Digital Library
- ]]Rotenberg, E., Jacobson, Q., Sazeides, Y., and Smith, J. 1997. Trace processors. In Proceedings of the International Symposium on Microarchitecture. 138--148. Google Scholar
Digital Library
- ]]Skoglund, J. and Felsberg, M. 2005. Fast image processing using SSE2. In Proceedings of the SSBA Symposium on Image Analysis.Google Scholar
- ]]Sohi, G. S., Breach, S. E., and Vijaykumar, T. N. 1998. Multi-scalar processors. In 25 Years of ISCA: Retrospectives and Reprints. 521--532. Google Scholar
Digital Library
- ]]Srivastava, A. and Eustace, A. 2004. Atom: A system for building customized program analysis tools. SIGPLAN Not. 39, 4, 528--539. Google Scholar
Digital Library
- ]]Thakkar, S. T. and Huff, T. 1999. The Internet Streaming SIMD Extensions. Intel Tech. J., 8.Google Scholar
- ]]Tsai, J.-Y., Huang, J., Amlo, C., Lilja, D. J., and Yew, P.-C. 1999. The superthreaded processor architecture. IEEE Trans. Comput. 48, 9, 881--902. Google Scholar
Digital Library
- ]]Tsai, J.-Y. and Yew, P.-C. 1996. The superthreaded architecture: Thread pipelining with run time data dependence checking and control speculation. In Proceedings of Parallel Architectures and Compilation Techniques. 35--46. Google Scholar
Digital Library
- ]]Tullsen, D. M., Eggers, S., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22th Annual International Symposium on Computer Architecture. 392--403. Google Scholar
Digital Library
- ]]Voss, M. J. and Eigenmann, R. 2000. Adapt: Automated de-coupled adaptive program transformation. In Proceedings of the International Conference on Parallel Processing. IEEE Computer Society, Washington, 163. Google Scholar
Digital Library
- ]]Voss, M. J. and Eigenmann, R. 2001. High-level adaptive program optimization with ADAPT. ACM SIGPLAN Not. 36, 7, 93--102. Google Scholar
Digital Library
- ]]Yardimci, E. 2006. Exploiting parallelism to improve the performance of sequential binary executables. Ph.D. thesis, University of California, Irvine. Google Scholar
Digital Library
- ]]Yardimci, E. and Franz, M. 2006. Dynamic parallelization and mapping of binary executables on hierarchical platforms. In Proceedings of the 3rd Conference on Computing Frontiers. ACM Press, New York. 127--138. Google Scholar
Digital Library
- ]]Zilles, C. and Sohi, G. 2001. A programmable co-processor for profiling. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. Google Scholar
Digital Library
Index Terms
Mostly static program partitioning of binary executables
Recommendations
Low overhead dynamic binary translation on ARM
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and ImplementationThe ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations ...
Low overhead dynamic binary translation on ARM
PLDI '17The ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations ...
A platform for secure static binary instrumentation
VEE '14: Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environmentsProgram instrumentation techniques form the basis of many recent software security defenses, including defenses against common exploits and security policy enforcement. As compared to source-code instrumentation, binary instrumentation is easier to use ...






Comments