skip to main content
research-article
Free Access

Mostly static program partitioning of binary executables

Published:03 July 2009Publication History
Skip Abstract Section

Abstract

We have built a runtime compilation system that takes unmodified sequential binaries and improves their performance on off-the-shelf multiprocessors using dynamic vectorization and loop-level parallelization techniques. Our system, Azure, is purely software based and requires no specific hardware support for speculative thread execution, yet it is able to break even in most cases; that is, the achieved speedup exceeds the cost of runtime monitoring and compilation, often by significant amounts.

Key to this remarkable performance is an offline preprocessing step that extracts a mostly correct control flow graph (CFG) from the binary program ahead of time. This statically obtained CFG is incomplete in that it may be missing some edges corresponding to computed branches. We describe how such additional control flow edges are discovered and handled at runtime, so that an incomplete static analysis never leads to an incorrect optimization result.

The availability of a mostly correct CFG enables us to statically partition a binary executable into single-entry multiple-exit regions and to identify potential parallelization candidates ahead of execution. Program regions that are not candidates for parallelization can thereby be excluded completely from runtime monitoring and dynamic recompilation. Azure's extremely low overhead is a direct consequence of this design.

References

  1. ]]Akkary, H. and Driscoll, M. A. 1998. A dynamic multithreading processor. In Proceedings of the 31st Annual International Symposium on Microarchitecture. ACM Press, 226--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ]]Bala, V., Duesterwald, E., and Banerjia, S. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ]]Balakrishnan, G. and Reps, T. 2004. Analyzing memory accesses in x86 executables. In Proceedings of the Conference on Compiler Construction. Lecture Notes in Computer Science, vol. 2985, Springer Verlag, 5--23.Google ScholarGoogle Scholar
  4. ]]Baraz, L., Devor, T., Etzion, O., Goldenberg, S., Skaletsky, A., Wang, Y., and Zemach, Y. 2003. IA-32 Execution Layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the 36th International Symposium on Micro-architecture. IEEE, 191--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. ]]Buck, B. and Hollingsworth, J. K. 2000. An API for runtime code patching. Int. J. High Perform. Comput. Applic. 14, 4, 317--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. ]]Byrd, G. T. and Holliday, M. A. 1995. Multithreaded processor architecture. IEEE Spectrum 32, 8, 38--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ]]Carlisle, M. C., Rogers, A., Reppy, J. H., and Hendren, L. J. 1994. Early experiences with Olden. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. ]]Chambers, C. 2002. Staged compilation. In Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Semantics-Based Program Manipulation. ACM Press, New York, NY, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. ]]Chernoff, A., Herdeg, M., Hookway, R., Reeve, C., Rubin, N., Tye, T., Yadavalli, S. B., and Yates, J. 1998. FX!32: A profile-directed binary translator. IEEE Micro 18, 2, 56--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. ]]Cifuentes, C. and Gough, K. J. 1995. Decompilation of binary programs. Softw. Pract. Exper. 25, 7, 811--829. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. ]]Cintra, M. and Llanos, D. R. 2003. Toward efficient and robust software speculative parallelization on multiprocessors. In Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, New York. 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. ]]de Alba, M. and Kaeli, D. 2001. Runtime predictability of loops. In Proceedings of the 4th Annual IEEE International Workshop on Workload Characterization. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. ]]Ebcioğlu, K. and Altman, E. R. 1997. DAISY: Dynamic compilation for 100% architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 26--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ]]Ebcioğlu, K., Fritts, J., Kosonocky, S., Gschwind, M., Altman, E., Kailas, K., and Bright, T. 1998. An eight-issue tree VLIW processor for dynamic binary translation. In Proceedings of the International Conference on Computer Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ]]Fisher, J. A. 1981. Trace scheduling: A technique for global micro-code compaction. IEEE Trans. Comput. 30, 7, 478--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. ]]Grant, B., Philipose, M., Mock, M., Chambers, C., and Eggers, S. J. 1999. An evaluation of staged run time optimizations in DyC. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 293--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ]]Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., and Olukotun, K. 2000. The Stanford Hydra CMP. IEEE Micro 20, 2, 71--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. ]]Hwu, W., Mahlke, S., Chen, W., Chang, P., Warter, N., Bringmann, R., Ouellette, R., Hank, R., Kiyohara, T., Haab, G., Holm, J., and Lavery, D. 1993. The superblock: An effective technique for vliw and superscalar compilation. J. Supercomput. 7, 1--2, 229--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. ]]Kagan, M., Gochman, S., Orenstien, D., and Lin, D. 1997. MMX micro-architecture of Pentium processors with MMX technology and Pentium II micro-processors. Intel Techn. J. 8.Google ScholarGoogle Scholar
  20. ]]Kistler, T. and Franz, M. 2001. Continuous program optimization: Design and evaluation. IEEE Trans. Comput. 50, 6, 549--566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. ]]Kistler, T. and Franz, M. 2003. Continuous program optimization: A case study. ACM Trans. Program. Lang. Syst. 25, 4, 500--548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. ]]Klaiber, A. 2000. The technology behind Crusoe processors. White Paper, Transmeta Corp.Google ScholarGoogle Scholar
  23. ]]Krishnan, V. and Torrellas, J. 1998. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor. In Proceedings of the International Conference on Supercomputing. 85--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. ]]Krishnan, V. and Torrellas, J. 1999. A chip-multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. 48, 9, 866--880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ]]Krishnan, V. S. 1998. Speculative multithreading architectures. Tech. rep. UIUCDCS-R-98-2048, UIUC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. ]]Larus, J. R. and Ball, T. 1994. Rewriting executable files to measure program behavior. Softw. Pract. Exper. 24, 2, 197--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. ]]Leung, A. and George, L. 1999. Static single assignment form for machine code. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 204--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. ]]Lo, J. 1998. Exploiting thread-level parallelism on simultaneous multithreaded processors. Ph.D. thesis, University of Washington. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. ]]Nguyen, H. and John, L. K. 1999. Exploiting SIMD parallelism in DSP and multimedia algorithms using the altivec technology. In Proceedings of the International Conference on Supercomputing. 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. ]]Oberman, S., Favor, G., and Weber, F. 1999. AMD 3DNow! technology: Architecture and implementations. IEEE Micro 19, 2 (Mar./Apr.), 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. ]]Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K., and Chang, K. 1996. The case for a single-chip multiprocessor. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII). 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. ]]Quinones, C. G., Madriles, C., Sanchez, J., Marcuello, P., Gonzalez, A., and Tullsen, D. M. 2005. Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York. 269--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. ]]Rauchwerger, L. and Padua, D. A. 1999. The LRPD test: Speculative run time parallelization of loops with privatization and reduction parallelization. IEEE Trans. Parall. Distrib. Syst. 10, 2, 160--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. ]]Rotenberg, E., Jacobson, Q., Sazeides, Y., and Smith, J. 1997. Trace processors. In Proceedings of the International Symposium on Microarchitecture. 138--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. ]]Skoglund, J. and Felsberg, M. 2005. Fast image processing using SSE2. In Proceedings of the SSBA Symposium on Image Analysis.Google ScholarGoogle Scholar
  36. ]]Sohi, G. S., Breach, S. E., and Vijaykumar, T. N. 1998. Multi-scalar processors. In 25 Years of ISCA: Retrospectives and Reprints. 521--532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. ]]Srivastava, A. and Eustace, A. 2004. Atom: A system for building customized program analysis tools. SIGPLAN Not. 39, 4, 528--539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. ]]Thakkar, S. T. and Huff, T. 1999. The Internet Streaming SIMD Extensions. Intel Tech. J., 8.Google ScholarGoogle Scholar
  39. ]]Tsai, J.-Y., Huang, J., Amlo, C., Lilja, D. J., and Yew, P.-C. 1999. The superthreaded processor architecture. IEEE Trans. Comput. 48, 9, 881--902. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. ]]Tsai, J.-Y. and Yew, P.-C. 1996. The superthreaded architecture: Thread pipelining with run time data dependence checking and control speculation. In Proceedings of Parallel Architectures and Compilation Techniques. 35--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. ]]Tullsen, D. M., Eggers, S., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22th Annual International Symposium on Computer Architecture. 392--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. ]]Voss, M. J. and Eigenmann, R. 2000. Adapt: Automated de-coupled adaptive program transformation. In Proceedings of the International Conference on Parallel Processing. IEEE Computer Society, Washington, 163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. ]]Voss, M. J. and Eigenmann, R. 2001. High-level adaptive program optimization with ADAPT. ACM SIGPLAN Not. 36, 7, 93--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. ]]Yardimci, E. 2006. Exploiting parallelism to improve the performance of sequential binary executables. Ph.D. thesis, University of California, Irvine. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. ]]Yardimci, E. and Franz, M. 2006. Dynamic parallelization and mapping of binary executables on hierarchical platforms. In Proceedings of the 3rd Conference on Computing Frontiers. ACM Press, New York. 127--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. ]]Zilles, C. and Sohi, G. 2001. A programmable co-processor for profiling. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mostly static program partitioning of binary executables

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Programming Languages and Systems
            ACM Transactions on Programming Languages and Systems  Volume 31, Issue 5
            June 2009
            152 pages
            ISSN:0164-0925
            EISSN:1558-4593
            DOI:10.1145/1538917
            Issue’s Table of Contents

            Copyright © 2009 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 3 July 2009
            • Accepted: 1 November 2008
            • Revised: 1 December 2007
            • Received: 1 August 2006
            Published in toplas Volume 31, Issue 5

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!