skip to main content
research-article

Limits of region-based dynamic binary parallelization

Published:16 March 2013Publication History
Skip Abstract Section

Abstract

Efficiently executing sequential legacy binaries on chip multi-processors (CMPs) composed of many, small cores is one of today's most pressing problems. Single-threaded execution is a suboptimal option due to CMPs' lower single-core performance, while multi-threaded execution relies on prior parallelization, which is severely hampered by the low-level binary representation of applications compiled and optimized for a single-core target. A recent technology to address this problem is Dynamic Binary Parallelization (DBP), which creates a Virtual Execution Environment (VEE) taking advantage of the underlying multicore host to transparently parallelize the sequential binary executable. While still in its infancy, DBP has received broad interest within the research community. The combined use of DBP and thread-level speculation (TLS) has been proposed as a technique to accelerate legacy uniprocessor code on modern CMPs. In this paper, we investigate the limits of DBP and seek to gain an understanding of the factors contributing to these limits and the costs and overheads of its implementation. We have performed an extensive evaluation using a parameterizable DBP system targeting a CMP with light-weight architectural TLS support. We demonstrate that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy SPEC CPU2006 benchmarks. However, we show that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average.

References

  1. E. R. Altman, D. R. Kaeli, and Y. Sheffer. Welcome to the opportunities of binary translation. Computer, 33 (3): 40--45, Mar. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, pages 1--12, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Blake, R. G. Dreslinski, and T. Mudge. A survey of multicore processors. IEEE Signal Processing Magazine, 26 (6): 26--37, Oct. 2009.Google ScholarGoogle ScholarCross RefCross Ref
  4. et al.(2011)Böhm, Edler von Koch, Kyle, Franke, and Topham}Bohm: 2011I. Böhm, T. J. Edler von Koch, S. C. Kyle, B. Franke, and N. Topham. Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator. ACM SIGPLAN Conference on Programming Language Design and Implementation, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Chen and K. Olukotun. The Jrpm system for dynamically parallelizing java programs. ACM/IEEE International Symposium on Computer Architecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. DeVuyst, D. M. Tullsen, and S. W. Kim. Runtime parallelization of legacy code on a transactional memory system. International Conference on High Performance Embedded Architectures and Compilers, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Gao, L. Li, J. Xue, and T.-F. Ngai. Loop recreation for thread-level speculation. In International Conference on Parallel and Distributed Systems, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Gillespie. Preparing for the second stage of multi-core hardware: Asymmetric (heterogeneous) cores. Technical report, Intel, 2009. URL http://software.intel.com/file/1639.Google ScholarGoogle Scholar
  9. B. Hertzberg and K. Olukotun. Runtime automatic speculative parallelization. International Symposium on Code Generation and Optimization, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. Computer, 41: 33--38, July 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Inoue, H. Hayashizaki, P. Wu, and T. Nakatani. A trace-based java jit compiler retrofitted from a method-based compiler. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 246--256, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Intel. Single-chip cloud computer: Project. http://www.intel.co.uk/content/www/us/en/research/intel-labs-single-chip-cloud-computer.html, 2012.Google ScholarGoogle Scholar
  13. N. Ioannou, J. Singer, S. Khan, P. Xekalakis, P. Yiapanis, A. Pocock, G. Brown, M. Lujan, I. Watson, and M. Cintra. Toward a more accurate understanding of the limits of the TLS execution paradigm. IEEE International Symposium on Workload Characterization, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Q. Jacobson, E. Rotenberg, and J. Smith. Path-based next trace prediction. 30th Annual International Symposium on Microarchitecture, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Krishnan and J. Torrellas. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor. In Proceedings of the 12th International Conference on Supercomputing, International Conference on Supercomputing, pages 85--92, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Marcuello and A. González. Clustered speculative multithreaded processors. International Conference on Supercomputing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Marcuello and A. González. Thread-spawning schemes for speculative multithreading. International Symposium on High Performance Computer Architecture, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Packirisamy, A. Zhai, W.-C. Hsu, P.-C. Yew, and T.-F. Ngai. Exploring speculative parallelism in SPEC2006. IEEE International Symposium on Performance Analysis of Systems and Software, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  19. B. Pradelle, A. Ketterlin, and P. Clauss. Polyhedral parallelization of binary code. ACM Trans. Archit. Code Optim., 8 (4): 39:1--39:21, Jan. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn. PIN: a binary instrumentation tool for computer architecture research and education. Workshop on Computer Architecture Education, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Reilly. When multicore isn't enough: Trends and the future for multi-multicore systems. High Performance Embedded Computing Workshop, 2008.Google ScholarGoogle Scholar
  22. G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. ACM/IEEE International Symposium on Computer Architecture, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Suganuma, T. Yasue, and T. Nakatani. A region-based compilation technique for a java just-in-time compiler. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, pages 312--323, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. Vachharajani, M. Iyer, C. Ashok, M. Vachharajani, D. I. August, and D. Connors. Chip multi-processor scalability for single-threaded applications. SIGARCH Comput. Archit. News, 33: 44--53, Nov 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Wang, Y. Wu, E. Borin, S. Hu, W. Liu, D. Sager, T. F. Ngai, and J. Fang. Dynamic parallelization of single-threaded binary programs using speculative slicing. International Conference on Supercomputing, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Wentzlaff and A. Agarwal. Constructing virtual architectures on a tiled processor. International Symposium on Code Generation and Optimization, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Yang, K. Skadron, M. Soffa, and K. Whitehouse. Feasibility of dynamic binary parallelization. 3rd USENIX Workshop on Hot Topics in Parallelism, 2011.Google ScholarGoogle Scholar
  28. E. Yardımcı and M. Franz. Dynamic parallelization and mapping of binary executables on hierarchical platforms. ACM International Conference on Computing Frontiers, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Limits of region-based dynamic binary parallelization

      Recommendations

      Reviews

      Andre Maximo

      Parallel computation is everywhere, from small portable devices to laptops, desktops, and data centers. This wealth of available parallelism challenges computer scientists and developers alike to rethink algorithms and rewrite old sequential source codes. However, we still have a long way to go. This paper offers a deep analysis on a different way to address the challenge. The authors investigate the performance limits of dynamic binary parallelization (DBP), a technique for transparently parallelizing single-thread binary executables. The success of instruction-level parallelism on single-core architectures with multiple arithmetic logic units (ALUs) motivates a higher level of automatic parallelization on multi-core architectures (the DBP). In the paper, the authors first analyze this new proposed architecture by explaining how sections of binary code are identified for parallel execution, and then present their experiments and extensive results. The key idea of DBP is to reduce the number of critical path instructions by overlapping execution segments of the instruction stream. The overlapping is done by speculative cores launched by a master core responsible for the correct execution of the original binary. However, this strategy may result in performance reduction if many speculative threads are invalidated, mainly due to data dependency between threads. The experiments address various applications of a benchmark suite. The results, from an eight-core reduced instruction set computing (RISC) processor over a single-core execution, show an interesting 54 percent reduction in the number of instructions on critical paths. This is associated with a mismatched average speedup of 1.43 times. This is still far from the ideal performance gain, which suggests that the main bottleneck is bandwidth rather than instruction processing. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!