Abstract
Efficiently executing sequential legacy binaries on chip multi-processors (CMPs) composed of many, small cores is one of today's most pressing problems. Single-threaded execution is a suboptimal option due to CMPs' lower single-core performance, while multi-threaded execution relies on prior parallelization, which is severely hampered by the low-level binary representation of applications compiled and optimized for a single-core target. A recent technology to address this problem is Dynamic Binary Parallelization (DBP), which creates a Virtual Execution Environment (VEE) taking advantage of the underlying multicore host to transparently parallelize the sequential binary executable. While still in its infancy, DBP has received broad interest within the research community. The combined use of DBP and thread-level speculation (TLS) has been proposed as a technique to accelerate legacy uniprocessor code on modern CMPs. In this paper, we investigate the limits of DBP and seek to gain an understanding of the factors contributing to these limits and the costs and overheads of its implementation. We have performed an extensive evaluation using a parameterizable DBP system targeting a CMP with light-weight architectural TLS support. We demonstrate that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy SPEC CPU2006 benchmarks. However, we show that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average.
- E. R. Altman, D. R. Kaeli, and Y. Sheffer. Welcome to the opportunities of binary translation. Computer, 33 (3): 40--45, Mar. 2000. Google Scholar
Digital Library
- V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, pages 1--12, New York, NY, USA, 2000. ACM. Google Scholar
Digital Library
- G. Blake, R. G. Dreslinski, and T. Mudge. A survey of multicore processors. IEEE Signal Processing Magazine, 26 (6): 26--37, Oct. 2009.Google Scholar
Cross Ref
- et al.(2011)Böhm, Edler von Koch, Kyle, Franke, and Topham}Bohm: 2011I. Böhm, T. J. Edler von Koch, S. C. Kyle, B. Franke, and N. Topham. Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator. ACM SIGPLAN Conference on Programming Language Design and Implementation, 2011. Google Scholar
Digital Library
- M. Chen and K. Olukotun. The Jrpm system for dynamically parallelizing java programs. ACM/IEEE International Symposium on Computer Architecture, 2003. Google Scholar
Digital Library
- M. DeVuyst, D. M. Tullsen, and S. W. Kim. Runtime parallelization of legacy code on a transactional memory system. International Conference on High Performance Embedded Architectures and Compilers, 2011. Google Scholar
Digital Library
- L. Gao, L. Li, J. Xue, and T.-F. Ngai. Loop recreation for thread-level speculation. In International Conference on Parallel and Distributed Systems, 2007. Google Scholar
Digital Library
- M. Gillespie. Preparing for the second stage of multi-core hardware: Asymmetric (heterogeneous) cores. Technical report, Intel, 2009. URL http://software.intel.com/file/1639.Google Scholar
- B. Hertzberg and K. Olukotun. Runtime automatic speculative parallelization. International Symposium on Code Generation and Optimization, 2011. Google Scholar
Digital Library
- M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. Computer, 41: 33--38, July 2008. Google Scholar
Digital Library
- H. Inoue, H. Hayashizaki, P. Wu, and T. Nakatani. A trace-based java jit compiler retrofitted from a method-based compiler. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 246--256, Washington, DC, USA, 2011. IEEE Computer Society. Google Scholar
Digital Library
- Intel. Single-chip cloud computer: Project. http://www.intel.co.uk/content/www/us/en/research/intel-labs-single-chip-cloud-computer.html, 2012.Google Scholar
- N. Ioannou, J. Singer, S. Khan, P. Xekalakis, P. Yiapanis, A. Pocock, G. Brown, M. Lujan, I. Watson, and M. Cintra. Toward a more accurate understanding of the limits of the TLS execution paradigm. IEEE International Symposium on Workload Characterization, 2010. Google Scholar
Digital Library
- Q. Jacobson, E. Rotenberg, and J. Smith. Path-based next trace prediction. 30th Annual International Symposium on Microarchitecture, 1997. Google Scholar
Digital Library
- V. Krishnan and J. Torrellas. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor. In Proceedings of the 12th International Conference on Supercomputing, International Conference on Supercomputing, pages 85--92, New York, NY, USA, 1998. ACM. Google Scholar
Digital Library
- P. Marcuello and A. González. Clustered speculative multithreaded processors. International Conference on Supercomputing, 1999. Google Scholar
Digital Library
- P. Marcuello and A. González. Thread-spawning schemes for speculative multithreading. International Symposium on High Performance Computer Architecture, 2002. Google Scholar
Digital Library
- V. Packirisamy, A. Zhai, W.-C. Hsu, P.-C. Yew, and T.-F. Ngai. Exploring speculative parallelism in SPEC2006. IEEE International Symposium on Performance Analysis of Systems and Software, 2009.Google Scholar
Cross Ref
- B. Pradelle, A. Ketterlin, and P. Clauss. Polyhedral parallelization of binary code. ACM Trans. Archit. Code Optim., 8 (4): 39:1--39:21, Jan. 2012. Google Scholar
Digital Library
- V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn. PIN: a binary instrumentation tool for computer architecture research and education. Workshop on Computer Architecture Education, 2004. Google Scholar
Digital Library
- M. Reilly. When multicore isn't enough: Trends and the future for multi-multicore systems. High Performance Embedded Computing Workshop, 2008.Google Scholar
- G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. ACM/IEEE International Symposium on Computer Architecture, 1995. Google Scholar
Digital Library
- T. Suganuma, T. Yasue, and T. Nakatani. A region-based compilation technique for a java just-in-time compiler. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, pages 312--323, New York, NY, USA, 2003. ACM. Google Scholar
Digital Library
- N. Vachharajani, M. Iyer, C. Ashok, M. Vachharajani, D. I. August, and D. Connors. Chip multi-processor scalability for single-threaded applications. SIGARCH Comput. Archit. News, 33: 44--53, Nov 2005. Google Scholar
Digital Library
- C. Wang, Y. Wu, E. Borin, S. Hu, W. Liu, D. Sager, T. F. Ngai, and J. Fang. Dynamic parallelization of single-threaded binary programs using speculative slicing. International Conference on Supercomputing, 2009. Google Scholar
Digital Library
- D. Wentzlaff and A. Agarwal. Constructing virtual architectures on a tiled processor. International Symposium on Code Generation and Optimization, 2006. Google Scholar
Digital Library
- J. Yang, K. Skadron, M. Soffa, and K. Whitehouse. Feasibility of dynamic binary parallelization. 3rd USENIX Workshop on Hot Topics in Parallelism, 2011.Google Scholar
- E. Yardımcı and M. Franz. Dynamic parallelization and mapping of binary executables on hierarchical platforms. ACM International Conference on Computing Frontiers, 2006. Google Scholar
Digital Library
Index Terms
Limits of region-based dynamic binary parallelization
Recommendations
Limits of region-based dynamic binary parallelization
VEE '13: Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environmentsEfficiently executing sequential legacy binaries on chip multi-processors (CMPs) composed of many, small cores is one of today's most pressing problems. Single-threaded execution is a suboptimal option due to CMPs' lower single-core performance, while ...
Speculative parallelization using software multi-threaded transactions
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsWith the right techniques, multicore architectures may be able to continue the exponential performance trend that elevated the performance of applications of all types for decades. While many scientific programs can be parallelized without speculative ...
Speculative parallelization using software multi-threaded transactions
ASPLOS '10With the right techniques, multicore architectures may be able to continue the exponential performance trend that elevated the performance of applications of all types for decades. While many scientific programs can be parallelized without speculative ...









Comments