Abstract
More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, guest memory instructions with strides are emulated by a sequence of scalar instructions, leaving a significant room for performance improvement when the host machines have SIMD instructions available. Structured loads/stores, such as VLDn/VSTn in ARM NEON, are one type of strided SIMD data access instructions. They are widely used in signal processing, multimedia, mathematical and 2D matrix transposition applications. Efficient translation of such structured loads/stores is a critical issue when migrating ARM executables to other ISAs. However, it is quite challenging since not only the translation of structured loads/stores is not trivial, but also the difference between guest and host register configurations must be taken into consideration. In this work, we present the design and implementation of translating structured loads/stores in DBT, including target code generation as well as efficient SIMD register mapping. Our proposed register mapping mechanisms are not limited to handling structured loads/stores, they can be extended to deal with normal SIMD instructions. On a set of OpenCV benchmarks, our QEMU-based system has achieved a maximum speedup of 5.41x, with an average improvement of 2.93x. On a set of BLAS benchmarks, our system has also obtained a maximum speedup of 2.19x and an average improvement of 1.63x.
- A. Anderson, A. Malik, and D. Gregg. Automatic vectorization of interleaved data revisited. TACO, 12(4):50, 2016. Google Scholar
Digital Library
- N. Hallou, E. Rohou, P. Clauss, and A. Ketterlin. Dynamic revectorization of binary code. In SAMOS, pages 228–237. IEEE, 2015.Google Scholar
- C. J. Hughes. Single-instruction multiple-data execution. Synthesis Lectures on Computer Architecture, 10(1):1–121, 2015.Google Scholar
Cross Ref
- Intel. Intel 64 and ia-32 architectures optimization reference manual. Intel Corporation, Sept, 2016.Google Scholar
- S. Kim and H. Han. Efficient SIMD code generation for irregular kernels. In PPoPP, pages 55–64. ACM, 2012. Google Scholar
Digital Library
- S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, pages 59–69. ACM, 2000. Google Scholar
Digital Library
- R. Leupers. Code selection for media processors with SIMD instructions. In DATE, pages 4–8. ACM, 2000. Google Scholar
Digital Library
- L. Michel, N. Fournel, and F. Pétrot. Speeding-up SIMD instructions dynamic binary translation in embedded processor simulation. In DATE, pages 1–4. ACM, 2011.Google Scholar
- D. Naishlos, M. Biberstein, and A. Zaks. Compiler vectorization techniques for disjoint SIMD architectures. Technical report, 2002.Google Scholar
- D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In CGO, pages 281–294. IEEE Computer Society, 2006. Google Scholar
Digital Library
- D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI, pages 132–143. ACM, 2006. Google Scholar
Digital Library
- V. Porpodas, A. Magni, and T. M. Jones. Pslp: Padded slp automatic vectorization. In CGO, pages 190–201. IEEE Computer Society, 2015. Google Scholar
Digital Library
- Y. Sui, X. Fan, H. Zhou, and J. Xue. Loop-oriented array-and field-sensitive pointer analysis for automatic SIMD vectorization. In LCTES, pages 41–51. ACM, 2016. Google Scholar
Digital Library
- C. Zheng and C. Thompson. Pa-risc to ia-64: Transparent execution, no recompilation. Computer, 33(3):47–52, 2000. Google Scholar
Digital Library
- H. Zhou and J. Xue. A compiler approach for exploiting partial SIMD parallelism. TACO, 13(1):11, 2016. Google Scholar
Digital Library
- H. Zhou and J. Xue. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In CGO, pages 59–69. ACM, 2016. Google Scholar
Digital Library
Index Terms
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
Recommendations
Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation
Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD ...
Improving SIMD Parallelism via Dynamic Binary Translation
Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit ...
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded SystemsMore and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...






Comments