Abstract
Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide’s state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97× performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25× improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop’s filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.
- Idapro, hexrays. URL http://www.hex-rays.com/idapro/.Google Scholar
- Mcsema: Static translation of x86 into llvm. 2014.Google Scholar
- K. Anand, M. Smithson, K. Elwazeer, A. Kotha, J. Gruen, N. Giles, and R. Barua. A compiler-level intermediate representation based binary analysis and rewriting system. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, pages 295– 308, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1994-2.. URL http://doi.acm.org/10.1145/2465351.2465380. Google Scholar
Digital Library
- J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, and S. Amarasinghe. Opentuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques, Edmonton, Canada, August 2014. Google Scholar
Digital Library
- V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pages 1–12, New York, NY, USA, 2000. ACM. ISBN 1-58113-199-2.. URL http://doi.acm.org/10.1145/349299. Google Scholar
Digital Library
- 349303.Google Scholar
- G. Balakrishnan and T. Reps. Analyzing memory accesses in x86 executables. In E. Duesterwald, editor, Compiler Construction, volume 2985 of Lecture Notes in Computer Science, pages 5–23. Springer Berlin Heidelberg, 2004. ISBN 978-3-540-21297-3..Google Scholar
- F. Bellard. QEMU, a fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’05, pages 41–41, Berkeley, CA, USA, 2005. Google Scholar
Digital Library
- USENIX Association. URL www.qemu.org.Google Scholar
- D. Bruening, Q. Zhao, and S. Amarasinghe. Transparent dynamic instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments, VEE ’12, pages 133– 144, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1176-2.. URL http://doi.acm.org/10.1145/2151024.2151043. Google Scholar
Digital Library
- D. Brumley, I. Jager, T. Avgerinos, and E. J. Schwartz. BAP: A binary analysis platform. In Proceedings of the 23rd International Conference on Computer Aided Verification, CAV’11, pages 463–469, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 978-3-642-22109-5. URL http://dl.acm.org/citation.cfm?id=2032305.2032342. Google Scholar
Digital Library
- S. Campanoni, T. Jones, G. Holloway, V. J. Reddi, G.-Y. Wei, and D. Brooks. HELIX: Automatic parallelization of irregular programs for chip multiprocessing. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, pages 84–93, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1206-6.. URL http://doi.acm.org/10.1145/2259016.2259028. Google Scholar
Digital Library
- V. Chipounov and G. Candea. Reverse engineering of binary device drivers with RevNIC. In Proceedings of the 5th European Conference on Computer Systems, EuroSys ’10, pages 167–180, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-577-2.. URL http://doi. acm.org/10.1145/1755913.1755932. Google Scholar
Digital Library
- V. Chipounov, V. Kuznetsov, and G. Candea. S2E: A platform for in-vivo multi-path analysis of software systems. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 265–278, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0266-1.. URL http://doi.acm.org/10.1145/1950365.1950396. Google Scholar
Digital Library
- K. ElWazeer, K. Anand, A. Kotha, M. Smithson, and R. Barua. Scalable variable and data type detection in a binary rewriter. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 51–60, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6.. URL http: //doi.acm.org/10.1145/2491956.2462165. Google Scholar
Digital Library
- A. Fokin, E. Derevenetc, A. Chernov, and K. Troshina. Smartdec: Approaching c++ decompilation. In Proceedings of the 2011 18th Working Conference on Reverse Engineering, WCRE ’11, pages 347– 356, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4582-0.. URL http://dx.doi.org/10.1109/WCRE. 2011.49. Google Scholar
Digital Library
- B. Guo, M. Bridges, S. Triantafyllis, G. Ottoni, E. Raman, and D. August. Practical and accurate low-level pointer analysis. In Code Generation and Optimization, 2005. CGO 2005. International Symposium on, pages 291–302, March 2005.. Google Scholar
Digital Library
- R. N. Horspool and N. Marovac. An approach to the problem of detranslation of computer programs. The Computer Journal, 23(3): 223–229, 1980.Google Scholar
Cross Ref
- S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1–12, April 2010..Google Scholar
Cross Ref
- A. Kotha, K. Anand, M. Smithson, G. Yellareddy, and R. Barua. Automatic parallelization in a binary rewriter. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 547–557, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4299-7.. URL http://dx.doi.org/10.1109/MICRO.2010.27. Google Scholar
Digital Library
- J. Li, C. Wu, and W.-C. Hsu. Dynamic register promotion of stack variables. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’11, pages 21–31, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-1-61284-356-8. URL http://dl.acm.org/citation.cfm? id=2190025.2190050. Google Scholar
Digital Library
- N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pages 89–100, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-633-2.. URL http://doi.acm.org/10.1145/ 1250734.1250746. Google Scholar
Digital Library
- S. Paris. Adobe systems. personal communication, 2014.Google Scholar
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, 2013. Google Scholar
Digital Library
- . URL http://doi.acm.org/10.1145/2491956.2462176.Google Scholar
- M. Research. Phoenix compiler and shared source common language infrastructure. URL http://www.research.microsoft. com/phoenix.Google Scholar
- D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. G. Kang, Z. Liang, J. Newsome, P. Poosankam, and P. Saxena. BitBlaze: A new approach to computer security via binary analysis. In Proceedings of the 4th International Conference on Information Systems Security. Keynote invited paper., Hyderabad, India, Dec. 2008. Google Scholar
Digital Library
- K. Stock, M. Kong, T. Grosser, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 65–76, New York, NY, USA, 2014. ACM. ISBN 978- 1-4503-2784-8.. URL http://doi.acm.org/10.1145/2594291. Google Scholar
Digital Library
- 2594342.Google Scholar
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 117–128, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0743-7.. URL http://doi.acm. org/10.1145/1989493.1989508. Google Scholar
Digital Library
- W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 356–369, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-3047-8.. URL http://dx.doi.org/10.1109/MICRO.2007.7. Google Scholar
Digital Library
- H. Vandierendonck, S. Rul, and K. De Bosschere. The paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pages 389–400, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0178-7.. URL http://doi.acm.org/10.1145/1854273.1854322. Google Scholar
Digital Library
- Z. Wang, G. Tournavitis, B. Franke, and M. F. P. O’boyle. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Archit. Code Optim., 11(1):2:1–2:26, Feb. 2014. ISSN 1544-3566.. URL http://doi.acm.org/10.1145/ 2579561. Google Scholar
Digital Library
- R. Wilhelm, M. Sagiv, and T. Reps. Shape analysis. In D. Watt, editor, Compiler Construction, volume 1781 of Lecture Notes in Computer Science, pages 1–17. Springer Berlin Heidelberg, 2000. ISBN 978-3-540-67263-0.. URL http://dx.doi.org/10.1007/ 3-540-46423-9\_1. Google Scholar
Digital Library
- S. Williams, D. D. Kalamkar, A. Singh, A. M. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, and L. Oliker. Optimization of geometric multigrid for emerging multi-and manycore processors. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 96. IEEE Computer Society Press, 2012. Google Scholar
Digital Library
- Y. Wu. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, PLDI ’02, pages 210–221, New York, NY, USA, 2002. ACM. ISBN 1-58113-463-0.. URL http: //doi.acm.org/10.1145/512529.512555. Google Scholar
Digital Library
- Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In International Symposium on Code Generation and Optimization, San Jose, CA, Mar 2007. URL http://groups.csail.mit.edu/commit/papers/ 07/zhao-cgo07-umi.pdf. Google Scholar
Digital Library
Index Terms
Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code
Recommendations
Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and ImplementationHighly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 ...
Automatic code generation and tuning for stencil kernels on modern shared memory architectures
In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for " P arallel A uto tu ned S ...
A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops
In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications in a way that exploits both ...






Comments