skip to main content
research-article
Open Access

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Published:03 June 2015Publication History
Skip Abstract Section

Abstract

Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide’s state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97× performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25× improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop’s filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.

References

  1. Idapro, hexrays. URL http://www.hex-rays.com/idapro/.Google ScholarGoogle Scholar
  2. Mcsema: Static translation of x86 into llvm. 2014.Google ScholarGoogle Scholar
  3. K. Anand, M. Smithson, K. Elwazeer, A. Kotha, J. Gruen, N. Giles, and R. Barua. A compiler-level intermediate representation based binary analysis and rewriting system. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, pages 295– 308, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1994-2.. URL http://doi.acm.org/10.1145/2465351.2465380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, and S. Amarasinghe. Opentuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques, Edmonton, Canada, August 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pages 1–12, New York, NY, USA, 2000. ACM. ISBN 1-58113-199-2.. URL http://doi.acm.org/10.1145/349299. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 349303.Google ScholarGoogle Scholar
  7. G. Balakrishnan and T. Reps. Analyzing memory accesses in x86 executables. In E. Duesterwald, editor, Compiler Construction, volume 2985 of Lecture Notes in Computer Science, pages 5–23. Springer Berlin Heidelberg, 2004. ISBN 978-3-540-21297-3..Google ScholarGoogle Scholar
  8. F. Bellard. QEMU, a fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’05, pages 41–41, Berkeley, CA, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. USENIX Association. URL www.qemu.org.Google ScholarGoogle Scholar
  10. D. Bruening, Q. Zhao, and S. Amarasinghe. Transparent dynamic instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments, VEE ’12, pages 133– 144, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1176-2.. URL http://doi.acm.org/10.1145/2151024.2151043. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Brumley, I. Jager, T. Avgerinos, and E. J. Schwartz. BAP: A binary analysis platform. In Proceedings of the 23rd International Conference on Computer Aided Verification, CAV’11, pages 463–469, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 978-3-642-22109-5. URL http://dl.acm.org/citation.cfm?id=2032305.2032342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Campanoni, T. Jones, G. Holloway, V. J. Reddi, G.-Y. Wei, and D. Brooks. HELIX: Automatic parallelization of irregular programs for chip multiprocessing. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, pages 84–93, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1206-6.. URL http://doi.acm.org/10.1145/2259016.2259028. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. Chipounov and G. Candea. Reverse engineering of binary device drivers with RevNIC. In Proceedings of the 5th European Conference on Computer Systems, EuroSys ’10, pages 167–180, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-577-2.. URL http://doi. acm.org/10.1145/1755913.1755932. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. V. Chipounov, V. Kuznetsov, and G. Candea. S2E: A platform for in-vivo multi-path analysis of software systems. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 265–278, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0266-1.. URL http://doi.acm.org/10.1145/1950365.1950396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. ElWazeer, K. Anand, A. Kotha, M. Smithson, and R. Barua. Scalable variable and data type detection in a binary rewriter. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 51–60, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6.. URL http: //doi.acm.org/10.1145/2491956.2462165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Fokin, E. Derevenetc, A. Chernov, and K. Troshina. Smartdec: Approaching c++ decompilation. In Proceedings of the 2011 18th Working Conference on Reverse Engineering, WCRE ’11, pages 347– 356, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4582-0.. URL http://dx.doi.org/10.1109/WCRE. 2011.49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Guo, M. Bridges, S. Triantafyllis, G. Ottoni, E. Raman, and D. August. Practical and accurate low-level pointer analysis. In Code Generation and Optimization, 2005. CGO 2005. International Symposium on, pages 291–302, March 2005.. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. N. Horspool and N. Marovac. An approach to the problem of detranslation of computer programs. The Computer Journal, 23(3): 223–229, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  19. S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1–12, April 2010..Google ScholarGoogle ScholarCross RefCross Ref
  20. A. Kotha, K. Anand, M. Smithson, G. Yellareddy, and R. Barua. Automatic parallelization in a binary rewriter. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 547–557, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4299-7.. URL http://dx.doi.org/10.1109/MICRO.2010.27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Li, C. Wu, and W.-C. Hsu. Dynamic register promotion of stack variables. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’11, pages 21–31, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-1-61284-356-8. URL http://dl.acm.org/citation.cfm? id=2190025.2190050. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pages 89–100, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-633-2.. URL http://doi.acm.org/10.1145/ 1250734.1250746. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Paris. Adobe systems. personal communication, 2014.Google ScholarGoogle Scholar
  24. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. . URL http://doi.acm.org/10.1145/2491956.2462176.Google ScholarGoogle Scholar
  26. M. Research. Phoenix compiler and shared source common language infrastructure. URL http://www.research.microsoft. com/phoenix.Google ScholarGoogle Scholar
  27. D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. G. Kang, Z. Liang, J. Newsome, P. Poosankam, and P. Saxena. BitBlaze: A new approach to computer security via binary analysis. In Proceedings of the 4th International Conference on Information Systems Security. Keynote invited paper., Hyderabad, India, Dec. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Stock, M. Kong, T. Grosser, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 65–76, New York, NY, USA, 2014. ACM. ISBN 978- 1-4503-2784-8.. URL http://doi.acm.org/10.1145/2594291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. 2594342.Google ScholarGoogle Scholar
  30. Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 117–128, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0743-7.. URL http://doi.acm. org/10.1145/1989493.1989508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 356–369, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-3047-8.. URL http://dx.doi.org/10.1109/MICRO.2007.7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Vandierendonck, S. Rul, and K. De Bosschere. The paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pages 389–400, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0178-7.. URL http://doi.acm.org/10.1145/1854273.1854322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Z. Wang, G. Tournavitis, B. Franke, and M. F. P. O’boyle. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Archit. Code Optim., 11(1):2:1–2:26, Feb. 2014. ISSN 1544-3566.. URL http://doi.acm.org/10.1145/ 2579561. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Wilhelm, M. Sagiv, and T. Reps. Shape analysis. In D. Watt, editor, Compiler Construction, volume 1781 of Lecture Notes in Computer Science, pages 1–17. Springer Berlin Heidelberg, 2000. ISBN 978-3-540-67263-0.. URL http://dx.doi.org/10.1007/ 3-540-46423-9\_1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Williams, D. D. Kalamkar, A. Singh, A. M. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, and L. Oliker. Optimization of geometric multigrid for emerging multi-and manycore processors. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 96. IEEE Computer Society Press, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Wu. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, PLDI ’02, pages 210–221, New York, NY, USA, 2002. ACM. ISBN 1-58113-463-0.. URL http: //doi.acm.org/10.1145/512529.512555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In International Symposium on Code Generation and Optimization, San Jose, CA, Mar 2007. URL http://groups.csail.mit.edu/commit/papers/ 07/zhao-cgo07-umi.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!