skip to main content
research-article
Public Access

Lifting Assembly to Intermediate Representation: A Novel Approach Leveraging Compilers

Published:25 March 2016Publication History
Skip Abstract Section

Abstract

Translating low-level machine instructions into higher-level intermediate language (IL) is one of the central steps in many binary analysis and instrumentation systems. Existing systems build such translators manually. As a result, it takes a great deal of effort to support new architectures. Even for widely deployed architectures, full instruction sets may not be modeled, e.g., mature systems such as Valgrind still lack support for AVX, FMA4 and SSE4.1 for x86 processors. To overcome these difficulties, we propose a novel approach that leverages knowledge about instruction set semantics that is already embedded into modern compilers such as GCC. In particular, we present a learning-based approach for automating the translation of assembly instructions to a compiler's architecture-neutral IL. We present an experimental evaluation that demonstrates the ability of our approach to easily support many architectures (x86, ARM and AVR), including their advanced instruction sets. Our implementation is available as open-source software.

References

  1. Bad rounding in cvtsi2ss instruction. https://bugs.kde.org/show_bug.cgi?id=319393.Google ScholarGoogle Scholar
  2. Dagger. http://dagger.repzret.org.Google ScholarGoogle Scholar
  3. Incorrect decoding of vpbroadcastb,w reg,reg forms. https://bugs.kde.org/show_bug.cgi?id=340725.Google ScholarGoogle Scholar
  4. insn_basic test might crash because of setting and not clearing DF flag. https://bugs.kde.org/show_bug.cgi?id=326983.Google ScholarGoogle Scholar
  5. Power lxvw4x instruction uses 4 32 byte loads. https://bugs.kde.org/show_bug.cgi?id=339433.Google ScholarGoogle Scholar
  6. Martın Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti. Control-flow Integrity Principles, Implementations, and Applications. ACM Trans. Inf. Syst. Secur.Google ScholarGoogle Scholar
  7. Kapil Anand, Matthew Smithson, Aparna Kotha, Khaled Elwazeer, and Rajeev Barua. Decompilation to Compiler High IR in a Binary Rewriter. Technical report, Univ of Maryland, 2010.Google ScholarGoogle Scholar
  8. ARM. ARM Architecture Reference Manual ARMv7A and ARMV7-R edition. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0406c/index.html, 2014.Google ScholarGoogle Scholar
  9. Atmel. Atmel AVR 8-bit Instruction Set. www.atmel.com/images/Atmel-0856-AVR-Instruction-Set-Manual.pdf, 2014.Google ScholarGoogle Scholar
  10. Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley. AEG: Automatic Exploit Generation. In Network and Distributed System Security Symposium, 2011.Google ScholarGoogle Scholar
  11. Gogul Balakrishnan, Radu Gruian, Thomas Reps, and Tim Teitelbaum. CodeSurfer/x86 -- A Platform for Analyzing X86 Executables. In Compiler Construction, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fabrice Bellard. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC '05, 2005.Google ScholarGoogle Scholar
  13. Derek L. Bruening. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. PhD thesis, Cambridge, MA, USA, 2004.Google ScholarGoogle Scholar
  14. David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J. Schwartz. BAP: A Binary Analysis Platform. In Proceedings of the 23rd International Conference on Computer Aided Verification, CAV'11, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Vitaly Chipounov and George Candea. Dynamically Translating x86 to LLVM using QEMU. Technical Report EPFL-TR-149975, 2010.Google ScholarGoogle Scholar
  16. Cristina Cifuentes, Brian Lewis, and David Ung. Walkabout - A Retargetable Dynamic Binary Translation Framework. In Workshop on Binary Translation, 2002.Google ScholarGoogle Scholar
  17. Cristina Cifuentes, Mike Van Emmerik, and Norman Ramsey. The Design of a Resourceable and Retargetable Binary Translator. In Reverse Engineering, 1999. Proceedings. Sixth Working Conference on, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  18. Christian S. Collberg. Reverse InterpretationGoogle ScholarGoogle Scholar
  19. Mutation Analysis = Automatic Retargeting. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation, PLDI '97, 1997.Google ScholarGoogle Scholar
  20. Manuel Costa, Jon Crowcroft, Miguel Castro, Antony Rowstron, Lidong Zhou, Lintao Zhang, and Paul Barham. Vigilante: End-to-end Containment of Internet Worm Epidemics. ACM Trans. Comput. Syst., 26(4), December 2008.Google ScholarGoogle Scholar
  21. Jack W. Davidson and Christopher W. Fraser. Code Selection Through Object Code Optimization. ACM Trans. Program. Lang. Syst., 1984.Google ScholarGoogle Scholar
  22. Thomas Dullien and Sebastian Porst. REIL: A platform-independent intermediate representation of disassembled code for static code analysis. 2009.Google ScholarGoogle Scholar
  23. Manuel Egele, Christopher Kruegel, Engin Kirda, Heng Yin, and Dawn Song. Dynamic Spyware Analysis. In 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, ATC'07, 2007.Google ScholarGoogle Scholar
  24. Úlfar Erlingsson, Martın Abadi, Michael Vrable, Mihai Budiu, and George C. Necula. XFI: Software Guards for System Address Spaces. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI '06, 2006.Google ScholarGoogle Scholar
  25. LLVM Foundation. The LLVM Compiler Infrastructure Project. http://llvm.org.Google ScholarGoogle Scholar
  26. Jonathan Graehl, Kevin Knight, and Jonathan May. Training Tree Transducers. Comput. Linguist., 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Niranjan Hasabnis, Rui Qiao, and R. Sekar. Checking Correctness of Code Generator Architecture Specifications. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '15, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Niranjan Hasabnis and R Sekar. LISC - Learning Instruction Semantics from Code Generator - software release. http://seclab.cs.sunysb.edu/seclab/lisc/.Google ScholarGoogle Scholar
  29. Wilson C. Hsieh, Dawson R. Engler, and Godmar Back. Reverse-Engineering Instruction Encodings. In Proceedings of the General Track: 2001 USENIX Annual Technical Conference, 2001.Google ScholarGoogle Scholar
  30. Chun-Chen Hsu, Pangfeng Liu, Chien-Min Wang, Jan-Jan Wu, Ding-Yong Hong, Pen-Chung Yew, and Wei-Chung Hsu. LnQ: Building High Performance Dynamic Binary Translators with Existing Compiler Backends. In Parallel Processing (ICPP), 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yuan-Shin Hwang, Tzong-Yen Lin, and Rong-Guey Chang. DisIRer: Converting a Retargetable Compiler into a Multiplatform Binary Translator. ACM Trans. Archit. Code Optim., 7, December 2010.Google ScholarGoogle Scholar
  32. Johannes Kinder and Helmut Veith. Jakstab: A Static Analysis Platform for Binaries. In Proceedings of the 20th International Conference on Computer Aided Verification, CAV '08, 2008.Google ScholarGoogle Scholar
  33. Vladimir Kiriansky, Derek Bruening, and Saman P. Amarasinghe. Secure Execution via Program Shepherding. In USENIX Security Symposium, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Julian Kranz, Alexander Sepp, and Axel Simon. GDSL: A Universal Toolkit for Giving Semantics to Machine Language. In Programming Languages and Systems, Lecture Notes in Computer Science. 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Christopher Kruegel and Thomas Toth. Using Decision Trees to Improve Signature-Based Intrusion Detection. In RAID, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  36. James R. Larus and Eric Schnarr. EEL: Machine-independent Executable Editing. In Proceedings of the SIGPLAN 1995 Conference on Programming Language Design and Implementation, June 1995.Google ScholarGoogle Scholar
  37. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '05, 2005.Google ScholarGoogle Scholar
  38. Nicholas Nethercote and Julian Seward. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, 2007.Google ScholarGoogle Scholar
  39. James Newsome and Dawn Song. Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software. In Network and Distributed System Security Symposium (NDSS), 2005.Google ScholarGoogle Scholar
  40. J. Oncina, P. García, and E. Vidal. Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks. IEEE Trans. Pattern Anal. Mach. Intell., 1993.Google ScholarGoogle Scholar
  41. GNU Project. The GNU Compiler Collection. http://gcc.gnu.org.Google ScholarGoogle Scholar
  42. Rui Qiao, Mingwei Zhang, and R. Sekar. A Principled Approach for ROP Defense. In Proceedings of the 31st Annual Computer Security Applications Conference, ACSAC 2015, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan Zhou, and Youfeng Wu. LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, 2006.Google ScholarGoogle Scholar
  44. William C. Rounds. Mappings and grammars on trees. Mathematical systems theory, 4(3), 1970.Google ScholarGoogle Scholar
  45. Prateek Saxena, R Sekar, and Varun Puranik. Efficient Fine-grained Binary Instrumentation with Applications to Taint-tracking. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, CGO '08, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R. Sekar, I. V. Ramakrishnan, and Andrei Voronkov. Handbook of automated reasoning. chapter Term Indexing, pages 1853--1964. Elsevier Science Publishers B. V., Amsterdam, The Netherlands, The Netherlands, 2001.Google ScholarGoogle Scholar
  47. R. C. Sekar, R. Ramesh, and I. V. Ramakrishnan. Adaptive Pattern Matching. In Proceedings of the 19th International Colloquium on Automata, Languages and Programming, ICALP '92, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. RC Sekar, R Ramesh, and IV Ramakrishnan. Adaptive Pattern Matching. SIAM Journal on Computing, 24(6):1207--1234, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Alexander Sepp, Julian Kranz, and Axel Simon. GDSL: A Generic Decoder Specification Language for Interpreting Machine Language. Electronic Notes in Theoretical Computer Science, 2012. Third Workshop on Tools for Automatic Program Analysis (TAPAS' 2012).Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager, Min Gyung Kang, Zhenkai Liang, James Newsome, Pongsin Poosankam, and Prateek Saxena. BitBlaze: A New Approach to Computer Security via Binary Analysis. In Proceedings of the 4th International Conference on Information Systems Security. Keynote invited paper., December 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. A. Tongaonkar and R. Sekar. Condition Factorization: A Technique for Building Fast and Compact Packet Matching Automata. IEEE Transactions on Information Forensics and Security, 2016.Google ScholarGoogle Scholar
  52. Alok Tongaonkar, R. Sekar, and Sreenaath Vasudevan. Fast Packet Classification Using Condition Factorization. In Proceedings of the 7th International Conference on Applied Cryptography and Network Security, ACNS '09, 2009.Google ScholarGoogle Scholar
  53. P. Vogt, F. Nentwich, N. Jovanovic, E. Kirda, C. Kruegel, and G. Vigna. Cross-Site Scripting Prevention with Dynamic Data Tainting and Static Analysis. In Proceeding of the Network and Distributed System Security Symposium (NDSS), 2007.Google ScholarGoogle Scholar
  54. Kenji Yamada and Kevin Knight. A Syntax-based Statistical Translation Model. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL'01, 2001.Google ScholarGoogle Scholar
  55. Bennet Yee, David Sehr, Gregory Dardyk, J. Bradley Chen, Robert Muth, Tavis Ormandy, Shiki Okasaka, Neha Narula, and Nicholas Fullagar. Native Client: A Sandbox for Portable, Untrusted x86 Native Code. In Security and Privacy, 2009 30th IEEE Symposium on, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda. Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security, CCS '07, 2007.Google ScholarGoogle Scholar
  57. Mingwei Zhang, Rui Qiao, Niranjan Hasabnis, and R. Sekar. A Platform for Secure Static Binary Instrumentation. In ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Mingwei Zhang and R. Sekar. Control Flow Integrity for COTS Binaries. In Proceedings of the 22nd USENIX Conference on Security, SEC'13, 2013.Google ScholarGoogle Scholar

Index Terms

  1. Lifting Assembly to Intermediate Representation: A Novel Approach Leveraging Compilers

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM SIGPLAN Notices
                ACM SIGPLAN Notices  Volume 51, Issue 4
                ASPLOS '16
                April 2016
                774 pages
                ISSN:0362-1340
                EISSN:1558-1160
                DOI:10.1145/2954679
                • Editor:
                • Andy Gill
                Issue’s Table of Contents
                • cover image ACM Conferences
                  ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
                  March 2016
                  824 pages
                  ISBN:9781450340915
                  DOI:10.1145/2872362
                  • General Chair:
                  • Tom Conte,
                  • Program Chair:
                  • Yuanyuan Zhou

                Copyright © 2016 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 25 March 2016

                Check for updates

                Qualifiers

                • research-article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!