Abstract
We address a fundamental problem in reverse engineering of object-oriented code: the reconstruction of a program's class hierarchy from its stripped binary. Existing approaches rely heavily on structural information that is not always available, e.g., calls to parent constructors. As a result, these approaches often leave gaps in the hierarchies they construct, or fail to construct them altogether. Our main insight is that behavioral information can be used to infer subclass/superclass relations, supplementing any missing structural information. Thus, we propose the first statistical approach for static reconstruction of class hierarchies based on behavioral similarity. We capture the behavior of each type using a statistical language model (SLM), define a metric for pairwise similarity between types based on the Kullback-Leibler divergence between their SLMs, and lift it to determine the most likely class hierarchy. We implemented our approach in a tool called ROCK and used it to automatically reconstruct the class hierarchies of several real-world stripped C++ binaries. Our results demonstrate that ROCK obtained significantly more accurate class hierarchies than those obtained using structural analysis alone.
- Mart'ın Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti. Control-flow integrity. In Proceedings of the Conference on Computer and Communications Security, 2005. Google Scholar
Digital Library
- Wolfram Amme, Peter Braun, Franccois Thomasset, and Eberhard Zehendner. Data dependence analysis of assembly code. normalfont In International Journal Parallel Programming, 2000. Google Scholar
Digital Library
- Gogul Balakrishnan and Thomas Reps. Divine: Discovering variables in executables. In Proceedings of the International Conference on Verification, Model Checking, and Abstract Interpretation, 2007. Google Scholar
Digital Library
- Gogul Balakrishnan and Thomas Reps. WYSINWYX: What you see is not what you execute. normalfont In ACM Transactions on Programming Languages and Systems, 2010. Google Scholar
Digital Library
- Thomas Ball, Ella Bounimova, Byron Cook, Vladimir Levin, Jakob Lichtenberg, Con McGarvey, Bohus Ondrusek, Sriram K. Rajamani, and Abdullah Ustuner. Thorough static analysis of device drivers. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, 2006. Google Scholar
Digital Library
- Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley. Byteweight: Learning to recognize functions in binary code. In USENIX Security Symposium, 2014. Google Scholar
Digital Library
- J. Bergeron, Mourad Debbabi, M. M. Erhioui, and Béchir Ktari. Static analysis of binary code to isolate malicious behaviors. In Proceedings of the Workshop on Enabling Technologies on Infrastructure for Collaborative Enterprises, 1999. Google Scholar
Digital Library
- David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J. Schwartz. Bap: A binary analysis platform. In Proceedings of the International Conference on Computer Aided Verification, 2011. Google Scholar
Digital Library
- Juan Caballero and Zhiqiang Lin. Type inference on executables. normalfont In ACM Computing Surveys, 2016. Google Scholar
Digital Library
- John A Capra and Mona Singh. Predicting functionally important residues from sequence conservation. normalfont In Bioinformatics, 2007. Google Scholar
Digital Library
- Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the Annual Meeting on Association for Computational Linguistics, 1996. Google Scholar
Digital Library
- John G Cleary and Ian H Witten. Data compression using adaptive coding and partial string matching. normalfont In IEEE Transactions on Communications, 1984.Google Scholar
- Yaniv David and Eran Yahav. Tracelet-based code search in executables. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014. Google Scholar
Digital Library
- Saumya Debray, Robert Muth, and Matthew Weippert. Alias analysis of executable code. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 1998. Google Scholar
Digital Library
- Jack Edmonds. Optimum branchings. Journal of Research of the National Bureau of Standards B, 1967.Google Scholar
Cross Ref
- Ran El-Yaniv, Shai Fine, and Naftali Tishby. Agnostic classification of markovian sequences. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 1998. Google Scholar
Digital Library
- Alexander Fokin, Egor Derevenetc, Alexander Chernov, and Katerina Troshina. Smartdec: Approaching cGoogle Scholar
- decompilation. In Proceedings of the Working Conference on Reverse Engineering, 2011.Google Scholar
- Matthew Fredrikson, Mihai Christodorescu, and Somesh Jha. Dynamic behavior matching: A complexity analysis and new approximation algorithms. In Proceedings of the International Conference on Automated Deduction, 2011. Google Scholar
Digital Library
- Denis Gopan, Evan Driscoll, Ducson Nguyen, Dimitri Naydich, Alexey Loginov, and David Melski. Data-delineation in software binaries and its application to buffer-overrun discovery. In Proceedings of the International Conference on Software Engineering, 2015. Google Scholar
Digital Library
- S. Jha, K. Tan, and R.A. Maxion. Markov chains, classifiers, and intrusion detection. In Proceedings of the IEEE workshop on Computer Security Foundations, 2001. Google Scholar
Digital Library
- Omer Katz, Ran El-Yaniv, and Eran Yahav. Estimating types in binaries using predictive modeling. In Proceedings of the Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016. Google Scholar
Digital Library
- Slava M Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. normalfont In IEEE Transactions on Acoustics, Speech, and Signal Processing, 1987.Google Scholar
- S. Kullback and R. A. Leibler. On information and sufficiency. In The Annals of Mathematical Statistics, 1951.Google Scholar
Cross Ref
- Magnus Madsen, Benjamin Livshits, and Michael Fanning. Practical static analysis of javascript applications in the presence of frameworks and libraries. In Proceedings of the Joint Meeting on Foundations of Software Engineering, 2013. Google Scholar
Digital Library
- Geoffrey Mazeroff, Jens Gregor, Michael Thomason, and Richard Ford. Probabilistic suffix models for API sequence analysis of windows XP applications. In Pattern Recognition, 2008. Google Scholar
Digital Library
- Microsoft Corporation. Skype. https://www.skype.com/en/.Google Scholar
- Microsoft Corporation. Visual studio. https://www.visualstudio.com.Google Scholar
- Alon Mishne, Sharon Shoham, and Eran Yahav. Typestate-based semantic code search over partial programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, 2012. Google Scholar
Digital Library
- Alistair Moffat. Implementing the ppm data compression scheme. In IEEE Transactions on Communications, 1990.Google Scholar
Cross Ref
- Andre Pawlowski, Moritz Contag, Victor van der Veen, Chris Ouwehand, Thorsten Holz, Herbert Bos, Elias Athanasopoulos, and Cristiano Giuffrida. MARX: Uncovering Class Hierarchies in CGoogle Scholar
- Programs. In Network and Distributed System Security Symposium, 2017.Google Scholar
- Mario Polino, Andrea Scorti, Federico Maggi, and Stefano Zanero. Jackdaw: Towards automatic reverse engineering of large datasets of binaries. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, 2015. Google Scholar
Digital Library
- Mila Dalla Preda, Mihai Christodorescu, Somesh Jha, and Saumya Debray. A semantics-based approach to malware detection. In Proceedings of the Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2007. Google Scholar
Digital Library
- Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting program properties from "big code". In Proceedings of the Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2015. Google Scholar
Digital Library
- Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014. Google Scholar
Digital Library
- Thomas Reps and Gogul Balakrishnan. Improved memory-access analysis for x86 executables. In Proceedings of the International Conference on Compiler construction, 2008. Google Scholar
Digital Library
- Thomas Reps, Gogul Balakrishnan, and Junghee Lim. Intermediate-representation recovery from low-level code. In Proceedings of the ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation, 2006. Google Scholar
Digital Library
- Thomas Reps, Gogul Balakrishnan, Junghee Lim, and Tim Teitelbaum. A next-generation platform for analyzing executables. In Proceedings of the Third Asian conference on Programming Languages and Systems, 2005. Google Scholar
Digital Library
- Thomas Reps, Junghee Lim, Aditya Thakur, Gogul Balakrishnan, and Akash Lal. There's plenty of room at the bottom: analyzing and verifying machine code. In Proceedings of the International Conference on Computer Aided Verification, 2010. Google Scholar
Digital Library
- Ronald Rosenfeld. Two decades of statistical language modeling: Where do we go from here? normalfont In Proceedings of the IEEE, 2000.Google Scholar
- Paul Vincent Sabanal and Mark Vincent Yason. Reversing c+. In BlackHat USA, 2007.Google Scholar
- Dominik Schnitzer. Musly: Audio music similarity. http://www.musly.org.Google Scholar
- Hinrich Schütze and Yoram Singer. Part-of-speech tagging using a variable memory markov model. In Proceedings of the Annual Meeting on Association for Computational Linguistics, 1994. Google Scholar
Digital Library
- Edward J Schwartz, J Lee, Maverick Woo, and David Brumley. Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring. normalfont In Proceedings of the USENIX Security Symposium, 2013. Google Scholar
Digital Library
- Venkatesh Srinivasan and Thomas Reps. Recovery of class hierarchies and composition relationships from machine code. In In Proceedings of the International Conference on Compiler construction, 2014.Google Scholar
Cross Ref
- Venkatesh Karthik Srinivasan and Thomas Reps. Software-architecture recovery from machine code*. https://minds.wisconsin.edu/handle/1793/65091, 2013.Google Scholar
- Stephen Tu. MINO: Data-driven type inference for python. http://people.csail.mit.edu/stephentu/papers/mino.pdf, 2012.Google Scholar
- Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. Detecting intrusions using system calls: Alternative data models. In Proceedings of the IEEE Symposium on Security and Privacy, 1999.Google Scholar
Cross Ref
- Golan Yona and Michael Levitt. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. In Journal of Molecular Biology, 2002.Google Scholar
Cross Ref
Recommendations
Statistical Reconstruction of Class Hierarchies in Binaries
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsWe address a fundamental problem in reverse engineering of object-oriented code: the reconstruction of a program's class hierarchy from its stripped binary. Existing approaches rely heavily on structural information that is not always available, e.g., ...
Pitfalls of assessing extracted hierarchies for multi-class classification
Highlights- We identify several pitfalls in the process of extracting and evaluating methods to extract hierarchies in the context of HMC.
AbstractUsing hierarchies of classes is one of the standard methods to solve multi-class classification problems. In the literature, selecting the right hierarchy is considered to play a key role in improving classification performance. ...
Reconstruction of Class Hierarchies for Decompilation of C++ Programs
CSMR '10: Proceedings of the 2010 14th European Conference on Software Maintenance and ReengineeringThis paper presents a method for automatic reconstruction of polymorphic class hierarchies from the assembly code obtained by compiling a C++ program. If the program is compiled with run-time type information (RTTI), class hierarchy is reconstructed via ...







Comments