Abstract
Reverse engineering is an important tool in mitigating vulnerabilities in binaries. As a lot of software is developed in object-oriented languages, reverse engineering of object-oriented code is of critical importance. One of the major hurdles in reverse engineering binaries compiled from object-oriented code is the use of dynamic dispatch. In the absence of debug information, any dynamic dispatch may seem to jump to many possible targets, posing a significant challenge to a reverse engineer trying to track the program flow. We present a novel technique that allows us to statically determine the likely targets of virtual function calls. Our technique uses object tracelets – statically constructed sequences of operations performed on an object – to capture potential runtime behaviors of the object. Our analysis automatically pre-labels some of the object tracelets by relying on instances where the type of an object is known. The resulting type-labeled tracelets are then used to train a statistical language model (SLM) for each type.We then use the resulting ensemble of SLMs over unlabeled tracelets to generate a ranking of their most likely types, from which we deduce the likely targets of dynamic dispatches.We have implemented our technique and evaluated it over real-world C++ binaries. Our evaluation shows that when there are multiple alternative targets, our approach can drastically reduce the number of targets that have to be considered by a reverse engineer.
- Hex-rays interactive disassembler (ida) pro. https://www. hex-rays.com/products/ida/.Google Scholar
- Microsoft corporation. visual studio. https://www.visualstudio. com.Google Scholar
- A MME, W., B RAUN, P., T HOMASSET, F., AND Z EHENDNER, E. Data dependence analysis of assembly code. Int. J. Parallel Program. 28, 5 (Oct. 2000), 431–467. Google Scholar
Digital Library
- B ACON, D. F., AND S WEENEY, P. F. Fast static analysis of c++ virtual function callsIn Proceedings of the 11th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (1996), OOPSLA ’96, ACM, pp. 324–341. Google Scholar
Digital Library
- B ALAKRISHNAN, G., AND R EPS, T. Divine: Discovering variables in executables. In Verification, Model Checking, and Abstract Interpretation, B. Cook and A. Podelski, Eds., vol. 4349 of Lecture Notes in Computer Science. Springer, 2007, pp. 1–28. Google Scholar
Digital Library
- B ALAKRISHNAN, G., AND R EPS, T. Analyzing stripped devicedriver executables. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (2008), TACAS’08/ETAPS’08, Springer-Verlag, pp. 124–140. Google Scholar
Digital Library
- B ALAKRISHNAN, G., AND R EPS, T. WYSINWYX: What you see is not what you execute. ACM Trans. Program. Lang. Syst. 32, 6 (Aug. 2010), 23:1–23:84. Google Scholar
Digital Library
- B ALL, T., B OUNIMOVA, E., C OOK, B., L EVIN, V., L ICHTENBERG, J., M C G ARVEY, C., O NDRUSEK, B., R AJAMANI, S. K., AND U S - TUNER, A. Thorough static analysis of device driversIn Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006 (2006), EuroSys ’06, ACM, pp. 73–85. Google Scholar
Digital Library
- B AO, T., B URKET, J., W OO, M., T URNER, R., AND B RUMLEY, D. Byteweight: Learning to recognize functions in binary codeIn 23rd USENIX Security Symposium (USENIX Security 14) (Aug. 2014), USENIX Association, pp. 845–860. Google Scholar
Digital Library
- B EGLEITER, R., AND E L -Y ANIV, R. Superior guarantees for sequential prediction and lossless compression via alphabet decomposition. J. Mach. Learn. Res. 7 (Dec. 2006), 379–411. Google Scholar
Digital Library
- B EJERANO, G., AND Y ONA, G. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17, 1 (2001), 23–43.Google Scholar
- B ERGERON, J., D EBBABI, M., E RHIOUI, M. M., AND K TARI, B. Static analysis of binary code to isolate malicious behaviorsIn Proceedings of the 8th Workshop on Enabling Technologies on Infrastructure for Collaborative Enterprises (1999), WETICE ’99, IEEE Computer Society, pp. 184–189. Google Scholar
Digital Library
- B RUMLEY, D., J AGER, I., A VGERINOS, T., AND S CHWARTZ, E. Bap: A binary analysis platform. In Computer Aided Verification, vol. 6806 of Lecture Notes in Computer Science. Springer, 2011, pp. 463–469. Google Scholar
Digital Library
- C HEN, S. F., AND G OODMAN, J. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics (1996), Association for Computational Linguistics, pp. 310–318. Google Scholar
Digital Library
- C LEARY, J. G., AND W ITTEN, I. H. Data compression using adaptive coding and partial string matching. Communications, IEEE Transactions on 32, 4 (1984), 396–402.Google Scholar
- C UTURI, M., AND V ERT, J.-P. The context-tree kernel for strings. Neural Networks 18, 8 (2005), 1111–1123. Google Scholar
Digital Library
- D AVID, Y., AND Y AHAV, E. Tracelet-based code search in executablesIn Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (2014), PLDI ’14, ACM, pp. 349–360. Google Scholar
Digital Library
- D EBRAY, S., M UTH, R., AND W EIPPERT, M. Alias analysis of executable codeIn Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (1998), POPL ’98, ACM, pp. 12–24. Google Scholar
Digital Library
- E SKIN, E., W ESTON, J., N OBLE, W. S., AND L ESLIE, C. S. Mismatch string kernels for svm protein classification. In Advances in neural information processing systems (2002), pp. 1417–1424.Google Scholar
- F REDRIKSON, M., C HRISTODORESCU, M., AND J HA, S. Dynamic behavior matching: A complexity analysis and new approximation algorithms. In Automated Deduction - CADE, vol. 6803 of Lecture Notes in Computer Science. Springer, 2011, pp. 252–267. Google Scholar
Digital Library
- G OPAN, D., D RISCOLL, E., N GUYEN, D., N AYDICH, D., L OGINOV, A., AND M ELSKI, D. Data-delineation in software binaries and its application to buffer-overrun discovery. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on (May 2015), vol. 1, pp. 145–155. Google Scholar
Digital Library
- G UO, B., B RIDGES, M. J., T RIANTAFYLLIS, S., O TTONI, G., R A - MAN, E., AND A UGUST, D. I. Practical and accurate low-level pointer analysisIn Proceedings of the International Symposium on Code Generation and Optimization (2005), CGO ’05, IEEE Computer Society, pp. 291–302. Google Scholar
Digital Library
- H ALLER, I., S LOWINSKA, A., AND B OS, H. Mempick: High-level data structure detection in c/c++ binaries. In Reverse Engineering (WCRE), 2013 20th Working Conference on (Oct 2013), pp. 32–41.Google Scholar
- H E, Q., J IANG, D., L IAO, Z., H OI, S. C., C HANG, K., L IM, E.-P., AND L I, H. Web query recommendation via sequential query prediction. In Data Engineering, 2009. ICDE’09. IEEE 25th International Conference on (2009), IEEE, pp. 1443–1454. Google Scholar
Digital Library
- J AAKKOLA, T., H AUSSLER, D., ET AL. Exploiting generative models in discriminative classifiers. Advances in neural information processing systems (1999), 487–493. Google Scholar
Digital Library
- J ANG, D., T ATLOCK, Z., AND L ERNER, S. Safedispatch: Securing C++ virtual calls from memory corruption attacks. In Network and Distributed System Security (NDSS) Symposium (2014).Google Scholar
- J HA, S., T AN, K., AND M AXION, R. Markov chains, classifiers, and intrusion detection. Computer Security Foundations Workshop, IEEE 0 (2001), 0206. Google Scholar
Digital Library
- K ATZ, S. M. Estimation of probabilities from sparse data for the language model component of a speech recognizer. Acoustics, Speech and Signal Processing, IEEE Transactions on 35, 3 (1987), 400–401.Google Scholar
- K RICHEVSKY, R., AND T ROFIMOV, V. The performance of universal encoding. IEEE Trans. Inf. Theor. 27, 2 (Sept. 2006), 199–207. Google Scholar
Digital Library
- L EE, J., A VGERINOS, T., AND B RUMLEY, D. TIE: principled reverse engineering of types in binary programs. In Proceedings of the Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, USA, 6th February - 9th February 2011 (2011).Google Scholar
- L IN, J. Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theor. 37, 1 (Sept. 2006), 145–151. Google Scholar
Digital Library
- L ODHI, H., S AUNDERS, C., S HAWE -T AYLOR, J., C RISTIANINI, N., AND W ATKINS, C. Text classification using string kernels. The Journal of Machine Learning Research 2 (2002), 419–444. Google Scholar
Digital Library
- M ADSEN, M., L IVSHITS, B., AND F ANNING, M. Practical static analysis of javascript applications in the presence of frameworks and librariesIn Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (2013), ESEC/FSE 2013, ACM, pp. 499–509. Google Scholar
Digital Library
- M AHONEY, M. V. Adaptive weighing of context models for lossless data compression, 2005.Google Scholar
- M AZEROFF, G., G REGOR, J., T HOMASON, M., AND F ORD, R. Probabilistic suffix models for {API} sequence analysis of windows {XP} applications. Pattern Recognition 41, 1 (2008), 90 – 101. Google Scholar
Digital Library
- M ISHNE, A., S HOHAM, S., AND Y AHAV, E. Typestate-based semantic code search over partial programsIn Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (2012), OOPSLA ’12, ACM, pp. 997– 1016. Google Scholar
Digital Library
- M OFFAT, A. Implementing the ppm data compression scheme. Communications, IEEE Transactions on 38, 11 (1990), 1917–1921.Google Scholar
- N ISENSON, M., Y ARIV, I., E L -Y ANIV, R., AND M EIR, R. Towards behaviometric security systems: Learning to identify a typist. In Knowledge Discovery in Databases: PKDD 2003. Springer, 2003, pp. 363–374.Google Scholar
- P AULUS, J., AND K LAPURI, A. Labelling the structural parts of a music piece with markov models. In Computer Music Modeling and Retrieval. Genesis of Meaning in Sound and Music. Springer, 2009, pp. 166–176. Google Scholar
Digital Library
- P REDA, M. D., C HRISTODORESCU, M., J HA, S., AND D EBRAY, S. A semantics-based approach to malware detectionIn Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (2007), POPL ’07, ACM, pp. 377–388. Google Scholar
Digital Library
- R AMALINGAM, G., F IELD, J., AND T IP, F. Aggregate structure identification and its application to program analysisIn Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (1999), POPL ’99, ACM, pp. 119–132. Google Scholar
Digital Library
- R AYCHEV, V., V ECHEV, M., AND K RAUSE, A. Predicting program properties from "big code". In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (2015), POPL ’15, ACM, pp. 111–124. Google Scholar
Digital Library
- R AYCHEV, V., V ECHEV, M., AND Y AHAV, E. Code completion with statistical language modelsIn Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (2014), PLDI ’14, ACM, pp. 419–428. Google Scholar
Digital Library
- R EPS, T., AND B ALAKRISHNAN, G. Improved memory-access analysis for x86 executables. In Compiler Construction, L. Hendren, Ed., vol. 4959 of Lecture Notes in Computer Science. Springer, 2008, pp. 16–35. Google Scholar
Digital Library
- R EPS, T., B ALAKRISHNAN, G., AND L IM, J. Intermediaterepresentation recovery from low-level codeIn Proceedings of the 2006 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation (2006), PEPM ’06, ACM, pp. 100–111. Google Scholar
Digital Library
- R EPS, T., B ALAKRISHNAN, G., L IM, J., AND T EITELBAUM, T. A next-generation platform for analyzing executables. In Malware Detection, M. Christodorescu, S. Jha, D. Maughan, D. Song, and C. Wang, Eds., vol. 27 of Advances in Information Security. Springer US, 2007, pp. 43–61.Google Scholar
- R EPS, T., L IM, J., T HAKUR, A., B ALAKRISHNAN, G., AND L AL, A. There’s plenty of room at the bottom: Analyzing and verifying machine code. In Computer Aided Verification, T. Touili, B. Cook, and P. Jackson, Eds., vol. 6174 of Lecture Notes in Computer Science. Springer, 2010, pp. 41–56. Google Scholar
Digital Library
- R OSENFELD, R. Two decades of statistical language modeling: Where do we go from here? In Proceedings of the IEEE (2000), vol. 88, pp. 1270–1278.Google Scholar
- S ABANAL, P. V., AND Y ASON, M. V. Reversing C++. https://www.blackhat.com/presentations/bh-dc-07/ Sabanal_Yason/Paper/bh-dc-07-Sabanal_Yason-WP.pdf.Google Scholar
- S AIGO, H., V ERT, J.-P., U EDA, N., AND A KUTSU, T. Protein homology detection using string alignment kernels. Bioinformatics 20, 11 (2004), 1682–1689. Google Scholar
Digital Library
- S CHÜTZE, H., AND S INGER, Y. Part-of-speech tagging using a variable memory markov modelIn Proceedings of the 32Nd Annual Meeting on Association for Computational Linguistics (1994), ACL ’94, Association for Computational Linguistics, pp. 181–187. Google Scholar
Digital Library
- S CHWARTZ, E. J., L EE, J., W OO, M., AND B RUMLEY, D. Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring. Proceedings of the USENIX Security Symposium (2013), 16. Google Scholar
Digital Library
- S TEENSGAARD, B. Points-to analysis in almost linear timeIn Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (1996), POPL ’96, ACM, pp. 32–41. Google Scholar
Digital Library
- S UTTON, C., AND M C C ALLUM, A. An introduction to conditional random fields. Machine Learning 4, 4 (2011), 267–373. Google Scholar
Digital Library
- T U, S. MINO: Data-driven type inference for python. MIT 6.867 Fall 2012 Final Project, December 2012.Google Scholar
Index Terms
Estimating types in binaries using predictive modeling
Recommendations
Neural reverse engineering of stripped binaries using augmented control flow graphs
We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code ...
Estimating types in binaries using predictive modeling
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesReverse engineering is an important tool in mitigating vulnerabilities in binaries. As a lot of software is developed in object-oriented languages, reverse engineering of object-oriented code is of critical importance. One of the major hurdles in ...
Intraprocedural Static Slicing of Binary Executables
ICSM '97: Proceedings of the International Conference on Software MaintenanceProgram slicing is a technique for determining the set of statements of a program that potentially affect the value of a variable at some point in the program. Intra and interprocedural slicing of high-level languages has greatly been studied in the ...






Comments