skip to main content
article

Estimating types in binaries using predictive modeling

Published:11 January 2016Publication History
Skip Abstract Section

Abstract

Reverse engineering is an important tool in mitigating vulnerabilities in binaries. As a lot of software is developed in object-oriented languages, reverse engineering of object-oriented code is of critical importance. One of the major hurdles in reverse engineering binaries compiled from object-oriented code is the use of dynamic dispatch. In the absence of debug information, any dynamic dispatch may seem to jump to many possible targets, posing a significant challenge to a reverse engineer trying to track the program flow. We present a novel technique that allows us to statically determine the likely targets of virtual function calls. Our technique uses object tracelets – statically constructed sequences of operations performed on an object – to capture potential runtime behaviors of the object. Our analysis automatically pre-labels some of the object tracelets by relying on instances where the type of an object is known. The resulting type-labeled tracelets are then used to train a statistical language model (SLM) for each type.We then use the resulting ensemble of SLMs over unlabeled tracelets to generate a ranking of their most likely types, from which we deduce the likely targets of dynamic dispatches.We have implemented our technique and evaluated it over real-world C++ binaries. Our evaluation shows that when there are multiple alternative targets, our approach can drastically reduce the number of targets that have to be considered by a reverse engineer.

References

  1. Hex-rays interactive disassembler (ida) pro. https://www. hex-rays.com/products/ida/.Google ScholarGoogle Scholar
  2. Microsoft corporation. visual studio. https://www.visualstudio. com.Google ScholarGoogle Scholar
  3. A MME, W., B RAUN, P., T HOMASSET, F., AND Z EHENDNER, E. Data dependence analysis of assembly code. Int. J. Parallel Program. 28, 5 (Oct. 2000), 431–467. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B ACON, D. F., AND S WEENEY, P. F. Fast static analysis of c++ virtual function callsIn Proceedings of the 11th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (1996), OOPSLA ’96, ACM, pp. 324–341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B ALAKRISHNAN, G., AND R EPS, T. Divine: Discovering variables in executables. In Verification, Model Checking, and Abstract Interpretation, B. Cook and A. Podelski, Eds., vol. 4349 of Lecture Notes in Computer Science. Springer, 2007, pp. 1–28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B ALAKRISHNAN, G., AND R EPS, T. Analyzing stripped devicedriver executables. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (2008), TACAS’08/ETAPS’08, Springer-Verlag, pp. 124–140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B ALAKRISHNAN, G., AND R EPS, T. WYSINWYX: What you see is not what you execute. ACM Trans. Program. Lang. Syst. 32, 6 (Aug. 2010), 23:1–23:84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B ALL, T., B OUNIMOVA, E., C OOK, B., L EVIN, V., L ICHTENBERG, J., M C G ARVEY, C., O NDRUSEK, B., R AJAMANI, S. K., AND U S - TUNER, A. Thorough static analysis of device driversIn Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006 (2006), EuroSys ’06, ACM, pp. 73–85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B AO, T., B URKET, J., W OO, M., T URNER, R., AND B RUMLEY, D. Byteweight: Learning to recognize functions in binary codeIn 23rd USENIX Security Symposium (USENIX Security 14) (Aug. 2014), USENIX Association, pp. 845–860. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B EGLEITER, R., AND E L -Y ANIV, R. Superior guarantees for sequential prediction and lossless compression via alphabet decomposition. J. Mach. Learn. Res. 7 (Dec. 2006), 379–411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B EJERANO, G., AND Y ONA, G. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17, 1 (2001), 23–43.Google ScholarGoogle Scholar
  12. B ERGERON, J., D EBBABI, M., E RHIOUI, M. M., AND K TARI, B. Static analysis of binary code to isolate malicious behaviorsIn Proceedings of the 8th Workshop on Enabling Technologies on Infrastructure for Collaborative Enterprises (1999), WETICE ’99, IEEE Computer Society, pp. 184–189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B RUMLEY, D., J AGER, I., A VGERINOS, T., AND S CHWARTZ, E. Bap: A binary analysis platform. In Computer Aided Verification, vol. 6806 of Lecture Notes in Computer Science. Springer, 2011, pp. 463–469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C HEN, S. F., AND G OODMAN, J. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics (1996), Association for Computational Linguistics, pp. 310–318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C LEARY, J. G., AND W ITTEN, I. H. Data compression using adaptive coding and partial string matching. Communications, IEEE Transactions on 32, 4 (1984), 396–402.Google ScholarGoogle Scholar
  16. C UTURI, M., AND V ERT, J.-P. The context-tree kernel for strings. Neural Networks 18, 8 (2005), 1111–1123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D AVID, Y., AND Y AHAV, E. Tracelet-based code search in executablesIn Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (2014), PLDI ’14, ACM, pp. 349–360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D EBRAY, S., M UTH, R., AND W EIPPERT, M. Alias analysis of executable codeIn Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (1998), POPL ’98, ACM, pp. 12–24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E SKIN, E., W ESTON, J., N OBLE, W. S., AND L ESLIE, C. S. Mismatch string kernels for svm protein classification. In Advances in neural information processing systems (2002), pp. 1417–1424.Google ScholarGoogle Scholar
  20. F REDRIKSON, M., C HRISTODORESCU, M., AND J HA, S. Dynamic behavior matching: A complexity analysis and new approximation algorithms. In Automated Deduction - CADE, vol. 6803 of Lecture Notes in Computer Science. Springer, 2011, pp. 252–267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G OPAN, D., D RISCOLL, E., N GUYEN, D., N AYDICH, D., L OGINOV, A., AND M ELSKI, D. Data-delineation in software binaries and its application to buffer-overrun discovery. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on (May 2015), vol. 1, pp. 145–155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G UO, B., B RIDGES, M. J., T RIANTAFYLLIS, S., O TTONI, G., R A - MAN, E., AND A UGUST, D. I. Practical and accurate low-level pointer analysisIn Proceedings of the International Symposium on Code Generation and Optimization (2005), CGO ’05, IEEE Computer Society, pp. 291–302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H ALLER, I., S LOWINSKA, A., AND B OS, H. Mempick: High-level data structure detection in c/c++ binaries. In Reverse Engineering (WCRE), 2013 20th Working Conference on (Oct 2013), pp. 32–41.Google ScholarGoogle Scholar
  24. H E, Q., J IANG, D., L IAO, Z., H OI, S. C., C HANG, K., L IM, E.-P., AND L I, H. Web query recommendation via sequential query prediction. In Data Engineering, 2009. ICDE’09. IEEE 25th International Conference on (2009), IEEE, pp. 1443–1454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J AAKKOLA, T., H AUSSLER, D., ET AL. Exploiting generative models in discriminative classifiers. Advances in neural information processing systems (1999), 487–493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J ANG, D., T ATLOCK, Z., AND L ERNER, S. Safedispatch: Securing C++ virtual calls from memory corruption attacks. In Network and Distributed System Security (NDSS) Symposium (2014).Google ScholarGoogle Scholar
  27. J HA, S., T AN, K., AND M AXION, R. Markov chains, classifiers, and intrusion detection. Computer Security Foundations Workshop, IEEE 0 (2001), 0206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K ATZ, S. M. Estimation of probabilities from sparse data for the language model component of a speech recognizer. Acoustics, Speech and Signal Processing, IEEE Transactions on 35, 3 (1987), 400–401.Google ScholarGoogle Scholar
  29. K RICHEVSKY, R., AND T ROFIMOV, V. The performance of universal encoding. IEEE Trans. Inf. Theor. 27, 2 (Sept. 2006), 199–207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L EE, J., A VGERINOS, T., AND B RUMLEY, D. TIE: principled reverse engineering of types in binary programs. In Proceedings of the Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, USA, 6th February - 9th February 2011 (2011).Google ScholarGoogle Scholar
  31. L IN, J. Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theor. 37, 1 (Sept. 2006), 145–151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L ODHI, H., S AUNDERS, C., S HAWE -T AYLOR, J., C RISTIANINI, N., AND W ATKINS, C. Text classification using string kernels. The Journal of Machine Learning Research 2 (2002), 419–444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M ADSEN, M., L IVSHITS, B., AND F ANNING, M. Practical static analysis of javascript applications in the presence of frameworks and librariesIn Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (2013), ESEC/FSE 2013, ACM, pp. 499–509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M AHONEY, M. V. Adaptive weighing of context models for lossless data compression, 2005.Google ScholarGoogle Scholar
  35. M AZEROFF, G., G REGOR, J., T HOMASON, M., AND F ORD, R. Probabilistic suffix models for {API} sequence analysis of windows {XP} applications. Pattern Recognition 41, 1 (2008), 90 – 101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M ISHNE, A., S HOHAM, S., AND Y AHAV, E. Typestate-based semantic code search over partial programsIn Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (2012), OOPSLA ’12, ACM, pp. 997– 1016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M OFFAT, A. Implementing the ppm data compression scheme. Communications, IEEE Transactions on 38, 11 (1990), 1917–1921.Google ScholarGoogle Scholar
  38. N ISENSON, M., Y ARIV, I., E L -Y ANIV, R., AND M EIR, R. Towards behaviometric security systems: Learning to identify a typist. In Knowledge Discovery in Databases: PKDD 2003. Springer, 2003, pp. 363–374.Google ScholarGoogle Scholar
  39. P AULUS, J., AND K LAPURI, A. Labelling the structural parts of a music piece with markov models. In Computer Music Modeling and Retrieval. Genesis of Meaning in Sound and Music. Springer, 2009, pp. 166–176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. P REDA, M. D., C HRISTODORESCU, M., J HA, S., AND D EBRAY, S. A semantics-based approach to malware detectionIn Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (2007), POPL ’07, ACM, pp. 377–388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R AMALINGAM, G., F IELD, J., AND T IP, F. Aggregate structure identification and its application to program analysisIn Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (1999), POPL ’99, ACM, pp. 119–132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. R AYCHEV, V., V ECHEV, M., AND K RAUSE, A. Predicting program properties from "big code". In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (2015), POPL ’15, ACM, pp. 111–124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R AYCHEV, V., V ECHEV, M., AND Y AHAV, E. Code completion with statistical language modelsIn Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (2014), PLDI ’14, ACM, pp. 419–428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. R EPS, T., AND B ALAKRISHNAN, G. Improved memory-access analysis for x86 executables. In Compiler Construction, L. Hendren, Ed., vol. 4959 of Lecture Notes in Computer Science. Springer, 2008, pp. 16–35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R EPS, T., B ALAKRISHNAN, G., AND L IM, J. Intermediaterepresentation recovery from low-level codeIn Proceedings of the 2006 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation (2006), PEPM ’06, ACM, pp. 100–111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R EPS, T., B ALAKRISHNAN, G., L IM, J., AND T EITELBAUM, T. A next-generation platform for analyzing executables. In Malware Detection, M. Christodorescu, S. Jha, D. Maughan, D. Song, and C. Wang, Eds., vol. 27 of Advances in Information Security. Springer US, 2007, pp. 43–61.Google ScholarGoogle Scholar
  47. R EPS, T., L IM, J., T HAKUR, A., B ALAKRISHNAN, G., AND L AL, A. There’s plenty of room at the bottom: Analyzing and verifying machine code. In Computer Aided Verification, T. Touili, B. Cook, and P. Jackson, Eds., vol. 6174 of Lecture Notes in Computer Science. Springer, 2010, pp. 41–56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. R OSENFELD, R. Two decades of statistical language modeling: Where do we go from here? In Proceedings of the IEEE (2000), vol. 88, pp. 1270–1278.Google ScholarGoogle Scholar
  49. S ABANAL, P. V., AND Y ASON, M. V. Reversing C++. https://www.blackhat.com/presentations/bh-dc-07/ Sabanal_Yason/Paper/bh-dc-07-Sabanal_Yason-WP.pdf.Google ScholarGoogle Scholar
  50. S AIGO, H., V ERT, J.-P., U EDA, N., AND A KUTSU, T. Protein homology detection using string alignment kernels. Bioinformatics 20, 11 (2004), 1682–1689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. S CHÜTZE, H., AND S INGER, Y. Part-of-speech tagging using a variable memory markov modelIn Proceedings of the 32Nd Annual Meeting on Association for Computational Linguistics (1994), ACL ’94, Association for Computational Linguistics, pp. 181–187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S CHWARTZ, E. J., L EE, J., W OO, M., AND B RUMLEY, D. Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring. Proceedings of the USENIX Security Symposium (2013), 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S TEENSGAARD, B. Points-to analysis in almost linear timeIn Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (1996), POPL ’96, ACM, pp. 32–41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S UTTON, C., AND M C C ALLUM, A. An introduction to conditional random fields. Machine Learning 4, 4 (2011), 267–373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. T U, S. MINO: Data-driven type inference for python. MIT 6.867 Fall 2012 Final Project, December 2012.Google ScholarGoogle Scholar

Index Terms

  1. Estimating types in binaries using predictive modeling

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 51, Issue 1
            POPL '16
            January 2016
            815 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/2914770
            • Editor:
            • Andy Gill
            Issue’s Table of Contents
            • cover image ACM Conferences
              POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
              January 2016
              815 pages
              ISBN:9781450335492
              DOI:10.1145/2837614

            Copyright © 2016 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 January 2016

            Check for updates

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!