skip to main content
research-article

Large-scale and Robust Code Authorship Identification with Deep Feature Learning

Published:19 July 2021Publication History
Skip Abstract Section

Abstract

Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.

References

  1. 20120. Stunnix. Retrieved February 2, 2021 from http://stunnix.com/.Google ScholarGoogle Scholar
  2. 2020. Google Code Jam. Retrieved February 2, 2021 from https://codingcompetitions.withgoogle.com/codejam.Google ScholarGoogle Scholar
  3. 2020. Hex-Rays. Retrieved February 2, 2021 from https://www.hex-rays.com/products/decompiler/.Google ScholarGoogle Scholar
  4. 2020. IDA Pro. Retrieved February 2, 2021 from https://www.hex-rays.com/products/ida/.Google ScholarGoogle Scholar
  5. 2020. Radare. Retrieved February 2, 2021 from https://www.radare.org/.Google ScholarGoogle Scholar
  6. 2020. The tigress c obfuscator. Retrieved February 2, 2021 from https://tigress.wtf/.Google ScholarGoogle Scholar
  7. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467. Retrieved from http://arxiv.org/abs/1603.04467.Google ScholarGoogle Scholar
  8. Ahmed Abbasi and Hsinchun Chen. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. 26, 2 (2008), 7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, and DaeHun Nyang. 2018. Large-scale and language-oblivious code authorship identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 101–114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mohammed Abuhamad, Tamer Abuhmed, DaeHun Nyang, and David Mohaisen. 2020. Multi-: Identifying multiple authors from source code files. Proceedings on Privacy Enhancing Technologies (PoPETs) 2020, 3 (2020), 25–41.Google ScholarGoogle ScholarCross RefCross Ref
  11. Mohammed Abuhamad, Ji-su Rhim, Tamer AbuHmed, Sana Ullah, Sanggil Kang, and DaeHun Nyang. 2019. Code authorship identification using convolutional neural networks. Fut. Gener. Comput. Syst. 95 (2019), 104–115.Google ScholarGoogle ScholarCross RefCross Ref
  12. Sadia Afroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and Damon McCoy. 2014. Doppelgänger finder: Taking stylometry to the underground. In Proceedings of the IEEE Symposium on Security and Privacy (SP’14). IEEE, 212–226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. Oba2: An onion approach to binary code authorship attribution. Dig. Invest. 11 (2014), S94–S103.Google ScholarGoogle ScholarCross RefCross Ref
  14. Alexander T. Basilevsky. 2009. Statistical Factor Analysis and Related Methods: Theory and Applications. Vol. 418. John Wiley & Sons.Google ScholarGoogle Scholar
  15. Yoshua Bengio. 2008. Neural net language models. Scholarpedia 3, 1 (2008), 3881.Google ScholarGoogle ScholarCross RefCross Ref
  16. Yoshua Bengio. 2009. Learning deep architectures for AI. Found. Trends Mach. Learn. 2, 1 (Jan. 2009), 1–127. DOI:https://doi.org/10.1561/2200000006Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1798–1828.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. 2012. Unsupervised feature learning and deep learning: A review and new perspectives. arxiv:1206.5538. Retrieved from http://arxiv.org/abs/1206.5538.Google ScholarGoogle Scholar
  19. Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2012. Joint learning of words and meaning representations for open-text semantic parsing. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’12), Vol. 22. 127–135.Google ScholarGoogle Scholar
  20. Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. 2012. Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc.Google ScholarGoogle Scholar
  21. Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5–32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15, 3 (2012), 12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Steven Burrows and S. M. M. Tahaghoghi. 2007. Source code authorship attribution using n-grams. In Proceedings of the 12th Australasian Document Computing Symposium (ADCS’07), A. Spink, A. Turpin, and M. Wu (Eds.), 32–39.Google ScholarGoogle Scholar
  24. Steven Burrows, S. M. M. Tahaghoghi, and Justin Zobel. 2007. Efficient plagiarism detection for large code repositories. Softw. Pract. Exper. 37, 2 (Feb. 2007), 151–175. DOI:https://doi.org/10.1002/spe.v37:2Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. 2009. Application of information retrieval techniques for source code authorship attribution. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications (DASFAA’09). Springer-Verlag, Berlin, 699–713. DOI:https://doi.org/10.1007/978-3-642-00887-0_61Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Burrows, A. L. Uitdenbogerd, and A. Turpin. 2009. Temporally robust software features for authorship attribution. In Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference, Vol. 1. 599–606.Google ScholarGoogle Scholar
  27. Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. 2014. Comparing techniques for authorship attribution of source code. Softw.: Pract. Exper. 44, 1 (2014), 1–32. DOI:https://doi.org/10.1002/spe.2146Google ScholarGoogle ScholarCross RefCross Ref
  28. Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In Proceedings of the 24th USENIX Conference on Security Symposium (SEC’15). USENIX Association, Berkeley, CA, 255–270. http://dl.acm.org/citation.cfm?id=2831143.2831160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2018. When coding style survives compilation: De-anonymizing programmers from executable binaries. In Proceedings of the Network and Distributed System Security Symposium 2018 (NDSS’18).Google ScholarGoogle ScholarCross RefCross Ref
  30. B. Chandra and Rajesh Kumar Sharma. 2017. On improving recurrent neural network for image classification. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN’17). IEEE, 1904–1907.Google ScholarGoogle ScholarCross RefCross Ref
  31. Chahes Chopra, Shivam Sinha, Shubham Jaroli, Anupam Shukla, and Saumil Maheshwari. 2017. Recurrent neural networks with non-sequential data to predict hospital readmission of diabetic patients. In Proceedings of the 2017 International Conference on Computational Biology and Bioinformatics. 18–23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. 2012. Multi-column deep neural networks for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, 3642–3649.Google ScholarGoogle Scholar
  33. George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Aud. Speech Lang. Process. 20, 1 (2012), 30–42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Edwin Dauber, Aylin Caliskan, Richard Harang, and Rachel Greenstadt. 2018. Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. ACM, 356–357.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Haibiao Ding and Mansur H. Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. J. Syst. Softw. 72, 1 (2004), 49–57. DOI:https://doi.org/10.1016/S0164-1212(03)00049-9Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (Jul. 2011), 2121–2159.Google ScholarGoogle Scholar
  37. Bruce S. Elenbogen and Naeem Seliya. 2008. Detecting outsourced student programming assignments. J. Comput. Sci. Coll. 23, 3 (Jan. 2008), 50–57. http://dl.acm.org/citation.cfm?id=1295109.1295123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Brian S. Everitt and Graham Dunn. 2001. Applied Multivariate Data Analysis. Vol. 2. Wiley Online Library.Google ScholarGoogle Scholar
  39. Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Carole E. Chaski, and Blake Stephen Howald. 2007. Identifying authorship by byte-level n-grams: The source code author profile (scap) method. Int. J. Dig. Evid. 6, 1 (2007), 1–18.Google ScholarGoogle Scholar
  40. Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. 2006. Effective identification of source code authors using byte-level information. In Proceedings of the 28th International Conference on Software Engineering (ICSE’06). ACM, New York, NY, 893–896. DOI:https://doi.org/10.1145/1134285.1134445Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ian J. Goodfellow, Aaron Courville, and Yoshua Bengio. 2012. Spike-and-slab sparse coding for unsupervised feature discovery. CoRR abs/1201.3382 (2012). http://arxiv.org/abs/1201.3382.Google ScholarGoogle Scholar
  42. Niels Dalum Hansen, Christina Lioma, Birger Larsen, and Stephen Alstrup. 2014. Temporal context for authorship attribution. In Proceedings of the Information Retrieval Facility Conference. Springer, 22–40.Google ScholarGoogle ScholarCross RefCross Ref
  43. Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (2012), 82–97.Google ScholarGoogle ScholarCross RefCross Ref
  44. Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neur. Comput. 18, 7 (2006), 1527–1554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Michielin. 2015. Obfuscator-LLVM—Software protection for the masses. In Proceedings of the IEEE/ACM 1st International Workshop on Software Protection (SPRO’15), Brecht Wyseur (Ed.). IEEE, 3–9. DOI:https://doi.org/10.1109/SPRO.2015.10Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Patrick Juola et al. 2008. Authorship attribution. Found. Trends Inf. Retriev. 1, 3 (2008), 233–334.Google ScholarGoogle Scholar
  47. Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING’03), Vol. 3. 255–264.Google ScholarGoogle Scholar
  48. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015. http://arxiv.org/abs/1412.6980Google ScholarGoogle Scholar
  49. Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the International Joint Conference on Artificial Intelligence, Vol. 14. Stanford, CA, 1137–1145.Google ScholarGoogle Scholar
  50. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Assoc. Inf. Sci. Technol. 60, 1 (2009), 9–26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Ivan Krsul and Eugene H. Spafford. 1997. Refereed paper: Authorship analysis: Identifying the author of a program. Comput. Secur. 16, 3 (Jan. 1997), 233–257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Robert Charles Lange and Spiros Mancoridis. 2007. Using code metric histograms and genetic algorithms to perform author identification for software forensics. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation (GECCO’07). ACM, New York, NY, 2082–2089. DOI:https://doi.org/10.1145/1276958.1277364Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Liang Li, Shuhui Wang, Shuqiang Jiang, and Qingming Huang. 2018. Attentive recurrent neural network for weak-supervised multi-label image classification. In Proceedings of the 26th ACM International Conference on Multimedia. 1092–1100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. G. Macdonell, A. R. Gray, G. MacLennan, and P. J. Sallis. 1999. Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis. In Proceedings of the 6th International Conference on Neural Information Processing (ICONIP’99), Vol. 1. 66–71. DOI:https://doi.org/10.1109/ICONIP.1999.843963Google ScholarGoogle Scholar
  55. Cameron H. Malin, Eoghan Casey, and James M. Aquilina. 2008. Malware Forensics: Investigating and Analyzing Malicious Code. Syngress.Google ScholarGoogle Scholar
  56. Xiaozhu Meng, Barton P. Miller, and Kwang-Sung Jun. 2017. Identifying multiple authors in a binary program. In Proceedings of the European Symposium on Research in Computer Security. Springer, Oslo, Norway, 286–304.Google ScholarGoogle ScholarCross RefCross Ref
  57. Lichao Mou, Pedram Ghamisi, and Xiao Xiang Zhu. 2017. Deep recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 55, 7 (2017), 3639–3655.Google ScholarGoogle ScholarCross RefCross Ref
  58. Brian N. Pellin. 2000. Using classification techniques to determine source code authorship. White Paper. Department of Computer Science, University of Wisconsin.Google ScholarGoogle Scholar
  59. Salah Rifai, Yann Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. 2011. The manifold tangent classifier. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’11), Vol. 271. 523.Google ScholarGoogle Scholar
  60. Nathan Rosenblum, Barton P. Miller, and Xiaojin Zhu. 2011. Recovering the toolchain provenance of binary code. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. ACM, 100–110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Nathan Rosenblum, Xiaojin Zhu, and Barton Miller. 2011. Who wrote this code? Identifying the authors of program binaries. In Proceedings of the European Symposium on Research in Computer Security (ESORICS’11), 172–189.Google ScholarGoogle ScholarCross RefCross Ref
  62. Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing functions in binaries with neural networks. In Proceedings of the 24th USENIX Security Symposium (USENIX Security’15). Washington, DC, 611–626.Google ScholarGoogle Scholar
  63. Eugene H. Spafford and Stephen A. Weeber. 1993. Software forensics: Can we track code to its authors?Comput. Secur. 12, 6 (1993), 585–595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 1929–1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Assoc. Inf. Sci. Technol. 60, 3 (2009), 538–556.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. 2013. Classify, but verify: Breaking the closed-world assumption in stylometric authorship attribution. In IFIP Working Group, Vol. 11. 64.Google ScholarGoogle Scholar
  67. Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. Neur. Netw. Mach. Learn. 4, 2 (2012).Google ScholarGoogle Scholar
  68. Özlem Uzuner and Boris Katz. 2005. A comparative study of language models for book and author recognition. In Proceedings of the International Conference on Natural Language Processing. Springer, 969–980.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Linda J. Wilcox. 1998. Authorship: The coin of the realm, the source of complaints. J. Am. Med. Assoc. 280, 3 (1998), 216–217.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Large-scale and Robust Code Authorship Identification with Deep Feature Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!