skip to main content
research-article

A Visualizable Evidence-Driven Approach for Authorship Attribution

Authors Info & Claims
Published:09 March 2015Publication History
Skip Abstract Section

Abstract

The Internet provides an ideal anonymous channel for concealing computer-mediated malicious activities, as the network-based origins of critical electronic textual evidence (e.g., emails, blogs, forum posts, chat logs, etc.) can be easily repudiated. Authorship attribution is the study of identifying the actual author of the given anonymous documents based on the text itself, and for decades, many linguistic stylometry and computational techniques have been extensively studied for this purpose. However, most of the previous research emphasizes promoting the authorship attribution accuracy, and few works have been done for the purpose of constructing and visualizing the evidential traits. In addition, these sophisticated techniques are difficult for cyber investigators or linguistic experts to interpret. In this article, based on the End-to-End Digital Investigation (EEDI) framework, we propose a visualizable evidence-driven approach, namely VEA, which aims at facilitating the work of cyber investigation. Our comprehensive controlled experiment and the stratified experiment on the real-life Enron email dataset demonstrate that our approach can achieve even higher accuracy than traditional methods; meanwhile, its output can be easily visualized and interpreted as evidential traits. In addition to identifying the most plausible author of a given text, our approach also estimates the confidence for the predicted result based on a given identification context and presents visualizable linguistic evidence for each candidate.

References

  1. Ahmed Abbasi and Hsinchun Chen. 2006. Visualizing authorship for identification. In Intelligence and Security Informatics. Lecture Notes in Computer Science, Vol. 3975. Springer, 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ahmed Abbasi and Hsinchun Chen. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems 26, 2, Article No. 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Seymour Bosworth, Michel E. Kabay, and Eric Whyne. 2012. Computer Security Handbook. Wiley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security 15, 3, Article No. 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. John Burrows. 2007. All the way through: Testing for authorship in different frequency strata. Literary and Linguistic Computing 22, 1, 27--48.Google ScholarGoogle ScholarCross RefCross Ref
  6. Tantek Çelik, Chris Lilley, and L. David Baron. 2012. CSS Color Module Level 3.Google ScholarGoogle Scholar
  7. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3, Article No. 27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Marco Cristani, Giorgio Roffo, Cristina Segalin, Loris Bazzani, Alessandro Vinciarelli, and Vittorio Murino. 2012. Conversationally-inspired stylometric features for authorship attribution in instant messaging. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, New York, NY, 1121--1124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Walter Daelemans. 2013. Explanation in computational stylometry. In Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, Vol. 7817. Springer, 451--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kai-Bo Duan and S. Sathiya Keerthi. 2005. Which is the best multiclass SVM method? An empirical study. In Multiple Classifier Systems. Lecture Notes in Computer Science, Vol. 3541. Springer, 278--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hugo Jair Escalante, Thamar Solorio, and Manuel Montes-y-Gómez. 2011. Local histograms of character n-grams for authorship attribution. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 228--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yoav Freund and Robert E. Schapire. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the 2nd European Conference on Computational Learning Theory. 23--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Benjamin C. M. Fung, Ke Wang, and Martin Ester. 2003. Hierarchical document clustering using frequent itemsets. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM).Google ScholarGoogle ScholarCross RefCross Ref
  14. Hans Van Halteren. 2007. Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing 4, 1, Article No. 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mark Harrower and Cynthia A. Brewer. 2003. Colorbrewer.org: An online tool for selecting colour schemes for maps. Cartographic Journal 40, 1, 27--37.Google ScholarGoogle ScholarCross RefCross Ref
  16. Steffen Hedegaard and Jakob Grue Simonsen. 2011. Lost in translation: Authorship attribution using frame semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 65--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David I. Holmes. 1994. Authorship attribution. Computers and the Humanities 28, 2, 87--106.Google ScholarGoogle ScholarCross RefCross Ref
  18. David I. Holmes. 1998. The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13, 3, 111--117.Google ScholarGoogle ScholarCross RefCross Ref
  19. John Houvardas and Efstathios Stamatatos. 2006. N-gram feature selection for authorship identification. In Artificial Intelligence: Methodology, Systems, and Applications. Lecture Notes in Computer Science, Vol. 4183. Springer, 77--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Farkhund Iqbal, Hamad Binsalleeh, Benjamin Fung, and Mourad Debbabi. 2013. A unified data mining solution for authorship analysis in anonymous textual communications. Information Sciences 231, 98--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Patrick Juola. 2006. Authorship attribution. Foundations and Trends in Information Retrieval 1, 3, 233--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Patrick Juola. 2012. Detecting stylistic deception. In Proceedings of the Workshop on Computational Approaches to Deception Detection. 91--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Patrick Juola and Darren Vescovi. 2010. Empirical evaluation of authorship obfuscation using JGAAP. In Proceedings of the 3rd ACM Workshop on Artificial Intelligence and Security. ACM, New York, NY. 14--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Gary Kacmarcik and Michael Gamon. 2006. Obfuscating document stylometry to preserve author anonymity. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. 444--451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sangkyum Kim, Hyungsul Kim, Tim Weninger, Jiawei Han, and Hyun Duk Kim. 2011. Authorship classification: A discriminative syntactic tree mining approach. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 455--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas. 1998. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 3, 226--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2006. Authorship attribution with thousands of candidate authors. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 659--660. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2011. Authorship attribution in the wild. Language Resources and Evaluation 45, 183--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Yaron Winter. 2012. The “fundamental problem” of authorship attribution. English Studies 93, 3, 284--291.Google ScholarGoogle ScholarCross RefCross Ref
  30. Moshe Koppel, Jonathan Schler, and Elisheva Bonchek-Dokow. 2007. Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8, 1261--1276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ioannis Kourtis and Efstathios Stamatatos. 2011. Author identification using semi-supervised learning. In Proceedings of the 2011 CLEF Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers).Google ScholarGoogle Scholar
  32. Maarten Lambers and Cor J. Veenman. 2009. Forensic authorship attribution using compression distances to prototypes. In Computational Forensics. Lecture Notes in Computer Science, Vol. 5718. Springer, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Robert Layton, Paul Andrew Watters, and Richard Dazeley. 2013. Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 19, 195--120.Google ScholarGoogle ScholarCross RefCross Ref
  34. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 707.Google ScholarGoogle Scholar
  35. Kim Luyckx and Walter Daelemans. 2011. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26, 135--55.Google ScholarGoogle ScholarCross RefCross Ref
  36. Justin Martineau, Tim Finin, Anupam Joshi, and Shamit Patel. 2009. Improving binary classification on text problems using differential word features. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, New York, NY, 2019--2024. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Frederick Mosteller and David Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley.Google ScholarGoogle Scholar
  38. Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the feasibility of Internet-scale author identification. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (SP). 300--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Fuchun Peng, Dale Schuurmans, Shaojun Wang, and Vlado Keselj. 2003. Language independent authorship attribution with character level language models. In Proceedings of the 10th Conference on the European Chapter of the Association for Computational Linguistics—Volume 1. 267--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sindhu Raghavan, Adriana Kovashka, and Raymond Mooney. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers. 38--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Conrad Sanderson and Simon Guenter. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. 482--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Upendra Sapkota, Thamar Solorio, Manuel Montes-y-Gómez, and Paolo Rosso. 2013. The use of orthogonal similarity relations in the prediction of authorship. In Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, Vol. 7817. Springer, 463--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jacques Savoy. 2012. Authorship attribution based on specific vocabulary. ACM Transactions on Information Systems 30, 2, Article No. 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yanir Seroussi, Fabian Bohnert, and Ingrid Zukerman. 2012. Authorship attribution with author-aware topic models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers—Volume 2. 264--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2011. Authorship attribution with latent Dirichlet allocation. In Proceedings of the 15th Conference on Computational Natural Language Learning. 181--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jitesh Shetty and Jafar Adibi. 2004. The Enron Email Dataset Database Schema and Brief Statistical Report. Information Sciences Institute Technical Report, University of Southern California.Google ScholarGoogle Scholar
  48. Grigori Sidorov, Francisco Velasquez, Efstathios Stamatatos, Alexander Gelbukh, and Liliana Chanona-Hernández. 2013. Syntactic dependency-based n-grams: More evidence of usefulness in classification. In Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, Vol. 7816. Springer, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Lawrence M. Solan. 2013. Intuition versus algorithm: The case of forensic authorship attribution. Brooklyn Journal of Law and Policy 21, 551, Paper No. 342.Google ScholarGoogle Scholar
  50. Thamar Solorio, Sangita Pillay, Sindhu Raghavan, and Manuel Montes-y-Gómez. 2011. Modality specific meta features for authorship attribution in Web forum posts.. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 156--164.Google ScholarGoogle Scholar
  51. Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60, 3538--556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Fiona J. Tweedie, Sameer Singh, and David I. Holmes. 1996. Neural network applications in stylometry: The Federalist papers. Computers and the Humanities 30, 1, 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  53. Yiming Yang and Jan O. Pedersen. 1997. A Comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML). 412--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Justin Zobel and Alistair Moffat. 1998. Exploring the similarity space. ACM SIGIR Forum 32, 1, 18--34. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Visualizable Evidence-Driven Approach for Authorship Attribution

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Information and System Security
          ACM Transactions on Information and System Security  Volume 17, Issue 3
          March 2015
          124 pages
          ISSN:1094-9224
          EISSN:1557-7406
          DOI:10.1145/2744298
          • Editor:
          • Gene Tsudik
          Issue’s Table of Contents

          Copyright © 2015 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 March 2015
          • Accepted: 1 November 2014
          • Revised: 1 October 2014
          • Received: 1 March 2014
          Published in tissec Volume 17, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!