Abstract
The Internet provides an ideal anonymous channel for concealing computer-mediated malicious activities, as the network-based origins of critical electronic textual evidence (e.g., emails, blogs, forum posts, chat logs, etc.) can be easily repudiated. Authorship attribution is the study of identifying the actual author of the given anonymous documents based on the text itself, and for decades, many linguistic stylometry and computational techniques have been extensively studied for this purpose. However, most of the previous research emphasizes promoting the authorship attribution accuracy, and few works have been done for the purpose of constructing and visualizing the evidential traits. In addition, these sophisticated techniques are difficult for cyber investigators or linguistic experts to interpret. In this article, based on the End-to-End Digital Investigation (EEDI) framework, we propose a visualizable evidence-driven approach, namely VEA, which aims at facilitating the work of cyber investigation. Our comprehensive controlled experiment and the stratified experiment on the real-life Enron email dataset demonstrate that our approach can achieve even higher accuracy than traditional methods; meanwhile, its output can be easily visualized and interpreted as evidential traits. In addition to identifying the most plausible author of a given text, our approach also estimates the confidence for the predicted result based on a given identification context and presents visualizable linguistic evidence for each candidate.
- Ahmed Abbasi and Hsinchun Chen. 2006. Visualizing authorship for identification. In Intelligence and Security Informatics. Lecture Notes in Computer Science, Vol. 3975. Springer, 60--71. Google Scholar
Digital Library
- Ahmed Abbasi and Hsinchun Chen. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems 26, 2, Article No. 7. Google Scholar
Digital Library
- Seymour Bosworth, Michel E. Kabay, and Eric Whyne. 2012. Computer Security Handbook. Wiley. Google Scholar
Digital Library
- Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security 15, 3, Article No. 12. Google Scholar
Digital Library
- John Burrows. 2007. All the way through: Testing for authorship in different frequency strata. Literary and Linguistic Computing 22, 1, 27--48.Google Scholar
Cross Ref
- Tantek Çelik, Chris Lilley, and L. David Baron. 2012. CSS Color Module Level 3.Google Scholar
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3, Article No. 27. Google Scholar
Digital Library
- Marco Cristani, Giorgio Roffo, Cristina Segalin, Loris Bazzani, Alessandro Vinciarelli, and Vittorio Murino. 2012. Conversationally-inspired stylometric features for authorship attribution in instant messaging. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, New York, NY, 1121--1124. Google Scholar
Digital Library
- Walter Daelemans. 2013. Explanation in computational stylometry. In Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, Vol. 7817. Springer, 451--462. Google Scholar
Digital Library
- Kai-Bo Duan and S. Sathiya Keerthi. 2005. Which is the best multiclass SVM method? An empirical study. In Multiple Classifier Systems. Lecture Notes in Computer Science, Vol. 3541. Springer, 278--285. Google Scholar
Digital Library
- Hugo Jair Escalante, Thamar Solorio, and Manuel Montes-y-Gómez. 2011. Local histograms of character n-grams for authorship attribution. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 228--298. Google Scholar
Digital Library
- Yoav Freund and Robert E. Schapire. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the 2nd European Conference on Computational Learning Theory. 23--37. Google Scholar
Digital Library
- Benjamin C. M. Fung, Ke Wang, and Martin Ester. 2003. Hierarchical document clustering using frequent itemsets. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM).Google Scholar
Cross Ref
- Hans Van Halteren. 2007. Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing 4, 1, Article No. 1. Google Scholar
Digital Library
- Mark Harrower and Cynthia A. Brewer. 2003. Colorbrewer.org: An online tool for selecting colour schemes for maps. Cartographic Journal 40, 1, 27--37.Google Scholar
Cross Ref
- Steffen Hedegaard and Jakob Grue Simonsen. 2011. Lost in translation: Authorship attribution using frame semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 65--70. Google Scholar
Digital Library
- David I. Holmes. 1994. Authorship attribution. Computers and the Humanities 28, 2, 87--106.Google Scholar
Cross Ref
- David I. Holmes. 1998. The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13, 3, 111--117.Google Scholar
Cross Ref
- John Houvardas and Efstathios Stamatatos. 2006. N-gram feature selection for authorship identification. In Artificial Intelligence: Methodology, Systems, and Applications. Lecture Notes in Computer Science, Vol. 4183. Springer, 77--86. Google Scholar
Digital Library
- Farkhund Iqbal, Hamad Binsalleeh, Benjamin Fung, and Mourad Debbabi. 2013. A unified data mining solution for authorship analysis in anonymous textual communications. Information Sciences 231, 98--112. Google Scholar
Digital Library
- Patrick Juola. 2006. Authorship attribution. Foundations and Trends in Information Retrieval 1, 3, 233--334. Google Scholar
Digital Library
- Patrick Juola. 2012. Detecting stylistic deception. In Proceedings of the Workshop on Computational Approaches to Deception Detection. 91--96. Google Scholar
Digital Library
- Patrick Juola and Darren Vescovi. 2010. Empirical evaluation of authorship obfuscation using JGAAP. In Proceedings of the 3rd ACM Workshop on Artificial Intelligence and Security. ACM, New York, NY. 14--18. Google Scholar
Digital Library
- Gary Kacmarcik and Michael Gamon. 2006. Obfuscating document stylometry to preserve author anonymity. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. 444--451. Google Scholar
Digital Library
- Sangkyum Kim, Hyungsul Kim, Tim Weninger, Jiawei Han, and Hyun Duk Kim. 2011. Authorship classification: A discriminative syntactic tree mining approach. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 455--464. Google Scholar
Digital Library
- Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas. 1998. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 3, 226--239. Google Scholar
Digital Library
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2006. Authorship attribution with thousands of candidate authors. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 659--660. Google Scholar
Digital Library
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2011. Authorship attribution in the wild. Language Resources and Evaluation 45, 183--94. Google Scholar
Digital Library
- Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Yaron Winter. 2012. The “fundamental problem” of authorship attribution. English Studies 93, 3, 284--291.Google Scholar
Cross Ref
- Moshe Koppel, Jonathan Schler, and Elisheva Bonchek-Dokow. 2007. Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8, 1261--1276. Google Scholar
Digital Library
- Ioannis Kourtis and Efstathios Stamatatos. 2011. Author identification using semi-supervised learning. In Proceedings of the 2011 CLEF Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers).Google Scholar
- Maarten Lambers and Cor J. Veenman. 2009. Forensic authorship attribution using compression distances to prototypes. In Computational Forensics. Lecture Notes in Computer Science, Vol. 5718. Springer, 13--24. Google Scholar
Digital Library
- Robert Layton, Paul Andrew Watters, and Richard Dazeley. 2013. Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 19, 195--120.Google Scholar
Cross Ref
- Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 707.Google Scholar
- Kim Luyckx and Walter Daelemans. 2011. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26, 135--55.Google Scholar
Cross Ref
- Justin Martineau, Tim Finin, Anupam Joshi, and Shamit Patel. 2009. Improving binary classification on text problems using differential word features. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, New York, NY, 2019--2024. Google Scholar
Digital Library
- Frederick Mosteller and David Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley.Google Scholar
- Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the feasibility of Internet-scale author identification. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (SP). 300--314. Google Scholar
Digital Library
- Fuchun Peng, Dale Schuurmans, Shaojun Wang, and Vlado Keselj. 2003. Language independent authorship attribution with character level language models. In Proceedings of the 10th Conference on the European Chapter of the Association for Computational Linguistics—Volume 1. 267--274. Google Scholar
Digital Library
- Sindhu Raghavan, Adriana Kovashka, and Raymond Mooney. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers. 38--42. Google Scholar
Digital Library
- Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513--523. Google Scholar
Digital Library
- Conrad Sanderson and Simon Guenter. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. 482--491. Google Scholar
Digital Library
- Upendra Sapkota, Thamar Solorio, Manuel Montes-y-Gómez, and Paolo Rosso. 2013. The use of orthogonal similarity relations in the prediction of authorship. In Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, Vol. 7817. Springer, 463--475. Google Scholar
Digital Library
- Jacques Savoy. 2012. Authorship attribution based on specific vocabulary. ACM Transactions on Information Systems 30, 2, Article No. 12. Google Scholar
Digital Library
- Yanir Seroussi, Fabian Bohnert, and Ingrid Zukerman. 2012. Authorship attribution with author-aware topic models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers—Volume 2. 264--269. Google Scholar
Digital Library
- Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2011. Authorship attribution with latent Dirichlet allocation. In Proceedings of the 15th Conference on Computational Natural Language Learning. 181--189. Google Scholar
Digital Library
- Jitesh Shetty and Jafar Adibi. 2004. The Enron Email Dataset Database Schema and Brief Statistical Report. Information Sciences Institute Technical Report, University of Southern California.Google Scholar
- Grigori Sidorov, Francisco Velasquez, Efstathios Stamatatos, Alexander Gelbukh, and Liliana Chanona-Hernández. 2013. Syntactic dependency-based n-grams: More evidence of usefulness in classification. In Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, Vol. 7816. Springer, 13--24. Google Scholar
Digital Library
- Lawrence M. Solan. 2013. Intuition versus algorithm: The case of forensic authorship attribution. Brooklyn Journal of Law and Policy 21, 551, Paper No. 342.Google Scholar
- Thamar Solorio, Sangita Pillay, Sindhu Raghavan, and Manuel Montes-y-Gómez. 2011. Modality specific meta features for authorship attribution in Web forum posts.. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 156--164.Google Scholar
- Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60, 3538--556. Google Scholar
Digital Library
- Fiona J. Tweedie, Sameer Singh, and David I. Holmes. 1996. Neural network applications in stylometry: The Federalist papers. Computers and the Humanities 30, 1, 1--10.Google Scholar
Cross Ref
- Yiming Yang and Jan O. Pedersen. 1997. A Comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML). 412--420. Google Scholar
Digital Library
- Justin Zobel and Alistair Moffat. 1998. Exploring the similarity space. ACM SIGIR Forum 32, 1, 18--34. Google Scholar
Digital Library
Index Terms
A Visualizable Evidence-Driven Approach for Authorship Attribution
Recommendations
Arabic Authorship Attribution: An Extensive Study on Twitter Posts
Law enforcement faces problems in tracing the true identity of offenders in cybercrime investigations. Most offenders mask their true identity, impersonate people of high authority, or use identity deception and obfuscation tactics to avoid detection ...
Stylometric Analysis for Authorship Attribution on Twitter
BDA 2013: Proceedings of the Second International Conference on Big Data Analytics - Volume 8302Authorship Attribution (AA), the science of inferring an author for a given piece of text based on its characteristics is a problem with a long history. In this paper, we study the problem of authorship attribution for forensic purposes and present ...
An author-specific-model-based authorship analysis using psycholinguistic aspects and style word patterns
Illegal cyber activities can be curbed by means of authorship analysis which intends to identify the authors of a document by scrutinizing the writing style involved in it. One of the major threats associated with online media is the propagation of false ...






Comments