Abstract
Authorship identification helps to identify the true author of a given anonymous document from a set of candidate authors. The applications of this task can be found in several domains, such as law enforcement agencies and information retrieval. These application domains are not limited to a specific language, community, or ethnicity. However, most of the existing solutions are designed for English, and a little attention has been paid to Thai. These existing solutions are not directly applicable to Thai due to the linguistic differences between these two languages. Moreover, the existing solution designed for Thai is unable to (i) handle outliers in the dataset, (ii) scale when the size of the candidate authors set increases, and (iii) perform well when the number of writing samples for each candidate author is low. We identify a stylometric feature space for the Thai authorship identification task. Based on our feature space, we present an authorship identification solution that uses the probabilistic k nearest neighbors classifier by transforming each document into a collection of point sets. Specifically, this document transformation allows us to (i) use set distance measures associated with an outlier handling mechanism, (ii) capture stylistic variations within a document, and (iii) produce multiple predictions for a query document. We create a new Thai authorship identification corpus containing 547 documents from 200 authors, which is significantly larger than the corpus used by the existing study (an increase of 32 folds in terms of the number of candidate authors). The experimental results show that our solution can overcome the limitations of the existing solution and outperforms all competitors with an accuracy level of 91.02%. Moreover, we investigate the effectiveness of each stylometric features category with the help of an ablation study. We found that combining all categories of the stylometric features outperforms the other combinations. Finally, we cross compare the feature spaces and classification methods of all solutions. We found that (i) our solution can scale as the number of candidate authors increases, (ii) our method outperforms all the competitors, and (iii) our feature space provides better performance than the feature space used by the existing study.
- Malik H. Altakrori, Farkhund Iqbal, Benjamin C. M. Fung, Steven H. H. Ding, and Abdallah Tubaishat. 2019. Arabic authorship attribution: An extensive study on Twitter posts. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 1 (2019), Article 5, 51 pages.Google Scholar
Digital Library
- Sophia Ananiadou, Paul Thompson, and Raheel Nawaz. 2013. Enhancing search: Events and their discourse context. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 318--334.Google Scholar
Digital Library
- Riza Theresa Batista-Navarro, Georgios Kontonatsios, Claudiu Mihăilă, Paul Thompson, Rafal Rak, Raheel Nawaz, Ioannis Korkontzelos, and Sophia Ananiadou. 2013. Facilitating the analysis of discourse phenomena in an interoperable NLP platform. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 559--571.Google Scholar
Digital Library
- Malcolm Coulthard. 2012. On admissible linguistic evidence. Journal of Law and Policy 21 (2012), 441.Google Scholar
- Boonyarit Deewattananon and Usa Sammapun. 2017. Analyzing user reviews in Thai language toward aspects in mobile applications. In Proceedings of the 14th International Joint Conference on Computer Science and Software Engineering (JCSSE’17). 1--6.Google Scholar
Cross Ref
- Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2019. NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 2 (2019), Article 17, 18 pages.Google Scholar
Digital Library
- Heba El-Fiqi, Eleni Petraki, and Hussein A. Abbass. 2016. Pairwise comparative classification for translator stylometric analysis. ACM Transactions on Asian and Low-Resource Language Information Processing 16, 1 (2016), Article 2, 26 pages.Google Scholar
Digital Library
- Zhenhao Ge, Yufang Sun, and Mark J. T. Smith. 2016. Authorship attribution using a neural network language model. In Proceedings of the 30th AAAI Conference on Artificial Intelligence 4212--4213.Google Scholar
- Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press, Cambridge, MA.Google Scholar
Digital Library
- Saeed-Ul Hassan, Raheem Sarwar, and Amina Muazzam. 2016. Tapping into intra-and international collaborations of the organization of Islamic cooperation states across science and technology disciplines. Science and Public Policy 43, 5 (2016), 690--701.Google Scholar
Cross Ref
- C. C. Holmes and N. M. Adams. 2002. A probabilistic nearest neighbour method for statistical pattern recognition. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64, 2 (2002), 295--306.Google Scholar
Cross Ref
- Daniel P. Huttenlocher, Gregory A. Klanderman, and William Rucklidge. 1993. Comparing images using the Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 9 (1993), 850--863.Google Scholar
Digital Library
- Maham Jahangir, Hammad Afzal, Mehreen Ahmed, Khawar Khurshid, and Raheel Nawaz. 2017. An expert system for diabetes prediction using auto tuned multi-layer perceptron. In Proceedings of the 2017 Intelligent Systems Conference (IntelliSys’17). IEEE, Los Alamitos, CA, 722--728.Google Scholar
Cross Ref
- Patrick Juola, George K. Mikros, and Sean Vinsick. 2019. A comparative assessment of the difficulty of authorship attribution in Greek and in English. Journal of the Association for Information Science and Technology 70, 1 (2019), 61--70.Google Scholar
Digital Library
- Natthawut Kertkeidkachorn, Proadpran Punyabukkana, and Atiwong Suchato. 2016. Acoustic features for hidden conditional random fields-based Thai tone classification. ACM Transactions on Asian and Low-Resource Language Information Processing 15, 2 (2016), Article 9, 26 pages.Google Scholar
Digital Library
- Nongnuch Ketui, Thanaruk Theeramunkong, and Chutamanee Onsuwan. 2015. An EDU-based approach for Thai multi-document summarization and its application. ACM Transactions on Asian and Low-Resource Language Information Processing 14, 1 (2015), Article 4, 26 pages.Google Scholar
Digital Library
- Rajalida Lipikorn, Akinobu Shimizu, and Hidefumi Kobatake. 1994. A modified Hausdorff distance for object matching. In Proceedings of the Conference on Pattern Recognition, Vol. 1. 566--568.Google Scholar
- Rangsipan Marukatat, Robroo Somkiadcharoen, Ratthanan Nalintasnai, and Tappasarn Aramboonpong. 2014. Authorship attribution analysis of thai online messages. In Proceedings of the IEEE International Conference on Information Science and Applications (ICISA’14). 1--4.Google Scholar
Cross Ref
- Thomas M. Mitchell. 1997. Machine Learning. McGraw-Hill, New York, NY.Google Scholar
Digital Library
- Frederick Mosteller and David Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, MA.Google Scholar
- Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the feasibility of Internet-scale author identification. In Proceedings of the 2012 IEEE Symposium onSecurity and Privacy (SP’12). IEEE, Los Alamitos, CA, 300--314.Google Scholar
Digital Library
- Raheel Nawaz, Paul Thompson, and Sophia Ananiadou. 2012. Identification of manner in bio-events. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3505--3510.Google Scholar
- Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, and Dickson Chow. 2016. A scalable framework for stylometric analysis query processing. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 1125--1130.Google Scholar
Cross Ref
- Mathias Payer, Ling Huang, Neil Zhenqiang Gong, Kevin Borgolte, and Mario Frank. 2015. What you submit is who you are: A multimodal approach for deanonymizing scientific publications. IEEE Transactions on Information Forensics and Security 10, 1 (2015), 200--212.Google Scholar
Cross Ref
- Fuchun Peng, Dale Schuurmans, Shaojun Wang, and Vlado Keselj. 2003. Language independent authorship attribution using character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics—Volume 1. 267--274.Google Scholar
Digital Library
- Fahad Sabah, Saeed-Ul Hassan, Amina Muazzam, Sehrish Iqbal, Saira Hanif Soroya, and Raheem Sarwar. 2019. Scientific collaboration networks in Pakistan and their impact on institutional research performance. Library Hi Tech 37, 1 (2019), 19--29.Google Scholar
Cross Ref
- Ahmad Al Sallab, Ramy Baly, Hazem M. Hajj, Khaled Bashir Shaban, Wassim El-Hajj, and Gilbert Badaro. 2017. AROMA: A recursive deep learning model for opinion mining in Arabic as a low resource language. ACM Transactions on Asian and Low-Resource Language Information Processing 16, 4 (2017), Article 25, 20 pages.Google Scholar
- Raheem Sarwar and Saeed-Ul Hassan. 2015. A bibliometric assessment of scientific productivity and international collaboration of the Islamic world in science and technology (S8T) areas. Scientometrics 105, 2 (2015), 1059--1077.Google Scholar
Digital Library
- Raheem Sarwar, Qing Li, Thanawin Rakthanmanon, and Sarana Nutanong. 2018. A scalable framework for cross-lingual authorship identification. Information Sciences 465 (2018), 323--339.Google Scholar
Digital Library
- Raheem Sarwar and Sarana Nutanong. 2016. The key factors and their influence in authorship attribution. Research in Computing Science 110 (2016), 139--150.Google Scholar
Cross Ref
- Raheem Sarwar, Saira Hanif Soroya, Amina Muazzam, Fahad Sabah, Sehrish Iqbal, and Saeed-Ul Hassan. 2019. A bibliometric perspective on technology-driven innovation in the Gulf Cooperation Council (GCC) countries in relation to its transformative impact on international business. In Technology-Driven Innovation in Gulf Cooperation Council (GCC) Countries: Emerging Research and Opportunities. IGI Global, 49--66.Google Scholar
- Raheem Sarwar, Chenyun Yu, Sarana Nutanong, Norawit Urailertprasert, Nattapol Vannaboot, and Thanawin Rakthanmanon. 2018. A scalable framework for stylometric analysis of multi-author documents. In Proceedings of the 23rd International Conference on Database Systems for Advanced Applications (DASFAA’18), Part I. 813--829.Google Scholar
Cross Ref
- Raheem Sarwar, Chenyun Yu, Ninad Tungare, Kanatip Chitavisutthivong, Sukrit Sriratanawilai, Yaohai Xu, Dickson Chow, Thanawin Rakthanmanon, and Sarana Nutanong. 2018. An effective and scalable framework for authorship attribution query processing. IEEE Access 6 (2018), 50030--50048.Google Scholar
Cross Ref
- Fabrizio Sebastiani. 2006. Classification of text, automatic. Encyclopedia of Language and Linguistics 14 (2006), 457--462.Google Scholar
Cross Ref
- Matthew Shardlow, Riza Batista-Navarro, Paul Thompson, Raheel Nawaz, John McNaught, and Sophia Ananiadou. 2018. Identification of research hypotheses and new knowledge from scientific literature. BMC Medical Informatics and Decision Making 18, 1 (2018), 46.Google Scholar
Cross Ref
- Thamar Solorio, Paolo Rosso, Manuel Montes-y-Gómez, Prasha Shrestha, Sebastián Sierra, and Fabio A. González. 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17)—Volume 2: Short Papers. 669--674.Google Scholar
- Paul Thompson, Raheel Nawaz, John McNaught, and Sophia Ananiadou. 2017. Enriching news events with meta-knowledge information. Language Resources and Evaluation 51, 2 (2017), 409--438.Google Scholar
Digital Library
- Ying Zhao and Justin Zobel. 2007. Searching with style: Authorship attribution in classic literature. In Proceedings of the 30th Australasian Computer Science Conference (ACSC’07). 59--68.Google Scholar
Index Terms
StyloThai:: A Scalable Framework for Stylometric Authorship Identification of Thai Documents
Recommendations
UrduAI: Writeprints for Urdu Authorship Identification
The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains ...
Code Authorship Attribution: Methods and Challenges
Code authorship attribution is the process of identifying the author of a given code. With increasing numbers of malware and advanced mutation techniques, the authors of malware are creating a large number of malware variants. To better deal with this ...
Surveying Stylometry Techniques and Applications
The analysis of authorial style, termed stylometry, assumes that style is quantifiably measurable for evaluation of distinctive qualities. Stylometry research has yielded several methods and tools over the past 200 years to handle a variety of ...






Comments