skip to main content
short-paper

StyloThai:: A Scalable Framework for Stylometric Authorship Identification of Thai Documents

Published:09 January 2020Publication History
Skip Abstract Section

Abstract

Authorship identification helps to identify the true author of a given anonymous document from a set of candidate authors. The applications of this task can be found in several domains, such as law enforcement agencies and information retrieval. These application domains are not limited to a specific language, community, or ethnicity. However, most of the existing solutions are designed for English, and a little attention has been paid to Thai. These existing solutions are not directly applicable to Thai due to the linguistic differences between these two languages. Moreover, the existing solution designed for Thai is unable to (i) handle outliers in the dataset, (ii) scale when the size of the candidate authors set increases, and (iii) perform well when the number of writing samples for each candidate author is low. We identify a stylometric feature space for the Thai authorship identification task. Based on our feature space, we present an authorship identification solution that uses the probabilistic k nearest neighbors classifier by transforming each document into a collection of point sets. Specifically, this document transformation allows us to (i) use set distance measures associated with an outlier handling mechanism, (ii) capture stylistic variations within a document, and (iii) produce multiple predictions for a query document. We create a new Thai authorship identification corpus containing 547 documents from 200 authors, which is significantly larger than the corpus used by the existing study (an increase of 32 folds in terms of the number of candidate authors). The experimental results show that our solution can overcome the limitations of the existing solution and outperforms all competitors with an accuracy level of 91.02%. Moreover, we investigate the effectiveness of each stylometric features category with the help of an ablation study. We found that combining all categories of the stylometric features outperforms the other combinations. Finally, we cross compare the feature spaces and classification methods of all solutions. We found that (i) our solution can scale as the number of candidate authors increases, (ii) our method outperforms all the competitors, and (iii) our feature space provides better performance than the feature space used by the existing study.

References

  1. Malik H. Altakrori, Farkhund Iqbal, Benjamin C. M. Fung, Steven H. H. Ding, and Abdallah Tubaishat. 2019. Arabic authorship attribution: An extensive study on Twitter posts. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 1 (2019), Article 5, 51 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sophia Ananiadou, Paul Thompson, and Raheel Nawaz. 2013. Enhancing search: Events and their discourse context. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 318--334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Riza Theresa Batista-Navarro, Georgios Kontonatsios, Claudiu Mihăilă, Paul Thompson, Rafal Rak, Raheel Nawaz, Ioannis Korkontzelos, and Sophia Ananiadou. 2013. Facilitating the analysis of discourse phenomena in an interoperable NLP platform. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 559--571.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Malcolm Coulthard. 2012. On admissible linguistic evidence. Journal of Law and Policy 21 (2012), 441.Google ScholarGoogle Scholar
  5. Boonyarit Deewattananon and Usa Sammapun. 2017. Analyzing user reviews in Thai language toward aspects in mobile applications. In Proceedings of the 14th International Joint Conference on Computer Science and Software Engineering (JCSSE’17). 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  6. Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2019. NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 2 (2019), Article 17, 18 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Heba El-Fiqi, Eleni Petraki, and Hussein A. Abbass. 2016. Pairwise comparative classification for translator stylometric analysis. ACM Transactions on Asian and Low-Resource Language Information Processing 16, 1 (2016), Article 2, 26 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Zhenhao Ge, Yufang Sun, and Mark J. T. Smith. 2016. Authorship attribution using a neural network language model. In Proceedings of the 30th AAAI Conference on Artificial Intelligence 4212--4213.Google ScholarGoogle Scholar
  9. Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press, Cambridge, MA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Saeed-Ul Hassan, Raheem Sarwar, and Amina Muazzam. 2016. Tapping into intra-and international collaborations of the organization of Islamic cooperation states across science and technology disciplines. Science and Public Policy 43, 5 (2016), 690--701.Google ScholarGoogle ScholarCross RefCross Ref
  11. C. C. Holmes and N. M. Adams. 2002. A probabilistic nearest neighbour method for statistical pattern recognition. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64, 2 (2002), 295--306.Google ScholarGoogle ScholarCross RefCross Ref
  12. Daniel P. Huttenlocher, Gregory A. Klanderman, and William Rucklidge. 1993. Comparing images using the Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 9 (1993), 850--863.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Maham Jahangir, Hammad Afzal, Mehreen Ahmed, Khawar Khurshid, and Raheel Nawaz. 2017. An expert system for diabetes prediction using auto tuned multi-layer perceptron. In Proceedings of the 2017 Intelligent Systems Conference (IntelliSys’17). IEEE, Los Alamitos, CA, 722--728.Google ScholarGoogle ScholarCross RefCross Ref
  14. Patrick Juola, George K. Mikros, and Sean Vinsick. 2019. A comparative assessment of the difficulty of authorship attribution in Greek and in English. Journal of the Association for Information Science and Technology 70, 1 (2019), 61--70.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Natthawut Kertkeidkachorn, Proadpran Punyabukkana, and Atiwong Suchato. 2016. Acoustic features for hidden conditional random fields-based Thai tone classification. ACM Transactions on Asian and Low-Resource Language Information Processing 15, 2 (2016), Article 9, 26 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nongnuch Ketui, Thanaruk Theeramunkong, and Chutamanee Onsuwan. 2015. An EDU-based approach for Thai multi-document summarization and its application. ACM Transactions on Asian and Low-Resource Language Information Processing 14, 1 (2015), Article 4, 26 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rajalida Lipikorn, Akinobu Shimizu, and Hidefumi Kobatake. 1994. A modified Hausdorff distance for object matching. In Proceedings of the Conference on Pattern Recognition, Vol. 1. 566--568.Google ScholarGoogle Scholar
  18. Rangsipan Marukatat, Robroo Somkiadcharoen, Ratthanan Nalintasnai, and Tappasarn Aramboonpong. 2014. Authorship attribution analysis of thai online messages. In Proceedings of the IEEE International Conference on Information Science and Applications (ICISA’14). 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  19. Thomas M. Mitchell. 1997. Machine Learning. McGraw-Hill, New York, NY.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Frederick Mosteller and David Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar
  21. Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the feasibility of Internet-scale author identification. In Proceedings of the 2012 IEEE Symposium onSecurity and Privacy (SP’12). IEEE, Los Alamitos, CA, 300--314.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Raheel Nawaz, Paul Thompson, and Sophia Ananiadou. 2012. Identification of manner in bio-events. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3505--3510.Google ScholarGoogle Scholar
  23. Sarana Nutanong, Chenyun Yu, Raheem Sarwar, Peter Xu, and Dickson Chow. 2016. A scalable framework for stylometric analysis query processing. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 1125--1130.Google ScholarGoogle ScholarCross RefCross Ref
  24. Mathias Payer, Ling Huang, Neil Zhenqiang Gong, Kevin Borgolte, and Mario Frank. 2015. What you submit is who you are: A multimodal approach for deanonymizing scientific publications. IEEE Transactions on Information Forensics and Security 10, 1 (2015), 200--212.Google ScholarGoogle ScholarCross RefCross Ref
  25. Fuchun Peng, Dale Schuurmans, Shaojun Wang, and Vlado Keselj. 2003. Language independent authorship attribution using character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics—Volume 1. 267--274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Fahad Sabah, Saeed-Ul Hassan, Amina Muazzam, Sehrish Iqbal, Saira Hanif Soroya, and Raheem Sarwar. 2019. Scientific collaboration networks in Pakistan and their impact on institutional research performance. Library Hi Tech 37, 1 (2019), 19--29.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ahmad Al Sallab, Ramy Baly, Hazem M. Hajj, Khaled Bashir Shaban, Wassim El-Hajj, and Gilbert Badaro. 2017. AROMA: A recursive deep learning model for opinion mining in Arabic as a low resource language. ACM Transactions on Asian and Low-Resource Language Information Processing 16, 4 (2017), Article 25, 20 pages.Google ScholarGoogle Scholar
  28. Raheem Sarwar and Saeed-Ul Hassan. 2015. A bibliometric assessment of scientific productivity and international collaboration of the Islamic world in science and technology (S8T) areas. Scientometrics 105, 2 (2015), 1059--1077.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Raheem Sarwar, Qing Li, Thanawin Rakthanmanon, and Sarana Nutanong. 2018. A scalable framework for cross-lingual authorship identification. Information Sciences 465 (2018), 323--339.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Raheem Sarwar and Sarana Nutanong. 2016. The key factors and their influence in authorship attribution. Research in Computing Science 110 (2016), 139--150.Google ScholarGoogle ScholarCross RefCross Ref
  31. Raheem Sarwar, Saira Hanif Soroya, Amina Muazzam, Fahad Sabah, Sehrish Iqbal, and Saeed-Ul Hassan. 2019. A bibliometric perspective on technology-driven innovation in the Gulf Cooperation Council (GCC) countries in relation to its transformative impact on international business. In Technology-Driven Innovation in Gulf Cooperation Council (GCC) Countries: Emerging Research and Opportunities. IGI Global, 49--66.Google ScholarGoogle Scholar
  32. Raheem Sarwar, Chenyun Yu, Sarana Nutanong, Norawit Urailertprasert, Nattapol Vannaboot, and Thanawin Rakthanmanon. 2018. A scalable framework for stylometric analysis of multi-author documents. In Proceedings of the 23rd International Conference on Database Systems for Advanced Applications (DASFAA’18), Part I. 813--829.Google ScholarGoogle ScholarCross RefCross Ref
  33. Raheem Sarwar, Chenyun Yu, Ninad Tungare, Kanatip Chitavisutthivong, Sukrit Sriratanawilai, Yaohai Xu, Dickson Chow, Thanawin Rakthanmanon, and Sarana Nutanong. 2018. An effective and scalable framework for authorship attribution query processing. IEEE Access 6 (2018), 50030--50048.Google ScholarGoogle ScholarCross RefCross Ref
  34. Fabrizio Sebastiani. 2006. Classification of text, automatic. Encyclopedia of Language and Linguistics 14 (2006), 457--462.Google ScholarGoogle ScholarCross RefCross Ref
  35. Matthew Shardlow, Riza Batista-Navarro, Paul Thompson, Raheel Nawaz, John McNaught, and Sophia Ananiadou. 2018. Identification of research hypotheses and new knowledge from scientific literature. BMC Medical Informatics and Decision Making 18, 1 (2018), 46.Google ScholarGoogle ScholarCross RefCross Ref
  36. Thamar Solorio, Paolo Rosso, Manuel Montes-y-Gómez, Prasha Shrestha, Sebastián Sierra, and Fabio A. González. 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17)—Volume 2: Short Papers. 669--674.Google ScholarGoogle Scholar
  37. Paul Thompson, Raheel Nawaz, John McNaught, and Sophia Ananiadou. 2017. Enriching news events with meta-knowledge information. Language Resources and Evaluation 51, 2 (2017), 409--438.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ying Zhao and Justin Zobel. 2007. Searching with style: Authorship attribution in classic literature. In Proceedings of the 30th Australasian Computer Science Conference (ACSC’07). 59--68.Google ScholarGoogle Scholar

Index Terms

  1. StyloThai:: A Scalable Framework for Stylometric Authorship Identification of Thai Documents

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Asian and Low-Resource Language Information Processing
                  ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 3
                  May 2020
                  228 pages
                  ISSN:2375-4699
                  EISSN:2375-4702
                  DOI:10.1145/3378675
                  Issue’s Table of Contents

                  Copyright © 2020 ACM

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 9 January 2020
                  • Revised: 1 October 2019
                  • Accepted: 1 October 2019
                  • Received: 1 August 2019
                  Published in tallip Volume 19, Issue 3

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • short-paper
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                HTML Format

                View this article in HTML Format .

                View HTML Format
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!