skip to main content
research-article

Authorship Attribution for a Resource Poor Language—Urdu

Authors Info & Claims
Published:13 December 2021Publication History
Skip Abstract Section

Abstract

Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.

REFERENCES

  1. [1] Abbasi Ahmed and Chen Hsinchun. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. 26, 2 (2008), 129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Agun Hayri Volkan and Yilmazel Ozgur. 2017. Document embedding approach for efficient authorship attribution. In Proceedings of the 2nd International Conference on Knowledge Engineering and Applications (ICKEA’17). IEEE, 194198.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Agun Hayri Volkan and Yilmazel Ozgur. 2019. Incorporating topic information in a global feature selection schema for authorship attribution. IEEE Access 7 (2019), 9852298529.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Ahmed Al-Falahi, Mohamed Ramdani, and Mostafa Bellafkih. 2019. Arabic poetry authorship attribution using machine learning techniques. J. Comput. Sci. 15, 7 (2019), 10121021.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Anwar Waheed, Bajwa Imran Sarwar, Choudhary M. Abbas, and Ramzan Shabana. 2018. An empirical study on forensic analysis of Urdu text using LDA-based authorship attribution. IEEE Access 7 (2018), 32243234.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Anwar Waheed, Bajwa Imran Sarwar, and Ramzan Shabana. 2019. Design and implementation of a machine learning-based authorship identification model. Sci. Program. 2019 (2019), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Brocardo Marcelo Luiz, Traore Issa, Woungang Isaac, and Obaidat Mohammad S.. 2017. Authorship verification using deep belief network systems. Int. J. Commun. Syst. 30, 12 (2017), 110.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Can Fazli and Patton Jon M.. 2010. Change of word characteristics in 20th-century turkish literature: A statistical analysis. J. Quant. Linguist. 17, 3 (2010), 167190.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Canales Omar, Monaco Vinnie, Murphy Thomas, Zych Edyta, Stewart John, Castro Charles Tappert Alex, Sotoye Ola, Torres Linda, and Truley Greg. 2011. A stylometry system for authenticating students taking online tests. In Proceedings of the Student-Faculty Research Day. Pace University, 4.1–4.6.Google ScholarGoogle Scholar
  10. [10] Chakraborty Tanmoy and Bandyopadhyay Sivaji. 2010. Authorship identification using stylometry analysis: A CRF-based approach. In Proceedings of the IEEE CASCOM Postgraduate Student Paper Conference. IEEE, 6669.Google ScholarGoogle Scholar
  11. [11] Cheng Na, Chandramouli Rajarathnam, and Subbalakshmi K. P.. 2011. Author gender identification from text. Digit. Investig. 8, 1 (2011), 7888. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Daud Ali, Khan Wahab, and Che Dunren. 2017. Urdu language processing: A survey. Artif. Intell. Rev. 47, 3 (2017), 279311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Diri Banu and Amasyalı M. Fatih. 2003. Automatic author detection for turkish texts. In Proceedings of the Annual Conference on Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP’03). 138141.Google ScholarGoogle Scholar
  14. [14] Durrani Nadir and Hussain Sarmad. 2010. Urdu word segmentation. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. 528536. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Fatima Mehwish, Hasan Komal, Anwar Saba, and Nawab Rao Muhammad Adeel. 2017. Multilingual author profiling on facebook. Inf. Process. Manage. 53, 4 (2017), 886904. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Fedzechkina Maryia, Newport Elissa L., and Jaeger T. Florian. 2017. Balancing effort and information transmission during language acquisition: Evidence from word order and case marking. Cogn. Sci. 41, 2 (2017), 416446.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Fridman Alex, Stolerman Ariel, Acharya Sayandeep, Brennan Patrick, Juola Patrick, Greenstadt Rachel, and Kam Moshe. 2013. Decision fusion for multimodal active authentication. IT Profession. 15, 4 (2013), 2933. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Fridman Lex, Weber Steven, Greenstadt Rachel, and Kam Moshe. 2016. Active authentication on mobile devices via stylometry, application usage, web browsing, and GPS location. IEEE Syst. J. 11, 2 (2016), 513521.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Gaizauskas Robert, Foster Jonathan, Wilks Yorick, Arundel John, Clough Paul, and Piao Scott. 2001. The METER corpus: A corpus for analysing journalistic text reuse. In Proceedings of the Corpus Linguistics Conference. 214223.Google ScholarGoogle Scholar
  20. [20] García-Díaz José Antonio, Cánovas-García Mar, Colomo-Palacios Ricardo, and Valencia-García Rafael. 2021. Detecting misogyny in spanish tweets. An approach based on linguistics features and word embeddings. Fut. Gener. Comput. Syst. 114 (2021), 506518.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] HaCohen-Kerner Yaakov, Miller Daniel, Yigal Yair, and Shayovitz Elyashiv. 2018. Cross-domain authorship attribution: Author identification using char sequences, word unigrams, and POS-tags features. Work. Not. CLEF 2125, 1 (2018), 125.Google ScholarGoogle Scholar
  22. [22] Hinh Robert, Shin Sangmi, and Taylor Julia. 2016. Using frame semantics in authorship attribution. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’16). IEEE, 004093004098.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Hossain Anika Samiha, Akter Nazia, and Islam Md Saiful. 2020. A stylometric approach for author attribution system using neural network and machine learning classifiers. In Proceedings of the International Conference on Computing Advancements. 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Hou Renkui and Huang Chu-Ren. 2020. Robust stylometric analysis and author attribution based on tones and rimes. Natur. Lang. Eng. 26, 1 (2020), 4971.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Hurtado Jose, Taweewitchakreeya Napat, and Zhu Xingquan. 2014. Who wrote this paper? Learning for authorship de-identification using stylometric featuress. In Proceedings of the IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI’14). IEEE, 859862.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Jafariakinabad Fereshteh and Hua Kien A.. 2019. Style-aware neural model with application in authorship attribution. In Proceedings of the 18th IEEE International Conference on Machine Learning And Applications (ICMLA’19). IEEE, 325328.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Jafariakinabad Fereshteh, Tarnpradab Sansiri, and Hua Kien A.. 2019. Syntactic recurrent neural network for authorship attribution. arxiv:1902.09723. Retrieved from http://arxiv.org/abs/1902.09723.Google ScholarGoogle Scholar
  28. [28] Malik Muhammad Kamran, Ahmed Tafseer, Sulger Sebastian, Bögel Tina, Gulzar Atif, Raza Ghulam, Hussain Sarmad, and Butt Miriam. 2010. Transliterating urdu for a broad-coverage urdu/hindi lfg grammar. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 29212927.Google ScholarGoogle Scholar
  29. [29] Kanwal Safia, Malik Kamran, Shahzad Khurram, Aslam Faisal, and Nawaz Zubair. 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. Asian Low-Resourc. Lang. Inf. Process. 19, 1 (2019), 113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Klaussner Carmen and Vogel Carl. 2015. Stylochronometry: Timeline prediction in stylometric analysis. In Proceedings of the International Conference on Innovative Techniques and Applications of Artificial Intelligence. Springer, 91106.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Kodiyan Don, Hardegger Florin, Neuhaus Stephan, and Cieliebak Mark. 2017. Author profiling with bidirectional rnns using attention with grus: Notebook for PAN at CLEF 2017. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF’17), Vol. 1866. RWTH Aachen.Google ScholarGoogle Scholar
  32. [32] Koppel Moshe, Schler Jonathan, and Argamon Shlomo. 2011. Authorship attribution in the wild. Lang. Resourc. Eval. 45, 1 (2011), 8394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Krause Markus. 2014. A behavioral biometrics based authentication method for MOOC’s that is robust against imitation attempts. In Proceedings of the 1st ACM Conference on Learning@ Scale Conference. 201202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Kumar P. Jeevan, Reddy G. Srikanth, and Reddy T. Raghunadha. 2017. Document weighted approach for authorship attribution. Int. J. Comput. Intell. Res. 13, 7 (2017), 16531661.Google ScholarGoogle Scholar
  35. [35] López-Santillán Roberto, Montes-Y-Gómez Manuel, González-Gurrola Luis Carlos, Ramírez-Alonso Graciela, and Prieto-Ordaz Olanda. 2020. Richer document embeddings for author profiling tasks based on a heuristic search. Inf. Process. Manage. 57, 4 (2020), 102227.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Malik Muhammad Kamran. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Trans. Asian Low-Resourc. Lang. Inf. Process. 17, 1 (2017), 113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Martín-Del-Campo-Rodríguez Carolina, Alvarez Daniel Alejandro Pérez, Sifuentes Christian Efraín Maldonado, Sidorov Grigori, Batyrshin Ildar, and Gelbukh Alexander. 2019. Authorship attribution through punctuation n-grams and averaged combination of SVM notebook for PAN at CLEF 2019. In Proceedings of the CEUR Workshop. 15.Google ScholarGoogle Scholar
  38. [38] Neal Tempestt, Sundararajan Kalaivani, Fatima Aneez, Yan Yiming, Xiang Yingfei, and Woodard Damon. 2017. Surveying stylometry techniques and applications. ACM Comput. Surv. 50, 6 (2017), 136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Pateriya Pushpendra Kumar, Raj Gaurav, et al. 2014. A pragmatic validation of stylometric techniques using BPA. In Proceedings of the 5th International Conference on Confluence: The Next Generation Information Technology Summit (Confluence’14). IEEE, 124131.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Posadas-Duran Juan-Pablo, Sidorov Grigori, and Batyrshin Ildar. 2014. Complete syntactic n-grams as style markers for authorship attribution. In Proceedings of the Mexican International Conference on Artificial Intelligence. Springer, 917.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Raghavan Sindhu, Kovashka Adriana, and Mooney Raymond. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL Conference Short Papers. 3842. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Rahgouy Mostafa, Giglou Hamed Babaei, Rahgooy Taher, Sheykhlan Mohammad Karami, and Mohammadzadeh Erfan. 2019. Cross-domain authorship attribution: Author identification using a multi-aspect ensemble approach. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF’19). 18.Google ScholarGoogle Scholar
  43. [43] Rangel Francisco, Rosso Paolo, Potthast Martin, and Stein Benno. 2017. Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF (2017), 1613–0073.Google ScholarGoogle Scholar
  44. [44] Rangel Francisco, Rosso Paolo, Potthast Martin, Stein Benno, and Daelemans Walter. 2015. Overview of the 3rd Author profiling task at PAN 2015. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF’15). sn, 2015.Google ScholarGoogle Scholar
  45. [45] Raza Agha Ali, Athar Awais, and Nadeem Sajid. 2009. N-gram based authorship attribution in Urdu poetry. In Proceedings of the Conference on Language & Technology. 8893.Google ScholarGoogle Scholar
  46. [46] Ruppenhofer Josef, Siegel Melanie, and Wiegand Michael. 2018. GermEval 2018 workshop proceedings. In Proceedings of the GermEval Workshop in Conjunction with the 14th International Conference on Natural Language Processing. 001006.Google ScholarGoogle Scholar
  47. [47] Santos Joaquim, Consoli Bernardo, Santos Cicero dos, Terra Juliano, Collonini Sandra, and Vieira Renata. 2019. Assessing the impact of contextual embeddings for portuguese named entity recognition. In Proceedings of the 8th Brazilian Conference on Intelligent Systems (BRACIS’19). IEEE, 437442.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Sari Yunita, Stevenson Mark, and Vlachos Andreas. 2018. Topic or style? Exploring the most useful features for authorship attribution. In Proceedings of the 27th International Conference on Computational Linguistics. 343353.Google ScholarGoogle Scholar
  49. [49] Sarmad H.. 2004. Letter-to-sound conversion for Urdu text-to-speech system. In Workshop on Computational Approaches to Arabic Script. IEEE, 7479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Schler Jonathan, Koppel Moshe, Argamon Shlomo, and Pennebaker James W.. 2006. Effects of age and gender on blogging. In Proceedings of the AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, Vol. 6. 199205.Google ScholarGoogle Scholar
  51. [51] Sharma Abhay and Reetika Ralhan. 2018. An investigation of supervised learning methods for authorship attribution in short hinglish texts using char & word n-grams. ACM Trans. Asian Low-Resourc. Lang. Inf. Process. 1, 1 (2018).Google ScholarGoogle Scholar
  52. [52] Shrestha Prasha, Sierra Sebastian, González Fabio A, Montes Manuel, Rosso Paolo, and Solorio Thamar. 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 669674.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Sinoara Roberta A., Camacho-Collados Jose, Rossi Rafael G., Navigli Roberto, and Rezende Solange O.. 2019. Knowledge-enhanced document embeddings for text classification. Knowl.-Bas. Syst. 163 (2019), 955971.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Stamatatos Efstathios. 2017. Authorship attribution using text distortion. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 11381149.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Surendran Kl, Harilal O. P., Hrudya P., Poornachandran Prabaharan, and Suchetha N. K.. 2017. Stylometry detection using deep learning. In Computational Intelligence in Data Mining. Springer, 749757.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Tang Xuemei, Liang Shichen, and Liu Zhiying. 2019. Authorship attribution of the golden lotus based on text classification methods. In Proceedings of the 3rd International Conference on Innovation in Artificial Intelligence. 6972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Tariq Nida, Ijaz Iqra, Malik Muhammad Kamran, Malik Zubair, and Bukhari Faisal. 2019. Identification of urdu ghazal poets using SVM. Mehran Univ. Res. J. Eng. Technol. 38, 4 (2019), 935944.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Wang Shirui, Zhou Wenan, and Jiang Chao. 2020. A survey of word embeddings based on deep learning. Computing 102, 3 (2020), 717740.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Yang Min, Zhu Dingju, Tang Yong, and Wang Jingxuan. 2017. Authorship attribution with topic drift model. In Proceedings of the 1st AAAI Conference on Artificial Intelligence. 50155016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Yavanoglu Ozlem. 2016. Intelligent authorship identification with using turkish newspapers metadata. In Proceedings of the IEEE International Conference on Big Data (Big Data’16). IEEE, 18951900.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Zheng Rong, Li Jiexun, Chen Hsinchun, and Huang Zan. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57, 3 (2006), 378393. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Authorship Attribution for a Resource Poor Language—Urdu

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
        May 2022
        413 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3505182
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 December 2021
        • Accepted: 1 August 2021
        • Revised: 1 July 2021
        • Received: 1 August 2020
        Published in tallip Volume 21, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)160
        • Downloads (Last 6 weeks)6

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!