skip to main content
research-article

UrduAI: Writeprints for Urdu Authorship Identification

Authors Info & Claims
Published:31 October 2021Publication History
Skip Abstract Section

Abstract

The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. However, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces and when the number of candidate authors increases. Consequently, these solutions are inapplicable to real-world cases. Moreover, due to the unavailability of reliable POS taggers or sentence segmenters, all existing authorship identification studies on Urdu text are limited to the word n-grams features only. To overcome these limitations, we formulate a stylometric feature space, which is not limited to the word n-grams feature only. Based on this feature space, we use an authorship identification solution that transforms each text sample into a point set, retrieves candidate text samples, and relies on the nearest neighbors classifier to predict the original author of the anonymous text sample. To evaluate our solution, we create a significantly larger corpus than existing studies and conduct several experimental studies that show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.

REFERENCES

  1. [1] Altakrori Malik H., Iqbal Farkhund, Fung Benjamin C. M., Ding Steven H. H., and Tubaishat Abdallah. 2019. Arabic authorship attribution: An extensive study on Twitter posts. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 18, 1 (2019), 5:1–5:51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Amjad Maaz, Sidorov Grigori, and Zhila Alisa. 2020. Data augmentation using machine translation for fake news detection in the Urdu language. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 25372542.Google ScholarGoogle Scholar
  3. [3] Anwar Waheed, Bajwa Imran Sarwar, Choudhary M. Abbas, and Ramzan Shabana. 2019. An empirical study on forensic analysis of Urdu text using LDA-based authorship attribution. IEEE Access 7 (2019), 32243234.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Awais Muhammad and Shoaib Muhammad. 2019. Role of discourse information in Urdu sentiment classification: A rule-based method and machine-learning technique. ACM Trans. Asian Low Resour. Lang. Inf. Process. 18, 4 (2019), 34:1–34:37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bay Stephen D.. 1999. Nearest neighbor classification from multiple feature subsets. Intell. Data Anal. 3, 3 (1999), 191209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Bhat Riyaz Ahmad, Bhat Irshad Ahmad, and Sharma Dipti Misra. 2017. Improving transition-based dependency parsing of Hindi and Urdu by modeling syntactically relevant phenomena. ACM Trans. Asian Low Resour. Lang. Inf. Process. 16, 3 (2017), 17:1–17:35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chaski Carole E.. 2001. Empirical evaluations of language-based author identification techniques. Forens. Ling. 8 (2001), 165.Google ScholarGoogle Scholar
  8. [8] Choudhary Prakash and Nain Neeta. 2016. A four-tier annotated Urdu handwritten text image dataset for multidisciplinary research on Urdu script. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15, 4 (2016), 26:1–26:23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Dauber Edwin, Overdorf Rebekah, and Greenstadt Rachel. 2017. Stylometric authorship attribution of collaborative documents. In Proceedings of the 1st International Conference. 115135.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Ding Steven H. H., Fung Benjamin C. M., Iqbal Farkhund, and Cheung William K.. 2017. Learning stylometric representations for authorship analysis. IEEE Trans. Cybern. 49, 1 (2017), 107121.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Fourkioti Olga, Symeonidis Symeon, and Arampatzis Avi. 2019. Language models and fusion for authorship attribution. Inf. Process. Manag. 56, 6 (2019), 102061.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Ge Zhenhao, Sun Yufang, and Smith Mark J. T.. 2016. Authorship attribution using a neural network language model. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.42124213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Grieve Jack. 2007. Quantitative authorship attribution: An evaluation of techniques. Liter. Ling. Comput. 22, 3 (2007), 251270.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Hassan Saeed-Ul, Aljohani Naif R., Idrees Nimra, Sarwar Raheem, Nawaz Raheel, Martínez-Cámara Eugenio, Ventura Sebastián, and Herrera Francisco. 2020. Predicting literature’s early impact with sentiment analysis in Twitter. Knowl.-based Syst. 192 (2020), 105383.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hassan Saeed-Ul, Aljohani Naif R., Shabbir Mudassir, Ali Umair, Iqbal Sehrish, Sarwar Raheem, Martínez-Cámara Eugenio, Ventura Sebastián, and Herrera Francisco. 2020. Tweet coupling: A social media methodology for clustering scientific publications. Scientometrics 124 (2020), 973991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hassan Saeed-Ul, Sarwar Raheem, and Muazzam Amina. 2016. Tapping into intra-and international collaborations of the Organization of Islamic Cooperation states across science and technology disciplines. Sci. Pub. Polic. 43, 5 (2016), 690701.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Holmes C. C. and Adams N. M.. 2002. A probabilistic nearest neighbour method for statistical pattern recognition. J R. Stat. Soc. Series B Stat. Methodol. 64, 2 (2002), 295306.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Huttenlocher Daniel P., Klanderman Gregory A., and Rucklidge William. 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15, 9 (1993), 850863. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Jindal Nitin and Liu Bing. 2008. Opinion spam and analysis. In Proceedings of the International Conference on Web Search and Web Data Mining, Najork Marc, Broder Andrei Z., and Chakrabarti Soumen (Eds.). ACM, 219230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Kanwal Safia, Malik Kamran, Shahzad Khurram, Aslam Faisal, and Nawaz Zubair. 2020. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. Asian Low Resour. Lang. Inf. Process. 19, 1 (2020), 8:1–8:13. DOI:DOI: DOI: https://doi.org/10.1145/3329710 Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Kešelj Vlado, Peng Fuchun, Cercone Nick, and Thomas Calvin. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference of the Pacific Association for Computational Linguistics. 255264.Google ScholarGoogle Scholar
  22. [22] Le Quoc V. and Mikolov Tomás. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning(JMLR Workshop and Conference Proceedings, Vol. 32). JMLR.org, 11881196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Jiexun, Zheng Rong, and Chen Hsinchun. 2006. From fingerprint to writeprint. Commun. ACM 49, 4 (2006), 7682. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Limkonchotiwat Peerat, Phatthiyaphaibun Wannaphong, Sarwar Raheem, Chuangsuwanich Ekapol, and Nutanong Sarana. 2020. Domain adaptation of Thai word segmentation models using stacked ensemble. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20), Online, November 16-20, 2020. Association for Computational Linguistics, 3841–3847. DOI: 10.18653/v1/2020.emnlp-main.315Google ScholarGoogle Scholar
  25. [25] Lipikorn Rajalida, Shimizu Akinobu, and Kobatake Hidefumi. 1994. A modified Hausdorff distance for object matching. In 12th IAPR International Conference on Pattern Recognition, Conference on Computer Vision & Image Processing, ICPR 1994, Jerusalem, Israel, 9-13, October 1994, Volume 1. IEEE, 566–568. DOI: 10.1109/ICPR.1994.576361Google ScholarGoogle Scholar
  26. [26] Malik Muhammad Kamran. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Trans. Asian Low Resour. Lang. Inf. Process. 17, 1 (2017), 2:1–2:13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Mehmood Khawar, Essam Daryl, Shafi Kamran, and Malik Muhammad Kamran. 2020. Sentiment analysis for a resource poor language - Roman Urdu. ACM Trans. Asian Low Resour. Lang. Inf. Process. 19, 1 (2020), 10:1–10:15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Mosteller Frederick and Wallace David. 1964. Inference and Disputed Authorship: The Federalist. Reading MA: Addison-Wesley.Google ScholarGoogle Scholar
  29. [29] Narayanan Arvind, Paskov Hristo, Gong Neil Zhenqiang, Bethencourt John, Stefanov Emil, Shin Eui Chul Richard, and Song Dawn. 2012. On the feasibility of internet-scale author identification. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE, 300314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Nutanong Sarana, Yu Chenyun, Sarwar Raheem, Xu Peter, and Chow Dickson. 2016. A scalable framework for stylometric analysis query processing. In Proceedings of the IEEE 16th International Conference on Data Mining. 11251130.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Payer Mathias, Huang Ling, Gong Neil Zhenqiang, Borgolte Kevin, and Frank Mario. 2015. What you submit is who you are: A multimodal approach for deanonymizing scientific publications. IEEE Trans. Inf. Forens. Secur. 10, 1 (2015), 200212.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Peng Fuchun, Schuurmans Dale, and Wang Shaojun. 2004. Augmenting naive Bayes classifiers with statistical language models. Inf. Retriev. 7, 3-4 (2004), 317345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Peng Fuchun, Schuurmans Dale, Wang Shaojun, and Keselj Vlado. 2003. Language independent authorship attribution using character level language models. In Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 267274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Peng Jian, Choo Kim-Kwang Raymond, and Ashman Helen. 2016. Astroturfing detection in social media: Using binary n-Gram analysis for authorship attribution. In Proceedings of the IEEE Trustcom/BigDataSE/ISPA. 121128.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Posadas-Durán Juan-Pablo, Gómez-Adorno Helena, Sidorov Grigori, Batyrshin Ildar, Pinto David, and Chanona-Hernández Liliana. 2017. Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21, 3 (2017), 627639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Raza Agha Ali, Athar Awais, and Nadeem Sajid. 2009. N-gram based authorship attribution in Urdu poetry. In Proceedings of the Conference on Language & Technology. 8893.Google ScholarGoogle Scholar
  37. [37] Rehurek Radim and Sojka Petr. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC Workshop on New Challenges for NLP Frameworks. Citeseer.Google ScholarGoogle Scholar
  38. [38] Sabah Fahad, Hassan Saeed-Ul, Muazzam Amina, Iqbal Sehrish, Soroya Saira Hanif, and Sarwar Raheem. 2019. Scientific collaboration networks in Pakistan and their impact on institutional research performance: A case study based on Scopus publications. Library Hi Tech 37, 1 (2019), 19–29. DOI: 0.1108/LHT-03-2018-0036Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Saeed Ali, Nawab Rao Muhammad Adeel, Stevenson Mark, and Rayson Paul. 2019. A sense annotated corpus for all-words Urdu word sense disambiguation. ACM Trans. Asian Low Resour. Lang. Inf. Process. 18, 4 (2019), 40:1–40:14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Safder Iqra, Mahmood Zainab, Sarwar Raheem, Hassan Saeed-Ul, Zaman Farooq, Muhammad Adeel Nawab Rao, Bukhari Faisal, Ayaz Abbasi Rabeeh, Alelyani Salem, Radi Aljohani Naif, and Nawaz Raheel. 2021. Sentiment analysis for Urdu online reviews using deep learning models. Exp. Syst. (2021), e12751. DOI: 10.1111/exsy.12751Google ScholarGoogle Scholar
  41. [41] Sarwar Raheem, Li Qing, Rakthanmanon Thanawin, and Nutanong Sarana. 2018. A scalable framework for cross-lingual authorship identification. Inf. Sci. 465 (2018), 323–339. DOI: 10.1016/j.ins.2018.07.009Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Sarwar Raheem and Nutanong Sarana. 2016. The key factors and their influence in authorship attribution. Res. Comput. Sci. 110 (2016), 139150.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Sarwar Raheem, Porthaveepong Thanasarn, Rutherford Attapol, Rakthanmanon Thanawin, and Nutanong Sarana. 2020. StyloThai: A scalable framework for stylometric authorship identification of thai documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 3 (2020), 115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Sarwar Raheem, Rutherford Attapol T., Hassan Saeed-Ul, Rakthanmanon Thanawin, and Nutanong Sarana. 2020. Native language identification of fluent and advanced non-native writers. ACM Trans. Asian Low Resour. Lang. Inf. Process. 19, 4 (2020), 55:1–55:19. DOI:DOI: DOI: https://doi.org/10.1145/3383202 Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Sarwar Raheem, Soroya Saira Hanif, Muazzam Amina, Sabah Fahad, Iqbal Sehrish, and Hassan Saeed-Ul. 2019. A bibliometric perspective on technology-driven innovation in the Gulf Cooperation Council (GCC) countries in relation to its transformative impact on international business. In Technology-driven Innovation in Gulf Cooperation Council (GCC) Countries: Emerging Research and Opportunities. IGI Global, 4966.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Sarwar Raheem, Urailertprasert Norawit, Vannaboot Nattapol, Yu Chenyun, Rakthanmanon Thanawin, Chuangsuwanich Ekapol, and Nutanong Sarana. 2020. CAG: Stylometric authorship attribution of multi-author documents using a co-authorship graph. IEEE Access 8 (2020), 1837418393.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Sarwar Raheem, Yu Chenyun, Nutanong Sarana, Urailertprasert Norawit, Vannaboot Nattapol, and Rakthanmanon Thanawin. 2018. A scalable framework for stylometric analysis of multi-author documents. In Proceedings of the 23rd International Conference on Database Systems for Advanced Applications. 813829.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Sarwar Raheem, Yu Chenyun, Tungare Ninad, Chitavisutthivong Kanatip, Sriratanawilai Sukrit, Xu Yaohai, Chow Dickson, Rakthanmanon Thanawin, and Nutanong Sarana. 2018. An effective and scalable framework for authorship attribution query processing. IEEE Access 6 (2018), 5003050048.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Sarwar Raheem, Zia Afifa, Nawaz Raheel, Fayoumi Ayman, Aljohani Naif Radi, and Hassan Saeed-Ul. 2021. Webometrics: Evolution of social media presence of universities. Scientometrics 126, 2 (2021), 951967.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Sebastiani Fabrizio. 2006. Classification of text, automatic. Encycl. Lang. Ling. 14 (2006), 457462.Google ScholarGoogle Scholar
  51. [51] Solorio Thamar, Rosso Paolo, Montes-y-Gómez Manuel, Shrestha Prasha, Sierra Sebastián, and González Fabio A.. 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 669674.Google ScholarGoogle Scholar
  52. [52] Stamatatos Efstathios. 2008. Author identification: Using text sampling to handle the class imbalance problem. Inf. Process. Manag. 44, 2 (2008), 790799. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Stamatatos Efstathios. 2009. A survey of modern authorship attribution methods. J. Assoc. Inf. Sci. Technol. 60, 3 (2009), 538556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Stamatatos Efstathios. 2013. On the robustness of authorship attribution based on character n-gram features. J. Law Polic. 21, 2 (2013), 421439.Google ScholarGoogle Scholar
  55. [55] Stamatatos Efstathios et al. 2006. Ensemble-based author identification using character n-grams. In Proceedings of the 3rd International Workshop on Text-based Information Retrieval. 4146.Google ScholarGoogle Scholar
  56. [56] Zhao Ying and Zobel Justin. 2007. Searching with style: Authorship attribution in classic literature. In Proceedings of the 30th Australasian Computer Science Conference (ACSC’07). 5968.Google ScholarGoogle Scholar

Index Terms

  1. UrduAI: Writeprints for Urdu Authorship Identification

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Asian and Low-Resource Language Information Processing
                  ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
                  March 2022
                  413 pages
                  ISSN:2375-4699
                  EISSN:2375-4702
                  DOI:10.1145/3494070
                  Issue’s Table of Contents

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 31 October 2021
                  • Accepted: 1 July 2021
                  • Revised: 1 April 2021
                  • Received: 1 September 2020
                  Published in tallip Volume 21, Issue 2

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                Full Text

                View this article in Full Text.

                View Full Text

                HTML Format

                View this article in HTML Format .

                View HTML Format
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!