Abstract
Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.
- [1] . 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. 26, 2 (2008), 1–29. Google Scholar
Digital Library
- [2] . 2017. Document embedding approach for efficient authorship attribution. In Proceedings of the 2nd International Conference on Knowledge Engineering and Applications (ICKEA’17). IEEE, 194–198.Google Scholar
Cross Ref
- [3] . 2019. Incorporating topic information in a global feature selection schema for authorship attribution. IEEE Access 7 (2019), 98522–98529.Google Scholar
Cross Ref
- [4] . 2019. Arabic poetry authorship attribution using machine learning techniques. J. Comput. Sci. 15, 7 (2019), 1012–1021.Google Scholar
Cross Ref
- [5] . 2018. An empirical study on forensic analysis of Urdu text using LDA-based authorship attribution. IEEE Access 7 (2018), 3224–3234.Google Scholar
Cross Ref
- [6] . 2019. Design and implementation of a machine learning-based authorship identification model. Sci. Program. 2019 (2019), 1–14.Google Scholar
Digital Library
- [7] . 2017. Authorship verification using deep belief network systems. Int. J. Commun. Syst. 30, 12 (2017), 1–10.Google Scholar
Cross Ref
- [8] . 2010. Change of word characteristics in 20th-century turkish literature: A statistical analysis. J. Quant. Linguist. 17, 3 (2010), 167–190.Google Scholar
Cross Ref
- [9] . 2011. A stylometry system for authenticating students taking online tests. In Proceedings of the Student-Faculty Research Day. Pace University, 4.1–4.6.Google Scholar
- [10] . 2010. Authorship identification using stylometry analysis: A CRF-based approach. In Proceedings of the IEEE CASCOM Postgraduate Student Paper Conference. IEEE, 66–69.Google Scholar
- [11] . 2011. Author gender identification from text. Digit. Investig. 8, 1 (2011), 78–88. Google Scholar
Digital Library
- [12] . 2017. Urdu language processing: A survey. Artif. Intell. Rev. 47, 3 (2017), 279–311. Google Scholar
Digital Library
- [13] . 2003. Automatic author detection for turkish texts. In Proceedings of the Annual Conference on Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP’03). 138–141.Google Scholar
- [14] . 2010. Urdu word segmentation. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. 528–536. Google Scholar
Digital Library
- [15] . 2017. Multilingual author profiling on facebook. Inf. Process. Manage. 53, 4 (2017), 886–904. Google Scholar
Digital Library
- [16] . 2017. Balancing effort and information transmission during language acquisition: Evidence from word order and case marking. Cogn. Sci. 41, 2 (2017), 416–446.Google Scholar
Cross Ref
- [17] . 2013. Decision fusion for multimodal active authentication. IT Profession. 15, 4 (2013), 29–33. Google Scholar
Digital Library
- [18] . 2016. Active authentication on mobile devices via stylometry, application usage, web browsing, and GPS location. IEEE Syst. J. 11, 2 (2016), 513–521.Google Scholar
Cross Ref
- [19] . 2001. The METER corpus: A corpus for analysing journalistic text reuse. In Proceedings of the Corpus Linguistics Conference. 214–223.Google Scholar
- [20] . 2021. Detecting misogyny in spanish tweets. An approach based on linguistics features and word embeddings. Fut. Gener. Comput. Syst. 114 (2021), 506–518.Google Scholar
Cross Ref
- [21] . 2018. Cross-domain authorship attribution: Author identification using char sequences, word unigrams, and POS-tags features. Work. Not. CLEF 2125, 1 (2018), 1–25.Google Scholar
- [22] . 2016. Using frame semantics in authorship attribution. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’16). IEEE, 004093–004098.Google Scholar
Cross Ref
- [23] . 2020. A stylometric approach for author attribution system using neural network and machine learning classifiers. In Proceedings of the International Conference on Computing Advancements. 1–7. Google Scholar
Digital Library
- [24] . 2020. Robust stylometric analysis and author attribution based on tones and rimes. Natur. Lang. Eng. 26, 1 (2020), 49–71.Google Scholar
Cross Ref
- [25] . 2014. Who wrote this paper? Learning for authorship de-identification using stylometric featuress. In Proceedings of the IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI’14). IEEE, 859–862.Google Scholar
Cross Ref
- [26] . 2019. Style-aware neural model with application in authorship attribution. In Proceedings of the 18th IEEE International Conference on Machine Learning And Applications (ICMLA’19). IEEE, 325–328.Google Scholar
Cross Ref
- [27] . 2019. Syntactic recurrent neural network for authorship attribution.
arxiv:1902.09723 . Retrieved from http://arxiv.org/abs/1902.09723.Google Scholar - [28] . 2010. Transliterating urdu for a broad-coverage urdu/hindi lfg grammar. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 2921–2927.Google Scholar
- [29] . 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. Asian Low-Resourc. Lang. Inf. Process. 19, 1 (2019), 1–13. Google Scholar
Digital Library
- [30] . 2015. Stylochronometry: Timeline prediction in stylometric analysis. In Proceedings of the International Conference on Innovative Techniques and Applications of Artificial Intelligence. Springer, 91–106.Google Scholar
Cross Ref
- [31] . 2017. Author profiling with bidirectional rnns using attention with grus: Notebook for PAN at CLEF 2017. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF’17), Vol. 1866. RWTH Aachen.Google Scholar
- [32] . 2011. Authorship attribution in the wild. Lang. Resourc. Eval. 45, 1 (2011), 83–94. Google Scholar
Digital Library
- [33] . 2014. A behavioral biometrics based authentication method for MOOC’s that is robust against imitation attempts. In Proceedings of the 1st ACM Conference on Learning@ Scale Conference. 201–202. Google Scholar
Digital Library
- [34] . 2017. Document weighted approach for authorship attribution. Int. J. Comput. Intell. Res. 13, 7 (2017), 1653–1661.Google Scholar
- [35] . 2020. Richer document embeddings for author profiling tasks based on a heuristic search. Inf. Process. Manage. 57, 4 (2020), 102227.Google Scholar
Cross Ref
- [36] . 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Trans. Asian Low-Resourc. Lang. Inf. Process. 17, 1 (2017), 1–13. Google Scholar
Digital Library
- [37] . 2019. Authorship attribution through punctuation n-grams and averaged combination of SVM notebook for PAN at CLEF 2019. In Proceedings of the CEUR Workshop. 1–5.Google Scholar
- [38] . 2017. Surveying stylometry techniques and applications. ACM Comput. Surv. 50, 6 (2017), 1–36. Google Scholar
Digital Library
- [39] . 2014. A pragmatic validation of stylometric techniques using BPA. In Proceedings of the 5th International Conference on Confluence: The Next Generation Information Technology Summit (Confluence’14). IEEE, 124–131.Google Scholar
Cross Ref
- [40] . 2014. Complete syntactic n-grams as style markers for authorship attribution. In Proceedings of the Mexican International Conference on Artificial Intelligence. Springer, 9–17.Google Scholar
Cross Ref
- [41] . 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL Conference Short Papers. 38–42. Google Scholar
Digital Library
- [42] . 2019. Cross-domain authorship attribution: Author identification using a multi-aspect ensemble approach. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF’19). 1–8.Google Scholar
- [43] . 2017. Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF (2017), 1613–0073.Google Scholar
- [44] . 2015. Overview of the 3rd Author profiling task at PAN 2015. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF’15). sn, 2015.Google Scholar
- [45] . 2009. N-gram based authorship attribution in Urdu poetry. In Proceedings of the Conference on Language & Technology. 88–93.Google Scholar
- [46] . 2018. GermEval 2018 workshop proceedings. In Proceedings of the GermEval Workshop in Conjunction with the 14th International Conference on Natural Language Processing. 001–006.Google Scholar
- [47] . 2019. Assessing the impact of contextual embeddings for portuguese named entity recognition. In Proceedings of the 8th Brazilian Conference on Intelligent Systems (BRACIS’19). IEEE, 437–442.Google Scholar
Cross Ref
- [48] . 2018. Topic or style? Exploring the most useful features for authorship attribution. In Proceedings of the 27th International Conference on Computational Linguistics. 343–353.Google Scholar
- [49] . 2004. Letter-to-sound conversion for Urdu text-to-speech system. In Workshop on Computational Approaches to Arabic Script. IEEE, 74–79. Google Scholar
Digital Library
- [50] . 2006. Effects of age and gender on blogging. In Proceedings of the AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, Vol. 6. 199–205.Google Scholar
- [51] . 2018. An investigation of supervised learning methods for authorship attribution in short hinglish texts using char & word n-grams. ACM Trans. Asian Low-Resourc. Lang. Inf. Process. 1, 1 (2018).Google Scholar
- [52] . 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 669–674.Google Scholar
Cross Ref
- [53] . 2019. Knowledge-enhanced document embeddings for text classification. Knowl.-Bas. Syst. 163 (2019), 955–971.Google Scholar
Cross Ref
- [54] . 2017. Authorship attribution using text distortion. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 1138–1149.Google Scholar
Cross Ref
- [55] . 2017. Stylometry detection using deep learning. In Computational Intelligence in Data Mining. Springer, 749–757.Google Scholar
Cross Ref
- [56] . 2019. Authorship attribution of the golden lotus based on text classification methods. In Proceedings of the 3rd International Conference on Innovation in Artificial Intelligence. 69–72. Google Scholar
Digital Library
- [57] . 2019. Identification of urdu ghazal poets using SVM. Mehran Univ. Res. J. Eng. Technol. 38, 4 (2019), 935–944.Google Scholar
Cross Ref
- [58] . 2020. A survey of word embeddings based on deep learning. Computing 102, 3 (2020), 717–740.Google Scholar
Cross Ref
- [59] . 2017. Authorship attribution with topic drift model. In Proceedings of the 1st AAAI Conference on Artificial Intelligence. 5015–5016. Google Scholar
Digital Library
- [60] . 2016. Intelligent authorship identification with using turkish newspapers metadata. In Proceedings of the IEEE International Conference on Big Data (Big Data’16). IEEE, 1895–1900.Google Scholar
Cross Ref
- [61] . 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57, 3 (2006), 378–393. Google Scholar
Digital Library
Index Terms
Authorship Attribution for a Resource Poor Language—Urdu
Recommendations
A survey on Urdu and Urdu like language stemmers and stemming techniques
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...
Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information RetrievalAuthorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, ...
Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...






Comments