Abstract
The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. However, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces and when the number of candidate authors increases. Consequently, these solutions are inapplicable to real-world cases. Moreover, due to the unavailability of reliable POS taggers or sentence segmenters, all existing authorship identification studies on Urdu text are limited to the word n-grams features only. To overcome these limitations, we formulate a stylometric feature space, which is not limited to the word n-grams feature only. Based on this feature space, we use an authorship identification solution that transforms each text sample into a point set, retrieves candidate text samples, and relies on the nearest neighbors classifier to predict the original author of the anonymous text sample. To evaluate our solution, we create a significantly larger corpus than existing studies and conduct several experimental studies that show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.
- [1] . 2019. Arabic authorship attribution: An extensive study on Twitter posts. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 18, 1 (2019), 5:1–5:51. Google Scholar
Digital Library
- [2] . 2020. Data augmentation using machine translation for fake news detection in the Urdu language. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 2537–2542.Google Scholar
- [3] . 2019. An empirical study on forensic analysis of Urdu text using LDA-based authorship attribution. IEEE Access 7 (2019), 3224–3234.Google Scholar
Cross Ref
- [4] . 2019. Role of discourse information in Urdu sentiment classification: A rule-based method and machine-learning technique. ACM Trans. Asian Low Resour. Lang. Inf. Process. 18, 4 (2019), 34:1–34:37. Google Scholar
Digital Library
- [5] . 1999. Nearest neighbor classification from multiple feature subsets. Intell. Data Anal. 3, 3 (1999), 191–209. Google Scholar
Digital Library
- [6] . 2017. Improving transition-based dependency parsing of Hindi and Urdu by modeling syntactically relevant phenomena. ACM Trans. Asian Low Resour. Lang. Inf. Process. 16, 3 (2017), 17:1–17:35. Google Scholar
Digital Library
- [7] . 2001. Empirical evaluations of language-based author identification techniques. Forens. Ling. 8 (2001), 1–65.Google Scholar
- [8] . 2016. A four-tier annotated Urdu handwritten text image dataset for multidisciplinary research on Urdu script. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15, 4 (2016), 26:1–26:23. Google Scholar
Digital Library
- [9] . 2017. Stylometric authorship attribution of collaborative documents. In Proceedings of the 1st International Conference. 115–135.Google Scholar
Cross Ref
- [10] . 2017. Learning stylometric representations for authorship analysis. IEEE Trans. Cybern. 49, 1 (2017), 107–121.Google Scholar
Cross Ref
- [11] . 2019. Language models and fusion for authorship attribution. Inf. Process. Manag. 56, 6 (2019), 102061.Google Scholar
Cross Ref
- [12] . 2016. Authorship attribution using a neural network language model. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.4212–4213. Google Scholar
Digital Library
- [13] . 2007. Quantitative authorship attribution: An evaluation of techniques. Liter. Ling. Comput. 22, 3 (2007), 251–270.Google Scholar
Cross Ref
- [14] . 2020. Predicting literature’s early impact with sentiment analysis in Twitter. Knowl.-based Syst. 192 (2020), 105383.Google Scholar
Cross Ref
- [15] . 2020. Tweet coupling: A social media methodology for clustering scientific publications. Scientometrics 124 (2020), 973–991.Google Scholar
Digital Library
- [16] . 2016. Tapping into intra-and international collaborations of the Organization of Islamic Cooperation states across science and technology disciplines. Sci. Pub. Polic. 43, 5 (2016), 690–701.Google Scholar
Cross Ref
- [17] . 2002. A probabilistic nearest neighbour method for statistical pattern recognition. J R. Stat. Soc. Series B Stat. Methodol. 64, 2 (2002), 295–306.Google Scholar
Cross Ref
- [18] . 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15, 9 (1993), 850–863. Google Scholar
Digital Library
- [19] . 2008. Opinion spam and analysis. In Proceedings of the International Conference on Web Search and Web Data Mining, , , and (Eds.). ACM, 219–230. Google Scholar
Digital Library
- [20] . 2020. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. Asian Low Resour. Lang. Inf. Process. 19, 1 (2020), 8:1–8:13.
DOI: DOI: DOI: https://doi.org/10.1145/3329710 Google ScholarCross Ref
- [21] . 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference of the Pacific Association for Computational Linguistics. 255–264.Google Scholar
- [22] . 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning(
JMLR Workshop and Conference Proceedings , Vol. 32). JMLR.org, 1188–1196. Google ScholarDigital Library
- [23] . 2006. From fingerprint to writeprint. Commun. ACM 49, 4 (2006), 76–82. Google Scholar
Digital Library
- [24] . 2020. Domain adaptation of Thai word segmentation models using stacked ensemble. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20), Online, November 16-20, 2020. Association for Computational Linguistics, 3841–3847.
DOI: 10.18653/v1/2020.emnlp-main.315Google Scholar - [25] . 1994. A modified Hausdorff distance for object matching. In 12th IAPR International Conference on Pattern Recognition, Conference on Computer Vision & Image Processing, ICPR 1994, Jerusalem, Israel, 9-13, October 1994, Volume 1. IEEE, 566–568.
DOI: 10.1109/ICPR.1994.576361Google Scholar - [26] . 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Trans. Asian Low Resour. Lang. Inf. Process. 17, 1 (2017), 2:1–2:13.Google Scholar
Digital Library
- [27] . 2020. Sentiment analysis for a resource poor language - Roman Urdu. ACM Trans. Asian Low Resour. Lang. Inf. Process. 19, 1 (2020), 10:1–10:15. Google Scholar
Digital Library
- [28] . 1964. Inference and Disputed Authorship: The Federalist. Reading MA: Addison-Wesley.Google Scholar
- [29] . 2012. On the feasibility of internet-scale author identification. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE, 300–314. Google Scholar
Digital Library
- [30] . 2016. A scalable framework for stylometric analysis query processing. In Proceedings of the IEEE 16th International Conference on Data Mining. 1125–1130.Google Scholar
Cross Ref
- [31] . 2015. What you submit is who you are: A multimodal approach for deanonymizing scientific publications. IEEE Trans. Inf. Forens. Secur. 10, 1 (2015), 200–212.Google Scholar
Cross Ref
- [32] . 2004. Augmenting naive Bayes classifiers with statistical language models. Inf. Retriev. 7, 3-4 (2004), 317–345. Google Scholar
Digital Library
- [33] . 2003. Language independent authorship attribution using character level language models. In Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 267–274. Google Scholar
Digital Library
- [34] . 2016. Astroturfing detection in social media: Using binary n-Gram analysis for authorship attribution. In Proceedings of the IEEE Trustcom/BigDataSE/ISPA. 121–128.Google Scholar
Cross Ref
- [35] . 2017. Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21, 3 (2017), 627–639. Google Scholar
Digital Library
- [36] . 2009. N-gram based authorship attribution in Urdu poetry. In Proceedings of the Conference on Language & Technology. 88–93.Google Scholar
- [37] . 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC Workshop on New Challenges for NLP Frameworks. Citeseer.Google Scholar
- [38] . 2019. Scientific collaboration networks in Pakistan and their impact on institutional research performance: A case study based on Scopus publications. Library Hi Tech 37, 1 (2019), 19–29.
DOI: 0.1108/LHT-03-2018-0036Google ScholarCross Ref
- [39] . 2019. A sense annotated corpus for all-words Urdu word sense disambiguation. ACM Trans. Asian Low Resour. Lang. Inf. Process. 18, 4 (2019), 40:1–40:14. Google Scholar
Digital Library
- [40] . 2021. Sentiment analysis for Urdu online reviews using deep learning models. Exp. Syst. (2021), e12751.
DOI: 10.1111/exsy.12751Google Scholar - [41] . 2018. A scalable framework for cross-lingual authorship identification. Inf. Sci. 465 (2018), 323–339.
DOI: 10.1016/j.ins.2018.07.009Google ScholarCross Ref
- [42] . 2016. The key factors and their influence in authorship attribution. Res. Comput. Sci. 110 (2016), 139–150.Google Scholar
Cross Ref
- [43] . 2020. StyloThai: A scalable framework for stylometric authorship identification of thai documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 3 (2020), 1–15. Google Scholar
Digital Library
- [44] . 2020. Native language identification of fluent and advanced non-native writers. ACM Trans. Asian Low Resour. Lang. Inf. Process. 19, 4 (2020), 55:1–55:19.
DOI: DOI: DOI: https://doi.org/10.1145/3383202 Google ScholarCross Ref
- [45] . 2019. A bibliometric perspective on technology-driven innovation in the Gulf Cooperation Council (GCC) countries in relation to its transformative impact on international business. In Technology-driven Innovation in Gulf Cooperation Council (GCC) Countries: Emerging Research and Opportunities. IGI Global, 49–66.Google Scholar
Cross Ref
- [46] . 2020. CAG: Stylometric authorship attribution of multi-author documents using a co-authorship graph. IEEE Access 8 (2020), 18374–18393.Google Scholar
Cross Ref
- [47] . 2018. A scalable framework for stylometric analysis of multi-author documents. In Proceedings of the 23rd International Conference on Database Systems for Advanced Applications. 813–829.Google Scholar
Digital Library
- [48] . 2018. An effective and scalable framework for authorship attribution query processing. IEEE Access 6 (2018), 50030–50048.Google Scholar
Cross Ref
- [49] . 2021. Webometrics: Evolution of social media presence of universities. Scientometrics 126, 2 (2021), 951–967.Google Scholar
Digital Library
- [50] . 2006. Classification of text, automatic. Encycl. Lang. Ling. 14 (2006), 457–462.Google Scholar
- [51] . 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 669–674.Google Scholar
- [52] . 2008. Author identification: Using text sampling to handle the class imbalance problem. Inf. Process. Manag. 44, 2 (2008), 790–799. Google Scholar
Digital Library
- [53] . 2009. A survey of modern authorship attribution methods. J. Assoc. Inf. Sci. Technol. 60, 3 (2009), 538–556. Google Scholar
Digital Library
- [54] . 2013. On the robustness of authorship attribution based on character n-gram features. J. Law Polic. 21, 2 (2013), 421–439.Google Scholar
- [55] . 2006. Ensemble-based author identification using character n-grams. In Proceedings of the 3rd International Workshop on Text-based Information Retrieval. 41–46.Google Scholar
- [56] . 2007. Searching with style: Authorship attribution in classic literature. In Proceedings of the 30th Australasian Computer Science Conference (ACSC’07). 59–68.Google Scholar
Index Terms
UrduAI: Writeprints for Urdu Authorship Identification
Recommendations
A scalable framework for cross-lingual authorship identification
AbstractCross-lingual authorship identification aims at finding the author of an anonymous document written in one language by using labeled documents written in other languages. The main challenge of cross-lingual authorship identification is ...
Native Language Identification of Fluent and Advanced Non-Native Writers
Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language ...
Survey of Authorship Identification Tasks on Arabic Texts
Authorship identification is the process of extracting and analysing the writing styles of authors to identify the authorship. From the writing style, the author and his/her different characteristics can be recognised, which is very useful in digital ...






Comments