Abstract
Authorship identification is the process of extracting and analysing the writing styles of authors to identify the authorship. From the writing style, the author and his/her different characteristics can be recognised, which is very useful in digital forensics and cyber investigations. In the literature, authorship identification tasks were addressed on both long and short documents and performed on different languages, such as English, Arabic, Chinese, and Greek. This survey has reviewed the authorship identification tasks for the Arabic language to contribute to this area of research by exploring Arabic language performance and challenges. A total of 27 prominent Arabic studies of each authorship identification domain were reviewed considering the used data, selected features, utilised methods, and results. After a review of the various studies, it was concluded that the results of authorship identification tasks vary based on mostly the selected features and used dataset. Furthermore, the effective features differ from one dataset to another based on the various types of the Arabic language. However, all authorship identification tasks involving the Arabic language face considerable challenges with data pre-processing due to the challenging Arabic concatenative morphology.
- . 2005. Applying authorship analysis to extremist-group Web forum messages. IEEE Intelligent Systems 20, 5 (2005), 67–75. Google Scholar
Digital Library
- . 2018. Modern standard Arabic grammar automatic extraction from Penn 1 Arabic Treebank using natural language toolkit. The Egyptian Journal of Language Engineering 5, 1 (2018), 1–10. Google Scholar
Cross Ref
- . 2019. AraNet: A deep learning toolkit for Arabic social media. Proceedings, 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT). Retrieved from http://arxiv.org/abs/1912.13072.Google Scholar
- . 2016. Farasa: A fast and furious segmenter for Arabic. Proceedings, 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 11–16.Google Scholar
Cross Ref
- . 2016. Author attribution of Arabic texts using extended probabilistic context-free grammar language model. International Journal of Intelligent Systems and Applications 8, 6 (2016), 27–39. Google Scholar
Cross Ref
- . 2014. Burrows-Delta method fitness for Arabic text authorship stylometric detection. International Journal of Computer Science and Mobile Computing 36, 6 (2014), 69–78. http://www.ijcsmc.com/docs/papers/June2014/V3I6201419.pdf.Google Scholar
- . 2019a. Sample size in Arabic authorship verification. Proceedings, 3rd International Conference on Natural Language and Speech Processing. 1–8.Google Scholar
- . 2019b. Distance-based authorship verification across modern standard Arabic genres. Proceedings of the 3rd Workshop on Arabic Corpus Linguistics. 89–96.Google Scholar
- . 2017. Dynamic similarity threshold in authorship verification: Evidence from classical Arabic. Procedia Computer Science 117, 0 (2017), 145–152. Google Scholar
Cross Ref
- . 2018. The role of linguistic feature categories in authorship verification. Procedia Computer Science 142 (2018), 214–221. Google Scholar
Digital Library
- . 2017a. An extensive study of authorship authentication of Arabic articles. International Journal of Web Information Systems 13, 1 (2017), 85–104. Google Scholar
Cross Ref
- . 2017b. Feature extraction and selection for Arabic tweets authorship authentication. Journal of Ambient Intelligence and Humanized Computing 8, 3 (2017), 383–393.Google Scholar
Cross Ref
- . 2019. The effect of training set size in authorship attribution: Application on short Arabic texts. International Journal of Electrical and Computer Engineering (IJECE) 9, 1 (2019), 652.Google Scholar
Cross Ref
- . 2018. Combination of stylo-based features and frequency-based features for identifying the author of short Arabic text. In ACM International Conference Proceeding Series.Google Scholar
Digital Library
- . 2020. Ensemble methods for instance-based Arabic language authorship attribution. IEEE Access 8 (2020), 17331–17345. Google Scholar
Cross Ref
- . 2015. Classical Arabic authorship attribution using simple features. Natural Language Processing and Cognitive Science (2015), 45–51.Google Scholar
- . 2015. Using big data analytics for authorship authentication of Arabic tweets. Proceedings, 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing, UCC 2015, 448–452. Google Scholar
Cross Ref
- . 2018. Arabic authorship attribution: An extensive study on Twitter posts. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 1 (2018)Google Scholar
Digital Library
- . 2019. Multi-platform authorship verification. Proceedings, Third ACM Central European Cybersecurity Conference 13 (2019). Google Scholar
Digital Library
- . 2014. Naïve Bayes classifiers for authorship attribution of Arabic texts. Journal of King Saud University, Computer and Information Sciences 26, 4 (2014), 473–484. Google Scholar
Digital Library
- . 2014. On authorship authentication of Arabic articles. Proceedings, 2014 5th International Conference on Information and Communication Systems (ICICS'14). Google Scholar
Cross Ref
- . 2020. AraBERT: Transformer-based model for Arabic language understanding. arXiv preprint, arXiv:2003.00104.Google Scholar
- . 2014. Arabic text author identification using support vector machines. Journal of Advanced Computer Science and Technology Research 4, 1 (2014), 1–11. http://www.sign-ific-ance.co.uk/dsr/index.php/JACSTR/article/view/852/1163.Google Scholar
- . 2013. Evaluating text visualization: An experiment in authorship analysis. IEEE ISI 2013 - 2013 IEEE International Conference on Intelligence and Security Informatics: Big Data, Emergent Threats, and Decision-Making in Security Informatics. 16–20.Google Scholar
Cross Ref
- . 2018. Multi-classifier system for authorship verification task using word embeddings. Proceedings, 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP). IEEE, 1–6.Google Scholar
Cross Ref
- . 2019. Explainable authorship verification in social media via attention-based similarity learning. IEEE International Conference on Big Data (Big Data). Retrieved from http://arxiv.org/abs/1910.08144.Google Scholar
- . 2002. ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17, 3 (2002), 267–287.Google Scholar
Cross Ref
- . 2015. Authorship verification, average similarity analysis. International Conference on Recent Advances in Natural Language Processing (RANLP) (2015). 84–90.Google Scholar
- . 2018. Arabic nonconcatenative morphology in construction morphology. In The Construction of Words. G. Booij (ed.). Studies in Morphology, 4. Springer, Cham. 315–339.Google Scholar
- . 1996. Understanding Arabic: Essays in Contemporary Arabic Linguistics in Honor of El-Said Badawi. American University in Cairo Press.Google Scholar
- . 2014. Text classification for authorship attribution using Naïve Bayes classifier with limited training data. Computer Engineering and Intelligent Systems 5, 4 (2014), 48–56. Retrieved from http://iiste.org/Journals/index.php/CEIS/article/view/12132.Google Scholar
- . 2010. Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation 7, 1–2, (2010), 56–64. Google Scholar
Digital Library
- . 2013. A unified data mining solution for authorship analysis in anonymous textual communications. Information Sciences 231 (2013), 98–112. Google Scholar
Digital Library
- . 2002. CLUTO: A Clustering Toolkit.Google Scholar
- . 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Conference, Association for Computational Linguistics (PACLING) 3 (2003), 255–264.Google Scholar
- . 2020. Authorship authentication of political Arabic articles based on modified TF-IGF algorithm. Journal of Theoretical and Applied Information Technology 98, 17 (2020).Google Scholar
- . 2012. Assessment on stylometry for multilingual manuscript. IOSR Journal of Engineering 2, 9 (2012), 1–6.Google Scholar
- . 2001. Arabic information retrieval at UMass in TREC-10. Tenth Text Retrieval Conference (Lm).Google Scholar
- . 2013. Analysis of stylometric variables in long and short texts. Procedia - Social and Behavioral Sciences 95 (2013), 604–611. Google Scholar
Cross Ref
- . 1981. A prosodic theory of nonconcatenative morphology. Linguistic Inquiry 12, 3 (1981), 373–418.Google Scholar
- . 1963. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers. Journal of the American Statistical Association 58, (302) 275–309.Google Scholar
- . 2021. Towards a stylometric authorship recognition model for the social media texts in Arabic. Arab World English Journal (AWEJ), 11.Google Scholar
- . 2020. CAMeL tools: An open-source Python toolkit for Arabic natural language processing. In Proceedings, 12th Language Resources and Evaluation Conference. 7022–7032.Google Scholar
- . 2019. Authorship attribution revisited: The problem of flash fiction. Arab World English Journal 10, 3 (2019), 318–329.Google Scholar
Cross Ref
- . 2020. The effectiveness of stemming in the stylometric authorship attribution in Arabic. International Journal of Advanced Computer Science and Applications 11, 1 (2020) 116–121.Google Scholar
Cross Ref
- . 2014. Towards author identification of Arabic text articles. Proceedings, 2014 5th International Conference on Information and Communication Systems (ICICS'14). 5–8. Google Scholar
Cross Ref
- . 2016. Effect of the text size on stylometry—application on Arabic religious texts. Advances in Intelligent Systems and Computing 453 (2016), 215–228. Google Scholar
Cross Ref
- . 2013. Authorship attribution of ancient texts written by ten Arabic travelers using character n-grams. In 2013 International Conference on Computer, Information and Telecommunication Systems (CITS). IEEE, 1–5.Google Scholar
Cross Ref
- . 2012. Authorship attribution of ancient texts written by ten Arabic travelers using a SMO-SVM classifier. In 2012 International Conference on Communications and Information Technology (ICCIT). IEEE, 44–47.Google Scholar
Cross Ref
- . 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. Proceedings, 9th International Conference on Language Resources and Evaluation (LREC'14). 1094–1101.Google Scholar
- . 2010. Authorship attribution of web forum posts. General Members Meeting and eCrime Researchers Summit, eCrime 2010, 1–7.Google Scholar
Cross Ref
- . 2018. Intrinsic author verification using topic modeling. Proceedings of the 10th Hellenic Conference on Artificial Intelligence. 20.Google Scholar
Digital Library
- . 2016. Authorship attribution of Arabic tweets. Proceedings, 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA). 1–6.Google Scholar
Cross Ref
- . 2009. NERA: Named entity recognition for Arabic. Journal of the American Society for Information Science and Technology 60, 8 (2009), 1652–1663. Google Scholar
Cross Ref
- . 2010. Authorship attribution in Arabic using a hybrid of evolutionary search and linear discriminant analysis. 2010 UK Workshop on Computational Intelligence (UKCI'10). 12–17. Google Scholar
Cross Ref
- . 2007. Investigating hybrids of evolutionary search and linear discriminant analysis for authorship attribution. Proceedings, 2007 IEEE Congress on Evolutionary Computation (2007), 2071–2077.Google Scholar
Cross Ref
- . 2019. An improved framework for authorship identification in online messages. Cluster Computing 22, s5 (2019), 12101–12110. Google Scholar
Cross Ref
- . 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.Google Scholar
Digital Library
- . 2018. Current Issues in Criminal Justice Performance Crime and Justice. 5329 2015. Google Scholar
Cross Ref
- . 2020. Multi-dialect Arabic BERT for country-level dialect identification. arXiv preprint arXiv:2007.05612.Google Scholar
- . 2018. Arap-Tweet: A large multi-dialect Twitter corpus for gender, age, and language variety identification. arXiv preprint, arXiv:1808.07674.Google Scholar
- . 2019. BERT-based Arabic social media author profiling. CEUR Workshop Proceedings 2517, 1 (2019), 84–91.Google Scholar
- . 2003. Authorship analysis in cybercrime investigation. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2665 (2003), 59–73. Google Scholar
Cross Ref
Index Terms
Survey of Authorship Identification Tasks on Arabic Texts
Recommendations
Can punctuation marks be used as writer invariants? rough set-based approach to authorship attribution
ECC'08: Proceedings of the 2nd conference on European computing conferenceWriter invariant is a stylometric notion corresponding to such unique characteristic that describes the writing style of a person, allowing for distinguishing texts authored by this person from all others and providing means for either discounting or ...
Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features
CYBERC '13: Proceedings of the 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge DiscoveryIn this paper the authors investigate the authorship of several short historical texts that are written by ten ancient Arabic travelers: this Arabic dataset, which was collected by the authors in 2011, is called AAAT dataset. Several experiments of ...
A survey on Arabic character segmentation
Arabic character segmentation is a necessary step in Arabic Optical Character Recognition (OCR). The cursive nature of Arabic script poses challenging problems in Arabic character recognition; however, incorrectly segmented characters will cause ...






Comments