Abstract
The semantic word similarity task aims to quantify the degree of similarity between a pair of words. In literature, efforts have been made to create standard evaluation resources to develop, evaluate, and compare various methods for semantic word similarity. The majority of these efforts focused on English and some other languages. However, the problem of semantic word similarity has not been thoroughly explored for South Asian languages, particularly Urdu. To fill this gap, this study presents a large benchmark corpus of 518 word pairs for the Urdu semantic word similarity task, which were manually annotated by 12 annotators. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic word similarity systems, we applied two state-of-the-art methods: (1) a word embedding–based method and (2) a Sentence Transformer–based method. As another major contribution, we proposed a feature fusion method based on Sentence Transformers and word embedding methods. The best results were obtained using our proposed feature fusion method (the combination of best features of both methods) with a Pearson correlation score of 0.67. To foster research in Urdu (an under-resourced language), our proposed corpus will be free and publicly available for research purposes.
- [1] . 2022. Reptile search algorithm (RSA): A nature-inspired meta-heuristic optimizer. Expert Syst. Appl. 191, Article 116158 (2022), 116158.Google Scholar
Digital Library
- [2] . 2021. The arithmetic optimization algorithm. Comput. Methods Appl. Mech. Eng. 376 (2021), 113609.Google Scholar
Cross Ref
- [3] . 2021. Applications, deployments, and integration of internet of drones (iod): A review. IEEE Sens. J. (2021).Google Scholar
Cross Ref
- [4] . 2021. Aquila optimizer: A novel meta-heuristic optimization algorithm. Comput. Industr. Eng. 157 (2021), 107250.Google Scholar
Cross Ref
- [5] . 2017. Word similarity datasets for indian languages: Annotation and baseline systems. In Proceedings of the 11th Linguistic Annotation Workshop. 91–94.Google Scholar
Cross Ref
- [6] . 2022. An empirical study on similarity functions: Parameter estimation for the information contrast model. OSF Preprints.Google Scholar
- [7] . 2011. How we BLESSed distributional semantic evaluation. In Proceedings of the GEMS’11 Workshop on GEometrical Models of Natural Language Semantics. Association for Computational Linguistics, 1–10.Google Scholar
- [8] . 2016. Joint word representation learning using a corpus and a semantic lexicon. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.Google Scholar
Cross Ref
- [9] . 2020. Language-independent tokenisation rivals language-specific tokenisation for word similarity prediction. Google Scholar
Cross Ref
- [10] . 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 632–642. Google Scholar
Cross Ref
- [11] . 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49 (2014), 1–47.Google Scholar
Cross Ref
- [12] . 2017. BabelDomains: Large-scale domain labeling of lexical resources. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 223–228.Google Scholar
Cross Ref
- [13] . 2017. Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity. Association for Computational Linguistics.Google Scholar
- [14] . 2015. A framework for the construction of monolingual and cross-lingual word similarity datasets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2. 1–7.Google Scholar
Cross Ref
- [15] . 2018. Universal sentence encoder for English. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 169–174. Google Scholar
Cross Ref
- [16] . 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 670–680. Google Scholar
Cross Ref
- [17] . 2017. Urdu language processing: A survey. Artif. Intell. Rev. 47, 3 (2017), 279–311.Google Scholar
Digital Library
- [18] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. Google Scholar
Cross Ref
- [19] . 2018. AnlamVer: Semantic model evaluation dataset for turkish-word similarity and relatedness. In Proceedings of the 27th International Conference on Computational Linguistics. 3819–3836.Google Scholar
- [20] . 2022. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 878–891. Google Scholar
Cross Ref
- [21] . 2021. Russian paraphrasers: Paraphrase with transformers. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. 11–19.Google Scholar
- [22] . 2017. Using word embedding for cross-language plagiarism detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 415–421. https://www.aclweb.org/anthology/E17-2066.Google Scholar
Cross Ref
- [23] . 2002. Placing search in context: The concept revisited. ACM Trans. Inf. Syst. 20, 1 (2002), 116–131.Google Scholar
Digital Library
- [24] . 2016. A systematic study of knowledge graph analysis for cross-language plagiarism detection. Inf. Process. Manage. 52, 4 (2016), 550–570.Google Scholar
Digital Library
- [25] . 2020. CORD19STS: COVID-19 Semantic Textual Similarity Dataset. Google Scholar
Cross Ref
- [26] . 2020. QuASE: Question-answer driven sentence encoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8743–8758. Google Scholar
Cross Ref
- [27] . 2019. Corpora in word embedding training and application.Google Scholar
- [28] . 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Ling. 41, 4 (2015), 665–695.Google Scholar
Digital Library
- [29] . 2006. Information retrieval by semantic similarity. Int. J. Semant. Web Inf. Syst. 2, 3 (2006), 55–73.Google Scholar
Cross Ref
- [30] . 2019. Word2vec model analysis for semantic similarities in english words. Proc. Comput. Sci. 157 (2019), 160–167.Google Scholar
Digital Library
- [31] . 2012. Semeval-2012 task 4: Evaluating chinese word similarity. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the Main Conference and the Shared Task (SEM’12) and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval’12). 374–377.Google Scholar
- [32] . 2014. Semeval-2014 task 3: Cross-level semantic similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval’14). 17–26.Google Scholar
Cross Ref
- [33] . 2015. Learning word meanings from images of natural scenes. Trait. Autom. Lang. 55, 3 (2015).Google Scholar
- [34] . 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. As. Low-Resour. Lang. Inf. Process. 19, 1 (2019), 1–13.Google Scholar
- [35] . 2020. Sentilare: Linguistic knowledge enhanced language representation for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6975–6988.Google Scholar
Cross Ref
- [36] . 2015. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 1411–1420.Google Scholar
Digital Library
- [37] . 2021. A survey on sentiment analysis in Urdu: A resource-poor language. Egypt. Inf. J. 22, 1 (2021), 53–74.Google Scholar
Cross Ref
- [38] . 2018. 2L-APD: A two-level plagiarism detection system for Arabic documents. Cybernet. Inf. Technol. 18, 1 (2018), 124–138.Google Scholar
- [39] . 2022. Thesaurus-based word embeddings for automated biomedical literature classification. Neural Comput. Appl. 34, 2 (2022), 937–950.Google Scholar
Digital Library
- [40] . 2022. An enhanced neural word embedding model for transfer learning. Appl. Sci. 12, 6 (2022), 2848.Google Scholar
Cross Ref
- [41] . 2016. Cosine similarity to determine similarity measure: Study case in online essay assessment. In Proceedings of the 4th International Conference on Cyber and IT Service Management. IEEE, 1–6.Google Scholar
Cross Ref
- [42] . 2020. rmassidda@ DaDoEval: Document dating using sentence embeddings at EVALITA 2020. In Proceedings of 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA’20).Google Scholar
Cross Ref
- [43] . 2021. On Multi-domain Sentence Level Sentiment Analysis for Roman Urdu. Ph.D. Dissertation. University of New South Wales Canberra Australia.Google Scholar
- [44] . 2017. Merali at semeval-2017 task 2 subtask 1: A cognitively inspired approach. In Proceedings of the International Workshop on Semantic Evaluation (SemEval’17). Association for Computational Linguistics, 236–240.Google Scholar
Cross Ref
- [45] . 1991. Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1 (1991), 1–28.Google Scholar
Cross Ref
- [46] . 2021. Deep learning based text classification: A comprehensive review. ACM Comput. Surv. 54, 3 (2021), 1–40.Google Scholar
Digital Library
- [47] . 2020. Finding and generating a missing part for story completion. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 156–166.Google Scholar
- [48] . 2022. Cross-lingual text reuse detection at sentence level for English–Urdu language pair. Comput. Speech Lang. 75 (2022), 101381. Google Scholar
Digital Library
- [49] . 2009. Supervised word sense disambiguation for Urdu using Bayesian classification. Center for Research in Urdu Language Processing, Lahore, Pakistan.Google Scholar
- [50] . 2021. Objective-based hierarchical clustering of deep embedding vectors. In Proceedings of the Annual AAAI Conference on Artificial Intelligence (AAAI’21). 9055–9063.Google Scholar
Cross Ref
- [51] . 2020. Using natural language processing to identify similar patent documents. LU-CS-EX. Lund University Library. LU-CS-EX EDAM05 20192. Department of Computer Science.Google Scholar
- [52] . 2019. Word similarity datasets for Thai: Construction and evaluation. IEEE Access 7 (2019), 142907–142915.Google Scholar
Cross Ref
- [53] . 2018. Distinguishing antonymy, synonymy and hypernymy with distributional and distributed vector representations and neural networks.Google Scholar
- [54] . 2016. From word embeddings to item recommendation. University of Stuttgart. Google Scholar
Cross Ref
- [55] . 2016. Making sense of word embeddings. In Proceedings of the 1st Workshop on Representation Learning for NLP. Association for Computational Linguistics, 174–183. Google Scholar
Cross Ref
- [56] . 2018. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237. Google Scholar
Cross Ref
- [57] . 2004. Language policy and localization in Pakistan: Proposal for a paradigmatic shift. In Proceedings of the SCALLA Conference on Computational Linguistics, Vol. 99. 1–19.Google Scholar
- [58] . 2020. COMET: A neural framework for MT evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 2685–2702. Google Scholar
Cross Ref
- [59] . 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. Google Scholar
Cross Ref
- [60] . 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. Google Scholar
Cross Ref
- [61] . 1965. Contextual correlates of synonymy. Commun. ACM 8, 10 (1965), 627–633.Google Scholar
Digital Library
- [62] . 2017. ConceptNet at SemEval-2017 task 2: Extending word embeddings with multilingual relational knowledge. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval’17). Association for Computational Linguistics, 85–89. Google Scholar
Cross Ref
- [63] . 2021. Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 296–310. Google Scholar
Cross Ref
- [64] . 2017. Construction of a word similarity dataset and evaluation of word similarity techniques for vietnamese. In Proceedings of the 9th International Conference on Knowledge and Systems Engineering (KSE’17). IEEE, 65–70.Google Scholar
Cross Ref
- [65] . 2016. A survey on similarity measures in text mining. Mach. Learn. Appl. 3, 2 (2016), 19–28.Google Scholar
- [66] . 2022. Deep-BERT: Transfer learning for classifying multilingual offensive texts on social media. Comput. Syst. Sci. Eng. 44, 2 (2022), 1775–1791.Google Scholar
Cross Ref
- [67] . 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 1112–1122. Google Scholar
Cross Ref
- [68] . 2014. Semi-supervised matrix completion for cross-lingual text classification. In Proceedings of the 28th AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [69] . 2021. Investigating readability of French as a foreign language with deep learning and cognitive and pedagogical features. Ling. Linguag. 20, 2 (2021), 229–258.Google Scholar
- [70] . 2021. Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 1154–1156.Google Scholar
Digital Library
- [71] . 2012. MIXCD: System description for evaluating Chinese word similarity at SemEval-2012. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the Main Conference and the Shared Task (SEM’12) and Volume 2: Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval’12). 425–429.Google Scholar
- [72] . 2021. A robustly optimized BERT pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics. Chinese Information Processing Society of China, Huhhot, 1218–1227.Google Scholar
Index Terms
Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity
Recommendations
Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...
Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels
AbstractIn recent years, Cross-Lingual Text Reuse Detection (CLTRD) has attracted the attention of the research community because large digital repositories and efficient Machine Translation systems are readily and freely available, which makes it easier ...
A survey on Urdu and Urdu like language stemmers and stemming techniques
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...






Comments