Abstract
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu language on the Internet, there is a lack of benchmark corpus for the Cross-lingual Semantic Word Similarity task for the Urdu language. This article reports our efforts in developing such a corpus. The newly developed corpus is based on the SemEval-2017 task 2 English dataset, and it contains 1,945 cross-lingual English–Urdu word pairs. For each of these pairs of words, semantic similarity scores were assigned by 11 native Urdu speakers. In addition to corpus generation, this article also reports the evaluation results of a baseline approach, namely “Translation Plus Monolingual Analysis” for automated identification of semantic similarity between English–Urdu word pairs. The results showed that the path length similarity measure performs better for the Google and Bing translated words. The newly created corpus and evaluation results are freely available online for further research and development.
- [1] . 2013. AWSS: An algorithm for measuring arabic word semantic similarity. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 504–509. Google Scholar
Digital Library
- [2] . 2019. Scalable cross-lingual document similarity through language-specific concept hierarchies. In Proceedings of the 10th International Conference on Knowledge Capture. 147–153. Google Scholar
Digital Library
- [3] . 2007. Statistical machine translation through global lexical selection and sentence reconstruction. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 152–159.Google Scholar
- [4] Tomáš Brychcín. (2018). Linear transformations for cross-lingual semantic textual similarity. Knowledge-Based Systems, arXiv preprint arXiv:1807.04172. Retrieved from https://arxiv.org/abs/1807.04172Google Scholar
- [5] Tomáš Brychcín, Stephen Taylor, and Lukáš Svoboda. 2019. Cross-lingual word analogies using linear transformations between semantic spaces. Expert Systems with Applications 135 (2019), 287–295.Google Scholar
- [6] . 2017. BabelDomains: Large-scale domain labeling of lexical resources. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 223–228.Google Scholar
Cross Ref
- [7] . 2017. Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 15–26.Google Scholar
Cross Ref
- [8] . 2015. A framework for the construction of monolingual and cross-lingual word similarity datasets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2. 1–7.Google Scholar
Cross Ref
- [9] . 2021. Computer-aided research on the translation ability cultivation model of Chinese college English interdisciplinary talents. In Journal of Physics: Conference Series, Vol. 1744. IOP Publishing, 042026.Google Scholar
Cross Ref
- [10] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053. Retrieved from https://arxiv.org/abs/1809.05053Google Scholar
- [11] S. Anitha Elavarasi, J. Akilandeswari, and K. Menaga. 2014. A survey on semantic similarity measure. International Journal of Research in Advent Technology 2, 3 (2014), 389–398.Google Scholar
- [12] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, GadiWolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. In Proceedings of the 10th International Conference on World Wide Web. 406–414. Google Scholar
Digital Library
- [13] . 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin Harcourt.Google Scholar
- [14] . 2016. A systematic study of knowledge graph analysis for cross-language plagiarism detection. Inf. Process. Manage. 52, 4 (2016), 550–570. Google Scholar
Digital Library
- [15] Goran Glavaš, Marc Franco-Salvador, Simone P. Ponzetto, and Paolo Rosso. 2018. A resource-light method for cross-lingual semantic textual similarity. Knowledge-Based Systems 143 (2018), 1–9.Google Scholar
- [16] . 2006. Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 553–560. Google Scholar
Digital Library
- [17] . 2000. Ethnologue: Languages of the World (14th ed.). 588–598.Google Scholar
- [18] Israr Haneef, Adeel Nawab, Rao Muhammad, Ehsan Ullah Munir, and Imran Sarwar Bajwa. 2019. Design and development of a large cross-lingual plagiarism corpus for Urdu-English language pair. Scientific Programming 2019, Article ID 2962040.Google Scholar
- [19] . 2009. Cross-lingual semantic relatedness using encyclopedic knowledge. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 1192–1201. Google Scholar
Digital Library
- [20] . 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41, 4 (2015), 665–695. Google Scholar
Digital Library
- [21] . 2002. Evaluating translational correspondence using annotation projection. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 392–399. Google Scholar
Digital Library
- [22] . 1997. Semantic similarity based on corpus statistics and lexical taxonomy. arXiv:cmp-lg/9709008. Retrieved from https://arxiv.org/abs/cmp-lg/9709008.Google Scholar
- [23] . 2014. Semeval-2014 task 3: Cross-level semantic similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval’14). 17–26.Google Scholar
Cross Ref
- [24] Alistair Kennedy and Graeme Hirst. 2012. Measuring semantic relatedness across languages. In Proceedings of xLiTe: Cross-Lingual Technologies Workshop at the Neural Information Processing Systems Conference. 1–6.Google Scholar
- [25] . 2018. Cross lingual speech emotion recognition: Urdu vs. Western languages. In Proceedings of the International Conference on Frontiers of Information Technology (FIT’18). IEEE, 88–93.Google Scholar
Cross Ref
- [26] . 1998. Combining local context and WordNet similarity for word sense identification. WordNet: An Electr. Lexic. Datab. 49, 2 (1998), 265–283.Google Scholar
- [27] Dekang Lin et al. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning (ICML’98), Vol. 98. 296–304. Google Scholar
Digital Library
- [28] . 2019. Investigating cross-lingual alignment methods for contextualized embeddings with token-level evaluation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL’19). 33–43.Google Scholar
Cross Ref
- [29] . 2019. Fully unsupervised crosslingual semantic textual similarity metric based on BERT for identifying parallel data. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL’19). 206–215.Google Scholar
Cross Ref
- [30] . 2013. A review of semantic similarity measures in wordnet. Int. J. Hybr. Inf. Technol. 6, 1 (2013), 1–12.Google Scholar
- [31] . 1991. Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1 (1991), 1–28.Google Scholar
Cross Ref
- [32] . 1993. A semantic concordance. In Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics, 303–308. Google Scholar
Digital Library
- [33] Saif Mohammad, Iryna Gurevych, Graeme Hirst, and Torsten Zesch. 2007. Cross-lingual distributional profiles of concepts for measuring semantic distance. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 571–580.Google Scholar
- [34] . 2005. Iterative translation disambiguation for cross-language information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 520–527. Google Scholar
Digital Library
- [35] . 2019. An overview of word and sense similarity. Nat. Lang. Eng. 25, 6 (2019), 693–714.Google Scholar
Cross Ref
- [36] . 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 74–81. Google Scholar
Digital Library
- [37] Franz Josef Och and Hermann Ney. 2000. Acomparison of alignment models for statistical machinetranslation. In Proc. of the 18th International Conference on Computational Linguistics. 1086–1090. Google Scholar
Digital Library
- [38] . 1995. Using information content to evaluate semantic similarity in a taxonomy. arXiv:cmp-lg/9511007. Retrieved from https://arxiv.org/abs/cmp-lg/9511007. Google Scholar
Digital Library
- [39] . 2008. Baseline for Urdu IR evaluation. In Proceedings of the 2nd ACM Workshop on Improving Non English Web Searching. ACM, 97–100. Google Scholar
Digital Library
- [40] . 2002. Inducing information extraction systems for new languages via cross-language projection. In Proceedings of the 19th International Conference on Computational Linguistics, Volume 1. Association for Computational Linguistics, 1–7. Google Scholar
Digital Library
- [41] . 1965. Contextual correlates of synonymy. Commun. ACM 8, 10 (1965), 627–633. Google Scholar
Digital Library
- [42] . 2019. Robust cross-lingual embeddings from parallel sentences. arXiv:1912.12481. Retrieved from https://arxiv.org/abs/1912.12481.Google Scholar
- [43] . 2019. A word sense disambiguation corpus for Urdu. Lang. Resourc. Eval. 53, 3 (2019), 397–418.Google Scholar
Digital Library
- [44] Ivan Vulić, Simon Baker, Edoardo Maria Ponti, Ulla Petti, Ira Leviant, Kelly Wing, Olga Majewska, Eden Bar, Matt Malone, Thierry Poibeau, et al. 2020. Multi-simLex: A large-scale evaluation of multilingual and crosslingual lexical semantic similarity. Computational Linguistics 46, 4 (2020), 847–897.Google Scholar
- [45] . 2013. Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’13). ACL, 106–116.Google Scholar
- [46] . 2016. Bilingual distributed word representations from document-aligned comparable data. J. Artif. Intell. Res. 55 (2016), 953–994. Google Scholar
Digital Library
- [47] M. Warschauer, G. R. E. Said, and A. G. Zohry. 2002. Language choice online: Globalization and identity in Egypt. Journal of Computer-Mediated Communication 7, 4 (2002), JCMC744.Google Scholar
- [48] . 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 133–138. Google Scholar
Digital Library
- [49] Min Xiao and Yuhong Guo. 2014. Semi-supervised matrix completion for cross-lingual text classification. In Twenty-Eighth AAAI Conference on Artificial Intelligence. 1607–1614.Google Scholar
Index Terms
Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Recommendations
Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity
The semantic word similarity task aims to quantify the degree of similarity between a pair of words. In literature, efforts have been made to create standard evaluation resources to develop, evaluate, and compare various methods for semantic word ...
A survey on Urdu and Urdu like language stemmers and stemming techniques
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...
Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels
AbstractIn recent years, Cross-Lingual Text Reuse Detection (CLTRD) has attracted the attention of the research community because large digital repositories and efficient Machine Translation systems are readily and freely available, which makes it easier ...






Comments