skip to main content
research-article

Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

Authors Info & Claims
Published:18 November 2021Publication History
Skip Abstract Section

Abstract

Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu language on the Internet, there is a lack of benchmark corpus for the Cross-lingual Semantic Word Similarity task for the Urdu language. This article reports our efforts in developing such a corpus. The newly developed corpus is based on the SemEval-2017 task 2 English dataset, and it contains 1,945 cross-lingual English–Urdu word pairs. For each of these pairs of words, semantic similarity scores were assigned by 11 native Urdu speakers. In addition to corpus generation, this article also reports the evaluation results of a baseline approach, namely “Translation Plus Monolingual Analysis” for automated identification of semantic similarity between English–Urdu word pairs. The results showed that the path length similarity measure performs better for the Google and Bing translated words. The newly created corpus and evaluation results are freely available online for further research and development.

REFERENCES

  1. [1] Almarsoomi Faaza A., OShea James D., Bandar Zuhair, and Crockett Keeley. 2013. AWSS: An algorithm for measuring arabic word semantic similarity. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 504509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Badenes-Olmedo Carlos, Redondo-García José Luis, and Corcho Oscar. 2019. Scalable cross-lingual document similarity through language-specific concept hierarchies. In Proceedings of the 10th International Conference on Knowledge Capture. 147153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Bangalore Srinivas, Haffner Patrick, and Kanthak Stephan. 2007. Statistical machine translation through global lexical selection and sentence reconstruction. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 152159.Google ScholarGoogle Scholar
  4. [4] Tomáš Brychcín. (2018). Linear transformations for cross-lingual semantic textual similarity. Knowledge-Based Systems, arXiv preprint arXiv:1807.04172. Retrieved from https://arxiv.org/abs/1807.04172Google ScholarGoogle Scholar
  5. [5] Tomáš Brychcín, Stephen Taylor, and Lukáš Svoboda. 2019. Cross-lingual word analogies using linear transformations between semantic spaces. Expert Systems with Applications 135 (2019), 287–295.Google ScholarGoogle Scholar
  6. [6] Camacho-Collados Jose and Navigli Roberto. 2017. BabelDomains: Large-scale domain labeling of lexical resources. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 223228.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Camacho-Collados Jose, Pilehvar Mohammad Taher, Collier Nigel, and Navigli Roberto. 2017. Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 1526.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Camacho-Collados José, Pilehvar Mohammad Taher, and Navigli Roberto. 2015. A framework for the construction of monolingual and cross-lingual word similarity datasets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2. 17.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Cao Xiaojun. 2021. Computer-aided research on the translation ability cultivation model of Chinese college English interdisciplinary talents. In Journal of Physics: Conference Series, Vol. 1744. IOP Publishing, 042026.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053. Retrieved from https://arxiv.org/abs/1809.05053Google ScholarGoogle Scholar
  11. [11] S. Anitha Elavarasi, J. Akilandeswari, and K. Menaga. 2014. A survey on semantic similarity measure. International Journal of Research in Advent Technology 2, 3 (2014), 389–398.Google ScholarGoogle Scholar
  12. [12] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, GadiWolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. In Proceedings of the 10th International Conference on World Wide Web. 406–414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Francis Winthrop Nelson, Kučera Henry, and Mackie Andrew W.. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin Harcourt.Google ScholarGoogle Scholar
  14. [14] Franco-Salvador Marc, Rosso Paolo, and Gómez Manuel Montes-y. 2016. A systematic study of knowledge graph analysis for cross-language plagiarism detection. Inf. Process. Manage. 52, 4 (2016), 550570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Goran Glavaš, Marc Franco-Salvador, Simone P. Ponzetto, and Paolo Rosso. 2018. A resource-light method for cross-lingual semantic textual similarity. Knowledge-Based Systems 143 (2018), 1–9.Google ScholarGoogle Scholar
  16. [16] Gliozzo Alfio and Strapparava Carlo. 2006. Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 553560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Grimes Barbara F.. 2000. Ethnologue: Languages of the World (14th ed.). 588598.Google ScholarGoogle Scholar
  18. [18] Israr Haneef, Adeel Nawab, Rao Muhammad, Ehsan Ullah Munir, and Imran Sarwar Bajwa. 2019. Design and development of a large cross-lingual plagiarism corpus for Urdu-English language pair. Scientific Programming 2019, Article ID 2962040.Google ScholarGoogle Scholar
  19. [19] Hassan Samer and Mihalcea Rada. 2009. Cross-lingual semantic relatedness using encyclopedic knowledge. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 11921201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Hill Felix, Reichart Roi, and Korhonen Anna. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41, 4 (2015), 665695. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Hwa Rebecca, Resnik Philip, Weinberg Amy, and Kolak Okan. 2002. Evaluating translational correspondence using annotation projection. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 392399. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Jiang Jay J. and Conrath David W.. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. arXiv:cmp-lg/9709008. Retrieved from https://arxiv.org/abs/cmp-lg/9709008.Google ScholarGoogle Scholar
  23. [23] Jurgens David, Pilehvar Mohammad Taher, and Navigli Roberto. 2014. Semeval-2014 task 3: Cross-level semantic similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval’14). 1726.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Alistair Kennedy and Graeme Hirst. 2012. Measuring semantic relatedness across languages. In Proceedings of xLiTe: Cross-Lingual Technologies Workshop at the Neural Information Processing Systems Conference. 1–6.Google ScholarGoogle Scholar
  25. [25] Latif Siddique, Qayyum Adnan, Usman Muhammad, and Qadir Junaid. 2018. Cross lingual speech emotion recognition: Urdu vs. Western languages. In Proceedings of the International Conference on Frontiers of Information Technology (FIT’18). IEEE, 8893.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Leacock Claudia and Chodorow Martin. 1998. Combining local context and WordNet similarity for word sense identification. WordNet: An Electr. Lexic. Datab. 49, 2 (1998), 265283.Google ScholarGoogle Scholar
  27. [27] Dekang Lin et al. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning (ICML’98), Vol. 98. 296–304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Liu Qianchu, McCarthy Diana, Vulić Ivan, and Korhonen Anna. 2019. Investigating cross-lingual alignment methods for contextualized embeddings with token-level evaluation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL’19). 3343.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Lo Chi-kiu and Simard Michel. 2019. Fully unsupervised crosslingual semantic textual similarity metric based on BERT for identifying parallel data. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL’19). 206215.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Meng Lingling, Huang Runqing, and Gu Junzhong. 2013. A review of semantic similarity measures in wordnet. Int. J. Hybr. Inf. Technol. 6, 1 (2013), 112.Google ScholarGoogle Scholar
  31. [31] Miller George A. and Charles Walter G.. 1991. Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1 (1991), 128.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Miller George A., Leacock Claudia, Tengi Randee, and Bunker Ross T.. 1993. A semantic concordance. In Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics, 303308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Saif Mohammad, Iryna Gurevych, Graeme Hirst, and Torsten Zesch. 2007. Cross-lingual distributional profiles of concepts for measuring semantic distance. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 571–580.Google ScholarGoogle Scholar
  34. [34] Monz Christof and Dorr Bonnie J.. 2005. Iterative translation disambiguation for cross-language information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 520527. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Navigli Roberto and Martelli Federico. 2019. An overview of word and sense similarity. Nat. Lang. Eng. 25, 6 (2019), 693714.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Nie Jian-Yun, Simard Michel, Isabelle Pierre, and Durand Richard. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 7481. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Franz Josef Och and Hermann Ney. 2000. Acomparison of alignment models for statistical machinetranslation. In Proc. of the 18th International Conference on Computational Linguistics. 1086–1090. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Resnik Philip. 1995. Using information content to evaluate semantic similarity in a taxonomy. arXiv:cmp-lg/9511007. Retrieved from https://arxiv.org/abs/cmp-lg/9511007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Riaz Kashif. 2008. Baseline for Urdu IR evaluation. In Proceedings of the 2nd ACM Workshop on Improving Non English Web Searching. ACM, 97100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Riloff Ellen, Schafer Charles, and Yarowsky David. 2002. Inducing information extraction systems for new languages via cross-language projection. In Proceedings of the 19th International Conference on Computational Linguistics, Volume 1. Association for Computational Linguistics, 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Rubenstein Herbert and Goodenough John B.. 1965. Contextual correlates of synonymy. Commun. ACM 8, 10 (1965), 627633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Sabet Ali, Gupta Prakhar, Cordonnier Jean-Baptiste, West Robert, and Jaggi Martin. 2019. Robust cross-lingual embeddings from parallel sentences. arXiv:1912.12481. Retrieved from https://arxiv.org/abs/1912.12481.Google ScholarGoogle Scholar
  43. [43] Saeed Ali, Nawab Rao Muhammad Adeel, Stevenson Mark, and Rayson Paul. 2019. A word sense disambiguation corpus for Urdu. Lang. Resourc. Eval. 53, 3 (2019), 397418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Ivan Vulić, Simon Baker, Edoardo Maria Ponti, Ulla Petti, Ira Leviant, Kelly Wing, Olga Majewska, Eden Bar, Matt Malone, Thierry Poibeau, et al. 2020. Multi-simLex: A large-scale evaluation of multilingual and crosslingual lexical semantic similarity. Computational Linguistics 46, 4 (2020), 847–897.Google ScholarGoogle Scholar
  45. [45] Vulic Ivan and Moens Marie-Francine. 2013. Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’13). ACL, 106116.Google ScholarGoogle Scholar
  46. [46] Vulić Ivan and Moens Marie-Francine. 2016. Bilingual distributed word representations from document-aligned comparable data. J. Artif. Intell. Res. 55 (2016), 953994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] M. Warschauer, G. R. E. Said, and A. G. Zohry. 2002. Language choice online: Globalization and identity in Egypt. Journal of Computer-Mediated Communication 7, 4 (2002), JCMC744.Google ScholarGoogle Scholar
  48. [48] Wu Zhibiao and Palmer Martha. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 133138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Min Xiao and Yuhong Guo. 2014. Semi-supervised matrix completion for cross-lingual text classification. In Twenty-Eighth AAAI Conference on Artificial Intelligence. 1607–1614.Google ScholarGoogle Scholar

Index Terms

  1. Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
      March 2022
      413 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3494070
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 November 2021
      • Accepted: 1 June 2021
      • Revised: 1 April 2021
      • Received: 1 April 2020
      Published in tallip Volume 21, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!