skip to main content
research-article

Urdu Short Paraphrase Detection at Sentence Level

Published:12 April 2023Publication History
Skip Abstract Section

Abstract

Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique (F1 = 0.855). Our corpus is available and free to download for research purposes.

REFERENCES

  1. [1] Alam Sawood, Mehmood Fateh ud din B., and Nelson Michael L.. 2015. Improving accessibility of archived raster dictionaries of complex script languages. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’15). ACM, 4756. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Alvi Faisal, El-Alfy El-Sayed M., Al-Khatib Wasfi G., and Abdel-Aal Radwan E.. 2012. Analysis and extraction of sentence-level paraphrase sub-corpus in CS education. In Proceedings of the 13th Annual Conference on Information Technology Education (SIGITE’12). Association for Computing Machinery, New York, NY, 4954. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Barrón-Cedeño Alberto, Vila Marta, Martí M. Antònia, and Rosso Paolo. 2013. Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computat. Ling. 39, 4 (2013), 917947.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Barzilay Regina and Lee Lillian. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Association for Computational Linguistics, 1623.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Batanović Vuk, Furlan Bojan, and Nikolić Boško. 2011. A software system for determining the semantic similarity of short texts in Serbian. In Proceedings of the 19th Telecommunications Forum (TELFOR). IEEE, 12491252.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Bowman Samuel R., Angeli Gabor, Potts Christopher, and Manning Christopher D.. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 632642. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Budanitsky Alexander and Hirst Graeme. 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 1 (Mar.2006), 1347. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Burrows Steven, Potthast Martin, and Stein Benno. 2013. Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4, 3 (2013), 43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Clough Paul and Gaizauskas Rob. 2009. Corpora and Text Re-use. De Gruyter Mouton, 12491271. DOI: Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Clough Paul, Gaizauskas Robert, Piao Scott S. L., and Wilks Yorick. 2002. Meter: Measuring text reuse. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 152159.Google ScholarGoogle Scholar
  11. [11] Clough Paul and Stevenson Mark. 2011. Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45, 1 (2011), 524.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Cohn Trevor, Callison-Burch Chris, and Lapata Mirella. 2008. Constructing corpora for the development and evaluation of paraphrase systems. Computat. Ling. 34, 4 (2008), 597614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Daud Ali, Khan Wahab, and Che Dunren. 2017. Urdu language processing: A survey. Arti. Intell. Rev. 47, 3 (2017), 279311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Demir Seniz, El-Kahlout Ilknur Durgar, Unal Erdem, and Kaya Hamza. 2012. Turkish paraphrase corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, Calzolari Nicoletta, Choukri Khalid, Declerck Thierry, Dogan Mehmet Ugur, Maegaard Bente, Mariani Joseph, Odijk Jan, and Piperidis Stelios (Eds.). European Language Resources Association (ELRA), 40874091. Retrieved from http://www.lrec-conf.org/proceedings/lrec2012/summaries/968.html.Google ScholarGoogle Scholar
  15. [15] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. Association for Computational Linguistics, 41714186. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Dolan William B. and Brockett Chris. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing. 916.Google ScholarGoogle Scholar
  17. [17] Ekbal Asif, Saha Sriparna, and Choudhary Gaurav. 2012. Plagiarism detection in text using vector space model. In Proceedings of the 12th International Conference on Hybrid Intelligent Systems (HIS). 366371.Google ScholarGoogle Scholar
  18. [18] Desouki Mohamed I. El, Gomaa Wael H., and Abdalhakim Hawaf. 2019. A hybrid model for paraphrase detection combines pros of text similarity with deep learning. Int. J. Comput. Appl. 975 (2019), 8887.Google ScholarGoogle Scholar
  19. [19] Fader Anthony, Zettlemoyer Luke, and Etzioni Oren. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 16081618.Google ScholarGoogle Scholar
  20. [20] Feng Fangxiaoyu, Yang Yinfei, Cer Daniel, Arivazhagan Naveen, and Wang Wei. 2022. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 878891. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Fenogenova Alena. 2021. Russian paraphrasers: Paraphrase with transformers. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. 1119.Google ScholarGoogle Scholar
  22. [22] Fernando Samuel and Stevenson Mark. 2008. A semantic similarity approach to paraphrase detection. In Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics.Google ScholarGoogle Scholar
  23. [23] Forsythe Chris, Bernard Michael L., and Goldsmith Timothy E.. 2006. Cognitive Systems: Human Cognitive Models in Systems Design. Psychology Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Gangadharan Veena, Gupta Deepa, Amritha L., and Athira T. A.2020. Paraphrase detection using deep neural network-based word embedding techniques. In Proceedings of the 4th International Conference on Trends in Electronics and Informatics (ICOEI). 517521. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Ghannay Sahar, Favre Benoit, Esteve Yannick, and Camelin Nathalie. 2016. Word embedding evaluation and combination. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 300305.Google ScholarGoogle Scholar
  26. [26] Grave Edouard, Bojanowski Piotr, Gupta Prakhar, Joulin Armand, and Mikolov Tomas. 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L18-1550.Google ScholarGoogle Scholar
  27. [27] Gudkov Vadim, Mitrofanova Olga, and Filippskikh Elizaveta. 2020. Automatically ranked Russian paraphrase corpus for text generation. In Proceedings of the 4th Workshop on Neural Generation and Translation. Association for Computational Linguistics, 5459. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Guo Xiao, Mirzaalian Hengameh, Sabir Ekraam, Jaiswal Ayush, and Abd-Almageed Wael. 2020. CORD19STS: COVID-19 Semantic Textual Similarity Dataset. arxiv:cs.CL/2007.02461.Google ScholarGoogle Scholar
  29. [29] HaCohen-Kerner Yaakov, Gross Zuriel, and Masa Asaf. 2005. Automatic extraction and learning of keyphrases from scientific articles. Lect. Notes Comput. Sci. 3406 (2005), 657669. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Haider Samar. 2018. Urdu word embeddings. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L18-1155.Google ScholarGoogle Scholar
  31. [31] Hany Mena and Gomaa Wael H.. 2022. A hybrid approach to paraphrase detection based on text similarities and machine learning classifiers. In Proceedings of the 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC). 343348. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] He Hangfeng, Ning Qiang, and Roth Dan. 2020. QuASE: Question-answer driven sentence encoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics87438758. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Hunt Ethan, Janamsetty Ritvik, Kinares Chanana, Koh Chanel, Sanchez Alexis, Zhan Felix, Ozdemir Murat, Waseem Shabnam, Yolcu Osman, Dahal Binay, Zhan Justin, Gewali Laxmi, and Oh Paul. 2019. Machine learning models for paraphrase identification and its applications on plagiarism detection. In Proceedings of the IEEE International Conference on Big Knowledge (ICBK). 97104. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Kanwal Safia, Malik Kamran, Shahzad Khurram, Aslam Faisal, and Nawaz Zubair. 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 1 (2019), 113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Ke Pei, Ji Haozhe, Liu Siyang, Zhu Xiaoyan, and Huang Minlie. 2020. SentiLARE: Linguistic knowledge enhanced language representation for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 69756988.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Khoshnavataher Khadijeh, Zarrabi Vahid, Mohtaj Salar, and Asghari Habibollah. 2015. Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial obfuscation: Notebook for PAN at CLEF 2015. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum, Toulouse, France, September 8-11, 2015 (CEUR Workshop Proceedings), Vol. 1391. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-1391/146-CR.pdf.Google ScholarGoogle Scholar
  37. [37] Lahitani Alfirna Rizqi, Permanasari Adhistya Erna, and Setiawan Noor Akhmad. 2016. Cosine similarity to determine similarity measure: Study case in online essay assessment. In Proceedings of the 4th International Conference on Cyber and IT Service Management. IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Malajyan Arthur, Avetisyan Karen, and Ghukasyan Tsolak. 2020. ARPA: Armenian Paraphrase Detection Corpus and Models. arxiv:cs.CL/2009.12615.Google ScholarGoogle Scholar
  39. [39] Massidda Riccardo. 2020. [email protected]: Document dating using sentence embeddings at EVALITA 2020. In Proceedings of 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA’20).Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] McEnery Tony, Baker Paul, and Burnard Lou. 2000. Corpus resources and minority language engineering. In Proceedings of the International Conference on Language Resources and Evaluation.Google ScholarGoogle Scholar
  41. [41] Mikolov Tomas, Yih Wen-tau, and Zweig Geoffrey. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746751.Google ScholarGoogle Scholar
  42. [42] Minaee Shervin, Kalchbrenner Nal, Cambria Erik, Nikzad Narjes, Chenaghlu Meysam, and Gao Jianfeng. 2021. Deep learning based text classification: A comprehensive review. ACM Comput. Surv. 54, 3 (2021), 140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Mori Yusuke, Yamane Hiroaki, Mukuta Yusuke, and Harada Tatsuya. 2020. Finding and generating a missing part for story completion. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 156166.Google ScholarGoogle Scholar
  44. [44] Naumov Stanislav, Yaroslavtsev Grigory, and Avdiukhin Dmitrii. 2021. Objective-based hierarchical clustering of deep embedding vectors. In Proceedings of the AAAI Conference on Artificial Intelligence. 90559063.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Navrozidis Jakob and Jansson Hannes. 2020. Using Natural Language Processing to Identify Similar Patent Documents. LU-CS-EX (2020).Google ScholarGoogle Scholar
  46. [46] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Pivovarova Lidia, Pronoza Ekaterina, Yagunova Elena, and Pronoza Anton. 2017. ParaPhraser: Russian paraphrase corpus and shared task. In Proceedings of the Conference on Artificial Intelligence and Natural Language. Springer, 211225.Google ScholarGoogle Scholar
  48. [48] Potthast Martin, Stein Benno, Barrón-Cedeño Alberto, and Rosso Paolo. 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 9971005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Pronoza Ekaterina, Yagunova Elena, and Pronoza Anton. 2016. Construction of a Russian paraphrase corpus: Unsupervised paraphrase extraction. In Information Retrieval. Springer, 146157.Google ScholarGoogle Scholar
  50. [50] Rei Ricardo, Stewart Craig, Farinha Ana C., and Lavie Alon. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 26852702. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Reimers Nils. 2020. sentence embeddings using siamese BERT-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language.39823992.Google ScholarGoogle Scholar
  52. [52] Reimers Nils and Gurevych Iryna. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. arxiv:cs.CL/2004.09813.Google ScholarGoogle Scholar
  53. [53] Resnik Philip, Buzek Olivia, Hu Chang, Kronrod Yakov, Quinn Alex, and Bederson Benjamin B.. 2010. Improving translation via targeted paraphrasing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 127137.Google ScholarGoogle Scholar
  54. [54] Sadeghi Reyhaneh, Karbasi Hamed, and Akbari Ahmad. 2022. ExaPPC: A large-scale persian paraphrase detection corpus. In Proceedings of the 8th International Conference on Web Research (ICWR). 168175. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Sameen Sara, Sharjeel Muhammad, Nawab Rao Muhammad Adeel, Rayson Paul, and Muneer Iqra. 2018. Measuring short text reuse for the Urdu language. IEEE Access 6, 1 (2018), 74127421. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Shahmohammadi Hassan, Dezfoulian MirHossein, and Mansoorizadeh Muharram. 2021. Paraphrase detection using LSTM networks and handcrafted features. Multim. Tools Applic. 80, 4 (2021), 64796492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Sharjeel Muhammad, Nawab Rao Muhammad Adeel, and Rayson Paul. 2017. COUNTER: Corpus of Urdu news text reuse. Lang. Resour. Eval. 51, 3 (01 Sept.2017), 777803. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Sharjeel Muhammad, Rayson Paul, and Nawab Rao Muhammad Adeel. 2016. UPPC-Urdu paraphrase plagiarism corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 18321836.Google ScholarGoogle Scholar
  59. [59] Shinyama Yusuke and Sekine Satoshi. 2003. Paraphrase acquisition for information extraction. In Proceedings of the 2nd International Workshop on Paraphrasing. Association for Computational Linguistics, 6571.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Thakur Nandan, Reimers Nils, Daxenberger Johannes, and Gurevych Iryna. 2021. Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 296310. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Upadhyay Shyam, Faruqui Manaal, Dyer Chris, and Roth Dan. 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 16611670. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Vijaymeena M. K. and Kavitha K.. 2016. A survey on similarity measures in text mining. Mach. Learn. Applic. Int. J. 3, 2 (2016), 1928.Google ScholarGoogle Scholar
  63. [63] Vila Marta, Rodríguez Horacio, and Martí M. Antònia. 2015. Relational paraphrase acquisition from Wikipedia: The WRPA method and corpus. Nat. Lang. Eng. 21, 3 (2015), 355389.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Vrbanec Tedo and Meštrović Ana. 2020. Corpus-based paraphrase detection experiments and review. Information 11, 5 (2020), 241.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Williams Adina, Nangia Nikita, and Bowman Samuel. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 11121122. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Yates Andrew, Nogueira Rodrigo, and Lin Jimmy. 2021. Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 11541156.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Urdu Short Paraphrase Detection at Sentence Level

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
      April 2023
      682 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3588902
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 April 2023
      • Online AM: 28 February 2023
      • Accepted: 24 February 2023
      • Revised: 22 February 2023
      • Received: 15 October 2022
      Published in tallip Volume 22, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)183
      • Downloads (Last 6 weeks)36

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!