Abstract
Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique (F1 = 0.855). Our corpus is available and free to download for research purposes.
- [1] . 2015. Improving accessibility of archived raster dictionaries of complex script languages. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’15). ACM, 47–56.
DOI: Google ScholarDigital Library
- [2] . 2012. Analysis and extraction of sentence-level paraphrase sub-corpus in CS education. In Proceedings of the 13th Annual Conference on Information Technology Education (SIGITE’12). Association for Computing Machinery, New York, NY, 49–54.
DOI: Google ScholarDigital Library
- [3] . 2013. Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computat. Ling. 39, 4 (2013), 917–947.Google Scholar
Digital Library
- [4] . 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Association for Computational Linguistics, 16–23.Google Scholar
Digital Library
- [5] . 2011. A software system for determining the semantic similarity of short texts in Serbian. In Proceedings of the 19th Telecommunications Forum (TELFOR). IEEE, 1249–1252.Google Scholar
Cross Ref
- [6] . 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 632–642.
DOI: Google ScholarCross Ref
- [7] . 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 1 (
Mar. 2006), 13–47.DOI: Google ScholarDigital Library
- [8] . 2013. Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4, 3 (2013), 43.Google Scholar
Digital Library
- [9] . 2009. Corpora and Text Re-use. De Gruyter Mouton, 1249–1271.
DOI: Google ScholarCross Ref
- [10] . 2002. Meter: Measuring text reuse. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 152–159.Google Scholar
- [11] . 2011. Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45, 1 (2011), 5–24.Google Scholar
Digital Library
- [12] . 2008. Constructing corpora for the development and evaluation of paraphrase systems. Computat. Ling. 34, 4 (2008), 597–614.Google Scholar
Digital Library
- [13] . 2017. Urdu language processing: A survey. Arti. Intell. Rev. 47, 3 (2017), 279–311.Google Scholar
Digital Library
- [14] . 2012. Turkish paraphrase corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, , , , , , , , and (Eds.). European Language Resources Association (ELRA), 4087–4091. Retrieved from http://www.lrec-conf.org/proceedings/lrec2012/summaries/968.html.Google Scholar
- [15] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. Association for Computational Linguistics, 4171–4186.
DOI: Google ScholarCross Ref
- [16] . 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing. 9–16.Google Scholar
- [17] . 2012. Plagiarism detection in text using vector space model. In Proceedings of the 12th International Conference on Hybrid Intelligent Systems (HIS). 366–371.Google Scholar
- [18] . 2019. A hybrid model for paraphrase detection combines pros of text similarity with deep learning. Int. J. Comput. Appl. 975 (2019), 8887.Google Scholar
- [19] . 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 1608–1618.Google Scholar
- [20] . 2022. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 878–891.
DOI: Google ScholarCross Ref
- [21] . 2021. Russian paraphrasers: Paraphrase with transformers. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. 11–19.Google Scholar
- [22] . 2008. A semantic similarity approach to paraphrase detection. In Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics.Google Scholar
- [23] . 2006. Cognitive Systems: Human Cognitive Models in Systems Design. Psychology Press.Google Scholar
Digital Library
- [24] 2020. Paraphrase detection using deep neural network-based word embedding techniques. In Proceedings of the 4th International Conference on Trends in Electronics and Informatics (ICOEI). 517–521.
DOI: Google ScholarCross Ref
- [25] . 2016. Word embedding evaluation and combination. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 300–305.Google Scholar
- [26] . 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L18-1550.Google Scholar
- [27] . 2020. Automatically ranked Russian paraphrase corpus for text generation. In Proceedings of the 4th Workshop on Neural Generation and Translation. Association for Computational Linguistics, 54–59.
DOI: Google ScholarCross Ref
- [28] . 2020. CORD19STS: COVID-19 Semantic Textual Similarity Dataset.
arxiv:cs.CL/2007.02461. Google Scholar - [29] . 2005. Automatic extraction and learning of keyphrases from scientific articles. Lect. Notes Comput. Sci. 3406 (2005), 657–669.
DOI: Google ScholarDigital Library
- [30] . 2018. Urdu word embeddings. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L18-1155.Google Scholar
- [31] . 2022. A hybrid approach to paraphrase detection based on text similarities and machine learning classifiers. In Proceedings of the 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC). 343–348.
DOI: Google ScholarCross Ref
- [32] . 2020. QuASE: Question-answer driven sentence encoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics8743–8758.
DOI: Google ScholarCross Ref
- [33] . 2019. Machine learning models for paraphrase identification and its applications on plagiarism detection. In Proceedings of the IEEE International Conference on Big Knowledge (ICBK). 97–104.
DOI: Google ScholarCross Ref
- [34] . 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 1 (2019), 1–13.Google Scholar
Digital Library
- [35] . 2020. SentiLARE: Linguistic knowledge enhanced language representation for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 6975–6988.Google Scholar
Cross Ref
- [36] . 2015. Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial obfuscation: Notebook for PAN at CLEF 2015. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum, Toulouse, France, September 8-11, 2015 (CEUR Workshop Proceedings), Vol. 1391. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-1391/146-CR.pdf.Google Scholar
- [37] . 2016. Cosine similarity to determine similarity measure: Study case in online essay assessment. In Proceedings of the 4th International Conference on Cyber and IT Service Management. IEEE, 1–6.Google Scholar
Cross Ref
- [38] . 2020. ARPA: Armenian Paraphrase Detection Corpus and Models.
arxiv:cs.CL/2009.12615. Google Scholar - [39] . 2020. [email protected]: Document dating using sentence embeddings at EVALITA 2020. In Proceedings of 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA’20).Google Scholar
Cross Ref
- [40] . 2000. Corpus resources and minority language engineering. In Proceedings of the International Conference on Language Resources and Evaluation.Google Scholar
- [41] . 2013. Linguistic regularities in continuous space word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746–751.Google Scholar
- [42] . 2021. Deep learning based text classification: A comprehensive review. ACM Comput. Surv. 54, 3 (2021), 1–40.Google Scholar
Digital Library
- [43] . 2020. Finding and generating a missing part for story completion. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 156–166.Google Scholar
- [44] . 2021. Objective-based hierarchical clustering of deep embedding vectors. In Proceedings of the AAAI Conference on Artificial Intelligence. 9055–9063.Google Scholar
Cross Ref
- [45] . 2020. Using Natural Language Processing to Identify Similar Patent Documents. LU-CS-EX (2020).Google Scholar
- [46] . 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.Google Scholar
Cross Ref
- [47] . 2017. ParaPhraser: Russian paraphrase corpus and shared task. In Proceedings of the Conference on Artificial Intelligence and Natural Language. Springer, 211–225.Google Scholar
- [48] . 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 997–1005.Google Scholar
Digital Library
- [49] . 2016. Construction of a Russian paraphrase corpus: Unsupervised paraphrase extraction. In Information Retrieval. Springer, 146–157.Google Scholar
- [50] . 2020. COMET: A neural framework for MT evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2685–2702.
DOI: Google ScholarCross Ref
- [51] . 2020. sentence embeddings using siamese BERT-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language.3982–3992.Google Scholar
- [52] . 2020. Making monolingual sentence embeddings multilingual using knowledge distillation.
arxiv:cs.CL/2004.09813. Google Scholar - [53] . 2010. Improving translation via targeted paraphrasing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 127–137.Google Scholar
- [54] . 2022. ExaPPC: A large-scale persian paraphrase detection corpus. In Proceedings of the 8th International Conference on Web Research (ICWR). 168–175.
DOI: Google ScholarCross Ref
- [55] . 2018. Measuring short text reuse for the Urdu language. IEEE Access 6, 1 (2018), 7412–7421.
DOI: Google ScholarCross Ref
- [56] . 2021. Paraphrase detection using LSTM networks and handcrafted features. Multim. Tools Applic. 80, 4 (2021), 6479–6492.Google Scholar
Digital Library
- [57] . 2017. COUNTER: Corpus of Urdu news text reuse. Lang. Resour. Eval. 51, 3 (
01 Sept. 2017), 777–803.DOI: Google ScholarDigital Library
- [58] . 2016. UPPC-Urdu paraphrase plagiarism corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 1832–1836.Google Scholar
- [59] . 2003. Paraphrase acquisition for information extraction. In Proceedings of the 2nd International Workshop on Paraphrasing. Association for Computational Linguistics, 65–71.Google Scholar
Digital Library
- [60] . 2021. Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 296–310.
DOI: Google ScholarCross Ref
- [61] . 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1661–1670.
DOI: Google ScholarCross Ref
- [62] . 2016. A survey on similarity measures in text mining. Mach. Learn. Applic. Int. J. 3, 2 (2016), 19–28.Google Scholar
- [63] . 2015. Relational paraphrase acquisition from Wikipedia: The WRPA method and corpus. Nat. Lang. Eng. 21, 3 (2015), 355–389.Google Scholar
Cross Ref
- [64] . 2020. Corpus-based paraphrase detection experiments and review. Information 11, 5 (2020), 241.Google Scholar
Cross Ref
- [65] . 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1112–1122.
DOI: Google ScholarCross Ref
- [66] . 2021. Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 1154–1156.Google Scholar
Digital Library
Index Terms
Urdu Short Paraphrase Detection at Sentence Level
Recommendations
PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese
Natural Language Processing and Chinese ComputingAbstractOne of the main challenges of conducting research on paraphrase is the lack of large-scale, high-quality corpus, which is particularly serious for non-English investigations. In this paper, we present a simple and effective unsupervised learning ...
Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair
AbstractIn recent years, the problem of Cross-Lingual Text Reuse Detection (X-TRD) has gained the interest of researchers due to the availability of large digital repositories and automatic translation systems. These systems are promptly ...
Highlights- Proposed a large benchmark corpus of 21,669 sentence pairs (English–Urdu language pair) for Cross-Lingual Text Reuse Detection.
Sentence-Level Novelty Detection in English and Malay
PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data MiningNovelty detection (ND) is a process for identifying information from an incoming stream of documents. Although there are many studies of ND on English language documents, however, to the best of our knowledge, none has been reported on Malay documents. ...






Comments