Abstract
Cross-Lingual Text Reuse Detection (CLTRD) has recently attracted the attention of the research community due to a large amount of digital text readily available for reuse in multiple languages through online digital repositories. In addition, efficient machine translation systems are freely and readily available to translate text from one language into another, which makes it quite easy to reuse text across languages, and consequently difficult to detect it. In the literature, the most prominent and widely used approach for CLTRD is Translation plus Monolingual Analysis (T+MA). To detect CLTR for English-Urdu language pair, T+MA has been used with lexical approaches, namely, N-gram Overlap, Longest Common Subsequence, and Greedy String Tiling. This clearly shows that T+MA has not been thoroughly explored for the English-Urdu language pair. To fulfill this gap, this study presents an in-depth and detailed comparison of 26 approaches that are based on T+MA. These approaches include semantic similarity approaches (semantic tagger based approaches, WordNet-based approaches), probabilistic approach (Kullback-Leibler distance approach), monolingual word embedding-based approaches siamese recurrent architecture, and monolingual sentence transformer-based approaches for English-Urdu language pair. The evaluation was carried out using the CLEU benchmark corpus, both for the binary and the ternary classification tasks. Our extensive experimentation shows that our proposed approach that is a combination of 26 approaches obtained an F1 score of 0.77 and 0.61 for the binary and ternary classification tasks, respectively, and outperformed the previously reported approaches [41] (F1 = 0.73) for the binary and (F1 = 0.55) for the ternary classification tasks) on the CLEU corpus.
- [1] . 2015. PDLK: Plagiarism detection using linguistic knowledge. Exp. Syst. Applic. 42, 22 (2015), 8936–8946. Google Scholar
Digital Library
- [2] . 2012. The construction of Indonesian-English cross language plagiarism detection system using fingerprinting technique. Jurnal Ilmu Komputer dan Informasi 5, 1 (2012), 16–23.Google Scholar
Cross Ref
- [3] . 2014. Arabic-English cross-language plagiarism detection using winnowing algorithm. Inf. Technol. J. 13, 14 (2014), 2349.Google Scholar
Cross Ref
- [4] . 2015. Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus: Notebook for PAN at CLEF 2015. In CLEF (Working Notes). Retrieved from https://pan.webis.de/downloads/publications/papers/asghari_2015.pdf.Google Scholar
- [5] . 2019. CrossLang: The system of cross-lingual plagiarism detection. In Workshop on Document Intelligence at NeurIPS 2019. Retrieved from https://openreview.net/pdf?id=BkxiG6qqIr.Google Scholar
- [6] . 2013. Methods for cross-language plagiarism detection. Knowl.-based Syst. 50 (2013), 211–217. Google Scholar
Digital Library
- [7] . 2010. Plagiarism detection across distant language pairs. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling’10). 37–45. Google Scholar
Digital Library
- [8] . 2008. On cross-lingual plagiarism analysis using a statistical model. PAN 212 (2008). Retrieved from https://webis.de/events/pan-08/pan08-talks/barroncedeno08a-talk-cross-lingual-plagiarism-analysis-using-statistical-model.pdf.Google Scholar
- [9] . 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 632–642.
DOI: DOI: DOI: https://doi.org/10.18653/v1/D15-1075.Google ScholarCross Ref
- [10] . 2018. Universal Sentence Encoder.
arxiv:cs.CL/1803.11175 .Google Scholar - [11] . 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 670–680.
DOI: DOI: DOI: https://doi.org/10.18653/v1/D17-1070.Google ScholarCross Ref
- [12] . 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
arxiv:cs.CL/1810.04805 .Google Scholar - [13] . 2020. Language-agnostic BERT Sentence Embedding.
arxiv:cs.CL/2007.01852 .Google Scholar - [14] . 2021. Russian paraphrasers: Paraphrase with transformers. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. 11–19.Google Scholar
- [15] . 2016. A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In Proceedings of the 10th Edition of the Language Resources and Evaluation Conference.Google Scholar
- [16] . 2017. Deep investigation of cross-language plagiarism detection methods. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora. Association for Computational Linguistics, 6–15.
DOI: DOI: DOI: https://doi.org/10.18653/v1/W17-2502.Google ScholarCross Ref
- [17] . 2017. Using word embedding for cross-language plagiarism detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 415–421. Retrieved from https://www.aclweb.org/anthology/E17-2066.Google Scholar
Cross Ref
- [18] . 2013. Cross-language plagiarism detection using a multilingual semantic network. In Proceedings of the European Conference on Information Retrieval. Springer, 710–713. Google Scholar
Digital Library
- [19] . 2016. Cross-language plagiarism detection over continuous-space-and knowledge graph-based representations of language. Knowl.-based Syst. 111 (2016), 87–99. Google Scholar
Digital Library
- [20] . 2016. Word embedding evaluation and combination. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 300–305.Google Scholar
- [21] . 2020. CORD19STS: COVID-19 Semantic Textual Similarity Dataset.
arxiv:cs.CL/2007.02461 .Google Scholar - [22] . 2012. Cross-language high similarity search using a conceptual thesaurus. In Proceedings of the International Conference of the Cross-language Evaluation Forum for European Languages. Springer, 67–75. Google Scholar
Digital Library
- [23] . 2019. Design and development of a large cross-lingual plagiarism corpus for Urdu-English language pair. Sci. Program. 2019, Article ID 2962040 (2019), 11 pages. https://doi.org/10.1155/2019/2962040Google Scholar
- [24] . 2020. QuASE: Question-answer Driven Sentence Encoding.
arxiv:cs.CL/1909.00333 .Google Scholar - [25] . 2017. Gain customer insights using NLP techniques. SAE International Journal of Materials and Manufacturing 10, 3 (2017), 333–337.Google Scholar
- [26] . 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780. Google Scholar
Digital Library
- [27] . 2008. Similarity measures for text document clustering. In Proceedings of the 6th New Zealand Computer Science Research Student Conference (NZCSRSC’08). 9–56.Google Scholar
- [28] . 2020. SentiLARE: Linguistic knowledge enhanced language representation for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 6975–6988.Google Scholar
Cross Ref
- [29] . 2015. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 1411–1420. Google Scholar
Digital Library
- [30] . 2018. 2L-APD: A two-level plagiarism detection system for Arabic documents. Cybern. Inf. Technol. 18, 1 (2018), 124–138.Google Scholar
- [31] . 2013. Cross lingual text reuse detection based on keyphrase extraction and similarity measures. In Multilingual Information Access in South Asian Languages. Springer, 71–78.Google Scholar
Cross Ref
- [32] . 2006. Record linkage: Similarity measures and algorithms. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 802–803. Google Scholar
Digital Library
- [33] . 2016. Cosine similarity to determine similarity measure: Study case in online essay assessment. In Proceedings of the 4th International Conference on Cyber and IT Service Management. IEEE, 1–6.Google Scholar
Cross Ref
- [34] . 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
arxiv:cs.CL/1907.11692 .Google Scholar - [35] . 2015. The comparation of distance-based similarity measure to detection of plagiarism in Indonesian text. In Proceedings of the International Conference on Soft Computing, Intelligence Systems, and Information Technology. Springer, 155–164.Google Scholar
Cross Ref
- [36] . 2020. rmassidda@ DaDoEval: Document dating using sentence embeddings at EVALITA 2020. In Proceedings of 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA’20). Retrieved from: CEUR.org.Google Scholar
Cross Ref
- [37] . 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39–41. Google Scholar
Digital Library
- [38] . 2021. Deep learning based text classification: A comprehensive review. ACM Comput. Surv. 54, 3 (2021), 1–40. Google Scholar
Digital Library
- [39] . 2020. Finding and generating a missing part for story completion. In Proceedings of the the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 156–166.Google Scholar
- [40] . 2016. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. Google Scholar
Digital Library
- [41] . 2019. CLEU-A cross-language English-Urdu corpus and benchmark for text reuse experiments. J. Assoc. Inf. Sci. Technol. 70, 7 (2019), 729–741.Google Scholar
Digital Library
- [42] . 2020. Objective-Based Hierarchical Clustering of Deep Embedding Vectors.
arxiv:cs.LG/2012.08466 .Google Scholar - [43] . 2020. Using natural language processing to identify similar patent documents. LU-CS-EX (2020). Retrieved from https://lup.lub.lu.se/student-papers/search/publication/9008699.Google Scholar
- [44] . 2016. An IR-based approach utilizing query expansion for plagiarism detection in MEDLINE. IEEE/ACM Trans. Comput. Biol. Bioinf. 14, 4 (2016), 796–804. Google Scholar
Digital Library
- [45] . 2016. Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP. 148–157.Google Scholar
Cross Ref
- [46] . 2017. Accurate sentence matching with hybrid siamese networks. In Proceedings of theACM on Conference on Information and Knowledge Management. 2235–2238. Google Scholar
Digital Library
- [47] . 2016. From Word Embeddings to Item Recommendation.
arxiv:cs.LG/1601.01356 .Google Scholar - [48] . 2017. Making Sense of Word Embeddings.
arxiv:cs.CL/1708.03390 .Google Scholar - [49] . 2010. A new approach for cross-language plagiarism analysis. In Proceedings of the International Conference of the Cross-language Evaluation Forum for European Languages. Springer, 15–26. Google Scholar
Digital Library
- [50] . 2018. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237.
DOI: DOI: DOI: https://doi.org/10.18653/v1/N18-1202.Google ScholarCross Ref
- [51] . 2011. Cross-language plagiarism detection. Lang. Resour. Eval. 45, 1 (2011), 45–62. Google Scholar
Digital Library
- [52] . 2010. Overview of the 2nd international competition on plagiarism detection. In CEUR Workshop Proceedings, Vol. 1176.Google Scholar
- [53] . 2011. Overview of the 3rd international competition on plagiarism detection. In CEUR Workshop Proceedings, Vol. 1177. CEUR Workshop Proceedings.Google Scholar
- [54] . 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 997–1005. Google Scholar
Digital Library
- [55] . 2020. COMET: A Neural Framework for MT Evaluation.
arxiv:cs.CL/2009.09025 .Google Scholar - [56] . 2019. Sentence-BERT: Sentence Embeddings Using Siamese BERT-networks.
arxiv:cs.CL/1908.10084 .Google Scholar - [57] . 2020. Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.
arxiv:cs.CL/2004.09813 .Google Scholar - [58] . 2017. Measuring short text reuse for the Urdu language. IEEE Access 6 (2017), 7412–7421.Google Scholar
Cross Ref
- [59] . 2020. Mono- and Cross-lingual Paraphrased Text Reuse and Extrinsic Plagiarism Detection. Ph.D. Dissertation. Lancaster University (United Kingdom).Google Scholar
- [60] . 2021. Augmented SBERT: Data Augmentation Method for Improving Bi-encoders for Pairwise Sentence Scoring Tasks.
arxiv:cs.CL/2010.08240 .Google Scholar - [61] Princeton University. 2010. About WordNet. Retrieved from https://wordnet.princeton.edu/citing-wordnet.Google Scholar
- [62] . 2016. A siamese long short-term memory architecture for human re-identification. In Proceedings of the European Conference on Computer Vision. Springer, 135–153.Google Scholar
Cross Ref
- [63] . 2016. A survey on similarity measures in text mining. Mach. Learn. Applic. 3, 2 (2016), 19–28.Google Scholar
- [64] . 2020. Siamese neural network architecture for homoglyph attacks detection. ICT Express 6, 1 (2020), 16–19.Google Scholar
Cross Ref
- [65] . 2019. Cross-lingual document similarity estimation and dictionary generation with comparable corpora. Knowl. Inf. Syst. 58, 3 (2019), 729–743. Google Scholar
Digital Library
- [66] . 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 1112–1122.
DOI: DOI: DOI: https://doi.org/10.18653/v1/N18-1101.Google ScholarCross Ref
- [67] . 2017. Siamese recurrent architecture for visual tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP’17). IEEE, 1152–1156.Google Scholar
Cross Ref
- [68] . 2021. Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 1154–1156. Google Scholar
Digital Library
- [69] . 2018. Rule-based vs. neural net approaches to semantic textual similarity. In Proceedings of the 1st Workshop on Linguistic Resources for Natural Language Processing. 12–17.Google Scholar
Index Terms
Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair
Recommendations
Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...
Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair
AbstractIn recent years, the problem of Cross-Lingual Text Reuse Detection (X-TRD) has gained the interest of researchers due to the availability of large digital repositories and automatic translation systems. These systems are promptly ...
Highlights- Proposed a large benchmark corpus of 21,669 sentence pairs (English–Urdu language pair) for Cross-Lingual Text Reuse Detection.
Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels
AbstractIn recent years, Cross-Lingual Text Reuse Detection (CLTRD) has attracted the attention of the research community because large digital repositories and efficient Machine Translation systems are readily and freely available, which makes it easier ...






Comments