Abstract
In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary (F1 = 0.78) and ternary (F1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e., Urdu.
- [1] . 2018. Fast and scalable neural embedding models for biomedical sentence classification. BMC Bioinform. 19, 1 (2018), 541–549.Google Scholar
Cross Ref
- [2] . 2019. LitSense: Making sense of biomedical literature at sentence level. Nucleic Acids Res. 47, 1 (2019), 594–599.Google Scholar
Cross Ref
- [3] . 2018. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv preprint arXiv:1812.10464 (2018).Google Scholar
- [4] . 2018. Transfer pretrained sentence encoder to sentiment classification. In Proceedings of the IEEE 3rd International Conference on Data Science in Cyberspace. 423–427.Google Scholar
Cross Ref
- [5] . 2013. Methods for cross-language plagiarism detection. Knowl.-based Syst. 50 (2013), 211–217.Google Scholar
Cross Ref
- [6] . 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Vol. 1. O’Reilly Media, Inc.Google Scholar
- [7] . 2017. Enriching word vectors with subword information. Trans. Assoc. Computat. Ling. 5, 1 (2017), 135–146.Google Scholar
Cross Ref
- [8] . 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google Scholar
Cross Ref
- [9] . 2018. Universal Sentence Encoder. (2018).
arxiv:cs.CL/1803.11175 .Google Scholar - [10] . 2018. QuAC: Question answering in context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2174–2184.Google Scholar
Cross Ref
- [11] . 2003. Measuring Text Reuse. Ph.D Dissertation. University of Sheffield, UK.Google Scholar
- [12] . 2002. METER: MEasuring text reuse. In Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics. 152–159.Google Scholar
- [13] . 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 1 (1960), 37–46.Google Scholar
Cross Ref
- [14] . 2018. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th International Conference on Language Resources and Evaluation.Google Scholar
- [15] . 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 670–680.Google Scholar
Cross Ref
- [16] . 2018. Using statistical and semantic models for multi-document summarization. In Proceedings of the 30th Conference on Computational Linguistics and Speech Processing. 169–183.Google Scholar
- [17] . 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP’05).Google Scholar
- [18] . 2010. MultiUN: A multilingual corpus from united nation documents. In Proceedings of the 7th International Conference on Language Resources and Evaluation.Google Scholar
- [19] . 2017. Using word embedding for cross-language plagiarism detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 415–421. Retrieved from https://www.aclweb.org/anthology/E17-2066.Google Scholar
Cross Ref
- [20] . 2011. Comparative evaluation of text- and citation-based plagiarism detection approaches using GuttenPlag. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries. 255–258.Google Scholar
Digital Library
- [21] . 2019. Better word embeddings by disentangling contextual n-gram information. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 933–939.Google Scholar
Cross Ref
- [22] . 2019. Design and development of a large cross-lingual plagiarism corpus for Urdu-English language pair. Scient. Program. 2019 (2019), 11. Google Scholar
Digital Library
- [23] . 2015. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 1411–1420.Google Scholar
Digital Library
- [24] . 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit. 79–86.Google Scholar
- [25] . 2017. Deep learning for aspect based sentiment detection. In Proceedings of the GSCL GermEval Shared Task on Aspect-based Sentiment in Social Media Customer Feedback. 22–29.Google Scholar
- [26] . 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174.Google Scholar
Digital Library
- [27] . 2004. Plagiarism: The internet makes it easy. Nurs. Stand. 18, 51 (2004).Google Scholar
Cross Ref
- [28] . 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 55–60.Google Scholar
Cross Ref
- [29] . 2017. Qlut at SemEval-2017 task 1: Semantic textual similarity based on word embeddings. In Proceedings of the 11th International Workshop on Semantic Evaluation. 150–153.Google Scholar
Cross Ref
- [30] . 2013. Linguistic regularities in continuous space word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746–751.Google Scholar
- [31] . 2021. Cross-lingual text reuse detection using translation plus monolingual analysis for English-Urdu language pair. Trans. Asian Low-resour. Lang. Inf. Process. 21, 2 (2021), 1–18.Google Scholar
- [32] . 2022. Cross-lingual text reuse detection at sentence level for English-Urdu language pair. Comput. Speech Lang. 75 (2022), 101381.
DOI: Google ScholarDigital Library
- [33] . 2022. Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels. Lang. Resour. Eval. 56, 4 (2022), 1–28.Google Scholar
Digital Library
- [34] . 2019. CLEU-a cross-language English-Urdu corpus and benchmark for text reuse experiments. J. Assoc. Inf. Sci. Technol. 70, 7 (2019), 729–741.Google Scholar
Digital Library
- [35] . 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 528–540.Google Scholar
Cross Ref
- [36] . 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.Google Scholar
Cross Ref
- [37] . 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC Workshop on New Challenges for NLP Frameworks. 45–50.Google Scholar
- [38] . 2017. Measuring short text reuse for the Urdu language. IEEE Access 6 (2017), 7412–7421.Google Scholar
Cross Ref
- [39] . 2017. Continuous n-gram representations for authorship attribution. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 267–273.Google Scholar
Cross Ref
- [40] . 2017. COUNTER: Corpus of Urdu news text reuse. Lang. Resour. Eval. 51, 3 (2017), 777–803.Google Scholar
Digital Library
- [41] . 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 363–372.Google Scholar
Digital Library
- [42] . 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the Empirical Methods in Natural Language Processing. 353–355.Google Scholar
Cross Ref
- [43] . 2004. On the ownership of text. Comput. Human. 38, 2 (2004), 115–127.Google Scholar
Cross Ref
- [44] . 1993. Running Karp-Rabin Matching and Greedy String Tiling. Vol. 1. University of Sydney.Google Scholar
- [45] . 2019. Probing for semantic classes: Diagnosing the meaning content of word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5740–5753.Google Scholar
Cross Ref
- [46] . 2017. Refining word embeddings for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 534–539.Google Scholar
Cross Ref
- [47] . 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19–27.Google Scholar
Digital Library
Index Terms
Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair
Recommendations
Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair
AbstractIn recent years, the problem of Cross-Lingual Text Reuse Detection (X-TRD) has gained the interest of researchers due to the availability of large digital repositories and automatic translation systems. These systems are promptly ...
Highlights- Proposed a large benchmark corpus of 21,669 sentence pairs (English–Urdu language pair) for Cross-Lingual Text Reuse Detection.
Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels
AbstractIn recent years, Cross-Lingual Text Reuse Detection (CLTRD) has attracted the attention of the research community because large digital repositories and efficient Machine Translation systems are readily and freely available, which makes it easier ...
Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...






Comments