skip to main content
research-article

Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

Published:16 June 2023Publication History
Skip Abstract Section

Abstract

In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary (F1 = 0.78) and ternary (F1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e., Urdu.

REFERENCES

  1. [1] Agibetov Asan, Blagec Kathrin, Xu Hong, and Samwald Matthias. 2018. Fast and scalable neural embedding models for biomedical sentence classification. BMC Bioinform. 19, 1 (2018), 541549.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Allot Alexis, Chen Qingyu, Kim Sun, Alvarez Roberto Vera, Comeau Donald C., Wilbur W. John, and Lu Zhiyong. 2019. LitSense: Making sense of biomedical literature at sentence level. Nucleic Acids Res. 47, 1 (2019), 594599.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Artetxe Mikel and Schwenk Holger. 2018. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv preprint arXiv:1812.10464 (2018).Google ScholarGoogle Scholar
  4. [4] Bai Man, Han Xu, Jia Haoran, Wang Cong, and Sun Yawei. 2018. Transfer pretrained sentence encoder to sentiment classification. In Proceedings of the IEEE 3rd International Conference on Data Science in Cyberspace. 423427.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Barrón-Cedeño Alberto, Gupta Parth, and Rosso Paolo. 2013. Methods for cross-language plagiarism detection. Knowl.-based Syst. 50 (2013), 211217.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Bird Steven, Klein Ewan, and Loper Edward. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Vol. 1. O’Reilly Media, Inc.Google ScholarGoogle Scholar
  7. [7] Bojanowski Piotr, Grave Edouard, Joulin Armand, and Mikolov Tomas. 2017. Enriching word vectors with subword information. Trans. Assoc. Computat. Ling. 5, 1 (2017), 135146.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Bowman Samuel R., Angeli Gabor, Potts Christopher, and Manning Christopher D.. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Cer Daniel, Yang Yinfei, Kong Sheng yi, Hua Nan, Limtiaco Nicole, John Rhomni St., Constant Noah, Guajardo-Cespedes Mario, Yuan Steve, Tar Chris, Sung Yun-Hsuan, Strope Brian, and Kurzweil Ray. 2018. Universal Sentence Encoder. (2018). arxiv:cs.CL/1803.11175.Google ScholarGoogle Scholar
  10. [10] Choi Eunsol, He He, Iyyer Mohit, Yatskar Mark, Yih Wen-tau, Choi Yejin, Liang Percy, and Zettlemoyer Luke. 2018. QuAC: Question answering in context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 21742184.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Clough Paul. 2003. Measuring Text Reuse. Ph.D Dissertation. University of Sheffield, UK.Google ScholarGoogle Scholar
  12. [12] Clough Paul, Gaizauskas Rob, Piao Scott, and Wilks Yorick. 2002. METER: MEasuring text reuse. In Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics. 152159.Google ScholarGoogle Scholar
  13. [13] Cohen Jacob. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 1 (1960), 3746.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Conneau Alexis and Kiela Douwe. 2018. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th International Conference on Language Resources and Evaluation.Google ScholarGoogle Scholar
  15. [15] Conneau Alexis, Kiela Douwe, Schwenk Holger, Barrault Loïc, and Bordes Antoine. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 670680.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Daiya Divyanshu and Singh Anukarsh. 2018. Using statistical and semantic models for multi-document summarization. In Proceedings of the 30th Conference on Computational Linguistics and Speech Processing. 169183.Google ScholarGoogle Scholar
  17. [17] Dolan Bill and Brockett Chris. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP’05).Google ScholarGoogle Scholar
  18. [18] Eisele Andreas and Chen Yu. 2010. MultiUN: A multilingual corpus from united nation documents. In Proceedings of the 7th International Conference on Language Resources and Evaluation.Google ScholarGoogle Scholar
  19. [19] Ferrero Jeremy, Besacier Laurent, Schwab Didier, and Agnes Frederic. 2017. Using word embedding for cross-language plagiarism detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 415421. Retrieved from https://www.aclweb.org/anthology/E17-2066.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Gipp Bela, Meuschke Norman, and Beel Joeran. 2011. Comparative evaluation of text- and citation-based plagiarism detection approaches using GuttenPlag. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries. 255258.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Gupta Prakhar, Pagliardini Matteo, and Jaggi Martin. 2019. Better word embeddings by disentangling contextual n-gram information. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 933939.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Haneef Israr, Nawab Rao Muhammad Adeel, Munir Ehsan Ullah, and Bajwa Imran Sarwar. 2019. Design and development of a large cross-lingual plagiarism corpus for Urdu-English language pair. Scient. Program. 2019 (2019), 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Kenter Tom and Rijke Maarten De. 2015. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 14111420.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Koehn Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit. 7986.Google ScholarGoogle Scholar
  25. [25] Lee Ji-Ung, Eger Steffen, Daxenberger Johannes, and Gurevych Iryna. 2017. Deep learning for aspect based sentiment detection. In Proceedings of the GSCL GermEval Shared Task on Aspect-based Sentiment in Social Media Customer Feedback. 2229.Google ScholarGoogle Scholar
  26. [26] Li Chenliang, Wang Haoran, Zhang Zhiqian, Sun Aixin, and Ma Zongyang. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 165174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Logue Roger. 2004. Plagiarism: The internet makes it easy. Nurs. Stand. 18, 51 (2004).Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Manning Christopher, Surdeanu Mihai, Bauer John, Finkel Jenny, Bethard Steven, and McClosky David. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 5560.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Meng Fanqing, Lu Wenpeng, Zhang Yuteng, Cheng Jinyong, Du Yuehan, and Han Shuwang. 2017. Qlut at SemEval-2017 task 1: Semantic textual similarity based on word embeddings. In Proceedings of the 11th International Workshop on Semantic Evaluation. 150153.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Mikolov Tomáš, Yih Wen-tau, and Zweig Geoffrey. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746751.Google ScholarGoogle Scholar
  31. [31] Muneer Iqra and Nawab Rao Muhammad Adeel. 2021. Cross-lingual text reuse detection using translation plus monolingual analysis for English-Urdu language pair. Trans. Asian Low-resour. Lang. Inf. Process. 21, 2 (2021), 118.Google ScholarGoogle Scholar
  32. [32] Muneer Iqra and Nawab Rao Muhammad Adeel. 2022. Cross-lingual text reuse detection at sentence level for English-Urdu language pair. Comput. Speech Lang. 75 (2022), 101381. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Muneer Iqra and Nawab Rao Muhammad Adeel. 2022. Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels. Lang. Resour. Eval. 56, 4 (2022), 128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Muneer Iqra, Sharjeel Muhammad, Iqbal Muntaha, Nawab Rao Muhammad Adeel, and Rayson Paul. 2019. CLEU-a cross-language English-Urdu corpus and benchmark for text reuse experiments. J. Assoc. Inf. Sci. Technol. 70, 7 (2019), 729741.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Pagliardini Matteo, Gupta Prakhar, and Jaggi Martin. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 528540.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Řehůřek Radim and Sojka Petr. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC Workshop on New Challenges for NLP Frameworks. 4550.Google ScholarGoogle Scholar
  38. [38] Sameen Sara, Sharjeel Muhammad, Nawab Rao Muhammad Adeel, Rayson Paul, and Muneer Iqra. 2017. Measuring short text reuse for the Urdu language. IEEE Access 6 (2017), 74127421.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Sari Yunita, Vlachos Andreas, and Stevenson Mark. 2017. Continuous n-gram representations for authorship attribution. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 267273.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Sharjeel Muhammad, Nawab Rao Muhammad Adeel, and Rayson Paul. 2017. COUNTER: Corpus of Urdu news text reuse. Lang. Resour. Eval. 51, 3 (2017), 777803.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Vulić Ivan and Moens Marie-Francine. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 363372.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Wang Alex, Singh Amanpreet, Michael Julian, Hill Felix, Levy Omer, and Bowman Samuel. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the Empirical Methods in Natural Language Processing. 353355.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Wilks Yorick. 2004. On the ownership of text. Comput. Human. 38, 2 (2004), 115127.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Wise Michael J.. 1993. Running Karp-Rabin Matching and Greedy String Tiling. Vol. 1. University of Sydney.Google ScholarGoogle Scholar
  45. [45] Yaghoobzadeh Yadollah, Kann Katharina, Hazen Timothy J., Agirre Eneko, and Schütze Hinrich. 2019. Probing for semantic classes: Diagnosing the meaning content of word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 57405753.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Yu Liang-Chih, Wang Jin, Lai K. Robert, and Zhang Xuejie. 2017. Refining word embeddings for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 534539.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Zhu Yukun, Kiros Ryan, Zemel Rich, Salakhutdinov Ruslan, Urtasun Raquel, Torralba Antonio, and Fidler Sanja. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 1927.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
      June 2023
      635 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3604597
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 June 2023
      • Online AM: 1 May 2023
      • Accepted: 8 April 2023
      • Revised: 20 March 2023
      • Received: 26 July 2022
      Published in tallip Volume 22, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)63
      • Downloads (Last 6 weeks)23

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!