skip to main content
research-article

Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair

Published:31 October 2021Publication History
Skip Abstract Section

Abstract

Cross-Lingual Text Reuse Detection (CLTRD) has recently attracted the attention of the research community due to a large amount of digital text readily available for reuse in multiple languages through online digital repositories. In addition, efficient machine translation systems are freely and readily available to translate text from one language into another, which makes it quite easy to reuse text across languages, and consequently difficult to detect it. In the literature, the most prominent and widely used approach for CLTRD is Translation plus Monolingual Analysis (T+MA). To detect CLTR for English-Urdu language pair, T+MA has been used with lexical approaches, namely, N-gram Overlap, Longest Common Subsequence, and Greedy String Tiling. This clearly shows that T+MA has not been thoroughly explored for the English-Urdu language pair. To fulfill this gap, this study presents an in-depth and detailed comparison of 26 approaches that are based on T+MA. These approaches include semantic similarity approaches (semantic tagger based approaches, WordNet-based approaches), probabilistic approach (Kullback-Leibler distance approach), monolingual word embedding-based approaches siamese recurrent architecture, and monolingual sentence transformer-based approaches for English-Urdu language pair. The evaluation was carried out using the CLEU benchmark corpus, both for the binary and the ternary classification tasks. Our extensive experimentation shows that our proposed approach that is a combination of 26 approaches obtained an F1 score of 0.77 and 0.61 for the binary and ternary classification tasks, respectively, and outperformed the previously reported approaches [41] (F1 = 0.73) for the binary and (F1 = 0.55) for the ternary classification tasks) on the CLEU corpus.

REFERENCES

  1. [1] Abdi Asad, Idris Norisma, Alguliyev Rasim M., and Aliguliyev Ramiz M.. 2015. PDLK: Plagiarism detection using linguistic knowledge. Exp. Syst. Applic. 42, 22 (2015), 89368946. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Alfikri Zakiy Firdaus and Purwarianti Ayu. 2012. The construction of Indonesian-English cross language plagiarism detection system using fingerprinting technique. Jurnal Ilmu Komputer dan Informasi 5, 1 (2012), 1623.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Aljohani Adel and Mohd Masnizah. 2014. Arabic-English cross-language plagiarism detection using winnowing algorithm. Inf. Technol. J. 13, 14 (2014), 2349.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Asghari Habibollah, Khoshnava Khadijeh, Fatemi Omid, and Faili Heshaam. 2015. Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus: Notebook for PAN at CLEF 2015. In CLEF (Working Notes). Retrieved from https://pan.webis.de/downloads/publications/papers/asghari_2015.pdf.Google ScholarGoogle Scholar
  5. [5] Bakhteev Oleg, Ogaltsov Alexandr, Khazov Andrey, Safin Kamil, and Kuznetsova Rita. 2019. CrossLang: The system of cross-lingual plagiarism detection. In Workshop on Document Intelligence at NeurIPS 2019. Retrieved from https://openreview.net/pdf?id=BkxiG6qqIr.Google ScholarGoogle Scholar
  6. [6] Barrón-Cedeño Alberto, Gupta Parth, and Rosso Paolo. 2013. Methods for cross-language plagiarism detection. Knowl.-based Syst. 50 (2013), 211217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Barrón-Cedeno Alberto, Rosso Paolo, Agirre Eneko, and Labaka Gorka. 2010. Plagiarism detection across distant language pairs. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling’10). 3745. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Barrón-Cedeno Alberto, Rosso Paolo, Pinto David, and Juan Alfons. 2008. On cross-lingual plagiarism analysis using a statistical model. PAN 212 (2008). Retrieved from https://webis.de/events/pan-08/pan08-talks/barroncedeno08a-talk-cross-lingual-plagiarism-analysis-using-statistical-model.pdf.Google ScholarGoogle Scholar
  9. [9] Bowman Samuel R., Angeli Gabor, Potts Christopher, and Manning Christopher D.. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 632642. DOI:DOI: DOI: https://doi.org/10.18653/v1/D15-1075.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Cer Daniel, Yang Yinfei, Kong Sheng yi, Hua Nan, Limtiaco Nicole, John Rhomni St., Constant Noah, Guajardo-Cespedes Mario, Yuan Steve, Tar Chris, Sung Yun-Hsuan, Strope Brian, and Kurzweil Ray. 2018. Universal Sentence Encoder. arxiv:cs.CL/1803.11175.Google ScholarGoogle Scholar
  11. [11] Conneau Alexis, Kiela Douwe, Schwenk Holger, Barrault Loïc, and Bordes Antoine. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 670680. DOI:DOI: DOI: https://doi.org/10.18653/v1/D17-1070.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:cs.CL/1810.04805.Google ScholarGoogle Scholar
  13. [13] Feng Fangxiaoyu, Yang Yinfei, Cer Daniel, Arivazhagan Naveen, and Wang Wei. 2020. Language-agnostic BERT Sentence Embedding. arxiv:cs.CL/2007.01852.Google ScholarGoogle Scholar
  14. [14] Fenogenova Alena. 2021. Russian paraphrasers: Paraphrase with transformers. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. 1119.Google ScholarGoogle Scholar
  15. [15] Ferrero Jeremy, Agnes Frederic, Besacier Laurent, and Schwab Didier. 2016. A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In Proceedings of the 10th Edition of the Language Resources and Evaluation Conference.Google ScholarGoogle Scholar
  16. [16] Ferrero Jeremy, Besacier Laurent, Schwab Didier, and Agnes Frederic. 2017. Deep investigation of cross-language plagiarism detection methods. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora. Association for Computational Linguistics, 615. DOI:DOI: DOI: https://doi.org/10.18653/v1/W17-2502.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Ferrero Jeremy, Besacier Laurent, Schwab Didier, and Agnes Frederic. 2017. Using word embedding for cross-language plagiarism detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 415421. Retrieved from https://www.aclweb.org/anthology/E17-2066.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Franco-Salvador Marc, Gupta Parth, and Rosso Paolo. 2013. Cross-language plagiarism detection using a multilingual semantic network. In Proceedings of the European Conference on Information Retrieval. Springer, 710713. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Franco-Salvador Marc, Gupta Parth, Rosso Paolo, and Banchs Rafael E.. 2016. Cross-language plagiarism detection over continuous-space-and knowledge graph-based representations of language. Knowl.-based Syst. 111 (2016), 8799. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Ghannay Sahar, Favre Benoit, Esteve Yannick, and Camelin Nathalie. 2016. Word embedding evaluation and combination. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 300305.Google ScholarGoogle Scholar
  21. [21] Guo Xiao, Mirzaalian Hengameh, Sabir Ekraam, Jaiswal Ayush, and Abd-Almageed Wael. 2020. CORD19STS: COVID-19 Semantic Textual Similarity Dataset. arxiv:cs.CL/2007.02461.Google ScholarGoogle Scholar
  22. [22] Gupta Parth, Barrón-Cedeno Alberto, and Rosso Paolo. 2012. Cross-language high similarity search using a conceptual thesaurus. In Proceedings of the International Conference of the Cross-language Evaluation Forum for European Languages. Springer, 6775. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Haneef Israr, Nawab Rao Muhammad Adeel, Munir Ehsan Ullah, and Bajwa Imran Sarwar. 2019. Design and development of a large cross-lingual plagiarism corpus for Urdu-English language pair. Sci. Program. 2019, Article ID 2962040 (2019), 11 pages. https://doi.org/10.1155/2019/2962040Google ScholarGoogle Scholar
  24. [24] He Hangfeng, Ning Qiang, and Roth Dan. 2020. QuASE: Question-answer Driven Sentence Encoding. arxiv:cs.CL/1909.00333.Google ScholarGoogle Scholar
  25. [25] Akella Kanna, Venkatachalam N., Gokul K., Choi Keunho, and Tyakal Ramachandraprabhu. 2017. Gain customer insights using NLP techniques. SAE International Journal of Materials and Manufacturing 10, 3 (2017), 333–337.Google ScholarGoogle Scholar
  26. [26] Hochreiter Sepp and Schmidhuber Jurgen. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Huang Anna. 2008. Similarity measures for text document clustering. In Proceedings of the 6th New Zealand Computer Science Research Student Conference (NZCSRSC’08). 956.Google ScholarGoogle Scholar
  28. [28] Ke Pei, Ji Haozhe, Liu Siyang, Zhu Xiaoyan, and Huang Minlie. 2020. SentiLARE: Linguistic knowledge enhanced language representation for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 69756988.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Kenter Tom and Rijke Maarten De. 2015. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 14111420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Khorsi Ahmed, Cherroun Hadda, Schwab Didier, et al. 2018. 2L-APD: A two-level plagiarism detection system for Arabic documents. Cybern. Inf. Technol. 18, 1 (2018), 124138.Google ScholarGoogle Scholar
  31. [31] Kothwal Rambhoopal and Varma Vasudeva. 2013. Cross lingual text reuse detection based on keyphrase extraction and similarity measures. In Multilingual Information Access in South Asian Languages. Springer, 7178.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Koudas Nick, Sarawagi Sunita, and Srivastava Divesh. 2006. Record linkage: Similarity measures and algorithms. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 802803. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Lahitani Alfirna Rizqi, Permanasari Adhistya Erna, and Setiawan Noor Akhmad. 2016. Cosine similarity to determine similarity measure: Study case in online essay assessment. In Proceedings of the 4th International Conference on Cyber and IT Service Management. IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv:cs.CL/1907.11692.Google ScholarGoogle Scholar
  35. [35] Mardiana Tari, Adji Teguh Bharata, and Hidayah Indriana. 2015. The comparation of distance-based similarity measure to detection of plagiarism in Indonesian text. In Proceedings of the International Conference on Soft Computing, Intelligence Systems, and Information Technology. Springer, 155164.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Massidda Riccardo. 2020. rmassidda@ DaDoEval: Document dating using sentence embeddings at EVALITA 2020. In Proceedings of 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA’20). Retrieved from: CEUR.org.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Miller George A.. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 3941. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Minaee Shervin, Kalchbrenner Nal, Cambria Erik, Nikzad Narjes, Chenaghlu Meysam, and Gao Jianfeng. 2021. Deep learning based text classification: A comprehensive review. ACM Comput. Surv. 54, 3 (2021), 140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Mori Yusuke, Yamane Hiroaki, Mukuta Yusuke, and Harada Tatsuya. 2020. Finding and generating a missing part for story completion. In Proceedings of the the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 156166.Google ScholarGoogle Scholar
  40. [40] Mueller Jonas and Thyagarajan Aditya. 2016. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Muneer Iqra, Sharjeel Muhammad, Iqbal Muntaha, Nawab Rao Muhammad Adeel, and Rayson Paul. 2019. CLEU-A cross-language English-Urdu corpus and benchmark for text reuse experiments. J. Assoc. Inf. Sci. Technol. 70, 7 (2019), 729741.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Naumov Stanislav, Yaroslavtsev Grigory, and Avdiukhin Dmitrii. 2020. Objective-Based Hierarchical Clustering of Deep Embedding Vectors. arxiv:cs.LG/2012.08466.Google ScholarGoogle Scholar
  43. [43] Navrozidis Jakob and Jansson Hannes. 2020. Using natural language processing to identify similar patent documents. LU-CS-EX (2020). Retrieved from https://lup.lub.lu.se/student-papers/search/publication/9008699.Google ScholarGoogle Scholar
  44. [44] Nawab Rao Muhammad Adeel, Stevenson Mark, and Clough Paul. 2016. An IR-based approach utilizing query expansion for plagiarism detection in MEDLINE. IEEE/ACM Trans. Comput. Biol. Bioinf. 14, 4 (2016), 796804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Neculoiu Paul, Versteegh Maarten, and Rotaru Mihai. 2016. Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP. 148157.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Nicosia Massimo and Moschitti Alessandro. 2017. Accurate sentence matching with hybrid siamese networks. In Proceedings of theACM on Conference on Information and Knowledge Management. 22352238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Ozsoy Makbule Gulcin. 2016. From Word Embeddings to Item Recommendation. arxiv:cs.LG/1601.01356.Google ScholarGoogle Scholar
  48. [48] Pelevina Maria, Arefyev Nikolay, Biemann Chris, and Panchenko Alexander. 2017. Making Sense of Word Embeddings. arxiv:cs.CL/1708.03390.Google ScholarGoogle Scholar
  49. [49] Pereira Rafael Corezola, Moreira Viviane P., and Galante Renata. 2010. A new approach for cross-language plagiarism analysis. In Proceedings of the International Conference of the Cross-language Evaluation Forum for European Languages. Springer, 1526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Peters Matthew, Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 22272237. DOI:DOI: DOI: https://doi.org/10.18653/v1/N18-1202.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Potthast Martin, Barrón-Cedeño Alberto, Stein Benno, and Rosso Paolo. 2011. Cross-language plagiarism detection. Lang. Resour. Eval. 45, 1 (2011), 4562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Potthast Martin, Cedeño Alberto Barrón, Eiselt Andreas, Stein Benno, and Rosso Paolo. 2010. Overview of the 2nd international competition on plagiarism detection. In CEUR Workshop Proceedings, Vol. 1176.Google ScholarGoogle Scholar
  53. [53] Potthast Martin, Eiselt Andreas, Cedeño Luis Alberto Barrón, Stein Benno, and Rosso Paolo. 2011. Overview of the 3rd international competition on plagiarism detection. In CEUR Workshop Proceedings, Vol. 1177. CEUR Workshop Proceedings.Google ScholarGoogle Scholar
  54. [54] Potthast Martin, Stein Benno, Barrón-Cedeño Alberto, and Rosso Paolo. 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 9971005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Rei Ricardo, Stewart Craig, Farinha Ana C., and Lavie Alon. 2020. COMET: A Neural Framework for MT Evaluation. arxiv:cs.CL/2009.09025.Google ScholarGoogle Scholar
  56. [56] Reimers Nils and Gurevych Iryna. 2019. Sentence-BERT: Sentence Embeddings Using Siamese BERT-networks. arxiv:cs.CL/1908.10084.Google ScholarGoogle Scholar
  57. [57] Reimers Nils and Gurevych Iryna. 2020. Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation. arxiv:cs.CL/2004.09813.Google ScholarGoogle Scholar
  58. [58] Sameen Sara, Sharjeel Muhammad, Nawab Rao Muhammad Adeel, Rayson Paul, and Muneer Iqra. 2017. Measuring short text reuse for the Urdu language. IEEE Access 6 (2017), 74127421.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Sharjeel Muhammad. 2020. Mono- and Cross-lingual Paraphrased Text Reuse and Extrinsic Plagiarism Detection. Ph.D. Dissertation. Lancaster University (United Kingdom).Google ScholarGoogle Scholar
  60. [60] Thakur Nandan, Reimers Nils, Daxenberger Johannes, and Gurevych Iryna. 2021. Augmented SBERT: Data Augmentation Method for Improving Bi-encoders for Pairwise Sentence Scoring Tasks. arxiv:cs.CL/2010.08240.Google ScholarGoogle Scholar
  61. [61] Princeton University. 2010. About WordNet. Retrieved from https://wordnet.princeton.edu/citing-wordnet.Google ScholarGoogle Scholar
  62. [62] Varior Rahul Rama, Shuai Bing, Lu Jiwen, Xu Dong, and Wang Gang. 2016. A siamese long short-term memory architecture for human re-identification. In Proceedings of the European Conference on Computer Vision. Springer, 135153.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Vijaymeena M. K. and Kavitha K.. 2016. A survey on similarity measures in text mining. Mach. Learn. Applic. 3, 2 (2016), 1928.Google ScholarGoogle Scholar
  64. [64] Vinayakumar R. and Soman K. P.. 2020. Siamese neural network architecture for homoglyph attacks detection. ICT Express 6, 1 (2020), 1619.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Štajner Tadej and Mladenic Dunja. 2019. Cross-lingual document similarity estimation and dictionary generation with comparable corpora. Knowl. Inf. Syst. 58, 3 (2019), 729743. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Williams Adina, Nangia Nikita, and Bowman Samuel. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 11121122. DOI:DOI: DOI: https://doi.org/10.18653/v1/N18-1101.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Xu Xiaqing, Ma Bingpeng, Chang Hong, and Chen Xilin. 2017. Siamese recurrent architecture for visual tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP’17). IEEE, 11521156.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Yates Andrew, Nogueira Rodrigo, and Lin Jimmy. 2021. Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 11541156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Zhang Linrui and Moldovan Dan. 2018. Rule-based vs. neural net approaches to semantic textual similarity. In Proceedings of the 1st Workshop on Linguistic Resources for Natural Language Processing. 1217.Google ScholarGoogle Scholar

Index Terms

  1. Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
        March 2022
        413 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3494070
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 October 2021
        • Revised: 1 June 2021
        • Accepted: 1 June 2021
        • Received: 1 August 2020
        Published in tallip Volume 21, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)107
        • Downloads (Last 6 weeks)2

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!