skip to main content
research-article

An Automatic and a Machine-assisted Method to Clean Bilingual Corpus

Authors Info & Claims
Published:09 October 2019Publication History
Skip Abstract Section

Abstract

Two different methods of corpus cleaning are presented in this article. One is a machine-assisted technique, which is good to clean small-sized parallel corpus, and the other is an automatic method, which is suitable for cleaning large-sized parallel corpus. A baseline SMT (MOSES) system is used to evaluate these methods. The machine-assisted technique used two features: word alignment and length of the source and target language sentence. These features are used to detect mistranslations in the corpus, which are then handled by a human translator. Experiments of this method are conducted on the English-to-Indian Language Machine Translation (EILMT) corpus (English-Hindi). The Bilingual Evaluation Understudy (BLEU) score is improved by 0.47% for the clean corpus. Automatic method of corpus cleaning uses a combination of two features. One feature is length of source and target language sentence and the second feature is Viterbi alignment score generated by Hidden Markov Model for each sentence pair. Two different threshold values are used for these two features. These values are decided by using a small-sized manually annotated parallel corpus of 206 sentence pairs. Experiments of this method are conducted on the HindEnCorp corpus, released in the workshop of the Association of Computational Linguistics (ACL 2014). The BLEU score is improved by 0.6% on clean corpus. A comparison of the two methods is also presented on EILMT corpus.

References

  1. F. Smadja, K. R. McKeown, and V. Hatzivassiloglou. 1996. Translating collocations for bilingual lexicons: A statistical approach. Comput. Linguist. 22, 3 (1996), 1--38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Diab and P. Resnik. 2002. An unsupervised method for word sense tagging using parallel corpora. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Google ScholarGoogle Scholar
  3. H. Caseli, C. Ramisch, M. G. V. Nunes, and A. Villavicencio. 2010. Alignment-based extraction of multiword expressions. Lang. Resour. Eval. 44 (2010), 59--77.Google ScholarGoogle ScholarCross RefCross Ref
  4. P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. 1990. A statistical approach to machine translation. Comput. Linguist. 16 (1990) 79--85.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Taghipour, N. Afhami, S. Khadivi, and S. Shiry. 2010. A discriminative approach to filter out noisy sentence pairs from bilingual corpora. In Proceedings of the 5th International Symposium on Telecommunications (IST’10).Google ScholarGoogle Scholar
  6. P. Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation, in MT Summit.Google ScholarGoogle Scholar
  7. X. Liu and M. Zhou. 2010. Evaluating the quality of web-mined bilingual sentences using multiple linguistic features. In Proceedings of the International Conference on Asian Language Processing (IALP’10).Google ScholarGoogle Scholar
  8. S. Khadivi and H. Ney. 2005. Automatic filtering of bilingual corpora for statistical machine translation. In Proceedings of the International Conference on Application of Natural Language to Information Systems.Google ScholarGoogle Scholar
  9. S. Stymne, C. Hardmeier, J. Tiedemann, and J. Nivre. 2013. Tunable distortion limits and corpus cleaning for SMT. In Proceedings of the Workshop on Machine Translation (WMT’13).Google ScholarGoogle Scholar
  10. L. Formiga Fanals and J. A. Rodrı́guez Fonollosa. 2012. Dealing with input noise in statistical machine translation. In Proceedings of the International Conference on Computer Linguistics (COLING’12).Google ScholarGoogle Scholar
  11. L. Cui, D. Zhang, S. Liu, M. Li, and M. Zhou. 2013. Bilingual data cleaning for SMT using graph-based random walk. In Proceedings of the Annual Meeting of the Association of Computational Linguistics (ACL’13). 2.Google ScholarGoogle Scholar
  12. C. Callison-Burch, D. Talbot, and M. Osborne. 2004. Statistical machine translation with word-and sentence-aligned parallel corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  13. P. Resnik and N. A. Smith. 2003. The web as a parallel corpus. Comput. Linguist. 29 (2003) 349--380.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Sarikaya, S. Maskey, R. Zhang, E.-E. Jan, D. Wang, B. Ramabhadran, and S. Roukos. 2009. Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’09).Google ScholarGoogle Scholar
  15. M. Turchi, T. De Bie, and N. Cristianini. 2009. An intelligent agent that autonomously learns how to translate. In Proceedings of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT’09).Google ScholarGoogle Scholar
  16. K. Taghipour, S. Khadivi, and J. Xu. 2011. Parallel corpus refinement as an outlier detection algorithm. Proceedings of the 13th Machine Translation Summit (MT’11), 414--421.Google ScholarGoogle Scholar
  17. D. S. Munteanu and D. Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31 (2005), 477--504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Cui, D. Zhang, S. Liu, M. Li, and M. Zhou. 2013. Collective corpus weighting and phrase scoring for SMT using graph-based random walk. In Natural Language Processing and Chinese Computing, Springer, 176--187.Google ScholarGoogle Scholar
  19. E. Yıldız, A. C. Tantuğ, and B. Diri. 2014. The effect of parallel corpus quality vs. size in English-to-Turkish SMT. In Proceedings of the 6th International Conference on Web Services and Semantic Technology (WeST’14).Google ScholarGoogle Scholar
  20. P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19 263--311.Google ScholarGoogle Scholar
  21. F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29 (2003), 19--51.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Liang, B. Taskar, and D. Klein. 2006. Alignment by agreement. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics.Google ScholarGoogle Scholar
  23. A. Ramanathan, J. Hegde, R. M. Shah, P. Bhattacharyya, and M. Sasikumar. 2008. Simple syntactic and morphological processing can help English-Hindi statistical machine translation. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’08).Google ScholarGoogle Scholar
  24. K. Singla, K. Sachdeva, S. Bangalore, D. M. Sharma, and D. Yadav. 2014. Reducing the impact of data sparsity in statistical machine translation. In Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST’14).Google ScholarGoogle Scholar
  25. K. Sachdeva, R. Srivastava, S. Jain, and D. M. Sharma, 2014. Hindi-to-English machine translation: Using effective selection in multi-model SMT. In Proceedings of the Language Resources and Evaluation Conference (LREC’14).Google ScholarGoogle Scholar
  26. P. Pecina, A. Toral, V. Papavassiliou, P. Prokopidis, A. Tamchyna, A. Way, and J. Genabith. 2015. Domain adaptation of statistical machine translation with domain-focused web crawling. Lang. Resour. Eval. 49 (2015), 147--193.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. A. Gale and K. W. Church. 1993. A program for aligning sentences in bilingual corpora. Comput. Linguist. 19 (1993), 75--102.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. F. Brown, J. C. Lai, and R. L. Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting on Association for Computational Linguistics.Google ScholarGoogle Scholar
  29. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Google ScholarGoogle Scholar
  31. G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research.Google ScholarGoogle ScholarCross RefCross Ref
  32. O. Bojar, V. Diatka, P. Rychlỳ, P. Stranák, V. Suchomel, A. Tamchyna, and D. Zeman. 2014. Hindencorp-Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the Language Resources and Evaluation Conference (LREC’14).Google ScholarGoogle Scholar

Index Terms

  1. An Automatic and a Machine-assisted Method to Clean Bilingual Corpus

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 1
          January 2020
          345 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3338846
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 October 2019
          • Accepted: 1 May 2019
          • Revised: 1 July 2018
          • Received: 1 February 2017
          Published in tallip Volume 19, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!