Abstract
Two different methods of corpus cleaning are presented in this article. One is a machine-assisted technique, which is good to clean small-sized parallel corpus, and the other is an automatic method, which is suitable for cleaning large-sized parallel corpus. A baseline SMT (MOSES) system is used to evaluate these methods. The machine-assisted technique used two features: word alignment and length of the source and target language sentence. These features are used to detect mistranslations in the corpus, which are then handled by a human translator. Experiments of this method are conducted on the English-to-Indian Language Machine Translation (EILMT) corpus (English-Hindi). The Bilingual Evaluation Understudy (BLEU) score is improved by 0.47% for the clean corpus. Automatic method of corpus cleaning uses a combination of two features. One feature is length of source and target language sentence and the second feature is Viterbi alignment score generated by Hidden Markov Model for each sentence pair. Two different threshold values are used for these two features. These values are decided by using a small-sized manually annotated parallel corpus of 206 sentence pairs. Experiments of this method are conducted on the HindEnCorp corpus, released in the workshop of the Association of Computational Linguistics (ACL 2014). The BLEU score is improved by 0.6% on clean corpus. A comparison of the two methods is also presented on EILMT corpus.
- F. Smadja, K. R. McKeown, and V. Hatzivassiloglou. 1996. Translating collocations for bilingual lexicons: A statistical approach. Comput. Linguist. 22, 3 (1996), 1--38.Google Scholar
Digital Library
- M. Diab and P. Resnik. 2002. An unsupervised method for word sense tagging using parallel corpora. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Google Scholar
- H. Caseli, C. Ramisch, M. G. V. Nunes, and A. Villavicencio. 2010. Alignment-based extraction of multiword expressions. Lang. Resour. Eval. 44 (2010), 59--77.Google Scholar
Cross Ref
- P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. 1990. A statistical approach to machine translation. Comput. Linguist. 16 (1990) 79--85.Google Scholar
Digital Library
- K. Taghipour, N. Afhami, S. Khadivi, and S. Shiry. 2010. A discriminative approach to filter out noisy sentence pairs from bilingual corpora. In Proceedings of the 5th International Symposium on Telecommunications (IST’10).Google Scholar
- P. Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation, in MT Summit.Google Scholar
- X. Liu and M. Zhou. 2010. Evaluating the quality of web-mined bilingual sentences using multiple linguistic features. In Proceedings of the International Conference on Asian Language Processing (IALP’10).Google Scholar
- S. Khadivi and H. Ney. 2005. Automatic filtering of bilingual corpora for statistical machine translation. In Proceedings of the International Conference on Application of Natural Language to Information Systems.Google Scholar
- S. Stymne, C. Hardmeier, J. Tiedemann, and J. Nivre. 2013. Tunable distortion limits and corpus cleaning for SMT. In Proceedings of the Workshop on Machine Translation (WMT’13).Google Scholar
- L. Formiga Fanals and J. A. Rodrı́guez Fonollosa. 2012. Dealing with input noise in statistical machine translation. In Proceedings of the International Conference on Computer Linguistics (COLING’12).Google Scholar
- L. Cui, D. Zhang, S. Liu, M. Li, and M. Zhou. 2013. Bilingual data cleaning for SMT using graph-based random walk. In Proceedings of the Annual Meeting of the Association of Computational Linguistics (ACL’13). 2.Google Scholar
- C. Callison-Burch, D. Talbot, and M. Osborne. 2004. Statistical machine translation with word-and sentence-aligned parallel corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.Google Scholar
- P. Resnik and N. A. Smith. 2003. The web as a parallel corpus. Comput. Linguist. 29 (2003) 349--380.Google Scholar
Digital Library
- R. Sarikaya, S. Maskey, R. Zhang, E.-E. Jan, D. Wang, B. Ramabhadran, and S. Roukos. 2009. Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’09).Google Scholar
- M. Turchi, T. De Bie, and N. Cristianini. 2009. An intelligent agent that autonomously learns how to translate. In Proceedings of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT’09).Google Scholar
- K. Taghipour, S. Khadivi, and J. Xu. 2011. Parallel corpus refinement as an outlier detection algorithm. Proceedings of the 13th Machine Translation Summit (MT’11), 414--421.Google Scholar
- D. S. Munteanu and D. Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31 (2005), 477--504.Google Scholar
Digital Library
- L. Cui, D. Zhang, S. Liu, M. Li, and M. Zhou. 2013. Collective corpus weighting and phrase scoring for SMT using graph-based random walk. In Natural Language Processing and Chinese Computing, Springer, 176--187.Google Scholar
- E. Yıldız, A. C. Tantuğ, and B. Diri. 2014. The effect of parallel corpus quality vs. size in English-to-Turkish SMT. In Proceedings of the 6th International Conference on Web Services and Semantic Technology (WeST’14).Google Scholar
- P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19 263--311.Google Scholar
- F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29 (2003), 19--51.Google Scholar
Digital Library
- P. Liang, B. Taskar, and D. Klein. 2006. Alignment by agreement. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics.Google Scholar
- A. Ramanathan, J. Hegde, R. M. Shah, P. Bhattacharyya, and M. Sasikumar. 2008. Simple syntactic and morphological processing can help English-Hindi statistical machine translation. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’08).Google Scholar
- K. Singla, K. Sachdeva, S. Bangalore, D. M. Sharma, and D. Yadav. 2014. Reducing the impact of data sparsity in statistical machine translation. In Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST’14).Google Scholar
- K. Sachdeva, R. Srivastava, S. Jain, and D. M. Sharma, 2014. Hindi-to-English machine translation: Using effective selection in multi-model SMT. In Proceedings of the Language Resources and Evaluation Conference (LREC’14).Google Scholar
- P. Pecina, A. Toral, V. Papavassiliou, P. Prokopidis, A. Tamchyna, A. Way, and J. Genabith. 2015. Domain adaptation of statistical machine translation with domain-focused web crawling. Lang. Resour. Eval. 49 (2015), 147--193.Google Scholar
Digital Library
- W. A. Gale and K. W. Church. 1993. A program for aligning sentences in bilingual corpora. Comput. Linguist. 19 (1993), 75--102.Google Scholar
Digital Library
- P. F. Brown, J. C. Lai, and R. L. Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting on Association for Computational Linguistics.Google Scholar
- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions.Google Scholar
Digital Library
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Google Scholar
- G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research.Google Scholar
Cross Ref
- O. Bojar, V. Diatka, P. Rychlỳ, P. Stranák, V. Suchomel, A. Tamchyna, and D. Zeman. 2014. Hindencorp-Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the Language Resources and Evaluation Conference (LREC’14).Google Scholar
Index Terms
An Automatic and a Machine-assisted Method to Clean Bilingual Corpus
Recommendations
Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages
Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian ...
Graph-Based Bilingual Word Embedding for Statistical Machine Translation
Bilingual word embedding has been shown to be helpful for Statistical Machine Translation (SMT). However, most existing methods suffer from two obvious drawbacks. First, they only focus on simple contexts such as an entire document or a fixed-sized ...
Building a Spanish-Portuguese parallel corpus for statistical machine translation
WebMedia '08: Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the WebParallel corpora have long been recognised as valuable resources for building MT applications, but their usefulness have often been limited to the translation between language pairs that include English. In this work we describe our efforts to build a ...






Comments