Abstract
Phrase-based Statistical Machine Translation (PBSMT) is commonly used for automatic translation. However, PBSMT runs into difficulty when either or both of the source and target languages are morphologically rich. Factored models are found to be useful for such cases, as they consider word as a vector of factors. These factors can contain any information about the surface word and use it while translating. The objective of the current work is to handle morphological inflections in Hindi, Marathi, and Malayalam using Factored translation models when translating from English. Statistical MT approaches face the problem of data sparsity when translating to a morphologically rich language. It is very unlikely for a parallel corpus to contain all morphological forms of words. We propose a solution to generate these unseen morphological forms and inject them into the original training corpus. We propose a simple and effective solution based on enriching the input with various morphological forms of words. We observe that morphology injection improves the quality of translation in terms of both adequacy and fluency. We verify this with experiments on three morphologically rich languages when translating from English. From the detailed evaluations, we observed an order of magnitude improvement in translation quality.
- Ramananthan Ananthakrishnan, Pushpak Bhattacharyya, Karthik Visweswariah, Kushal Ladha, and Ankur Gandhe. 2011. Clause-based reordering constraints to improve statistical machine translation. Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’11).Google Scholar
- Kunchukuttan Anoop and Pushpak Bhattacharyya. 2012. Partially modelling word reordering as a sequence labeling problem. In COLING 2012.Google Scholar
- Kunchukuttan Abhijit Mishra Anoop, Rajen Chatterjee, Ritesh Shah, and Pushpak Bhattacharyya. 2014. Shata Anuvadak: Tackling Multiway Translation of Indian Languages. LREC, Rekjyavik, Iceland.Google Scholar
- P. J. Antony. 2013. Machine translation approaches and survey for indian languages. In Proceedings of the Association for Computational Linguistics and Chinese Language Processing. 18, 1, 47--78.Google Scholar
- Ahsan Arafat, Prasanth Kolachina, Sudheer Kolachina, Dipti Misra Sharma, and Rajeev Sangal. 2010. Coupling Statistical Machine Translation with Rule-based Transfer and Generation. Retrieved from amta2010.amtaweb.org.Google Scholar
- Eleftherios Avramidis and Philipp Koehn. 2008. Enriching Morphologically Poor Languages for Statistical Machine Translation. ACL.Google Scholar
- A. Birch, M. Osborne, and P. Koehn. 2007. CCG supertags in factored statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, 6--19. Google Scholar
Digital Library
- Bonnie J. Dorr. 1994. Machine Translation divergences: A formal description and proposed solution. Computational Linguistics 20, 597--633. Google Scholar
Digital Library
- M. Carpuat and D. Wu. 2007. Improving statistical machine translation using word sense disambiguation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP CoNLL’07). 61--72.Google Scholar
- Victor Chahuneau, Eva Schlinger, Noah A. Smith, and Chris Dyer. 2013. Translating into morphologically rich languages with synthetic phrases. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.Google Scholar
- Marie Catherine De Marneffe and Christopher D. Manning. 2008. Stanford Typed Dependencies Manual. Retrieved from http://nlp.stanford.edu/software/dependenciesmanual.pdf.Google Scholar
- I. Durgar El Kahlout and K. Oflazer. 2006. Initial explorations in english to turkish statistical machine translation. In Proceedings on the Workshop on Statistical Machine Translation. Association for Computational Linguistics, 7--14. Google Scholar
Digital Library
- Avramidis Eleftherios and Philip Koehn. 2008. Enriching morphologically poor languages for statistical machine translation. In Proceedings of Association for Computational Linguistics-08: HLT. 763--770.Google Scholar
- Josef Och Franz and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics. 29, 1, pp. 19--51. Google Scholar
Digital Library
- Josef Och Franz and Hermann Ney. 2001. Statistical Multi Source Translation. MT Summit.Google Scholar
- Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2007. IRSTLM: An open source toolkit for handling large scale language models. FBK-irst—Ricerca Scientifica e Tecnologica Via Sommarive 18, Povo (TN), Italy.Google Scholar
- Bhosale Ganesh, Subodh Kembhavi, Archana Amberkar, Supriya Mhatre, Lata Popale, and Pushpak Bhattacharyya. 2011. Processing of participle (Krudanta) in Marathi. In Proceedings of the International Conference on Natural Language Processing (ICON’11).Google Scholar
- Gandhe Ankur, Rashmi Gangadharaiah, Karthik Visweswariah, and Ananthakrishnan Ramanathan. 2011. Handling verb phrase morphology in highly inflected indian languages for machine translation. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’11).Google Scholar
- N. Habash, R. Gabbard, O. Rambow, S. Kulick, and M. Marcus. 2007. Determining case in arabic: Learning complex linguistic behavior requires complex linguistic features. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP CoNLL’07). 1084--1092.Google Scholar
- N. Habash and F. Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Association for Computational Linguistics, 49--52. Google Scholar
Digital Library
- L. Huang, K. Knight, and A. Joshi. 2006. Statistical syntax directed translation with extended domain of locality. In Proceedings of AMTA. 66--73.Google Scholar
- Kevin Knight. 1999. Decoding complexity in word replacement translation models. Computational Linguistics 25, 4, 607--615. Google Scholar
Digital Library
- Papineni Kishore, Salim Roukos, Todd Ward, and Wei Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02). 311--318. Google Scholar
Digital Library
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. MT Summit.Google Scholar
- Philipp Koehn, Josef Och Franz, and Daniel Marcu. 2007. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Volume 1. ACL. Google Scholar
Digital Library
- Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP CoNLL’07). 868--876.Google Scholar
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’07). Google Scholar
Digital Library
- Y. Marton, C. Callison Burch, and P. Resnik. 2009. Improved statistical machine translation using monolingually derived paraphrases. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Volume 1, 381--390. Google Scholar
Digital Library
- E. Minkov, K. Toutanova, and H. Suzuki. 2007. Generating complex morphology for machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL’07). 128--135, Prague, Czech Republic. Association for Computational Linguistics.Google Scholar
- P. I. Nakov and H. T. Ng. 2012. Improving statistical machine translation for a resource poor language using related resource rich languages. J. AI Res. 44, 179--222. Google Scholar
Digital Library
- Peter E. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’93).Google Scholar
- Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh, and Pushpak Bhattacharyya. 2009. Case markers and morphology: Addressing the crux of the fluency problem in english-Hindi SMT. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Volume 2. Association for Computational Linguistics. Google Scholar
Digital Library
- Smriti Singh, Vaijayanthi M. Sarma, and Stefan Muller. 2010. Hindi noun inflection and distributed morphology. Universite Paris Diderot, Paris 7, France. Stefan Muller (Editor). CSLI Publications. Retrieved from http:cslipublications.stanford.edu.Google Scholar
- Smriti Singh and Vaijayanthi M. Sarma. 2011. Verbal inflection in Hindi: A distributed morphology approach. In PACLIC.Google Scholar
- S. Sreelekha, Piyush Dungarwal, Pushpak Bhattacharyya, and D. Malathi. 2015. Solving data sparsity by morphology injection in factored SMT. In Proceedings of the International Conference on Natural Language Processing (ICON’15).Google Scholar
- S. Sreelekha, Pushpak Bhattacharyya, and D. Malathi. 2014. Lexical resources for Hindi--Marathi MT. In Proceedings of WILDRE (LREC’14).Google Scholar
- S. Sreelekha and Pushpak Bhattacharyya. 2016. Lexical resources to enrich English Malayalam machine translation. In Proceedings of the International Conference on Lexical Resources and Evaluation (LREC’16).Google Scholar
- S. Sreelekha, Pushpak Bhattacharyya, and D. Malathi. 2017. Statistical vs. rule based; a case study on indian language perspective. In International Conference on Intelligent Computing and Applications, Advances in Intelligent Systems and Computing, S.S. Dash et al. (Eds.). Vol. 632.Google Scholar
- S. Sreelekha, Pushpak Bhattacharyya, and D. Malathi. 2016. A survey report of evolution of machine translation. Int. J. Control Theory Appl. 9, (33), 233--240.Google Scholar
- S. Sreelekha and Pushpak Bhattacharyya. 2015. A case study on english malayalam machine translation. In Proceedings of the iDravidian.Google Scholar
- S. Sreelekha, Raj Dabre, and Pushpak Bhattacharyya. 2013. Comparison of SMT and RBMT, the requirement of hybridization for Marathi—Hindi MT. In Proceedings of the 10th International Conference on Natural Language Processing (ICON’13).Google Scholar
- Shachi Dave, Jignashu Parikh, and Pushpak Bhattacharyya. 2002. Interlingua-based english-hindi machine translation and language divergence. In JMT. Google Scholar
Digital Library
- R. Sunil, Nimtha Manohar, V. Jayan, and K. G. Sulochana. 2011. Development of Malayalam text generator for translation from English. In Proceedings of the Annual India Conference (INDICON’11). IEEE.Google Scholar
- Ales Tamchyna and Bojar Ondrej. 2013. No free lunch in factored phrase-based machine translation. In Computational Linguistics and Intelligent Text Processing. Springer, Berlin, 210--223. Google Scholar
Digital Library
- Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature rich part of speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Volume 1. Association for Computational Linguistics. Google Scholar
Digital Library
- N. Ueffing and H. Ney. 2003. Using pos information for statistical machine translation into morphologically rich languages. In Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics (EACL’03). Association for Computational Linguistics Morristown, NJ, 247--354. Google Scholar
Digital Library
- Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Google Scholar
Digital Library
Index Terms
Role of Morphology Injection in SMT: A Case Study from Indian Language Perspective
Recommendations
Role of Paraphrases in PB-SMT
CICLing 2014: Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 8404Statistical Machine Translation SMT delivers a convenient format for representing how translation process is modeled. The translations of words or phrases are generally computed based on their occurrence in some bilingual training corpus. However, SMT ...
Source-Side Suffix Stripping for Bengali-to-English SMT
IALP '12: Proceedings of the 2012 International Conference on Asian Language ProcessingData sparseness is a well-known problem for statistical machine translation (SMT) when morphologically rich and highly inflected languages are involved. This problem become worse in resource-scarce scenarios where sufficient parallel corpora are not ...
Introducing a Translation Dictionary into Phrase-Based SMT
This paper presents a method to effectively introduce a translation dictionary into phrase-based SMT. Though SMT systems can be built with only a parallel corpus, translation dictionaries are more widely available and have many more entries than ...






Comments