skip to main content
research-article

Role of Morphology Injection in SMT: A Case Study from Indian Language Perspective

Published:15 September 2017Publication History
Skip Abstract Section

Abstract

Phrase-based Statistical Machine Translation (PBSMT) is commonly used for automatic translation. However, PBSMT runs into difficulty when either or both of the source and target languages are morphologically rich. Factored models are found to be useful for such cases, as they consider word as a vector of factors. These factors can contain any information about the surface word and use it while translating. The objective of the current work is to handle morphological inflections in Hindi, Marathi, and Malayalam using Factored translation models when translating from English. Statistical MT approaches face the problem of data sparsity when translating to a morphologically rich language. It is very unlikely for a parallel corpus to contain all morphological forms of words. We propose a solution to generate these unseen morphological forms and inject them into the original training corpus. We propose a simple and effective solution based on enriching the input with various morphological forms of words. We observe that morphology injection improves the quality of translation in terms of both adequacy and fluency. We verify this with experiments on three morphologically rich languages when translating from English. From the detailed evaluations, we observed an order of magnitude improvement in translation quality.

References

  1. Ramananthan Ananthakrishnan, Pushpak Bhattacharyya, Karthik Visweswariah, Kushal Ladha, and Ankur Gandhe. 2011. Clause-based reordering constraints to improve statistical machine translation. Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’11).Google ScholarGoogle Scholar
  2. Kunchukuttan Anoop and Pushpak Bhattacharyya. 2012. Partially modelling word reordering as a sequence labeling problem. In COLING 2012.Google ScholarGoogle Scholar
  3. Kunchukuttan Abhijit Mishra Anoop, Rajen Chatterjee, Ritesh Shah, and Pushpak Bhattacharyya. 2014. Shata Anuvadak: Tackling Multiway Translation of Indian Languages. LREC, Rekjyavik, Iceland.Google ScholarGoogle Scholar
  4. P. J. Antony. 2013. Machine translation approaches and survey for indian languages. In Proceedings of the Association for Computational Linguistics and Chinese Language Processing. 18, 1, 47--78.Google ScholarGoogle Scholar
  5. Ahsan Arafat, Prasanth Kolachina, Sudheer Kolachina, Dipti Misra Sharma, and Rajeev Sangal. 2010. Coupling Statistical Machine Translation with Rule-based Transfer and Generation. Retrieved from amta2010.amtaweb.org.Google ScholarGoogle Scholar
  6. Eleftherios Avramidis and Philipp Koehn. 2008. Enriching Morphologically Poor Languages for Statistical Machine Translation. ACL.Google ScholarGoogle Scholar
  7. A. Birch, M. Osborne, and P. Koehn. 2007. CCG supertags in factored statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, 6--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bonnie J. Dorr. 1994. Machine Translation divergences: A formal description and proposed solution. Computational Linguistics 20, 597--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Carpuat and D. Wu. 2007. Improving statistical machine translation using word sense disambiguation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP CoNLL’07). 61--72.Google ScholarGoogle Scholar
  10. Victor Chahuneau, Eva Schlinger, Noah A. Smith, and Chris Dyer. 2013. Translating into morphologically rich languages with synthetic phrases. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.Google ScholarGoogle Scholar
  11. Marie Catherine De Marneffe and Christopher D. Manning. 2008. Stanford Typed Dependencies Manual. Retrieved from http://nlp.stanford.edu/software/dependenciesmanual.pdf.Google ScholarGoogle Scholar
  12. I. Durgar El Kahlout and K. Oflazer. 2006. Initial explorations in english to turkish statistical machine translation. In Proceedings on the Workshop on Statistical Machine Translation. Association for Computational Linguistics, 7--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Avramidis Eleftherios and Philip Koehn. 2008. Enriching morphologically poor languages for statistical machine translation. In Proceedings of Association for Computational Linguistics-08: HLT. 763--770.Google ScholarGoogle Scholar
  14. Josef Och Franz and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics. 29, 1, pp. 19--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Josef Och Franz and Hermann Ney. 2001. Statistical Multi Source Translation. MT Summit.Google ScholarGoogle Scholar
  16. Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2007. IRSTLM: An open source toolkit for handling large scale language models. FBK-irst—Ricerca Scientifica e Tecnologica Via Sommarive 18, Povo (TN), Italy.Google ScholarGoogle Scholar
  17. Bhosale Ganesh, Subodh Kembhavi, Archana Amberkar, Supriya Mhatre, Lata Popale, and Pushpak Bhattacharyya. 2011. Processing of participle (Krudanta) in Marathi. In Proceedings of the International Conference on Natural Language Processing (ICON’11).Google ScholarGoogle Scholar
  18. Gandhe Ankur, Rashmi Gangadharaiah, Karthik Visweswariah, and Ananthakrishnan Ramanathan. 2011. Handling verb phrase morphology in highly inflected indian languages for machine translation. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’11).Google ScholarGoogle Scholar
  19. N. Habash, R. Gabbard, O. Rambow, S. Kulick, and M. Marcus. 2007. Determining case in arabic: Learning complex linguistic behavior requires complex linguistic features. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP CoNLL’07). 1084--1092.Google ScholarGoogle Scholar
  20. N. Habash and F. Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Association for Computational Linguistics, 49--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Huang, K. Knight, and A. Joshi. 2006. Statistical syntax directed translation with extended domain of locality. In Proceedings of AMTA. 66--73.Google ScholarGoogle Scholar
  22. Kevin Knight. 1999. Decoding complexity in word replacement translation models. Computational Linguistics 25, 4, 607--615. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Papineni Kishore, Salim Roukos, Todd Ward, and Wei Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02). 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. MT Summit.Google ScholarGoogle Scholar
  25. Philipp Koehn, Josef Och Franz, and Daniel Marcu. 2007. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Volume 1. ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP CoNLL’07). 868--876.Google ScholarGoogle Scholar
  27. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Marton, C. Callison Burch, and P. Resnik. 2009. Improved statistical machine translation using monolingually derived paraphrases. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Volume 1, 381--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Minkov, K. Toutanova, and H. Suzuki. 2007. Generating complex morphology for machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL’07). 128--135, Prague, Czech Republic. Association for Computational Linguistics.Google ScholarGoogle Scholar
  30. P. I. Nakov and H. T. Ng. 2012. Improving statistical machine translation for a resource poor language using related resource rich languages. J. AI Res. 44, 179--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Peter E. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’93).Google ScholarGoogle Scholar
  32. Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh, and Pushpak Bhattacharyya. 2009. Case markers and morphology: Addressing the crux of the fluency problem in english-Hindi SMT. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Volume 2. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Smriti Singh, Vaijayanthi M. Sarma, and Stefan Muller. 2010. Hindi noun inflection and distributed morphology. Universite Paris Diderot, Paris 7, France. Stefan Muller (Editor). CSLI Publications. Retrieved from http:cslipublications.stanford.edu.Google ScholarGoogle Scholar
  34. Smriti Singh and Vaijayanthi M. Sarma. 2011. Verbal inflection in Hindi: A distributed morphology approach. In PACLIC.Google ScholarGoogle Scholar
  35. S. Sreelekha, Piyush Dungarwal, Pushpak Bhattacharyya, and D. Malathi. 2015. Solving data sparsity by morphology injection in factored SMT. In Proceedings of the International Conference on Natural Language Processing (ICON’15).Google ScholarGoogle Scholar
  36. S. Sreelekha, Pushpak Bhattacharyya, and D. Malathi. 2014. Lexical resources for Hindi--Marathi MT. In Proceedings of WILDRE (LREC’14).Google ScholarGoogle Scholar
  37. S. Sreelekha and Pushpak Bhattacharyya. 2016. Lexical resources to enrich English Malayalam machine translation. In Proceedings of the International Conference on Lexical Resources and Evaluation (LREC’16).Google ScholarGoogle Scholar
  38. S. Sreelekha, Pushpak Bhattacharyya, and D. Malathi. 2017. Statistical vs. rule based; a case study on indian language perspective. In International Conference on Intelligent Computing and Applications, Advances in Intelligent Systems and Computing, S.S. Dash et al. (Eds.). Vol. 632.Google ScholarGoogle Scholar
  39. S. Sreelekha, Pushpak Bhattacharyya, and D. Malathi. 2016. A survey report of evolution of machine translation. Int. J. Control Theory Appl. 9, (33), 233--240.Google ScholarGoogle Scholar
  40. S. Sreelekha and Pushpak Bhattacharyya. 2015. A case study on english malayalam machine translation. In Proceedings of the iDravidian.Google ScholarGoogle Scholar
  41. S. Sreelekha, Raj Dabre, and Pushpak Bhattacharyya. 2013. Comparison of SMT and RBMT, the requirement of hybridization for Marathi—Hindi MT. In Proceedings of the 10th International Conference on Natural Language Processing (ICON’13).Google ScholarGoogle Scholar
  42. Shachi Dave, Jignashu Parikh, and Pushpak Bhattacharyya. 2002. Interlingua-based english-hindi machine translation and language divergence. In JMT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R. Sunil, Nimtha Manohar, V. Jayan, and K. G. Sulochana. 2011. Development of Malayalam text generator for translation from English. In Proceedings of the Annual India Conference (INDICON’11). IEEE.Google ScholarGoogle Scholar
  44. Ales Tamchyna and Bojar Ondrej. 2013. No free lunch in factored phrase-based machine translation. In Computational Linguistics and Intelligent Text Processing. Springer, Berlin, 210--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature rich part of speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Volume 1. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. N. Ueffing and H. Ney. 2003. Using pos information for statistical machine translation into morphologically rich languages. In Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics (EACL’03). Association for Computational Linguistics Morristown, NJ, 247--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Role of Morphology Injection in SMT: A Case Study from Indian Language Perspective

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!