Abstract
Statistical Machine Translation (SMT) is a preferred Machine Translation approach to convert the text in a specific language into another by automatically learning translations using a parallel corpus. SMT has been successful in producing quality translations in many foreign languages, but there are only a few works attempted in South Indian languages. The article discusses on experiments conducted with SMT for Malayalam language and analyzes how the methods defined for SMT in foreign languages affect a Dravidian language, Malayalam. The baseline SMT model does not work for Malayalam due to its unique characteristics like agglutinative nature and morphological richness. Hence, the challenge is to identify where precisely the SMT model has to be modified such that it adapts the challenges of the language peculiarity into the baseline model and give better translations for English to Malayalam translation. The alignments between English and Malayalam sentence pairs, subjected to the training process in SMT, plays a crucial role in producing quality output translation. Therefore, this work focuses on improving the translation model of SMT by refining the alignments between English–Malayalam sentence pairs. The phrase alignment algorithms align the verb and noun phrases in the sentence pairs and develop a new set of alignments for the English–Malayalam sentence pairs. These alignment sets refine the alignments formed from Giza++ produced as a result of EM training algorithm. The improved Phrase-Based SMT model trained using these refined alignments resulted in better translation quality, as indicated by the AER and BLUE scores.
- . 2015. Machine translation from English to Malayalam using transfer approach. In Proceedings of the 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 1565–1570.Google Scholar
Cross Ref
- L. Ahrenberg, A. Mikael, and M. Magnus. 1998. A simple hybrid aligner for generating lexical correspondences in parallel texts. 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Vol. 1.Google Scholar
- . 2010. Coupling statistical machine translation with rule-based transfer and generation. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas.Google Scholar
- . 2010. Development of parallel corpus and english to urdu statistical machine translation. Int. J. of Engineering & Technology IJET-IJENS 10 (2010), 31–33.Google Scholar
- M. Anand Kumar. 2013. Morphology based prototype statistical machine translation system for English to Tamil language. PhD thesis, Amrita Vishwa Vidyapeetham, Coimbatore.Google Scholar
- . 2013. Machine translation approaches and survey for Indian languages. In Proceedings of the International Journal of Computational Linguistics & Chinese Language Processing 18, 1 (March 2013).Google Scholar
- . 2010. Statistical method for English to Kannada transliteration. In Proceedings of the International Conference on Business Administration and Information Processing. Springer, Berlin, 356–362.Google Scholar
Cross Ref
- . 2015. Malayala Vyakarana Padanam. Kochi: Pranatha Book.Google Scholar
- . 2012. Improving Statistical Machine Translation through co-joining parts of verbal constructs in English-Hindi translation. In Proceedings of the 6th Workshop on Syntax, Semantics and Structure in Statistical Translation. 95–101.Google Scholar
Digital Library
- . 2008. Translation of unknown words in phrase-based statistical machine translation for languages of rich morphology. In Proceedings of the Spoken Languages Technologies for Under-Resourced Languages.Google Scholar
- . 2010. A data mining approach to learn reorder rules for SMT. In Proceedings of the NAACL HLT 2010 Student Research Workshop, 52–57.Google Scholar
Digital Library
- . 2008. A Dependency Treelet-based phrasal SMT: Evaluation and issues in English-Hindi language pair. In Proceedings of ICON-2008: 6th International Conference on Natural Language Processing, Macmillan Publishers.Google Scholar
- . 2008. English-Hindi translation in 21 days. In Proceedings of the 6th International Conference On Natural Language Processing (ICON-2008) NLP Tools Contest.Google Scholar
- . 2010. Data issues in english-to-hindi machine translation. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC'10).Google Scholar
- . 1988. A statistical approach to language translation. In Proceedings of the Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics.Google Scholar
Digital Library
- . 1990. A statistical approach to machine translation. Computational Linguistics 16, 2 (1990), 79.Google Scholar
Digital Library
- . 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2 (1993), 263.Google Scholar
Digital Library
- . 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39, 1 (1977), 1.Google Scholar
Cross Ref
- . 2010. Transliteration using a phrase-based statistical machine translation system to re-score the output of a joint multigram model. In Proceedings of the 2010 Named Entities Workshop. 48–52.Google Scholar
Digital Library
- . 2007. Measuring word alignment quality for statistical machine translation. Computational Linguistics 33, 3 (2007), 293–303.Google Scholar
Digital Library
- . 2011. Handling verb phrase morphology in highly inflected Indian languages for Machine Translation. In Proceedings of 5th International Joint Conference on Natural Language Processing. 111–119.Google Scholar
- . 2014. Learning continuous phrase representations for translation modeling. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 699–709.Google Scholar
Cross Ref
- . 2015. Machine translation development for Indian languages and its approaches. International Journal on Natural Language Computing 4, 2 (2015), 55–74.Google Scholar
Cross Ref
- . 2008. Four techniques for online handling of out-of-vocabulary words in Arabic-English statistical machine translation. In Proceedings of ACL-08: HLT, Short Papers (Companion Volume), Columbus, OH, 57--60.Google Scholar
- R. Harshawardhan, M. S. Augustine, and K. P. Soman. 2008. A simplified approach to word alignment algorithm for english-tamil translation. Indian Journal of Computer Science and Engineering 2, 1 (2008), 94--100.Google Scholar
- . 2014. Minimum translation modeling with recurrent neural networks. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 20–29.Google Scholar
Cross Ref
- . 2007. ‘A morphological processor for malayalam language’. South Asia Research 27, 2 (2007), 173–86.Google Scholar
Cross Ref
- . 2010. English to Bangla phrase-based machine translation. In Proceedings of the 14th Annual conference of the European Association for Machine Translation.Google Scholar
- . 2019. Statistical machine translation of Indian languages: A survey. Neural Computing and Applications 31, 7 (2019), 2455–2467.Google Scholar
Digital Library
- . 2011. Word-order issues in english-to-urdu statistical machine translation. The Prague Bulletin of Mathematical Linguistics 95, 1 (2011), 87–106.Google Scholar
Cross Ref
- . 2015. Difficulties in processing malayalam verbs for statistical machine translation. International Journal of Artificial Intelligence and Applications (IJAIA) 6, 3 (2015), 13--24.Google Scholar
- . 2012. Disambiguation of pre/post positions in English-Malayalam Text Translation. In Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages. 93–102.Google Scholar
- . 2012. Divergence patterns in machine translation between Malayalam and English. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics. 788–794.Google Scholar
Digital Library
- . 1993. Text-translation alignment. Computational Linguistics 19, 1 (1993), 121–142.Google Scholar
Digital Library
- . 1999. A statistical MT tutorial workbook. In Proceedings of the Prepared for the 1999 JHU Summer Workshop.Google Scholar
- . 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proceedings of the Conference of the Association for Machine Translation in the Americas. Springer, Berlin. 115–124.Google Scholar
Cross Ref
- . 2003. Statistical Phrase-based Translation. University of Southern California Marina Del Rey Information Sciences Inst.Google Scholar
Cross Ref
- . 2017. Morphological analysis of the Dravidian language family. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 217–222.Google Scholar
Cross Ref
- . 2014. Factored statistical machine translation system for english to Tamil language. Pertanika Journal of Social Sciences & Humanities 22, 4 (2014), 1045--1061.Google Scholar
- . 2018. Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, 5039--5049.Google Scholar
Cross Ref
- . 2013. Recursive autoencoders for ITG-based translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 567–577.Google Scholar
- . 2008. Statistical machine translation. ACM Computing Surveys (CSUR) 40, 3 (2008), 1–49.Google Scholar
Digital Library
- I. Dan Melamed. 2001. Word-to-word models of translational equivalence. In Empirical Methods for Exploiting Parallel Texts, MIT Press, 81--121.Google Scholar
- . 2012. Machine translation systems for Indian languages. International Journal of Computer Applications 39, 1 (2012), 0975–8887.Google Scholar
Cross Ref
- . 2011. Statistical Machine Translation using Joshua: An approach to build “enTel” system. Special Volume: Problems of Parsing in Indian Languages, Language in India 11, 5 (2011), 1--6.Google Scholar
- . 2000. Improving SMT quality with morpho-syntactic analysis. In Proceedings of the COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics.Google Scholar
- . 2013. A hybrid approach to english to malayalam machine translation. International Journal of Computer Applications 81, 8 (2013), 11--15.Google Scholar
- . 2000. Acomparison of alignment models for statistical machine translation. In Proceedings of the COLING’00: The 18th International Conference on Computational Linguistics. 1086–1090.Google Scholar
- . 2004. The alignment template approach to statistical machine translation. Computational Linguistics 30, 4 (2004), 417–449.Google Scholar
Digital Library
- . 2010. Handling named entities and compound verbs in phrase-based statistical machine translation. In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE'10), Association for Computational Linguistics, Beijing, 45--53.Google Scholar
- . 2019. Embedding linguistic features in word embedding for preposition sense disambiguation in english—Malayalam machine translation context. In Proceedings of the Recent Advances in Computational Intelligence. Springer, Cham. 341–370.Google Scholar
Cross Ref
- . 2015. Statistical machine translation from and into morphologically rich and low resourced languages. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Cham, 545–556.Google Scholar
Cross Ref
- . 2009. Rule based reordering and morphological processing for English-Malayalam statistical machine translation. In Proceedings of the 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies. IEEE, 458–460.Google Scholar
- R. Rajan, R. Sivan, R. Ravindran and K. P. Soman. 2009. Rule based machine translation from English to Malayalam. In Proceedings of the 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies, Bangalore, 439--441.Google Scholar
- . 2013. A new dynamic statistical maximum likelihood alignment algorithm for sentence translations in bilingual corpora (Malayalam & English). International Journal of Computational Linguistics and Natural Language Processing 2, 1 (2013), 217--231.Google Scholar
- . 2011. Clause-based reordering constraints to improve statistical machine translation. In Proceedings of 5th International Joint Conference on Natural Language Processing. 1351–1355.Google Scholar
- . 2009. Case markers and morphology: Addressing the crux of the fluency problem in English-Hindi SMT. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 800–808.Google Scholar
Cross Ref
- . 2008. Simple syntactic and morphological processing can help English–Hindi statistical machine translation. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I.Google Scholar
- . 2012. Morphological processing for English-Tamil statistical machine translation. In Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages. 113–122.Google Scholar
- . 2018. Training and adapting multilingual nmt for less-resourced and morphologically rich languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018).Google Scholar
- . 2020. Deep learning-based techniques to enhance the precision of phrase-based statistical machine translation system for Indian languages. International Journal of Computer Aided Engineering and Technology 13, 1-2 (2020), 239–257.Google Scholar
Cross Ref
- S. Sanyal and R. Borgohain. 2013. Machine translation Systems In India. Annals of the Faculty of Engineering Hunedoara 11, 4 (2013), 137--142.Google Scholar
- . 2010a. Alignment model and training technique in SMT from English to Malayalam. In Proceedings of the International Conference on Contemporary Computing. Springer. Berlin, 305–315.Google Scholar
Cross Ref
- . 2010b. ‘English to malayalam translation: A statistical approach’. In Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India (A2CWiC’10). New York: Association for Computing Machinery. 1–5, DOI: Google Scholar
Digital Library
- . 2020. Machine learning approach to suffix separation on a sandhi rule annotated malayalam data set. South Asia Research 40, 2 (2020), 231–249.Google Scholar
Cross Ref
- . 2019. Verb phrases alignment technique for english-malayalam parallel corpus in statistical machine translation special issue on MTIL 2017. Journal of Intelligent Systems 28, 3 (2019), 479–492.Google Scholar
Cross Ref
- 2010c. ‘A classification of sandhi rules for suffix separation in Malayalam’. In Proceedings of the 38th All India Conference of Dravidian Linguists. Trivandrum: Dravidian Linguistics Association, 1–12.Google Scholar
- . 2010. Manipuri-english bidirectional statistical machine translation systems using morphology and dependency relations. In Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation. 83–91.Google Scholar
- . 2016. Lexical resources to enrich English Malayalam machine translation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). 620–627.Google Scholar
- . 2018. Morphology injection for English-Malayalam statistical machine translation. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018).Google Scholar
- . 2018. Statistical vs. Rule-based machine translation: a comparative study on Indian languages. In Proceedings of the International Conference on Intelligent Computing and Applications. Springer, 663–675.Google Scholar
Cross Ref
- . 2012. Preprocessors in NLP applications: In the context of English to Malayalam Machine Translation. In Proceedings of the 2012 Annual IEEE India Conference (INDICON). IEEE, 221–226.Google Scholar
Cross Ref
- . 2014. Recurrent neural networks for word alignment model. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1470–1480.Google Scholar
Cross Ref
- . 2008. Applying morphology generation models to machine translation. In Proceedings of the ACL-08: HLT. 514–522.Google Scholar
- . 2004. An English-Hindi statistical machine translation system. In Proceedings of the International Conference on Natural Language Processing. Springer, Berlin, 254–262.Google Scholar
- . 2010. A novel approach for English to South Dravidian language statistical machine translation system. International Journal on Computer Science and Engineering 2, 8 (2010), 2749–2759.Google Scholar
- . 2006. Keralapanineeyam. Eighth edition. Kottayam: DC Books.Google Scholar
- . 2013. Decoding with large-scale neural language models improves translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1387–1392.Google Scholar
- . 2010. A discriminative approach for dependency based statistical machine translation. In Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation. 66–74.Google Scholar
- . 2010. English-hindi automatic word alignment with scarce resources. In Proceedings of the 2010 International Conference on Asian Language Processing. IEEE, 253–256.Google Scholar
Digital Library
- . 2011. A word reordering model for improved machine translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 486–496.Google Scholar
- K. Visweswariah et al. 2010. Syntax based reordering with automatically derived rules for improved statistical machine translation. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010).Google Scholar
- W. Weaver. 1955. Translation. Machine Translation of Languages. Cambridge: Technology Press, MIT, Vol. 14, 15--27.Google Scholar
- . 1994. Aligning a parallel English-Chinese corpus statistically with lexical criteria. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics (ACL'94). Association for Computational Linguistics, 80--87. Google Scholar
Digital Library
- . 2013. Word alignment modeling with context dependent deep neural network. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 166–175.Google Scholar
- . 2010. Using TectoMT as a preprocessing tool for phrase-based statistical machine translation. In Proceedings of the International Conference on Text, Speech and Dialogue. Springer. Berlin, 216–223.Google Scholar
Cross Ref
- . 2002. Phrase-based statistical machine translation. In Proceedings of the Annual Conference on Artificial Intelligence. Springer. Berlin, 18–32.Google Scholar
Cross Ref
- . 2020. Parallel corpus filtering via pre-trained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 8545--8554.Google Scholar
Cross Ref
- . 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1393–1398.Google Scholar
Index Terms
Malayalam Natural Language Processing: Challenges in Building a Phrase-Based Statistical Machine Translation System
Recommendations
A deconverter framework for Malayalam
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and InformaticsThis paper discusses the design and implementation of a deconverter framework for the Malayalam Language. The Universal Networking Language or UNL facilitates translation between Natural Languages across the world. The paper focuses on the linguistic ...
Integrating source-language context into phrase-based statistical machine translation
The translation features typically used in Phrase-Based Statistical Machine Translation (PB-SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated ...
A Reordering Model for Phrase-Based Machine Translation
GoTAL '08: Proceedings of the 6th international conference on Advances in Natural Language ProcessingThis paper presents a new method for reordering in phrase based statistical machine translation (PBSMT). Our method is based on previous chunk-level reordering methods for PBSMT. Our method is a global reordering. First, we parse the source language ...






Comments