skip to main content
short-paper

Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

Published:21 June 2020Publication History
Skip Abstract Section

Abstract

Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and agglutinative morphologies, inducing good-quality crosslingual embeddings becomes challenging due to the problem of complex morphological forms and rare words. This is true even for languages that share common linguistic structure. In our work, we have shown that performing a simple morphological segmentation upon the corpora prior to the generation of crosslingual word embeddings for both roots and suffixes greatly improves the prediction quality and captures semantic similarities more effectively. To exhibit this, we have chosen two related languages: Telugu and Kannada of the Dravidian language family. We have also tested our method upon a widely spoken North Indian language, Hindi, belonging to the Indo-European language family, and have observed encouraging results.

References

  1. S. S. Akhtar, A. Gupta, A. Vajpayee, A. Srivastava, and M. Shrivastava. 2017. Word similarity datasets for Indian languages: Annotation and baseline systems. In Proceedings of the 11th Linguistic Annotation Workshop ([email protected]’17). 91--94.Google ScholarGoogle Scholar
  2. M. Artetxe, G. Labaka, and E. Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 2289--2294.Google ScholarGoogle Scholar
  3. M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Robots (ACLR’18).Google ScholarGoogle Scholar
  4. D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google ScholarGoogle Scholar
  5. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (2003), 1137—1155.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Bhatia, R. Guthrie, and J. Eisenstein. 2016. Morphological priors for probabilistic neural word embeddings. arXiv:1608.01056.Google ScholarGoogle Scholar
  7. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2016. Enriching word vectors with subword information. arXiv:1607.04606.Google ScholarGoogle Scholar
  8. J. A. Botha and P. Blunsom. 2014. Compositional morphology for word representations and language modelling. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 1899--1907.Google ScholarGoogle Scholar
  9. K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation (SSST-8). 103--111.Google ScholarGoogle Scholar
  10. T. Cohn, S. Bird, G. Neubig, O. Adams, and A. J. Makarucha. 2017. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers. 937--947.Google ScholarGoogle Scholar
  11. R. Collobert and J. Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). 160--167.Google ScholarGoogle Scholar
  12. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493--2537.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Cotterell and H. Schütze. 2015. Morphological word-embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1287--1292.Google ScholarGoogle Scholar
  14. M. Creutz and K. Lagus. 2005. Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Helsinki University of Technology.Google ScholarGoogle Scholar
  15. L. Duong, H. Kanayama, T. Ma, S. Bird, and T. Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 1285--1295.Google ScholarGoogle Scholar
  16. M. Fadaee, A. Bisazza, and C. Monz. 2017. Data augmentation for low-resource neural machine translation. arXiv:1705.00440.Google ScholarGoogle Scholar
  17. M. Faruqui and C. Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL’14). 462--471.Google ScholarGoogle Scholar
  18. S. Gouws and A. Søgaard. 2015. Simple task-specific bilingual word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1386--1390.Google ScholarGoogle Scholar
  19. J. Gu, H. Hassan, J. Devlin, and V. O. K. Li. 2018. Universal neural machine translation for extremely low resource languages. arXiv:1802.05368.Google ScholarGoogle Scholar
  20. H. Kanayama, T. Cohn, T. Ma, S. Bird, and L. Duong. 2017. Multilingual training of crosslingual word embeddings. In Proceedings of the 2015 Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers (EACL’17). 894--904.Google ScholarGoogle Scholar
  21. Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. 2016. Character-aware neural language models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 2741--2749.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Krishnamurthi. 2003. The Dravidian Languages. Cambridge University Press.Google ScholarGoogle Scholar
  23. A. Kunchukuttan, A. Mishra, R. Chatterjee, R. M. Shah, and P. Bhattacharyya. 2014. Shata-Anuvadak: Tackling multiway translation of Indian languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 1781--1787.Google ScholarGoogle Scholar
  24. G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato. 2018. Phrase-based & neural unsupervised machine translation. CoRR. abs/1804.07755, (2018).Google ScholarGoogle Scholar
  25. A. Lu, W. Wang, M. Bansal, K. Gimpel, and K. Livescu. 2015. Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 250--256Google ScholarGoogle Scholar
  26. T. Luong, H. Pham, and C. D. Manning. 2015a. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1412--1421.Google ScholarGoogle Scholar
  27. T. Luong, H. Pham, and C. D. Manning. 2015b. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing (VS@ HLT-NAACL’15). 151--159.Google ScholarGoogle Scholar
  28. T. Luong, R. Socher, and C. D. Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL’13). 104--113.Google ScholarGoogle Scholar
  29. T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781.Google ScholarGoogle Scholar
  30. T. Mikolov, Q. V. Le, and I. Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv:1309.4168.Google ScholarGoogle Scholar
  31. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013c. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS’13). 3111--3119.Google ScholarGoogle Scholar
  32. T. Mikolov, W. Yih, and G. Zweig. 2013d. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’13). 746--751.Google ScholarGoogle Scholar
  33. S. Qiu, Q. Cui, J. Bian, B. Gao, and T.-Y. Liu. 2014. Co-learning of word representations and morpheme representations. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 141--150.Google ScholarGoogle Scholar
  34. A. Ramanathan, J. Hegde, R. M. Shah, P. Bhattacharyya, and M. Sasikumar. 2008. Simple syntactic and morphological processing can help English-Hindi statistical machine translation. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, Volume 1 (IJCNLP’08). 513--520.Google ScholarGoogle Scholar
  35. S. Reddy and S. Sharoff. 2011. Cross language POS taggers (and other tools) for Indian languages: An experiment with Kannada using Telugu resources. In Proceedings of the 5th International Workshop on Cross Lingual Information Access. 11--19Google ScholarGoogle Scholar
  36. R. Soricut and F. J. Och. 2015. Unsupervised morphology induction using word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1627--1637.Google ScholarGoogle Scholar
  37. I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2 (NIPS’14). 3104--3112.Google ScholarGoogle Scholar
  38. C.-T. Tsai and D. Roth. 2016. Cross-lingual Wikification using multilingual embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’16). 589--598Google ScholarGoogle Scholar
  39. S. Upadhyay, M. Faruqui, C. Dyer, and D. Roth, 2016. Cross-lingual models of word embeddings: An empirical comparison. arXiv:1604.00425.Google ScholarGoogle Scholar
  40. I. Vulic, N. Mrksic, and A. Korhonen. 2017. Cross-lingual induction and transfer of verb classes based on word vector space specialisation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2546--2558.Google ScholarGoogle Scholar
  41. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, and M. Krikun et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.Google ScholarGoogle Scholar
  42. C. Xing, D. Wang, C. Liu, and Y. Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1006--1011.Google ScholarGoogle Scholar
  43. W. Y. Zou, R. Socher, D. M. Cer, and C. D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 1393--1398.Google ScholarGoogle Scholar

Index Terms

  1. Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!