Abstract
Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and agglutinative morphologies, inducing good-quality crosslingual embeddings becomes challenging due to the problem of complex morphological forms and rare words. This is true even for languages that share common linguistic structure. In our work, we have shown that performing a simple morphological segmentation upon the corpora prior to the generation of crosslingual word embeddings for both roots and suffixes greatly improves the prediction quality and captures semantic similarities more effectively. To exhibit this, we have chosen two related languages: Telugu and Kannada of the Dravidian language family. We have also tested our method upon a widely spoken North Indian language, Hindi, belonging to the Indo-European language family, and have observed encouraging results.
- S. S. Akhtar, A. Gupta, A. Vajpayee, A. Srivastava, and M. Shrivastava. 2017. Word similarity datasets for Indian languages: Annotation and baseline systems. In Proceedings of the 11th Linguistic Annotation Workshop ([email protected]’17). 91--94.Google Scholar
- M. Artetxe, G. Labaka, and E. Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 2289--2294.Google Scholar
- M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Robots (ACLR’18).Google Scholar
- D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google Scholar
- Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (2003), 1137—1155.Google Scholar
Digital Library
- P. Bhatia, R. Guthrie, and J. Eisenstein. 2016. Morphological priors for probabilistic neural word embeddings. arXiv:1608.01056.Google Scholar
- P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2016. Enriching word vectors with subword information. arXiv:1607.04606.Google Scholar
- J. A. Botha and P. Blunsom. 2014. Compositional morphology for word representations and language modelling. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 1899--1907.Google Scholar
- K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation (SSST-8). 103--111.Google Scholar
- T. Cohn, S. Bird, G. Neubig, O. Adams, and A. J. Makarucha. 2017. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers. 937--947.Google Scholar
- R. Collobert and J. Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). 160--167.Google Scholar
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493--2537.Google Scholar
Digital Library
- R. Cotterell and H. Schütze. 2015. Morphological word-embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1287--1292.Google Scholar
- M. Creutz and K. Lagus. 2005. Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Helsinki University of Technology.Google Scholar
- L. Duong, H. Kanayama, T. Ma, S. Bird, and T. Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 1285--1295.Google Scholar
- M. Fadaee, A. Bisazza, and C. Monz. 2017. Data augmentation for low-resource neural machine translation. arXiv:1705.00440.Google Scholar
- M. Faruqui and C. Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL’14). 462--471.Google Scholar
- S. Gouws and A. Søgaard. 2015. Simple task-specific bilingual word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1386--1390.Google Scholar
- J. Gu, H. Hassan, J. Devlin, and V. O. K. Li. 2018. Universal neural machine translation for extremely low resource languages. arXiv:1802.05368.Google Scholar
- H. Kanayama, T. Cohn, T. Ma, S. Bird, and L. Duong. 2017. Multilingual training of crosslingual word embeddings. In Proceedings of the 2015 Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers (EACL’17). 894--904.Google Scholar
- Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. 2016. Character-aware neural language models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 2741--2749.Google Scholar
Digital Library
- B. Krishnamurthi. 2003. The Dravidian Languages. Cambridge University Press.Google Scholar
- A. Kunchukuttan, A. Mishra, R. Chatterjee, R. M. Shah, and P. Bhattacharyya. 2014. Shata-Anuvadak: Tackling multiway translation of Indian languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 1781--1787.Google Scholar
- G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato. 2018. Phrase-based & neural unsupervised machine translation. CoRR. abs/1804.07755, (2018).Google Scholar
- A. Lu, W. Wang, M. Bansal, K. Gimpel, and K. Livescu. 2015. Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 250--256Google Scholar
- T. Luong, H. Pham, and C. D. Manning. 2015a. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1412--1421.Google Scholar
- T. Luong, H. Pham, and C. D. Manning. 2015b. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing (VS@ HLT-NAACL’15). 151--159.Google Scholar
- T. Luong, R. Socher, and C. D. Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL’13). 104--113.Google Scholar
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781.Google Scholar
- T. Mikolov, Q. V. Le, and I. Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv:1309.4168.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013c. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS’13). 3111--3119.Google Scholar
- T. Mikolov, W. Yih, and G. Zweig. 2013d. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’13). 746--751.Google Scholar
- S. Qiu, Q. Cui, J. Bian, B. Gao, and T.-Y. Liu. 2014. Co-learning of word representations and morpheme representations. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 141--150.Google Scholar
- A. Ramanathan, J. Hegde, R. M. Shah, P. Bhattacharyya, and M. Sasikumar. 2008. Simple syntactic and morphological processing can help English-Hindi statistical machine translation. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, Volume 1 (IJCNLP’08). 513--520.Google Scholar
- S. Reddy and S. Sharoff. 2011. Cross language POS taggers (and other tools) for Indian languages: An experiment with Kannada using Telugu resources. In Proceedings of the 5th International Workshop on Cross Lingual Information Access. 11--19Google Scholar
- R. Soricut and F. J. Och. 2015. Unsupervised morphology induction using word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1627--1637.Google Scholar
- I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2 (NIPS’14). 3104--3112.Google Scholar
- C.-T. Tsai and D. Roth. 2016. Cross-lingual Wikification using multilingual embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’16). 589--598Google Scholar
- S. Upadhyay, M. Faruqui, C. Dyer, and D. Roth, 2016. Cross-lingual models of word embeddings: An empirical comparison. arXiv:1604.00425.Google Scholar
- I. Vulic, N. Mrksic, and A. Korhonen. 2017. Cross-lingual induction and transfer of verb classes based on word vector space specialisation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2546--2558.Google Scholar
- Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, and M. Krikun et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.Google Scholar
- C. Xing, D. Wang, C. Liu, and Y. Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1006--1011.Google Scholar
- W. Y. Zou, R. Socher, D. M. Cer, and C. D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 1393--1398.Google Scholar
Index Terms
Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages
Recommendations
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages
This study aims to increase the performance of word embeddings by proposing a new weighting scheme for co-occurrence counting. The idea behind this new family of weights is to overcome the disadvantage of distant appearing word pairs, which are indeed ...
Cross-lingual word analogies using linear transformations between semantic spaces
Highlights- We generalize the word analogy task to evaluate cross-lingual semantic spaces.
- ...
AbstractThe ability to represent the meaning of words is one of the core parts of natural language understanding (NLU), with applications ranging across machine translation, summarization, question answering, information retrieval, etc. The ...






Comments