skip to main content
research-article

NeuMorph: Neural Morphological Tagging for Low-Resource Languages—An Experimental Study for Indic Languages

Published:10 August 2019Publication History
Skip Abstract Section

Abstract

This article deals with morphological tagging for low-resource languages. For this purpose, five Indic languages are taken as reference. In addition, two severely resource-poor languages, Coptic and Kurmanji, are also considered. The task entails prediction of the morphological tag (case, degree, gender, etc.) of an in-context word. We hypothesize that to predict the tag of a word, considering its longer context such as the entire sentence is not always necessary. In this light, the usefulness of convolution operation is studied resulting in a convolutional neural network (CNN) based morphological tagger. Our proposed model (BLSTM-CNN) achieves insightful results in comparison to the present state-of-the-art. Following the recent trend, the task is carried out under three different settings: single language, across languages, and across keys. Whereas the previous models used only character-level features, we show that the addition of word vectors along with character-level embedding significantly improves the performance of all the models. Since obtaining high-quality word vectors for resource-poor languages remains a challenge, in that scenario, the proposed character-level BLSTM-CNN proves to be most effective.1

References

  1. Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 349--359. http://aclweb.org/anthology/D15-1041.Google ScholarGoogle Scholar
  2. Mugdha Bapat, Harshada Gune, and Pushpak Bhattacharyya. 2010. A paradigm-based finite state morphological analyzer for Marathi. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing. 26--34.Google ScholarGoogle Scholar
  3. Akshar Bharati, Amba P. Kulkarni, and V. Sheeba. 2006. Building a wide coverage Sanskrit morphological analyser: A practical approach. In The First National Symposium on Modelling and Shallow Parsing of Indian Languages.Google ScholarGoogle Scholar
  4. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5 (2017), 135--146. https://transacl.org/ojs/index.php/tacl/article/view/999.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jan Buys and Jan A. Botha. 2016. Cross-lingual morphological tagging for low-resource languages. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1954--1964. http://www.aclweb.org/anthology/P16-1184.Google ScholarGoogle Scholar
  6. François Chollet et al. 2015. Keras. https://keras.io.Google ScholarGoogle Scholar
  7. Ryan Cotterell and Georg Heigold. 2017. Cross-lingual character-level neural morphological tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 748--759. https://www.aclweb.org/anthology/D17-1078.Google ScholarGoogle ScholarCross RefCross Ref
  8. Raj Dabre, Archana Amberkar, and Pushpak Bhattacharyya. 2012. Morphological analyzer for affix stacking languages: A case study of Marathi. In Proceedings of COLING 2012: Posters. The COLING 2012 Organizing Committee, 225--234. http://www.aclweb.org/anthology/C12-2023.Google ScholarGoogle Scholar
  9. Raj Dabre, Archana Amberkar, and Pushpak Bhattacharyya. 2013. A way to break them all: A compound word analyzer for Marathi. ICON (2013). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.433.61078rep=rep18type=pdf.Google ScholarGoogle Scholar
  10. Yann N. Dauphin, Harm de Vries, and Yoshua Bengio. 2015. Equilibrated adaptive learning rates for non-convex optimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). MIT Press, Cambridge, MA, 1504--1512. http://dl.acm.org/citation.cfm?id=2969239.2969407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cícero Nogueira Dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on on Machine Learning - Volume 32 (ICML’14). JMLR.org, II--1818--II--1826. http://dl.acm.org/citation.cfm?id=3044805.3045095. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Timothy Dozat. 2016. Incorporating nesterov momentum into adam. https://web.stanford.edu/tdozat/files/TDozat-CS229-Paper.pdf.Google ScholarGoogle Scholar
  13. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 334--343. http://www.aclweb.org/anthology/P15-1033.Google ScholarGoogle Scholar
  14. V. Goyal and G. S. Lehal. 2008. Hindi morphological analyzer and generator. In 2008 1st International Conference on Emerging Trends in Engineering and Technology. 1156--1159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resource Association. http://aclweb.org/anthology/L18-1550.Google ScholarGoogle Scholar
  16. Georg Heigold, Guenter Neumann, and Josef van Genabith. 2016. Neural morphological tagging from characters for morphologically rich languages. CoRR abs/1606.06640 (2016). arxiv:1606.06640 http://arxiv.org/abs/1606.06640.Google ScholarGoogle Scholar
  17. Georg Heigold, Guenter Neumann, and Josef van Genabith. 2017. An extensive empirical evaluation of character-based morphological tagging for 14 languages. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, 505--513. http://www.aclweb.org/anthology/E17-1048.Google ScholarGoogle ScholarCross RefCross Ref
  18. Girish Nath Jha, Muktanand Agrawal, Sudhir K. Mishra, Diwakar Mani, Diwakar Mishra, Manji Bhadra, Surjit K. Singh, et al. 2009. Inflectional morphology analyzer for Sanskrit. In Sanskrit Computational Linguistics. Springer, 219--238. https://link.springer.com/chapter/10.1007/978-3-642-00155-0_8.Google ScholarGoogle Scholar
  19. Nikhil Kanuparthi, Abhilash Inumella, and Dipti Misra Sharma. 2012. Hindi derivational morphological analyzer. In Proceedings of the 12th Meeting of the Special Interest Group on Computational Morphology and Phonology (SIGMORPHON’12). Association for Computational Linguistics, 10--16. http://dl.acm.org/citation.cfm?id=2390930.2390932. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 https://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  21. S. Kullback and R. A. Leibler. 1951. On information and sufficiency. Ann. Math. Statistics 22 (1951), 79--86.Google ScholarGoogle ScholarCross RefCross Ref
  22. Arun Kumar, Ryan Cotterell, Lluís Padró, and Antoni Oliver. 2017. Morphological analysis of the Dravidian language family. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 217--222. http://www.aclweb.org/anthology/E17-2035.Google ScholarGoogle ScholarCross RefCross Ref
  23. Arun Kumar, Ryan Cotterell, Lluís Padró, and Antoni Oliver. 2017. Morphological analysis of the Dravidian language family. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 217--222. http://www.aclweb.org/anthology/E17-2035.Google ScholarGoogle ScholarCross RefCross Ref
  24. Arun Kumar, Lluís Padró, and Antoni Oliver. 2015. Learning agglutinative morphology of Indian languages with linguistically motivated adaptor grammars. In Proceedings of the International Conference Recent Advances in Natural Language Processing. 307--312. http://www.aclweb.org/anthology/R15-1041.Google ScholarGoogle Scholar
  25. Deepak Kumar, Manjeet Singh, and Seema Shukla. 2012. FST based morphological analyzer for Hindi language. CoRR abs/1207.5409 (2012). arxiv:1207.5409 http://arxiv.org/abs/1207.5409.Google ScholarGoogle Scholar
  26. Vishal Kumar and Rupinderdeep Guide Kaur. 2013. Paradigm Based Hindi Morphological Analyzer. Ph.D. Dissertation.Google ScholarGoogle Scholar
  27. Matthieu Labeau, Kevin Löser, and Alexandre Allauzen. 2015. Non-lexical neural architecture for fine-grained POS tagging. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 232--237. http://aclweb.org/anthology/D15-1025.Google ScholarGoogle ScholarCross RefCross Ref
  28. Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1520--1530. http://aclweb.org/anthology/D15-1176.Google ScholarGoogle ScholarCross RefCross Ref
  29. S. Lushanthan, A. R. Weerasinghe, and D. L. Herath. 2014. Morphological analyzer and generator for Tamil language. In 2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer’14). 190--196.Google ScholarGoogle Scholar
  30. Chaitanya Malaviya, Matthew R. Gormley, and Graham Neubig. 2018. Neural factor graph models for cross-lingual morphological tagging. In The 56th Annual Meeting of the Association for Computational Linguistics (ACL). Melbourne, Australia.Google ScholarGoogle ScholarCross RefCross Ref
  31. Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. CoRR abs/1707.05589 (2017). arxiv:1707.05589 http://arxiv.org/abs/1707.05589.Google ScholarGoogle Scholar
  32. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS). 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746--751.Google ScholarGoogle Scholar
  34. Thomas Mueller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 322--332. http://www.aclweb.org/anthology/D13-1032.Google ScholarGoogle Scholar
  35. Joakim Nivre, Željko Agić, Lars Ahrenberg, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Ballesteros, John Bauer, Kepa Bengoetxea, Riyaz Ahmad Bhat, Eckhard Bick, Cristina Bosco, Gosse Bouma, Sam Bowman, Marie Candito, Gülşen Cebiroǧlu Eryiǧit, Giuseppe G. A. Celano, Fabricio Chalub, Jinho Choi, Çaǧrı Çöltekin, Miriam Connor, Elizabeth Davidson, Marie-Catherine de Marneffe, Valeria de Paiva, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Cláudia Freitas, Katarína Gajdošová, Daniel Galbraith, Marcos Garcia, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Gökırmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta Gonzáles Saavedra, Matias Grioni, Normunds Grūzītis, Bruno Guillaume, Nizar Habash, Jan Hajič, Linh Hà Mỹ, Dag Haug, Barbora Hladká, Petter Hohle, Radu Ion, Elena Irimia, Anders Johannsen, Fredrik Jørgensen, Hüner Kaıkara, Hiroshi Kanayama, Jenna Kanerva, Natalia Kotsyba, Simon Krek, Veronika Laippala, Phương Lê Hồng, Alessandro Lenci, Nikola Ljubešić, Olga Lyashevskaya, Teresa Lynn, Aibek Makazhanov, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martínez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simonetta Montemagni, Amir More, Shunsuke Mori, Bohdan Moskalevskyi, Kadri Muischnek, Nina Mustafina, Kaili Müürisep, Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Vitaly Nikolaev, Hanna Nurmi, Stina Ojala, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel-Augusto Perez, Guy Perrier, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Lauma Pretkalniņa, Prokopis Prokopidis, Tiina Puolakainen, Sampo Pyysalo, Alexandre Rademaker, Loganathan Ramasamy, Livy Real, Laura Rituma, Rudolf Rosa, Shadi Saleh, Manuela Sanguinetti, Baiba Saulīte, Sebastian Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan Seraji, Lena Shakurova, Mo Shen, Dmitry Sichinava, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Alane Suhr, Umut Sulubacak, Zsolt Szántó, Dima Taji, Takaaki Tanaka, Reut Tsarfaty, Francis Tyers, Sumire Uematsu, Larraitz Uria, Gertjan van Noord, Viktor Varga, Veronika Vincze, Jonathan North Washington, Zdeněk Žabokrtský, Amir Zeldes, Daniel Zeman, and Hanzhi Zhu. 2017. Universal Dependencies 2.0. http://hdl.handle.net/11234/1-1983 LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University.Google ScholarGoogle Scholar
  36. Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 412--418. http://anthology.aclweb.org/P16-2067.Google ScholarGoogle ScholarCross RefCross Ref
  37. Mayuri Rastogi and Pooja Khanna. 2014. Development of morphological analyzer for Hindi. International Journal of Computer Applications 95, 17 (2014). https://pdfs.semanticscholar.org/6e88/d020e8739089e42155c05c320098b743620e.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  38. Vinit Ravishankar and Francis M. Tyers. 2017. Finite-state morphological analysis for Marathi. In Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP’17). 50--55.Google ScholarGoogle Scholar
  39. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In INTERSPEECH.Google ScholarGoogle Scholar
  40. KVN Sunitha and N. Kalyani. 2009. A novel approach to improve rule based Telugu morphological analyzer. In World Congress on Nature 8 Biologically Inspired Computing (NaBIC’09). IEEE, 1649--1652.Google ScholarGoogle Scholar
  41. John Sylak-Glassman, Christo Kirov, David Yarowsky, and Roger Que. 2015. A language-independent feature schema for inflectional morphology. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, 674--680. http://www.aclweb.org/anthology/P15-2111.Google ScholarGoogle Scholar
  42. Xiang Yu, Agnieszka Falenska, and Ngoc Thang Vu. 2017. A general-purpose tagger with convolutional neural networks. CoRR abs/1706.01723 (2017). arxiv:1706.01723 http://arxiv.org/abs/1706.01723.Google ScholarGoogle Scholar
  43. Matthew D. Zeiler. 2012. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012). https://arxiv.org/abs/1212.5701.Google ScholarGoogle Scholar
  44. Daniel Zeman, Martin Popel, Milan Straka, Jan Hajic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkova, Jan Hajic, Jr., Jaroslava Hlavacova, Václava Kettnerová, Zdenka Uresova, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria dePaiva, Kira Droganova, Héctor Martínez Alonso, Çağrı Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Alcalde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonca, Tatiana Lando, Rattima Nitisaroj, and Josie Li. 2017. CoNLL 2017 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, 1--19.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. NeuMorph: Neural Morphological Tagging for Low-Resource Languages—An Experimental Study for Indic Languages

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!