skip to main content
research-article

Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches

Published:16 June 2023Publication History
Skip Abstract Section

Abstract

Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language processing tasks, such as named entity recognition, speech processing, information extraction, word sense disambiguation, and machine translation. It has already gained promising results in English and European languages. However, in Indian languages, particularly in the Odia language, it is not yet well explored because of the lack of supporting tools, resources, and morphological richness of the language. Unfortunately, we were unable to locate an open source POS tagger for the Odia language, and only a handful of attempts have been made to develop POS taggers for the Odia language. The main contribution of this research work is to present statistical approaches such as the maximum entropy Markov model and conditional random field (CRF), as well as deep learning based approaches, including the convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM) to develop the Odia POS tagger. A publicly accessible corpus annotated with the Bureau of Indian Standards (BIS) tagset is used in our work. However, most of the languages around the globe have used the dataset annotated with the Universal Dependencies (UD) tagset. Hence, to maintain uniformity, the Odia dataset should use the same tagset. Thus, following the BIS and UD guidelines, we constructed a mapping from the BIS tagset to the UD tagset. The maximum entropy Markov model, CRF, Bi-LSTM, and CNN models are trained using the Indian Languages Corpora Initiative corpus with the BIS and UD tagsets. We have experimented with various feature sets as input to the statistical models to prepare a baseline system and observed the impact of constructed feature sets. The deep learning based model includes the Bi-LSTM network, the CNN network, the CRF layer, character sequence information, and a pre-trained word vector. Seven different combinations of neural sequence labeling models are implemented, and their performance measures are investigated. It has been observed that the Bi-LSTM model with the character sequence feature and pre-trained word vector achieved a result with 94.58% accuracy.

REFERENCES

  1. [1] Indian Language Technology Proliferation and Deployment Center. n.d. Home Page. Retrieved August 24, 2021 from http://tdil-dc.in.Google ScholarGoogle Scholar
  2. [2] Alam Firoj, Chowdhury Shammur Absar, and Noori Sheak Rashed Haider. 2016. Bidirectional LSTMs—CRFs networks for Bangla POS tagging. In Proceedings of the 2016 19th International Conference on Computer and Information Technology (ICCIT’16). IEEE, Los Alamitos, CA, 377382.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Behera Pitambar. 2015. Odia Parts of Speech Tagging Corpora: Suitability of Statistical Model. Ph.D. Dissertation. Jawaharlal Nehru University, New Delhi, India.Google ScholarGoogle Scholar
  4. [4] Bhat Irshad, Bhat Riyaz Ahmad, Shrivastava Manish, and Sharma Dipti Misra. 2018. Universal dependency parsing for Hindi-English code-switching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 987998.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bojanowski Piotr, Grave Edouard, Joulin Armand, and Mikolov Tomas. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135146.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Brill Eric. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 4 (1995), 543565.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chandra Nitish, Kumawat Sudhakar, and Srivastava Vinayak. 2014. Various tagsets for Indian languages and their performance in part of speech tagging. In Proceedings of the 5th IRF International Conference.Google ScholarGoogle Scholar
  8. [8] Cutting Doug, Kupiec Julian, Pedersen Jan, and Sibun Penelope. 1992. A practical part-of-speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing (ANLC’92). 133140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Das Bishwa Ranjan and Patnaik Srikanta. 2014. A novel approach for Odia part of speech tagging using artificial neural network. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA’14). 147–154.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Das Bishwa Ranjan, Sahoo Smrutirekha, Panda Chandra Sekhar, and Patnaik Srikanta. 2015. Part of speech tagging in Odia using support vector machine. Procedia Computer Science 48 (2015), 507512.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Marneffe Marie-Catherine De, Dozat Timothy, Silveira Natalia, Haverinen Katri, Ginter Filip, Nivre Joakim, and Manning Christopher D.. 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 45854592.Google ScholarGoogle Scholar
  12. [12] Marneffe Marie-Catherine De and Manning Christopher D.. 2008. The Stanford typed dependencies representation. In Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation (COLING’08). 18.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Dhanalakshmi V., Shivapratap G., K. P. Soman, and S. Rajendran. 2009. Tamil POS tagging using linear programming. 1, 2 (2009), 166–169.Google ScholarGoogle Scholar
  14. [14] Santos Cicero Dos and Zadrozny Bianca. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the International Conference on Machine Learning. 18181826.Google ScholarGoogle Scholar
  15. [15] Ekbal Asif, Haque Rejwanul, and Bandyopadhyay Sivaji. 2007. Bengali part of speech tagging using conditional random field. In Proceedings of the 7th International Symposium on Natural Language Processing (SNLP’07). 131136.Google ScholarGoogle Scholar
  16. [16] Fonseca Erick R., Rosa João Luís G., and Aluísio Sandra Maria. 2015. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society 21, 1 (2015), 114.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Harris Zellig. 1962. String Analysis of Language Structure. Mouton & Co., The Hague.Google ScholarGoogle Scholar
  18. [18] Huang Zhiheng, Xu Wei, and Yu Kai. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).Google ScholarGoogle Scholar
  19. [19] Krishnan K. S. Gokul, Pooja A., Kumar M. Anand, and Soman K. P.. 2017. Character based bidirectional LSTM for disambiguating tamil part-of-speech categories. International Journal of Control Theory and Applications 229 (2017), 235.Google ScholarGoogle Scholar
  20. [20] Kunchukuttan Anoop, Kakwani Divyanshu, Golla Satish, NC Gokul, Bhattacharyya Avik, Khapra Mitesh M., and Kumar Pratyush. 2020. AI4Bharat-IndicNLP corpus: Monolingual corpora and word embeddings for Indic languages. arXiv preprint arXiv:2005.00085 (2020).Google ScholarGoogle Scholar
  21. [21] Lafferty John, McCallum Andrew, and Pereira Fernando C. N.. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01).Google ScholarGoogle Scholar
  22. [22] Ling Wang, Dyer Chris, Black Alan W., Trancoso Isabel, Fermandez Ramón, Amir Silvio, Marujo Luis, and Luís Tiago. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 15201530.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Ma Xuezhe and Hovy Eduard. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10641074.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Mitkov Ruslan. 2022. The Oxford Handbook of Computational Linguistics. Oxford University Press.Google ScholarGoogle Scholar
  25. [25] Nivre Joakim, Marneffe Marie-Catherine De, Ginter Filip, Goldberg Yoav, Hajic Jan, Manning Christopher D., McDonald Ryan, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 16591666.Google ScholarGoogle Scholar
  26. [26] Nivre Joakim, Marneffe Marie-Catherine de, Ginter Filip, Hajic Jan, Manning Christopher D., Pyysalo Sampo, Schuster Sebastian, Tyers Francis, and Zeman Daniel. 2020. Universal dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the 12th Language Resources and Evaluation Conference. 40344043.Google ScholarGoogle Scholar
  27. [27] Nooralahzadeh Farhad, Brun Caroline, and Roux Claude. 2014. Part of speech tagging for french social media data. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 17641772.Google ScholarGoogle Scholar
  28. [28] Ojha Atul Ku, Behera Pitambar, Singh Srishti, and Jha Girish N.. 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524529.Google ScholarGoogle Scholar
  29. [29] Parida Shantipriya, Bojar Ondřej, and Dash Satya Ranjan. 2020. OdiEnCorp: Odia–English and Odia-only corpus for machine translation. In Smart Intelligent Computing and Applications. Springer, 495504.Google ScholarGoogle Scholar
  30. [30] Parida Shantipriya, Dash Satya Ranjan, Bojar Ondřej, Motlicek Petr, Pattnaik Priyanka, and Mallick Debasish Kumar. 2020. OdiEnCorp 2.0: Odia-English parallel corpus for machine translation. In Proceedings of the WILDRE5–5th Workshop on Indian Language Data: Resources and Evaluation. 1419.Google ScholarGoogle Scholar
  31. [31] Pattnaik Sagarika, Nayak Ajit Kumar, and Patnaik Srikanta. 2020. A semi-supervised learning of HMM to build a POS tagger for a low resourced language. Journal of Information and Communication Convergence Engineering 18, 4 (2020), 207215.Google ScholarGoogle Scholar
  32. [32] Petrov Slav, Das Dipanjan, and McDonald Ryan. 2012. A universal part-of-speech tagset. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 20892096.Google ScholarGoogle Scholar
  33. [33] Plank Barbara, Søgaard Anders, and Goldberg Yoav. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 412418.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Priyadarshi Ankur and Saha Sujan Kumar. 2020. Towards the first Maithili part of speech tagger: Resource creation and system development. Computer Speech & Language 62 (2020), 101054.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Priyadarshi Ankur and Saha Sujan Kumar. 2022. A study on the performance of recurrent neural network based models in Maithili part of speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 2 (2022), Article 32, 16 pages.Google ScholarGoogle Scholar
  36. [36] Pyysalo Sampo, Kanerva Jenna, Missilä Anna, Laippala Veronika, and Ginter Filip. 2015. Universal dependencies for finnish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (Nodalida’15). 163172.Google ScholarGoogle Scholar
  37. [37] Ramesh Gowtham, Doddapaneni Sumanth, Bheemaraj Aravinth, Mayank Jobanputra, A. K. Raghavan, Sharma Ajitesh, Sahoo Sujit, et al. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics 10 (2022), 145162.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Shrivastava Manish and Bhattacharyya Pushpak. 2008. Hindi POS tagger using naive stemming: Harnessing morphological information without extensive linguistic knowledge. In Proceedings of the International Conference on NLP (ICON’08).Google ScholarGoogle Scholar
  39. [39] Suraksha N. M., Reshma K., and Kumar K. M. Shiva. 2017. Part-of-speech tagging and parsing of Kannada text using Conditional Random Fields (CRFs). In Proceedings of the 2017 International Conference on Intelligent Computing and Control (I2C2’17). IEEE, Los Alamitos, CA, 15.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Tandon Juhi, Chaudhry Himani, Bhat Riyaz Ahmad, and Sharma Dipti Misra. 2016. Conversion from Paninian karakas to universal dependencies for Hindi dependency treebank. In Proceedings of the 10th Linguistic Annotation Workshop Held in Conjunction with ACL 2016 (LAW-X’16). 141150.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Warjri Sunita, Pakray Partha, Lyngdoh Saralin, and Maji Arnab Kumar. 2019. Identification of POS tag for Khasi language based on hidden Markov model POS tagger. Computación y Sistemas 23, 3 (2019), 795802.Google ScholarGoogle Scholar
  42. [42] Warjri Sunita, Pakray Partha, Lyngdoh Saralin A., and Maji Arnab Kumar. 2021. Part-of-speech (POS) tagging using deep learning-based approaches on the designed Khasi POS corpus. ACM Transactions on Asian and Low-Resource Language Information Processing 21, 3 (2021), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Xin Yingwei, Hart Ethan, Mahajan Vibhuti, and Ruvini Jean David. 2018. Learning better internal structure of words for sequence labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 25842593.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Zeman Daniel and Resnik Philip. 2008. Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages.Google ScholarGoogle Scholar

Index Terms

  1. Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
        June 2023
        635 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3604597
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 June 2023
        • Online AM: 30 March 2023
        • Accepted: 18 March 2023
        • Revised: 17 March 2023
        • Received: 22 July 2022
        Published in tallip Volume 22, Issue 6

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)154
        • Downloads (Last 6 weeks)107

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!