Abstract
Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language processing tasks, such as named entity recognition, speech processing, information extraction, word sense disambiguation, and machine translation. It has already gained promising results in English and European languages. However, in Indian languages, particularly in the Odia language, it is not yet well explored because of the lack of supporting tools, resources, and morphological richness of the language. Unfortunately, we were unable to locate an open source POS tagger for the Odia language, and only a handful of attempts have been made to develop POS taggers for the Odia language. The main contribution of this research work is to present statistical approaches such as the maximum entropy Markov model and conditional random field (CRF), as well as deep learning based approaches, including the convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM) to develop the Odia POS tagger. A publicly accessible corpus annotated with the Bureau of Indian Standards (BIS) tagset is used in our work. However, most of the languages around the globe have used the dataset annotated with the Universal Dependencies (UD) tagset. Hence, to maintain uniformity, the Odia dataset should use the same tagset. Thus, following the BIS and UD guidelines, we constructed a mapping from the BIS tagset to the UD tagset. The maximum entropy Markov model, CRF, Bi-LSTM, and CNN models are trained using the Indian Languages Corpora Initiative corpus with the BIS and UD tagsets. We have experimented with various feature sets as input to the statistical models to prepare a baseline system and observed the impact of constructed feature sets. The deep learning based model includes the Bi-LSTM network, the CNN network, the CRF layer, character sequence information, and a pre-trained word vector. Seven different combinations of neural sequence labeling models are implemented, and their performance measures are investigated. It has been observed that the Bi-LSTM model with the character sequence feature and pre-trained word vector achieved a result with 94.58% accuracy.
- [1] Indian Language Technology Proliferation and Deployment Center. n.d. Home Page. Retrieved August 24, 2021 from http://tdil-dc.in.Google Scholar
- [2] . 2016. Bidirectional LSTMs—CRFs networks for Bangla POS tagging. In Proceedings of the 2016 19th International Conference on Computer and Information Technology (ICCIT’16). IEEE, Los Alamitos, CA, 377–382.Google Scholar
Cross Ref
- [3] . 2015. Odia Parts of Speech Tagging Corpora: Suitability of Statistical Model. Ph.D. Dissertation. Jawaharlal Nehru University, New Delhi, India.Google Scholar
- [4] . 2018. Universal dependency parsing for Hindi-English code-switching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 987–998.Google Scholar
Cross Ref
- [5] . 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.Google Scholar
Cross Ref
- [6] . 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 4 (1995), 543–565.Google Scholar
Digital Library
- [7] . 2014. Various tagsets for Indian languages and their performance in part of speech tagging. In Proceedings of the 5th IRF International Conference.Google Scholar
- [8] . 1992. A practical part-of-speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing (ANLC’92). 133–140.Google Scholar
Digital Library
- [9] . 2014. A novel approach for Odia part of speech tagging using artificial neural network. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA’14). 147–154.Google Scholar
Cross Ref
- [10] . 2015. Part of speech tagging in Odia using support vector machine. Procedia Computer Science 48 (2015), 507–512.Google Scholar
Cross Ref
- [11] . 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 4585–4592.Google Scholar
- [12] . 2008. The Stanford typed dependencies representation. In Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation (COLING’08). 1–8.Google Scholar
Cross Ref
- [13] . 2009. Tamil POS tagging using linear programming. 1, 2 (2009), 166–169.Google Scholar
- [14] . 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the International Conference on Machine Learning. 1818–1826.Google Scholar
- [15] . 2007. Bengali part of speech tagging using conditional random field. In Proceedings of the 7th International Symposium on Natural Language Processing (SNLP’07). 131–136.Google Scholar
- [16] . 2015. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society 21, 1 (2015), 1–14.Google Scholar
Cross Ref
- [17] . 1962. String Analysis of Language Structure. Mouton & Co., The Hague.Google Scholar
- [18] . 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).Google Scholar
- [19] . 2017. Character based bidirectional LSTM for disambiguating tamil part-of-speech categories. International Journal of Control Theory and Applications 229 (2017), 235.Google Scholar
- [20] . 2020. AI4Bharat-IndicNLP corpus: Monolingual corpora and word embeddings for Indic languages. arXiv preprint arXiv:2005.00085 (2020).Google Scholar
- [21] . 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01).Google Scholar
- [22] . 2015. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1520–1530.Google Scholar
Cross Ref
- [23] . 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1064–1074.Google Scholar
Cross Ref
- [24] . 2022. The Oxford Handbook of Computational Linguistics. Oxford University Press.Google Scholar
- [25] . 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 1659–1666.Google Scholar
- [26] . 2020. Universal dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the 12th Language Resources and Evaluation Conference. 4034–4043.Google Scholar
- [27] . 2014. Part of speech tagging for french social media data. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 1764–1772.Google Scholar
- [28] . 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524–529.Google Scholar
- [29] . 2020. OdiEnCorp: Odia–English and Odia-only corpus for machine translation. In Smart Intelligent Computing and Applications. Springer, 495–504.Google Scholar
- [30] . 2020. OdiEnCorp 2.0: Odia-English parallel corpus for machine translation. In Proceedings of the WILDRE5–5th Workshop on Indian Language Data: Resources and Evaluation. 14–19.Google Scholar
- [31] . 2020. A semi-supervised learning of HMM to build a POS tagger for a low resourced language. Journal of Information and Communication Convergence Engineering 18, 4 (2020), 207–215.Google Scholar
- [32] . 2012. A universal part-of-speech tagset. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 2089–2096.Google Scholar
- [33] . 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 412–418.Google Scholar
Cross Ref
- [34] . 2020. Towards the first Maithili part of speech tagger: Resource creation and system development. Computer Speech & Language 62 (2020), 101054.Google Scholar
Digital Library
- [35] . 2022. A study on the performance of recurrent neural network based models in Maithili part of speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 2 (2022), Article 32, 16 pages.Google Scholar
- [36] . 2015. Universal dependencies for finnish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (Nodalida’15). 163–172.Google Scholar
- [37] . 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics 10 (2022), 145–162.Google Scholar
Cross Ref
- [38] . 2008. Hindi POS tagger using naive stemming: Harnessing morphological information without extensive linguistic knowledge. In Proceedings of the International Conference on NLP (ICON’08).Google Scholar
- [39] . 2017. Part-of-speech tagging and parsing of Kannada text using Conditional Random Fields (CRFs). In Proceedings of the 2017 International Conference on Intelligent Computing and Control (I2C2’17). IEEE, Los Alamitos, CA, 1–5.Google Scholar
Cross Ref
- [40] . 2016. Conversion from Paninian karakas to universal dependencies for Hindi dependency treebank. In Proceedings of the 10th Linguistic Annotation Workshop Held in Conjunction with ACL 2016 (LAW-X’16). 141–150.Google Scholar
Cross Ref
- [41] . 2019. Identification of POS tag for Khasi language based on hidden Markov model POS tagger. Computación y Sistemas 23, 3 (2019), 795–802.Google Scholar
- [42] . 2021. Part-of-speech (POS) tagging using deep learning-based approaches on the designed Khasi POS corpus. ACM Transactions on Asian and Low-Resource Language Information Processing 21, 3 (2021), 1–24.Google Scholar
Digital Library
- [43] . 2018. Learning better internal structure of words for sequence labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2584–2593.Google Scholar
Cross Ref
- [44] . 2008. Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages.Google Scholar
Index Terms
Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches
Recommendations
Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus
Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a ...
Urdu part of speech tagging using conditional random fields
AbstractPart of speech (POS) tagging, the assignment of syntactic categories for words in running text, is significant to natural language processing as a preliminary task in applications such as speech processing, information extraction, and others. Urdu ...
Rule Based Part of Speech Tagging of Sindhi Language
ICSAP '10: Proceedings of the 2010 International Conference on Signal Acquisition and ProcessingPart of Speech (POS) tagging is a process of assigning correct syntactic categories to each word in the text. Tag set and word disambiguation rules are fundamental parts of any POS tagger. No work has hitherto been published of tag set in Sindhi ...






Comments