Abstract
Named entity recognition has been one of the most widely researched natural language processing technologies over the past two decades. For the South African languages, however, relatively little research and development work has been done. This changed with the release of the NCHLT named entity annotated resources, a collection of named entity annotated data and Conditional Random Field-based named entity recognisers for ten of the official languages.
In this work, we provide a detailed description and linguistic analysis of the named entity (NE) annotated data for the agglutinative isiXhosa language, by analysing the morphosyntactic features relevant to the three main types of NE, viz. person, location, and organisation. From the data, we identify suffix and capitalisation features that may be good predictors of the different NE types. Based on these features, we describe the named entity recogniser and feature set developed as part of the NCHLT release. The recogniser has high precision, 0.9713 overall, but relatively low recall, 0.7409, especially for person names, 0.5963, resulting in an overall F-score of 0.8406. Although there are various avenues to improve the named entity recogniser, this is a significant release for a historically under-resourced language.
- [1] . 2001. Named entity recognition from diverse text types. In Proceedings of the Recent Advances in Natural Language Processing.
DOI: Google ScholarDigital Library
- [2] . 2016. Named entity resources-overview and outlook. In Proceedings of the European Language Resources Association (LREC’16). 3349–3356.Google Scholar
- [3] . 2003. Improving machine translation quality with automatic named entity recognition. In Proceedings of the European Chapter of the Assocition for Computational Linguistics (EACL’03). 1–8.Google Scholar
Cross Ref
- [4] . 2012. Multilingual sentiment analysis using machine translation? In Proceedings of the Association for Computational Linguistics. 52–60.Google Scholar
- [5] . 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting, Association for Computational Linguistics. 363–370.
DOI: Google ScholarDigital Library
- [6] . 1998. Named entity extraction from speech. In Proceedings of the DARPA Speech Recognition and Natural Language Workshop. Morgan Kaufmann. 287–1992.Google Scholar
- [7] . 2016. Government domain named entity recognition for South African languages. In Proceedings of the European Language Resources Association (LREC’16). 3344–3348.Google Scholar
- [8] . 2014. Comparing support vector machine and multinomial naive Bayes for named entity classification of South African languages. In Proceedings of the Pattern Recognition Association of South Africa Conference. Retrieved from http://hdl.handle.net/10394/16239.Google Scholar
- [9] . 2013. Benoemde–Entiteitherkenning Vir Afrikaans. PhD Thesis, North-West University, Vanderbijlpark.Google Scholar
- [10] . 2006. Outomatiese Afrikaanse Tekseenheididentifisering. PhD Thesis, North-West University, Potchefstroom.Google Scholar
- [11] . 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the CoNLL'09, Association for Computational Linguistics. 147–155.
DOI: Google ScholarCross Ref
- [12] . 2014. A linguistic research programme for reading in African languages to underpin CAPS. J. Lang. Teach. 48, 2 (2014), 149–177.
DOI: Google ScholarCross Ref
- [13] . 1995. Overview of results of the MUC-6 evaluation. In Proceedings of the Message Understanding Conference (MUC’95). 13–31.Google Scholar
Digital Library
- [14] . 2005. Extracting personal names from email: Applying named entity recognition to informal text. In Proceedings of the Association for Computational Linguistics (EMNLP-HLT’05). 443–450.Google Scholar
Digital Library
- [15] . 1997. Design and preparation of the 1996 Hub-4 broadcast news benchmark test corpora. In Proceedings of the DARPA Speech Recognition Workshop. Morgan Kaufmann, 15–21.Google Scholar
- [16] . 1997. MUC-7 named entity task definition. In Proceedings of the 7th Message Understanding Conference (MUC’97). 1–21.Google Scholar
- [17] . 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of the Association for Computational Linguistics (COLING’02).Google Scholar
Digital Library
- [18] . 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Association for Computational Linguistics (HLT-NAACL’03).Google Scholar
Digital Library
- [19] . 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 1 (2007), 3–26.Google Scholar
Cross Ref
- [20] . 2019. Cloze-driven pretraining of self-attention networks. Retrieved from https://arXiv:1903.07785.Google Scholar
- [21] . 2019. Improved differentiable architecture search for language modeling and named entity recognition. In Proceedings of the Association for Computational Linguistics (EMNLP-IJCNLP’19). 3585–3590.
DOI: Google ScholarCross Ref
- [22] . 2019. FAIRSEQ: A fast, extensible toolkit for sequence modeling. Retrieved from https://arXiv:1904.01038.Google Scholar
- [23] . 2019. CrossWeigh: Training named entity tagger from imperfect annotations. In Proceedings of the Association for Computational Linguistics (EMNLP-IJCNLP’19). 5154–5163.
DOI: Google ScholarCross Ref
- [24] . 2009. NERA: Named entity recognition for arabic. J. Amer. Soc. Info. Sci. Technol. 60, 8 (2009), 1652–1663.
DOI: Google ScholarCross Ref
- [25] . 2007. Emerging artificial intelligence applications in computer engineering: Real word AI systems with applications in eHealth, HCI, information retrieval and pervasive technologies. IOS Press, Amsterdam.Google Scholar
- [26] . 2016. Neural architectures for named entity recognition. Retrieved from https://arXiv:1603.01360.Google Scholar
- [27] . 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the Association for Computational Linguistics (HLT-NAACL’03). 188–191.Google Scholar
Digital Library
- [28] . 2018. Evaluating the utility of hand-crafted features in sequence labelling. In Proceedings of the Association for Computational Linguistics (EMNLP’18). 2850–2856.
DOI: Google ScholarCross Ref
- [29] . 2018. NCRF++: An open-source neural sequence labeling toolkit. InProceedings of the Association for Computational Linguistics. 74–79.
DOI: Google ScholarCross Ref
- [30] . 2016. High precision neural decoding of complex movement trajectories using recursive Bayesian estimation with dynamic movement primitives. IEEE Robot. Autom. Lett. 1, 2 (2016), 676–683.
DOI: Google ScholarCross Ref
- [31] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arXiv:1810.04805.Google Scholar
- [32] . 2018. Deep contextualized word representations. Retrieved from https://arXiv:1802.05365.Google Scholar
- [33] . 2019. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the Association for Computational Linguistics. 54–59.
DOI: Google ScholarCross Ref
- [34] . 2018. Contextual string embeddings for sequence labeling. In Proceedings of the Association for Computational Linguistics (COLING’18). 1638–1649.Google Scholar
- [35] . 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the Association for Computational Linguistics (EMNLP’20). 6442–6454.
DOI: Google ScholarCross Ref
- [36] . 2020. A Broad-coverage corpus for finnish named entity recognition. In Proceedings of the European Language Resources Association (LREC’20). 4615–4624.Google Scholar
- [37] . 2011. Exploiting morphology in turkish named entity recognition system. In Proceedings of the Association for Computational Linguistics. 105–110.Google Scholar
- [38] . 2015. Named entity recognizer for less resourced language Kokborok. In Proceedings of the International Conference on Asian Language Processing (IALP’15). IEEE, 164–168.Google Scholar
Cross Ref
- [39] . 2012. Integrating rule-based system with classification for Arabic named entity recognition. In Computational Linguistics and Intelligent Text Processing, A. Gelbukh (ed.). Springer, Berlin, 311–322.Google Scholar
- [40] . 2014. A survey of Arabic named entity recognition and classification. Comput. Linguist. 40, 2 (2014), 469–510.
DOI: Google ScholarDigital Library
- [41] . 2009. Exploiting cross-linguistic similarities in Zulu and Xhosa computational morphology. In Proceedings of the Association for Computational Linguistics (EACL'09). 96–103.Google Scholar
Cross Ref
- [42] . 1932. Introduction to the Phonology of the Bantu Languages, Trans. Reimer, Berlin.Google Scholar
- [43] . 2007. The acquisition of subject agreement in xhosa. In Proceedings of the Conference on Generative Approaches to Language Acquisition (GALANA’07). Citeseer, 114–123.Google Scholar
- [44] . 2009. Setswana tokenisation and computational verb morphology: Facing the challenge of a disjunctive orthography. In Proceedings of the Association for Computational Linguistics. 66–73.Google Scholar
Cross Ref
- [45] . 2006. The Bantu Languages. Routledge, London.Google Scholar
Cross Ref
- [46] . 2020. Isixhosa noun classes. Retrieved from http://facweb.furman.edu/∼perrytravis/courses/bio39/Academics/Isixhosa/nounclasses.html.Google Scholar
- [47] . 2003. An Introduction to African Languages. John Benjamins Publishing, Amsterdam.Google Scholar
Cross Ref
- [48] . 2005. Revision of Isixhosa Orthography and other Editorial Matters. PANSALB, Pretoria.Google Scholar
- [49] . 2016. NCHLT isiXhosa named entity annotated corpus. Dataset. Centre for Text Technology. Retrieved from https://hdl.handle.net/20.500.12185/312.Google Scholar
- [50] . 2014. NCHLT isiXhosa Text Corpora. Dataset. Centre for Text Technology. Retrieved from https://hdl.handle.net/20.500.12185/314.Google Scholar
- [51] . 2018. NLP web services for resource-scarce languages. In Proceedings of the Association for Computational Linguistics (ACL’18). 43–49.Google Scholar
Cross Ref
- [52] . 2013. CRF-based Czech named entity recognizer and consolidation of Czech NER research. In Proceedings of the International Conference on Text, Speech and Dialogue. Springer, 153–160.Google Scholar
Cross Ref
- [53] . 2013. CRF++: Yet another CRF toolkit. Version 0.58. Retrieved from https://taku910.github.io/crfpp/.Google Scholar
- [54] . 2014. Developing text resources for ten South African languages. In Proceedings of the European Language Resources Association (LREC’14). 3698–3703.Google Scholar
- [55] . 2020. Viability of neural networks for core technologies for resource-scarce languages. Information 11, 1 (2020), 41.
DOI: Google ScholarCross Ref
Index Terms
IsiXhosa Named Entity Recognition Resources
Recommendations
Named entity recognition and resolution in legal text
Semantic Processing of Legal TextsNamed entities in text are persons, places, companies, etc. that are explicitly mentioned in text using proper nouns. The process of finding named entities in a text and classifying them to a semantic type, is called named entity recognition. Resolution ...
Inducing Gazetteer for Chinese Named Entity Recognition Based on Local High-Frequent Strings
FITME '09: Proceedings of the 2009 Second International Conference on Future Information Technology and Management EngineeringGazetteers, or entity dictionaries, are important for named entity recognition (NER). Although the dictionaries extracted automatically by the previous methods from a corpus, web or Wikipedia are very huge, they also misses some entities, especially the ...
Unsupervised biomedical named entity recognition
Display Omitted BM-NER is approached by an unsupervised stepwise method.Noun phrase chunking is a good approximation of boundary detection.Distributional semantics works well in classifying entities.The system performs well on clinical and biological ...






Comments