skip to main content
research-article

IsiXhosa Named Entity Recognition Resources

Published:27 December 2022Publication History
Skip Abstract Section

Abstract

Named entity recognition has been one of the most widely researched natural language processing technologies over the past two decades. For the South African languages, however, relatively little research and development work has been done. This changed with the release of the NCHLT named entity annotated resources, a collection of named entity annotated data and Conditional Random Field-based named entity recognisers for ten of the official languages.

In this work, we provide a detailed description and linguistic analysis of the named entity (NE) annotated data for the agglutinative isiXhosa language, by analysing the morphosyntactic features relevant to the three main types of NE, viz. person, location, and organisation. From the data, we identify suffix and capitalisation features that may be good predictors of the different NE types. Based on these features, we describe the named entity recogniser and feature set developed as part of the NCHLT release. The recogniser has high precision, 0.9713 overall, but relatively low recall, 0.7409, especially for person names, 0.5963, resulting in an overall F-score of 0.8406. Although there are various avenues to improve the named entity recogniser, this is a significant release for a historically under-resourced language.

REFERENCES

  1. [1] Maynard Diana, Tablan Valentin, Ursu Cristian, Cunningham Hamish, and Wilks Yorick. 2001. Named entity recognition from diverse text types. In Proceedings of the Recent Advances in Natural Language Processing. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Ehrmann Maud, Nouvel Damien, and Rosset Sophie. 2016. Named entity resources-overview and outlook. In Proceedings of the European Language Resources Association (LREC’16). 33493356.Google ScholarGoogle Scholar
  3. [3] Babych Bogdan and Hartley Anthony. 2003. Improving machine translation quality with automatic named entity recognition. In Proceedings of the European Chapter of the Assocition for Computational Linguistics (EACL’03). 18.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Balahur Alexandra and Turchi Marco. 2012. Multilingual sentiment analysis using machine translation? In Proceedings of the Association for Computational Linguistics. 5260.Google ScholarGoogle Scholar
  5. [5] Finkel Jenny Rose, Grenager Trond, and Manning Christopher. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting, Association for Computational Linguistics. 363370. DOI: Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Kubala Francis, Schwartz Richard, Stone Rebecca, and Weischedel Ralph. 1998. Named entity extraction from speech. In Proceedings of the DARPA Speech Recognition and Natural Language Workshop. Morgan Kaufmann. 2871992.Google ScholarGoogle Scholar
  7. [7] Eiselen Roald. 2016. Government domain named entity recognition for South African languages. In Proceedings of the European Language Resources Association (LREC’16). 33443348.Google ScholarGoogle Scholar
  8. [8] Fourie W., Du Toit J. V., and Snyman D. P.. 2014. Comparing support vector machine and multinomial naive Bayes for named entity classification of South African languages. In Proceedings of the Pattern Recognition Association of South Africa Conference. Retrieved from http://hdl.handle.net/10394/16239.Google ScholarGoogle Scholar
  9. [9] Matthew Gordon. 2013. Benoemde–Entiteitherkenning Vir Afrikaans. PhD Thesis, North-West University, Vanderbijlpark.Google ScholarGoogle Scholar
  10. [10] Puttkammer Martin J.. 2006. Outomatiese Afrikaanse Tekseenheididentifisering. PhD Thesis, North-West University, Potchefstroom.Google ScholarGoogle Scholar
  11. [11] Ratinov Lev and Roth Dan. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the CoNLL'09, Association for Computational Linguistics. 147155. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Vos Mark De, Merwe Kristin Van der, and Mescht Caroline Van der. 2014. A linguistic research programme for reading in African languages to underpin CAPS. J. Lang. Teach. 48, 2 (2014), 149177. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Sundheim Beth M.. 1995. Overview of results of the MUC-6 evaluation. In Proceedings of the Message Understanding Conference (MUC’95). 1331.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Minkov Einat, Wang Richard C., and Cohen William W.. 2005. Extracting personal names from email: Applying named entity recognition to informal text. In Proceedings of the Association for Computational Linguistics (EMNLP-HLT’05). 443450.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Garofolo John S., Fiscus Jonathan G., and Fisher William M.. 1997. Design and preparation of the 1996 Hub-4 broadcast news benchmark test corpora. In Proceedings of the DARPA Speech Recognition Workshop. Morgan Kaufmann, 1521.Google ScholarGoogle Scholar
  16. [16] Chinchor Nancy and Robinson Patricia. 1997. MUC-7 named entity task definition. In Proceedings of the 7th Message Understanding Conference (MUC’97). 121.Google ScholarGoogle Scholar
  17. [17] Tjong Kim Sang Erik F.. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of the Association for Computational Linguistics (COLING’02).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Tjong Kim Sang Erik F., and De Meulder Fien. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Association for Computational Linguistics (HLT-NAACL’03).Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Nadeau David and Sekine Satoshi. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 1 (2007), 326.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Baevski Alexei, Edunov Sergey, Liu Yinhan, Zettlemoyer Luke, and Auli Michael. 2019. Cloze-driven pretraining of self-attention networks. Retrieved from https://arXiv:1903.07785.Google ScholarGoogle Scholar
  21. [21] Jiang Yufan, Hu Chi, Xiao Tong, Zhang Chunliang, and Zhu Jingbo. 2019. Improved differentiable architecture search for language modeling and named entity recognition. In Proceedings of the Association for Computational Linguistics (EMNLP-IJCNLP’19). 35853590. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Ott Myle, Edunov Sergey, Baevski Alexei, Fan Angela, Gross Sam, Ng Nathan, Grangier David, and Auli Michael. 2019. FAIRSEQ: A fast, extensible toolkit for sequence modeling. Retrieved from https://arXiv:1904.01038.Google ScholarGoogle Scholar
  23. [23] Wang Zihan, Shang Jingbo, Liu Liyuan, Lu Lihao, Liu Jiacheng, and Han Jiawei. 2019. CrossWeigh: Training named entity tagger from imperfect annotations. In Proceedings of the Association for Computational Linguistics (EMNLP-IJCNLP’19). 51545163. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Shaalan Khaled and Raza Hafsa. 2009. NERA: Named entity recognition for arabic. J. Amer. Soc. Info. Sci. Technol. 60, 8 (2009), 16521663. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Maglogiannis Ilias G.. 2007. Emerging artificial intelligence applications in computer engineering: Real word AI systems with applications in eHealth, HCI, information retrieval and pervasive technologies. IOS Press, Amsterdam.Google ScholarGoogle Scholar
  26. [26] Lample Guillaume, Ballesteros Miguel, Subramanian Sandeep, Kawakami Kazuya, and Dyer Chris. 2016. Neural architectures for named entity recognition. Retrieved from https://arXiv:1603.01360.Google ScholarGoogle Scholar
  27. [27] McCallum Andrew and Li Wei. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the Association for Computational Linguistics (HLT-NAACL’03). 188191.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Wu Minghao, Liu Fei, and Cohn Trevor. 2018. Evaluating the utility of hand-crafted features in sequence labelling. In Proceedings of the Association for Computational Linguistics (EMNLP’18). 28502856. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Yang Jie and Zhang Yue. 2018. NCRF++: An open-source neural sequence labeling toolkit. InProceedings of the Association for Computational Linguistics. 7479. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Rouse Adam G., Hotson Guy, Smith Ryan J., Schieber Marc H., Thakor Nitish V., and Wester Brock A.. 2016. High precision neural decoding of complex movement trajectories using recursive Bayesian estimation with dynamic movement primitives. IEEE Robot. Autom. Lett. 1, 2 (2016), 676683. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arXiv:1810.04805.Google ScholarGoogle Scholar
  32. [32] Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. Retrieved from https://arXiv:1802.05365.Google ScholarGoogle Scholar
  33. [33] Akbik Alan, Bergmann Tanja, Blythe Duncan, Rasul Kashif, Schweter Stefan, and Vollgraf Roland. 2019. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the Association for Computational Linguistics. 5459. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Akbik Alan, Blythe Duncan, and Vollgraf Roland. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the Association for Computational Linguistics (COLING’18). 16381649.Google ScholarGoogle Scholar
  35. [35] Yamada Ikuya, Asai Akari, Shindo Hiroyuki, Takeda Hideaki, and Matsumoto Yuji. 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the Association for Computational Linguistics (EMNLP’20). 64426454. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Luoma Jouni, Oinonen Miika, Pyykönen Maria, Laippala Veronika, and Pyysalo Sampo. 2020. A Broad-coverage corpus for finnish named entity recognition. In Proceedings of the European Language Resources Association (LREC’20). 46154624.Google ScholarGoogle Scholar
  37. [37] Yeniterzi Reyyan. 2011. Exploiting morphology in turkish named entity recognition system. In Proceedings of the Association for Computational Linguistics. 105110.Google ScholarGoogle Scholar
  38. [38] Patra Braja Gopal, Debbarma Nuna, Das Dipankar, and Bandyopadhyay Sivaji. 2015. Named entity recognizer for less resourced language Kokborok. In Proceedings of the International Conference on Asian Language Processing (IALP’15). IEEE, 164168.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Abdallah Sherief, Shaalan Khaled, and Shoaib Muhammad. 2012. Integrating rule-based system with classification for Arabic named entity recognition. In Computational Linguistics and Intelligent Text Processing, A. Gelbukh (ed.). Springer, Berlin, 311322.Google ScholarGoogle Scholar
  40. [40] Shaalan Khaled. 2014. A survey of Arabic named entity recognition and classification. Comput. Linguist. 40, 2 (2014), 469510. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Pretorius L. and Bosch S.. 2009. Exploiting cross-linguistic similarities in Zulu and Xhosa computational morphology. In Proceedings of the Association for Computational Linguistics (EACL'09). 96103.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Meinhof Carl. 1932. Introduction to the Phonology of the Bantu Languages, Trans. Reimer, Berlin.Google ScholarGoogle Scholar
  43. [43] Gxilishe Sandile, Villiers Peter de, Villiers Jill de, Belikova A., Meroni L., and Umeda Mari. 2007. The acquisition of subject agreement in xhosa. In Proceedings of the Conference on Generative Approaches to Language Acquisition (GALANA’07). Citeseer, 114123.Google ScholarGoogle Scholar
  44. [44] Pretorius Rigardt, Berg Ansu, Pretorius Laurette, and Viljoen Biffie. 2009. Setswana tokenisation and computational verb morphology: Facing the challenge of a disjunctive orthography. In Proceedings of the Association for Computational Linguistics. 6673.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Nurse Derek and Philippson Gérard. 2006. The Bantu Languages. Routledge, London.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Perry Travis W.. 2020. Isixhosa noun classes. Retrieved from http://facweb.furman.edu/∼perrytravis/courses/bio39/Academics/Isixhosa/nounclasses.html.Google ScholarGoogle Scholar
  47. [47] Childs George Tucker. 2003. An Introduction to African Languages. John Benjamins Publishing, Amsterdam.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Mini Buyiswa and Tyolwana Nonkosi. 2005. Revision of Isixhosa Orthography and other Editorial Matters. PANSALB, Pretoria.Google ScholarGoogle Scholar
  49. [49] Podile K. and Eiselen R.. 2016. NCHLT isiXhosa named entity annotated corpus. Dataset. Centre for Text Technology. Retrieved from https://hdl.handle.net/20.500.12185/312.Google ScholarGoogle Scholar
  50. [50] Puttkammer Martin, Schlemmer Martin, Pienaar Wikus, and Bekker Ruan. 2014. NCHLT isiXhosa Text Corpora. Dataset. Centre for Text Technology. Retrieved from https://hdl.handle.net/20.500.12185/314.Google ScholarGoogle Scholar
  51. [51] Puttkammer Martin, Eiselen Roald, Hocking Justin, and Koen Frederik. 2018. NLP web services for resource-scarce languages. In Proceedings of the Association for Computational Linguistics (ACL’18). 4349.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Konkol Michal and Konopík Miloslav. 2013. CRF-based Czech named entity recognizer and consolidation of Czech NER research. In Proceedings of the International Conference on Text, Speech and Dialogue. Springer, 153160.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Kudo Taku. 2013. CRF++: Yet another CRF toolkit. Version 0.58. Retrieved from https://taku910.github.io/crfpp/.Google ScholarGoogle Scholar
  54. [54] Eiselen Roald and Puttkammer Martin J.. 2014. Developing text resources for ten South African languages. In Proceedings of the European Language Resources Association (LREC’14). 36983703.Google ScholarGoogle Scholar
  55. [55] Loubser Melinda and Puttkammer Martin J.. 2020. Viability of neural networks for core technologies for resource-scarce languages. Information 11, 1 (2020), 41. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. IsiXhosa Named Entity Recognition Resources

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
      February 2023
      624 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3572719
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 December 2022
      • Online AM: 2 June 2022
      • Accepted: 12 April 2022
      • Received: 7 September 2021
      Published in tallip Volume 22, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)118
      • Downloads (Last 6 weeks)2

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!