Abstract
Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach taken to creating an initial set of resources for Igbo, including an electronic text corpus, a part-of-speech (POS) tagset, and a POS-tagged subcorpus. We discuss the approach taken in gathering texts, the preprocessing of these texts, and the development of the POS tagged corpus. We also discuss some of the problems encountered during corpus and tagset development and the solutions arrived at for these problems.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, A Basic Language Resource Kit Implementation for the IgboNLP Project
- A. E. Afigbo. 1992. Groundwork of Igbo History. Lagos: Vista Books.Google Scholar
- M. S. Agbo. 2013. Orthography theories and the standard igbo orthography. Language in India 14 (2013).Google Scholar
- L. Al-Sulaiti and E. S. Atwell. 2006. The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics 11, 2 (2006), 135--171.Google Scholar
Cross Ref
- J. Allwood, L. Grönqvist, and A. &Ptilde;. Hendrikse. 2003. Developing a tagset and tagger for the African languages of South Africa with special reference to Xhosa. Southern African Linguistics and Applied Language Studies 21 (2003), 223--237.Google Scholar
Cross Ref
- M. Alrabiah, N. Alhelewh, A. Al-Salman, and E. S. Atwell. 2014. An empirical study on the holy quran based on a large classical arabic corpus. International Journal of Computational Linguistics (IJCL) 5, 1 (2014), 1--13.Google Scholar
- R. Artstein and M. Poesio. 2008. Inter-Coder Agreement for Computational Linguistics. MIT Press, 555--596. Google Scholar
Digital Library
- E. S. Atwell. 2008. Development of tag sets for part-of-speech tagging. An international handbook. Corpus Linguistics, Mouton de Gruyter, 501--526.Google Scholar
- I. I. Ayogu, A. O. Adetunmbi, and N. C. Kammelu. 2013. Finite state concatenative morphotactics: The treatment of Igbo verbs. International Journal of Computing and ICT Research 7, 1 (2013).Google Scholar
- C. M. B. Dione, J. Kuhn, and S. Zarrieß. 2010. Design and development of part-of-speech-tagging resources for wolof (niger-congo, spoken in senegal). In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA).Google Scholar
- M. Baroni and A. Kilgarriff. 2006. Large linguistically-processed web corpora for multiple languages. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics: Posters 8 Demonstrations. Association for Computational Linguistics, 87--90. Google Scholar
Digital Library
- S. E. Bosch, L. Pretorius, and A. Fleisch. 2008. Experimental bootstrapping of morphological analysers for nguni languages. Nordic Journal of African Studies 17, 2 (2008), 66--88.Google Scholar
- T. Brants. 2000a. Inter-annotator agreement for a german newspaper corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000). Athens, Greece.Google Scholar
- T. Brants. 2000b. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLC’00). Association for Computational Linguistics, Stroudsburg, PA, 224--231. Google Scholar
Digital Library
- E. Brill and M. Marcus. 1992. Tagging an unfamiliar text with minimal human supervision. In Proceedings of the Fall Symposium on Probabilistic Approaches to Natural Language.Google Scholar
- N. Calzolari, R. Del Gratta, G. Francopoulo, J. Mariani, F. Rubino, I. Russo, and C. Soria. 2012. The LRE map. harmonising community descriptions of resources. In LREC. 1084--1089.Google Scholar
- CIA. 2016. Nigeria at CIA World Factbook: “Igbo 18%” out of a population of 186 million (2016 estimate). Retrieved July 10, 2017, from https://www.cia.gov/library/publications/the-world-factbook/geos/ni.html. (2016).Google Scholar
- M. M. Clark. 1990. The Tonal System of Igbo. Foris Publications Holland.Google Scholar
- H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. 2002. GATE: An architecture for development of robust HLT applications. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 168--175. Google Scholar
Digital Library
- N. E. Emenanjo. 1978. Elements of Modern Igbo Grammar: A Descriptive Approach. Ibadan Oxford University Press.Google Scholar
- Ethnologue. 2017. A language of Nigeria, Igbo. Retrieved from https://www.ethnologue.com/language/ibo.Google Scholar
- R. Fernández. 2011. Assessing the Reliability of an Annotation Scheme for Indefinites Measuring Inter-Annotator Agreement. Institute for Logic, Language and Computation University of Amsterdam.Google Scholar
- A. M. Green. 1977. Kappa Statistics for Multiple Raters Using Categorical Classifications. In Proceedings of the 22nd Annual SAS Users Group International Conference, San, Diego, CA.Google Scholar
- A. Hardie. 2003. The Computational Analysis of Morphosyntactic Categories in Urdu. Ph.D. Dissertation. University of Lancaster.Google Scholar
- U. Heid, E. Taljard, and D. &jtilde;. Prinsloo. 2006. Grammar-based tools for the creation of tagging resources for an unresourced language: The case of Northern Sotho. In 5th Edition of the International Conference on Language Resources and Evaluations.Google Scholar
- S. Helgadóttir, H. Loftsson, and E. Rögnvaldsson. 2012. Correcting errors in a new gold standard for tagging icelandic text. In LREC’14, 2944--2948.Google Scholar
- C. N. Ikegwuonu. 2011. Tense as an element of INFL phrase in igbo. Journal of Igbo Language and Linguistics (JILL) 3 (2011), 112--121.Google Scholar
- C. Ikekeonwu. 1999. “Igbo.” In Handbook of the International Phonetic Association. Cambridge University Press, 108--110.Google Scholar
- S. Krauwer. 2003. The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. Proceedings of SPECOM’03, 8--15.Google Scholar
- S. Krauwer. 2006. Strengthening the smaller languages in Europe. In Proceedings of 5th Slovenian and 1st International Language Technologies Conference. 2006. 9--10.Google Scholar
- K. Krippendorff. 1980. Content Analysis: An Introduction to Its Methodology. Sage, Beverly Hills, CA.Google Scholar
- R. J. Landis and G. G. Koch. 1977. The measurement of observer agreement for categorical data. biometrics JSTOR, 159--174.Google Scholar
- Geoffrey Leech. 1997. Introducing Corpus Annotation. Longman, London, 1--18.Google Scholar
- G. Leech, R. Garside, and E. S. Atwell. 1983. The automatic grammatical tagging of the LOB corpus. ICAME Journal: International Computer Archive of Modern and Medieval English Journal 7 (1983), 13--33.Google Scholar
- G. Leech and A. Wilson. 1996. EAGLES: Recommendations for the morphosyntactic annotation of corpora (EAGLES document EAG--TCWG--MAC/R). Pisa, Consiglio Nazionale Delle Ricerche. Istituto di Linguistica Computazionale.Google Scholar
- H. Loftsson. 2009. Correcting a POS-tagged corpus using three complementary methods. In Proceedings of EACL’09, 523--531. Google Scholar
Digital Library
- P. A. Nwachukwu. 1987. The Argument Structure of Igbo Verbs, Lexican Project Working Papers in Linguistics. Technical Report. Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
- J. A. Nweke. 2011. A Review of the Impact of the Minimalist Programme on Igbo Noun Phrase. Knowledge Review, Volume 23. globalacademicgroup.com.Google Scholar
- I. Onyenwe, M. Hepple, and U. Chinedu. 2016. Améliorer la précision dannotation dun corpus Igbo par reconstruction morphologique et lapprentissage basé sur la transformation. In Atelier Traitement Automatique des Langues Africaines (TALAF’16).Google Scholar
- I. Onyenwe, M. Hepple, C. Uchechukwu, and I. Ezeani. 2015. Use of transformation-based learning in annotation pipeline of Igbo, an African language. In Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects. 24.Google Scholar
- I. E. Onyenwe. 2017. Developing Methods and Resources for Automated Processing of the African Language Igbo. Ph.D. Dissertation. University of Sheffield.Google Scholar
- I. E. Onyenwe, C. Uchechukwu, and M. Hepple. 2014. Part-of-speech tagset and corpus development for igbo, an african language. LAW VIII (2014), 93.Google Scholar
- S. Petrov, D. Das, and R. McDonald. 2011. A universal part-of-speech tagset. arXiv Preprint arXiv:1104.2086 (2011).Google Scholar
- R. Pretorius, A. Berg, L. Pretorius, and B. Viljoen. 2009. Setswana tokenisation and computational verb morphology: Facing the challenge of a disjunctive orthography. In Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages (AfLaT’09). European Association for Computer Linguistics. Athens, Greece, 66--73. Google Scholar
Digital Library
- J. Pustejovsky and A. Stubbs. 2012. Natural Language Annotation for Machine Learning. O’Reilly Media.Google Scholar
- P. Resnik. 1999. Mining the web for bilingual text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 527--534. Google Scholar
Digital Library
- P. Resnik, M. B. Olsen, and M. Diab. 1999. The bible as a parallel corpus: Annotating the ’book of 2000 tongues. Computers and the Humanities 33 (1999), 29--153.Google Scholar
Cross Ref
- J. A. Rowbory. 2009. The History and Impact of Igbo Bible, 1840-1920. Retrieved from http://negstor.rowbory.co.uk/wp-content/uploads/2009/03/the-history-and-impact-of-the-igbo-bible-1840-1920.pdf.Google Scholar
- K. P. Scannell. 2007. The Crúbadán project: Corpus building for under-resourced languages. In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, Vol. 4. Presses Univ. de Louvain, 5--15.Google Scholar
- J. Sinclair. 2004. Corpus and Text Basic Principles. In Developing Linguistic Corpora: A Guide to Good Practice. Retrieved from http://users.ox.ac.uk/ martinw/dlc/chapter1.htm.Google Scholar
- E. Taljard, G. Faaß, U. Heid, and D. J. Prinsloo. 2008. On the development of a tagset for northern sotho with special reference to the issue of standardisation. In Literator: Journal of Literary Criticism, Comparative Linguistics and Literary Studies. AOSIS 29, 1 (2008), 111--137.Google Scholar
- K. Tapas and P. Resnik. 1999. The bible, truth, and multilingual OCR evaluation. In Proceedings of SPIE Conference on Document Recognition and Retrieval. 86--96.Google Scholar
- K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 173--180. Google Scholar
Digital Library
- C. Uchechukwu. 2008. African language data processing: The example of the igbo language. In Proceedings of the 10th International Pragmatics Conference, Data Processing in African Languages.Google Scholar
- UCLA. 2014. Language Materials Project: Igbo. Retrieved from http://www.lmp.ucla.edu/Profile.aspx? menu=0048LangID=13.Google Scholar
- B. F. Welmers and W. E. Welmers. 1968. Igbo: A Learner’s Manual. Published by authors.Google Scholar
- Michael Widjaja. 2013. Igbo Grammar. Retrieved from http://www.igboguide.org/HT-igbogrammar.htm.Google Scholar
- K. Williamson. 1971. Igbo Dictionaries. Paper presented at the Seminar on the Problems of the Igbo Language and Literature. University of Nigeria, Nsukka.Google Scholar
- Martin Wynne, Arts, and Humanities Data Service. 2005. Developing Linguistic Corpora: A Guide to Good Practice. Vol. 92. Oxbow Books, Oxford.Google Scholar
- Ọnwụ Committee. 1961. The Official Igbo Orthography. Retrieved from http://www.columbia.edu/itc/mealac/pritchett/00fwp/igbo/txt_onwu_1961.pdf.Google Scholar
Index Terms
A Basic Language Resource Kit Implementation for the IgboNLP Project
Recommendations
Toward an Effective Igbo Part-of-Speech Tagger
Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments ...
Fassieh¯, a Semi-Automatic Visual Interactive Tool for Morphological, PoS-Tags, Phonetic, and Semantic Annotation of Arabic Text Corpora
This paper introduces an Arabic text annotation tool called Fassiehreg. Via a sophisticated interactive GUI application, Fassiehreg makes it easy to build structured large standard written Arabic corpora, then allows the production of fundamental ...
A Lemmatizer for Low-resource Languages: WSD and Its Role in the Assamese Language
The morphological variations of highly inflected languages that appear in a text impede the progress of computer processing and root word determination tasks while extracting an abstract. As a remedy to this difficulty, a lemmatization algorithm is ...






Comments