skip to main content
research-article

A Basic Language Resource Kit Implementation for the IgboNLP Project

Published:11 January 2018Publication History
Skip Abstract Section

Abstract

Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach taken to creating an initial set of resources for Igbo, including an electronic text corpus, a part-of-speech (POS) tagset, and a POS-tagged subcorpus. We discuss the approach taken in gathering texts, the preprocessing of these texts, and the development of the POS tagged corpus. We also discuss some of the problems encountered during corpus and tagset development and the solutions arrived at for these problems.

Skip Supplemental Material Section

Supplemental Material

References

  1. A. E. Afigbo. 1992. Groundwork of Igbo History. Lagos: Vista Books.Google ScholarGoogle Scholar
  2. M. S. Agbo. 2013. Orthography theories and the standard igbo orthography. Language in India 14 (2013).Google ScholarGoogle Scholar
  3. L. Al-Sulaiti and E. S. Atwell. 2006. The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics 11, 2 (2006), 135--171.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. Allwood, L. Grönqvist, and A. &Ptilde;. Hendrikse. 2003. Developing a tagset and tagger for the African languages of South Africa with special reference to Xhosa. Southern African Linguistics and Applied Language Studies 21 (2003), 223--237.Google ScholarGoogle ScholarCross RefCross Ref
  5. M. Alrabiah, N. Alhelewh, A. Al-Salman, and E. S. Atwell. 2014. An empirical study on the holy quran based on a large classical arabic corpus. International Journal of Computational Linguistics (IJCL) 5, 1 (2014), 1--13.Google ScholarGoogle Scholar
  6. R. Artstein and M. Poesio. 2008. Inter-Coder Agreement for Computational Linguistics. MIT Press, 555--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. S. Atwell. 2008. Development of tag sets for part-of-speech tagging. An international handbook. Corpus Linguistics, Mouton de Gruyter, 501--526.Google ScholarGoogle Scholar
  8. I. I. Ayogu, A. O. Adetunmbi, and N. C. Kammelu. 2013. Finite state concatenative morphotactics: The treatment of Igbo verbs. International Journal of Computing and ICT Research 7, 1 (2013).Google ScholarGoogle Scholar
  9. C. M. B. Dione, J. Kuhn, and S. Zarrieß. 2010. Design and development of part-of-speech-tagging resources for wolof (niger-congo, spoken in senegal). In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA).Google ScholarGoogle Scholar
  10. M. Baroni and A. Kilgarriff. 2006. Large linguistically-processed web corpora for multiple languages. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics: Posters 8 Demonstrations. Association for Computational Linguistics, 87--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. E. Bosch, L. Pretorius, and A. Fleisch. 2008. Experimental bootstrapping of morphological analysers for nguni languages. Nordic Journal of African Studies 17, 2 (2008), 66--88.Google ScholarGoogle Scholar
  12. T. Brants. 2000a. Inter-annotator agreement for a german newspaper corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000). Athens, Greece.Google ScholarGoogle Scholar
  13. T. Brants. 2000b. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLC’00). Association for Computational Linguistics, Stroudsburg, PA, 224--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Brill and M. Marcus. 1992. Tagging an unfamiliar text with minimal human supervision. In Proceedings of the Fall Symposium on Probabilistic Approaches to Natural Language.Google ScholarGoogle Scholar
  15. N. Calzolari, R. Del Gratta, G. Francopoulo, J. Mariani, F. Rubino, I. Russo, and C. Soria. 2012. The LRE map. harmonising community descriptions of resources. In LREC. 1084--1089.Google ScholarGoogle Scholar
  16. CIA. 2016. Nigeria at CIA World Factbook: “Igbo 18%” out of a population of 186 million (2016 estimate). Retrieved July 10, 2017, from https://www.cia.gov/library/publications/the-world-factbook/geos/ni.html. (2016).Google ScholarGoogle Scholar
  17. M. M. Clark. 1990. The Tonal System of Igbo. Foris Publications Holland.Google ScholarGoogle Scholar
  18. H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. 2002. GATE: An architecture for development of robust HLT applications. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 168--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. E. Emenanjo. 1978. Elements of Modern Igbo Grammar: A Descriptive Approach. Ibadan Oxford University Press.Google ScholarGoogle Scholar
  20. Ethnologue. 2017. A language of Nigeria, Igbo. Retrieved from https://www.ethnologue.com/language/ibo.Google ScholarGoogle Scholar
  21. R. Fernández. 2011. Assessing the Reliability of an Annotation Scheme for Indefinites Measuring Inter-Annotator Agreement. Institute for Logic, Language and Computation University of Amsterdam.Google ScholarGoogle Scholar
  22. A. M. Green. 1977. Kappa Statistics for Multiple Raters Using Categorical Classifications. In Proceedings of the 22nd Annual SAS Users Group International Conference, San, Diego, CA.Google ScholarGoogle Scholar
  23. A. Hardie. 2003. The Computational Analysis of Morphosyntactic Categories in Urdu. Ph.D. Dissertation. University of Lancaster.Google ScholarGoogle Scholar
  24. U. Heid, E. Taljard, and D. &jtilde;. Prinsloo. 2006. Grammar-based tools for the creation of tagging resources for an unresourced language: The case of Northern Sotho. In 5th Edition of the International Conference on Language Resources and Evaluations.Google ScholarGoogle Scholar
  25. S. Helgadóttir, H. Loftsson, and E. Rögnvaldsson. 2012. Correcting errors in a new gold standard for tagging icelandic text. In LREC’14, 2944--2948.Google ScholarGoogle Scholar
  26. C. N. Ikegwuonu. 2011. Tense as an element of INFL phrase in igbo. Journal of Igbo Language and Linguistics (JILL) 3 (2011), 112--121.Google ScholarGoogle Scholar
  27. C. Ikekeonwu. 1999. “Igbo.” In Handbook of the International Phonetic Association. Cambridge University Press, 108--110.Google ScholarGoogle Scholar
  28. S. Krauwer. 2003. The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. Proceedings of SPECOM’03, 8--15.Google ScholarGoogle Scholar
  29. S. Krauwer. 2006. Strengthening the smaller languages in Europe. In Proceedings of 5th Slovenian and 1st International Language Technologies Conference. 2006. 9--10.Google ScholarGoogle Scholar
  30. K. Krippendorff. 1980. Content Analysis: An Introduction to Its Methodology. Sage, Beverly Hills, CA.Google ScholarGoogle Scholar
  31. R. J. Landis and G. G. Koch. 1977. The measurement of observer agreement for categorical data. biometrics JSTOR, 159--174.Google ScholarGoogle Scholar
  32. Geoffrey Leech. 1997. Introducing Corpus Annotation. Longman, London, 1--18.Google ScholarGoogle Scholar
  33. G. Leech, R. Garside, and E. S. Atwell. 1983. The automatic grammatical tagging of the LOB corpus. ICAME Journal: International Computer Archive of Modern and Medieval English Journal 7 (1983), 13--33.Google ScholarGoogle Scholar
  34. G. Leech and A. Wilson. 1996. EAGLES: Recommendations for the morphosyntactic annotation of corpora (EAGLES document EAG--TCWG--MAC/R). Pisa, Consiglio Nazionale Delle Ricerche. Istituto di Linguistica Computazionale.Google ScholarGoogle Scholar
  35. H. Loftsson. 2009. Correcting a POS-tagged corpus using three complementary methods. In Proceedings of EACL’09, 523--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. P. A. Nwachukwu. 1987. The Argument Structure of Igbo Verbs, Lexican Project Working Papers in Linguistics. Technical Report. Massachusetts Institute of Technology, Cambridge, MA.Google ScholarGoogle Scholar
  37. J. A. Nweke. 2011. A Review of the Impact of the Minimalist Programme on Igbo Noun Phrase. Knowledge Review, Volume 23. globalacademicgroup.com.Google ScholarGoogle Scholar
  38. I. Onyenwe, M. Hepple, and U. Chinedu. 2016. Améliorer la précision dannotation dun corpus Igbo par reconstruction morphologique et lapprentissage basé sur la transformation. In Atelier Traitement Automatique des Langues Africaines (TALAF’16).Google ScholarGoogle Scholar
  39. I. Onyenwe, M. Hepple, C. Uchechukwu, and I. Ezeani. 2015. Use of transformation-based learning in annotation pipeline of Igbo, an African language. In Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects. 24.Google ScholarGoogle Scholar
  40. I. E. Onyenwe. 2017. Developing Methods and Resources for Automated Processing of the African Language Igbo. Ph.D. Dissertation. University of Sheffield.Google ScholarGoogle Scholar
  41. I. E. Onyenwe, C. Uchechukwu, and M. Hepple. 2014. Part-of-speech tagset and corpus development for igbo, an african language. LAW VIII (2014), 93.Google ScholarGoogle Scholar
  42. S. Petrov, D. Das, and R. McDonald. 2011. A universal part-of-speech tagset. arXiv Preprint arXiv:1104.2086 (2011).Google ScholarGoogle Scholar
  43. R. Pretorius, A. Berg, L. Pretorius, and B. Viljoen. 2009. Setswana tokenisation and computational verb morphology: Facing the challenge of a disjunctive orthography. In Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages (AfLaT’09). European Association for Computer Linguistics. Athens, Greece, 66--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Pustejovsky and A. Stubbs. 2012. Natural Language Annotation for Machine Learning. O’Reilly Media.Google ScholarGoogle Scholar
  45. P. Resnik. 1999. Mining the web for bilingual text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 527--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. P. Resnik, M. B. Olsen, and M. Diab. 1999. The bible as a parallel corpus: Annotating the ’book of 2000 tongues. Computers and the Humanities 33 (1999), 29--153.Google ScholarGoogle ScholarCross RefCross Ref
  47. J. A. Rowbory. 2009. The History and Impact of Igbo Bible, 1840-1920. Retrieved from http://negstor.rowbory.co.uk/wp-content/uploads/2009/03/the-history-and-impact-of-the-igbo-bible-1840-1920.pdf.Google ScholarGoogle Scholar
  48. K. P. Scannell. 2007. The Crúbadán project: Corpus building for under-resourced languages. In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, Vol. 4. Presses Univ. de Louvain, 5--15.Google ScholarGoogle Scholar
  49. J. Sinclair. 2004. Corpus and Text Basic Principles. In Developing Linguistic Corpora: A Guide to Good Practice. Retrieved from http://users.ox.ac.uk/ martinw/dlc/chapter1.htm.Google ScholarGoogle Scholar
  50. E. Taljard, G. Faaß, U. Heid, and D. J. Prinsloo. 2008. On the development of a tagset for northern sotho with special reference to the issue of standardisation. In Literator: Journal of Literary Criticism, Comparative Linguistics and Literary Studies. AOSIS 29, 1 (2008), 111--137.Google ScholarGoogle Scholar
  51. K. Tapas and P. Resnik. 1999. The bible, truth, and multilingual OCR evaluation. In Proceedings of SPIE Conference on Document Recognition and Retrieval. 86--96.Google ScholarGoogle Scholar
  52. K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 173--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. C. Uchechukwu. 2008. African language data processing: The example of the igbo language. In Proceedings of the 10th International Pragmatics Conference, Data Processing in African Languages.Google ScholarGoogle Scholar
  54. UCLA. 2014. Language Materials Project: Igbo. Retrieved from http://www.lmp.ucla.edu/Profile.aspx? menu=0048LangID=13.Google ScholarGoogle Scholar
  55. B. F. Welmers and W. E. Welmers. 1968. Igbo: A Learner’s Manual. Published by authors.Google ScholarGoogle Scholar
  56. Michael Widjaja. 2013. Igbo Grammar. Retrieved from http://www.igboguide.org/HT-igbogrammar.htm.Google ScholarGoogle Scholar
  57. K. Williamson. 1971. Igbo Dictionaries. Paper presented at the Seminar on the Problems of the Igbo Language and Literature. University of Nigeria, Nsukka.Google ScholarGoogle Scholar
  58. Martin Wynne, Arts, and Humanities Data Service. 2005. Developing Linguistic Corpora: A Guide to Good Practice. Vol. 92. Oxbow Books, Oxford.Google ScholarGoogle Scholar
  59. Ọnwụ Committee. 1961. The Official Igbo Orthography. Retrieved from http://www.columbia.edu/itc/mealac/pritchett/00fwp/igbo/txt_onwu_1961.pdf.Google ScholarGoogle Scholar

Index Terms

  1. A Basic Language Resource Kit Implementation for the IgboNLP Project

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!