skip to main content
short-paper

Keyword Extraction from Arabic Documents using Term Equivalence Classes

Published:20 April 2015Publication History
Skip Abstract Section

Abstract

The rapid growth of the Internet and other computing facilities in recent years has resulted in the creation of a large amount of text in electronic form, which has increased the interest in and importance of different automatic text processing applications, including keyword extraction and term indexing. Although keywords are very useful for many applications, most documents available online are not provided with keywords. We describe a method for extracting keywords from Arabic documents. This method identifies the keywords by combining linguistics and statistical analysis of the text without using prior knowledge from its domain or information from any related corpus. The text is preprocessed to extract the main linguistic information, such as the roots and morphological patterns of derivative words. A cleaning phase is then applied to eliminate the meaningless words from the text. The most frequent terms are clustered into equivalence classes in which the derivative words generated from the same root and the non-derivative words generated from the same stem are placed together, and their count is accumulated. A vector space model is then used to capture the most frequent N-gram in the text. Experiments carried out using a real-world dataset show that the proposed method achieves good results with an average precision of 31% and average recall of 53% when tested against manually assigned keywords.

References

  1. Al-Sughaier, I. and Al-Kharashi, I. 2004. Arabic morphological analysis techniques: A comprehensive survey. J. Am. Soc. Inform. Sci. Technol. 55, 3, 189--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Awajan, A. 2011. Multilayer model for Arabic text compression. Int. Arab J. Inform. Technol. 8, 2, 188--196.Google ScholarGoogle Scholar
  3. Beesley, R. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th International Conference on Computational Linguistics (COLING’96). 89--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Ould Abdallahi Ould Bebah, M., and Shoul, M. 2010. Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In Proceedings of the International Arab Conference on Information Technology. http://www.itpapers.info/acit10/Papers/f653.Google ScholarGoogle Scholar
  5. Cohen, J. D. 1995. Language and domain-independent automatic indexing terms for abstracting. J. Am. Soc. Inform. Sci. 46, 3, 162--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Diab, M., Hacioglu, K., and Jurafsky, D. 2007. Automatic processing of modern standard Arabic text. In Arabic Computational Morphology. Springer, 159--179.Google ScholarGoogle Scholar
  7. El-Beltagy, S. and Rafea, A. 2008. KP-Miner: A keyphrase extraction system for English and Arabic documents. Inform. Sys. 34, 1, 132--144. DOI:10.1016/j.is.2008.05.002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. El-Shishtawy, T. and Al-Sammak, A. 2009. Arabic keyphrase extraction using linguistic knowledge and machine learning techniques, In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools. The MEDAR Consortium.Google ScholarGoogle Scholar
  9. ESCWA. 2012. Status of the digital Arabic content industry in the Arab region. Economic and Social Commission for Western Asia-United Nations. http://www.escwa.un.org/information/publications/edit/upload/E_ESCWA_ICTD_12_TP-4_E.pdf.Google ScholarGoogle Scholar
  10. Giarlo, M. J. 2006. A comparative analysis of keyword extraction techniques, Rutgers, University. http://lackoftalent.org/michael/papers/596.pdf.Google ScholarGoogle Scholar
  11. Green, S. and Manning, C. D. 2010. Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING). 394--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Habash, N. Y. 2012. Introduction to Arabic Language Processing. Morgan and Claypool. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Habash, N., Soudi, A., and Buckwalter, T. 2007. On Arabic transliteration. In Arabic Computational Morphology. Springer. 15--22.Google ScholarGoogle Scholar
  14. Hmeidi, I., Kanaan, G. and Evens, M. 1997. Design and implementation of automatic indexing for information retrieval with Arabic documents. J. Amer. Soc. Inform. Sci. 48, 10, 867--881. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hulth, A. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hulth, A. 2004. Combining machine learning and natural language processing for automatic keyword extraction. Ph.D. Dissertation, Department of Computer and Systems Sciences, Stockholm University.Google ScholarGoogle Scholar
  17. Jones, K. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Document. 28, 1, 11--21.Google ScholarGoogle ScholarCross RefCross Ref
  18. Liu, Z., Li, P., Zheng, Y., and Sun, M. 2009. Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 257--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Manning, C. D., Raghavan, P., and Schtze, H. 2009. An Introduction to Information Retrieval. Cambridge University Press, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Matsuo, Y. and Ishizuka, M. 2004. Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Art. Intell. Tools. 13, 1, 157--169.Google ScholarGoogle ScholarCross RefCross Ref
  21. Mihalcea, R. and Tarau, P. 2004. TextRank: Brining order into texts. In Proceedings of EMNLP. Association for Computational Linguistics. 404--411.Google ScholarGoogle Scholar
  22. Rose, S., Engel, D., Cramer, N., and Cowley, W. 2010. Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory, M. W. Berry and J. Kogan (Eds.). John Wiley & Sons. 3--20.Google ScholarGoogle Scholar
  23. Saad, M. 2011. Arabic Corpora. http://sourceforge.net/projects/ar-textmining/files/Arbic-Corpora/. (Last accessed 5/13).Google ScholarGoogle Scholar
  24. Salton, G., Wong, A., and Yang, C. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11, 613--620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Turney, P. D. 1999. Learning algorithm for keyphrase extraction. Technical Report ERB-1057. National Research Council Technology of Canada, Institute for Information Technology. http://arxiv.org/ftp/cs/papers/0212/0212013.pdf.Google ScholarGoogle Scholar
  26. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. 1999. KEA: Practical automatic keyphrase extraction. In Proceedings of the 4th Conference on Digital Libraries (DL’99). 254--256. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Keyword Extraction from Arabic Documents using Term Equivalence Classes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!