skip to main content
research-article

The Contribution of Stemming and Semantics in Arabic Topic Segmentation

Published:11 January 2018Publication History
Skip Abstract Section

Abstract

Topic Segmentation is one of the pillars of Natural Language Processing. Yet there is a remarkable research gap in this field, as far as the Arabic language is concerned. The purpose of this article is to improve Arabic Topic Segmentation (ATS) by inquiring into two segmenters: ArabC99 and ArabTextTiling. This study is carried out on two independent levels: the pre-processing level and the segmentation level. These levels represent the basic steps of topic segmentation. On the pre-processing level, we examine the effect of using different Arabic stemming algorithms on ATS. We find out that Light10 is more appropriate for the pre-processing step. Based on this conclusion, we proceed to the second level by proposing two Arabic segmenters called ArabC99-LS-LSA and ArabTextTiling-LS-LSA. These latter use external semantic knowledge related to the Latent Semantic Analysis (LSA). Based on the evaluation results, we notice that LSA provides improvements in this field. Hence, the main outcome of this article emphasizes the multilevel improvement of ATS based on Light10 and LSA.

Skip Supplemental Material Section

Supplemental Material

References

  1. A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak. 2016. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 11--16.Google ScholarGoogle Scholar
  2. D. Abuaiadah. 2015. Using bisect k-means clustering technique in the analysis of arabic documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 3, Article 17 (Dec. 2015), 13 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. S. Al-Anzi and D. AbuZeina. 2017. Toward an enhanced Arabic text classification using cosine similarity and latent semantic indexing. J. King Saud Univ.-Comput. Inf. Sci. 29, 2 (2017), 189--195.Google ScholarGoogle Scholar
  4. E. AlShawakfa, A. AlBadarneh, S. Shatnawi, K. Al-Rabab'ah, and B. Bani-Ismail. 2010. A comparison study of some arabic root finding algorithms. J. Am. Soc. Inf. Sci. Technol. 6, 5 (2010), 1015--1024, 2010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Azizi and N. Farah. 2012. From static to dynamic ensemble of classifiers selection: Application to Arabic handwritten recognition. Int. J. Knowl.-Based Intellig. Eng. Syst. 16, 4 (2012), 279--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Basu, I. R. Harris, and S. Basu. 1997. Minimum distance estimation: The approach using density-based distances. In Handbook of Statistics, G. S. Maddala and C. R. Rao (Eds.), 15, 21--48. North--Holland.Google ScholarGoogle Scholar
  7. D. Beeferman, A. Berger, and J. Lafferty. 1999. Statistical models for text segmentation. Mach Learn. 34, 1 (1999), 177--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Ben Guirat, I. Bounhas, and Y. Slimani. 2016. A hybrid model for arabic document indexing. In Proceedings of the 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD’16).Google ScholarGoogle Scholar
  9. Y. Bestgen. 2006. Improving text segmentation using latent semantic analysis: A reanalysis of choi, wiemer-hastings and moore. Comput. Ling. 32, 5 (2006), 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Bestgen and S. Pierard. 2006. Comment evaluer les algorithmes de segmentation thematique? essai de construction d'un mmateriel de reference. Traitement Automatique Des Langues Naturelles (TALN’06). 407--414.Google ScholarGoogle Scholar
  11. D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Boudchiche, A. Mazroui, M. O. A. O. Bebah, A. Lakhouaja, and A. Boudlal. 2016. AlKhalil morpho sys 2: A robust arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf. Sci.Google ScholarGoogle Scholar
  13. M. Boudchiche, A. Mazroui, M. O. A. O. Bebah, A. Lakhouaja, and A. Boudlal. 2017. AlKhalil morpho sys 2: A robust Arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf. Sci. 29, 2 (2017), 141--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Boudlal, A. Lakhouaja, A. Mazroui, A. Meziane, M. Ould Abdallahi Ould Bebah, and M. Shoul. 2010. alkhalil morpho SYS1: A morphosyntactic analysis system for arabic texts. In International Arab Conference on Information Technology. 1--6.Google ScholarGoogle Scholar
  15. T. Brants, F. Chen, and A. Farahat. 2002. Arabic document topic analysis. In Proceedings of the Workshop on Arabic Language Resources and Evaluation (LREC'02).Google ScholarGoogle Scholar
  16. T. Buckwalter. 2004. Buckwalter arabic morphological analyzer version 2.0, Linguistic Data Consortium Catalogue Number LDC2004L02.Google ScholarGoogle Scholar
  17. F. Y. Y. Choi. 2000. Advances in domain independent linear text segmentation. In Proceedings of Conference of the Association for Computational Linguistics (NAACL’00). 26--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Y. Y. Choi, P. Wiemer-Hastings, and J. Moore. 2001. Latent semantic analysis for text segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language (EMNLP’01).Google ScholarGoogle Scholar
  19. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  20. L. Du, L. Wray, and J. Mark. 2013. Topic segmentation with a structured topic model. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’13).Google ScholarGoogle Scholar
  21. S. Dumais. 1992. Enhancing performance in latent semantic indexing (lsi) retrieval. Technical Report TM-ARH017527, Bellcore, Morristown, NJ.Google ScholarGoogle Scholar
  22. J. Eisenstein and R. Barzilay. 2008. Bayesian unsupervised topic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. I. Eldesouki, W. M. Arafa, and K. M. Darwish. 2009. Stemming techniques of Arabic Language: Comparative study from the information retrieval perspective. Egypt. Comput. J. 36, 1, 30--49.Google ScholarGoogle Scholar
  24. M. A. El-Shayeb, S. R. El-Beltagy, and A. Rafea. 2007. Comparative analysis of different text segmentation algorithms on arabic news stories. In Proceedings of the IEEE International Conference on Information Reuse and Integration. 441--446.Google ScholarGoogle Scholar
  25. O. Ferret. 2002. Using collocations for topic segmentation and link detection. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 260--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. O. Ferret. 2009. Improving text segmentation by combining endogenous and exogenous methods. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’09). 88--93.Google ScholarGoogle Scholar
  27. P. Fragkou, V. Petridis, and K. Ath. 2004. A dynamic programming algorithm for linear text segmentation. Intell. Inf. Syst. 23, 2 (2004), 179--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Farzindar and G. Lapalme. 2004. Legal text summarization by exploration of the thematic structures and argumentative roles. In Proceedings of the Workshop on Text Summarization Branches Out (ACL’04).Google ScholarGoogle Scholar
  29. H. Froud, A. Lachkar, and S. A. Ouatik. 2012. Stemming versus light stemming for measuring the similarity between arabic words with latent semantic analysis model. In Proceedings of the Information Science and Technology Conference. 69--73.Google ScholarGoogle Scholar
  30. S. Ghwanmeh, S. Rabab'ah, R. Al-Shalabi, and G. Kanaan. 2009. Enhanced algorithm for extracting the root of arabic words. In Proceedings of the 6th International Conference on Computer Graphics, Imaging and Visualization. IEEE Computer Society, 388--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. B. Guillermo, A. L. Jose, O. Ricardo, and E. Inmaculada. 2010. Latent semantic analysis parameters for essay evaluation using small-scale corpora. J. Quant. Ling. 17, 1 (2010), 1--29.Google ScholarGoogle ScholarCross RefCross Ref
  32. A. C. Habacha, M. Naili, and S. Sammoud. 2014. Topic segmentation for textual document written in arabic language. KES-2014 Gdynia, Poland, September'14, Procedia Computer Science, 35, 437--446.Google ScholarGoogle Scholar
  33. F. Harrag, A. H. Cherif, and A. S. Al-Salman. 2010. Comparative study of topic segmentation algorithms based on lexical cohesion: Experimental results on arabic language. Arab. J. Sci. Eng. 35, 2C (2010), 33--64.Google ScholarGoogle Scholar
  34. F. Harrag, A. H. Cherif, and B. Mohamed. 2011. Evaluation of lexical cohesion algorithms for arabic topic segmentation. RIST, 18, 1 (2011), 103--116.Google ScholarGoogle Scholar
  35. M. A. Hearst. 1997. Texttiling: Segmenting text into multi-paragraph subtopic passages. Comput. Ling. 23, 1 (1997), 33--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. M. Islam and A. S. M. Hoque. 2012. Automated essay scoring using generalized latent semantic analysis. J. Comput. 7, 3 (2012), 616--626.Google ScholarGoogle ScholarCross RefCross Ref
  37. S. Khoja and R. Garside. 2001. Automatic tagging of an arabic corpus using APT. Ph.D. thesis, University of Utah, Salt Lake City, Utah.Google ScholarGoogle Scholar
  38. S. S. Kulkarni, U. M. Apte, and N. E. Evangelopoulos. 2014. The use of latent semantic analysis in operations management research. Decis. Sci. 45, 5 (2014), 971--994.Google ScholarGoogle ScholarCross RefCross Ref
  39. A. Kundu, V. Jain, S. Kumar, and C. Chandra. 2015. A journey from normative to behavioral operations in supply chain management: A review using latent semantic analysis. Expert Syst. Appl. 42, 2 (2015), 796--809. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Labadie and V. Prince. 2008. Lexical and semantic methods in inner text topic segmentation: A comparison between c99 and transeg. Lecture Notes in Computer Science, vol. 5039. 347--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. L. Larkey, L. Ballesteros, and M. Connell. 2007. Light stemming for arabic information retrivial. Arabic Computational Morphology, 38, 221--243.Google ScholarGoogle ScholarCross RefCross Ref
  42. T. Magerman, B. Van Looy, and X. Song. 2010. Exploring the feasibility and accuracy of latent semantic analysis based text mining techniques to detect similarity between patent documents and scientific publications. Scientometrics 82, 2 (2010), 289--306.Google ScholarGoogle ScholarCross RefCross Ref
  43. T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR, arXiv preprint arXiv:1301.3781.Google ScholarGoogle Scholar
  44. H. Misra, F. Yvon, J. M. Jose, and O. Cappe. 2009. Text segmentation via topic modeling: an analytical study. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1553--1556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. Naili, A. C. Habacha, and H. H. Ben Ghezala. 2016a. Parameters driving effectiveness of LSA on topic segmentation. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics CICLing. Lecture Notes in Computer Science. Springer.Google ScholarGoogle Scholar
  46. M. Naili, A. C. Habacha, and H. H. Ben Ghezala. 2016b. Exogenous approach to improve topic segmentation. Int. J. Intell. Comput. Cybernet. 9, 2 (2016), 165--178.Google ScholarGoogle ScholarCross RefCross Ref
  47. M. Naili, A. C. Habacha, and H. H. Ben Ghezala. 2016c. Empirical study of LDA for arabic topic identification. In Proceedings of the 13th African Conference on Research in Computer Science and Applied Mathematics (CARI). 138--145.Google ScholarGoogle Scholar
  48. P. Nakov, E. Valchanova, and G. Angelova. 2003. Towards deeper understanding of the lsa performance. In Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP'03).Google ScholarGoogle Scholar
  49. R. Olmos, J. A. Leon, G. Jorge-Botana, and I. Escudero. 2013. Using latent semantic analysis to grade brief summaries: A study exploring texts at different academic levels. Lit. Ling. Comput. 28, 3 (2013), 388--403.Google ScholarGoogle ScholarCross RefCross Ref
  50. M. A. Otair. 2013. Comparative analysis of arabic stemming algorithms. Int. J. Manag. Inf. Technol. 5, 2 (2013), 1--12.Google ScholarGoogle Scholar
  51. A. Pasha, M. Al-Badrashiny, M. T. Diab, A. El Kholy, R. Eskander, N. Habash, and R. Roth. 2014, May. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 1094--1101.Google ScholarGoogle Scholar
  52. J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.Google ScholarGoogle Scholar
  53. L. Pevzner and M. A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentetion. Comput. Ling. 28, 1 (2002), 19--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. M. F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. M. M. Rahman, B. C. Desai, and P. Bhattacharya. 2006. Visual keyword-based image retrieval uding latent semantic indexing. In Proceedings of the Correlation-enhanced Similarity Matching and Query Expansion in Retrieval Index (IDEAS'06). IEEE, 201--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. M. Reidl and C. Beimann. 2012. How text segmentation algorithms gain from topic models. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’12). 553--557. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. J. C. Reynar. 1980. Topic Segmentation: Algorithms and Applications, Ph.D. thesis, University of Pennsylvania.Google ScholarGoogle Scholar
  58. A. Rosenberg and J. Hirschberg. 2006. Story segmentation of broadcast news in English, mandarin and arabic. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. A. Simon, G. Gravier, and P. Sébillot. 2013. Un modèle segmental probabiliste combinant cohésion lexicale et rupture lexicale pour la segmentation thématique. In 20e Conférence Traitement Automatique Des Langues Naturelles, 20, 202--214.Google ScholarGoogle Scholar
  60. N. Soudani, I. Bounhas, and Y. Slimani. 2016. Semantic information retrieval: A comparative experimental study of NLP tools and language resources for arabic. In Proceedings of the 28th International Conference on Tools with Artificial Intelligence (ICTAI’16).Google ScholarGoogle Scholar
  61. S. Strassel and M. Glenn. 2003. Creating the annotated tdt-4 y2003 evaluation corpus. Retrieved from http://www.nist.gov/speech/tests/tdt/tdt2003/papers/ldc.ppt.Google ScholarGoogle Scholar
  62. K. Taghva, R. Elkhoury, and J. Coombs. 2005. Arabic stemming without a root dictionary. Int. Conf. Inf. Technol. Coding Comput. 1, 52--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. A. A. Touir, H. Makhtour, and W. Al-Sanea. 2008. Semantic-based segmentation of arabic texts, inf. Tech. J. 7, 7 (2008), 1009--1015.Google ScholarGoogle Scholar
  64. X. Wang, J. T. Sun, Z. Chen, and C. Zhai. 2006. Latent semantic analysis for multiple-type interrelated data objects. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 236--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. F. Wild. 2015. Package “lsa”. Retrieved from https://cran.r-project.org/web/packages/lsa/lsa.pdf.Google ScholarGoogle Scholar
  66. F. Wild, C. Stahl, G. Stermsek, Y. K. Penya, and G. Neumann. 2005. Factors influencing effectiveness in automated essay scoring with LSA, in artificial intelligence in education-supporting learning through intelligent and socially informed technology. In Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED'05). 947--949. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. F. Wild, D. Haley, and K. Bülow. 2011. Using latent-semantic analysis and network analysis for monitoring conceptual development. J. Lang. Technol. Comput. Ling. 26, 1 (2011), 9--21.Google ScholarGoogle Scholar
  68. J. Y. Yeh, H. R. Ke, W. P. Yang, and I. H. Meng. 2005. Text summarization using a trainable summarizer and latent semantic analysis. Inf. Process. Manage. 41, 1 (2005), 75--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. M. Yalcinkaya and V. Singh. 2015. Patterns and trends in building information modeling (BIM) research: A latent semantic analysis. Autom. Construct. 59 (2015), 68--80.Google ScholarGoogle ScholarCross RefCross Ref
  70. S. Yu, D. Cai, J. R. Wen, and W. Y. Ma. 2003. Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In Proceedings of the International World Wide Web Conference (WWW’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. S. Zelikovitz and F. Marquez. 2005. Transductive learning for short-text classification problems using latent semantic indexing. Int. J. Pattern Recogn. Artif. Intell. 19, 2 (2005), 143--163.Google ScholarGoogle ScholarCross RefCross Ref
  72. T. Zerrouki. 2010. Tashaphyne, arabic light stemmer/segment. Retrieved from http://tashaphyne.sourceforge.net.Google ScholarGoogle Scholar

Index Terms

  1. The Contribution of Stemming and Semantics in Arabic Topic Segmentation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 2
      June 2018
      134 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3160862
      Issue’s Table of Contents

      Copyright © 2018 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 January 2018
      • Accepted: 1 October 2017
      • Revised: 1 August 2017
      • Received: 1 October 2016
      Published in tallip Volume 17, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!